Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 3 Feb 2019 13:45:27 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: Re: sha512crypt & Drupal 7+ password cracking on FPGA

Hi,

As I wrote in:

https://www.openwall.com/lists/john-users/2019/01/12/1

"Denis is currently working on improving these existing designs and
removing unnecessary historical differences between them."

This message is to announce the update of the sha512crypt & Drupal7
design, bringing it in sync with the latest sha256crypt design.

I'll over-quote my original announcement a little bit (since it's been a
long while) and comment on what changed:

On Mon, Jul 23, 2018 at 05:27:51PM +0200, Solar Designer wrote:
> Now Denis has also added support for sha512crypt and Drupal 7+ SHA-512
> based password hashes on those same old boards.
> 
> We had achieved energy-efficiency improvement over current high-end GPUs
> at descrypt and bcrypt, and in the case of bcrypt also decent speed
> improvement per board and per rig (see further messages in the above
> threads).  However, for sha512crypt and Drupal 7+ hashes we're merely on
> par with current high-end GPUs in terms of energy-efficiency and our

Actually, there was some improvement in energy-efficiency vs. GPUs at
sha512crypt, but not at Drupal7, as discussed further in the thread back
then.  The revised design improves the energy-efficiency some further -
more on this below.

> speeds per-board are lower (it takes four or so boards to match one
> high-end GPU).  Thus, for practical purposes this is useful to those who
> have those boards anyway or would acquire such boards primarily for
> bcrypt and descrypt, so that the boards can also be put to more uses.
> 
> This is also valuable as being, to the best of my knowledge, the very
> first implementation of these two hash types on FPGA.  And it is also
> our first attempt to use specialized soft CPU cores(*) along with
> cryptographic cores in an FPGA design to combine some limited
> flexibility (in this case, used to implement two higher-level hash types
> in one bitstream) with resource savings (no need to waste logic on
> sha512crypt's higher-level algorithm specifics) and efficient
> cryptographic cores (in this case, SHA-512).  Application of a similar
> approach to newer and much larger FPGAs (such as those available on AWS
> F1) will result in improvement over current GPUs at least in
> energy-efficiency (and for the largest FPGAs probably also in
> performance).
> 
> (*) Denis' bcrypt design uses microcode to save on logic, but it's a
> closer match to historical CPUs' wide microcode than to a CPU program.
> Maybe it'll help us implement bcrypt-pbkdf at some point, though.
> 
> Denis wrote a good description of the design with some ASCII diagrams,
> currently found here:
> 
> https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha512crypt
> 
> Each soft CPU core is 16-way SMT (runs 16 hardware threads with their
> separate register files) and it controls four SHA-512 cores with each of
> those capable of up to four in-flight hash computations (most of the
> time only two are being computed, but there's some overlap between
> finishing processing on one pair of hashes and starting on the next).
> 
> One soft CPU core (plus its memory and glue logic) and four SHA-512
> cores form a unit.

This is still the case, but the soft CPUs have been changed from 32-bit
to smaller 16-bit ones, similar to those Denis uses in sha256crypt.

> 10 units fit in one Spartan-6 LX150 FPGA.  This means 10 soft CPU cores,
> 160 hardware threads, 40 SHA-512 cores, up to 160 in-flight SHA-512 per
> FPGA.  Four times that per board.

It's 12 units per FPGA now, which means 48 SHA-512 cores, 192 in-flight
SHA-512 hash computations, 12 soft CPU cores, 192 hardware threads.
Four times that per board, meaning 768 hash computations in parallel per
board, or 3072 hash computations in parallel in the four-board
benchmarks below.

> Also included are on-device candidate password generator (for mask mode,
> including in hybrid modes along with a wordlist coming from host, etc.)
> and hash comparator (capable of up to 512 loaded hashes per salt; no
> limit on total loaded hashes as that's handled on host).

> sha512crypt and Drupal 7+ hashes are two entry points into the program
> memory.  (The Drupal 7+ program is much simpler than sha512crypt's.
> It could also be more efficient on a more specialized design since it
> does not need unaligned access to the buffers, which we support for
> sha512crypt.  Yet it's good to have it along with sha512crypt
> essentially for free.)
> 
> Per Xilinx tools, this design was supposed to work at 225 MHz.

The design tools' reported frequency is now 215 MHz.

> Unfortunately, in our testing it only works at this frequency with very
> few units built into the bitstream.  We don't know exactly why (maybe
> it's the power draw).  With 10 units, the design works reliably for us
> at 135 MHz on many boards tested, so that's what we set as the current
> default.  It also sometimes works at higher frequencies such as 160 MHz,
> but other times not.  This is configurable in john.conf.

The current design appears to work reliably at 160 MHz, so that's the
new default (it was already the default for a previous revision).

> Here's a test run against 512 of same-salt sha512crypt hashes (good for
> quick reliability testing as all 512 are supposed to be cracked) on one
> board (4 FPGAs) at 135 MHz:
> 
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 327g 0:00:00:42 62.00% (ETA: 15:55:22) 7.746g/s 47003p/s 47003c/s 16282KC/s 40447..40137
> 512g 0:00:01:05 DONE (2018-07-23 15:55) 7.825g/s 46950p/s 46950c/s 12179KC/s 40500..40190

> Four boards (16 FPGAs), 135 MHz:
> 
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 378g 0:00:00:12 72.00% (ETA: 15:53:55) 30.45g/s 185656p/s 185656c/s 62318KC/s 40348..1AF58
> 512g 0:00:00:16 DONE (2018-07-23 15:53) 30.89g/s 185395p/s 185395c/s 51138KC/s 40000..40140

> Scaling efficiency 185395/46950/4 = 98.7%.
> 
> Four boards (16 FPGAs), 160 MHz:
> 
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 174g 0:00:00:04 32.00% (ETA: 15:57:33) 36.78g/s 216490p/s 216490c/s 94714KC/s 40044..1AF54
> 512g 0:00:00:14 DONE (2018-07-23 15:57) 36.44g/s 218647p/s 218647c/s 60310KC/s 40000..40340

One board (4 FPGAs), 160 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
123g 0:00:00:10 22.00% (ETA: 23:01:23) 11.93g/s 68283p/s 68283c/s 30671KC/s 40243..40333
409g 0:00:00:36 78.00% (ETA: 23:01:24) 11.26g/s 68722p/s 68722c/s 20972KC/s 1117H..11D7H
512g 0:00:00:44 DONE (2019-02-02 23:01) 11.44g/s 68648p/s 68648c/s 17808KC/s 40000..40290

Four boards (16 FPGAs), 160 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
460g 0:00:00:10 88.00% (ETA: 23:03:02) 44.27g/s 271029p/s 271029c/s 80274KC/s 40009..40049
512g 0:00:00:11 DONE (2019-02-02 23:03) 44.95g/s 269710p/s 269710c/s 74395KC/s 40000..40040

Scaling efficiency 269710/68648/4 = 98.2% (but maybe these runs are too
short to calculate it reliably).

This is 23% faster than we had in the original design at the same clock
rate.  We can also achieve a slightly higher speed with a larger mask:

$ ./john -2='?d?u' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
155g 0:00:00:14 6.69% (ETA: 23:13:52) 10.43g/s 272290p/s 272290c/s 118572KC/s 40472..5YHL2
222g 0:00:00:25 11.32% (ETA: 23:14:03) 8.858g/s 273059p/s 273059c/s 107225KC/s 40429..5YH29
512g 0:00:00:56 DONE (2019-02-02 23:11) 8.996g/s 273273p/s 273273c/s 71083KC/s 40477..##E77

> This is similar speed to what Jeremi Gosney reported for hashcat on one
> GTX 1080 Ti at stock clocks:
> 
> https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073
> 
> Hashtype: sha512crypt, SHA512(Unix)
> 
> Speed.Dev.#1.....:   216.0 kH/s (53.53ms)
> 
> Somehow a newer benchmark of 8x GTX 1080 Ti shows slightly higher speed
> per GPU:
> 
> https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505
> 
> Hashtype: sha512crypt $6$, SHA512 (Unix)
> 
> Speed.Dev.#1.....:   235.9 kH/s (96.29ms)
> Speed.Dev.#2.....:   228.3 kH/s (50.67ms)
> Speed.Dev.#3.....:   230.4 kH/s (50.22ms)
> Speed.Dev.#4.....:   230.5 kH/s (50.18ms)
> Speed.Dev.#5.....:   230.6 kH/s (50.16ms)
> Speed.Dev.#6.....:   230.1 kH/s (50.27ms)
> Speed.Dev.#7.....:   232.0 kH/s (49.85ms)
> Speed.Dev.#8.....:   231.3 kH/s (50.01ms)
> Speed.Dev.#*.....:  1849.1 kH/s

We're now slightly faster on four boards than hashcat is on GTX 1080 Ti.
However, hashcat is faster on RTX 2080 Ti:

https://hashcat.net/forum/thread-7853.html

Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000)

Speed.Dev.#1.....:   350.9 kH/s (78.30ms) @ Accel:512 Loops:128 Thr:32 Vec:1

> We're probably consuming around 160W for the boards (Denis measured 3.4A
> at 12V per board at 160 MHz, which translates to ~40W/board)

Now it's 3.6A to 3.7A at 12V, which means 43W to 45W per board.

Drupal7 hashes:

> Four boards (16 FPGAs), 160 MHz:
> 
> $ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
> [...]
> Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
> Cost 1 (iteration count) is 16384 for all loaded hashes
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:12 14.78% (ETA: 16:11:22) 0g/s 65890p/s 65890c/s 65890C/s rpdroapa..dwdporpa
> 0g 0:00:00:31 36.38% (ETA: 16:11:25) 0g/s 66386p/s 66386c/s 66386C/s apawrrws..swarosos
> 0g 0:00:01:16 88.67% (ETA: 16:11:26) 0g/s 66586p/s 66586c/s 66586C/s soapawad..wpssppsd
> password         (?)
> 1g 0:00:01:24 DONE (2018-07-23 16:11) 0.01180g/s 66541p/s 66541c/s 66541C/s password..wsrssdrd

Ditto, new design:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:22 32.68% (ETA: 23:06:03) 0g/s 83406p/s 83406c/s 83406C/s raarppss..wppddoss
0g 0:00:01:04 93.79% (ETA: 23:06:04) 0g/s 83604p/s 83604c/s 83604C/s oapdodwd..spddwood
password         (?)
1g 0:00:01:07 DONE (2019-02-02 23:06) 0.01477g/s 83530p/s 83530c/s 83530C/s password..prwaspdd

That's a 25% speedup at the same clock rate.

On energy-efficiency vs. GTX 1080 power-limited to 90W:

On Mon, Jul 23, 2018 at 08:40:48PM +0200, Jens Steube wrote:
> For sha512crypt I'm getting around 377kH/s on all four GPU. That
> translates to ~94300 per 90W.
> For Drupal7 I'm getting around 156kH/s on all four GPU. That translates
> to ~39000 per 90W.

> If I understand your
> measurements correctly a single quad FPGA board is doing 54600H/s at 40W
> on sha512crypt and 16600H/s at 40W on Drupal7. If you scale this up to
> 90W, it's 122850H/s per sha512crypt and 37350H/s per Drupal7. That means
> from power consumption perspective it's 30% faster than the GPU for
> sha512crypt, but at the same time it's slower for Drupal7?

(Jens' understanding was correct.)

Denis' new design appears to deliver about 140k sha512crypt or about 42k
Drupal7 per 90W, which is slightly better than we had before.

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.