john-users - Re: sha512crypt & Drupal 7+ password cracking on FPGA

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ccde395c-bf80-9bd8-73dd-860edaa977c4@hashcat.net>
Date: Mon, 23 Jul 2018 20:40:48 +0200
From: Jens Steube <atom@...hcat.net>
To: john-users@...ts.openwall.com
Subject: Re: sha512crypt & Drupal 7+ password cracking on FPGA

It's nice to see sha512crypt available for ztex boards, this is great work!

I'd like to step in here as you did the comparison with the GPU based on
the benchmark tables from Jeremy where hashcat is optimized to run on
maximum performance. But when it comes to power efficiency perspective
I'd recommend the GTX1080 and limiting it to 90W. You can do that with
"nvidia-smi -pl 90". Here's a sheet where you can see the great effect
on Performance/Watt ratio by limiting the power consumption:
https://docs.google.com/spreadsheets/d/1yyefbpYOq7UIBeBmi5SDUNTXnkIEsdz_gRC5mAeN1x8/edit?usp=sharing

To make it short: We can limit the GPU to consume only the half of power
but at the same time not losing half of the performance, just 25%.
Limiting the power consumption has other advantages. For example it's
much much easier to cool them. On my system the GPU's stay around 70c
even on longer runs. I'm using them as they are without any
modifications or external cooling solutions. The fans (air) are driver
auto-controlled and stay far below 50%.

I have a system with four GTX1080 for development. While running the
hashcat and controlling the power consumption in a second shell (in
parallel using nvidia-smi) I can see the power consumption sometimes
peaks up to 92W, but in most of the time goes down to 75W and sometimes
even 70W. I don't know about the technical details here, but my gut
feeling tells me it's lower than 90W on average. OTOH we have to keep up
the host system while we do not need to do that on a mature FPGA system
which can run fully on its own. Therefore I think it's a good trade-off.

For sha512crypt I'm getting around 377kH/s on all four GPU. That
translates to ~94300 per 90W.
For Drupal7 I'm getting around 156kH/s on all four GPU. That translates
to ~39000 per 90W.

This is a weird result on the first look. If I understand your
measurements correctly a single quad FPGA board is doing 54600H/s at 40W
on sha512crypt and 16600H/s at 40W on Drupal7. If you scale this up to
90W, it's 122850H/s per sha512crypt and 37350H/s per Drupal7. That means
from power consumption perspective it's 30% faster than the GPU for
sha512crypt, but at the same time it's slower for Drupal7? The reason
here is the branches in the loop function in sha512crypt which is a
special case. GPU's really don't like them. IOW, the GPU implementation
for all *crypt algorithms is a bit below it's theoretical maximum. In
Drupal7 (and PBKDF2 and most other KDF) there's no such branches in the
loop thus the GPU can perform at full speed on all compute units.

As you can see here the GPU of today are pretty close when it comes to
power consumption to a FPGA board. I know that ztex boards are old now
and that there's better solutions, but the same as with newer GPU, see
alone the V100. I'm happy with the results.

- Jens

On 23.07.2018 17:27, Solar Designer wrote:
> Hi,
>
> As many of you are aware, we support descrypt and bcrypt password hash
> cracking on the old ZTEX 1.15y quad-FPGA boards.  Threads:
>
> http://www.openwall.com/lists/john-users/2016/11/06/1
> http://www.openwall.com/lists/john-users/2017/06/25/1
>
> Now Denis has also added support for sha512crypt and Drupal 7+ SHA-512
> based password hashes on those same old boards.
>
> We had achieved energy-efficiency improvement over current high-end GPUs
> at descrypt and bcrypt, and in the case of bcrypt also decent speed
> improvement per board and per rig (see further messages in the above
> threads).  However, for sha512crypt and Drupal 7+ hashes we're merely on
> par with current high-end GPUs in terms of energy-efficiency and our
> speeds per-board are lower (it takes four or so boards to match one
> high-end GPU).  Thus, for practical purposes this is useful to those who
> have those boards anyway or would acquire such boards primarily for
> bcrypt and descrypt, so that the boards can also be put to more uses.
>
> This is also valuable as being, to the best of my knowledge, the very
> first implementation of these two hash types on FPGA.  And it is also
> our first attempt to use specialized soft CPU cores(*) along with
> cryptographic cores in an FPGA design to combine some limited
> flexibility (in this case, used to implement two higher-level hash types
> in one bitstream) with resource savings (no need to waste logic on
> sha512crypt's higher-level algorithm specifics) and efficient
> cryptographic cores (in this case, SHA-512).  Application of a similar
> approach to newer and much larger FPGAs (such as those available on AWS
> F1) will result in improvement over current GPUs at least in
> energy-efficiency (and for the largest FPGAs probably also in
> performance).
>
> (*) Denis' bcrypt design uses microcode to save on logic, but it's a
> closer match to historical CPUs' wide microcode than to a CPU program.
> Maybe it'll help us implement bcrypt-pbkdf at some point, though.
>
> Denis wrote a good description of the design with some ASCII diagrams,
> currently found here:
>
> https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha512crypt
>
> Each soft CPU core is 16-way SMT (runs 16 hardware threads with their
> separate register files) and it controls four SHA-512 cores with each of
> those capable of up to four in-flight hash computations (most of the
> time only two are being computed, but there's some overlap between
> finishing processing on one pair of hashes and starting on the next).
>
> One soft CPU core (plus its memory and glue logic) and four SHA-512
> cores form a unit.  The SHA-512 cores occupy 80% of the unit's area,
> so in those terms the overhead of using soft CPUs is at most 25% (but
> they actually help save on algorithm-specific logic).
>
> 10 units fit in one Spartan-6 LX150 FPGA.  This means 10 soft CPU cores,
> 160 hardware threads, 40 SHA-512 cores, up to 160 in-flight SHA-512 per
> FPGA.  Four times that per board.
>
> Also included are on-device candidate password generator (for mask mode,
> including in hybrid modes along with a wordlist coming from host, etc.)
> and hash comparator (capable of up to 512 loaded hashes per salt; no
> limit on total loaded hashes as that's handled on host).  This is
> similar to what Denis' designs for descrypt and bcrypt also have.
>
> sha512crypt and Drupal 7+ hashes are two entry points into the program
> memory.  (The Drupal 7+ program is much simpler than sha512crypt's.
> It could also be more efficient on a more specialized design since it
> does not need unaligned access to the buffers, which we support for
> sha512crypt.  Yet it's good to have it along with sha512crypt
> essentially for free.)
>
> Per Xilinx tools, this design was supposed to work at 225 MHz.
> Unfortunately, in our testing it only works at this frequency with very
> few units built into the bitstream.  We don't know exactly why (maybe
> it's the power draw).  With 10 units, the design works reliably for us
> at 135 MHz on many boards tested, so that's what we set as the current
> default.  It also sometimes works at higher frequencies such as 160 MHz,
> but other times not.  This is configurable in john.conf.
>
> Here's a test run against 512 of same-salt sha512crypt hashes (good for
> quick reliability testing as all 512 are supposed to be cracked) on one
> board (4 FPGAs) at 135 MHz:
>
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 327g 0:00:00:42 62.00% (ETA: 15:55:22) 7.746g/s 47003p/s 47003c/s 16282KC/s 40447..40137
> 512g 0:00:01:05 DONE (2018-07-23 15:55) 7.825g/s 46950p/s 46950c/s 12179KC/s 40500..40190
> Session completed
>
> Four boards (16 FPGAs), 135 MHz:
>
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 378g 0:00:00:12 72.00% (ETA: 15:53:55) 30.45g/s 185656p/s 185656c/s 62318KC/s 40348..1AF58
> 512g 0:00:00:16 DONE (2018-07-23 15:53) 30.89g/s 185395p/s 185395c/s 51138KC/s 40000..40140
> Session completed
>
> Scaling efficiency 185395/46950/4 = 98.7%.
>
> Four boards (16 FPGAs), 160 MHz:
>
> $ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
> [...]
> Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 174g 0:00:00:04 32.00% (ETA: 15:57:33) 36.78g/s 216490p/s 216490c/s 94714KC/s 40044..1AF54
> 512g 0:00:00:14 DONE (2018-07-23 15:57) 36.44g/s 218647p/s 218647c/s 60310KC/s 40000..40340
> Session completed
>
> This is similar speed to what Jeremi Gosney reported for hashcat on one
> GTX 1080 Ti at stock clocks:
>
> https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073
>
> Hashtype: sha512crypt, SHA512(Unix)
>
> Speed.Dev.#1.....:   216.0 kH/s (53.53ms)
>
> Somehow a newer benchmark of 8x GTX 1080 Ti shows slightly higher speed
> per GPU:
>
> https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505
>
> Hashtype: sha512crypt $6$, SHA512 (Unix)
>
> Speed.Dev.#1.....:   235.9 kH/s (96.29ms)
> Speed.Dev.#2.....:   228.3 kH/s (50.67ms)
> Speed.Dev.#3.....:   230.4 kH/s (50.22ms)
> Speed.Dev.#4.....:   230.5 kH/s (50.18ms)
> Speed.Dev.#5.....:   230.6 kH/s (50.16ms)
> Speed.Dev.#6.....:   230.1 kH/s (50.27ms)
> Speed.Dev.#7.....:   232.0 kH/s (49.85ms)
> Speed.Dev.#8.....:   231.3 kH/s (50.01ms)
> Speed.Dev.#*.....:  1849.1 kH/s
>
> We're probably consuming around 160W for the boards (Denis measured 3.4A
> at 12V per board at 160 MHz, which translates to ~40W/board) or 180W at
> the wall at ~90% PSU efficiency.
>
> I guess GTX 1080 Ti might consume a little bit more at this benchmark
> (it's a 300W TDP card).  Jeremi (or someone else who has one of those
> cards) can probably check via nvidia-smi while running hashcat.
>
> Drupal 7+ hash, one board (4 FPGAs) at 135 MHz:
>
> $ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
> [...]
> Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
> Cost 1 (iteration count) is 16384 for all loaded hashes
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:10 2.49% (ETA: 16:08:54) 0g/s 14250p/s 14250c/s 14250C/s prdowaap..oooarsap
> 0g 0:00:02:03 30.91% (ETA: 16:08:49) 0g/s 14421p/s 14421c/s 14421C/s awoppaas..rssoasas
> 0g 0:00:03:31 52.93% (ETA: 16:08:50) 0g/s 14427p/s 14427c/s 14427C/s wdwdwdow..pdawrprw
> 0g 0:00:06:20 95.21% (ETA: 16:08:51) 0g/s 14430p/s 14430c/s 14430C/s wpddwood..ppowrrod
> password         (?)
> 1g 0:00:06:28 DONE (2018-07-23 16:08) 0.002571g/s 14428p/s 14428c/s 14428C/s password..orpadord
> Use the "--show" option to display all of the cracked passwords reliably
> Session completed
>
> Four boards (16 FPGAs), 135 MHz:
>
> $ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
> [...]
> Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
> Cost 1 (iteration count) is 16384 for all loaded hashes
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:10 10.23% (ETA: 16:01:23) 0g/s 56120p/s 56120c/s 56120C/s oaoopprp..rooddwrp
> 0g 0:00:00:35 35.24% (ETA: 16:01:26) 0g/s 56590p/s 56590c/s 56590C/s dwpadaws..ppawrrws
> 0g 0:00:01:01 60.25% (ETA: 16:01:27) 0g/s 56662p/s 56662c/s 56662C/s adwoowao..ssodwpso
> password         (?)
> 1g 0:00:01:39 DONE (2018-07-23 16:01) 0.01005g/s 56678p/s 56678c/s 56678C/s password..wsrssdrd
> Use the "--show" option to display all of the cracked passwords reliably
> Session completed
>
> Scaling efficiency 56678/14428/4 = 98.2% despite of the complaint about
> too small mask (too few different characters for the mask positions
> handled on device).
>
> Four boards (16 FPGAs), 160 MHz:
>
> $ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
> [...]
> Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
> Cost 1 (iteration count) is 16384 for all loaded hashes
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:12 14.78% (ETA: 16:11:22) 0g/s 65890p/s 65890c/s 65890C/s rpdroapa..dwdporpa
> 0g 0:00:00:31 36.38% (ETA: 16:11:25) 0g/s 66386p/s 66386c/s 66386C/s apawrrws..swarosos
> 0g 0:00:01:16 88.67% (ETA: 16:11:26) 0g/s 66586p/s 66586c/s 66586C/s soapawad..wpssppsd
> password         (?)
> 1g 0:00:01:24 DONE (2018-07-23 16:11) 0.01180g/s 66541p/s 66541c/s 66541C/s password..wsrssdrd
> Use the "--show" option to display all of the cracked passwords reliably
> Session completed
>
> We'd appreciate more testing, such as on Royce' larger cluster of these
> boards maybe.  Please post your results as follow-ups to this message.
>
> Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.