john-users - Re: md5crypt & phpass password cracking on FPGA

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190401095504.GA23065@openwall.com>
Date: Mon, 1 Apr 2019 11:55:04 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: Re: md5crypt & phpass password cracking on FPGA

Hi,

Denis has updated the design behind our md5crypt-ztex and phpass-ztex
recently.  Unfortunately, he didn't have as much luck with these updates
as he did with other designs, so this announcement is quite different
from those.  Please see inline:

On Fri, Oct 12, 2018 at 08:59:56PM +0200, Solar Designer wrote:
> As many of you are aware, we support descrypt, bcrypt, sha512crypt,
> sha256crypt, and Drupal7 password hash cracking on the old ZTEX 1.15y
> quad-FPGA boards.  Threads:
> 
> https://www.openwall.com/lists/john-users/2016/11/06/1
> https://www.openwall.com/lists/john-users/2017/06/25/1
> https://www.openwall.com/lists/john-users/2018/07/23/1
> https://www.openwall.com/lists/john-users/2018/08/27/11
> 
> Now Denis has also added support for md5crypt and for phpass "portable
> hashes" on those same boards.  (While phpass as released by Openwall
> primarily uses bcrypt, it also includes a "last resort fallback" to
> MD5-based "portable hashes", which many popular web apps forced use of
> for portability.)
> 
> Similarly to sha512crypt, sha256crypt, and Drupal7, this addition is not
> so much to compete with GPUs as it is to provide a way to put those FPGA
> boards to more uses.  Also just like our implementations of sha512crypt,
> sha256crypt, and Drupal7 hashes on FPGA, this is, to the best of my
> knowledge, the very first time md5crypt and phpass "portable hashes" are
> implemented on FPGA.
> 
> Denis wrote a good description of the design with some ASCII diagrams,
> currently found here:
> 
> https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-md5crypt
> 
> Similarly to Denis' designs for sha512crypt + Drupal7 and sha256crypt,
> the new one for md5crypt + phpass uses specialized soft CPU cores along
> with cryptographic cores.  However, the specific parameters of those
> cores changed once again: this time, it's 16-bit 12-way SMT CPU cores
> along with sets of 3 MD5 cores each capable of up to 4 in-flight hashes.
> 
> Three MD5 cores, one soft CPU core, and memory and glue logic form a
> unit.  32 units fit in one Spartan-6 LX150 FPGA.  This means 32 soft CPU
> cores, 384 hardware threads, 96 MD5 cores, up to 384 in-flight MD5 per
> FPGA.  Four times that - meaning 1536 in-flight hashes - per board.
> 
> Also included are on-device candidate password generator (for mask mode,
> including in hybrid modes along with a wordlist coming from host, etc.)
> and hash comparator (capable of up to 512 loaded hashes per salt; no
> limit on total loaded hashes as that's handled on host).  This is the
> same as the designs for sha512crypt + Drupal7 and sha256crypt also have.

Denis has now updated the documentation available at the GitHub URL
above to describe the existing design better.

> Per Xilinx tools, this design was supposed to work at 202 MHz.  In our
> testing on actual boards, the design works reliably for us at 180 MHz,
> which we set as the default and made it configurable in john.conf.

Denis also made an attempt to increase the clock rate through use of
low-level primitives to implement MD5's bit rotates with lower latency.
This did increase the toolset's reported frequency from 202 to 225 MHz,
but in testing on actual boards the new design was somewhat unstable
even at 180 MHz that the old design is stable at.

Thus, we didn't merge these updates into bleeding-jumbo, considering
them a partially failed experiment.

Instead, Denis made conservative updates to the existing design, adding
clock gating.  With this, we now have clock gating and thus idle power
consumption (with bitstreams loaded) for all *-ztex designs of under 5W
per board.  Other than that, performance and reliability of
md5crypt-ztex and phpass-ztex should have remained unchanged since my
original announcement in October.

There are some updates to the comparison against GPU below:

> Some other implementations of md5crypt and phpass "portable hashes" that
> we have in JtR have password length limitations of 15 and 39 characters,
> respectively - as an optimization.  This FPGA implementation of these
> hashes is capable of password lengths up to 64.  Actual performance
> varies by salt and password length.  For the benchmarks below, I'll use
> salt length of 8 and password length of 7 to match Hashcat benchmarks.
> 
> Here's a test run against one md5crypt hash on one board (4 FPGAs) at
> 180 MHz:
> 
> $ perl -e 'print crypt("passMD5", "\$1\$saltsalt"), "\n";' > pw-md5crypt-1
> $ cat pw-md5crypt-1
> $1$saltsalt$TwZH0EJ82F8jZZW.s.uLn/
> $ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' pw-md5crypt-1
> [...]
> Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> passMD5          (?)
> 1g 0:00:00:18 DONE (2018-10-12 19:31) 0.05506g/s 944245p/s 944245c/s 944245C/s passMD5..pas##|5
> 
> A higher frequency might work, but isn't reliable across the boards we
> tested.  That said, here's a lucky run on a lucky board at 210 MHz just
> to hit and exceed 1M c/s:
> 
> $ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' pw-md5crypt-1 -dev=04A3466XXX
> ZTEX 04A3466XXX bus:2 dev:8 Frequency:210 210 210 210
> Using default input encoding: UTF-8
> Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> passMD5          (?)
> 1g 0:00:00:15 DONE (2018-10-12 19:43) 0.06389g/s 1095Kp/s 1095Kc/s 1095KC/s passMD5..pas##|5
> 
> For all further tests, I'll use 180 MHz.  Let's pretend more of the
> password is unknown, for a longer test on one board (4 FPGAs) and also
> to test a hybrid mode:
> 
> $ ./john -form=md5crypt-ztex -inc -mask='?w?a?a?a' -min-len=7 -max-len=7 pw-md5crypt-1
> [...]
> Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> passMD5          (?)
> 1g 0:00:02:23 DONE (2018-10-12 19:55) 0.006945g/s 952837p/s 952837c/s 952837C/s passMD5..pass##|
> 
> Four boards (16 FPGAs):
> 
> $ ./john -form=md5crypt-ztex -inc -mask='?w?a?a?a' -min-len=7 -max-len=7 pw-md5crypt-1
> [...]
> Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> passMD5          (?)
> 1g 0:00:00:36 DONE (2018-10-12 19:55) 0.02751g/s 3773Kp/s 3773Kc/s 3773KC/s passMD5..pass##|
> 
> Scaling efficiency 3773000/952837/4 = 99.0%.
> 
> Modern high-end GPUs are several times faster than that.  However, this
> speed is on par with what was achieved with Hashcat on AMD HD 7970 aka
> Tahiti, the fastest GPU contemporary to these FPGA boards (circa 2012),
> and ours is achieved at moderately lower power consumption.  Denis says
> the boards running this design consume around 2.8A at 12V at 190 MHz (we
> don't readily have a figure for 180 MHz and I don't want to delay
> posting this), which means about 135W for the four boards.  HD 7970 was
> a 250W TDP GPU card; its actual power usage could be less, and
> underclocking could provide better power efficiency, but probably not to
> the extent of reaching 135W at this performance level.
> 
> Now to some multi-hash runs for reliability testing.  Four boards, mask:
> 
> $ perl -e 'for ($i = 100; $i < 612; $i++) { print crypt("pass$i", "\$1\$saltsalt"), "\n"; }' > pw-md5crypt
> $ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-md5crypt
> [...]
> Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 512g 0:00:00:06 DONE (2018-10-11 20:55) 74.41g/s 3672Kp/s 3672Kc/s 1263MC/s pass477..pas##Mj
> 
> Four boards, larger mask for a longer run:
> 
> $ ./john -form=md5crypt-ztex -mask='pa?l?a?a?a?a' -verb=1 pw-md5crypt
> [...]
> Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:06 1.06% (ETA: 21:19:48) 0g/s 3711Kp/s 3711Kc/s 1922MC/s paaaaee..paaa9ee
> 52g 0:00:00:18 3.37% (ETA: 21:19:19) 2.744g/s 3764Kp/s 3764Kc/s 1968MC/s paaaa5i..paaa95i
> 155g 0:00:01:00 10.81% (ETA: 21:19:38) 2.555g/s 3776Kp/s 3776Kc/s 1779MC/s pass372..pass242
> 359g 0:00:01:57 20.92% (ETA: 21:19:43) 3.065g/s 3783Kp/s 3783Kc/s 1414MC/s paaaa%5..paaa9%5
> 507g 0:00:02:28 26.59% (ETA: 21:19:41) 3.404g/s 3781Kp/s 3781Kc/s 1188MC/s pass367..pass587
> 512g 0:00:02:30 DONE (2018-10-11 21:12) 3.413g/s 3779Kp/s 3779Kc/s 1172MC/s pass477..pa###R7
> 
> (I pressed a key a few times.)
> 
> This shows roughly the same speed as we have for large enough mask when
> running against one hash, meaning the comparator against 512 loaded
> hashes (sharing this same salt for testing) doesn't slow things down.
> 
> Four boards, wordlist and mask:
> 
> $ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?d' -verb=1 pw-md5crypt
> [...]
> Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 280g 0:00:00:03 DONE (2018-10-11 21:01) 80.92g/s 3283Kp/s 3283Kc/s 1468MC/s 1alyssak#..Tom_Ere_2k9#
> 
> The number 280 is correct - you can also see it for a similar test in
> the posting about sha256crypt referenced above.  But 3 seconds is too
> quick for a speed measurement, and transferring a word from host over
> USB for every 10 hashes computed is probably slow.  Let's pretend we
> didn't know the last character is a digit, so we can increase our "mask
> amplifier":
> 
> $ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?a' -verb=1 pw-md5crypt
> [...]
> Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 280g 0:00:00:29 DONE (2018-10-11 21:04) 9.582g/s 3693Kp/s 3693Kc/s 1271MC/s 004811010124i..-----
> 
> Now we get performance figures closer to the maximum we've seen before.
> 
> A similarly amplifying mask is two digits:
> 
> $ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?d?d' -verb=1 pw-md5crypt
> [...]
> Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 300g 0:00:00:30 DONE (2018-10-12 20:27) 9.708g/s 3676Kp/s 3676Kc/s 1282MC/s 1alyssak##..-----
> 
> Again, 300 is the correct number here, as confirmed by a run on the
> original HD 7970 (925 MHz):
> 
> $ ./john -form=md5crypt-opencl -w=rtop1m -mask='?w?d?d' -verb=1 pw-md5crypt
> Using default input encoding: UTF-8
> Loaded 512 password hashes with no different salts (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 300g 0:00:00:55 DONE (2018-10-12 20:35) 5.404g/s 2043Kp/s 2043Kc/s 691495KC/s !ejr8!69..  nam77
> 
> Incidentally, "  nam" is in fact the last line in the "rtop1m" wordlist.
> Unfortunately, the reporting of the range of candidate passwords is
> wrong when using on-device mask (which for this hash type we have on
> FPGA, but not on GPU).

The performance of md5crypt-opencl in JtR jumbo was far below Hashcat's
(except maybe on NVIDIA Fermi and Kepler) for years.  I've finally made
some optimizations to it in January, and now we have a Hashcat-like
speed on a Tahiti GPU like the above (but in a HD 7990 this time, using
one of its two GPUs for this test, at stock clocks 950 MHz base, 1 GHz
turbo, 1.5 GHz memory; I no longer have the original HD 7970 handy):

$ ./john -form=md5crypt-opencl -w=rtop1m -mask='?w?d?d' -verb=1 pw-md5crypt
Using default input encoding: UTF-8
Loaded 512 password hashes with no different salts (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
100g 0:00:00:12 35.71% (ETA: 10:54:55) 8.264g/s 3379Kp/s 3379Kc/s 1482MC/s kiarag95..ilovemom.08
300g 0:00:00:33 DONE (2019-04-01 10:54) 8.960g/s 3384Kp/s 3384Kc/s 1153MC/s .maggie.61..  nam77
Session completed

Also, grouping of candidate passwords by length became more desirable:

$ awk '{ print length, $0 }' < rtop1m | sort -n | cut -d' ' -f2- > rtop1m-by-length
$ ./john -form=md5crypt-opencl -w=rtop1m-by-length -mask='?w?d?d' -verb=1 pw-md5crypt
Using default input encoding: UTF-8
Loaded 512 password hashes with no different salts (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:01 2.53% (ETA: 11:02:36) 0g/s 3191Kp/s 3191Kc/s 1633MC/s keith50..mhy1531
300g 0:00:00:02 5.03% (ETA: 11:02:36) 140.8g/s 3446Kp/s 3446Kc/s 1395MC/s 07161623..09232570
300g 0:00:00:05 13.33% (ETA: 11:02:34) 59.05g/s 3715Kp/s 3715Kc/s 1083MC/s amagic65..asia0788
300g 0:00:00:09 24.65% (ETA: 11:02:33) 32.96g/s 3802Kp/s 3802Kc/s 972379KC/s sherry61..sprike87
300g 0:00:00:18 50.89% (ETA: 11:02:32) 16.55g/s 3645Kp/s 3645Kc/s 856686KC/s MEXICO1666..Yayarea984
300g 0:00:00:31 DONE (2019-04-01 11:02) 9.422g/s 3557Kp/s 3557Kc/s 820276KC/s teamojosema61..zzzzzzzzzzz77
Session completed

The peak of ~3800K is average for lengths of up to 8 inclusive.  Then
speeds start to decrease for higher lengths, but even then the final
average remains higher than it was for wordlist not sorted by length.

Of course, modern GPUs perform way faster, but this shows a comparison
against most suitable GPU contemporary to those ZTEX boards.

> Now to phpass tests.  Four boards, mask only:
> 
> $ cat pw-phpass
> $P$9saltstriXeNc.xV8N.K9cTs/XEn13.
> $ ./john -form=phpass-ztex -mask='a?l?l?l?l?l?l' pw-phpass
> [...]
> Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
> Cost 1 (iteration count) is 2048 for all loaded hashes
> Press 'q' or Ctrl-C to abort, almost any other key for status
> abcdefg          (?)
> 1g 0:00:01:53 DONE (2018-10-11 21:42) 0.008816g/s 1881Kp/s 1881Kc/s 1881KC/s abcdefg..a###eqg
> 
> As expected, this is roughly one half of md5crypt's speed because the
> number of iterations of MD5 is increased from 1000 to 2048.  (It can
> vary between different phpass hashes.  It's just that 2048 is used for
> benchmarks for historical reasons.  This is also the value that phpBB3
> uses, whereas WordPress uses 8192 - thus, cracking of WordPress phpass
> hashes is 4 times slower yet - but is also supported on FPGA now.)
> 
> It's possible to implement phpass "portable hashes" on FPGA slightly
> more efficiently, without bothering with the soft CPUs and reclaiming
> their area for more MD5 cores, since it's a much simpler algorithm than
> md5crypt.  However, we preferred to get the implementation almost for
> free on top of the md5crypt design, by having phpass implemented as a
> different entry point into the soft CPU program running on the exact
> same hardware design (same bitstream).  (It's the exact same approach we
> used for having sha512crypt and Drupal7 share the hardware design.)
> 
> Hybrid with somewhat low mask amplifier (one letter):
> 
> $ ./john -form=phpass-ztex -inc=lower -mask='?w?l' -min-len=7 -max-len=7 pw-phpass
> [...]
> Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
> Cost 1 (iteration count) is 2048 for all loaded hashes
> Press 'q' or Ctrl-C to abort, almost any other key for status
> abcdefg          (?)
> 1g 0:00:00:05 DONE (2018-10-11 21:43) 0.1904g/s 1797Kp/s 1797Kc/s 1797KC/s abcdefg..llabat#
> 
> No mask, every candidate password is transferred over USB from host:
> 
> $ ./john -form=phpass-ztex -inc=lower -min-len=7 -max-len=7 pw-phpass
> [...]
> Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
> Cost 1 (iteration count) is 2048 for all loaded hashes
> Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> abcdefg          (?)
> 1g 0:00:00:01 DONE (2018-10-11 21:43) 0.7812g/s 1228Kp/s 1228Kc/s 1228KC/s abcdefg..lyziesi
> 
> The c/s rate is lower by a third, but the attack duration is a lot
> lower due to the more optimal ordering of candidate passwords.  In fact,
> the attack duration is still low even if we don't specify the password
> length (let incremental mode try different lengths):
> 
> $ ./john -form=phpass-ztex -inc=lower pw-phpass
> [...]
> Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
> Cost 1 (iteration count) is 2048 for all loaded hashes
> Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> abcdefg          (?)
> 1g 0:00:00:02 DONE (2018-10-11 21:45) 0.4149g/s 1305Kp/s 1305Kc/s 1305KC/s abcdefg..shunnas
> 
> And it's essentially zero if we let JtR use its default wordlist, as
> this is a common password:
> 
> $ ./john -form=phpass-ztex pw-phpass
> [...]
> Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
> Cost 1 (iteration count) is 2048 for all loaded hashes
> Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
> Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
> Press 'q' or Ctrl-C to abort, almost any other key for status
> abcdefg          (?)
> 1g 0:00:00:00 DONE 2/3 (2018-10-11 21:45) 1.694g/s 265840p/s 265840c/s 265840C/s abcdefg..Sssing
> 
> This shows that while raw performance is important, being smart is even
> more important.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.