john-users - Re: bcrypt cracking on ZTEX 1.15y FPGA boards (bcrypt-ztex)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170703171401.GA23453@openwall.com>
Date: Mon, 3 Jul 2017 19:14:01 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: Re: bcrypt cracking on ZTEX 1.15y FPGA boards (bcrypt-ztex)

On Mon, Jul 03, 2017 at 07:12:12AM -0800, Royce Williams wrote:
> I now have two more boards for a total of 16, so adjust any
> calculations accordingly.

Great.  Thank you for providing these benchmarks!

> > Denis' implementation works around our current synchronous crypt_all()
> > API by buffering a large number of candidate passwords - many times
> > larger than the number of cores.  The current design has 124 bcrypt
> > cores per chip, so 496 per board.  My tests are with "TargetSetting = 5"
> > (tuning for bcrypt cost 5) in the "[ZTEX:bcrypt]" section in john.conf,
> > and this results in:
> >
> > 0:00:00:00 - Candidate passwords will be buffered and tried in chunks of 63488
> 
> I wasn't paying a lot of attention to it at the time, but looking at
> john.log, unless I've lost track of something, my value was:
> 
> 0:00:00:00 - Candidate passwords will be buffered and tried in chunks of 262140
> 
> ... for values of both 5 and 6 for TargetSetting.

Yes, in your case TargetSetting shouldn't matter, because you have so
many boards that the value is capped anyway.  But you could try hacking
this cap in the source, in ztex_bcrypt.c:

        262140, // Absolute max. keys/crypt_all_interval for all devices.

Try setting it to 2031616 (as 63488*32), and then TargetSetting will be
making a difference.

Denis - by the way, 262140 isn't even a multiple of 496 (core count per
board) - perhaps that's wrong and should be fixed.

> My first tests were with all 16 boards.
> 
> The first test used the default john.conf [ZTEX:bcrypt] TargetSetting
> = 6 value, with john compiled with the keys_per_crypt *= 2 tweak:

The "keys_per_crypt *= 2 tweak" probably didn't matter because of the
cap above.
 
> $ ./john -format=bcrypt-ztex -inc=lower -min-len=8 -max-len=8
> -mask='?w?l?l?l?l' pw-fake-unix
> 
> Loaded 3107 password hashes with 3107 different salts (bcrypt-ztex
> [Blowfish ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:01:54  0g/s 0p/s 1609Kc/s 1609KC/s loveaaaa..loveioia

> 0g 0:01:00:56  0g/s 260.3p/s 1613Kc/s 1613KC/s lovaaani..lovaioli
> 0g 0:01:13:00 0.00% (ETA: 2032-09-23 00:29) 0g/s 434.5p/s 1613Kc/s
> 1613KC/s lolaaatn..lolaiocn

So this is 101k per board, or about 6% lower than we get with 1 board.

> That test ran at ~505W / 16 = ~31.6W per board, which includes the
> power for the onboard fans. The power consumption actually jumps
> around quite a bit between 495W and 515W, but 505W seemed about
> average.

Cool.  The fluctuation is probably in part because the FPGAs are at
times left idle, sometimes several of them at once.

> The second test was with 16 boards, changing to TargetSetting = 5, and
> still with keys_per_crypt *= 2:

> 0g 0:00:41:28 0.00% (ETA: 2032-02-20 19:37)
>     0g/s 452.0p/s 1632Kc/s 1632KC/s lovaaaay..lovaaidy
> 
> For that test, I'd say that power was very slightly higher, maybe
> averaging 510W, so ~31.9W per board. But this might be normal
> variation.

Yes, this should be normal variation.

> So across the cluster, with known tweaks and settings without
> overclocking, I'm getting 1.632Mc/s for 510W.

That's 102k per board in this test.

> Next, here are single-board versions of both tests, using the same
> board. (I did this by disconnecting the other boards. Is there a way
> to tell john to only use a specific device?)

I think there's no such way currently.  I think we should add that.

> First, TargetSetting = 5, keys_per_crypt *= 2:
> 
> $ ./john -format=bcrypt-ztex -inc=lower -min-len=8 -max-len=8
> -mask='?w?l?l?l?l' pw-fake-unix
> SN XXXXXXXXXX: firmware uploaded
> SN XXXXXXXXXX: uploading bitstreams.. ok
> ZTEX XXXXXXXXXX bus:1 dev:72 Frequency:141 141 141 141
> Using default input encoding: UTF-8
> Loaded 3107 password hashes with 3107 different salts (bcrypt-ztex
> [Blowfish ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:14  0g/s 0p/s 106815c/s 106815C/s loveaaaa..loveaaoa
> 0g 0:00:03:12  0g/s 0p/s 107169c/s 107169C/s loveaaaa..loveaaoa

> 0g 0:00:24:07  0g/s 0p/s 107199c/s 107199C/s loveaaaa..loveaaoa

OK, this matches my results.

> Then I enabled the full cluster again.
> 
> Here are all 16 boards again, with TargetSetting = 5, the
> keys_per_crypt *= 2 tweak, and Frequency = 152.

The overclock doesn't appear to have made much of a difference (5%
overclock, but only 1% speedup - and that's with trying to use a 17th
board).  Maybe this is because of the keys_per_crypt cap - so please
hack that as above and re-test.

Also, you seem to have excluded the portions of JtR output where it
reports the clock rates - please include those going forward.

> During this test, I was also trying to coax a 17th board into
> usability. I include this test anyway because there appears to have
> been a slight (temporary?) drop in performance associated with the
> attempt to talk to that board (or it might be a coincidence; I will
> test further to check this correlation):

This is interesting as a test of JtR's ability to recover from board
failure, but for further benchmarks please just use the 16 boards you
have working reliably.

> And finally, a more focused example - all 16 boards, a single
> artificial hash, with bcrypt work factor 12, with the same tweaks:
> 
> $ cat single-bf.hash
> $2a$12$S7H1VijH5FFkU/1bWeM98ObKGC6BwfjNnhsPFs3U88yNbYSphoTp.
> 
> $ ./john -format=bcrypt-ztex -inc=lower -min-len=8 -max-len=8
> -mask='?w?l?l?l?l' --progress-every=300 single-bf.hash
> Using default input encoding: UTF-8
> Loaded 1 password hash (bcrypt-ztex [Blowfish ZTEX])
> Press 'q' or Ctrl-C to abort, almost any other key for status
> 0g 0:00:00:12 0.00% (ETA: 2017-12-16 07:45)
>     0g/s 14299p/s 14299c/s 14299C/s loveisxm..lovehjfc
> 0g 0:00:05:00 0.00% (ETA: 2017-12-17 12:58)
>     0g/s 14422p/s 14422c/s 14422C/s laliawhy..lalidtdh

> 0g 0:01:01:00 0.03% (ETA: 2017-12-17 18:31)
>     0g/s 14419p/s 14419c/s 14419C/s johoiouh..johohbdu
> 
> This pulled about 560W from the wall.

Looks good.  This would be something like 1.85M total or 115k per board
if scaled to bcrypt work factor 5.  Is this at standard clocks or o/c?

> I tried to compare this to john on my general-purpose GPU system
> (which isn't working the way I expect it to, as it appears to only be
> using one GPU. Not sure what I'm doing wrong yet):

This may very well be broken in JtR right now.  We implemented
bcrypt-opencl as an experiment and optimized it a little bit back when
HD 7970 was a current card.  We haven't tried tuning it for NVIDIA
Maxwell and Pascal yet - perhaps we should.  And bcrypt cost 12 is hard
for GPUs without what we call a split kernel, which I think we lack for
this format.  With high bcrypt cost settings, single kernel invocations
may be taking too long, resulting in timeouts.  This doesn't explain why
things appear to be working (even if suboptimally) with one of your GPUs
but not with the rest, though - we'll need to debug that and fix it.
Please open a GitHub issue for this.  Thanks!

> 1 0g 0:00:01:16 0.00% (ETA: 2037-12-19 05:32) 0g/s 53.38p/s 53.38c/s
> 53.38C/s GPU:34C lilluela..lilleoya
> 
> ... but maybe all six GPUs might run at 53.38c/s x 6 = 320c/s?

Yes, perhaps.  Or more with some tuning, as you see with hashcat.  IIRC,
on AMD GCN, JtR's bcrypt-opencl and hashcat's have similar performance,
but on NVIDIA Maxwell and Pascal hashcat's is much faster.  We didn't
care much because bcrypt cracking is generally done on CPUs anyway, but
with newer NVIDIA cards showing not so poor bcrypt speeds (compared to
CPUs) perhaps we should revise/tune our code.  Please feel free to open
a GitHub issue for that as well.

> I also compared GPU performance with hashcat.

> Speed.Dev.#1.....:      128 H/s (154.07ms)
> Speed.Dev.#2.....:      125 H/s (157.26ms)
> Speed.Dev.#3.....:      127 H/s (154.83ms)
> Speed.Dev.#4.....:      127 H/s (154.09ms)
> Speed.Dev.#5.....:      128 H/s (154.25ms)
> Speed.Dev.#6.....:      126 H/s (155.12ms)
> Speed.Dev.#*.....:      760 H/s

> Returning the GPUs' default max power (180W) made no difference at all
> for a single $12$ bcrypt hash.
> 
> In both cases, the GPU system was pulling 500W from the wall, and the
> GPUs hardly broke a sweat, temperature-wise. There may be ways to get
> more performance from hashcat for this hash type and work factor, but
> that will take some research on my part.
> 
> So if I'm reading this right, for single-hash bcrypt with work factor
> 12, just using my own hardware and techniques to compare, the best
> performance available to me so far on FPGA (14419c/s) is about 19
> times as fast as the best performance I know how to get on my GPU
> system (760H/s), at around the same power consumption:
> 
> FPGA: 14419c/s / 560W = ~25.75c/s/W
> GPU: 760H/s / 500W = 1.52H/s/W
> 
> So for a focused, single-hash attack on a modern target using my own
> gear, FPGA is ~17 times as efficient as GPU?

This sounds about right.  FPGAs' energy-efficiency advantage at bcrypt
used to be greater than that at the time of Katja's work in 2013-2014,
but with NVIDIA Maxwell and Pascal these GPUs got close to those FPGAs.
Of course, there are also newer FPGAs.

> I will also do some testing without the keys_per_crypt *= 2 tweak, and
> with different keys_per_crypt values, but I wanted to get this posted.

Thanks.  The keys_per_crypt tweaks shouldn't matter until you lift the
cap, so please do that first.

It would also be nice to see some bcrypt benchmarks that actually crack
passwords - including directly comparable to those I posted in:

http://www.openwall.com/lists/john-users/2017/06/25/1

And you may repeat your descrypt tests with the 16 boards (and keeping
the keys_per_crypt *= 2 tweak).  Yours should be on par with Jeremi's
8x1080Ti system at descrypt then.  If you do this, please post those to
the descrypt thread.

It would also be interesting to see how the power usage increases with
5% overclocks for both bcrypt and descrypt.  The increase from 510W to
560W you report here isn't that - I think it's primarily a result of
going to the higher bcrypt work factor, which reduces the overhead and
keeps the bcrypt cores busier.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.