Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Sun, 25 Jun 2017 19:07:53 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: apingis@...nwall.net
Subject: bcrypt cracking on ZTEX 1.15y FPGA boards (bcrypt-ztex)

Hi,

After last year's work on descrypt-ztex:

http://www.openwall.com/lists/john-users/2016/11/06/1

Denis proceeded to work on bcrypt-ztex this year.  We had listed this as
planned future work on Katja's project in 2014:

http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/

but unfortunately didn't resume that project until this year.  I guess
better late than never, especially given that the results achieved are
still good even by modern standards (relative to current GPUs), despite
of those ZTEX 1.15y boards being rather old by now.  As far as I can
tell, Denis' implementation is brand new, not building upon Katja's,
although our past experience was of some indirect help.

We finally got the bcrypt-ztex format into bleeding-jumbo this week.
For technical detail on the implementation, you may read:

https://github.com/magnumripper/JohnTheRipper/commit/4c37300e32c5b8c47e34be3a0b28a94ecd30da2a#diff-af56e15c23e8e70150ed23cb93cbae6fR1

The speed is roughly ~106k c/s at bcrypt cost 5 on ZTEX 1.15y without
overclocking, ~114k with overclocking.  It should scale almost linearly
with multiple boards (e.g. Denis reported ~103k c/s/board with 3 boards
on the same host).  I can't easily measure the power consumption right
now, but I estimate it's ~20W as both the board (with a large but slowly
rotating cooling fan) and the 12V, 5A power adapter (brick) stay barely
warm to the touch.  These used to get much warmer in Bitcoin mining
tests (known to be ~40W).

For comparison, according to Jeremi M Gosney's testing hashcat achieves
~23k c/s at bcrypt cost 5 on GTX 1080 Ti:

https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505

Hashtype: bcrypt $2*$, Blowfish (Unix)

Speed.Dev.#1.....:    23223 H/s (37.63ms)
Speed.Dev.#2.....:    22953 H/s (38.08ms)
Speed.Dev.#3.....:    22958 H/s (38.05ms)
Speed.Dev.#4.....:    22821 H/s (38.30ms)
Speed.Dev.#5.....:    23025 H/s (37.89ms)
Speed.Dev.#6.....:    23266 H/s (37.60ms)
Speed.Dev.#7.....:    23342 H/s (37.41ms)
Speed.Dev.#8.....:    23209 H/s (37.62ms)
Speed.Dev.#*.....:   184.8 kH/s

Thus, these FPGAs from several years back perform slightly faster than
this year's top GPUs at bcrypt, per chip.  The four-chip ZTEX 1.15y is
slightly faster at bcrypt than four GTX 1080 Ti cards, while consuming
10+ times less power.  (I suspect the GPUs don't reach their peak power
usage on this test, by far, which is why the conservative 10+ figure.)

This doesn't mean these FPGAs are so fast and those GPUs are so slow.
Rather, it means that bcrypt is a better fit for FPGAs than for GPUs.

Now to the setup and testing:

To build JtR bleeding-jumbo with ZTEX 1.15y board support, install
libusb (e.g., the libusb-devel package on Fedora) in addition to jumbo's
usual dependencies.  Then use "./configure --enable-ztex".  The rest of
the build is as usual for jumbo.

To access a ZTEX board as non-root (and you shouldn't build nor run JtR
as root) on a Linux system with udev, add this:

ATTRS{idVendor}=="221a", ATTRS{idProduct}=="0100", SUBSYSTEMS=="usb", ACTION=="add", MODE="0660", GROUP="ztex"

e.g. to /etc/udev/rules.d/99-local.rules (create this file).  Then issue
these commands as root:

groupadd ztex
usermod -a -G ztex user # where "user" is your non-root username
systemctl restart systemd-udevd # or "service udev restart" if without systemd

In order to trigger udev to set the new permissions, (re)connect the
device after this point.

If you use a common Linux distro like Ubuntu or Fedora, the above should
be sufficient.  In my case this time, the system is Fedora in a Qubes OS
VM, so I have to use USB passthrough.  Moreover, I didn't want to pass
the entire USB controller into the VM, so the data is being proxied
through two userspace processes: one in the VM with JtR, and the other
in sys-usb.  It's a setup supported by Qubes.  No customizations other
than enabling the passthrough:

https://www.qubes-os.org/doc/usb/#attaching-a-single-usb-device-to-a-qube-usb-passthrough

There's significant CPU load caused in both of these VMs by such
proxying of the candidate passwords stream, and there must be increased
latency too.  Speeds would probably be slightly higher if I ran the same
tests without use of VMs.  In a way, it's amazing this works at all and
shows decent speeds.

Denis' implementation works around our current synchronous crypt_all()
API by buffering a large number of candidate passwords - many times
larger than the number of cores.  The current design has 124 bcrypt
cores per chip, so 496 per board.  My tests are with "TargetSetting = 5"
(tuning for bcrypt cost 5) in the "[ZTEX:bcrypt]" section in john.conf,
and this results in:

0:00:00:00 - Candidate passwords will be buffered and tried in chunks of 63488

appearing in john.log.  The number 63488 is 496*128.  This buffering is
similar to what GPUs commonly require, albeit for different reasons
(greater concurrency and pipelining on GPUs vs. hiding communication
latency to these FPGA boards).  Either way, this has usability and
efficiency drawbacks when you interrupt/restore a session (especially
with large salt count), but it results in nearly optimal c/s rate
despite of the synchronous API and the USB latency (especially in my
testing in a VM).

Here is a test run:

$ ./john -form=bcrypt-ztex -mask='tes?l?l?l?l?l' -u=u2781-bf pw-fake-unix
ZTEX XXXXXXXXXX bus:2 dev:19 Frequency:141 141 141 141 
Using default input encoding: UTF-8
Loaded 1 password hash (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:01 1.18% (ETA: 22:05:58) 0g/s 105720p/s 105720c/s 105720C/s tesaaata..tesaaota
0g 0:00:00:05 4.73% (ETA: 22:06:19) 0g/s 106521p/s 106521c/s 106521C/s tesaaale..tesaaole
0g 0:00:00:17 15.38% (ETA: 22:06:24) 0g/s 106583p/s 106583c/s 106583C/s tesaaaan..tesaaoan
0g 0:00:00:34 30.77% (ETA: 22:06:24) 0g/s 106614p/s 106614c/s 106614C/s tesaaaat..tesaaoat
testtest         (u2781-bf)
1g 0:00:00:35 DONE (2017-06-24 22:05) 0.02807g/s 106581p/s 106581c/s 106581C/s testtest..tes###st
Use the "--show" option to display all of the cracked passwords reliably
Session completed

This is at 141 MHz, which per the design tools is guaranteed to work.
As you can see, the speed is about 106.6k c/s.

Now hybrid mode, combining mask (in this case simply having it give the
known 3 characters verbatim) with incremental mode (thus, necessarily
feeding the candidate passwords from host):

$ ./john -form=bcrypt-ztex -mask='tes?w' -inc=lower -min-len=8 -max-len=8 -u=u2781-bf pw-fake-unix
ZTEX XXXXXXXXXX bus:2 dev:19 Frequency:141 141 141 141 
Using default input encoding: UTF-8
Loaded 1 password hash (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:02 2.14% (ETA: 22:07:51) 0g/s 102814p/s 102814c/s 102814C/s tesnivfm..tesjrkto
testtest         (u2781-bf)
1g 0:00:00:04 DONE (2017-06-24 22:06) 0.2331g/s 103593p/s 103593c/s 103593C/s testtest..tesfedal
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Much quicker running time (4 seconds instead of 35) due to incremental
mode's more optimal ordering of candidate passwords, even though the c/s
rate has reduced to 103.6k c/s (but 4 seconds is too little to measure
this precisely).

Another variation, running against many hashes (and salts) and using
mask mode to double the "words" generated by incremental mode:

$ ./john -form=bcrypt-ztex -mask='?w?w' -inc=lower -min-len=8 -max-len=8 pw-fake-unix
ZTEX XXXXXXXXXX bus:2 dev:19 Frequency:141 141 141 141 
Using default input encoding: UTF-8
Loaded 3107 password hashes with 3107 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:02  0g/s 0p/s 104078c/s 104078C/s lovelove..lvvllvvl
0g 0:00:00:20  0g/s 0p/s 105297c/s 105297C/s lovelove..lvvllvvl
0g 0:00:00:53  0g/s 0p/s 105398c/s 105398C/s lovelove..lvvllvvl
asdfasdf         (u915-bf)
1g 0:00:01:22  0.01211g/s 0p/s 105415c/s 105415C/s lovelove..lvvllvvl
1g 0:00:03:18  0.005044g/s 0p/s 105375c/s 105375C/s lovelove..lvvllvvl
1g 0:00:04:40  0.003567g/s 0p/s 105307c/s 105307C/s lovelove..lvvllvvl
Use the "--show" option to display all of the cracked passwords reliably
Session aborted

I interrupted this one, but it does show that 105.3k c/s is possible
even with incremental mode and a mask on top of it.

Now extreme overclocking, setting "Frequency = 163" in the section in
john.conf (it is also possible to set individual frequencies per FPGA -
see the comments in john.conf - but I did not use this here):

$ ./john -form=bcrypt-ztex -mask='tes?l?l?l?l?l' -u=u2781-bf pw-fake-unix
ZTEX XXXXXXXXXX bus:2 dev:19 Frequency:163 163 163 163 
Using default input encoding: UTF-8
Loaded 1 password hash (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:02 2.37% (ETA: 22:25:57) 0g/s 121213p/s 121213c/s 121213C/s tesaaaka..tesaaoka
0g 0:00:00:08 8.28% (ETA: 22:26:09) 0g/s 122572p/s 122572c/s 122572C/s tesaaani..tesaaoni
0g 0:00:00:18 18.93% (ETA: 22:26:08) 0g/s 122868p/s 122868c/s 122868C/s tesaaaxn..tesaaoxn
0g 0:00:00:26 27.22% (ETA: 22:26:08) 0g/s 122871p/s 122871c/s 122871C/s tesaaais..tesaaois
testtest         (u2781-bf)
1g 0:00:00:31 DONE (2017-06-24 22:25) 0.03224g/s 122425p/s 122425c/s 122425C/s testtest..tes###st
Use the "--show" option to display all of the cracked passwords reliably
Session completed

This worked here (and 163 MHz is actually the maximum that does, with
higher values failing even this quick test) achieving 122.4k c/s, but
more thorough testing shows this design and board are unstable at this
high frequency, so I didn't quote it above.  The highest that works
reliably for me so far is 152 MHz, where the below tests are supposed to
and do crack all of the 239 short passwords, 7 times in a row:

$ egrep '^([^:]*:){4}[a-z]{4}:' pw-fake-unix > pw-fake-len4
$ for n in `seq 1 7`; do rm john.pot; ./john -form=bcrypt-ztex -mask='?l?l?l?l' -verb=1 pw-fake-len4; done
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:39 N/A 0.5981g/s 1143p/s 114625c/s 114625C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:40 N/A 0.5972g/s 1141p/s 114458c/s 114458C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:39 N/A 0.5980g/s 1143p/s 114613c/s 114613C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:39 N/A 0.5976g/s 1142p/s 114527c/s 114527C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:39 N/A 0.5976g/s 1142p/s 114542c/s 114542C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:39 N/A 0.5977g/s 1142p/s 114550c/s 114550C/s alex..###q
Session completed
ZTEX XXXXXXXXXX bus:2 dev:25 Frequency:152 152 152 152 
Using default input encoding: UTF-8
Loaded 239 password hashes with 239 different salts (bcrypt-ztex [Blowfish ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
239g 0:00:06:40 N/A 0.5971g/s 1141p/s 114444c/s 114444C/s alex..###q
Session completed

So that's 114.5k c/s at maximum overclocking here.  I must admit this
board is 10% overvolted (extra resistors soldered on by the previous
owner), but per testing at Bitcoin mining this only provided a 1%
increase in maximum reasonable clock rates (vs. other non-overvolted
boards), so it's probably similar here.  Denis' boards are not
overvolted, but he mentioned getting similar maximum stable clocks and
speeds.  YMMV.

If you test our *-ztex formats as well, please share your feedback.
In case you'd like to reproduce these results, our pw-fake-unix is
available at:

http://openwall.info/wiki/john/sample-hashes#Sample-password-hash-files

Also see this recent reply on what else we could implement on FPGAs:

http://www.openwall.com/lists/john-users/2017/05/31/2

And this Twitter poll/thread:

https://twitter.com/solardiz/status/876087192573104128

PBKDF2-HMAC-SHA* won, and we'll likely have it in a few months from now.
This means things like WPA and dmg.

Another target we intend to explore is AWS F1, but we don't have
anything ready yet.  F1 turned out to be reasonably priced - $1.65/hour
per FPGA, spot price now is ~$0.18/hour (I guess not much demand yet):

https://aws.amazon.com/ec2/instance-types/f1/
https://aws.amazon.com/ec2/pricing/on-demand/
https://aws.amazon.com/ec2/spot/pricing/ (choose N. Virginia)

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ