john-users - Re: DES-based crypt(3) cracking on ZTEX 1.15y FPGA boards (descrypt-ztex)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170630172233.GA6766@openwall.com>
Date: Fri, 30 Jun 2017 19:22:33 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: apingis@...nwall.net
Subject: Re: DES-based crypt(3) cracking on ZTEX 1.15y FPGA boards (descrypt-ztex)

On Thu, Jun 29, 2017 at 09:49:13PM -0800, Royce Williams wrote:
> Fantastic! :)  I'm still being excessively cautious with my
> substandard cooling setup (and I do understand that these boards
> should run much cooler with descrypt-ztex than when they are used for
> cryptocurrency work), so I haven't experimented with overclocking just
> yet. But I have at least now done some basic testing.

Thank you!

> Here are standard-clocks results on my setup - controlled directly
> from a Raspberry Pi 2, through a couple of powered USB 2.0 hubs, to my
> (now down to 14) functional boards:
> 
> $ time ./john -form=descrypt-ztex -inc=alpha -min-len=8 -max-len=8 -mask='?w?l?l?l?l' pw-fake-unix

Wow.  This is very helpful.  It would also be helpful for you to include
the corresponding results for 1 board.  And ditto (14 and 1) for bcrypt.

Quite possibly you have the world's fastest bcrypt cracker right now.

And, for these tests can we please standardize on -inc=lower for the
incremental mode portion?  My use of -inc=alpha was a mistake, it's
inconsistent with the mask for the last 4 characters.

> So after an hour, performance was ~706Mc/s per board (if I'm reading it right).

Yes.  And that's about 15% lower than we'd expect for 1 board if it were
the only board.  Probably the communication latency causes this.

> I also used a Kill-A-Watt to roughly measure power consumption (just
> of the supply to the boards, not any of the supporting gear):
> 
> ~110W idle = ~7.8W/board
> ~470W under load = ~33.6W/board

This is also very helpful, and I also want a figure for bcrypt-ztex.

> I also noted during the testing that CPU usage was around 40-44% during the run.

The CPU usage is fine per se, but it indicates there's latency in John
talking to the FPGAs, probably leaving them idle at times.  We need to
implement an asynchronous API at a higher level to fix this long-term.

For now, we could try more workarounds, such as maybe supporting --fork
along with use of ZTEX boards (allocate fewer boards per process, like
we have with 1 GPU/process, which is also not great but works for now).
This isn't supported yet, but maybe Denis could consider it.

An easier (and better?) workaround is to buffer even more candidate
passwords within 1 process.  On my Qubes system, adding this line:

+++ b/src/ztex/device_format.c
@@ -111,6 +111,7 @@ void device_format_reset()
        // Mask data is ready, calculate and set keys_per_crypt
        unsigned int keys_per_crypt = jtr_bitstream->candidates_per_crypt
                        / mask_num_cand();
+       keys_per_crypt *= 4;
        if (!keys_per_crypt)
                keys_per_crypt = 1;

makes the standard clocks c/s rate increase from 806M to 828M, which
suspiciously matches the theoretical maximum Denis gives for the current
design in the just committed src/ztex/fpga-descrypt/README.md:

https://github.com/magnumripper/JohnTheRipper/pull/2598/commits/1214de42284c8b66728f0f8fd362a743f54c2ab0#diff-c70cc4e9666091acde7a844bfc24d88d

This appears to work.  The cracked passwords stream isn't expected to be
exactly the same (that is, not in the same order) because the larger
buffers unfortunately result in less optimal ordering of the candidates
with incremental mode and the like.

Also, the total running time of some short sessions (in my testing:
mask, but not wordlist) appears to increase - perhaps the very last
crypt_all() call (for each salt) hashes many more keys than are actually
supplied?  Denis, perhaps this is something you could fix?

Royce, unless Denis says this hack is somehow very wrong, feel free to
try it on your cluster, and maybe you'll regain some of those lost ~15%.

Please note that this code is also used for bcrypt (and the change gives
me slight speedup for bcrypt, too - about 2%, which were presumably lost
to USB pass-through - just not that much of a speedup, because the
performance hit on it wasn't that bad in the first place).

So if you do go for this, please test bcrypt both ways (without and with
this change) as well.  Perhaps keep two john binaries.

keys_per_crypt factors other than 4 (such as 2, 3, 5, more) on this line
may also be tried.  I only tried 4 - I didn't tune.  Denis, maybe make
this configurable?

> The only troubles that I've encountered so far are from either
> apparent problems of scale, or problems with individual boards. (Most
> users won't encounter the scale problems, but given the age of these
> boards, some users may run into the per-board issues, so they may
> warrant closer examination; I will file issues for these as you
> suggest).

I see you've already created two GitHub issues for these - thank you!
The segfaults are definitely bugs for us to fix.  John shouldn't
segfault no matter what happens with the boards.

> With the individual board problems, during earlier (very kind!)
> troubleshooting sessions, Denis already significantly improved how his
> communication methods (ztex_inouttraffic?) handled my various failure
> modes, but a few remain.

Great.  I see you use mostly (or exclusively?) US clones rather than
ZTEX original boards, and this might have been causing some issues.

(There are those edge connectors visible on the picture you tweeted.
They're not present on ZTEX originals.)

> It would of course be
> preferable if a long-running job could continue after dropping an iffy
> board, but I don't know if this would be feasible.

It is feasible, and should already be the case - in fact, you mention it
almost works for you, but not all the time.  It's a bug to fix.

One thing Denis hasn't implemented yet, but maybe should consider
implementing, is similar handling of intermittent failures when running
with just one board: reset it and resume cracking.  Right now, if the
only board fails, JtR terminates right away.

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.