john-users - Re: sha512crypt-opencl / Self test failed (cmp

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200812172333.GA9198@openwall.com>
Date: Wed, 12 Aug 2020 19:23:33 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: sha512crypt-opencl / Self test failed (cmp_all(1))

To include anything on-topic (sha512crypt-opencl) in this thread again:

Until recently, sha512crypt-opencl and sha256crypt-opencl didn't use
optimal internal settings for NVIDIA Volta and NVIDIA Turing cards.
Claudio has fixed this in very recent commits first by recognizing Volta
specially and then (on my advice) by simply treating all newer and
future NVIDIA GPUs the same as the last NVIDIA GPU family we tuned for.
With this, latest jumbo's sha512crypt-opencl delivers speeds on NVIDIA
Tesla V100 and "GeForce RTX 2070 with Max-Q Design" (a laptop GPU) that
are on par with hashcat's.  (Just whatever I happen to have results for.
We still don't have an NVIDIA RTX 20xx GPU in a JtR dev box.)

My own test on AWS p3.2xlarge, V100 16GB, "Driver Version: 418.87.00
CUDA Version: 10.1", before the commits mentioned above:

Device 1: Tesla V100-SXM2-16GB
[...]
LWS=32 GWS=20480 (640 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    266240 c/s real, 264915 c/s virtual

After:

Device 1: Tesla V100-SXM2-16GB
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=256 GWS=2621440 (10240 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	393019 c/s real, 392725 c/s virtual

With these changes, the auto-tuning results in very large GWS, which may
sometimes be inconvenient (2621440/393019 = ~7 seconds per salt, so a
long time to advance to the next batch of candidates when there are many
salts).  However, forcing a lower GWS nevertheless delivers reasonable
speed (just moderately lower):

$ ./john -test -form=sha512crypt-opencl -gws=20480
Device 1: Tesla V100-SXM2-16GB
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=32 GWS=20480 (640 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:	368640 c/s real, 368640 c/s virtual

For comparison, on the same AWS instance with current hashcat, CUDA API:

$ ./hashcat -b -O -w4 -m1800
hashcat (v6.1.1-20-gdc9a2468) starting in benchmark mode...
[...]
CUDA API (CUDA 10.1)
====================
* Device #1: Tesla V100-SXM2-16GB, 15814/16130 MB, 80MCU
[...]
Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000)

Speed.#1.........:   398.6 kH/s (326.13ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1

OpenCL API:

$ ./hashcat -b -O -w4 -m1800 -d2
hashcat (v6.1.1-20-gdc9a2468) starting in benchmark mode...
[...]
OpenCL API (OpenCL 1.2 CUDA 10.1.236) - Platform #1 [NVIDIA Corporation]
========================================================================
* Device #2: Tesla V100-SXM2-16GB, 15744/16130 MB (4032 MB allocatable), 80MCU
[...]
Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000)

Speed.#2.........:   382.5 kH/s (340.11ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1

So best hashcat is 398k+ (CUDA) and best John is 393k (OpenCL).  The
"-w4" option made a difference - speeds were lower with "-w3".

"GeForce RTX 2070 with Max-Q Design" in a Windows laptop, latest build
of JtR for Windows from:

https://github.com/openwall/john-packages/releases

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=32 GWS=147456 (4608 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    156576 c/s real, 156618 c/s virtual

BTW, credit for making these builds also goes to Claudio.  Thanks!

hashcat:

Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000)

Speed.#1.........:   151.7 kH/s (385.92ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1

(Not sure which exact version and command-line, but I had asked for
"hashcat -b -O -w4 -m1800" to be used for this test.)

Now to the ZTEX stuff, which is on-topic for john-users, but not so much
for this thread:

On Wed, Aug 12, 2020 at 07:41:07AM -0800, Royce Williams wrote:
> On Thu, Aug 6, 2020 at 5:21 AM Solar Designer <solar@...nwall.com> wrote:
> > On Wed, Aug 05, 2020 at 09:45:26PM -0800, Royce Williams wrote:
> > > When this happened to me, I dropped the speed on the specific boards by
> > > 10MHz or so until it stopped,
> >
> > When errors are infrequent, it's generally more efficient to just let
> > them happen once in a while, giving a higher average c/s rate than you'd
> > have at a lower clock rate.
> 
> Indeed. There's definitely a sweet spot there. I'm sure that the various
> Bitcoin forums from ZTEX have similar wisdom.

It's actually quite different for password cracking vs. cryptocurrency
mining.  It's also different for password security audits that need to
be reliable vs. those that are opportunistic (or are contests, indeed).

For mining, all that matters is to maximize the effective hashrate (for
shares accepted by a pool).  Occasional errors (both false negatives aka
missed winning nonces and false positives aka nonces that don't actually
produce a valid share) are OK as long as the average effective hashrate
is higher.

For password cracking, one has to decide whether and what error rate to
tolerate.  And this is almost exclusively about false negatives (missing
some otherwise crackable passwords), not about false positives
(reporting a cracked password when there isn't one).  False positives
are almost impossible because we're checking for exact match (not for
the computed hash being below a target, which is the case for mining).

Another aspect is what kind of errors we're getting.  Detected errors
are not the worst - we waste time, but repeat the computation.
Undetected errors are worse.  When you see or don't see occasional
detected errors, this doesn't tell you whether or not there are also
undetected errors, although there might be correlation.  We detect
errors with communication (we use checksums) and some kinds of errors
with computation (if a computation result isn't reported in time,
something might have gone wrong with a state machine or control flow on
a soft CPU).  We do not detect most other potential errors with
computation (can't do that without duplicating the work or using some
kind of slower error-detecting computation primitives).

So if the errors you're getting look like they're related to stress on
the USB subsystem, it may be OK to ignore them and to optimize for the
highest average c/s rate despite of the errors.  And regardless of
whether you're getting any detected errors, it may be a good idea to
test that you're getting passwords cracked sufficiently reliably for
your use case (e.g., 99%+ for a contest and 100% for a policy audit).

Yet another aspect is that Bitcoin in particular involves just two
computations of SHA-256.  An error rate of, say, 1% per one SHA-256
computation would result in around a 2% error rate for the whole thing,
which is acceptable and likely profitable (compared to trying to avoid
errors by using a lower clock rate).  However, the password hashes we
have implemented on ZTEX are all of the "slow" kind - they use large
numbers of iterations of a primitive.  For example, sha512crypt hashes
use 5000 iterations by default.  A 1% error rate per one SHA-512
computation would turn into an almost 100% error rate for the whole
construction.  This means that a much more conservative clock rate is
needed to have an error rate that is acceptable for any kind of password
cracking (even for a contest).

When the computation error rate is non-zero, it will also tend to
increase for higher "cost" settings in the variable cost hashes (bcrypt,
sha512crypt, sha256crypt, Drupal7, phpass).  For example, in GitHub
issue #3851 Aleksey reports a 0.1% error rate for bcrypt cost 10 on one
of his ZTEX boards at 150 MHz, but no errors for bcrypt cost 5 nor at
148 MHz or below.  I guess no errors for cost 5 at 150 MHz merely meant
the error rate was 32x lower as expected (the difference in iterations
count between bcrypt costs 5 and 10), which apparently was below the
threshold of Aleksey's tests.

> > Please remember that there's generally no point in adjusting frequencies
> > per board (except for testing) if you use all of your boards as one big
> > cluster.  John is currently only able to use the boards synchronously,
> > so the slowest board will determine the cluster's overall performance.
> >
> > This changes when you use "--fork" or "--devices", but in particular
> > with "--fork" it'd probably be inconvenient for you to have some forked
> > processes terminate much sooner than others.  So the per-board frequency
> > adjustment is generally only useful when you run per-board-set attacks,
> > explicitly targeting attacks to same-frequency lists of "--devices".  Of
> > course, you'd also use "--session" to launch multiple attacks from the
> > same "run" directory.
> 
> Ah, yes - I'd been tuning relative to "--fork" ... once I re-discovered it.
> :)

Right, and please note that usage of "--fork" with ZTEX puts much more
stress on the USB subsystem, so more frequent communication errors are
expected (thus, errors of the kind you can ignore as long as they're
infrequent enough that the average speed improves).

> In my experience, for contests and similar time-critical scenarios, having
> some boards finish sooner also means (theoretically) getting some *results*
> sooner - which may be worth the inconvenience.

Right.

BTW, sha512crypt-ztex might be more suitable than sha512crypt-opencl for
quick experiments in a contest because it becomes reasonably efficient
at lower candidate password counts.  For sha512crypt-ztex, you currently
need at least 768 candidates per board (or preferably an exact multiple
of that, but exactly 768 is OK), which is much less than the GWS figures
seen above.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.