john-users - Re: sha512crypt-opencl / Self test failed (cmp

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200806132149.GB14882@openwall.com>
Date: Thu, 6 Aug 2020 15:21:49 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: sha512crypt-opencl / Self test failed (cmp_all(1))

> On Wed, Aug 5, 2020 at 9:29 PM Albert Veli <albert.veli@...il.com> wrote:
> > I did not see the self test error on ZTEX. But I saw some other errors
> > on my setup, Aleksey saw them too on his setup. Something like this:
> >
> > SN 04A36E226F FPGA #2 error: pkt_comm_status=0x01, debug=0x0000
> > SN 04A36E226F error -1 doing r/w of FPGAs (LIBUSB_ERROR_IO)
> > SN 04A36E226F: Timeout.
> >
> > It happens after a while. Not every time but sometimes.

This is kind of normal.  We're using USB, which involves many not
exactly reliable hardware and software components.  Further, the FPGAs
themselves do misbehave sometimes.  We're running them at combinations
of utilization and clock rate close to their limits seen in our testing
in practice.  (Per Xilinx design tools', they're supposed to run most of
our designs at higher clock rates, but in practice they don't - so we
adjusted to be near the maximums that actually work.)

> > It is usually enough to power off the boards and power them on again

That's a bit puzzling.  In my experience, when errors like the above
happen, everything recovers from them on their own.  John includes logic
to recover from such errors without needing to be restarted, it's just
that the average c/s rate becomes lower (because some FPGAs are idle for
a while when an error happens, then are put back to use).

> > (I have connected
> > the PSU to a Silver Shield power manager to do so remotely, a modbus I/O
> > could also be used for this).

That's good.  I use something like this too, but it's mostly just to
power the boards off when not in use, not to recover from errors.

On Wed, Aug 05, 2020 at 09:45:26PM -0800, Royce Williams wrote:
> When this happened to me, I dropped the speed on the specific boards by
> 10MHz or so until it stopped,

When errors are infrequent, it's generally more efficient to just let
them happen once in a while, giving a higher average c/s rate than you'd
have at a lower clock rate.

> using the "Frequency_[serial] = 999" syntax
> for that particular algorithm's section.
> 
> If enough boards are lower than the default, it's easier to just change the
> default and create exceptions for the remainder.

Please remember that there's generally no point in adjusting frequencies
per board (except for testing) if you use all of your boards as one big
cluster.  John is currently only able to use the boards synchronously,
so the slowest board will determine the cluster's overall performance.

This changes when you use "--fork" or "--devices", but in particular
with "--fork" it'd probably be inconvenient for you to have some forked
processes terminate much sooner than others.  So the per-board frequency
adjustment is generally only useful when you run per-board-set attacks,
explicitly targeting attacks to same-frequency lists of "--devices".  Of
course, you'd also use "--session" to launch multiple attacks from the
same "run" directory.

> If that doesn't work, you have other issues (flaky USB connector, flaky USB
> cable, unstable power, etc.)

Right.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.