john-users - Re: dmg-opencl low performance/ low gpu utilisation

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201203182807.GA2462@openwall.com>
Date: Thu, 3 Dec 2020 19:28:07 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: dmg-opencl low performance/ low gpu utilisation

Hi,

On Thu, Dec 03, 2020 at 03:26:27PM +0100, r.wiesbach@....de wrote:
> I use dmg-opencl on a two Radeon RX 580 system.
> 
> However the dmg-opencl has very low utilisation

How low?  And how do you measure it?

> and a speed of only about 2500 pw/s.

This may be a fine speed.  It depends on performance of the system the
dmg file or sparsebundle was created on - the faster that system was,
the slower the file or sparsebundle will be to crack.  This is because
recent versions of macOS tune the time needed to generate the encryption
key from the password to be roughly the same - whatever the developers
thought the user would not be too comfortable with.

> --test
> 
> shows Raw 46500 c/s real and 3300 c/s virtual

A default "--test" benchmark of dmg-opencl is for:

Speed for cost 1 (iteration count) of 1000, cost 2 (version) of 2 and 1

This is kind of nominal - useful to compare builds of JtR and different
hardware, but not so useful to predict performance on real input.  Your
actual input is almost certainly version 2 only, and it probably has
something like 100,000 iterations, affecting the speed accordingly.

That said, "46500 c/s real" does sound low, and "3300 c/s virtual"
weird (the virtual is generally the same or higher than real for this
test, because only one CPU thread is run and the virtual time runs
slower than real).  Here's what I am getting for a Vega 64 under Linux:

Device 1: gfx900 [Radeon RX Vega]
Benchmarking: dmg-opencl, Apple DMG [PBKDF2-SHA1 3DES/AES OpenCL]... LWS=64 GWS=32768 (512 blocks) DONE
Speed for cost 1 (iteration count) of 1000, cost 2 (version) of 2 and 1
Raw:	875976 c/s real, 11796K c/s virtual

Once again, the speeds are nominal, and speeds of a couple of orders of
magnitude lower during actual cracking of a dmg file or sparsebundle
produced by a non-ancient version of macOS are expected.

> Device 2 (same GPU model as device 1) seems not to be used by default,

That's correct.

> but using
> --devices=1,2 --fork=2
> there is at most a slight increase in performance (not doubling the pw/s
> as one would expect)

When you run with two devices, there should be two status lines printed
for every keypress.  These correspond to the two devices, separately.
So it is expected that the performance reported on every one line will
not increase, but the cumulative performance for the two lines will be
double what you had when using just one device.

You might want to post an excerpt from your terminal window starting
with the "Loaded ..." line for us to see if it's reasonable or not.

> Knowing that some opencl-kernels do not perform well (without rules) I tried
> --wordlist=wordlist.txt
> --wordlist=wordlist.txt --rules
> --wordlist=wordlist.txt --rules=best64
> --incremental

We have no OpenCL kernels that would perform better with rules.  We do
have some that will perform better with mask, but those are for
so-called "fast hashes".  dmg-opencl is (more than) slow enough not to
need this (and thus doesn't include this unneeded optimization).

> Additionally i tried using more hashes (5) and the pw/s droped to about
> 500p/w. As this is 1/5th of 2500 this is the same speed

This looks correct, as long as the hashes all have different salts.

> but still low utilization.

Again, how low?  And how do you know?

> The wordlist has a size of about 100MB.

At (expected) low speeds like there are for this format, things like
cracking mode and wordlist size shouldn't make much of a difference in
the resulting p/s rate.

> I did not see an open issue for dmg-opencl on the isssue tracker

We're not currently aware of performance issues with dmg-opencl, and in
my personal experience it works well - but I don't use Windows.

It is quite possible there's an issue - maybe a Windows-specific one,
maybe e.g. with auto-tuning of OpenCL work sizes - just guessing here.
Let's see what you actually have (some lines where JtR reports on the
loaded hashes, their tunable costs, the tuned LWS and GWS figures, the
resulting speeds after it's been running for a while) - then determine
if anything is wrong with that, what exactly, and how it can be fixed.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.