john-users - Re: sha512crypt-opencl / Self test failed (cmp

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220104195943.GB6334@openwall.com>
Date: Tue, 4 Jan 2022 20:59:43 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: sha512crypt-opencl / Self test failed (cmp_all(1))

Hi Jason,

On Mon, Jan 03, 2022 at 03:15:17PM -0500, Jason Cooper wrote:
> I'm encountering the following error:
> 
> ```
> $ john --wordlist=crackstation.txt --rules --format=sha512crypt-opencl 
> passwd
> Device 1@...alhost: Intel(R) UHD Graphics [0x9bc4]
> Using default input encoding: UTF-8
> Loaded 1 password hash (sha512crypt-opencl, crypt(3) $6$ [SHA512 OpenCL])
> Cost 1 (iteration count) is 5000 for all loaded hashes
> Error creating binary cache file: No such file or directory
> Self test failed (cmp_all(1))
> ```
> 
> It runs fine, but slow, when I remove `--format=..`

Unfortunately, OpenCL on Intel embedded GPUs is generally unreliable -
if it passes self-test for your desired JtR format, you're lucky.  In
this case, no luck for you, it seems.  As to performance, it is unlikely
that embedded GPU would be any faster than your CPU cores - more likely,
it'd be slower yet.  However, if you were lucky, you could use both
simultaneously (preferably with separate sessions and different
attacks), so the cumulative performance would be somewhat better.

> john version:
> 
> ```
> $ john
> John the Ripper 1.9.0-jumbo-1 MPI + OMP [linux-gnu 64-bit x86_64 AVX AC]
> ```

I'm not familiar with the Arch Linux package, and I don't know your
laptop's exact specs, but moving from AVX to AVX2 or AVX-512 (if your
CPU supports those) will likely make a greater difference than trying to
use the poor little embedded GPU.

For example, here's the old i7-4770K's CPU cores vs. its embedded GPU:

$ ./john -test -form=sha512crypt
Will run 8 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    6493 c/s real, 813 c/s virtual

$ ./john -test -form=sha512crypt-opencl -dev=1
Device 1: Intel(R) HD Graphics
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... Build log: fcl build 1 succeeded.
fcl build 2 succeeded.
bcl build succeeded.

LWS=8 GWS=640 (80 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    1213 c/s

> My GPU:
> 
> ```
> $ john --list=opencl-devices
> Platform #0 name: Intel(R) OpenCL HD Graphics, version: OpenCL 3.0 
>     Device #0 (1) name:     Intel(R) UHD Graphics [0x9bc4]
>     Device vendor:          Intel(R) Corporation
>     Device type:            GPU (LE)
>     Device version:         OpenCL 3.0 NEO 
>     Driver version:         21.49.21786 
>     Native vector widths:   char 16, short 8, int 4, long 1
>     Preferred vector width: char 16, short 8, int 4, long 1
>     Global Memory:          12496 MB
>     Global Memory Cache:    512 KB
>     Local Memory:           64 KB (Local)
>     Constant Buffer size:   4095 MB
>     Max memory alloc. size: 4095 MB
>     Max clock (MHz):        1150
>     Profiling timer res.:   83 ns
>     Max Work Group Size:    256
>     Parallel compute cores: 24
>     Stream processors:      192  (24 x 8)
>     Speed index:            220800
> ```

Yours should be only a tiny bit faster than what I benchmarked above,
which was:

$ ./john --list=opencl-devices -dev=1
Platform #0 name: Intel(R) OpenCL, version: OpenCL 1.2 
    Device #0 (1) name:     Intel(R) HD Graphics
    Device vendor:          Intel(R) Corporation
    Device type:            GPU (LE)
    Device version:         OpenCL 1.2 
    Driver version:         16.4.2.1.39163 
    Native vector widths:   char 1, short 1, int 1, long 1
    Preferred vector width: char 1, short 1, int 1, long 1
    Global Memory:          1630 MiB
    Global Memory Cache:    256 KiB
    Local Memory:           64 KiB (Local)
    Constant Buffer size:   64 KiB
    Max memory alloc. size: 407 MiB
    Max clock (MHz):        1250
    Profiling timer res.:   80 ns
    Max Work Group Size:    512
    Parallel compute cores: 20
    Stream processors:      160  (20 x 8)
    Speed index:            200000

As you can see, these are pretty slow (at least at running this kernel).

> I've tried every way I can think of to increase the verbosity, without 
> success.
> I even ran `strings $(which john) | grep ^OCL` with no results.

There's no need - it's a miscompile resulting in miscomputation, and
that's it.

> How do I debug this?  Or, better yet, how do I fix it?

Unfortunately, only by trial and error.  You could try different
versions of Intel OpenCL.  You could try modifying the OpenCL kernel
code to hopefully avoid triggering whatever miscompile there is.  To
guide this process, you could introduce debugging printf()s in there and
compare against a run that passes self-tests on another device.

However, I wouldn't bother.

Instead, I recommend that you build the latest bleeding-jumbo off GitHub
on your system, and use that.  It might run faster than the package, and
we've made various improvements since the 1.9.0-jumbo-1 release.  And
who knows, maybe it'd also avoid whatever OpenCL kernel miscompile you
ran into on the embedded GPU.

I hope this helps.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.