john-users - Re: Inadequate GPU VRAM Utilization in John the Ripper (OpenCL)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d4f090062bb16798c564c3e78054266d@smtp.hushmail.com>
Date: Fri, 2 May 2025 23:13:39 +0200
From: magnum <magnumripper@...hmail.com>
To: john-users@...ts.openwall.com
Subject: Re: Inadequate GPU VRAM Utilization in John the Ripper (OpenCL) – Optimization Inquiry

On 2025-05-02 20:44, Pentester LAB wrote:
> I’ve observed that JtR does not effectively utilize GPU VRAM when run
> without `--fork`, despite using OpenCL formats and correct device
> selection. In contrast, Hashcat fully engages GPU VRAM under similar
> workloads.

JtR uses only as much memory as needed by hash binaries and sometimes 
bitmaps for on-device compare. When running wordlist + rules, the rules 
processing runs on CPU. This does hurt performance for the fastest 
formats such as NT or LM but as it is so fast anyway, we've put our 
efforts on other cracking modes (such as mask, incremental + mask, 
wordlist + mask).

I think Hashcat may load all or parts of the wordlist onto GPU for 
on-device rules processing.

> 1. Is GPU VRAM underutilization expected in non-forked JtR runs using
> OpenCL?

Well sort of, but I don't understand your obsession with using more 
memory. There's no universal performance gain from using more. The 
wordlist + rules case is the exception but it would require us to 
implement an on-device rules engine (not that hard) and a unified "api" 
for OpenCL kernels (lots of work - and hashcat is miles ahead of JtR in 
that very aspect as it was rewritten from scratch with that in mind).

> 2. Are there flags or configuration options to force or encourage higher
> GPU memory usage, similar to how Hashcat allocates workload?
> 3. Can VRAM utilization be tuned via kernel batch sizes or custom OpenCL
> tweaks?

As long as we (unfortunately) don't run rules on GPU, there's no (or 
very little) gain in loading complete wordlists to GPU. Given that, I 
think we do utilize memory as good as we can. Unless you can think of 
something else we could fill the memory with?

> 4. Is there a recommended approach to achieving full GPU engagement during
> cracking, outside of --fork?

Forget about memory and use mask or hybrid mask.

For example, fast formats like NT can process "?a?a" on device-side on a 
good GPU. If you build a charset for incremental mode from large 
wordlists (or pot files) with a filter truncating the last two letters, 
then run the result with hybrid mask "?w?a?a", you should get full 
utilization and you'll likely also crack some things Hashcat wouldn't do 
in a lifetime. Stuff that doesn't exist in any wordlist, mangled or not. 
Even unbeliveably long passwords.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.