|
|
Message-ID: <49f7790a70e66cdeb803dba4101806ff@smtp.hushmail.com>
Date: Fri, 18 Oct 2013 16:40:06 +0200
From: magnum <john.magnum@...hmail.com>
To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com>
Subject: OpenCL vectorizing how-to.
Not sure anyone listens but anyway,
I take a pride in aiming to write device agnostic OpenCL code. Ideally
it should support ridiculously weak devices as well as ones much more
powerful than today's flagships, and ideally with no configuration or
tweaking at runtime (or worse, at build time). I can't beat Hashcat in
raw performance on MD4 hashes, but I can try to beat it with support for
devices not even blueprinted yet.
This week I've been concentrating on vectorizing OpenCL kernels. Until
now I made some sporadic experiments with optional vectorizing (with
width 4, enabled or not at runtime build of kernel) but I never figured
out a safe way to decide when to actually enable it or not, so it
remained disabled.
Now I suddenly stumbled over a detail I totally missed until now: You
can ask a device about it's "preferred vector size" for a given type
(eg. int). Empirical tests showed that this was indeed the missing piece
I needed.
The first thing I did was add this information to --list=opencl-devices.
Examples:
Device #2 (2) name: Juniper
Board name: AMD Radeon HD 6770 Green Edition
(...)
Native vector widths: char 16, short 8, int 4, long 2
Preferred vector width: char 16, short 8, int 4, long 2
Device #0 (0) name: GeForce GTX 570
(...)
Native vector widths: char 1, short 1, int 1, long 1
Preferred vector width: char 1, short 1, int 1, long 1
Interestingly enough this finally settled not only *why* CPU devices
sometimes suffer from vectorizing and sometimes gain a lot from it (I
did have a pretty good understanding of that), but also how to *know*
that beforehand and act accordingly:
Platform #0 name: AMD Accelerated Parallel Processing
Platform version: OpenCL 1.2 AMD-APP (1214.3)
(...)
Device #3 (3) name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Native vector widths: char 16, short 8, int 4, long 2
Preferred vector width: char 16, short 8, int 4, long 2
Platform #1 name: Intel(R) OpenCL
Platform version: OpenCL 1.2 LINUX
(...)
Device #0 (4) name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Native vector widths: char 32, short 16, int 8, long 8
Preferred vector width: char 1, short 1, int 1, long 1
At least with these versions of drivers it seems that AMD's CPU driver
wants us to supply vectorized code, while Intel's driver prefers to get
scalar code (and it will try to auto-vectorize it). Now both will get
what they need automatically. On a side note, the AMD driver doesn't
seem to be AVX2-aware yet while the native figures for Intel are
somewhat confusing for 'long' but probably indicates AVX2 support.
Anyway, I went ahead and made the vectorizing formats obey this
information, not only "vectorizing or not" but also using the actual
recommended *width*: SHA-512 formats should obviously use ulong2 for
SSE2/AVX while most other formats use uint4. Obvious, but I overlooked
it in my earlier experiments and used ulong4. Also, future drivers may
ask for wider vectors when using AVX2 or future technologies - and my
formats will support that already. The code for achieving all this is
trivial, the host code supplies a V_WIDTH macro and the kernels build
accordingly using #ifdefs and macros.
The rest of the week I've been adding vectorizing support to more
formats and improving code/macros to support all widths (except 3 - that
width is a pig to code and anyway it's weird for our use and not likely
to give anything anyway). In almost all cases, the Juniper and AMD's CPU
device (as well as my Apple CPU device) get a fine boost from
vectorizing. And the automagics seems to work like a champ even if there
might be cases when vectorizing on eg. Juniper give suboptimal results
due to register spilling. New option --force-scalar or john.conf entry
"ForceScalar = Y" will disable vectorizing in that case. BTW there is
also a new "--force-vector-width=N" option mostly for testing/debugging.
BTW on Well's Haswell CPU, Intel's auto-vectorizing beats AMD running
pre-vectorized code every time. Part of that may be that Intel's driver
supports AVX2 while AMD's does not.
Formats that support vectorizing right now:
krb5pa-sha1-opencl Kerberos 5 AS-REQ Pre-Auth etype 17/18
ntlmv2-opencl NTLMv2 C/R
office2007-opencl Office 2007
office2010-opencl Office 2010
office2013-opencl Office 2013
RAKP-opencl IPMI 2.0 RAKP (RMCP+)
WPAPSK-opencl WPA/WPA2 PSK
Err... well that was it I think. Have a nice day. BTW note to self after
wasting a lot of time: There is NO WAY to vectorize RC4, lol. Sometimes
I'm really thick %-)
magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.