[<prev] [next>] [thread-next>] [day] [month] [year] [list]
```Message-ID: <20120514012502.GA6845@openwall.com>
Date: Mon, 14 May 2012 05:25:02 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Blowfish (bcrypt) on CPU vs. GPU (FX-8120 vs. HD 7970)

Hi,

I tried to estimate possible speeds for bcrypt (Eksblowfish) on GCN (HD
7970), using the known speeds on Bulldozer (FX-8120) as reference (to
verify my math, as well as to see the possible speedup over CPU, if any).

Let's assume the \$2a\$05 cost setting (32 iterations), which is what we
historically use for benchmarking (even though modern systems tend to
use higher settings, such as \$2a\$08 or even \$2a\$10).

The number of Blowfish encryptions performed per bcrypt hash computed is:

(9+512)*2*32 + 9+512+64*3 = 34057

or if we move some final encryptions into cmp_exact(), which is normally
not called, then:

(9+512)*2*32 + 9+512+64 = 33929

So let's just use 34000 as our estimate.

FX-8120 running at stock clocks does 955 c/s on one core.  Presumably,
this is at 4.0 GHz (half load turbo).  In terms of cycles per byte
encrypted, this translates to:

4*10^9 / (34000*8*955) = 15.4

For comparison (albeit somewhat unrelated to the purpose of this
posting), on Core i7-2600K at stock clocks (3.8 GHz max turbo)
I am getting:

3.8*10^9 / (34000*8*940) = 14.9

Core 2 Duo E6550 o/c to 3.15 GHz:

3.15*10^9 / (34000*8*884) = 13.1

All of these compare favorably to 18 cycles/byte (not including
higher-level overhead) for Blowfish on the original Pentium:

http://www.schneier.com/blowfish-speed.html

Yet they're pretty close, thus realistic.  Of course, in these builds
JtR actually computes two bcrypt hashes in parallel with mixed
instructions, but the reported c/s rate takes into account the fact that
two hashes were computed at a time.  So 34000 Blowfish encryptions is
the correct number to use here (along with the reported c/s rate) even
though twice more are done per crypt_all() call (non-OpenMP).

BF_ROUND has 14 operations in it.  How these translate to x86
instructions and micro-ops is not entirely obvious.  For example, x86.S
(which is not used in the x86-64 builds used in the benchmarks above,
though) has two versions of BF_ROUND: in the version for the original
Pentium (yes, this old) it uses single micro-op instructions only (to
avoid imperfect pairing on that processor), and it has 18 instructions
as a result.  In the version for most newer processors, it has 11
instructions, many of which are complex (I think these produce at least
15 micro-ops on current CPUs).

Assuming 15 micro-ops (although in practice there are probably more), we
get issue rates of:

34000*(15*16+4+5)*955 / (4*10^9) = 2.02
34000*(15*16+4+5)*884 / (3.15*10^9) = 2.38

for Bulldozer and Core 2, respectively.  The actual micro-op issue rates
are probably slightly higher (since there are probably more micro-ops).

FX-8120 has 4 modules, each capable of decoding 4 x86 instructions
per cycle - thus producing at least 4 micro-ops per cycle per module.
Assuming that we have 15 micro-ops per Blowfish round, and that 4
micro-ops are executed per cycle (we're doing 4 separate Blowfish
encryptions per Bulldozer module, so this is realistic), we get:

4*4*3.4*10^9/((15*16+4+5)*34000) = 6426 c/s

as a speed estimate for an OpenMP-enabled build of the current source
code running on FX-8120 at its full load turbo frequency of 3.4 GHz.
In practice, we're getting around 5500 c/s on such build.

Now to GPU stuff.  Disclaimer: I am still unfamiliar with GPUs, I am
just learning.  I'll be using info on GCN from:

http://developer.amd.com/afds/assets/presentations/2620_final.pdf

The GPU on an HD 7970 card contains 32 compute units (CUs), each of them
having one scalar unit, four 16-lane SIMD units, 64 KB of 32-bank Local
Data Share (LDS) memory, 16 KB read-write L1 data cache, and other
important stuff.  Also relevant to what I'll be talking about below is
the 16 KB 4-way set associative (and 4-port?) scalar read-only L1 cache
shared by 4 CUs (so we have only 8 of these total in a 7970's GPU?)
There's more memory in registers (64 KB per SIMD unit, 8 KB per scalar
unit), but I guess those are not indirectly addressable?

We can issue up to 5 instructions per cycle per CU - apparently, this
maximum is reached with 1 scalar and 4 SIMD instructions.  With four
16-lane SIMD units, we'd normally have 64 work-items per wavefront, and
we'd have at least 10 waves/SIMD, 40 waves/CU, 2560 work-items per CU
(as per slide 11).  However, the 64 KB of LDS only lets us keep up to
16 sets of Blowfish S-boxes in it (we need 4 KB per set).  So maybe we
can/should only run 16 work-items per CU, thus making use of only 1/4 of
total available lanes and incurring stalls on data dependencies after
high-latency instructions.  And this is assuming that we can do gather
loads into VGPRs from LDS - which the slides suggest that we can (I am
surprised!), although I am not 100% sure of that crucial detail.

Maybe we can also make use of the scalar unit for our actual computation,
even though it's normally intended for control.  We also happen to have
exactly 4 KB of read-only L1 cache per scalar unit.  (It is not clear to
me whether the scalar unit can access the LDS memory or not.)  Blowfish
S-boxes are not constant, but writes into them are relatively rare (as
compared to reads).  In Eksblowfish, we do one 64-bit write per Blowfish
block encrypted.

The 16 KB read-write L1 data cache is probably unusable for the S-boxes
as it seems very unlikely that there's gather support for such loads,
and even if we abused vector loads for scalar operations (using only one
vector element for real), we'd be thrashing this small cache even with
just one SIMD unit doing that (it'd need 64 KB then), unless unaligned
vector loads are supported (maybe they are?)

Now to speed estimates.  Assuming that we can use the LDS and the scalar
read-only L1 cache fully, don't use anything else, somehow are able to
issue an instruction every cycle with no stalls (what's the latency for
may be:

32*(16+1)*925*10^6/((16*16+4+5)*34000) = 55849 c/s

Note that I assume 16 instructions per round here, because per the
slides loads and ALU ops appear to be separate instructions on GCN.
If the number of instructions is actually lower for some reason (e.g.,
if there are byte extract instructions similar to Alpha's EXTBL), then
the theoretical speed may be slightly higher.

In practice, we will incur stalls, we will have overhead (more
instructions), and we might not be able to use the scalar unit like
that.  Without the scalar unit, we'll get:

32*16*925*10^6/((16*16+4+5)*34000) = 52564 c/s

Without gather loads, but using the scalar unit, we might have up to:

32*(4+1)*925*10^6/((16*16+4+5)*34000) = 16426 c/s

if we're somehow able to use the 4 SIMDs and LDS anyway (effectively
doing only one Blowfish encryption per SIMD unit), although from the
slides it appears like gather loads are supported.

Finally, using the scalar unit alone, we'd get up to:

32*1*925*10^6/((16*16+4+5)*34000) = 3285 c/s

(or less because of the occasional writes, which will be very slow).

Thus, the possible speedup over CPU should be at most 10x for HD 7970
vs. FX-8120 at stock clocks.

So far I've ignored the possibility of using L2 caches and global
memory, though.  Maybe adding that to make at least some (poor) use of
the otherwise idle SIMD units or lanes would increase the total speed a
little bit.  Maybe it would partially compensate the overall speed
decrease from stalls and overhead, which I did not make an attempt at
estimating above.

Overall, I think that we should give this a try.

BTW, on CPU some speedup might be possible from mixing in MMX
instructions - just to make use of the otherwise-idle vector unit.
I did not include this in any of the estimates above.  Yet we may try
implementing it too.