Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 11 Jul 2012 17:00:00 +0400
From: Solar Designer <>
Subject: Re:

On Wed, Jul 11, 2012 at 04:41:22PM +0530, Sayantan Datta wrote:
> At IL level I can see that each lds load or store is associted with 5 extra
> instructions to specify the address of the data to be fetched or stored.

I think it'd help to look at GCN instructions to see whether these IL
instructions are compacted into fewer GCN instructions (and how many),
and what addressing modes are used by different revisions of OpenCL code.

> ;r79.x == L
>     iand r80.x___, r79.x,
> r69.x
> ;r69.x == 255
>     ior r80.x___, r80.x,
> r69.z
> ;r69.z == 768    ,add 768
>     ishl r80.x___, r80.x,
> r72.y
> ;r72.y== 2 , multiply 4 i.e for addresing 4byte uint
>     iadd r80.x___, r74.z,
> r80.x

First, this appears to be for a pre-Sptr revision of the code, right?
Do you get fewer IL instructions for the Sptr code revision?

Second, it appears that my alternative version of BF_ROUND for
"architectures with no complicated addressing modes supported" would
result in fewer IL instructions (the shift by two bits would be
avoided for 3 out of 4 S-boxes).  (Whether it would also reduce the GCN
instruction count or not is another question.)  Can you try it?

> Looking at IL I can't say for sure whether it using gather addressing. It
> is the same cl code written in assembly.  Although looking at benchmarks it
> seems like we are using gather addressing.

Yes, but only through the lack of performance change between work group
size 8 and 4, assuming that this translates to using 2 vs. 4 SIMDs.
I am not 100% sure that this assumption is correct.  It'd be nice to
validate it explicitly, such as through review of GCN code or through
running some analyzer/profiler that would report which execution units
were in use by a given kernel.

> > BTW, in the wiki page at you
> > mention Bulldozer's L2 cache.  JFYI, this is irrelevant, since on CPUs
> > the S-boxes fit in L1 cache.  We only run 4 concurrent instances of
> > bcrypt per Bulldozer module (two per thread, and there's no need to run
> > more), so that's only 16 KB for the S-boxes.
> Thanks for clearing that. I would change that statemant.


Now you're saying that Bulldozer has 16 KB L1 cache per module, but
actually I think it has 16 KB L1 data cache per core, or 32 KB (in two
separate L1 data caches) per module.  This makes a difference since with
the P arrays we exceed 16 KB per module a little bit.  Since the caches
are actually larger than what you write, both S and P do fit in them.

As to your guess that APUs would be good for bcrypt due to their L3
cache, I doubt that they'd soon exceeds speeds of CPUs where we use L1
cache.  L3 cache is usually quite slow, and data is read from it (into
faster caches) in entire cache lines, whereas we only use 32 bits per
lookup.  It's the same problem we're seeing with global memory on GPU.
L3 cache might be a little bit faster (maybe even a few times faster),
but otherwise similar.

What will in fact enable much faster bcrypt cracking is AVX2 (will be
generally available in CPUs next year) and Intel MIC (already available
as a coprocessor in supercomputers, including one entry on the recent
top 500 list, but not available for purchase by mere mortals), assuming
that it inherited scatter/gather addressing from Larrabee.  The latter
might provide a 25x speedup per-chip over what we currently have
(assuming that it will be limited by 32 KB L1 data cache per core;
otherwise 50x speedup seems possible).


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ