Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 31 Jan 2012 06:01:28 +0400
From: Solar Designer <>
Subject: Re: DES with OpenMP

On Tue, Jan 31, 2012 at 01:32:35AM +0000, Alex Sicamiotis wrote:
> john1 => DES_bs_cpt=1
> john4 => DES_bs_cpt=4
> etc etc all the way to 256.
> All in all "4" seems to be the best performing option for both 1 thread and 2 threads.

Thanks.  Actually, 1 and 8 are about as good on your system.

> "32" seems to be 100k slower for many salts and 200k slower for one salt with one thread
> "32" seems to be 300k slower for many salts and 1000k slower for one salt with two threads

That's curious.  32 was a result of tuning on a dual quad-core system
(two CPU sockets).  Larger systems (multiple CPU boards) show better
performance with even higher values (even as high as 1024, although the
program becomes less interactive then).

Maybe the code should assume that if there are 4 threads or less, that's
probably just one CPU chip - and use DES_bs_cpt=4 or 8 in that case.
This assumption will fail if the number of threads is deliberately
lowered to use only some cores in a multi-socket system, though.  And it
will fail differently for bigger than quad-core CPU chips.  Not great.

> So maybe the extra buffered stuff overflows my 1MB l2 cache per core - which reduces speed (???).

More likely, this has to do with differences in L1 data cache usage.

> Anyway, if the 32 value is actually a slowing factor (for me),

It might be different in the icc build, which e.g. might be processing
the parallel loop iterations in a different order.

> then there's some other difference in the openMP version

There are also other differences, indeed.

> that not only covers this performance loss of ~100-200k (from the 32X relative to 1X of the non-openMP version), but also compensates with a +300k - making the openMP version (1 thread) reach speeds of +200k over the non-openMP version. This means that there are some other factors which make openMP (thread=1) faster which may be worth investigating and replicating in the non-omp version. You are the programmer so you know them best :D

I suspect that it's nothing fundamental, but merely icc happening to do
register allocation or whatever better in one version of code vs. the
other.  It might be the other way around in a slightly different build.

These differences of a few percent are hard/unrealistic to turn in our
favor reliably without explicit assembly code and focus on a specific CPU.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.