john-users - RE: DES with OpenMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU159-W47D661ADC76FA445CD38D6A4900@phx.gbl>
Date: Sun, 1 Jan 2012 14:48:28 +0000
From: Alex Sicamiotis <alekshs@...mail.com>
To: <john-users@...ts.openwall.com>
Subject: RE: DES with OpenMP




> Date: Sun, 1 Jan 2012 10:11:55 +0400
> From: solar@...nwall.com
> To: john-users@...ts.openwall.com
> Subject: Re: [john-users] DES with OpenMP
> 
> On Sat, Dec 31, 2011 at 08:32:43PM +0000, Alex Sicamiotis wrote:
> > I read what you said about icc improving per core speed, well I hadn't thought of the difference because non-openMP builds produce similar results due to the asm code. I just did a run of the openMP build with OMP_NUM_THREADS=1. It seems you are right. I got 4527k vs 4330-4350 of the best GCC/ICC/Open64 builds I have.... this is +4% in single core cracking speed. Not bad at all. Apparently icc is better even than the asm code. So it's efficiency is actually 8707 / (4527X2) = ~96.1%.
> 
> Thanks for the info.  Yes, icc is really good.  Also, you probably have
> it tune for your specific CPU model, whereas in the supplied assembly
> code I couldn't reasonably focus on just one CPU model.
> 

Yep it was march=core2. Unfortunately my icc eval license has expired, otherwise I'd also try it with prof-gen /prof-use for pgo optimization, now that I know asm is not used in the openMP version. Btw, how can I emulate the same thing that happened to the openMP build (bypassing the asm) in a non-openMP build? I want to check the AMD open64 compiler to see what this does - because it does compile the single core version, but not the open-MP... Is there a tweak in the Makefile for this?


> > Btw, I also found very small gains by tweaking the linux kernel... I'm using opensuse and opensuse has two kernels... desktop (low latency, more timeslicing) and server (higher latency, more processing throughput). Server gave ~+30k c/s peak relative to desktop, and a custom built kernel for my cpu, less debug code etc, gave another 10k c/s relative to Server. I was kind of disappointed though. 5 or 6 years ago, in the Athlon XP days, latency was more critical for perfromance. IIRC, I was around 920k c/s with low-latency kernel, 970-980k c/s with server kernel and >1.020.000 in win32 with cygwin - but the system was clearly less responsive than even the server kernel of linux. I later read somewhere that XP uses a timer frequency of 100Hz (linux server kernel = 250, linux desktop kernel = 1000Hz). The increased hz frequency is more disruptive, and increased timeslicing leads to more cache-misses. Somehow, today's kernels only show marginal differences :|
> 
> I think the differences you observed with the Athlon were caused by
> something else.  In fact, even the +30k c/s that you report now sounds
> excessive to me.  Back in the 1990s with 100 to 200 MHz original Pentium
> CPUs, I measured a difference of around 1% between 100 Hz and 1000 Hz
> timer frequency in Linux (custom patch for Linux 2.0.x kernels).  This
> should be like 0.1% with a 1 GHz CPU.
> 
> My guess is that you have some user-space processes running - perhaps
> they're part of a GUI desktop.
> 
> As to the Cygwin build, clearly it was different from the Linux build in
> several ways (relative placement of variables and pieces of code, etc. -
> maybe more lucky in your case).
> 

Aha... Actually, let me clarify this because I was not very clear and I'm kind of comparing apple with oranges here if I don't explain it. The +30k c/s and further +10k are about the icc openmp build only which I sensed was too sensitive to interruptions. My rationale was that if I reduce I/O checking, reduce timesliving etc, that it would go even better. Indeed it went slightly better, but taking it from 8620-60k c/s to 8700-8710k c/s (in a real life cracking benchmark) was not what I expected to be honest (just +0.5-0.6%). Single core builds show extremely marginal to non-existent gains of 4-5k c/s at most (which in 4m c/s, is a negligible 0.1% improvement - which was a huge disappointment). I don't know why the Athlon XP results were varying so wildly though. I was really impressed when I saw it back then and, naturally, this registered in my mind as something that is definitely worth trying for significant gain in similar cases - so I did try it. Maybe it's not a timer issue - it could be that the first implementations of pre-emptiveness in the kernel was way more inefficient compared to today's preemption kernels... I remember doing the Byte benchmarks in the athlonXP - they too were lagging with the preemptive / desktop kernel. That was in 2005 or 2006, kernel must've been something like 2.6.8-2.6.12. By the way, I always benchmark with x.org turned down otherwise the tiniest process provide erratic results. And there are always processes in the desktop, especially my desktop (plasma widgets).


> > For the time being, I'm absolutely "settled" with the icc openmp version. It practically eliminates the need for two sessions of john, except when I'm running KDE desktop. Then the speed falls from 8600 k c/s to 8200-8300 k c/s, and ~7500k c/s when having mp3s playing or stuff. That's when I prefer 2 concurrent non-openMP versions. Even a 2% load in x.org, mp3s, plasma and stuff is very disruptive to openMP, despite nice values of minus 11 to minus 15 in a "server" kernel.
> 
> Yes, OpenMP is very sensitive to other load.  You may try to mitigate
> this to some extent by tweaking GOMP_SPINCOUNT:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43706
> 
> You may also try adding schedule(dynamic) to the relevant #pragma line.
> This may make things slightly slower on an otherwise idle system, but
> slightly faster when there's other load.
> 
> Alexander

Thanks for the tips.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.