Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 9 May 2015 05:00:39 +0300
From: Solar Designer <>
Subject: Re: Adding OpenMP support to SunMD5

On Sat, May 09, 2015 at 09:23:45AM +0800, Lei Zhang wrote:
> There seems to be some compatibility issue among different compilers' implementation of OpenMP. Initially I used the following OpenMP clause, which works fine with icc on my laptop:
> #pragma omp parallel for default(none) private(idx) copyin(input_buf_big) \
> 	shared(saved_salt, data, constant_phrase, ngroups, group_sz)
> But when I experimented it on well, gcc failed to compile, saying that constant_phrase is already a constant, thus no need to be declared as shared. So I removed it, and then icc failed to compile... Magnum suggested I use the simplified form to avoid this issue:
> #pragma omp parallel for copyin(input_buf_big)

That's not great.  I prefer that we use default(none).

> It works well both on my laptop and well. I don't know it fails on super before you pointing it out. I tried changing the OpenMP clause back to its lengthy form, and now it works on super:
> [ src]$ ../run/john --test --format=sunmd5
> Will run 32 OpenMP threads
> Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... (32xOMP) DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	5907 c/s real, 194 c/s virtual

Note that this is only ~11x faster than single-thread speed.  It should
be ~15x+ faster.

> The default gcc version on super is 4.4.7, and I'm using gcc-4.9.2 on well. I assume there's incompatibility even between different versions of gcc.

Not necessarily.  Passing self-tests isn't enough proof the code is
actually reliable.  It is possible that there are still issues, and they
just appear on self-tests "randomly" with different versions of code on
different machines.

Anyway, you need to revise this code as I suggested in another message.
Then test it for real (beyond self-test).

> > and the single-thread performance is a bit lower than it was before.
> > It was:
> > 
> > [ run]$ ./john -te -form=sunmd5
> > Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... DONE
> > Speed for cost 1 (iteration count) of 5000
> > Raw:    538 c/s real, 538 c/s virtual
> I'm not sure of the penalty introduced by a new outer loop, even if the loop is only iterated once in non-openmp mode. I can use macros to disable the outer loop in a non-openmp build if necessary. I haven't done so because the code looks cleaner the current way.

I think copyin() might have a measurable performance cost.  A little for
one thread, and more for many threads.  This may also be why the OpenMP
scaling is worse than expected in the 32-thread benchmark.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.