john-dev - Re: Adding OpenMP support to SunMD5

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150509020039.GA27423@openwall.com>
Date: Sat, 9 May 2015 05:00:39 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Adding OpenMP support to SunMD5

On Sat, May 09, 2015 at 09:23:45AM +0800, Lei Zhang wrote:
> There seems to be some compatibility issue among different compilers' implementation of OpenMP. Initially I used the following OpenMP clause, which works fine with icc on my laptop:
> 
> #pragma omp parallel for default(none) private(idx) copyin(input_buf_big) \
> 	shared(saved_salt, data, constant_phrase, ngroups, group_sz)
> 
> But when I experimented it on well, gcc failed to compile, saying that constant_phrase is already a constant, thus no need to be declared as shared. So I removed it, and then icc failed to compile... Magnum suggested I use the simplified form to avoid this issue:
> 
> #pragma omp parallel for copyin(input_buf_big)

That's not great.  I prefer that we use default(none).

> It works well both on my laptop and well. I don't know it fails on super before you pointing it out. I tried changing the OpenMP clause back to its lengthy form, and now it works on super:
> 
> [lei@...er src]$ ../run/john --test --format=sunmd5
> Will run 32 OpenMP threads
> Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... (32xOMP) DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	5907 c/s real, 194 c/s virtual

Note that this is only ~11x faster than single-thread speed.  It should
be ~15x+ faster.

> The default gcc version on super is 4.4.7, and I'm using gcc-4.9.2 on well. I assume there's incompatibility even between different versions of gcc.

Not necessarily.  Passing self-tests isn't enough proof the code is
actually reliable.  It is possible that there are still issues, and they
just appear on self-tests "randomly" with different versions of code on
different machines.

Anyway, you need to revise this code as I suggested in another message.
Then test it for real (beyond self-test).

> > and the single-thread performance is a bit lower than it was before.
> > It was:
> > 
> > [solar@...er run]$ ./john -te -form=sunmd5
> > Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... DONE
> > Speed for cost 1 (iteration count) of 5000
> > Raw:    538 c/s real, 538 c/s virtual
> 
> I'm not sure of the penalty introduced by a new outer loop, even if the loop is only iterated once in non-openmp mode. I can use macros to disable the outer loop in a non-openmp build if necessary. I haven't done so because the code looks cleaner the current way.

I think copyin() might have a measurable performance cost.  A little for
one thread, and more for many threads.  This may also be why the OpenMP
scaling is worse than expected in the 32-thread benchmark.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.