john-dev - Re: Adding OpenMP support to SunMD5

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150508212212.GA24976@openwall.com>
Date: Sat, 9 May 2015 00:22:12 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Adding OpenMP support to SunMD5

Hi Lei,

On Fri, May 08, 2015 at 09:52:53AM +0800, Lei Zhang wrote:
> Thanks for clearing it out. Now I've got SunMD5 to work with OpenMP, and here's the result obtained on my laptop:
> 
> [Before]
> Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	605 c/s real, 610 c/s virtual
> 
> [After]
> Will run 4 OpenMP threads
> Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... (4xOMP) DONE
> Speed for cost 1 (iteration count) of 5000
> Raw:	1269 c/s real, 326 c/s virtual

This may be fine.  I assume your laptop's CPU is dual-core (with two
threads/core).

> Currently I only multiplied max_keys_per_crypt by number of threads. There might be more space for further improvement.

Yes, but given that this is a slow hash, simply multiplying by number of
threads should be fine.

> BTW, there's a minor issue I've encountered. See the following code (simplified):
> --------------------------------------------------------------------
> #if defined (_DEBUG)
> static unsigned char input_buf[BLK_CNT*MD5_CBLOCK];
> static unsigned char out_buf[BLK_CNT*MD5_DIGEST_LENGTH];
> static unsigned char input_buf_big[25][BLK_CNT*MD5_CBLOCK];
> #else
> static unsigned char *input_buf;
> static unsigned char *out_buf;
> static unsigned char (*input_buf_big)[BLK_CNT*MD5_CBLOCK];
> #endif
> --------------------------------------------------------------------

I'd drop the _DEBUG support, and keep only the dynamically allocated
version.

> Those three arrays have static size, but are dynamically allocated in a non-debug build. With OpenMP enabled, each thread should have a private copy of these arrays.

Not necessarily:

You could instead have big arrays for all keys, covering the range for
all threads combined, but the threads would each access their portion of
key sub-groups.  (This is what happens for most JtR formats currently,
albeit in a simpler form: there's only a per-key loop.)  In fact, you
would not need to care about threads - it's OpenMP's job.  The
parallelized top-level loop will simply invoke its body on indices that
do not overlap for different threads.

Or you could in fact have per-thread buffers, yet have them allocated
dynamically.  For example c3_fmt.c allocates some.

Or, as a third alternative, if those buffers are only needed during
crypt_all(), but not to pass data to/from it, then you could put them on
the stack from within the top-level loop.  For example, BF_std.c:
BF_std_crypt() does this for "struct BF_ctx BF_current;"

> If they're statically allocated, I can use a single clause like this:
> --------------------------------------------------------------------
> #pragma omp threadprivate(out_buf, input_buf, input_buf_big)
> --------------------------------------------------------------------

I see that this is what you implemented now, but it makes me moderately
unhappy.  It's the only use of threadprivate in our tree so far.  I'd
rather not introduce extra dependencies unnecessarily.

Similarly, you've introduced the only use of copyin so far.  And this
one might have performance cost.

I suggest that you avoid these.

I've just checked, both threadprivate and copyin are already in OpenMP
2.5, which is probably the minimum we require anyway, so it's not that
big a deal.  Yet I see no reason for us to depend on them.

There are still systems without or with incomplete thread local storage
support.  I don't know if gcc/libgomp would currently (partially) work
on those or not at all, but I don't rule out the possibility.  If so, it
is possible that a subset of OpenMP features would work there.
threadprivate would not.

> But if not, I probably need to do some explicit copying for each thread.

Why does anything need to be copied?

> Currently I manually defined _DEBUG for ease of experimenting. I'm not sure about the point of allocating those array dynamically. Is there some optimization involved?

Not wasting address space.  With static allocation, they'd be in .bss
even when the SunMD5 format isn't in use.

> If dynamic allocation isn't necessary, I'd abandon it and keep the OpenMP code clearer.

We have different opinions on what's cleaner, it seems.

If the copying is in fact needed, I might agree with you.  I just don't
see why it would be.  Other JtR formats manage to do without any such
copying, and this one shouldn't be special in that aspect.  I think you
just need to think of it out of the box, not constrained to the
definitions and code you started with.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.