john-dev - Re: PHC: Lyra2 on CPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150602014518.GA13205@openwall.com>
Date: Tue, 2 Jun 2015 04:45:18 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on CPU

Agnieszka,

On Mon, Jun 01, 2015 at 11:57:34PM +0200, Agnieszka Bielec wrote:
> Lyra 2 uses by default openmp threads in one hashing function.

IIRC, their implementation uses pthreads directly, not via OpenMP.
Do you know otherwise?

> nPARALLEL option determines how many omp threads are running. and if
> nPARALLEL changes, output also changes.
> nPARALLEL by default equals to 2

What we need is an implementation of Lyra2 that would work for any
thread-level parallelism setting _without_ necessarily creating any
threads.  In its threads-disabled mode, it would compute those threads'
portions of work sequentially.  This is much like Colin Percival's
original implementation of scrypt works when called with p > 1.

This will enable us to introduce our desired thread-level parallelism
externally, like we normally do - at JtR format level, due to having
multiple candidate passwords to test.

A reason to do things in this way is that a given Lyra2 hash being
cracked might not have sufficient thread-level parallelism for us to
make use of all of our CPU cores and such, or their thread-level
parallelism setting might not be a multiple of our logical CPU count,
nor vice versa (e.g., a given hash uses up to 4 threads, but we're on a
6-core CPU without SMT).

Then it will make sense to explore combinations of these settings since
the locality of reference (and thus cache hit rate) might turn out to be
higher when we do also use Lyra2's built-in thread-level parallelism (if
a given Lyra2 hash has any).

Converting to use of OpenMP may make such experiments easier, since
OpenMP does not forcibly create the max number of threads that the code
has parallelism for, and it includes support for nested parallel regions
(and ways to control its behavior for those).

BTW, it's similar for scrypt (already in jumbo), yescrypt, and Argon2 -
except that these don't default to 2 threads.

I think Lyra2's default of 2 threads matters only for the PHS() wrapper.
We should simply have its thread-level parallelism as one of the cost
parameters that we'll have encoded along with the salts, much like we do
for scrypt's p now.

> I implemented omp in Lyra2 in 3 different ways and I am including tests below
> 
> a) I used nested omp
> b) I am using omp only in crypt_all() function and one hash is
> computed by nPARALLEL number of threads
> c) I am using omp only in crypt_all() function and one hash is
> computed by only one thread

Oh, it sounds like you already realized that there are different
approaches to try here.  Great.

I haven't looked at your code yet - I should.

> a)
> Will run 8 OpenMP threads
> Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE
> Speed for cost 1 (t) of 8, cost 2 (m) of 8
> Many salts:     4896 c/s real, 848 c/s virtual
> Only one salt:  5005 c/s real, 856 c/s virtual
> 
> b)
> Will run 8 OpenMP threads
> Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE
> Speed for cost 1 (t) of 8, cost 2 (m) of 8
> Many salts:     6608 c/s real, 876 c/s virtual
> Only one salt:  7120 c/s real, 935 c/s virtu
> 
> c)
> Will run 8 OpenMP threads
> Benchmarking: Lyra2, Generic Lyra2 [ ]... (8xOMP) DONE
> Speed for cost 1 (t) of 8, cost 2 (m) of 8
> Many salts:     7032 c/s real, 943 c/s virtual
> Only one salt:  7872 c/s real, 1035 c/s virtual
> 
> without openmp)
> Benchmarking: Lyra2, Generic Lyra2 [ ]... DONE
> Speed for cost 1 (t) of 8, cost 2 (m) of 8
> Many salts:     2130 c/s real, 2152 c/s virtual
> Only one salt:  2160 c/s real, 2138 c/s virtual
> 
> drawback of method b)
> -running_threads%nPARALLER must be equal to 0
> drawback of method c)
> -we are using memory nPARALLEL times as many as in method b)

Oh, you noticed these too.  Cool.

> I think that method b) is slower because we are using synchronization
> many times and we have barriers  for all omp threads.

Maybe.  You can test this hypothesis by benchmarking at higher cost
settings and thus at lower c/s rates.  At lower c/s rates, the
synchronization overhead becomes relatively lower.

If confirmed, a way to reduce the overhead at higher c/s rates as well
would be via computing larger batches of hashes per parallel for loop.
This is what we normally call OMP_SCALE, but possibly applied at a
slightly lower level in this case.

> I couldn't find a way how to do it for only x threads.

What do you mean by x here?

> I am leaving to you to decide which method to implement to jtr.

I think the order of our experiments should be as I outlined at the
start of this reply.

> and also I have a question.
> Lyra2 has also options like nPARALLEL, which are passed during
> compilation. but every of these option has default value
> I don't know if should also add these options to parameters passed in
> my hashing functions or should I make a function only for default of
> these values

For nPARALLEL, make it a runtime parameter encoded with the hashes.

What other options "like this" are there?

I haven't looked at your Lyra2 code yet.  Is it committed?

While we're at it, have you moved the memory (de)allocation out of the
cracking loop?  And have you done it for POMELO too, as we had
discussed - perhaps reusing the allocators from yescrypt?  I don't
recall you reporting on this, so perhaps this task slipped through the
cracks?  If so, can you please (re-)add it to your priorities?

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.