john-dev - Re: [GSoC] JtR SIMD support enhancements

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150425182122.GA21302@openwall.com>
Date: Sat, 25 Apr 2015 21:21:22 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] JtR SIMD support enhancements

On Sat, Apr 25, 2015 at 08:06:25PM +0200, magnum wrote:
> On 2015-04-25 14:34, Solar Designer wrote:
> >>Benchmarking: NT [MD4 32/64]... DONE
> >>Raw:	3509K c/s real, 3509K c/s virtual
> >
> >No MIC code for it yet?
> 
> You mean hand crafted assembler? If not I can't see what would be the 
> point considering we have NT2. This format is obsolete for AVX2 builds 
> too (NT2 benches nearly 100M). I think we might want to phase it out (at 
> least rename it to nt-sse2 or something) in favor of NT2 (which I'd 
> prefer to have as "NT" from now on). That is, unless someone actually 
> adds stuff to the assembler code soon.

You're right.

> >>Benchmarking: nt2, NT [MD4 512/512 MIC 16x]... DONE
> >>Raw:	4907K c/s real, 4907K c/s virtual
> >
> >Very little improvement relative to the "NT" format.  I expected more of
> >a difference.  Perhaps this will be seen with --fork=240.  I guess the
> >SIMD instructions have higher latency, so impact the case of running
> >only one thread/core more.  Need to run 4 threads/core here.
> 
> I suppose that also means bumping the interleaving factors will make a 
> whole lot of difference, more than we're used to. Considering this run 
> was without ANY interleaving I'm curious what results we'll see.

Yes, but it's quite pointless to optimize single thread speeds.  We need
to optimize for the case of running max number of threads on each core.

> I don't really have a good understanding of the MIC. Is there only 60 
> real cores? Is there some kind of HT involved too or why is it 240 
> threads by default? Is it more like 60 real quad-cores but with some 
> bottlenecks between them?

60 real cores, 4 threads/core.  It's the same SMT concept that we know
as HT on Intel CPUs, but with 4 rather than 2 threads/core.  This isn't
even a very high number - on recent (and even not so recent) SPARC and
POWER CPUs it's already up to 8 threads/core.

> BTW I think we (at least I) have always established the interleaving 
> factors by benching a non-OpenMP build. Not sure if that's sensible here.

I think we were wrong in doing it that way, unless we try to optimize
for otherwise-similar but HT-less cheaper versions of Intel CPUs, such
as Core i5's and Ivy Bridge Celerons.

For example, for bcrypt I found that on HT-capable Intel CPUs it's
usually optimal to use 2x interleaving, but Frank now complains that on
his HT-incapable but otherwise modern CPU 3x was faster.  I'm not
surprised at all, since 3x was also faster on Core 2'ish CPUs, which
lacked HT.

So maybe we need both kinds of settings... and this means two different
builds, as we're currently unable to link both versions in.  Luckily,
for MIC the issue does not arise: we know it's always 4 threads/core.

Speaking of OpenMP in particular, as you know it's inefficient for fast
hashes, so for those I'd tune the interleave factors with -fork instead.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.