john-dev - Re: bcrypt BF_X2=3 is not always best

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150305144403.GA24579@openwall.com>
Date: Thu, 5 Mar 2015 17:44:03 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bcrypt BF_X2=3 is not always best

On Thu, Dec 25, 2014 at 12:46:35PM +0100, magnum wrote:
> On 2014-12-25 03:14, Solar Designer wrote:
> > The 3x interleaving works significantly betterthan 2x for Intel
> > x86-64 CPUs without Hyperthreading (such as Core 2 Duo/Quad), but is
> > usually of little help or sometimes even hurts speeds on CPUs that
> > are capable of running 2 threads/core.
> 
> > I don't know how/whether we can reasonably detect which BF_X2 setting is
> > best.  Running benchmarks at build- or run-time is unstable or slow,
> > given the variance seen under light unrelated load.  And these would have
> > to be full OpenMP benchmarks, because relative speeds are different when
> > running only 1 thread.
> 
> I think we should use a shared cpu_detect() function for x86, so we can
> detect HT, XOP/AVX and other things at run time. Another thing that can
> differ a lot between different CPU types is what we usually call
> OMP_SCALE - for the nt2 format I believe 1M is best on Bull while just
> 4K is best on core i7. The current selection is the __XOP__ and __AVX__
> macros at build time.
> 
> Looking at x86.S we already have this function... is it usable as-is?

As it is, it's only usable to terminate the process or exec a fallback
binary.  It does not return the detected CPU type, but rather a Boolean
indicating whether the current CPU satisfies the current build or not.

(Actually, it also patches some function pointers inside x86.S itself,
but this is legacy stuff such as for P1/P4 vs. PPro/P2/P3.  A similar
function in x86-64.S does not patch anything.)

Enhancing the CPU detection function is relatively easy.  However, it
may be cumbersome to have multiple higher-level implementations of a
crypto primitive or a JtR format linked in at the same time.  Right now,
BF_std.c is compiled for just one BF_X2 setting at a time, and BF_fmt.c
interfaces with just one build of BF_std.c at a time.

For now, I am getting this into the core tree:

/*
 * 3x (as opposed to 2x) interleaving provides substantial speedup on Core 2
 * CPUs, as well as slight speedup on some other CPUs.  Unfortunately, it
 * results in lower cumulative performance with multiple concurrent threads or
 * processes on some newer SMT-capable CPUs.  While this has nothing to do with
 * AVX per se, building for AVX implies we do not intend to run on a Core 2
 * (which has at most SSE4.1), so checking for AVX here provides an easy way to
 * avoid this performance regression in AVX-enabled builds.  In multi-binary
 * packages with runtime fallbacks, the AVX-enabled binary would invoke a
 * non-AVX fallback binary from its john.c if run e.g. on a Core 2.  We could
 * check for SSE4.2 rather than AVX here, as SSE4.2 was introduced along with
 * SMT-capable Nehalem microarchitecture CPUs, but apparently those CPUs did
 * not yet exhibit the performance regression with 3x interleaving.  Besides,
 * some newer CPUs capable of SSE4.2 but not AVX happen to lack SMT, so will
 * likely benefit from the 3x interleaving with no adverse effects for the
 * multi-threaded case.
 */
#ifdef __AVX__
#define BF_X2				1
#else
#define BF_X2				3
#endif

On a related note, we could want to rename BF_X2 - e.g., call it BF_X
and have it accept values 1, 2, or 3.  That could be cleaner.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.