john-users - Re: Password hashing at scale (for Internet companies with millions of users)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20121008210624.GA10754@openwall.com>
Date: Tue, 9 Oct 2012 01:06:24 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Password hashing at scale (for Internet companies with millions of users) - YaC 2012 slides

On Mon, Oct 08, 2012 at 11:54:16PM +0530, Sayantan Datta wrote:
> If we are doing bcrypt on xeon phi , then in order to utilize the 512 bit
> wide SIMDs , I think we must mix at least 16 bcrypt hash per core at
> instruction level.

Yes, but we might have to limit the usable number of vector elements to 8
(thus 256-bit) in order to fit in 32 KB of L1 data cache.  With full
512-bit vectors, we'll be accessing L2 cache for one half of the vector
elements, which will probably be very slow (entire cache lines will be
getting transferred for each of those L1-missing 32-bit accesses).

> However for GCN GPUs we usually don't have to worry
> about instruction level parallelism (only for GCN architecture, VLIW4 could
> benefit from ILP) because by definition the kernels follow SIMD
> execution.

We do need ILP beyond what's included in one SIMD instruction and VLIW
bundle, especially on GPU.  This is because of pipelining and thus high
instruction latencies.  In fact, from the bcrypt speed numbers on HD
7970 that we've obtained so far, the latencies appear to be on the order
of 10 clock cycles (this would be very high for a CPU).  Unfortunately,
the limited local memory size does not permit us to mix in more
instructions to hide these latencies - hence the poor performance.

> Doesn't this make programming on xeon phi harder?  In my
> opinion a GCN GPU with gather-scatter load/store should be the best for the
> programmers.

I think you're wrong.  Sure, programming in OpenCL might be easier than
with explicit intrinsics, especially if in the latter case you have to
explicitly mix multiple instances to provide the ILP - but that's not
inherent to the hardware devices.  We might have OpenCL for Xeon Phi
eventually, too.  As to the performance we obtain, as you're aware
HD 7970 is only general-purpose CPU-like at bcrypt.  Xeon Phi is likely
to allow for much greater performance as I expect its L1 data cache
access latencies to be CPU-like (~1 cycle), not GPU-like (~10 cycles).
I could be wrong about the latencies, though.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.