john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150406135820.GA12861@openwall.com>
Date: Mon, 6 Apr 2015 16:58:20 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka,

Some additional detail below:

On Mon, Apr 06, 2015 at 04:36:54PM +0300, Solar Designer wrote:
> On Mon, Apr 06, 2015 at 02:40:45PM +0200, Agnieszka Bielec wrote:
> > I'm wondering why on --dev=2 opencl using
> > global memory was fast, ~ 150k
> 
> That's because --dev=2 (and =3) is the CPUs.  There's no easy way (nor
> do we want it, most of the time) for OpenCL driver to bypass use of CPU
> caches.  So when you run an OpenCL kernel on CPUs, there is not supposed
> to be a (significant, if any at all) speed difference between local and
> global memory (it's the same memory subsystem anyway, consisting of
> caches and RAM).  Usually, this results in the same CPU instructions,
> possibly with (unimportant) differences in specific memory addresses
> (but the addresses are virtual anyway, and the memory is cacheable
> anyway), in (non-)use of prefetch instructions, and in instruction
> scheduling (in case the compiler has different expectations for
> latencies depending on whether you specified something as being local or
> global).  Chances are that those differences have small or negligible
> effect on performance (and it is unclear in which direction).
> 
> Similarly, when you access global memory on GPUs, some limited in-GPU
> caching may nevertheless go on.  It's just that on GPUs those caches are
> separate from local memory, and GPUs' local memory may only be addressed
> explicitly (so you need to explicitly use it from your OpenCL kernels),
> whereas CPUs don't actually have (explicitly addressable) local memory
> (they only have caches).

BTW, as it relates to the above, Xeon Phi (MIC) is a CPU, not a GPU.
It has caches, but not (explicitly addressable) local memory.  So you'll
likely see the same behavior on it (no speed difference between local
and global memory from OpenCL) once you get your code to use it for
real (as we're aware, when your current OpenCL kernel is run on the MIC
card, it mostly just waits, possibly on some kind of mutex or such).

Here, "no speed difference between local and global memory" does not
mean that the device has only fast memory, nor only slow memory, and you
don't need to care.  You do need to care about the locality of
reference, a lot, because caches are a lot faster than RAM.

There's little you can do about this, though, if the algorithm is fixed
(it is the PHC finalist designer's), unless you come up with another
efficient algorithm to compute the same function (in mathematical sense,
as mapping of a domain to a range).  If you do invent another algorithm
like this, with improved locality of reference, then depending on
whether that algorithm is reasonably usable defensively (is efficient
even on just one {password, salt} input at a time) or only offensively
(is efficient only when invoked on multiple inputs at once, needing that
extra parallelism), you've either demonstrated flexibility of or broken
the proposed password hashing scheme.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.