john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130517191500.GA15157@openwall.com>
Date: Fri, 17 May 2013 23:15:00 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Hi Katja,

Thank you for bringing this topic up.

On Fri, May 17, 2013 at 06:44:53PM +0200, Katja Malvoni wrote:
> I read the Epiphany documentation and went trough bcrypt implementation.
> And now I am thinking about possible approach. At the first sight, it seems
> that having one bcrypt instance per a core could work. Since each core has
> 32 KB of memory divided into 4 memory banks, memory isn't a problem.

Correct.  We should implement one or maybe two bcrypt instances per
core.  (We might need two to hide instruction latencies yet fully use
the dual-issue capability, although I think that for the current
Epiphany chips one instance will be enough for that.)

Some other data for bcrypt (such as the tiny P-box) and code will also
occupy a portion of the 32 KB local memory.  Overall, though, 32 KB is
more than enough, and much of this memory will likely remain unused even
with the instances of Blowfish fully unrolled (for their 16 rounds).

> What
> worries me is how big impact will have overhead of writing S-boxes in every
> core local memory

bcrypt is slow to compute.  We're talking speeds on the order of 100 per
second per core, or less (depending on bcrypt's cost setting, which is
configurable per hash).  Thus, the overhead and delay needed to transfer
the 4 KB of data to/from Epiphany cores is likely small, even if you do
it sequentially for the 16 or 64 cores and don't start computation until
all of these transfers have completed.  (As a further optimization, you
may implement asynchronous transfers and keep the cores busy computing
on previously transferred data while new transfers are in progress, but
I don't expect this to matter much.  My gut feeling is that it might
provide a speedup of a few percent when bcrypt is used with relatively
low cost settings, and less than that with high cost settings.)

This is assuming that you only implement bcrypt's most costly loop on
Epiphany, leaving the rest for host CPU.  If you implement full bcrypt
on Epiphany, the data transfer sizes would be even smaller (about 100
bytes in each direction per bcrypt computation), and you could keep the
initial values for the S-boxes in each core's local memory (there's
enough of it to hold an extra 4 KB).

> and will those 16 instances running on 1 GHz processors
> be enough to still provide speedup when compared to CPU.

We don't expect the current Epiphany chips to outperform current desktop
CPUs in terms of raw speed at bcrypt.  At best, we'll reach similar
performance, and it will likely take the 64-core Epiphany to do this.

However, our experiments with this may help shape up future generations
of Epiphany, including specifically for symmetric crypto use.  Besides,
future Epiphany chips may have more cores too (such as 1024 "soon" and
4096 eventually).  Also, clusters of Parallella boards may be easier and
cheaper to build (once the boards are easily available in sufficient
quantities) than clusters of traditional computers, and they will almost
certainly be more energy-efficient.

> In worst case it would take 16*512 cycles just to transfer S-boxes.

This may seem like a lot, but for bcrypt it is not.  At 8k cycles,
you'll be able to do this 100k times per second.  Given that you might
only be able to compute about 100 bcrypts per second per core, this
translates to like a 0.1% overhead, which is acceptable.

> What is the overhead in case that S-boxes are hard coded in the device
> code?

What do you mean by this?

> Should I try this approach of one instance per a core

Yes, please.

> And am I missing something in my approach?

You seem to be focusing on the host / Epiphany communication overhead,
but for bcrypt it's likely not where most room for optimization is.
(It would be for other hash types - for faster to compute ones.)

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.