crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTikVosgo8pmToF8HGs2-pW94622R=w@mail.gmail.com>
Date: Thu, 12 May 2011 10:44:17 -0700
From: David Hulton <0x31337@...il.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

Alexander & Yuri,

On Thu, May 12, 2011 at 2:51 AM, Solar Designer <solar@...nwall.com> wrote:
> Some drawbacks of DES are:
>
> - A lot of people have heard that DES has been "broken", even though it
> hasn't - it's just that its 56-bit key size is insufficient.

True. I think we can probably get around this by using 3DES or doing
some sort of similar "key extension" to get more than the 56-bits out
of it. I think that aside from the small key size it's actually quite
a secure algorithm and has had a much longer time to be peer-reviewed
than most other algorithms.

> - DES has been formally obsoleted, in favor of AES.  (As an excuse, we
> can build upon 3DES and say that we're using that...)
>
> - The key size (and also block size) is in fact inadequate; we'll need
> to use DES such that this does not affect our KDF's properties.
>
> - For the "alternative approach", DES is too software-friendly (but it
> is also extremely hardware-friendly).

True. It's one tradeoff that we make by making the operations
parallelizable, but for an attacker this will always be the case since
they will be able to parallelize even fully sequential algorithms... I
think the best thing that we can do is make it so that the server side
can make use of parallelism as well so we can get away with using much
higher iteration counts and make better use of the server resources
when computing hashes legitimately....

One thought experiment is what if instead of doing the 25 iterations
of DES for crypt() they did 128 and made it parallelizable? This would
probably end up being faster to compute for the server and would also
slow down the attacker at the same time...

>> We're able to do on the order of around 22 billion DES operations/sec
>> on our current M-501 modules (we currently are able to do over 1
>> trillion keys/sec on our 48-board 5U clusters)
>
> That's impressive.
>
> http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to
> six M-501 modules per board".  Is the 22 billion figure for one or for
> six FPGAs?

It's for a single M-501. One EX-500 with 6 M-501's is around 22
billion * 6 = 132 billion. Then we put 8 EX-500 backplanes in a
cluster to get the 132 billion * 8 = 1.056 trillion.

> Yuri - the numbers I posted in here before were for DES-based crypt(3),
> which includes 25 iterations of modified-DES.  So you need to multiply
> them by 25 to compare against those posted by David above.  Thus, if
> Core i7-2600 does 20.5M DES-based crypts per second (and it does that
> here), this means that it does 512M modified-DES encryptions per second.
> For plain DES (not modified), it would do approx. 550M per second in a
> similar loop (no key setup in the loop, just iterated DES encryption).
> More efficient parallelization over the multiple cores could provide
> another 10% speedup (my 20.5M figure is for thread-safe code and OpenMP,
> which has overhead).
>
> That's still 35-40 times slower than the 22 billion mentioned by David,
> which is good.

I have a feeling that by just adapting our current DES design (which
is really just a modified version of the fully pipelined DES core
provided by asics.ws/opencores.org) we should be able to get somewhere
around 2-4 billion on the E-101...

> This makes sense.  As a bonus, since bitslice DES is also CPU-friendly,
> we could try to design our KDF such that we treat the FPGA and the CPU
> almost the same (at high level).
>
> However, if 10x (or even 40x?) is possible in this way, this leads me to
> think that we could achieve 100x+ with a truly CPU-unfriendly component
> for the FPGA.

That's true. We would have to stray a bit far from the certified
algorithms but we could definitely put in some CPU/GPU roadblocks...

> Disclaimer: I am not into GPU programming (yet).  This is why I use
> words such as "apparently" below.
>
> Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
> likely to change.  Bitslice implementations of DES with the S-box
> expressions published so far need up to approx. 20 registers, which is
> apparently more than what typical GPUs have available per processing
> element.  Also, apparently accessing a temporary variable spilled to
> memory is relatively slow (compared to accessing the L1 cache in a CPU).
>
> One expected change is that we (Openwall) are going to release more
> GPU-friendly DES S-box expressions in about a month from now.  These
> match the instruction set of high-end ATI/AMD GPUs (as well as AMD XOP
> and PPC AltiVec, which is why they'd get into JtR this soon), and they
> will need fewer registers.
>
> Another change that may or may not happen is GPUs getting more registers
> or/and faster access to temporary variables.  This sounds quite
> realistic to me, but I am not an expert in this area.  Based on what I
> read (about the implementation difficulty for DES on GPUs), I wouldn't
> be surprised if a mere 2x increase in register count results in a 10x
> speedup for DES on GPUs.

Yeah, this is true and we'll need to look into this a bit more to
actually get a feel for how fast GPUs are at this right now and where
they'll be in the near future...

> This makes sense.  Wouldn't ASIC provide similar advantage for
> Blowfish-like constructs, though?

Yeah it would and it would be good for us to do a speed comparison
with a fully pipelined version of blowfish because I think we'll see a
lot better performance if it's implemented for speed.

> This makes sense.  I think the degree of parallelism would need to be
> far greater, though.  For the CPU, it will need to be at least 128
> (since we'll do bitslice).
>
> But I also need to post my thoughts on us using a sequential memory-hard
> algorithm as defined in the scrypt paper.  Maybe that is what will
> occupy the CPU.

I agree.. For something like this it may be worth doing larger S-Boxes
(even of the size that would take up a full or half BlockRAM) so they
can't be bitsliced efficiently to force the CPU/GPU to go to
cache/memory. The only issue with this is making the algorithm work
efficiently and securely with a fewer number of s-boxes because
BlockRAM resources are somewhat limited (the LX45 has 116 18Kbit
BlockRAMs)...

-David
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.