Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Sep 2014 12:57:57 +0400
From: Solar Designer <>
Cc: Steve Thomas <>
Subject: Re: nVidia Maxwell support (especially descrypt)?

On Wed, Sep 10, 2014 at 11:02:17PM -0800, Royce Williams wrote:
> Because of the $150 price and relatively low power requirements (~60W,
> no extra power connector needed) of the new nvidia Maxwell (GTX 750
> and 750 Ti) cards, and my own interest in descrypt, I'm interested in
> seeing JtR take advantage of Maxwell if feasible.
> Steve Thomas did a Passwords14 presentation on bitslice DES with
> LOP3.LUT (23 min video):

Yes, I watched a video of Steve's talk a while ago.  He brought the
topic up fine, but other than that I was disappointed: he had no results
yet at least as of the time of this talk.  Steve's best average gate
count per S-box by using LOP3.LUT (36.25 gates) was still worse than
Roman's result from 2011 for plain bitselect(), which is 32.875 gates:

I don't understand why Steve chose to start with Matthew's S-box
expressions from 1998 rather than with Roman's from 2011.  And indeed,
better results are possibly with custom algorithms targeting LOP3.LUT
"gates" directly, which is a point Steve made in the talk too.

For now, I think someone with a Maxwell GPU should try building and
benchmarking our descrypt-opencl on it with the S-boxes that use

GPUs are weird, though, especially when we exceed the instruction caches -
or maybe we're weird when we do so.  It's not all about the "gate
count" alone, by far.  IIRC, in our benchmarks on AMD GPUs, which do
support bitselect(), we often had better speeds with S-box expressions
that do not use bitselect(), but only use the MMX-like "non-standard"
gates (which was another target set of gates in Roman's work from
2010-2011), even though they obviously have higher average gate count.
This probably means that there's much room for optimization of our
descrypt-opencl code to match actual GPUs better, but my point is that
(especially when the rest of this thing isn't exactly optimal for a
particular GPU) the local maximum for overall performance isn't
necessarily achieved at the highest speed with which individual S-boxes
may be computed, and the latter is also not a monotonous function of
gate count.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.