Date: Thu, 11 Sep 2014 12:57:57 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Cc: Steve Thomas <steve@...tu.com> Subject: Re: nVidia Maxwell support (especially descrypt)? On Wed, Sep 10, 2014 at 11:02:17PM -0800, Royce Williams wrote: > Because of the $150 price and relatively low power requirements (~60W, > no extra power connector needed) of the new nvidia Maxwell (GTX 750 > and 750 Ti) cards, and my own interest in descrypt, I'm interested in > seeing JtR take advantage of Maxwell if feasible. > > Steve Thomas did a Passwords14 presentation on bitslice DES with > LOP3.LUT (23 min video): > > http://www.irongeek.com/i.php?page=videos/passwordscon2014/bitslice-des-with-lop3lut-steve-thomas Yes, I watched a video of Steve's talk a while ago. He brought the topic up fine, but other than that I was disappointed: he had no results yet at least as of the time of this talk. Steve's best average gate count per S-box by using LOP3.LUT (36.25 gates) was still worse than Roman's result from 2011 for plain bitselect(), which is 32.875 gates: http://www.openwall.com/lists/announce/2011/06/22/1 I don't understand why Steve chose to start with Matthew's S-box expressions from 1998 rather than with Roman's from 2011. And indeed, better results are possibly with custom algorithms targeting LOP3.LUT "gates" directly, which is a point Steve made in the talk too. For now, I think someone with a Maxwell GPU should try building and benchmarking our descrypt-opencl on it with the S-boxes that use bitselect(). GPUs are weird, though, especially when we exceed the instruction caches - or maybe we're weird when we do so. It's not all about the "gate count" alone, by far. IIRC, in our benchmarks on AMD GPUs, which do support bitselect(), we often had better speeds with S-box expressions that do not use bitselect(), but only use the MMX-like "non-standard" gates (which was another target set of gates in Roman's work from 2010-2011), even though they obviously have higher average gate count. This probably means that there's much room for optimization of our descrypt-opencl code to match actual GPUs better, but my point is that (especially when the rest of this thing isn't exactly optimal for a particular GPU) the local maximum for overall performance isn't necessarily achieved at the highest speed with which individual S-boxes may be computed, and the latter is also not a monotonous function of gate count. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.