john-users - Re: nVidia Maxwell support (especially descrypt)?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALj7wqDytFy-D5tHbUN35dfpa3VN3RtnDh-v+gi+1d6KLDaqzQ@mail.gmail.com>
Date: Sat, 10 Oct 2015 21:00:25 +0800
From: John Doe <deeplearningjohndoe@...il.com>
To: john-users@...ts.openwall.com
Cc: Roman Rusakov <rusakovster@...il.com>, Solar Designer <solar@...nwall.com>
Subject: Re: nVidia Maxwell support (especially descrypt)?

Hello,

Thanks for trying out my work.

Just a quick recap, there're two ways of approaching the Bit-slice DES
crypt(3) algorithm in Meriken's (the original
author's) merikens-tripcode-engine
<https://github.com/meriken2ch/merikens-tripcode-engine> (MTE) and my fork
of MTE
<https://github.com/DeepLearningJohnDoe/merikens-tripcode-engine/tree/PRV>.

The first is to use shared memory for communication between workitems and
distribute S-boxes across them and aim for a higher occupancy. The second
is to use registers to store everything from 56 bit keys and 64 bit data
(bitsliced, of course) and hope the instruction interleaving is enough to
hide the instruction latency at low occupancy.

Apparently you'll need one kernel for each of the 4096 salt values if
you're going to use the second method, as registers can't be arbitrarily
dynamically indexed. My latest implementation with the second method (on
the PRV branch of my code), which has already been integrated in Meriken's
MTE, performs at 967MH/s (with my S-boxes) on a +250MHz reference 980 Ti,
and, if memory serves me right, is within 7% of theoretical peak
performance of the card.

Only one kernel in total is needed for the first method and I assume you
are more interested in it. My naive implementation of ~300MH/s with the
first method (on the master branch of my code) was already obsoleted by
later updates by Meriken. His implementation
<https://github.com/meriken2ch/merikens-tripcode-engine/blob/master/MerikensTripcodeEngine/Source%20Files/CUDA10_SharedMemory.cu>
performs at 826MH/s (with my S-boxes) on a +250MHz reference 980 Ti.

As I last touched my codes more than two months ago, a lot of the
(presumably important) details have already started to elude me, so I'm
afraid I couldn't offer additional insights. However, I learned a great
deal from Meriken's MTE codes
<https://github.com/meriken2ch/merikens-tripcode-engine> and from the
NVIDIA devtalk forum thread
<https://devtalk.nvidia.com/default/topic/860120/bitslice-des-optimization/> I
started. I developed the code to get acclimated to CUDA so I can start
doing deep learning things in it, and my workstation now runs 24/7 on deep
learning projects. So as things stand, my Bit-slice DES code are no longer
maintained.

I suggest you get in touch with Meriken (you can find his email in most of
the files in the his MTE github project, like this one
<https://github.com/meriken2ch/merikens-tripcode-engine/blob/master/MerikensTripcodeEngine/Source%20Files/Main.cpp>)
to see if he can shed some light on the issue. His actively updates MTE.
Although MTE is written in complement to a piece of Japanese software,
Meriken is very fluent in English.

Wish you all good luck on the matter.

Regards,
DeepLearningJohnDoe

On Thu, Oct 8, 2015 at 5:37 AM, Solar Designer <solar@...nwall.com> wrote:

> DeepLearningJohnDoe - thank you for your work in this area, and we'd
> appreciate any comments you might have on the below.
>
> On Wed, Oct 07, 2015 at 06:54:20PM +0200, magnum wrote:
> > >On Wed, Oct 7, 2015 at 8:44 AM, Solar Designer <solar@...nwall.com>
> wrote:
> > >>And of course we'll also need to include some LOP3.LUT S-boxes.
> > >>If Roman's are still unreleased (except for S4), then Janet's.
> [...]
> > I implemeted this in 9c82bcc, using DeepLearningJohnDoes's (a.k.a
> > Janet's) S-boxes except for s4.
>
> Are you getting better speeds with Roman's S4?
>
> > Boost appears to be in the order of 10% for LM, 20% for DES.
>
> Confirmed, on Titan X against the same 10 descrypt hashes (10 different
> salts) as yesterday:
>
> 0g 0:00:03:10 2.04% (ETA: 02:51:06) 0g/s 22303Kp/s 226641Kc/s 226641KC/s
> GPU:67C util:100% fan:26% aacxytna..aacxytna
>
> This is now roughly same speed as Tahiti.  Titan X got to be better than
> that.  Maybe that split of the S-box "lookups" across 4 work-items is key
> to better performance (more work done per registers consumed).  Sayantan,
> please look into that.
>
> I'd run on many more salts to reduce the key setup overhead, but then
> the kernel build time becomes large and distorts the reported c/s
> figures too much for quick runs like this.  Maybe we need to reset the
> timer to zero once the kernels are built, or/and maybe we need to add
> computation and reporting of instantaneous speeds (not just the all-time
> averages).
>
> > Is there any special place to look for more of Romans's work?
>
> No.  We need to ask Roman.  I sort of just did, by CC'ing him.
>
> BTW, our current opencl_sboxes.h defaults to using nonstd.c derived
> expressions when !HAVE_LUT3.  Maybe it should also have an option for
> using sboxes-s.c derived expressions, which are supposed to be faster on
> AMD GPUs.
>
> > BTW we now also use LOP3.LUT for many MD4, MD5 and SHA-2 OpenCL formats.
> > Some driver bug prevented me for using it in SHA-1 with nvidia 352.39
> > (the code is there, just disabled) and md5crypt disable it because of
> > performance regression (still to be investigated). Some formats show a
> > fine boost but none as much as DEScrypt.
>
> ... with our guess on why lower boost being that LOP3.LUT was often
> used anyway, introduced in the PTX to ISA translation.
>
> Thank you all for working on this!
>
> Alexander
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.