john-users - Re: nVidia Maxwell support (especially descrypt)?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 7 Oct 2015 06:14:33 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Roman Rusakov <rusakovster@...il.com>
Subject: Re: nVidia Maxwell support (especially descrypt)?

On Tue, Oct 06, 2015 at 07:42:24PM +0200, magnum wrote:
> Have you seen this work by Janet Yellen? I can't recall it mentioned here.

I had not seen it.  Cool stuff!

> https://github.com/DeepLearningJohnDoe/merikens-tripcode-engine/tree/master
> https://devtalk.nvidia.com/default/topic/860120/cuda-programming-and-performance/bitslice-des-optimization/post/4622827/#4622827
> 
> "Gate counts: 25 24 25 18 25 24 24 23 (avg. 23.5)
>  Depth: 8 7 7 6 8 10 10 8 (avg. 8)"

Roman's S4 posted here is 1 gate shorter (17 vs. 18):

http://www.openwall.com/lists/john-users/2014/09/18/2

> "With this version, I get a performance of 950 MH/s for UNIX DES 
> crypt(3) (or equivalently 23750 MH/s for 1 round of DES) on my reference 
> Gigabyte GTX 980 Ti (+270 MHz). Considering hashcat's implementation 
> gets 165.5 MH/s on a GTX Titan X (+225 MHz), it's a great improvement. 
> Even my naive implementation bounded by shared memory/synchronization 
> with old SBOXes from JtR is faster (300 MH/s on 980 Ti +300 MHz)."

Also from that NVIDIA forum:

"As for your Nvidia/AMD comparison, I am currently getting 800MH/s on my
7990 with OpenCL and rewriting my implementation with a GCN assembler.
We will see how that goes :)"

These are very good speeds indeed.  The "naive implementation ... with
old SBOXes from JtR" mentioned there was somehow using the
bitselect-lacking S-boxes (I just took a look at the code on GitHub), so
would likely run even faster with the bitselect-enabled ones.  OTOH, it
is very interesting how they appear to split (even in that naive
implementation) one DES computation across 4 "threads" (aka work-items),
for 4 different pairs of S-box lookups.

During this summer's CMIYC contest, I was getting ~235M c/s per Tahiti
1 GHz (in 7990's), using Sayantan's code in JtR and his instructions for
pre-building per-salt kernels (takes about an hour).  (Some less lucky
Catalyst versions on other machines got slightly lower speeds on Tahiti.)
I think this is ~2x faster than hashcat, but clearly (as seen from the
speeds above) it is possible to do better yet.

BTW, we should post such instructions somewhere public, like include
them in a file under doc/ or/and put them on the wiki.  Also, I think
Sayantan has changed the code greatly since then.

Running the latest code against 10 descrypt hashes on Titan X (stock
clocks) with -mask='?l?l?l?l?l?l?l?l', I get:

0g 0:00:02:31 1.27% (ETA: 09:15:40) 0g/s 17601Kp/s 181316Kc/s 181316KC/s GPU:66C util:100% fan:24% aayuspia..aayuspia

(BTW, the range of candidates looks weird here.  It's the initial "aa"
that are being iterated, I think, but it's not seen here, making it
appear as though the start and end of range are the same.)

On Tahiti (1050 MHz with Catalyst 15.7 here), it quickly gets to higher
speeds:

0g 0:00:02:19 1.53% (ETA: 08:34:44) 0g/s 22864Kp/s 233304Kc/s 233304KC/s aaiemika..aaiemika

Is it still possible to lazy-build, or does the pre-building now have to
be done before start of cracking?  I've just tried the latest code, and
it pre-builds by default (is it possible to disable this behavior, or
are we only supporting per-salt kernels now?) and it appears to do so
before start of cracking (is it still possible to lazy-build during
cracking?)  Oh, and the current code appears to build kernels for some
salts (the self-test ones?) multiple times - a minor bug?

And it still doesn't appear to allow easy use of multiple CPU cores for
the kernel building.  I think it should be made --fork friendly for
that - I think this is easy to implement, but it's a topic for john-dev.
I found there's PARALLEL_BUILD in opencl_DES_hst_dev_shared.h now, which
enables use of OpenMP, but enabling it made the compiler crash with a
weird error (when targeting Titan X), so maybe something isn't MT-safe.

Then, I don't understand why it pre-builds by default when HARDCODE_SALT
is not enabled by default:

#define OVERRIDE_AUTO_CONFIG    0
#define HARDCODE_SALT           0
#define FULL_UNROLL             0
#define PARALLEL_BUILD          0

IIRC, previously HARDCODE_SALT was the setting to enable this mode.

Really need documentation for this, even if it's still in development.
And then we could proceed to discuss how to revise it to make it more
usable.  And of course we'll also need to include some LOP3.LUT S-boxes.
If Roman's are still unreleased (except for S4), then Janet's.

Sayantan?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.