Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Tue, 1 Feb 2011 13:59:30 +0200
From: Milen Rangelov <>
Subject: Re: FreeBSD crypt() / MD5-crypt implementation question


I have experience with OpenCL mostly on ATI cards. Yes, the compiler is very
good at optimizing code, yet there are of course many things that can be
done in order to help it do better. Some of them are documented in the
corresponding ATI/Nvidia OpenCL guides, some are not. For example, for some
reason rotate() function on ATI generates fast bitalign code, while a
leftrotate macro (with shifts/or) do not. Some code generates slowpath
accesses, other not and that's not documented well. Also, sometimes some
quite weird changes produce much better code and you're left without
explanation to this (OK, the dumped ISA code might give you some insight).

OTOH I had many issues trying to run the same code on NVidia, that's why I
split kernels in two - AMD version and NVidia one. Sometimes the Nvidia
compiler even crashed during clBuildProgram() because of some construction
that is pretty legal in ATI's implementation. Basically though you can write
code that runs on both platforms, however it won't perform well. There are
tweaks that need to be done for ATI, and for NVidia, related to global
memory accesses and vectorization mostly. Most of them are documented in the

On your platform probably Barswf would be faster because it does the hash
reversal trick and skips up to the 48th step. However, it is not capable of
doing multi-hash and mask attacks unlike oclhc. IMHO, oclhashcat is the best
cracker currently available as it offers a very good tradeoff between
performance and functionality. There might be faster single-hash crackers,
but they can't offer that rich set of features.

On Tue, Feb 1, 2011 at 1:27 PM, Freddie Witherden <>wrote:

> That's interesting.  Do you have any experience with the higher-level
> languages/compilers (CUDA C/OpenCL) and how they perform?  I ask as x86
> compilers are generally quite good at spotting and optimising bit
> manipulations (endian swapping macro => bswap; "two-shitfs, one or" =>
> ror).  It would indeed be nice if a single OpenCL kernel could take care
> of current and future AMD/Nvidia hardware without needing to hand-tune
> code for different ISA's.
> I've looked at a few CUDA MD5 implementations (although not MD5 crypt,
> just raw MD5) with the performance on my 295 GTX varying from ~100
> ("CUDA MD5", Mario Juric, GPL v2) Mhash/s to ~600 Mhash/s ("oclHashcat",
> blob).  "MD5 Crack GPU", LGPL v3, will do ~400 Mhash/s and I am yet to
> benchmark BarsWF.
> Polemically yours, Freddie.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.