Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 12 Apr 2011 00:11:14 +0200
From: Ɓukasz Odzioba <>
Subject: Re: sha256 format patches

Of course I will upload pictures, results, and patch to wiki.

>Yes, but you're not hitting the PCIe bandwidth limit yet, although with
>synchronous transfers some time is in fact wasted until the transfer is
>complete and before actual processing starts.

On my gpu it is in fact 5% but it is easy to buy 10x faster GPU so i
am aware that pci-e transfer matters. Of course I could send less data
through bus but for now it is not main optimization target (but will
be) in fact i need to check how much time takes creating from "string"
a 512bit input data for sha. Moving it on gpu may decrese those "GPU
idle" time. It also should give nice main memory savings.

>Why not do it in init() and keep the allocation until the "john" process terminates?
I did it, it worked well and gives mentioned +8% boost. I wasn't sure
that leaving cleanup to OS is good idea so it's not included in patch
but can be done with 4 lines modification. I didn't implement async
copy for the same reason. It's not a big deal so will be done before

>Is this just because you haven't implemented unrolling for the slow
hashes case yet?
Yes, I have compared slow vs fast yesterday but did unrolling on fast
sha today. It is easy so I will post results next time.

>Now how about implementing SHA-crypt?  You'll also need to implement SHA-512 for that, which is trickier (64-bit integers).
I'll try and see what I can do. Cuda offers 32 and 24bits integers.
The trick is that 24bis operations are almost 8times faster (but
nvidia claims it might change in the future) so meaby it is worth to
implement 64bit operations on both types and compare efficiency.

>That "certain change" I had mentioned was switching to 64-bit partial
>hashes.  So 4 times less data to transfer from the GPU.  In my hack of
>the code, I did not deal with potential false positives in any way, but
>a proper implementation will need to do it, likely by having cmp_exact()
>invoke an on-CPU implementation (it doesn't need to be fast).  Then even
>32-bit partial hashes would work, for further speedup and memory savings.

I do not understand what partial hash is and how it affect on speed.
Could you please tell me more details how to do it?

Thanks for benchmark.  There is still what to do in optimization this
code. I think that results can be improved by finding optimal
threads/blocks/registes settings. As I mentioned it is important to
develop some self-configuration script to get maximum occupancy on
every card what I will do soon.
I'll try get access to faster gpu before I will be able to buy Fermi based card.

Thansk for all.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.