Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 12 Apr 2011 03:16:56 +0400
From: Solar Designer <>
Subject: Re: sha256 format patches

On Tue, Apr 12, 2011 at 12:11:14AM +0200, ?ukasz Odzioba wrote:
> On my gpu it is in fact 5% but it is easy to buy 10x faster GPU so i
> am aware that pci-e transfer matters.

Right, but your current code will likely bump into another bottleneck
first.  I think it already does on my system.

> >Now how about implementing SHA-crypt?  You'll also need to implement SHA-512 for that, which is trickier (64-bit integers).

> I'll try and see what I can do. Cuda offers 32 and 24bits integers.
> The trick is that 24bis operations are almost 8times faster (but
> nvidia claims it might change in the future)

I think that you're confused.  I did a web search on this, and only
found it mentioned in integer multiplication context, which we don't use
in these hashes.

> so meaby it is worth to
> implement 64bit operations on both types and compare efficiency.

I guess there are two ways to implement SHA-512 on GPUs that only
support up to 32-bit integers:

1. Use pairs of 32-bit integers and handle carry on addition manually
(or are there special instructions for that?)

2. Use a bitslice implementation.

But I haven't really looked into this.  I am leaving it for you.

> I do not understand what partial hash is and how it affect on speed.
> Could you please tell me more details how to do it?

I've attached a patch (hack) demonstrating this.

> Thanks for benchmark.  There is still what to do in optimization this
> code. I think that results can be improved by finding optimal
> threads/blocks/registes settings. As I mentioned it is important to
> develop some self-configuration script to get maximum occupancy on
> every card what I will do soon.

I think there must be many other changes to make.  I've just experimented
with hashcat tools on the same machine, and here are some numbers
(checks per second with one hash loaded for cracking):

SHA-256: 52M
MD5: 556M
MD5: 405M

As you can see, much higher speeds are possible, although
oclHashcat-lite is limited to cracking just one hash at a time, which
enables it to partially reverse the hash loaded for cracking (not an
optimization we should consider now).

Anyway, I'd like to see your patch for a slow hash, and benchmark that.



diff -urp john-1.7.6-sha256cuda-1/src/cuda/ john-1.7.6-sha256cuda-1mod/src/cuda/
--- john-1.7.6-sha256cuda-1/src/cuda/	2011-04-11 19:37:33.000000000 +0000
+++ john-1.7.6-sha256cuda-1mod/src/cuda/	2011-04-11 21:42:30.000000000 +0000
@@ -90,7 +90,9 @@ __global__ void kernel_sha256(sha256_pas
diff -urp john-1.7.6-sha256cuda-1/src/Makefile john-1.7.6-sha256cuda-1mod/src/Makefile
--- john-1.7.6-sha256cuda-1/src/Makefile	2011-04-11 23:03:42.000000000 +0000
+++ john-1.7.6-sha256cuda-1mod/src/Makefile	2011-04-11 21:38:41.000000000 +0000
@@ -21,7 +21,8 @@ OMPFLAGS =
 #OMPFLAGS = -xopenmp
 NVCCFLAGS = -c --maxrregcount 14 -Xptxas -v
 CUDAPATH = /usr/local/cuda/lib
-CUDA64PATH = /usr/local/cuda/lib64
+#CUDA64PATH = /usr/local/cuda/lib64
+CUDA64PATH = /usr/lib64/cuda32
 CFLAGS = -c -Wall -O2 -fomit-frame-pointer  $(OMPFLAGS)
diff -urp john-1.7.6-sha256cuda-1/src/sha256cuda_fmt.c john-1.7.6-sha256cuda-1mod/src/sha256cuda_fmt.c
--- john-1.7.6-sha256cuda-1/src/sha256cuda_fmt.c	2011-04-11 19:40:51.000000000 +0000
+++ john-1.7.6-sha256cuda-1mod/src/sha256cuda_fmt.c	2011-04-11 21:43:07.000000000 +0000
@@ -167,7 +167,7 @@ static int cmp_all(void *binary,int coun
 static int cmp_one(void *binary,int index){
   int i;
   uint32_t *t=(uint32_t *)binary;
-  for(i=0;i<8;i++)
+  for(i=0;i<2;i++)
       return 0;
   return 1;
diff -urp john-1.7.6-sha256cuda-1/src/sha256.h john-1.7.6-sha256cuda-1mod/src/sha256.h
--- john-1.7.6-sha256cuda-1/src/sha256.h	2011-04-11 19:08:31.000000000 +0000
+++ john-1.7.6-sha256cuda-1mod/src/sha256.h	2011-04-11 21:47:05.000000000 +0000
@@ -16,7 +16,8 @@
 //#define KEYS_PER_CRYPT 65536			///256 blocks * 256 threads todo - auto detection
 #define THREADS 128
-#define BLOCKS 4096
+//#define BLOCKS 4096
+#define BLOCKS 16384
 typedef struct{
@@ -24,7 +25,7 @@ typedef struct{
 typedef struct{
-  uint32_t v[8]; 				///256bit
+  uint32_t v[2]; 				///64-bit
 const uint32_t host_h[]={

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ