Date: Mon, 28 Mar 2011 13:41:45 +0300 From: Milen Rangelov <gat3way@...il.com> To: john-users@...ts.openwall.com Subject: Re: GSOC - GPU for hashes Hello, Debugging a cl code is real pain, forget useful printf, forget easy > to use IDE, when everything is syntax correct and you don't get the > right result good luck my friend it's time to get pen and paper. > > With ATI, there is an cl_amd_printf extension that allows you to use printf() in kernels. Some format specifiers are not allowed but generally it does a good job....OK unless you put it in a branch, then it does not work correctly due to a bug in AMD's opencl implementation. > In this way first check to see if 1/5 of the hash matches is done very > fast, in case it matches i get the rest 4 blocks and compare them, if > everything is ok password has been cracked. > > This is done because of slow data transfer between gpu and cpu: As milen > pointed out this is the real pain you will be working on. > > That's good :) > > As you can see there's a difference in speed comparison if you "upload" > or "download" data on gpu, you need to take that in account when you > code. You also need to understand that different vc will have different > data transfer rate, this one is for an hd6970 which i found at about > 200euro and replaced my "old" 5750 which had "download" data rate lower > then the upload. > > I've done benchmarks with various AMD and Nvidia hardware. It turns out that clEnqueueMapBuffer()/clEnqueueUnmapMemObject() is fastest when smaller ammounts of memory are transferred and clEnqueueReadBuffer()/clEnqueueWriteBuffer() are better for larger buffers when CL_USE_HOST_PTR is used. > Even with this monster i can "feel" something is wrong because the video > card is not stressed enough; if compared to the times i run pyrit which > segfaults because of heat, with john running i can watch movies with no > problem. > Best utilization is achieved when: * NDRange (aka global work size) is large enough, at least 15000-20000 workitems.(probably even more to effectively "hide" memory latencies, depending on how much ALU stuff is being performed in kernel) * local work size is divisible by 64 * FetchUnitBusy value is as low as possible compared to ALUUnitBusy (use AMD Stream Kernel Analyzer for that purpose - very useful profiling tool). That means less __global memory accesses, more ALU work * host-device transfers are kept to a minimum In addition, I found out that creating two threads with their own contexts and queues on a single GPU device works faster by more than 10% on "fast" algos due to the fact that GPUs are better utilized then (less time lost between kernel invocations?). P.S - you may have a look at my (ATI) md5 bruteforce kernel: http://hashkill.svn.sourceforge.net/viewvc/hashkill/src/kernels/amd_md5_long.cl?revision=109&content-type=text%2Fplain Code is very ugly and you would probably not understand why I am doing lots of things without looking at the host code, but this demonstrates several things: 1) single-hash vs multihash optimizations based on a preprocessor define passed when building the kernel (-DSINGLE_MODE) 2) 5xxx/6xxx vs 4xxx-optimized code (again, preprocessor define - OLD_ATI). 4xxx GPUs generally work better with 4-component vectors as opposed to 8-component ones, dunno why. 3) Bitmap checks for multi-hash cases in order to avoid lots of slow host-device transfers. 4) Candidate generation based on a lookup table in global memory (ugh) plus a couple of __private function arguments. Overall it's not bad, but it has to be improved 5) MD5 algorithm reversal (single-hash case) up to step 43 (about 20 MD5 steps skipped) 6) Using amd_bytealign() seems rather illogical, but I do this because I patch the compiled binary kernel on-the-fly, replacing BYTEALIGN_INT with BFI_INT. This brings some nice 20% performance improvement on 5xxx/6xxx cards as BFI_INT is a single instruction that does round1 and round2 transformation (F and G) 7) Additional optimization in case plaintexts are less than 8 bytes in size (MAX8). 8) As for the "DOUBLE" thing - in my program that corresponds to a command-line option. In effect, this increases speeds a bit as global memory reads are cut in half and more work is done in the kernel so the ratio "kernel execution time" vs "transfer time" is lower. OTOH memory usage rises 2x. You may just ignore it as done in the 4xxx codepath (OLD_ATI)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.