Date: Sun, 26 Aug 2012 23:51:43 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice DES on GPU On Mon, Aug 27, 2012 at 12:44:27AM +0530, Sayantan Datta wrote: > So you are saying that I should do this portion of the code on cpu. No. I wrote "you could do the expansion on CPU side, but the reason why you won't ..." This means that you "could", but you "should not". > I think > I could do it in the DES_bs_init() and get rid of KSp all together(Right?). No. The expansion may only be performed for a specific set of keys, whereas DES_bs_init() is called just once per program invocation. However, you may fill an array of pointers (for use on CPU) or indices (for use on GPU) just once, in DES_bs_init() or even in a standalone program (then use a static initializer for a const array in your OpenCL kernel). In fact, the latter approach might be best, since it makes the const nature of the indices array very explicit. > There is another bigger problem than the previous one. I see that address > stored in E.E array is aliased with the B array. Transferring data to > gpu would change the location of B array and result in same kind of > problem as above. You will likely similarly want E to be shared between work-items (to save local memory), which means that you'll similarly need to fill it with indices rather than with pointers. However, it won't be fully constant: it'll change between crypt_all() calls. Perhaps it'd be best to compute this E filled with indices (not pointers) on CPU and transfer it to GPU on every crypt_all() call. > The most complex part being the initialization of E.E > array in the set_salt() function. I could port both DES_bs_init() and > set_salt() to GPU(as different kernel) but it would then result in same > problem as I would need to transfer data to and from GPU in order > to synchronize it with the CPU in between the kernel calls. Whenever I > transfer data from CPU to GPU the address of B will change. I don't understand you here. You seem to be imagining some problems that you won't even run into for other reasons anyway. In the current OpenMP code, E is separate for each thread slot because it stores pointers, as well as to be friendlier to CPU caches. On GPU, you'll want to store indices rather than pointers and you'll want E to be shared between the work-items - or at least that's what I'd expect to run faster. (A next step would be to customize OpenCL kernels for each of the 4096 salts - maybe by patching binary code. This will let us get rid of E. On CPU, an equivalent change would result in a 7% speedup per my experiments, but on GPU it might make a lot more of a difference.) > One more thing I would like to know that immediately after set_salt() > function which function is called? Usually crypt_all() is called next, but there may be some special cases. For example, when there's just one salt value seen among the loaded hashes, then set_salt() may be called just once - followed by multiple calls of crypt_all(), set_key() ..., then crypt_all() again. You could in fact embed set_salt() inside crypt_all() - only doing the salt setting if the salt has changed since what was previously set. However, this will result in duplicate work being done (the salt setting) by each work-item, if you're storing indices rather than pointers into E anyway. If you choose to be storing pointers into E, the this merging of the functions is the way to go. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.