Date: Thu, 9 Feb 2012 11:18:01 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: cryptmd5 optimizations Hi Lukas - On Wed, Feb 08, 2012 at 01:19:36AM +0100, Lukas Odzioba wrote: > I am trying to optimize opencl and cuda cryptmd5 code, Thank you for working on this! > and I got to > dead end from my perspective. I want to understand what optimizations > were done in MD5_std.c, but so many #ifdef's are distracting me. From > Alex i know that MD5_X2 should be off. My question is what else > "flags" should be on/off (and what they mean) to make code easier to > understand? When you say "should", you mean to make the code easier to understand, right? If so, it's probably MD5_X2 off, MD5_IMM on, MD5_ASM off. MD5_X2 means compute two hashes at a time with mixed instructions, for greater instruction-level parallelism. You get two inter-mixed implementations of MD5 (and of the higher-level logic as well) when you enable this. MD5_IMM means use immediate values for the 32-bit constants. When you disable this, you instead get array lookups. Immediate values work better on x86 and x86-64 where instruction size is variable and 32-bit immediate operands may be encoded right in ALU instructions. Array lookups work better on RISC architectures where instruction size is fixed (usually 32-bit), load instructions are separate from ALU instructions, and 32-bit immediate operands do not fit. On such RISC architectures, it'd take two load instructions to put one 32-bit immediate value in a register. With array lookups, we reduce this to one load instruction (assuming that the array start address is already loaded in a register and the constant offset to a specific element is small enough that it fits in the immediate operand field of just one instruction). I do not know which of these is more suitable for GPU architectures; my (limited) understanding is that on one hand you have fixed instruction size (RISC or VLIW), but on the other you have very limited amount of fast memory (and even loads from fast memory might be slower than loads of immediate operands from the instruction stream, even if you have to use twice more load instructions for the latter). You'll need to research this and experiment with it. I won't be surprised if different approaches will be required for different GPUs. MD5_ASM is obvious - it excludes some C code in favor of assembly code from a .S file. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.