
Date: Wed, 19 Aug 2015 07:10:43 +0300 From: Solar Designer <solar@...nwall.com> To: johndev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU On Wed, Aug 19, 2015 at 04:47:32AM +0200, Agnieszka Bielec wrote: > I have argon from github https://github.com/khovratovich/Argon2 from > branch master. should I work now on 'enhcance' branch? You'll need to switch at some point, but since you already did so much work on the original Argon2, I think it makes sense for you to stick with it for a while longer. Regarding optimizations, especially for GPU: I think the modulo division operations are causing a lot of latency: [solar@...er opencl]$ fgrep % argon2*.cl argon2d_kernel.cl: reference_block_offset = (phi % r); argon2i_kernel.cl: uint reference_block_index = addresses[0] % r; argon2i_kernel.cl: uint reference_block_index = addresses[i] % r; On GPU, this means having local and private memory tied up to code that doesn't actually touch that memory but is instead doing this division operation for many cycles in a row. This is extremely wasteful. I think this might explain the unexpectedly poor performance on AMD GCN. (Maybe NVIDIA has relatively low latency integer division hardware.) For Argon2i, you should be able to easily optimize this overhead out, since all the indices are known in advance (they are the same each time, by design, as required to avoid cache timing leaks). You should also be able to optimize out the hashing that produces those indices (before the modulo division), but that's relatively minor (yet by all means make this optimization as well if you do precompute the indices). This means you will need some memory to store those indices in (1536 of them for our current benchmarks? meaning something like 3 KB?), but this memory can be shared between different concurrent hash computations. For Argon2d, optimizing this is not easy, and the speedup potential is lower. Yet there could be ways: Fast Division Algorithm with a Small Lookup Table http://arith.stanford.edu/~hung/papers/asilomar.pdf "This paper presents a new division algorithm, which requires two multiplication operations and a single lookup in a small table. The division algorithm takes two steps. The table lookup and the first multiplication are processed concurrently in the first step, and the second multiplication is executed in the next step. This divider uses a single multiplier and a lookup table with 2/sup m/(2m+1) bits to produce 2 mbit results that are guaranteed correct to one ulp. By using a multiplier and a 12.5 KB lookup table, the basic algorithm generates a 24bit result in two cycles." It might also be practical to adapt methods normally used for floatingpoint to producing precise integer results (there might need to be some trial and error to confirm or come up with precise results, and you'll need to implement this in code too): https://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods http://stackoverflow.com/questions/12227126/divisionasmultiplyandlutfastfloatdivisionreciprocal Someone with a GPU working on a much simpler subset of the problem: http://stackoverflow.com/questions/2616072/fasterintegerdivisionwhendenominatorisknown (just to illustrate the problem of slow integer division on GPUs). Before you spend a lot of time on this, I suggest that you replace this modulo operation with something simpler (and wrong), yet in some ways similar, e.g.: static inline uint32_t wrap(uint64_t x, uint32_t n) { uint64_t a = (x + n) & (n  1); uint64_t b = x & n; uint64_t c = (x << 1) & n; return ((a << 1) + b + c) >> 2; } (and its OpenCL equivalent, with proper data types). Of course, this revision of Argon2 won't match Argon2's normal test vectors, but you should be able to see roughly what performance you could get if you later optimize the division. Alexander
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.