Date: Sun, 21 Jul 2013 19:21:27 +0200 From: Katja Malvoni <kmalvoni@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt On Sun, Jul 21, 2013 at 6:01 PM, Solar Designer <solar@...nwall.com> wrote: > Hi Katja, > > On Sun, Jul 21, 2013 at 03:40:30PM +0200, Katja Malvoni wrote: > > Since Epiphany code became much smaller when I integrated it with JtR, I > > tried using internal.ldf instead of fast.ldf and code fits in local > memory. > > How do internal.ldf and fast.ldf differ? > In internal.ldf all code is in local memory, in fast.ldf user code and stack are placed in local memory while standard libraries are in external SRAM. > > Speed is 932 c/s (sometimes it's 934 c/s). > > So that's a speedup by more than 10% from 830 c/s that you had before. > Where does this speedup come from? > It comes from placing standard libraries in local memory instead of external SRAM. I only changed that one word in Makefile to get that speedup. > > > BF_fmt port was slow because I > > didn't copy salt in local memory, when I did that speed was 790 c/s. > Than I > > used internal.ldf and got speed of 932 c/s. > > Do you think the speedup is from basing your code on JtR's BF_std now, > as opposed to the size-optimized revision of the code in musl? I doubt > that this alone would make a 10% difference, considering that we had > already unrolled the Blowfish rounds loop (and tried unrolling some > others as well) in musl's code. There must be something else. > Are you able to use internal.ldf on the musl-derived code? What speed > are you getting? > Sorry for not being clear, "Speed is 932 c/s (sometimes it's 934 c/s)." is for musl-derived code. Now both implementations have the same speed. > > If I try to interleave two > > instances of bcrypt than code can't fit in local memory. > > You may reduce the amount of unrolling to make it fit. > > > At the moment, > > interleaving two instances of bcrypt doesn't work, it fails self test on > > get_hash(1). > > How are you testing it? Does it run from off-chip memory in this test? > bcrypt code is in local memory, standard library code is in external memory. For now I'll leave it that way and just focus on get it working. [...] > > 2. Various out-of-inner-loop code inefficiencies, such as the memcpy(). > We may deal with some of those by optimizing the code. > This is not a problem any more. Since memcpy() code is in local memory now, it takes 0.007 ms to copy initial BF_ctx structure. [...] > > So I don't buy it when you say that BF_ROUND and BF_encrypt are already > optimal. Surely they look that way at first, and the compiler does a > pretty good job - it's just that there ought to be ways to do better. > Usage of IADD is one of those. It might also be possible to replace > some left shift and addition (used for address calculation) with IMADD. > If all additions performed for address calculation are currently > embedded in load instructions, then bundling them with shifts instead > might not help... or it might, in some cases: by changing which > execution units remain available for other use at which times (IMADD is > on the FPU, whereas shifts are not, so we're freeing up an execution > unit for other use by using IMADD). There's more to explore here than > what you see at first. > I agree now, I was looking at whole thing from completely wrong perspective. I didn't see the fact that using IADD doesn't mean it will be faster than ADD but it means IALU will be free for other instructions. Katja Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.