Date: Thu, 25 Jul 2013 06:02:52 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Katja, Yaniv - On Wed, Jul 24, 2013 at 07:28:19PM +0200, Katja Malvoni wrote: > When I moved the same code in function instead of leaving it under #define > it compiled. I don't remember what was I doing wrong but after some changes > it compiled as a macro too. I'm still working on optimizing BF_ROUND, at > the moment I'm using two FPU instructions (IADD and IMADD) and I'm getting > 926 c/s. I'm sorry for being so slow with this optimization, it took me a > lot just to get it working. That's OK. So I am looking at parallella_e_bcrypt.c with the pieces of inline asm, and here are my comments: | #ifdef assembly Thank you for keeping the plain C version in the same source file as well. That's how I like it. | #define BF_ROUND(L, R, N) \ | __asm__ __volatile__ ( \ | "and r44, %1, r25\n" \ | "lsr r23, %1, 0xe\n" \ | "and r23, r23, r21\n" \ | "lsr r24, %1, 0x16\n" \ | "and r24, r24, r21\n" \ | "imadd r44, r44, r46\n" \ | "ldr r23, [r18, +r23]\n" \ | "ldr r24, [r26, +r24]\n" \ | "lsr r22, %1, 6\n" \ | "and r22, r22, r21\n" \ | "iadd r23, r24, r23\n" \ | "ldr r22, [r19, +r22]\n" \ | "ldr r27, [r45], 0x1\n" \ | "ldr r44, [r20, +r44]\n" \ | "eor %0, %0, r27\n" \ | "eor r23, r22, r23\n" \ | "add r23, r23, r44\n" \ | "eor %0, %0, r23\n" \ | : "+r" (R) \ | : "r" (L) \ | ); The code itself mostly looks good to me (including your delayed use of results from IMADD and IADD). Shouldn't you re-order these two, though? - | "eor %0, %0, r27\n" \ | "eor r23, r22, r23\n" \ because r22 is loaded sooner than r27? Well, maybe this makes no difference on the current chip, but it might if load's latency is increased in a future revision of Epiphany. With 2x interleave, once you implement it, you might be able to reasonably replace the "add r23, r23, r44" with IADD, because you'd have some more instructions to put before the EOR. You might need to put two rounds into the macro, though (so that the two instances' rounds are slightly out of sync in order to have more parallelism in each round's end-game). Now, here's an issue/bug in the above: you rely on registers being preserved across multiple pieces of inline asm, but gcc does not guarantee you that. Also, you don't declare which registers you clobber. To fix this, your BF_ROUND should not be the entire __asm__ block, but rather just a portion of the string you put inside such block. The asm block itself, with proper confession on what registers you clobber, should be in the BF_encrypt function. Instead of the asm block before the loop, where you initialize register values, I think you should simply have assignments to local variables in C, and have the compiler allocate registers to those. Then you reference those variables via the "r" (var) notation in the one and only asm block inside the loop (or better yet, you can have the loop in asm too, which is easy to do once you have a macro for one or two rounds in the form of a string). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.