Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 30 Jul 2013 03:38:00 +0400
From: Solar Designer <>
Subject: Re: Parallella: bcrypt

Katja, Yaniv -

On Tue, Jul 30, 2013 at 12:44:09AM +0200, Katja Malvoni wrote:
> I moved to separate assembly file and my code from yesterday worked. I
> implemented whole BF_encrypt2() in assembly.
> There are no enough registers to preload both P arrays so I'm preloading
> only one.

How is that - not enough registers to preload both P arrays?  We got 64
registers and little demand for them other than for the two P's (need 36
for them).

> Speed is 1175 c/s.

Good, but it should be 1200 c/s with both P'c preloaded. ;-)

> Code is in
> I left two rounds in one macro - since there needs to be 4 cycles between
> ALU following FPU instruction to have dual issue, with only one round it's
> not possible to have shift right by 22 on ALU and FPU for both instances
> and to use iadd at the end. With two rounds in one macro one shift right by
> 22 and one add are not parallelised for one macro.

I find the above description a bit confusing, but I understand the
general issue.  OK.

Have you tried replacing the right shift by 22 followed by AND with
right shift by 24 followed by IMUL?  (AND is non-free, whereas IMUL is
potentially free.)

> When I used r2 and r3 for L0 and R0, before preloading P array, speed was
> 1083 c/s. But when I changed to r48 and r56 it became 1136 c/s. I guess
> it's because with r2 and r3 some instructions were 16-bit.

That's puzzling.  Yaniv - is dual-issue (or something else) hampered by
having some 16-bit instructions inter-mixed with 32-bit ones?



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.