john-dev - Re: Katja's weekly report #15

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130926030150.GA22958@openwall.com>
Date: Thu, 26 Sep 2013 07:01:50 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Katja's weekly report #15

Hi Katja,

On Mon, Sep 23, 2013 at 05:49:27PM +0200, Katja Malvoni wrote:
> I wasn't able to work yesterday and I won't be able to work today, I caught
> a flu.

Get well soon!

> Accomplishments:
> 1. Updated wiki page

Thanks!  As I had mentioned, we/you need to get the page at
http://openwall.info/wiki/john/development/Parallella linked from some
other wiki page(s), such as from john/development or/and from john.

> 2. Fixed bug so that bcrypt on FPGA doesn't fail self test on first run

Great.  What was the bug?

> 3. Partially optimized bcrypt on FPGA
>        - using true dual port RAM for Sbox with two cycle latency. In
> simulation I have it with 1 cycle latency, 3 cycles per BF_ROUND and
> 1709766 cycles in total but it doesn't work on ZedBoard.

3 cycles per BF_ROUND sounds just right to me.  I assume it's one cycle
to fetch first two S-box elements, another cycle to fetch the other two,
and a third cycle to process these fetched values and compute the next
set of S-box indices, for the next round.  Correct?

Can you perhaps reduce this further, to two cycles per Blowfish round
(for most rounds), by fetching the next round's first two S-box elements
during the current round's "computation" cycle?  In other words, we can
and should be doing two S-box lookups from the block RAM on every cycle.
There's probably no good enough reason to waste a cycle on computation
alone, when we can also use this cycle to send two addresses to memory
and have the data ready the next cycle.  Yes, the maximum clock rate
might be a bit lower than with the 3-cycle approach, but probably by
very little.

And indeed, you need to get this working on the device, not just in
simulation.

If you do get the 2-cycle approach working, then it'd make sense to use
two block RAMs per core and do all four S-box lookups at once - which
means you can do one Blowfish round per cycle.  Yes, we'll waste half of
our block RAM capacity in this way, but the alternative of having twice
more cores (even if we can fit them in the device) that do one Blowfish
round per two cycles is possibly not any better.  Or is it?  Well, there
may be slight clock rate differences, as well as differences in
"overhead" related to first and/or last round.  We'd need to try both
approaches.

Does the above sound right to you?

> I will be moving to a new place this week and I won't be able to do much
> work but I will list here everything I can think of at the moment
> 
> Priorities:
> 1. Finish optimization - it's about figuring out why having 1 cycle latency
> RAM doesn't produce correct result and figuring out clock problems
> 2. Implement multiple bcrypt cores in FPGA

Sounds good.

> 3. Replace mmap() calls in BF_fpga.c with proper drivers

What would those proper drivers be?  UIO, as I mentioned here? -

http://www.openwall.com/lists/john-dev/2013/06/04/2

> 4. Try to get bcrypt on 64-core Epiphany to work

Right.  I did not expect it'd be as tricky as it turned out to be, but I
am happy that you'd like to keep trying.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.