crypt-dev - Re: Yuri's Status Report

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFMirAFSHwYiicxpjky0rDEpg-_XxzK_ZpujN0mnEB43-8Bu5w@mail.gmail.com>
Date: Wed, 17 Aug 2011 01:29:35 -0300
From: Yuri Gonzaga <yuriggc@...il.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: Yuri's Status Report - #14 of 15

>
> So previously you had 23 seconds per Eksblowfish main loop invocation
> (which is also unacceptably slow).  Now you report 99 seconds for
> _parallel_ execution of 4 instances of the loop, yet you say you receive
> the same speed (or maybe I misinterpret what you wrote).  These numbers
> just do not agree; for parallel execution, you should have received the
> same total runtime for 4 instances as you did for 1 instance, but you
> got it to run 4 times longer now, as if your instances are running
> sequentially rather than in parallel?


I did some experimentations and I think I figured out what is wrong.
We have to consider the time spent with writing and reading operations.
This time is constant for one loop invocation in each case.
I measured and found:

   - Sequential case: 22,35 seconds waste with writing and reading for one
   loop invocation
   - Parallel case: 26,75 seconds waste with writing and reading for one
   loop invocation (corresponds to one core)

Parallel case is higher because it has one more operation during writing and
reading cycles in order to select the appropriate core.

Using that same example (cost = 5), the processing time (between writing and
reading) in both case is equivalent and about 0.06 seconds.

So, the parallel case will be better than sequential one only when it
overcome the higher overhead it has on writing and reading operations.
Considering the exponential growth of cost and contrasting 4 parallel cores
to 4 sequential loop invocation, I did some calculations and found that cost
>= 18 should be the case.

Then, I ran on the board to prove this conclusion.
With cost = 18, and 4 cores vs. 4 sequential invocations, I got:

   - Sequential total time: ~ 33 minutes
   - Parallel total time: ~ 9 minutes

So, this "theory" seems to be ok.
Higher values of cost should improve more the gain of parallel compared to
sequential approach.
More cores, of course, should improve it as well. But, unfortunately , I
could fit only 4 cores in e101's FPGA.

I asked you some questions in:
> http://www.openwall.com/lists/crypt-dev/2011/07/05/2
> to which you never replied.  Perhaps I should have insisted on a reply,
> so we'd catch the unacceptable performance sooner.


The following questions?

Any info on the performance? This is one Eksblowfish core running in
> the FPGA right? If so, it got to be several times slower than native
> (compiled) code on the CPU, right?


Somehow I thought
> that such execution times would apply to something like 1000 invocations
> of the Eksblowfish loop (you did mention just 10, though...) or to a
> much higher "cost" setting (but you appear to keep it at 5 in the code
> that you uploaded to the wiki).


We can run with these parameters to see what happens, mainly on the other
board that has a higher clock frequency and could fit more cores.

The code in eksblowfish-loop-interface.zip and
> 4-eksblowfish-loop-cores-pico-e101.zip looks like you do just 10
> invocations of Eksblowfish at cost=5 in the former and just 4 of them
> in the latter


Right.

 So where does the extreme performance loss come from?  A ridiculously
> low clock rate, like 1 MHz or below?  Or am I misreading things?
> You need to use a reasonable clock rate, comparable to what we'd
> actually use, to validate that your design works as intended in hardware.


The clock rate is that provided by the e101 board, 48 MHz.

On a 1 GHz CPU, Eksblowfish runs at something between 100 and 200
> invocations per second, at cost=5.  You're reporting it running at 23
> seconds, which is thus 2000 to 5000 times slower.  Indeed, the clock
> rate is lower, but maybe only by a factor of 10 (I am assuming that you
> run this at 100 MHz or so).


48 MHz on pico e101.


> This is partially compensated by the
> reduced number of clock cycles per Blowfish round.  On a CPU, it can be
> something between 5 and 10 cycles per Blowfish round:
> http://www.schneier.com/blowfish-speed.html
> This gives 9 for the original Pentium, I am getting 5.5 for the code in
> JtR on newer CPUs.  On an FPGA, you should have between 1 and 5 clock
> cycles per Blowfish round.  IIRC, the first implementation with BlockRAMs
> that you did was meant to provide 5 cycles per round, and we discussed
> how to reduce that.  Even if we take 5 cycles per round for both the CPU
> and the FPGA, this gives us only a 10x slowdown per core because of the
> clock rate difference alone (an Eksblowfish core in an FPGA vs. code
> running on a CPU core).  (And we'd compensate for that by having
> something like 100 cores in a larger FPGA chip, as well as by reducing
> the number of clock cycles per round, which doesn't have to be as bad as
> 5.)
> But you're getting a ridiculous 2000 to 5000 times slowdown instead?


Let's see what we can do to achieve this goal.


As to LUT count, it appears that you'd fit only 7 cores in the E-101
> board's Spartan-6.


4 cores, although they (plus Pico bus internal logic) are occupying 65%,
suggesting the increase of cores.
But if add only one more core the tool is not able anymore to fit everything
in that FPGA.


> That's obviously too few, but we'll target much
> larger chips and we'll need to optimize the LUT count.


In this case, We will need to revisit the LUT count optimization step.

Thanks,

---
Yuri Gonzaga Gonçalves da Costa
-------------------------------------------------------------
Mestrando em Informática
LASID - Laboratório de Sistemas Digitais
Universidade Federal da Paraíba

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.