crypt-dev - Re: Yuri's Status Report

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110817151728.GC28733@openwall.com>
Date: Wed, 17 Aug 2011 19:17:28 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: Yuri's Status Report - #14 of 15

Yuri -

On Wed, Aug 17, 2011 at 01:29:35AM -0300, Yuri Gonzaga wrote:
> I did some experimentations and I think I figured out what is wrong.
> We have to consider the time spent with writing and reading operations.

Right.  It appears that picoXfaceP->WriteDevice() is very slow, then -
maybe it involves bidirectional communication, to make sure the data was
actually received at the other end.

Would it be possible for you to invoke picoXfaceP->WriteDevice() just
once, or just once per core, on the entire block of data?  Perhaps start
by turning this loop:

		for(int i = 0; i < 1024; i++) {
			if(picoXfaceP->WriteDevice(7,&sBoxes[i],4) < 0){
				printf("ERRO NA ESCRITA. ABORTANDO\n");
				exit(EXIT_FAILURE);
			}
		}

into one call to picoXfaceP->WriteDevice(), for the 4 KB of data.

On the other hand, if this is somehow time-consuming for you (would take
more than a day to implement) and the API for M-501 under Linux is very
different (I don't know if it is), then don't bother.

But we definitely do need to combine writes/reads from the FPGA board
into larger chunks for good performance.

> Using that same example (cost = 5), the processing time (between writing and
> reading) in both case is equivalent and about 0.06 seconds.

This is a lot slower than desired, but it is reasonable.  It is very
similar to performance of the original Pentium at 100 MHz.  This makes
sense given that the board is running at 48 MHz and it needs maybe twice
fewer clock cycles per Blowfish round.  So running this on 4 cores
you'll achieve roughly the speed of Pentium 2 at 350 MHz.  Obviously, we
need a more optimal implementation (fewer cycles per Blowfish round,
less FPGA resources used per core), a much bigger FPGA, and a faster
clock rate for this to make any sense for practical use.  I hope we'll
get much better speed on M-501 once you're able to access it again.

> With cost = 18, and 4 cores vs. 4 sequential invocations, I got:
> 
>    - Sequential total time: ~ 33 minutes
>    - Parallel total time: ~ 9 minutes

These numbers looked reasonable to me at first, but then I did some math
and they don't agree with the 0.06 seconds figure for cost=5 that you
gave above.  Specifically:

33 * 60 / (2 ^ (18 - 5)) = 0.24

I expected to see something close to 0.06.  Why is it 4 times slower
here?  The difference between sequential and parallel times suggests
that the reads/writes overhead is indeed pretty low at cost=18, so this
overhead does not explain the 0.06 vs. 0.24 discrepancy.

Do you have an explanation?

> > I asked you some questions in:
> > http://www.openwall.com/lists/crypt-dev/2011/07/05/2
> > to which you never replied.  Perhaps I should have insisted on a reply,
> > so we'd catch the unacceptable performance sooner.
> 
> The following questions?
> 
> > Any info on the performance? This is one Eksblowfish core running in
> > the FPGA right? If so, it got to be several times slower than native
> > (compiled) code on the CPU, right?

Yes, these are what I meant above.  You've provided the answers now.

> 4 cores, although they (plus Pico bus internal logic) are occupying 65%,
> suggesting the increase of cores.
> But if add only one more core the tool is not able anymore to fit everything
> in that FPGA.

Any idea why not?  What specific error message does it give when you try
to fit 5 cores?

> In this case, We will need to revisit the LUT count optimization step.

Yes, we'll need to revisit it, but perhaps you should start by reducing
the number of clock cycles per Blowfish round.  I think you can do 2
cycles/round while using the same number of BlockRAMs, and 1 cycle/round
when you use twice more BlockRAMs (acceptable if LUTs and not BlockRAMs
are the scarce resource anyway).

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.