[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 20 May 2011 05:16:24 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach
Yuri, David -
On Thu, May 19, 2011 at 06:32:33PM -0300, Yuri Gonzaga wrote:
> I did the first attempt. I think I spent more time than I expected mainly
> because the adaptation to Xilinx dev environment.
These results are very helpful, and I am pleased that you did this soon
enough considering the dev environment change. Thank You!
> Device Utilization Summary (estimated values)
I looked at the HTML version of your message to figure this out.
(Normally, I read text/plain sections only.)
It's four columns:
> Logic Utilization
> Used
> Available
> Utilization
which contain, respectively (and so on for other lines):
> Number of Slice Registers
> 35
> 93120
> 0%
> Number of Slice LUTs
> 131
> 46560
> 0%
I find this puzzling. We have 272 bits of internal state (not including
auxiliary variables for loop control, etc.) How do they fit in 35
registers? http://www.1-core.com/library/digital/fpga-logic-cells/ says
we have four 1-bit registers per slice in a Virtex-5. Has something
changed in Virtex-6? (I doubt it.) Or is this a terminology difference?
(I find this more likely.) Yuri, David - can one of you explain this to
me, please?
s[] is declared as:
reg [7:0] s [31:0];
Shouldn't this consume 256 registers right away?
As to 131 LUTs, this is believable.
> Number of fully used LUT-FF pairs
> 35
> 131
> 26%
The same info repeated - 35 flip-flops, but 131 LUTs. Since there's an
equal number of flip-flops (aka registers?) and LUTs per slice, this
means that we're only using 26% of the flip-flops that we could have.
...but I am puzzled why we need so few of them. I'd expect that we'd
have unused LUTs, not unused flip-flops (because we'd need more of them
than LUTs).
> Number of bonded IOBs
> 14
> 240
> 5%
>
> Number of BUFG/BUFGCTRLs
> 2
> 32
> 6%
What are these? Could we reasonably use more of them by redefining our
Blowfish-like algorithm slightly?
> Maximum frequency: 261.356MHz
Assuming that we're LUT-bound and we can make use of all LUTs with this
(which might not be true), this gives us:
46560/131*261.356 = 92891
That's 93 billion two-round iterations per second (assuming sufficient
parallelism, which we'll provide).
We need to compare this against a CPU implementation. (And maybe also
against GPU, but that's trickier.) For a quad-core CPU at 3.5 GHz
capable of 3 instructions per cycle (on each core) running a
non-bitslice implementation, assuming 30 instructions per 2 rounds (the
best gcc-generated code I was able to get for an unrolled loop):
3500*4*3/30 = 1400
That's 1.4 billion two-round iterations per second. So the speedup is:
92891/1400 = 66.35
More optimal code might use 24 instructions per 2 rounds (with 3-operand
instructions if those are supported by the arch, which saves some moves):
3500*4*3/24 = 1750
92891/1750 = 53.08
For DES, we were getting a number of around 40 here - but comparing
actual performance (both FPGA and CPU), not theoretical.
So we have some (theoretical) improvement over DES, but it's not as good
as I would have liked it to be (I wanted to see 100+ here). Some
further improvement is possible by adding bit permutations.
Is "Virtex6 xc6vlx75t-3ff484" the same as the device that David
mentioned getting 22 billion of DES encryptions per second for, though?
If it's of a different size or speed grade, that could easily turn my
66x estimate into 100x or 50x (for the actual device).
> For NROUNDS = 4 (same target device):
This is also very helpful, thanks!
> Number of Slice Registers
> 35
> 93120
> 0%
No more registers needed. Makes sense.
(But the low number is just as puzzling.)
> Number of Slice LUTs
> 183
> 46560
> 0%
More LUTs, but not 2x more. Also makes sense. The relatively small
increase in LUT count means that with a 2-round implementation, we waste
most LUTs on the state machine, right?
> Maximum frequency: 147.432MHz
This gives:
46560/183*147.432*2 = 75021
75 billion of 2-round iterations per second. Slightly slower than for
the 2-round implementation. But this is worth another try with our
closer-to-final Blowfish-like algorithm, if we choose to use one. On
the other hand, the numbers so far suggest that 2-round is likely
better, and that if we simplify the state machine logic (possible or
not?) it will be better yet.
> About the pipelining, how can we deal with the fact that there are
> dependencies between r, l and s in calculations?
> Will each stage have to store locally r, l and s?
Yes. We'll need the full set of registers for these for each pipeline
stage. If right now we're LUT- rather than register-bound (which is
puzzling), then this makes sense (for a pipeline with 4 stages or so).
> The verilog code is attached, including simulation.
Thank you! Some comments on it:
> `define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1))
Are you sure this is good enough for the compiler to do the smart thing,
realizing that we have an adder with some holes punched in it?
> s[0] = 8'ha6;
> s[1] = 8'hac;
> s[2] = 8'hdb;
> s[3] = 8'hb7;
...
> s[28] = 8'ha5;
> s[29] = 8'h29;
> s[30] = 8'h40;
> s[31] = 8'h3e;
Do these initial value assignments consume any logic? Perhaps they do.
If so, perhaps you can save some logic by having the initial state of
s[] uploaded from the host?
Did you verify correctness of your implemenation in any way? (Perhaps
by making sure it produces the same outputs for the same inputs as the C
implementation does.)
Thanks again,
Alexander
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux -
Powered by OpenVZ