Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 20 May 2011 00:07:05 -0300
From: Yuri Gonzaga <yuriggc@...il.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

>
> I looked at the HTML version of your message to figure this out.
> (Normally, I read text/plain sections only.)


Oh! I am sorry. Next time I will send plain text only.
It was easy to copy and paste the results table from Xilinx ISE.

I have to say that I am not used to Xilinx terminology.
So, it is better to wait to David to explain better the relation between
these slices.

Even so, I did some research on Google.
According to [1], in section "CLBs, Slices, and LUTs", One Virtex-6 slice
has 4 LUTs/8 flip-flops.
It looks like one LUT has up to 2 flip-flops.
So, Slice Registers = LUT-FF pairs = 35 may means that 35 LUTs were used as
register and all of them used 2 flip-flops.
Following this idea, it sums up 70 flip-flops, which are still less than 256
expected.

Maybe, some of the registers were mergerd to logic in the 131 LUTs.
I remeber the synthesizer log warning about the possibility of latch
inference.
In this case, some bits of s, l and r could be implemented in LUTs instead
of flip-flops.

IOBs are Input-Output Blocks and are affected only by the number of inputs
and outputs in FPGA level.
I guess BUFG/BUFGCTRLs are buffers used to make global signals more stable.
Specifically in this project, 2 are used, one for clock and other to reset
signals.

The relatively small
> increase in LUT count means that with a 2-round implementation, we waste
> most LUTs on the state machine, right?


I am not sure. Maybe doing some experimentations could clarify this
question.
For example, remove part of state machine and see the impact in the result.

 On the other hand, the numbers so far suggest that 2-round is likely
> better, and that if we simplify the state machine logic (possible or
> not?) it will be better yet.


I Could try to optimize, but I don't know if I can achieve much more
resource saving.

Are you sure this is good enough for the compiler to do the smart thing,
> realizing that we have an adder with some holes punched in it?


And what do you suggest? Find an equivalent function but simpler?
In any case, the synthesizer already does optimizations in logic (or at
least, tries to).

Do these initial value assignments consume any logic?


Yes.

If so, perhaps you can save some logic by having the initial state of
> s[] uploaded from the host?


It is possible. However for each calculation it will be necessary to upload
them.
Unless we upload just once and maintain them stored in some registers.
Another possibility is to write in a ROM (depends on the board availability)
and read from it.

Did you verify correctness of your implemenation in any way?  (Perhaps
> by making sure it produces the same outputs for the same inputs as the C
> implementation does.)


Yes. The testbench.v I sent has this goal.
I did exactly what you said.

[1]
http://www.dinigroup.com/product/data/DN-DualV6-PCIe-4/files/Virtex6_Overview_v11.pdf

Yuri Gonzaga

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.