Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 12 Jul 2013 00:16:09 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Thinking about release

On Fri, Jul 12, 2013 at 03:36:42PM +1200, Andre Renaud wrote:
> > I was unable to measure any difference in performance of your version
> > with the prefetch hack versus simply:
> >
> >         __asm__ __volatile__(
> >         "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
> >         "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
> >         : "+r"(d), "+r"(s) :
> >         : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory");
> 
> What kind of machine were you using? I see a change of 115MB/s ->

It's a combined ARM Cortex-A9 & FPGA chip from Xilinx. Supposedly the
timings match the Cortex-A9 in other ARM chips.

> 105MB/s when I drop the prefetch, even using the code that you
> suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm
> assuming this is some subtlety about how the cache is operating?

Perhaps so.

By the way, I also did some tests with misaligning the src/dest with
respect to cache lines. and the timing did change, but not in any way
I could make sense of...

It may turn out to be that the issues are sufficiently complex that we
won't get ideal performance without either copying the BSD code you
suggested or fully understanding what it's doing, and other ARM
performance issues, and developing something new based on that
understanding... In that case copying/adapting the BSD code might turn
out to be the right solution for now.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.