Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Jul 2013 23:16:15 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Thinking about release

On Fri, Jul 12, 2013 at 10:34:31AM +1200, Andre Renaud wrote:
> I've rejiggled it a bit, and it appears to be working. I wasn't
> entirely sure what you meant about the proper constraints. There is an
> additional reason why 8*4 was used for the align - to force the whole
> loop to work in cache-line blocks. I've now done this explicitly on
> the lead-in by doing the first few copies as 32-bit, then going to the
> full cache-line asm. This has the same performance as the fully native
> assembler. However to get that I had to use the same trick that the
> native assembler uses - doing a load of the next block prior to
> storing this one. I'm a bit concerned that this would mean we'd be
> doing a read that was out of bounds, and I can't entirely see why this
> wouldn't be happening with the existing assembler (but I'm presuming
> it doesn't). Any comments on this side of it?

I was unable to measure any difference in performance of your version
with the prefetch hack versus simply:

	__asm__ __volatile__(
	"ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
	"stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t"
	: "+r"(d), "+r"(s) :
	: "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory");

in the inner loop.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.