Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 08 Aug 2013 09:03:51 -0400
From: Andrew Bradford <andrew@...dfordembedded.com>
To: musl@...ts.openwall.com
Subject: Re: Optimized C memcpy

On Thu, Aug 8, 2013, at 08:59 AM, Andrew Bradford wrote:
> On Wed, Aug 7, 2013, at 02:21 PM, Rich Felker wrote:
> > Attached is the latest version of my "pure C" (modulo aliasing issues)
> > memcpy implementation. Compiled with -O3 on arm, it matches the
> > performance of the assembly language memcpy from Bionic for aligned
> > copies, and is only 25% slower than the asm for misaligned copies. And
> > it's only mildly larger. It uses the same principle as the Bionic
> > code: large block copies as aligned 32-bit units for aligned copies,
> > and aligned-load, bitshift-then-or, aligned-store for misaligned
> > copies. This should, in principle, work well on typical risc archs
> > that have plenty of registers but no misaligned load or store support.
> > 
> > Unfortunately it only works on little-endian (I haven't though much
> > yet about how it could be adapted to big-endian), but testing it on
> > qemu-ppc with the endian check disabled (thus wrong behavior)
> > suggested that this approach would work well on there too if we could
> > adapt it. Of course tests under qemu are not worth much; the ARM tests
> > were on real hardware and I'd like to see real-hardware results for
> > others archs (mipsel?) too.
> > 
> > This is not a replacement for the ARM asm (which is still better), but
> > it's a step towards avoiding the need to have written-by-hand assembly
> > for every single new arch we add as a prerequisite for tolerable
> > performance.
> 
> Sorry if this has been discussed before but Google isn't much help.  Why
> is 32 bytes chosen as the block size over other sizes?
> 
> It seems that the code would be fewer lines if blocks were 4 bytes,

Sorry, I now see why 4 byte blocks won't work due to the misalignment,
but 8 or 16 seem like they should be possible.
Is it just the evaluation of the for loop being expensive that's trying
to be avoided?

Thanks,
Andrew

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.