musl - Re: Optimized C memcpy

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1375967031.16501.7506555.32466BB3@webmail.messagingengine.com>
Date: Thu, 08 Aug 2013 09:03:51 -0400
From: Andrew Bradford <andrew@...dfordembedded.com>
To: musl@...ts.openwall.com
Subject: Re: Optimized C memcpy

On Thu, Aug 8, 2013, at 08:59 AM, Andrew Bradford wrote:
> On Wed, Aug 7, 2013, at 02:21 PM, Rich Felker wrote:
> > Attached is the latest version of my "pure C" (modulo aliasing issues)
> > memcpy implementation. Compiled with -O3 on arm, it matches the
> > performance of the assembly language memcpy from Bionic for aligned
> > copies, and is only 25% slower than the asm for misaligned copies. And
> > it's only mildly larger. It uses the same principle as the Bionic
> > code: large block copies as aligned 32-bit units for aligned copies,
> > and aligned-load, bitshift-then-or, aligned-store for misaligned
> > copies. This should, in principle, work well on typical risc archs
> > that have plenty of registers but no misaligned load or store support.
> > 
> > Unfortunately it only works on little-endian (I haven't though much
> > yet about how it could be adapted to big-endian), but testing it on
> > qemu-ppc with the endian check disabled (thus wrong behavior)
> > suggested that this approach would work well on there too if we could
> > adapt it. Of course tests under qemu are not worth much; the ARM tests
> > were on real hardware and I'd like to see real-hardware results for
> > others archs (mipsel?) too.
> > 
> > This is not a replacement for the ARM asm (which is still better), but
> > it's a step towards avoiding the need to have written-by-hand assembly
> > for every single new arch we add as a prerequisite for tolerable
> > performance.
> 
> Sorry if this has been discussed before but Google isn't much help.  Why
> is 32 bytes chosen as the block size over other sizes?
> 
> It seems that the code would be fewer lines if blocks were 4 bytes,

Sorry, I now see why 4 byte blocks won't work due to the misalignment,
but 8 or 16 seem like they should be possible.
Is it just the evaluation of the for loop being expensive that's trying
to be avoided?

Thanks,
Andrew

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.