Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 8 Aug 2013 11:15:02 -0400
From: Rich Felker <>
Subject: Re: Optimized C memcpy

On Thu, Aug 08, 2013 at 09:03:51AM -0400, Andrew Bradford wrote:
> > > This is not a replacement for the ARM asm (which is still better), but
> > > it's a step towards avoiding the need to have written-by-hand assembly
> > > for every single new arch we add as a prerequisite for tolerable
> > > performance.
> > 
> > Sorry if this has been discussed before but Google isn't much help.  Why
> > is 32 bytes chosen as the block size over other sizes?
> > 
> > It seems that the code would be fewer lines if blocks were 4 bytes,
> Sorry, I now see why 4 byte blocks won't work due to the misalignment,
> but 8 or 16 seem like they should be possible.
> Is it just the evaluation of the for loop being expensive that's trying
> to be avoided?

It's purely empirical reasons. 8 is the smallest that would work
without extra logic to shuffle w/x. 16 runs 50% slower than the ARM
asm. 32 runs only 25% slower than the ARM asm.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.