Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 30 Jul 2013 23:23:15 -0400
From: Rich Felker <>
To: Harald Becker <>
Subject: Re: ARM memcpy post-0.9.12-release thread

On Wed, Jul 31, 2013 at 05:13:47AM +0200, Harald Becker wrote:
> Hi Rich !
> 30-07-2013 22:26 Rich Felker <>:
> > Some rough times (128k copy repeated 10000 times):
> > 
> > Aligned case:
> > Current C code: 1.2s
> > My best-attempt C code: 0.75s
> > My best-attempt inline asm: 0.57s
> > Bionic asm: 0.63s
> > Bionic asm without prefetch: 0.57s
> > 
> > Misaligned case:
> > Current C code: 4.7s
> > My best-attempt inline asm: 2.9s
> > Bionic asm: 1.1s
> I like to throw in a question, as my cent to this topic:
> Does modern C Compiler not try to align all data types? So
> following this path in most cases aligned data structures are
> used and copying them around usually hit the aligned case. The

Yes but these are small anyway and the compiler will be generating
inline code to copy them with ldmia/stmia.

> misaligned case happens mostly due to working with strings, and
> those are usually short. Can't we consider other misaligned cases
> violation of the programmer or code generator? If so, I would
> prefer the best-attempt inline asm versions of code or even
> best attempt C code over arch specific asm versions ... and add

Part of the problem discussed on #musl was that I was having to be
really careful with "best attempt C" since GCC will _generate_ calls
to memcpy for some code, even when -ffreestanding is used. The folks
on #gcc claim this is not a bug. So, if compilers deem themselves at
liberty to make this kind of transformation, any C implementation of
memcpy that's not intentionally crippled (e.g. using volatile temps
and 20x slower than it should be) is a time-bomb that might blow up on
us with the next GCC version...

This makes asm (either inline or standalone) a lot more appealing for
memcpy than it otherwise would be.

> a warning for performance lose on misaligned data in
> documentation, with giving a rough percentage of this lose.

You'd prefer video processing being 4 to 5 times slower? Video
typically consists of single-byte samples (planar YUV) and operations
like cropping to a non-multiple-of-4 size, motion compensation, etc.
all involve misaligned memcpy. Same goes for image transformations in
gimp, image blitting in web browsers (not necessarily aligned to
multiple-of-4 boundaries unless you're using 32bpp), etc...


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.