musl - Re: Optimized C memcpy

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130808202615.GR221@brightrain.aerifal.cx>
Date: Thu, 8 Aug 2013 16:26:15 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Optimized C memcpy

On Fri, Aug 09, 2013 at 08:17:22AM +1200, Andre Renaud wrote:
> Hi Rich,
> >From the looks of the code, compared to the original bionic assembly,
> I assume the remaining speed difference is caused by the C-code doing
> 8 discrete store operations, where as the bionic code batches these
> all up into registers and does these as a single multiple-store. Would
> it be worth having a structure with 8 32-bit ints in it, and doing a
> single write to d of one of these (hoping that gcc will catch it and
> turn it into a stm instruction)? It unfortunately runs the risk that
> gcc will decide a 32-byte copy is worth using memcpy for, resulting in
> the recursive issue you've seen previously.

In principle, the ARM codegen calls to memcpy (these are different
from the optimizer's calls to memcpy, which we already know how to
turn off) only happen when the compiler cannot determine that the
accesses are aligned. You can experiment with using a structure, but I
think there's also a risk that the compiler will not allocate enough
registers for it and will thereby end up using temp space on the
stack. If you want to experiment, just do one of the cases, not all
three, to make it easier.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.