musl - Re: Thinking about release

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130709053711.GO29800@brightrain.aerifal.cx>
Date: Tue, 9 Jul 2013 01:37:12 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Thinking about release

On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote:
> Hi Rich,
> > I think the first step should be benchmarking on real machines.
> > Somebody tried the asm that was posted and claimed it was no faster
> > than musl's C code; I don't know the specific hardware they were using
> > and I don't even recall right off who made the claim or where it was
> > reported, but I think before we start writing or importing code we
> > need to have a good idea how the current C code compares in
> > performance to other "optimized" implementations.
> 
> In the interests of furthering this discussion (and because I'd like
> to start using musl as the basis for some of our projects, but the
> current speed degradation is noticeable , I've created some patches

Then it needs to be fixed. :-)

> that enable memcmp, memcpy & memmove ARM optimisations. I've ignored
> the str* functions, as these are generally not used on the same bulk
> data as the mem* functions, and as such the performance issue is less
> noticeable.

I think that's a reasonable place to begin. I do mildly question the
relevance of memmove to performance, so if we end up having to do a
lot of review or changes to get the asm committed, it might make sense
to leave memmove for later.

> Using a fairly rudimentary test application, I've benchmarked it as
> having the following speed improvements (this is all on an actual ARM
> board - 400MHz arm926ejs):
> memcpy: 160%
> memmove: 162%
> memcmp: 272%
> These numbers bring musl in line with glibc (at least on ARMv5).
> memcmp in particular seems to be faster (90MB/s vs 75MB/s on my
> platform).
> I haven't looked at using the __hwcap feature at this stage to swap
> between these implementation and neon optimised versions. I assume
> this can come later.
> 
> >From a code size point of view (this is all with -O3), memcpy goes
> from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and
> memcmp goes from 1040 to 1452, for a total increase of 224 bytes.
> 
> The code is from NetBSD and Android (essentially unmodified), and it
> is all BSD 2-clause licensed.

At first glance, this looks like a clear improvement, but have you
compared it to much more naive optimizations? My _general_ experience
with optimized memcpy asm that's complex like this and that goes out
of its way to deal explicitly with cache lines and such is that it's
no faster than just naively moving large blocks at a time. Of course
this may or may not be the case for ARM, but I'd like to know if
you've done any tests.

The basic principle in my mind here is that a complex solution is not
necessarily wrong if it's a big win in other ways, but that a complex
solution which is at most 1-2% faster than a much simpler solution is
probably not the best choice.

I also have access to a good test system now, by the way, so I could
do some tests too.

> The git tree is available here:
> https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69

It's an open question whether it's better to sync something like this
with an 'upstream' or adapt it to musl coding conventions. Generally
musl uses explicit instructions rather than pseudo-instructions/macros
for prologue and epilogue, and does not use named labels.

> Does anyone have any comments on the suitability of this code, or what

If nothing else, it fails to be armv4 compatible. Fixing that should
not be hard, but it would require a bit of an audit. The return
sequences are the obvious issue, but there may be other instructions
in use that are not available on armv4 or maybe not even on armv5...?

> kind of more rigorous testing could be applied?

See above.

What also might be worth testing is whether GCC can compete if you
just give it a naive loop (not the fancy pseudo-vectorized stuff
currently in musl) and good CFLAGS. I know on x86 I was able to beat
the fanciest asm strlen I could come up with simply by writing the
naive loop in C and unrolling it a lot. The only reason musl isn't
already using that version is that I suspect it hurts branch
prediction in the caller....

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.