musl - Re: Re:Re: Re:Re: Re:Re: Re:Re: qsort

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230211133532.GD4163@brightrain.aerifal.cx>
Date: Sat, 11 Feb 2023 08:35:33 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re:Re: Re:Re: Re:Re: Re:Re: 
 qsort

On Sat, Feb 11, 2023 at 10:06:02AM +0100, alice wrote:
> On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote:
> > On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@...ya.dev> wrote:
> > > based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
> > > optimisations, the _avx_ functions in your case. musl doesn't implement any
> > > SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
> > > the same kind of native per-arch optimisation.
> > > 
> > > you should rerun these with GLIBC_TUNABLES, from something in:
> > > https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
> > > which should let you disable them all (if you just want to compare C to C code).
> > > 
> > > ( unrelated, but has there been some historic discussion of implementing
> > >   something similar in musl? i feel like i might be forgetting something. )
> >
> > There already are arch-specific asm implementations of functions like
> > memcpy.
> 
> apologies, i wasn't quite clear- the difference
> between src/string/x86_64/memcpy.s and the glibc fiesta is that the latter
> utilises subarch-specific SIMD (as you explain below), e.g. AVX like in the
> above benchmarks. a baseline x86_64 asm is more fair-game if the difference is
> as significant as it is for memcpy :)

Folks are missing the point here. It's not anything to do with AVX or
even glibc's memcpy making glibc faster here. Rather, it's that glibc
is *not calling memcpy* for 4-byte (and likely a bunch of other
specialized cases) element sizes. Either they manually special-case
them, or the compiler (due to lack of -ffreestanding and likely -O3 or
something) is inlining the memcpy.

Based on the profiling data, I would predict an instant 2x speed boost
special-casing small sizes to swap directly with no memcpy call.

Incidentally, our memcpy is almost surely at least as fast as glibc's
for 4-byte copies. It's very large sizes where performance is likely
to diverge.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.