|
|
Message-ID: <CAKHv7pj=HvDsPpMT-JQ7Q0PzD7SQkRP7qS=tPMSe2AcMa9BKaQ@mail.gmail.com>
Date: Fri, 22 May 2026 12:15:19 +0200
From: Paul Schutte <sjpschutte@...il.com>
To: musl@...ts.openwall.com, Justine Tunney <jtunney@...il.com>
Subject: Re: [PATCH] Make qsort 50% faster
Hi all
I apologize if I am going slightly off topic here, but this discussion
reminds me of some observations I made regarding memcpy
visibility/inlining.
On x86_64 (Void Linux), I observed the following with one of my
hashmap benchmarks.
When linked against musl:
./obj/last/Ember_test2 ~/enwiki-20230720-all-titles-in-ns0.shuf
Insertion took: 3.132s
Failed lookups: 0
Lookups took: 2.589s
When linked against glibc:
./obj/last/Ember_test2 ~/enwiki-20230720-all-titles-in-ns0.shuf
Insertion took: 2.345s
Failed lookups: 0
Lookups took: 2.593s
I then replaced libc memcpy with this very naive implementation while
still linking against musl:
sub memcpy -> Pointer:Byte
args
d Pointer:Byte cp
s Pointer:Byte cp
len Word cp
for (svar Word i:=0; i < len; inc i)
d:(i) := s:(i)
return d
which resulted in:
./obj/last/Ember_test2 ~/enwiki-20230720-all-titles-in-ns0.shuf
Insertion took: 2.349s
Failed lookups: 0
Lookups took: 2.570s
This makes me wonder whether visibility/inlining opportunities around
memcpy may play a larger role than expected in some workloads,
especially for smaller or compiler-visible copy sizes.
Kind Regards
Paul
On Fri, May 22, 2026 at 11:07 AM Szabolcs Nagy <nsz@...t70.net> wrote:
>
> * Justine Tunney <jtunney@...il.com> [2026-05-21 19:29:36 -0700]:
> > This change makes musl qsort go 50% faster on common sizes by avoiding
> > memcpy calls. On x86-64 with cc -Os it can be as much as 2x faster. No
> > changes have been made to the smoothsort algorithm itself. This simple
> > patch brings musl much closer to other C libraries in terms of sorting
> > perf, while maintaining smoothsort's highly desirable properties, e.g.
> > small size, stack safety, signal safety, no malloc dependency, etc.
> >
> > The cost is +670 bytes of .text (on x86-64) to specialize cycle() for
> > widths 4, 8, 12, 16, 20, 24, 28, and 32 (gcc, without stack protector).
> > But if you want to be like OpenBSD's qsort and only specialize 4 and 8,
> > then you're looking at a +140 byte increase. I chose the upper bound of
> > what's practical; after 32, gcc stops inlining memcpy cleanly.
> >
> > Further analysis and benchmark numbers are available in my repo:
> > https://github.com/jart/musl-qsort-speedup
>
> i think upstream musl would need
> __builtin_memcpy for this to work.
>
> > +/* specializes cycle() for constant width */
> > +#define DECLARE_CYCLE(NAME, WIDTH) \
> > + static void NAME(size_t width, unsigned char* ar[], long n) \
> > + { \
> > + long i; \
> > + char t[WIDTH]; \
> > + if(n < 2) \
> > + return; \
> > + memcpy(t, ar[0], WIDTH); \
> > + for (i = 0; i < n - 1; i++) \
> > + memcpy(ar[i], ar[i + 1], WIDTH); \
> > + memcpy(ar[n - 1], t, WIDTH); \
> > + }
> > +
> > +DECLARE_CYCLE(cycle4, 4)
> > +DECLARE_CYCLE(cycle8, 8)
> > +DECLARE_CYCLE(cycle12, 12)
> > +DECLARE_CYCLE(cycle16, 16)
> > +DECLARE_CYCLE(cycle20, 20)
> > +DECLARE_CYCLE(cycle24, 24)
> > +DECLARE_CYCLE(cycle28, 28)
> > +DECLARE_CYCLE(cycle32, 32)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.