Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANtdasTm_ZCzu+ZfjQjLHdcuAa5iUcD_8ceSos-w0wSiL9LV5Q@mail.gmail.com>
Date: Fri, 22 May 2026 08:14:18 -0700
From: Justine Tunney <jtunney@...il.com>
To: Justine Tunney <jtunney@...il.com>, musl@...ts.openwall.com
Subject: Re: [PATCH] Make qsort 50% faster

Yes the musl build system passes the -ffreestanding flag which implies
-fno-builtin. So Musl could solve this by simply adding the following
to its Makefile:

obj/src/stdlib/qsort.lo: src/stdlib/qsort.c $(GENH) $(IMPH)
        $(CC) $(filter-out -ffreestanding,$(CFLAGS_ALL)) -c -o $@ $<

I'm happy to mail out a v2 patch to this list which includes the above
change. Or I could modify the code to just use __builtin_memcpy. I
prefer to not write non-standard source, but it's a feature that's
been available since GCC 3.0 (2001) and is supported by Clang too.

On Fri, May 22, 2026 at 2:07 AM Szabolcs Nagy <nsz@...t70.net> wrote:
>
> * Justine Tunney <jtunney@...il.com> [2026-05-21 19:29:36 -0700]:
> > This change makes musl qsort go 50% faster on common sizes by avoiding
> > memcpy calls. On x86-64 with cc -Os it can be as much as 2x faster. No
> > changes have been made to the smoothsort algorithm itself. This simple
> > patch brings musl much closer to other C libraries in terms of sorting
> > perf, while maintaining smoothsort's highly desirable properties, e.g.
> > small size, stack safety, signal safety, no malloc dependency, etc.
> >
> > The cost is +670 bytes of .text (on x86-64) to specialize cycle() for
> > widths 4, 8, 12, 16, 20, 24, 28, and 32 (gcc, without stack protector).
> > But if you want to be like OpenBSD's qsort and only specialize 4 and 8,
> > then you're looking at a +140 byte increase. I chose the upper bound of
> > what's practical; after 32, gcc stops inlining memcpy cleanly.
> >
> > Further analysis and benchmark numbers are available in my repo:
> > https://github.com/jart/musl-qsort-speedup
>
> i think upstream musl would need
> __builtin_memcpy for this to work.
>
> > +/* specializes cycle() for constant width */
> > +#define DECLARE_CYCLE(NAME, WIDTH)                                   \
> > +     static void NAME(size_t width, unsigned char* ar[], long n)     \
> > +     {                                                               \
> > +             long i;                                                 \
> > +             char t[WIDTH];                                          \
> > +             if(n < 2)                                               \
> > +                     return;                                         \
> > +             memcpy(t, ar[0], WIDTH);                                \
> > +             for (i = 0; i < n - 1; i++)                             \
> > +                     memcpy(ar[i], ar[i + 1], WIDTH);                \
> > +             memcpy(ar[n - 1], t, WIDTH);                            \
> > +     }
> > +
> > +DECLARE_CYCLE(cycle4, 4)
> > +DECLARE_CYCLE(cycle8, 8)
> > +DECLARE_CYCLE(cycle12, 12)
> > +DECLARE_CYCLE(cycle16, 16)
> > +DECLARE_CYCLE(cycle20, 20)
> > +DECLARE_CYCLE(cycle24, 24)
> > +DECLARE_CYCLE(cycle28, 28)
> > +DECLARE_CYCLE(cycle32, 32)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.