Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 11 Feb 2023 10:22:49 +0100
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: Re:Re: qsort

On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:
> On average, qsort with musl is significantly slower than c++ sort,
> while with glibc the average performance of qsort is bettern than c++
> sort.

Getting back to this, with the benchmarks recently performed, I notice
one area where C++ is able to significantly outperform C, and that is in
the area of generic functions. C++ std::sort is a template function, and
the compiler can optimize the code for the situation. For example, in
your case, you only ever need to swap two elements by way of copying
four bytes, which the compiler can just inline.

In C, the same code can only call memcpy(), and a call to cycle() will
involve many calls to memcpy(), and they cannot be inlined since the
compiler will not know the copy size at the time it is compiling the
calls. We end up with tons of calls to memcpy() with a size of four,
when memcpy() is optimized for large sizes.

Merge sort, as glibc implements it, allows at least a few memcpy() calls
with larger sizes. Glibc also avoids calling memcpy by duplicating the
code for common sizes, only resorting to memcpy() for the default case.
That may be an avenue worth pursuing, at least for cycle().

I'm not quite certain why glibc chose the sizes they chose for
msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit
integers, longs (which in a System V ABI are always one of the former
two) and void* (which in a System V ABI are always the same as a long).
I think special casing 32-bit and 64-bit integers ought to get us most
of the way there, since those are the most common sizes that can still
be copied simply inline.

Also note that glibc further reduces the need for memcpy() by sorting
indirectly if the size of a single element exceeds 32 bytes. That is of
course something they can do, since they already are allocating memory
and have a fall-back strategy in place. Not sure if this is worthwhile
for musl. At least in the static linking case, it would suddenly pull
malloc() and free() into executables that previously did not have those.

That last optimization would also not have any bearing on the current
batch of benchmarks, since in those, a size of four is set in stone.

Ciao,
Markus

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.