musl - Re: qsort

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230216160734.GV4163@brightrain.aerifal.cx>
Date: Thu, 16 Feb 2023 11:07:35 -0500
From: Rich Felker <dalias@...c.org>
To: David Wang <00107082@....com>
Cc: musl@...ts.openwall.com
Subject: Re: qsort

On Thu, Feb 16, 2023 at 11:15:24PM +0800, David Wang wrote:
> 
> At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@....net> wrote:
> 
> >Anyway, there is an inefficiency in musl's smoothsort implementation,
> >namely in the sift() function:
> >
> >
> >| if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
> >| 	break;
> >| }
> >| if(cmp(lf, rt, arg) >= 0) {
> >| 	ar[i++] = lf;
> >| 	head = lf;
> >| 	pshift -= 1;
> >| } else {
> >| 	ar[i++] = rt;
> >| 	head = rt;
> >| 	pshift -= 2;
> >| }
> >
> >As you can see, this does two or three comparisions, but actually needs
> >to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
> >by transitivity we have also detected rt > lf, so there is no need to
> >compare lf and rt directly anymore. I have not really found a good way
> >to formulate code that takes advantage of that, maybe something like
> >this?
> >
> >
> >		if(cmp(ar[0], lf, arg) < 0) {
> >			if(cmp(lf, rt, arg) >= 0) {
> >				ar[i++] = lf;
> >				head = lf;
> >				pshift -= 1;
> >			} else {
> >go_right:
> >				ar[i++] = rt;
> >				head = rt;
> >				pshift -= 2;
> >			}
> >		} else if (cmp(ar[0], rt, arg) < 0) {
> >			goto go_right;
> >		} else {
> >			break;
> >		}
> >
> >Yeah, not great. But it does use only two compares ever. This ought to
> >lower the number of compares in your test somewhat.
> >
> >HTH,
> >Markus
> 
> 
> I tried this change, but improvement is not much.
> When sorting 1<<20 random items, my test shows average count of
> comparisons decrease from 2.733*nlogn to 2.717*nlogn, improved less
> than 1%
> Three comparisons, in the original code ,only happen when parent is
> larger than its left child, and smaller then its right child, if
> (parent, left, right) are independent, the probability for this
> order right>parent>left is 1/6, the expected comparison would be
> 3*1/6+2*5/6 = 13/6 for the original code, and we can get 1/13
> improvement, but heapsort procedure may make the probability much
> less than 1/6 since every round a leaf node is put to the head, and
> so I got less than 1% improvement.

OK, so it looks like this doesn't really matter. Kinda disappointing
but at least it means we don't need to do anything.

> Strange thing is, I made a textbook implementation of heapsort as in
> following patch, optimized for size 8, the runtime and comparison
> improved a lot.

Do you have a good theoretical reason why that might be the case?

> Here is what I collect
> +------------+-----------+-------------+
> |            |  runtime  |  comparison |
> +------------+-----------+-------------+
> |   glibc    | 0m32.093s | 0.937*nlogn |
> |  heapsort  | 0m54.708s | 1.846*nlogn |
> | smoothsort | 1m13.137s | 2.733*nlogn | <-- this version of smoothsort has optimized for 8byte data item.
> +------------+-----------+-------------+
> 
> I think smoothsort has worse average performance than heapsort based
> on the number of comparisons made.
> But the ~2*nlogn comparison of heapsort could not beat ~1*nlogn of
> merge sort, used by glibc.

Are those comparison counts just measured, or do they have some
theoretical basis? It'd be nice to know if we're possibly doing
something suboptimal still or if smoothsort is really more expensive
for a fundamental reason.

> So back to this thread
> 
> > On Thu, 9 Feb 2023, Rich Felker wrote:
> >
> >Before we consider any such change though we should probably see
> >profiling data to determine where the time is being spent in
> >smoothsort, in case it's stupid stuff that's fixable.
> 
> I think, optimize for small size 4/8/16 is practical, since feeding
> large item to qsort is less efficient than feeding an index or
> address to qsort, in my opinion, a typical structure for sorting
> would be { long long k, v; } or something alike, where k is
> index/address referring to the real data. The performance improved a
> lot if optimize out those memcpy call.

Yes, I think optimizing out the external memcpy calls for small sizes
is a good idea regardless of whatever else we do. The code you tested
is not valid (it has UB via aliasing violations) but we could fix it
either using __may_alias__ types conditioned on __GNUC__, or exposing
memcpy intrinsics where they don't result in circular definitions with
some magic in src/include/string.h. I'd like to do the latter at some
point anyway.

> And further improvement could be made, for the average performance,
> if switch back to heapsort. (Hope someone else can help confirm
> those two above).

This is kinda disappointing but might make sense if true.

If an introsort that switches from quicksort to heapsort is faster
still, that would also be a candidate.

> But heapsort in theory is always significantly slower than
> mergesort, I think it should be about 2 time slower when using full
> set of benchmark tests

Mergesort is simply not a candidate unless there's a way to make
in-place merge practical, but as I understand it that's prohibitively
costly, and I've never really seen a comprehensible implementation
that was convincingly correct. If I'm wrong on this and it is doable,
we could consider that.

Just using malloc opporunistically is not appropriate behavior. Even
if it succeeds, it could gratuitously consume vast amounts of memory,
making *other operations* in the same processs or other processes fail
when they should succeed. It's pretty basic to the principles of musl
that we don't do stuff like that (unnecessary allocation).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.