musl - Re: qsort

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45a265c9.2f3.1865acb64f0.Coremail.00107082@163.com>
Date: Thu, 16 Feb 2023 23:15:24 +0800 (CST)
From: "David Wang" <00107082@....com>
To: musl@...ts.openwall.com
Subject: Re: qsort





At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@....net> wrote:

>Anyway, there is an inefficiency in musl's smoothsort implementation,
>namely in the sift() function:
>
>
>| if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
>| 	break;
>| }
>| if(cmp(lf, rt, arg) >= 0) {
>| 	ar[i++] = lf;
>| 	head = lf;
>| 	pshift -= 1;
>| } else {
>| 	ar[i++] = rt;
>| 	head = rt;
>| 	pshift -= 2;
>| }
>
>As you can see, this does two or three comparisions, but actually needs
>to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
>by transitivity we have also detected rt > lf, so there is no need to
>compare lf and rt directly anymore. I have not really found a good way
>to formulate code that takes advantage of that, maybe something like
>this?
>
>
>		if(cmp(ar[0], lf, arg) < 0) {
>			if(cmp(lf, rt, arg) >= 0) {
>				ar[i++] = lf;
>				head = lf;
>				pshift -= 1;
>			} else {
>go_right:
>				ar[i++] = rt;
>				head = rt;
>				pshift -= 2;
>			}
>		} else if (cmp(ar[0], rt, arg) < 0) {
>			goto go_right;
>		} else {
>			break;
>		}
>
>Yeah, not great. But it does use only two compares ever. This ought to
>lower the number of compares in your test somewhat.
>
>HTH,
>Markus


I tried this change, but improvement is not much.
When sorting 1<<20 random items, my test shows average count of comparisons decrease from 2.733*nlogn to 2.717*nlogn, improved less than 1%
Three comparisons, in the original code ,only happen when parent is larger than its left child, and smaller then its right child, if (parent, left, right) are independent, the probability for this order right>parent>left is 1/6,  the expected comparison would be 3*1/6+2*5/6 = 13/6 for the original code, and we can get 1/13 improvement, but heapsort procedure may make the probability much less than 1/6 since every round a leaf node is put to the head, and so I got less than 1% improvement. 

Strange thing is, I made a textbook implementation of heapsort as in following patch, optimized for size 8, the runtime and comparison improved a lot.

The code patch is
diff --git a/src/stdlib/qsort_nr.c b/src/stdlib/qsort_nr.c
index 8ffe71d0..710b1aaa 100644
--- a/src/stdlib/qsort_nr.c
+++ b/src/stdlib/qsort_nr.c
@@ -1,4 +1,5 @@
 #define _BSD_SOURCE
+#include <stdint.h>
 #include <stdlib.h>
 
 typedef int (*cmpfun)(const void *, const void *);
@@ -8,7 +9,36 @@ static int wrapper_cmp(const void *v1, const void *v2, void *cmp)
 	return ((cmpfun)cmp)(v1, v2);
 }
 
+static void _heapifyx2(uint64_t *base, int s, int n, cmpfun cmp) {
+	int ms, ns;
+	uint64_t t;
+	while(1) {
+		ns=s*2+1;
+		if (ns>=n) break;
+		ms=s;
+		if (cmp(base+ms, base+ns)<0) ms=ns;
+		ns++;
+		if (ns<n&&cmp(base+ms, base+ns)<0) ms=ns;
+		if (s==ms) break;
+		t=base[ms]; base[ms]=base[s]; base[s]=t;
+		s=ms;
+	}
+}
+
+static void _hsortx2(uint64_t *base, size_t n, cmpfun cmp) {
+	int i;
+	uint64_t t;
+	for (i=n/2; i>=0; i--) _heapifyx2(base, i, n, cmp);
+	for (n--; n>0; n--) {
+		t=base[0]; base[0]=base[n]; base[n]=t;
+		_heapifyx2(base, 0, n, cmp);
+	}
+}
+
 void qsort(void *base, size_t nel, size_t width, cmpfun cmp)
 {
+	if (width==8) {
+		return _hsortx2(base, nel, cmp);
+	}
 	__qsort_r(base, nel, width, wrapper_cmp, (void *)cmp);
 }



And the test code is:
-----------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define MAXM 20
#define MAXN 1<<MAXM


long long count=0;
typedef struct { int k, v; } VNode;
VNode vs[MAXN];
int mycmp(const void *a, const void *b) {
    count++;
    return ((const VNode*)a)->v - ((const VNode*)b)->v;
}


int main() {
    int i, j, k, n;
    double f;
    VNode t;
    srand(time(NULL));
    for (k=0; k<256; k++) {
        for (i=0; i<MAXN; i++) vs[i].v=i; 
        for (n=MAXN; n>1; n--) {
            i=n-1; j=rand()%n;
            if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
        }
        qsort(vs, MAXN, sizeof(vs[0]), mycmp);
        for (i=0; i<MAXN; i++) if (vs[i].v!=i) { printf("error\n") ;return 1; }
    }
    f=count; f/=k; f/=MAXN; f/=MAXM; printf("count: %.3f*nlogn\n", f);
    return 0;
}
-----------------------

Here is what I collect
+------------+-----------+-------------+
|            |  runtime  |  comparison |
+------------+-----------+-------------+
|   glibc    | 0m32.093s | 0.937*nlogn |
|  heapsort  | 0m54.708s | 1.846*nlogn |
| smoothsort | 1m13.137s | 2.733*nlogn | <-- this version of smoothsort has optimized for 8byte data item.
+------------+-----------+-------------+

I think smoothsort has worse average performance than heapsort based on the number of comparisons made.
But the ~2*nlogn comparison of heapsort could not beat ~1*nlogn of merge sort, used by glibc.

So back to this thread

> On Thu, 9 Feb 2023, Rich Felker wrote:
>
>Before we consider any such change though we should probably see
>profiling data to determine where the time is being spent in
>smoothsort, in case it's stupid stuff that's fixable.
>
>Rich


I think, optimize for small size 4/8/16 is practical, since feeding large item to qsort is less efficient than feeding an index or address to qsort, in my opinion,  a typical structure for sorting would be  { long long k, v; } or something alike, where k is index/address referring to the real data. The performance improved a lot if optimize out those memcpy call.
And further improvement could be made, for the average performance, if switch back to heapsort.
 (Hope someone else can help confirm those two above).

But heapsort in theory is always significantly slower than mergesort, I think it should be about 2 time slower when using full set of benchmark tests





David
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.