musl - Re: Re:Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220921175817.GW9709@brightrain.aerifal.cx>
Date: Wed, 21 Sep 2022 13:58:17 -0400
From: Rich Felker <dalias@...c.org>
To: 王志强 <00107082@....com>
Cc: musl@...ts.openwall.com, Quentin Rameau <quinq@...th.space>,
	Florian Weimer <fweimer@...hat.com>
Subject: Re: Re:Re: The heap memory performance
 (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to
 1.1)

On Wed, Sep 21, 2022 at 01:15:35PM -0400, Rich Felker wrote:
> On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote:
> > Hi Rich,
> > 
> > 
> > 
> > I am quite interested into the topic,  and made a comparation between glibc and musl with following code:
> > #define MAXF 4096
> > void* tobefree[MAXF];
> > int main() {
> >     long long i;
> >     int v, k;
> >     size_t s, c=0;
> >     char *p;
> >     for (i=0; i<100000000L; i++) {
> >         v = rand();   
> >         s = ((v%256)+1)*1024;
> >         p = (char*) malloc(s);
> >         p[1023]=0;
> >         if (c>=MAXF) {
> >             k = v%c;
> >             free(tobefree[k]);
> >             tobefree[k]=tobefree[--c];
> >         }
> >         tobefree[c++]=p;
> >     }
> >     return 0;
> > }
> > ```
> > 
> > The results show signaficant difference.
> > With glibc, (running within a debian docker image)
> > # gcc -o m.debian -O0 app_malloc.c
> > 
> > # time ./m.debian
> > real    0m37.529s
> > user    0m36.677s
> > sys    0m0.771s
> > 
> > With musl, (runnign within a alpine3.15 docker image)
> > 
> > # gcc -o m.alpine -O0 app_malloc.c
> > 
> > # time ./m.alpine
> > real    6m 30.51s
> > user    1m 36.67s
> > sys    4m 53.31s
> > 
> > 
> > 
> > musl seems spend way too much time within kernel, while glibc hold most work within userspace.
> > I used perf_event_open to profile those programs:
> > musl profiling(total  302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap
> > 
> > munmap(30.858% 93469/302899)
> > _init?(22.583% 68404/302899)
> > aligned_alloc?(89.290% 61078/68404)
> > asm_exc_page_fault(45.961% 28072/61078)
> > main(9.001% 6157/68404)
> > asm_exc_page_fault(29.170% 1796/6157)
> > rand(1.266% 866/68404)
> > aligned_alloc?(20.437% 61904/302899)
> > asm_exc_page_fault(56.038% 34690/61904)
> > madvise(13.275% 40209/302899)
> > mmap64(11.125% 33698/302899)
> > 
> > 
> > But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"
> > 
> > 
> > 
> > pthread_attr_setschedparam?(82.021% 23845/29072)
> > asm_exc_page_fault(1.657% 395/23845)
> > _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
> > cfree(58.839% 2859/4859)
> > main(31.138% 1513/4859)
> > asm_exc_page_fault(2.115% 32/1513)
> > pthread_attr_setschedparam?(3.725% 181/4859)
> > random(2.099% 102/4859)
> > random_r(1.832% 89/4859)
> > __libc_malloc(1.420% 69/4859)
> > It seems to be me, glibc make lots of uasage of cache of kernel
> > memory and avoid lots of pagefault and syscalls.
> > Is this performance difference should concern realworld
> > applications? On average, musl actual spend about 3~4ns per
> > malloc/free, which is quite acceptable in realworld applications, I
> > think.
> > 
> > 
> > 
> > (Seems to me, that the performance difference has nothing to do with
> > malloc_usable_size, which may be indeed just a speculative guess
> > without any base)
> 
> Indeed this has nothing to do with it. What you're seeing is just that
> musl/mallocng return freed memory, and glibc, basically, doesn't
> (modulo the special case of large contiguous free block at 'top' of
> heap). This inherently has a time cost.
> 
> mallocng does make significant efforts to avoid hammering mmap/munmap
> under repeated malloc/free, at least in cases where it can reasonably
> be deemed to matter. However, this is best-effort, and always a
> tradeoff on (potential) large unwanted memory usage vs performance.
> More on this later.
> 
> Your test case, with the completely random size distribution across
> various large sizes, is likely a worst case. The mean size you're
> allocating is 128k, which is the threshold for direct mmap/munmap of
> each allocation, so at least half of the allocations you're making can
> *never* be reused, and will always be immediately unmapped on free. It
> might be interesting to change the scaling factor from 1k to 256 bytes
> so that basically all of the allocation sizes are in the
> malloc-managed range.

One observation if this change is made: it looks like at least 70% of
the time is spent performing madvise(MADV_FREE), and that a large
portion of the rest (just looking at strace) seems to be repeatedly
mapping and freeing a 17-page (68k) block, probably because this size
happens to be at the boundary of some threshold where bounce
protection isn't happening. I think we should look at both of these in
more detail, since they both suggest opportunities for large
performance improvements at low cost.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.