musl - Re: Re:Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220921171535.GV9709@brightrain.aerifal.cx>
Date: Wed, 21 Sep 2022 13:15:35 -0400
From: Rich Felker <dalias@...c.org>
To: 王志强 <00107082@....com>
Cc: musl@...ts.openwall.com, Quentin Rameau <quinq@...th.space>,
	Florian Weimer <fweimer@...hat.com>
Subject: Re: Re:Re: The heap memory performance
 (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to
 1.1)

On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote:
> Hi Rich,
> 
> 
> 
> I am quite interested into the topic,  and made a comparation between glibc and musl with following code:
> #define MAXF 4096
> void* tobefree[MAXF];
> int main() {
>     long long i;
>     int v, k;
>     size_t s, c=0;
>     char *p;
>     for (i=0; i<100000000L; i++) {
>         v = rand();   
>         s = ((v%256)+1)*1024;
>         p = (char*) malloc(s);
>         p[1023]=0;
>         if (c>=MAXF) {
>             k = v%c;
>             free(tobefree[k]);
>             tobefree[k]=tobefree[--c];
>         }
>         tobefree[c++]=p;
>     }
>     return 0;
> }
> ```
> 
> The results show signaficant difference.
> With glibc, (running within a debian docker image)
> # gcc -o m.debian -O0 app_malloc.c
> 
> # time ./m.debian
> real    0m37.529s
> user    0m36.677s
> sys    0m0.771s
> 
> With musl, (runnign within a alpine3.15 docker image)
> 
> # gcc -o m.alpine -O0 app_malloc.c
> 
> # time ./m.alpine
> real    6m 30.51s
> user    1m 36.67s
> sys    4m 53.31s
> 
> 
> 
> musl seems spend way too much time within kernel, while glibc hold most work within userspace.
> I used perf_event_open to profile those programs:
> musl profiling(total  302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap
> 
> munmap(30.858% 93469/302899)
> _init?(22.583% 68404/302899)
> aligned_alloc?(89.290% 61078/68404)
> asm_exc_page_fault(45.961% 28072/61078)
> main(9.001% 6157/68404)
> asm_exc_page_fault(29.170% 1796/6157)
> rand(1.266% 866/68404)
> aligned_alloc?(20.437% 61904/302899)
> asm_exc_page_fault(56.038% 34690/61904)
> madvise(13.275% 40209/302899)
> mmap64(11.125% 33698/302899)
> 
> 
> But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free"
> 
> 
> 
> pthread_attr_setschedparam?(82.021% 23845/29072)
> asm_exc_page_fault(1.657% 395/23845)
> _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859)
> cfree(58.839% 2859/4859)
> main(31.138% 1513/4859)
> asm_exc_page_fault(2.115% 32/1513)
> pthread_attr_setschedparam?(3.725% 181/4859)
> random(2.099% 102/4859)
> random_r(1.832% 89/4859)
> __libc_malloc(1.420% 69/4859)
> It seems to be me, glibc make lots of uasage of cache of kernel
> memory and avoid lots of pagefault and syscalls.
> Is this performance difference should concern realworld
> applications? On average, musl actual spend about 3~4ns per
> malloc/free, which is quite acceptable in realworld applications, I
> think.
> 
> 
> 
> (Seems to me, that the performance difference has nothing to do with
> malloc_usable_size, which may be indeed just a speculative guess
> without any base)

Indeed this has nothing to do with it. What you're seeing is just that
musl/mallocng return freed memory, and glibc, basically, doesn't
(modulo the special case of large contiguous free block at 'top' of
heap). This inherently has a time cost.

mallocng does make significant efforts to avoid hammering mmap/munmap
under repeated malloc/free, at least in cases where it can reasonably
be deemed to matter. However, this is best-effort, and always a
tradeoff on (potential) large unwanted memory usage vs performance.
More on this later.

Your test case, with the completely random size distribution across
various large sizes, is likely a worst case. The mean size you're
allocating is 128k, which is the threshold for direct mmap/munmap of
each allocation, so at least half of the allocations you're making can
*never* be reused, and will always be immediately unmapped on free. It
might be interesting to change the scaling factor from 1k to 256 bytes
so that basically all of the allocation sizes are in the
malloc-managed range.

Now, I mentioned above "when it can reasonably be deemed to matter".
One of the assumptions mallocng makes for deciding where to risk
larger memory usage for the sake of performance is that you're
*actually going to use* the memory you're allocating. Yes, 6min might
seem like a lot of time to do nothing, but that "doing nothing" was
churning through allocating 12 TB of memory. How much time would you
have spent *actually processing* 12 TB of data (and not as flat linear
data, but avg-128kB chunks), and what percentage of that time would
the 6min in malloc have been?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.