Date: Tue, 20 Sep 2022 01:32:31 +0800 From: baiyang <baiyang@...il.com> To: "Rich Felker" <dalias@...c.org> Cc: musl@...ts.openwall.com Subject: Re: Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) Hi Rich, Thanks for your reply. > Unless you have an application that's explicitly using > malloc_usable_size all over the place, it's highly unlikely that this > is the cause of your real-world performance problems. 1. Yes, we have a real scenario where `malloc_usable_size` is called frequently: we need to optimize the realloc experience. We add an extra parameter to realloc - minimalCopyBytes: it represents the actual size of data that needs to be copied after fallback to malloc-copy-free mode. We will judge whether to call realloc or complete malloc-memcpy-free by ourself based on factors such as the size of the data that realloc needs to copy (obtained through `malloc_usable_size`), the size that we actually need to copy when we doing malloc-memcpy-free ourself (minimalCopyBytes) and the chance of merging chunks (small blocks) or mremap (large blocks) in the underlayer realloc. So, this is a real scenario, we need to call `malloc_usable_size` frequently. 2. As I mentioned before, this isn't just a problem with `malloc_usable_size`, since we actually include a full `malloc_usable_size` procedure in both `realloc` and `free`, it actually slows down The speed of other calls such as `free` and `realloc`. So this problem actually slows down not only the `malloc_usable_size` call itself, but also the realloc and free calls. Thanks :-) -- Best Regards BaiYang baiyang@...il.com http://i.baiy.cn **** < END OF EMAIL > **** From: Rich Felker Date: 2022-09-19 21:43 To: baiyang CC: musl Subject: Re: [musl] The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote: > Hi there, > > As we have discussed at > https://github.com/openwrt/openwrt/issues/10752. The > malloc_usable_size() function in musl 1.2 (mallocng) seems to have > some performance issues. > > It caused realloc and free spends too long time for get the chunk size. > > As we mentioned in the discussion, tcmalloc and some other > allocators can also accurately obtain the size class corresponding > to a memory block and its precise size, and it is also very fast at > the same time. > > Can we make some improvements to the existing malloc_usable_size > algorithm in mallocng? This should significantly improve the > performance of existing algorithms. Can you please start from a point of identifying the real-world case in which you're hitting a performance degredation? Made-up tests are generally not helpful and will almost always lead to focusing on the wrong problem. For now I'm going to focus on some things from the linked thread: > > Considering that realloc itself contains a complete > > malloc_usable_size (refer to here and here), So actually most > > (66.7%) of the realloc time is spent doing malloc_usable_size. In your test that increments the realloc size by one each iteration, only one in every PAGESIZE calls has any real work to do. The rest do nothing but set_size after obtaining the metadata on the object they're acting on. It's completely expected that the runtime of these will be dominated by obtaining the metadata; this isn't evidence of anything wrong. And, moreover, it's almost surely a lot more than 66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap syscalls and page faults accessing the new pages they produce. > > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac > > os x (malloc_size), and musl 1.1, even on low-end embedded > > processors, the consumption of malloc_usable_size per 10 million > > calls is mostly not more than a few hundred milliseconds. It looks like mallocng's malloc_usable_size is taking around 150 ns per call on your system, vs maybe 30-50 for others? > > In addition, this very slow slab size acquisition algorithm also > > needs to be called every time free (see here). So we believe it > > should be the main reason for malloc/free and realloc performance > > degradation in version 1.2. Unless you have an application that's explicitly using malloc_usable_size all over the place, it's highly unlikely that this is the cause of your real-world performance problems. The vast majority of reported problems with malloc performance have been in multithreaded applications, where the dominating time cost is fundamental: synchronization cost of having global consistency. There you'll expect to find very similar performance figures from any other allocator with global consistency, such as hardened_malloc. If you're really having single-threaded performance problems that aren't just in made-up benchmarks, please see if you can narrow down the cause empirically rather than speculatively. For example, running the program under perf and looking at where the time is being spent. > > If we can improve its speed and make it close to implementations > > like tcmalloc (tcmalloc can also accurately return the size of the > > size class to which the chunk belongs), it should significantly > > improve the performance of mallocng (at least in single-threaded > > scenarios) . tcmalloc is fast by not having global consistency, not being hardened against memory errors like double-free and use-after-free, and not avoiding fragmentation and excessive memory usage. Likewise for most of the others. The run time costs in mallocng for looking up the out-of-band metadata are largely fundamental to it being out-of-band (not subject to direct falsification via typically exploitable application bugs), size-efficient, 32-bit-compatible, nommu-compatible, etc. Other approaches like in hardened_malloc can be moderately more efficient to access the metadata, at the price of not being at all amenable to small systems, which are a core goal of musl we can't really disregard. I can't say for sure there's not any room for optimization in the metadata fetching though. Looking at the assembly output might be informative, to see if we're doing anything that's making the compiler emit gratuitously inefficient code. Rich Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.