musl - Re: [SC22WG14.35943] N3849 - Memory allocation with size feedback v2, request for feedback

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <abxvLi-IsFsH_VNI@devuan>
Date: Fri, 20 Mar 2026 00:16:00 +0100
From: Alejandro Colomar <une+c@...jandro-colomar.es>
To: Charles Munger <clm@...gle.com>
Cc: sc22wg14@...n-std.org, Rich Felker <dalias@...c.org>, 
	musl@...ts.openwall.com
Subject: Re: [SC22WG14.35943] N3849 - Memory allocation with size feedback
 v2, request for feedback

[CC += Rich, musl@]

Hi Charles,

On 2026-03-16T22:50:40-0700, Charles Munger wrote:
> This paper is the second revision of the original N3813,

<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3813.pdf>

> updated to address
> feedback on this mailing list; thank you Alejandro for providing feedback
> on the initial version.
> 
> This new revision includes more discussion of design choices, as well as
> fixes for typos in example code, and no longer special casing allocation
> failure vs allocation success with regard to the returned size. In
> addition, two examples I have personal experience with in C codebases are
> included, as challenges dealing with these motivated this standards
> proposal.
> 
> I look forward to your feedback.

I'll quote the paper to reply.
TL;DR: I don't see evidence that this is necessary.

|	Use Cases
|	Dynamic Array
|	A common data structure is a resizable array
|	that grows on insertions by reallocating.
|	To achieve amortized linear time,
|	implementations allocate larger blocks of memory
|	than they actually need to hold the current set of items
|	- if they could use additional space at low/no cost,
|	fewer resizes would be needed,
|	minimizing copying and allocator churn.

I expect most dynamic arrays --just like every dynamic data structure--
to allocate sizes with the usual 2x algorithm: duplicate the size
whenever we need to grow.  If you do that, then

-  "Fewer resizes" is something you don't care about.  The amount of
   resizes is logarithmic with respect to the size.  We're not talking
   about hundreds of realloc(3) calls.

-  Considering allocators with different arenas for different sizes,
   you'd only be able to use a small percentage of the space, but
   the extra size is statistically not going to be meaningful.  And when
   the allocation is larger, the percentage is smaller (you'll get at
   most the page size).

You should quantify the gains, with some experiments in real code,
measuring how many realloc(3) calls you cave if you use those proposed
APIs compared to not using them.  I expect this will be in at most in
the order of 5~10%; possibly less.

And then, measure the performance difference too.  Because if you
achieve reduction of maybe 7% of realloc(3) calls, but the calls to this
API have a 7% performance penalty, then it makes no sense.

But my comments above would assume you also take advantage of realloc(3)
wasted space, which your proposal doesn't.  You only take advantage in
the initial allocation, which makes the savings even less meaningful.

I'd like to see numbers showing that this is really meaningful, compared
to requesting a reasonable round number initially (which is likely to be
quite optimized for space).

|	Arena/Bump allocators
|	Repeated small allocations with the same lifetime
|	can be grouped together by using malloc
|	to obtain a large block of memory,
|	and then adding the size of each small allocation
|	to the returned pointer until it is full.
|	If needed, allocate another block and repeat.
|	However, chosen block sizes can waste substantial memory
|	if the requested sizes are a poor fit
|	for underlying size classes used by malloc.

How large is the usual waste?  Have you measured numbers?  I expect that
as long as you use round numbers, you're not going to waste much, if at
all.  I expect this is not a real problem.

|	I/O Buffers
|	When streaming data to/from
|	a file, the network, or some other system,
|	we often don’t know how much data we’ll move in advance.
|	Larger buffers use more memory
|	but often increase throughput
|	and decrease syscall and device overhead;
|	if we can obtain a larger buffer than required
|	at no additional cost,
|	we want to do so.

Again, you need to show numbers.  The size extensions aren't going to be
an order of magnitude.  More likely, it will be a small % of what you
have, which will not affect in a meaningful way the performance.  You
should probably optimize for your system calls and network.

|	Examples
|	In SQLite, [...]
|	In upb, [...]

It would be interesting (actually, necessary) to see numbers of how much
memory those projects are saving, or how much performance those projects
are gaining, thanks to the use of these APIs.

Apart from the lack of numbers that would show whether this is a real
problem or not, I don't like the proposed APIs.

|	Choice of Return Type
|	The proposed API returns a struct,
|	which is atypical although not unheard of in C’s standard library,
|	div_t being a prior example.

div(3) is quite weird.

|	The following alternative structures were considered:
|		void* alloc_at_least(size_t* size);
|	The former is somewhat similar to
|	atomic_compare_exchange_strong’s expected parameter.
|	Like that parameter,
|	taking an address precludes
|	passing a literal or expression for the size,

Yes.  But is that a problem?

	size = 42;
	p = alloc_at_least(&size);

|	and the developer has to pass an address,
|	usually of a local variable, instead.
|	If the original value is still needed later,
|	it has to be stored somewhere else.

Why would you want to store the guess?  You want to know the amount of
memory used, and the amount of memory available, but the requested
amount of memory seems not very useful.

Since you have mentioned code bases that do something like this, you
should be able to show whether the guess is something that needs to be
stored often, and how each of these designs would affect their source
code.  It would be good to see real code using this, rather than
hypotheses.

|	alloc_result_t r = alloc_at_least(offsetof(T, data[4]));
|	if (r.ptr) {
|		T* result = r.ptr;
|		result->count = (r.size - offsetof(T, data[0])) / sizeof(uint64_t);
|	}

This seems quite unreadable, IMO.

|	size_t size = offsetof(T, data[4]);
|	T* result = alloc_at_least(&size);
|	if (result)
|		result->count = (size - offsetof(T, data[0])) / sizeof(uint64_t);

Much better, IMO.

|	size_t actual_size;
|	T* result = alloc_at_least(offsetof(T, data[4]), &actual_size);
|	if (result)
|		result->count = (actual_size - offsetof(T, data[0])) / sizeof(uint64_t);

The former seems more idiomatic.

|	Each of these is basically the same amount of code,
|	and offer similar opportunities for programmer error.

I tend to disagree.  The second is much more robust than the others, by
having less code.

|	However, there are some practical concerns.
|	With the out-param style,
|	an ABI is effectively required to load and store.

Are you really concerned about a "load and store"?  We're talking of
malloc(3) here.  That's a monster, and the least of our problems is a
load and a store.

|	When returning a struct by value,
|	some ABIs will return the small struct on the stack
|	at cost similar to
|	putting a size_t on the stack and passing its address.
|	However,
|	implementors may pass the struct’s members in registers,
|	and some do - notably including
|	aarch64, x86_64 using the System V ABI, and Risc-V.
|	WebAssembly/Wasm can also take advantage of multiple return values.

You're micro optimizing performance in malloc(3).  It makes no sense.
Please show numbers to see that this performance difference is
measurable at all.  We should be optimizing for safety, readability, and
usability when designing an API, unless there are strong reasons to care
about performance.

|	As far as developer ergonomics,
|	there are costs and benefits to each approach.
|	The functions that return a pointer
|	(and use an out-param for the size)
|	can assign the return value directly
|	to where it’s likely to be consumed,
|	to a non-`void` pointer.

This is what makes it importantly safer.

	alx@...uan:~/tmp$ cat m.c 
	#include <stdlib.h>

	#define typeas(T)          typeof((T){})
	#define mallocarray(n, z)  reallocarray(NULL, (n)?:1, z)
	#define malloc_T(n, T)     malloc_T_(n, typeas(T))
	#define malloc_T_(n, T)    ((void)0, (T *){mallocarray(n, sizeof(T))})

	int
	main(void)
	{
		long *p;

		p = malloc_T(42, long);  // ok
		free(p);

		p = malloc_T(42, int);   // -Wincompatible-pointer-types
		free(p);
	}
	alx@...uan:~/tmp$ gcc m.c 
	m.c: In function ‘main’:
	m.c:16:11: error: assignment to ‘long int *’ from incompatible pointer type ‘int *’ [-Wincompatible-pointer-types]
	   16 |         p = malloc_T(42, int);   // -Wincompatible-pointer-types
	      |           ^

Your variant is inherently unsafe, and can't be made type-safe in any
way.  This is a non-starter, IMO.

|	However,
|	they also generally have to declare a variable
|	to pass the address for the actual size;
|	often the allocated size will be stored
|	inside the just-allocated block.

Count lines of code in the examples you provided above.  I count 4 LoC
in all examples, so you're not really gaining in this regard with your
design.

|	Interaction with Sized Free

These interactions seem bad.  free_sized() is losing its only argument,
which was to mitigate some double-free(3) bugs.  This proposal is
mitigating the mitigation.

|	Why not realloc alone?
|	realloc must determine from the allocator’s metadata
|	the true size of the block.
|	Even if paired with extensions like malloc_usable_size
|	to resize to the precise, actual size,
|	these pointer-to-size lookups are costly.

Do you have actual numbers?

|	Avoiding this lookup
|	was a motivation behind C++14’s sized delete
|	and C23’s free_sized features.

Is the performance of free(3) vs free_sized() really a concern?
The security argument is slightly more compelling (although I'm still
not convinced either), but optimizing free(3) surprises me.
Is free(3) a performance issue to any program?  I expect the chances of
passing an incorrect size to be worse than calling free(3).

The paper that added free_sized() (n3699) claims a 30% performance gain.
But is a 30% meaningful?  I expect free(3) to be quite cheap,
relatively.  Also, I'm a bit confused: if free_sized() is about safety,
it means it must really check the actual size of the allocation.  And if
it checks the actual size, where does the performance gain come from?
I'd like to question those numbers, and see an actual experiment.
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm>

|	When a program can make use of the added space,
|	the best time to determine it is at allocation time
|	when the allocator has all of the relevant metadata available.


Have a lovely night!
Alex

-- 
<https://www.alejandro-colomar.es>

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.