musl - Re: [PATCH v2] MT fork

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <C6YXH9CO8YXT.1ASOTS17EZ872@mussels>
Date: Mon, 09 Nov 2020 15:01:24 -0300
From: Érico Nogueira <ericonr@...root.org>
To: <musl@...ts.openwall.com>, <musl@...ts.openwall.com>
Subject: Re: [PATCH v2] MT fork

On Mon Nov 9, 2020 at 9:07 AM -03, Rich Felker wrote:
> On Sun, Nov 08, 2020 at 05:12:15PM +0100, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@...c.org> [2020-11-05 22:36:17 -0500]:
> > > On Fri, Oct 30, 2020 at 11:31:17PM -0400, Rich Felker wrote:
> > > > One thing I know is potentially problematic is interaction with malloc
> > > > replacement -- locking of any of the subsystems locked at fork time
> > > > necessarily takes place after application atfork handlers, so if the
> > > > malloc replacement registers atfork handlers (as many do), it could
> > > > deadlock. I'm exploring whether malloc use in these systems can be
> > > > eliminated. A few are almost-surely better just using direct mmap
> > > > anyway, but for some it's borderline. I'll have a better idea sometime
> > > > in the next few days.
> > > 
> > > OK, here's a summary of the affected locks (where there's a lock order
> > > conflict between them and application-replaced malloc):
> > 
> > if malloc replacements take internal locks in atfork
> > handlers then it's not just libc internal locks that
> > can cause problems but locks taken by other atfork
> > handlers that were registered before the malloc one.
>
> No other locks taken there could impede forward process in libc. The
> only reason malloc is special here is because we allowed the
> application to redefine a family of functions used by libc. For
> MT-fork with malloc inside libc, we do the malloc_atfork code last so
> that the lock isn't held while other libc components' locks are being
> taken. But we can't start taking libc-internal locks until the
> application atfork handlers run, or the application could see a
> deadlocked libc state (e.g. if something in the atfork handlers used
> time functions, maybe to log time of fork, or gettext functions, maybe
> to print a localized message, etc.).
>
> > i don't think there is a clean solution to this. i
> > think using mmap is ugly (uless there is some reason
> > to prefer that other than the atfork issue).
>
> When the size is exactly a 4k page, it's preferable in terms of memory
> usage to use mmap instead of malloc, except on wacky archs with large
> pages (which I don't think it makes sense to memory-usage-optimize
> for; they're inherently and unfixably memory-inefficient).
>
> But for the most part, yes, it's ugly and I don't want to make such a
> change.
>
> > > - atexit: uses calloc to allocate more handler slots of the builtin 32
> > >   are exhausted. Could reasonably be changed to just mmap a whole page
> > >   of slots in this case.
>
> Not sure on this. Since it's only used in the extremely rare case
> where a huge number of atexit handlers are registered, it's probably
> nicer to use mmap anyway -- it avoids linking a malloc that will
> almost surely never be used by atfork.
>
> > > - dlerror: the lock is just for a queue of buffers to be freed on
> > >   future calls, since they can't be freed at thread exit time because
> > >   the calling context (thread that's "already exited") is not valid to
> > >   call application code, and malloc might be replaced. one plausible
> > >   solution here is getting rid of the free queue hack (and thus the
> > >   lock) entirely and instead calling libc's malloc/free via dlsym
> > >   rather than using the potentially-replaced symbol. but this would
> > >   not work for static linking (same dlerror is used even though dlopen
> > >   always fails; in future it may work) so it's probably not a good
> > >   approach. mmap is really not a good option here because it's
> > >   excessive mem usage. It's probably possible to just repeatedly
> > >   unlock/relock around performing each free so that only one lock is
> > >   held at once.
>
> I think the repeated unlock/relock works fine here, but it would also
> work to just null out the list in the child (i.e. never free any
> buffers that were queued to be freed in the parent before fork). Fork
> of MT process is inherently leaky anyway (you can never free thread
> data from the other parent threads). I'm not sure which approach is
> nicer. The repeated unlock/relock is less logically invasive I think.
>
> > > - gettext: bindtextdomain calls calloc while holding the lock on list
> > >   of bindings. It could drop the lock, allocate, retake it, recheck
> > >   for an existing binding, and free in that case, but this is
> > >   undesirable because it introduces a dependency on free in
> > >   static-linked programs. Otherwise all memory gettext allocates is
> > >   permanent. Because of this we could just mmap an area and bump
> > >   allocate it, but that's wasteful because most programs will only use
> > >   one tiny binding. We could also just leak on the rare possibility of
> > >   concurrent binding allocations; the number of such leaks is bounded
> > >   by nthreads*ndomains, and we could make it just nthreads by keeping
> > >   and reusing abandoned ones.
>
> Thoughts here?
>
> > > - sem_open: a one-time calloc of global semtab takes place with the
> > >   lock held. On 32-bit archs this table is exactly 4k; on 64-bit it's
> > >   6k. So it seems very reasonable to just mmap instead of calloc.
>
> This should almost surely just be changed to mmap. With mallocng, an
> early malloc(4k) will take over 5k, and this table is permanent so the
> excess will never be released. mmap avoids that entirely.
>
> > > - timezone: The tz core allocates memory to remember the last-seen
> > >   value of TZ env var to react if it changes. Normally it's small, so
> > >   perhaps we could just use a small (e.g. 32 byte) static buffer and
> > >   replace it with a whole mmapped page if a value too large for that
> > >   is seen.
> > > 
> > > Also, somehow I failed to find one of the important locks MT-fork
> > > needs to be taking: locale_map.c has a lock for the records of mapped
> > > locales. Allocation also takes place with it held, and for the same
> > > reason as gettext it really shouldn't be changed to allocate
> > > differently. It could possibly do the allocation without the lock held
> > > though and leak it (or save it for reuse later if needed) when another
> > > thread races to load the locale.
> > 
> > yeah this sounds problematic.
> > 
> > if malloc interposers want to do something around fork
> > then libc may need to expose some better api than atfork.
>
> Most of them use existing pthread_atfork if they're intended to be
> usable with MT-fork. I don't think inventing a new musl-specific API
> is a solution here. Their pthread_atfork approach already fully works
> with glibc because glibc *doesn't* make anything but malloc work in
> the MT-forked child. All the other subsystems mentioned here will
> deadlock or blow up in the child with glibc, but only with 0.01%
> probability since it's rare for them to be under hammering at fork
> time.
>
> One solution you might actually like: getting rid of
> application-provided-malloc use inside libc. This could be achieved by
> making malloc a thin wrapper for __libc_malloc or whatever, which
> could be called by everything in libc that doesn't actually have a
> contract to return "as-if-by-malloc" memory. Only a few functions like
> getdelim would be left still calling malloc.

This code block in glob() uses strdup(), which I'd assume would have to
use the application provided malloc. Wouldn't that have to be worked
around somehow?

	if (*pat) {
		char *p = strdup(pat);
		if (!p) return GLOB_NOSPACE;
		buf[0] = 0;
		size_t pos = 0;
		char *s = p;
		if ((flags & (GLOB_TILDE | GLOB_TILDE_CHECK)) && *p == '~')
			error = expand_tilde(&s, buf, &pos);
		if (!error)
			error = do_glob(buf, pos, 0, s, flags, errfunc, &tail);
		free(p);
	}

>
> The other pros of such an approach are stuff like making it so
> application code doesn't get called as a callback from messy contexts
> inside libc, e.g. with dynamic linker in inconsistent state. The major
> con I see is that it precludes omitting the libc malloc entirely when
> static linking, assuming you link any part of libc that uses malloc
> internally. However, almost all such places only call malloc, not
> free, so you'd just get the trivial bump allocator gratuitously
> linked, rather than full mallocng or oldmalloc, except for dlerror
> which shouldn't come up in static linked programs anyway.
>
> Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.