musl - Re: [PATCH v2] MT fork

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201109184456.GB534@brightrain.aerifal.cx>
Date: Mon, 9 Nov 2020 13:44:56 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH v2] MT fork

On Mon, Nov 09, 2020 at 03:01:24PM -0300, Érico Nogueira wrote:
> On Mon Nov 9, 2020 at 9:07 AM -03, Rich Felker wrote:
> > On Sun, Nov 08, 2020 at 05:12:15PM +0100, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@...c.org> [2020-11-05 22:36:17 -0500]:
> > > > On Fri, Oct 30, 2020 at 11:31:17PM -0400, Rich Felker wrote:
> > > > > One thing I know is potentially problematic is interaction with malloc
> > > > > replacement -- locking of any of the subsystems locked at fork time
> > > > > necessarily takes place after application atfork handlers, so if the
> > > > > malloc replacement registers atfork handlers (as many do), it could
> > > > > deadlock. I'm exploring whether malloc use in these systems can be
> > > > > eliminated. A few are almost-surely better just using direct mmap
> > > > > anyway, but for some it's borderline. I'll have a better idea sometime
> > > > > in the next few days.
> > > > 
> > > > OK, here's a summary of the affected locks (where there's a lock order
> > > > conflict between them and application-replaced malloc):
> > > 
> > > if malloc replacements take internal locks in atfork
> > > handlers then it's not just libc internal locks that
> > > can cause problems but locks taken by other atfork
> > > handlers that were registered before the malloc one.
> >
> > No other locks taken there could impede forward process in libc. The
> > only reason malloc is special here is because we allowed the
> > application to redefine a family of functions used by libc. For
> > MT-fork with malloc inside libc, we do the malloc_atfork code last so
> > that the lock isn't held while other libc components' locks are being
> > taken. But we can't start taking libc-internal locks until the
> > application atfork handlers run, or the application could see a
> > deadlocked libc state (e.g. if something in the atfork handlers used
> > time functions, maybe to log time of fork, or gettext functions, maybe
> > to print a localized message, etc.).
> >
> > > i don't think there is a clean solution to this. i
> > > think using mmap is ugly (uless there is some reason
> > > to prefer that other than the atfork issue).
> >
> > When the size is exactly a 4k page, it's preferable in terms of memory
> > usage to use mmap instead of malloc, except on wacky archs with large
> > pages (which I don't think it makes sense to memory-usage-optimize
> > for; they're inherently and unfixably memory-inefficient).
> >
> > But for the most part, yes, it's ugly and I don't want to make such a
> > change.
> >
> > > > - atexit: uses calloc to allocate more handler slots of the builtin 32
> > > >   are exhausted. Could reasonably be changed to just mmap a whole page
> > > >   of slots in this case.
> >
> > Not sure on this. Since it's only used in the extremely rare case
> > where a huge number of atexit handlers are registered, it's probably
> > nicer to use mmap anyway -- it avoids linking a malloc that will
> > almost surely never be used by atfork.
> >
> > > > - dlerror: the lock is just for a queue of buffers to be freed on
> > > >   future calls, since they can't be freed at thread exit time because
> > > >   the calling context (thread that's "already exited") is not valid to
> > > >   call application code, and malloc might be replaced. one plausible
> > > >   solution here is getting rid of the free queue hack (and thus the
> > > >   lock) entirely and instead calling libc's malloc/free via dlsym
> > > >   rather than using the potentially-replaced symbol. but this would
> > > >   not work for static linking (same dlerror is used even though dlopen
> > > >   always fails; in future it may work) so it's probably not a good
> > > >   approach. mmap is really not a good option here because it's
> > > >   excessive mem usage. It's probably possible to just repeatedly
> > > >   unlock/relock around performing each free so that only one lock is
> > > >   held at once.
> >
> > I think the repeated unlock/relock works fine here, but it would also
> > work to just null out the list in the child (i.e. never free any
> > buffers that were queued to be freed in the parent before fork). Fork
> > of MT process is inherently leaky anyway (you can never free thread
> > data from the other parent threads). I'm not sure which approach is
> > nicer. The repeated unlock/relock is less logically invasive I think.
> >
> > > > - gettext: bindtextdomain calls calloc while holding the lock on list
> > > >   of bindings. It could drop the lock, allocate, retake it, recheck
> > > >   for an existing binding, and free in that case, but this is
> > > >   undesirable because it introduces a dependency on free in
> > > >   static-linked programs. Otherwise all memory gettext allocates is
> > > >   permanent. Because of this we could just mmap an area and bump
> > > >   allocate it, but that's wasteful because most programs will only use
> > > >   one tiny binding. We could also just leak on the rare possibility of
> > > >   concurrent binding allocations; the number of such leaks is bounded
> > > >   by nthreads*ndomains, and we could make it just nthreads by keeping
> > > >   and reusing abandoned ones.
> >
> > Thoughts here?
> >
> > > > - sem_open: a one-time calloc of global semtab takes place with the
> > > >   lock held. On 32-bit archs this table is exactly 4k; on 64-bit it's
> > > >   6k. So it seems very reasonable to just mmap instead of calloc.
> >
> > This should almost surely just be changed to mmap. With mallocng, an
> > early malloc(4k) will take over 5k, and this table is permanent so the
> > excess will never be released. mmap avoids that entirely.
> >
> > > > - timezone: The tz core allocates memory to remember the last-seen
> > > >   value of TZ env var to react if it changes. Normally it's small, so
> > > >   perhaps we could just use a small (e.g. 32 byte) static buffer and
> > > >   replace it with a whole mmapped page if a value too large for that
> > > >   is seen.
> > > > 
> > > > Also, somehow I failed to find one of the important locks MT-fork
> > > > needs to be taking: locale_map.c has a lock for the records of mapped
> > > > locales. Allocation also takes place with it held, and for the same
> > > > reason as gettext it really shouldn't be changed to allocate
> > > > differently. It could possibly do the allocation without the lock held
> > > > though and leak it (or save it for reuse later if needed) when another
> > > > thread races to load the locale.
> > > 
> > > yeah this sounds problematic.
> > > 
> > > if malloc interposers want to do something around fork
> > > then libc may need to expose some better api than atfork.
> >
> > Most of them use existing pthread_atfork if they're intended to be
> > usable with MT-fork. I don't think inventing a new musl-specific API
> > is a solution here. Their pthread_atfork approach already fully works
> > with glibc because glibc *doesn't* make anything but malloc work in
> > the MT-forked child. All the other subsystems mentioned here will
> > deadlock or blow up in the child with glibc, but only with 0.01%
> > probability since it's rare for them to be under hammering at fork
> > time.
> >
> > One solution you might actually like: getting rid of
> > application-provided-malloc use inside libc. This could be achieved by
> > making malloc a thin wrapper for __libc_malloc or whatever, which
> > could be called by everything in libc that doesn't actually have a
> > contract to return "as-if-by-malloc" memory. Only a few functions like
> > getdelim would be left still calling malloc.
> 
> This code block in glob() uses strdup(), which I'd assume would have to
> use the application provided malloc. Wouldn't that have to be worked
> around somehow?
> 
> 	if (*pat) {
> 		char *p = strdup(pat);
> 		if (!p) return GLOB_NOSPACE;
> 		buf[0] = 0;
> 		size_t pos = 0;
> 		char *s = p;
> 		if ((flags & (GLOB_TILDE | GLOB_TILDE_CHECK)) && *p == '~')
> 			error = expand_tilde(&s, buf, &pos);
> 		if (!error)
> 			error = do_glob(buf, pos, 0, s, flags, errfunc, &tail);
> 		free(p);
> 	}

It could either be left using public malloc (imo fine since this is
not an "internal component of libc" but generic library code with no
tie-in to libc) or use of strdup could be replaced with a trivial
alternate version that uses __libc_malloc instead. My leaning would be
towards the former -- only using libc malloc in places where calling
the application-provided malloc could lead to recursive locking of
libc-internal locks (because the caller already holds a libc-internal
lock) or other "inconsistent state" issues (like dlerror buffers at
pthread_exit time).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.