musl - Re: dlsym(handle) may search in unrelated libraries

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190226150733.GQ23599@brightrain.aerifal.cx>
Date: Tue, 26 Feb 2019 10:07:33 -0500
From: Rich Felker <dalias@...c.org>
To: Alexey Izbyshev <izbyshev@...ras.ru>
Cc: musl@...ts.openwall.com
Subject: Re: dlsym(handle) may search in unrelated libraries

On Sat, Feb 09, 2019 at 08:03:47PM -0500, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 01:53:02AM +0300, Alexey Izbyshev wrote:
> > (Replying to https://www.openwall.com/lists/musl/2019/02/07/7)
> > 
> > >On Thu, Feb 07, 2019 at 07:36:38PM +0100, Markus Wichmann wrote:
> > >> On Thu, Feb 07, 2019 at 11:54:47AM -0500, Rich Felker wrote:
> > >> > I'm not sure if it makes sense to build both of these lists at the
> > >> > same time. We currently try to avoid allocating dependency lists for
> > >> > libraries loaded at startup in order not to allocate and setup data
> > >> > that might never be used, defering until if/when dlopen is called on
> > >> > them. I've wanted to do the ctor order walk without allocating a
> > >> > dependency list, but I don't know a good way to do so. Note that the
> > >> > code that runs ctors cannot allocate because, at the time it runs,
> > >> > dlopen has already "committed" the load and can't back it out. It has
> > >> > to have all resources it needs for the walk precommitted.
> > >>
> > >> Depth first means stack. Means recursion or explicit stack. Explicit is
> > >> going to be hard, considering there is no known limit to dependency
> > >> nesting depth. We could argue that the user better make their stack size
> > >> ulimit large enough for the main thread, but I hazard a guess that
> > >> nobody expects dlopen() to use overly much stack in other threads.
> > >>
> > >> Alright, what's the algorithm here?
> > >>
> > >> init(lib):
> > >>     if (!lib.inited):
> > >>         foreach d in lib.deps init(d)
> > >>         start_init(lib)
> > >>         lib.inited = 1
> > >>
> > >> That about right? Because that means we need a stack as high as the
> > >> nesting depth of dependencies. We can get a pessimistic estimation from
> > >> the number of deps loaded, but that may be way too high (see below).
> > >
> > >Yes, but you can also avoid recursion just by looping to the deepest
> > >dependency with !inited, then going back to the root. For a one-time
> > >operation at dlopen-time or program-start time, the quadratic search
> > >for each !inited seems unlikely to be a problem:
> > >
> > >> > Since the beginning of a list of breadth-first dependencies is just
> > >> > the direct dependencies, if we recorded the number of direct
> > >> > dependencies for each dso, the breadth-first lists could be used to
> > >> > perform a depth-first walk to run ctors; this can be made
> > >> > non-recursive, at the cost of quadratic time but with trivially-tiny
> > >^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> > constant factor, by simply restarting at the root of the tree after
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> > each node and finding the deepest not-yet-constructed dso. This
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > We'd need breadth-first lists for all dsos to implement DFS as you
> > suggest because a single breadth-first list for the root is not
> > enough to reconstruct all edges of the dependency graph even if we
> > know the number of direct dependencies for each dso (we need to
> > process each edge for DFS to work correctly). Or am I
> > misunderstanding your suggestion?
> 
> I think you misunderstood. I was talking about having the full BFS
> list for each dso, but tagging each one with the count of entries
> which are direct dependencies, so that the corresponding initial
> subarray can be used for DFS.
> 
> > If we're willing to add some fields to struct dso to implement the
> > correct ctor order, I'd suggest to use a non-recursive DFS with an
> > explicit stack instead. We'd need just two fields in each stack slot
> > (dso and the number of processed DT_NEEDED entries), so the memory
> > overhead is low compared to sizeof(struct dso). The stack can be
> > preallocated before "committing" dlopen(). As Markus said, the upper
> > bound is the number of dsos loaded with this dlopen() call because
> > all other dsos are already constructed, so we don't need to visit
> > them. This stack can be freed immediately after ctors are run. We
> > already have dso->constructed to mark visited dsos, and we already
> > use another list for finalization (dso->fini_next), and that list
> > would be built naturally by DFS.
> 
> I think it might get more complicated with concurrent and recursive
> calls to dlopen (e.g. dlopen from a ctor). Right now the
> init_fini_lock prevents a dlopen from another thread from completing
> as long as ctors are still running from a previous call, except
> recursively (in the same thread) thanks to a recursive mutex. Really,
> though, we should never be holding libc locks while application code
> runs, so the lock should be released each time a ctor is run and
> re-taken after it finishes.
> 
> I'd considered preallocating the explicit stack structure as part of
> the dsos, which is problematic because multiple callers might need to
> use it. However if the explicit stack were allocated by and just tied
> to the dlopen call frame instance, I think it should work. There may
> be deps that were already loaded from another thread's dlopen, though,
> but which were not yet constructed, so the bound is probably not the
> number of newly-loaded dsos but the total number of deps or the max
> depth of deps. This is probably each to record from building the BFS
> list.

I've gotten back to thinking about this, and here's a rough idea of
the algorithm I'm thinking of using:

Load-direct-deps operation for a single DSO:
- Count DT_NEEDED entries, store number of direct deps.
- Allocate storage for direct deps list.
- Call load_library for each direct dep, save pointer to it.

Load-deps operation for newly dlopen'd DSO:
- Start with all DSOs unmarked
- Use the deps list as a queue. For each entry in queue:
  - If the DSO is marked, skip it.
  - If the DSO doesn't have a deps list, perform load-direct-deps.
  - Append the initial n_direct_deps entries to the queue.
  - Mark the DSO.
- When end of queue is reached, finish and unmark all DSOs.

This should leave us with direct-dep lists for all DSOs, usable for
dependency-order ctor execution, and a non-redundant (currently, the
list can be highly redundant and thus inefficient to dlsym) BFS deps
list for the dlopen'd DSO. The direct-dep lists for DSOs not
explicitly opened also form the initial segment of any future BFS deps
list needed when dlopen is called on them.

The above is entirely non-recursive and uses no temp storage; the data
is generated in-place. Walking the direct-deps lists for ctor
execution can either use recursion or the "restart" approach I've
described before.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.