Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 26 Feb 2019 10:07:33 -0500
From: Rich Felker <dalias@...c.org>
To: Alexey Izbyshev <izbyshev@...ras.ru>
Cc: musl@...ts.openwall.com
Subject: Re: dlsym(handle) may search in unrelated libraries

On Sat, Feb 09, 2019 at 08:03:47PM -0500, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 01:53:02AM +0300, Alexey Izbyshev wrote:
> > (Replying to https://www.openwall.com/lists/musl/2019/02/07/7)
> > 
> > >On Thu, Feb 07, 2019 at 07:36:38PM +0100, Markus Wichmann wrote:
> > >> On Thu, Feb 07, 2019 at 11:54:47AM -0500, Rich Felker wrote:
> > >> > I'm not sure if it makes sense to build both of these lists at the
> > >> > same time. We currently try to avoid allocating dependency lists for
> > >> > libraries loaded at startup in order not to allocate and setup data
> > >> > that might never be used, defering until if/when dlopen is called on
> > >> > them. I've wanted to do the ctor order walk without allocating a
> > >> > dependency list, but I don't know a good way to do so. Note that the
> > >> > code that runs ctors cannot allocate because, at the time it runs,
> > >> > dlopen has already "committed" the load and can't back it out. It has
> > >> > to have all resources it needs for the walk precommitted.
> > >>
> > >> Depth first means stack. Means recursion or explicit stack. Explicit is
> > >> going to be hard, considering there is no known limit to dependency
> > >> nesting depth. We could argue that the user better make their stack size
> > >> ulimit large enough for the main thread, but I hazard a guess that
> > >> nobody expects dlopen() to use overly much stack in other threads.
> > >>
> > >> Alright, what's the algorithm here?
> > >>
> > >> init(lib):
> > >>     if (!lib.inited):
> > >>         foreach d in lib.deps init(d)
> > >>         start_init(lib)
> > >>         lib.inited = 1
> > >>
> > >> That about right? Because that means we need a stack as high as the
> > >> nesting depth of dependencies. We can get a pessimistic estimation from
> > >> the number of deps loaded, but that may be way too high (see below).
> > >
> > >Yes, but you can also avoid recursion just by looping to the deepest
> > >dependency with !inited, then going back to the root. For a one-time
> > >operation at dlopen-time or program-start time, the quadratic search
> > >for each !inited seems unlikely to be a problem:
> > >
> > >> > Since the beginning of a list of breadth-first dependencies is just
> > >> > the direct dependencies, if we recorded the number of direct
> > >> > dependencies for each dso, the breadth-first lists could be used to
> > >> > perform a depth-first walk to run ctors; this can be made
> > >> > non-recursive, at the cost of quadratic time but with trivially-tiny
> > >^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> > constant factor, by simply restarting at the root of the tree after
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >> > each node and finding the deepest not-yet-constructed dso. This
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > We'd need breadth-first lists for all dsos to implement DFS as you
> > suggest because a single breadth-first list for the root is not
> > enough to reconstruct all edges of the dependency graph even if we
> > know the number of direct dependencies for each dso (we need to
> > process each edge for DFS to work correctly). Or am I
> > misunderstanding your suggestion?
> 
> I think you misunderstood. I was talking about having the full BFS
> list for each dso, but tagging each one with the count of entries
> which are direct dependencies, so that the corresponding initial
> subarray can be used for DFS.
> 
> > If we're willing to add some fields to struct dso to implement the
> > correct ctor order, I'd suggest to use a non-recursive DFS with an
> > explicit stack instead. We'd need just two fields in each stack slot
> > (dso and the number of processed DT_NEEDED entries), so the memory
> > overhead is low compared to sizeof(struct dso). The stack can be
> > preallocated before "committing" dlopen(). As Markus said, the upper
> > bound is the number of dsos loaded with this dlopen() call because
> > all other dsos are already constructed, so we don't need to visit
> > them. This stack can be freed immediately after ctors are run. We
> > already have dso->constructed to mark visited dsos, and we already
> > use another list for finalization (dso->fini_next), and that list
> > would be built naturally by DFS.
> 
> I think it might get more complicated with concurrent and recursive
> calls to dlopen (e.g. dlopen from a ctor). Right now the
> init_fini_lock prevents a dlopen from another thread from completing
> as long as ctors are still running from a previous call, except
> recursively (in the same thread) thanks to a recursive mutex. Really,
> though, we should never be holding libc locks while application code
> runs, so the lock should be released each time a ctor is run and
> re-taken after it finishes.
> 
> I'd considered preallocating the explicit stack structure as part of
> the dsos, which is problematic because multiple callers might need to
> use it. However if the explicit stack were allocated by and just tied
> to the dlopen call frame instance, I think it should work. There may
> be deps that were already loaded from another thread's dlopen, though,
> but which were not yet constructed, so the bound is probably not the
> number of newly-loaded dsos but the total number of deps or the max
> depth of deps. This is probably each to record from building the BFS
> list.

I've gotten back to thinking about this, and here's a rough idea of
the algorithm I'm thinking of using:

Load-direct-deps operation for a single DSO:
- Count DT_NEEDED entries, store number of direct deps.
- Allocate storage for direct deps list.
- Call load_library for each direct dep, save pointer to it.

Load-deps operation for newly dlopen'd DSO:
- Start with all DSOs unmarked
- Use the deps list as a queue. For each entry in queue:
  - If the DSO is marked, skip it.
  - If the DSO doesn't have a deps list, perform load-direct-deps.
  - Append the initial n_direct_deps entries to the queue.
  - Mark the DSO.
- When end of queue is reached, finish and unmark all DSOs.

This should leave us with direct-dep lists for all DSOs, usable for
dependency-order ctor execution, and a non-redundant (currently, the
list can be highly redundant and thus inefficient to dlsym) BFS deps
list for the dlopen'd DSO. The direct-dep lists for DSOs not
explicitly opened also form the initial segment of any future BFS deps
list needed when dlopen is called on them.

The above is entirely non-recursive and uses no temp storage; the data
is generated in-place. Walking the direct-deps lists for ctor
execution can either use recursion or the "restart" approach I've
described before.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.