musl - Re: [PATCH 3/3] crt: add dcrt1, with support for locating the dynamic loader at runtime

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190428161203.GU23599@brightrain.aerifal.cx>
Date: Sun, 28 Apr 2019 12:12:03 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH 3/3] crt: add dcrt1, with support for locating the
 dynamic loader at runtime

On Sat, Apr 27, 2019 at 05:51:17PM -0500, Rodger Combs wrote:
> > On Apr 27, 2019, at 12:19, Rich Felker <dalias@...c.org> wrote:
> > 
> > On Fri, Apr 26, 2019 at 08:13:29PM -0500, Rodger Combs wrote:
> >> diff --git a/crt/dcrt1.c b/crt/dcrt1.c
> >> new file mode 100644
> >> index 0000000..47c6dc2
> >> --- /dev/null
> >> +++ b/crt/dcrt1.c
> >> @@ -0,0 +1,362 @@
> >> +#define SYSCALL_NO_TLS
> >> +
> >> +#include <elf.h>
> >> +#include <errno.h>
> >> +#include <fcntl.h>
> >> +#include <features.h>
> >> +#include <libgen.h>
> >> +#include <sys/mman.h>
> >> +#include <string.h>
> >> +#include <unistd.h>
> >> +#include "atomic.h"
> >> +#include "dynlink.h"
> >> +#include "syscall.h"
> >> +
> >> +extern weak hidden const size_t _DYNAMIC[];
> >> +
> >> +int main();
> >> +weak void _init();
> >> +weak void _fini();
> >> +weak _Noreturn int __libc_start_main(int (*)(), int, char **,
> >> +	void (*)(), void(*)(), void(*)());
> >> +
> >> +#define DLSTART_PROLOGUE __libc_start_main(main, argc, argv, _init, _fini, 0);
> >> +
> >> +#define START "_start"
> >> +#define _dlstart_c _start_c
> >> +#include "../ldso/dlstart.c"
> >> +
> >> +struct dso {
> >> +	unsigned char *base;
> >> +	struct fdpic_loadmap *loadmap;
> >> +	size_t *dynv;
> >> +	Phdr *phdr;
> >> +	int phnum;
> >> +	size_t phentsize;
> >> +	unsigned char *map;
> >> +	size_t map_len;
> >> +};
> > 
> > There is no need for making a dummy strict dso or exposing struct dso
> > to this file. The code to be taken/shared from map_library just needs
> > to be refactored so that the part that writes things into struct dso
> > remains in dynlink.c. Or, since we're punting on fdpic support for
> > this for now, a simplified version of map_library without fdpic
> > support (it gets quite simple then) could perhaps just be put here for
> > now, with the intent of refactoring to unify later.
> 
> There are enough relevant values here that it seems to warrant using
> a struct (rather than a bunch of individual pointer args), and
> fairly little of map_library had to be disabled (basically just the
> stuff dealing with TLS and GNU relro/stack stuff) to avoid accesses
> to other members. I suppose those portions could be factored out
> fairly easily, and then this minimal struct could be made a member
> of the dso struct? That just seemed like it'd require a fair bit
> more change to the existing dynlink code than this.

I think you're not understanding what I mean by "refactor". The
majority of accesses to *dso in map_library are to do things that have
nothing to do with mapping the ELF file into memory, but instead are
just work that was put into map_library because the dynamic linker
needed something later and there was already a loop iterating over
where it could be found in map_library.

The exceptions I see are dso->loadmap (for fdpic, which is needed to
record where the individual segments were loaded), and the program
header stuff (needed to use the mapped library). That leaves the
outputs as essentially: phdr,phnum,phent,map,map_len. (base is also an
output, but can be derived from phdrs+map. For FDPIC, it's essentially
an array of map,map_len.)

The factorization I think might make sense is renaming the existing
map_library to map_elf, and reducing it to just mapping the file
according to the load segments and providing the above output.
map_library, in dynlink.c, could then call it and additionally fill in
the fields of struct dso, so that the existing callers in dynlink.c
work as-is.

I'm not sure what the ideal way to do FDPIC loadmaps here is. It's
probably something like returning a loadmap rather than a map address
on FDPIC targets.

> >> +static char *crt_getenv(const char *name, char **environ)
> >> +{
> >> +	size_t l = crt_strchrnul(name, '=') - name;
> >> +	if (l && !name[l] && environ)
> >> +		for (char **e = environ; *e; e++)
> >> +			if (!crt_strncmp(name, *e, l) && l[*e] == '=')
> >> +				return *e + l+1;
> >> +	return 0;
> >> +}
> > 
> > I really question the use of environment here at all, both from a
> > standpoint of simplicity/least-surprise, and from a standpoint of
> > intended usage case. If you're on a non-musl system, the environment
> > is likely to contain settings that wouldn't be appropriate to musl
> > anyway, meaing you probably need to remove them for this to work at
> > all, and then if you really need to force a path other than the
> > builtin relative one for a particular musl-linked binary, you might as
> > well just invoke ldso as a command, making the override local to that
> > invocation rather than inherited across exec of other programs.
> 
> My goal is to make locating the dynamic loader work the same way as
> locating any other library you load, including the ability to
> override that path via LD_LIBRARY_PATH for testing, interception,
> etc. Having a dedicated var for it means you can set it and have a
> program and all of its child processes use an alternate loader when
> testing something, or as a way to adjust the loader path after build
> (if patchelf or similar tools don't work well for you).
> 
> > 
> > If you have compelling reasons you want to search the environment,
> > let's discuss it, but if it's just gratuitous "maybe it will be
> > useful", let's apply YAGNI.
> > 
> >> +static void get_rpaths(const char **rpath, const char **runpath, const size_t *dyn)
> >> +{
> >> +	/* DT_STRTAB is pre-relocated for us by dlstart */
> >> +	const char *strings = (void*)dyn[DT_STRTAB];
> >> +
> >> +	if (dyn[0] & (1 << DT_RPATH))
> >> +		*rpath = strings + dyn[DT_RPATH];
> >> +	if (dyn[0] & (1 << DT_RUNPATH))
> >> +		*runpath = strings + dyn[DT_RUNPATH];
> >> +}
> > 
> > musl does not honor the legacy difference in search order between
> > rpath and runpath (see dynlink.c), but if wwe don't search via
> > environment here it doesn't matter anyway.
> 
> It was easy enough to implement, so I did ¯\_(ツ)_/¯ but if we
> explicitly don't want to implement the old behavior I can remove it.

Indeed, it's explicitly not wanted. The old behavior was only left
with glibc for backwards-compatibility with existing stuff using it,
but with musl we didn't have that.

> >> +static size_t find_linker(char *outbuf, size_t bufsize, const char *this_path, size_t thisl, const size_t *dyn, char **environ, int secure)
> >> +{
> >> +	const char *paths[3] = {0}; // rpath, envpath, runpath
> >> +	size_t i;
> >> +	int fd;
> >> +
> >> +	if (thisl)
> >> +		thisl--;
> >> +	while (thisl > 1 && this_path[thisl] == '/')
> >> +		thisl--;
> >> +	while (thisl > 0 && this_path[thisl] != '/')
> >> +		thisl--;
> >> +
> >> +	if (!secure) {
> >> +		const char *envpath = crt_getenv("LD_LOADER_PATH", environ);
> >> +		if (envpath) {
> >> +			size_t envlen = crt_strlen(envpath);
> >> +			if (envlen < bufsize) {
> >> +				crt_memcpy(outbuf, envpath, envlen + 1);
> >> +				return envlen + 1;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	get_rpaths(&paths[0], &paths[2], dyn);
> >> +
> >> +	paths[1] = secure ? NULL : crt_getenv("LD_LIBRARY_PATH", environ);
> >> +
> >> +	for (i = 0; i < 3; i++) {
> >> +		int relative = 0;
> >> +		const char *p = paths[i];
> >> +		char *o = outbuf;
> >> +		if (!p)
> >> +			continue;
> >> +		for (;;) {
> >> +			if (!crt_strncmp(p, "$ORIGIN", 7) ||
> >> +					!crt_strncmp(p, "${ORIGIN}", 9)) {
> >> +				relative = 1;
> >> +				if (o + thisl + 1 < outbuf + bufsize) {
> >> +					crt_memcpy(o, this_path, thisl);
> >> +					o += thisl;
> >> +				} else {
> >> +					o = outbuf + bufsize - 1;
> >> +				}
> >> +				p += (p[1] == '{' ? 9 : 7);
> >> +			} else if (*p == ':' || !*p) {
> >> +#define LDSO_FILENAME "ld-musl-" LDSO_ARCH ".so.1"
> >> +				relative |= outbuf[0] != '/';
> >> +				if ((!secure || !relative) && o + sizeof(LDSO_FILENAME) + 1 < outbuf + bufsize) {
> >> +					*o++ = '/';
> >> +					crt_memcpy(o, LDSO_FILENAME, sizeof(LDSO_FILENAME));
> >> +					if (!access(outbuf, R_OK | X_OK))
> >> +						return (o + sizeof(LDSO_FILENAME)) - outbuf;
> >> +				}
> >> +				if (!*p)
> >> +					break;
> >> +				relative = 0;
> >> +				o = outbuf;
> >> +				p++;
> >> +			} else {
> >> +				if (o < outbuf + bufsize)
> >> +					*o++ = *p;
> >> +				p++;
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	// Didn't find a usable loader anywhere, so try the hardcoded path :shrug:
> >> +	crt_memcpy(outbuf, LDSO_PATHNAME, sizeof(LDSO_PATHNAME));
> >> +	return sizeof(LDSO_PATHNAME);
> >> +}
> > 
> > Including fallbacks that alter behavior is probably a bad idea
> > (failing preferable). If the changes in your other patch to dynlink.c
> > are needed to make it work with dcrt1, then it won't necessarily even
> > work to load a (possibly older) system-wide musl ldso; however if
> > that's the case I Think we should fix it.
> 
> This behavior is analogous to dynlink.c's behavior when it searches
> rpath for a file and can't find it there. It uses "/etc/ld-musl-"
> LDSO_ARCH ".path" and falls back on "/lib:/usr/local/lib:/usr/lib"
> if that's unavailable and I'm just using a single hardcoded path,
> but beyond that I don't see this as significantly conceptually
> different. It also allows executables built with this to work on
> systems with musl installed even without having the vendored copy
> on-hand, which makes it easier to deploy a single binary that can
> use a vendored copy if provided, but can be packaged without it for
> systems with musl.
> 
> The changes aren't needed for dcrt1 when the app is run directly;
> they're needed when running a dcrt app (or other static PIE app) via
> ldso on the command line.
> 
> > 
> >> +static void final_start_c(long *p)
> >> +{
> >> +	int argc = p[0];
> >> +	char **argv = (void *)(p+1);
> >> +	__libc_start_main(main, argc, argv, _init, _fini, 0);
> >> +}
> >> +
> >> +hidden _Noreturn void __dls2(unsigned char *base, size_t *p, size_t *dyn)
> >> +{
> >> +	int argc = p[0];
> >> +	char **argv = (void *)(p+1);
> >> +	int fd;
> >> +	int secure;
> >> +	int prot = PROT_READ;
> >> +	struct dso loader;
> >> +	Ehdr *loader_hdr;
> >> +	Phdr *new_hdr;
> >> +	void *entry;
> >> +	char this_path[2*NAME_MAX+2] = {0};
> >> +	size_t thisl;
> >> +	char linker_path[2*NAME_MAX+2] = {0};
> > 
> > 2*NAME_MAX is a fairly short arbitrary limit. Why not PATH_MAX, or a
> > VLA that adapts up to PATH_MAX (to avoid expanding stack when longer
> > is not needed)?
> 
> I got this limit from load_library; if we think it should be larger,
> we should change it in both places.

What I think you're missing here and in several other places where
you're citing something as analogous to what ldso does is that they're
not on equal standing. Anything that goes into crt1 is logic that's
permanently baked into every application linked with it. On the other
hand, anything in ldso can be changed in the future *with the changes
automatically picked up by any dynamic-linked application*, including
not only upstream changes but site-local changes if needed.

This also applies to things like search based on environment. Any
*policy* logic in ldso is something that can be changed if needed. Any
policy logic in crt1 is baked-in policy that can't be changed.

> I don't think there's much use in doing VLAs here (at least, not if
> they're going to have fixed maximums that are reasonably low), since
> we're going to CRTJMP out of this anyway.

The idea is just avoiding dirtying 2+ extra pages of memory at startup
for no reason. If the program never uses significant amounts of stack,
that's possibly an 8k increase in footprint for the entire process
lifetime.

> 
> > 
> >> +	size_t linker_len;
> >> +	size_t i;
> >> +	size_t aux[AUX_CNT];
> >> +	size_t *auxv;
> >> +	char **environ = argv + argc + 1;
> >> +
> >> +	/* Find aux vector just past environ[] and use it to initialize
> >> +	* global data that may be needed before we can make syscalls. */
> >> +	for (i = argc + 1; argv[i]; i++);
> >> +	auxv = (void *)(argv + i + 1);
> >> +	decode_vec(auxv, aux, AUX_CNT);
> >> +	secure = ((aux[0] & 0x7800) != 0x7800 || aux[AT_UID] != aux[AT_EUID]
> >> +		|| aux[AT_GID] != aux[AT_EGID] || aux[AT_SECURE]);
> > 
> > At this point we can just abort if secure != 0. There is unbounded
> > attack surface trying to load a (possibly relative) ldso with elevated
> > privileges.
> 
> No more so than dynlink.c normally has when loading other SOs. Like
> there, I don't follow $ORIGIN in secure mode, and additionally here
> I don't handle relative-to-cwd paths in secure mode. I don't see a
> problem with allowing a load from an absolute rpath, or from the
> hardcoded path, using this mechanism, though.

In that case, you're not getting any advantages over the kernel doing
the loading of the interpreter, and possibly disadvantages.

> Basically, I'm intending for this to be a feature that you could
> just turn on in your linker flags for everything you build, and get
> the functionality in the cases where you want it, at no significant
> cost in those where you don't.

But there *are* significant costs, and some of the things you want to
do make these costs much higher, and thereby significantly more
inappropriate to do globally (e.g. for a musl-based distro). These
costs are in the areas of:

- Size. Probably not relevant except to the real size purists.
- Complexity.
- Baked-in policy.
- Baked-in bug (and for suid, attack) surface.

I'm not sure yet where the right balance to be had is.

> >> +	thisl = readlink("/proc/self/exe", this_path, sizeof this_path);
> > 
> > We might consider whether use of AT_EXECFN (possibly with additional
> > resolution of symlinks etc) would be better. It would avoid dependency
> > on /proc, making it possible to run dcrt1 binaries at early boot or
> > inside chroots, but does have different semantics that might be less
> > (or maybe more?) desirable.
> > 
> 
> Hmmmm, so it's possible to get relative paths that way (e.g.
> ../test-prog), but I suppose that might be fine? But I don't know
> how useful this would really be, since dynlink.c depends on /proc
> for rpath anyway.

Yes, relative paths should be fine, but unless you resolve the final
component (filename), you end up getting a location relative to the
symlink invoked, rather than relative to the actual binary file. This
essentially makes it impossible to make executable symlinks to DNI
binaries (you'd have to ensure that you also make a symlink to the
ldso at the right relative location). So I think of AT_EXECFN were
used, we'd need to do symlink resolution on the final component
(repeating until a non-symlink was found). Note that this is also
racy, but the race is not really a problem as long as you're not doing
it for suid.

> > Instead it should be something like (pseudocode):
> > 
> > 	if (relocs_already_done) goto done:
> > 	// ...
> > done:
> > 	stage2_func dls2;
> > 	GETFUNCSYM(&dls2, __dls2, base+dyn[DT_PLTGOT]);
> > 	dls2((void *)base, sp);
> > 	
> > This does not require any DLSTART_PROLOGUE macro that's specific to
> > dcrt1; the exact same thing works in rcrt1 (static pie) and will also
> > make it safe to invoke static pie programs via ldso (seems pointless,
> > but useful when you don't realize the binary is static).
> 
> Alright, simple enough; just added a static int to __dls2 that I set
> to 1 before CRTJMP-ing to ldso,

That sounds like the perfect method. Nothing fancy or hackish.

> and had this same aux[AT_PHDR] check
> in dlstart goto past the relocs. It still needs an ifdef, since
> otherwise it'd trigger in the loader itself.

I still wonder if there's a better way to do this. I'll let you know
if I think of other ideas.

> > The dcrt1 code in dls2 (which can vary between ldso, rcrt1, and dcrt1)
> > then has to do its own test to tell which code path it needs to take,
> > but that's fine. General principle here: repeating small amounts of
> > work is better than increasing dependency between stages/layers. Same
> > applies to passing dyn into dls2 -- doing so prevents the call from
> > being a tail call, and keeps _dlstart_c's stack frame around despite
> > being done with it.
> 
> Re-parsing dynv is simple enough, but I need to get at it somehow
> (this function gets it from _start). I suppose I could read the
> _DYNAMIC symbol in C if you prefer?

Yes, that's simpler.

> The stack frame sticking around doesn't really matter, since we're
> going to CRTJMP out of this anyway.

But it happens for ldso's instanct of dlstart_c too, meaning you're
more likely to allocate an extra page during dynamic linking.

> >> +	/* For convenience in dcrt1, we relocate STRTAB here (as with FDPIC) */
> >> +	dyn[DT_STRTAB] += base;
> > 
> > Just do it where you need it.
> 
> It's simple enough in the non-FDPIC case; I just did this here for
> consistency with FDPIC, where this is significantly more complex. I
> can drop it if we want to handle FDPIC differently long-term.

I wouldn't really call it consistency -- in the FDPIC code path, it's
relocating the pointers in the dyn[] table because two of them (symbol
table address and GOT address) are needed for performing the FDPIC
relocations.

> >> 	/* MIPS uses an ugly packed form for GOT relocations. Since we
> >> 	 * can't make function calls yet and the code is tiny anyway,
> >> @@ -144,5 +163,5 @@ hidden void _dlstart_c(size_t *sp, size_t *dynv)
> >> 
> >> 	stage2_func dls2;
> >> 	GETFUNCSYM(&dls2, __dls2, base+dyn[DT_PLTGOT]);
> >> -	dls2((void *)base, sp);
> >> +	dls2((void *)base, sp, dyn);
> > 
> > See above.
> 
> I'll hold off on re-sending this series until we've decided what to
> do about the questions above; once that's done, I'll apply whatever
> we decide on, and also write up some more useful commit messages.

Sounds good.

> Most of your comments from your other mail are addressed (or at
> least discussed) above; the one about rewriting auxv/phdrs is not:
> 
> - Both linux and freebsd's linux emu layer always insert entries for
> all the auxv entries we care about (their values just might be
> zero), so we don't need to worry about them potentially being
> missing (unless there's some other implementation we care about that
> I need to check; maybe Microsoft's?). The aux vector is on the
> stack, so we don't need to worry about it being read-only or
> something like that.

Yes, my only concern was whether it's sized such that it can be done
in-place. Would be nice to confirm that it works for MS's loader too.

> glibc actually already rewrites values in there
> (when invoked directly), so there's precedent for doing so.

Very good to know. Are the changes you proposed in dynlink.c to
rewrite auxv in alignment with what glibc does?

> - We make an anonymous mapping in which to rewrite the program
> headers, rather than doing it in-place. The only particularly odd
> bit is that we take the difference between the effectively-random
> anonymous mapping address and the actual kernel-mapped base address,
> which may be negative (or very large). In practice, this works fine
> with musl's loader as-is (also happens to work with glibc's), and
> both do an unsigned subtract with no bounds check, so I think we're
> fine here as well.

Yes, I think that's fine. BTW if we had a reasonable bound on size of
program headers, we could just put the array in drct1.o's .bss and
only mmap if that bound were exceeded. It'd save PAGE_SIZE-epsilon
bytes of memory and save a syscall.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.