musl - Re: Re: Moving forward with sh2/nommu

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150602164547.GH17573@brightrain.aerifal.cx>
Date: Tue, 2 Jun 2015 12:45:47 -0400
From: Rich Felker <dalias@...ifal.cx>
To: Rob Landley <rob@...dley.net>
Cc: musl@...ts.openwall.com
Subject: Re: Re: Moving forward with sh2/nommu

On Tue, Jun 02, 2015 at 01:09:47AM -0500, Rob Landley wrote:
> >> > 2. Kernel insists on having a stack size set in the PT_GNU_STACK
> >> >   program header; if it's 0 (the default ld produces) then execve
> >> >   fails. It should just provide a default, probably 128k (equal to
> >> >   MMU-ful Linux).
> 
> MMU-ful linux preallocates 0k and then demand faults in pages. MMU-ful
> almost never has to worry about memory fragmentation because it can
> remap _and_ move physical pages around (if nothing else, evict through
> swap).

MMU-ful commits 128k + epsilon (typically ends up being 136k), dirties
only whatever you use, and reserves a range the size of RLIMIT_STACK.
Of course on NOMMU these 3 states are not distinct; reserved,
committed, and dirty are all the same.

> >> Nooooo.  8k.  uClinux programs cannot depend on a huge stack, because that
> >> means each instance needs to kmalloc() a huge block of memory.  That is
> >> bad, but it leads to failure to load because of fragmentation (not being
> >> able to find contiguous memory blocks for all those stacks).
> >
> > My view here was just that the default, which none was specified while
> > building the program, should be something "safe". Failed execve
> > ("oops, need to use the right -Wl,-z,stack-size=XXX") is a lot easier
> > to diagnose than a stack overflow that clobbers the program code with
> > stack objects. Right now the default is "always fails to load" because
> > the kernel explicitly rejects any request for a default.
> 
> I note that Rich was probably saying he wants the default at 128k for
> ELF, not for FDPIC. That said, I'm not sure you can have a big enough
> warning sign about vanilla elf being crappy in that case.

This is unrelated to binary format, so no. It's purely a matter of
making it possible for apps to work when they're built without adding
extra CFLAGS or running extra commands to set a stack size for the
binary. My view here is that an application which was not specifically
written for NOMMU should run (or fail with a meaningful error like
ENOMEM) after compiling it with ./configure && make or equivalent
(i.e. without additional custom CFLAGS that would require
application-specific knowledge). Getting it working optimally (size,
memory usage, speed, features, etc.) in your particular environment
might require more work, of course.

Current behavior is that apps with stacksize==0 fail to run at all;
the kernel gives a mysterious error from execve (ENOEXEC?) and then
the shell tries to run the binary as a shell script. Once you
explicitly set a size, it runs with the size you asked for or fails
with ENOMEM.

Setting a small default would be much worse than the current behavior;
rather than getting errors from execve as if the binary were an
unrecognized format, you'd get massive memory corruption likely to end
with bringing down the kernel -- the stack overwrites data/code as it
expands down, then whatever got written over top of the code gets
executed.

> There's 2 things to balance here: if it doesn't "just work" then
> people are frustrated getting their programs to run, but if the
> defaults are horrible for scalability people will go "nommu is crap,
> we can't use it" without ever learning what's actually _wrong_. (It's
> a lot easier to get people to fix obvious breakage than to performance
> tune something that becomes untenable after the fact. How big the
> performance hit has to be before requiring --no-i-really-mean-it on
> the command line is an open question, but this is up there.)
> 
> Of the two ("just works" but is horrible, breaks for trivial things
> until you understand why), if they can't get "hello world" to work we
> can probably get them to read a very small HOWTO, so they at least
> know fixed stack size is an _issue_ in this context. (We're already in
> "fork does not work" territory. We are not in Kansas anymore, if you
> try to fake kansas _too_ hard you're doing developers a disservice.)

Note that for non-shared binfmts (bFLT and non-FDPIC ELF) you already
need a relatively large contiguous memory region for the program's
text+data. So I don't think a moderate-sized stack is in the realm of
"works but horrible". Rather it's on the same scale as use of
non-shareable text -- perfectly fine if you're just going to have a
few processes running after system boot, but preventing scaling. As
long as both these issues are documented (stack size and non-shared
text) I don't see it as giving a bad first impression to NOMMU users.

> That said, annotating every package in the build is silly. Probably
> there should be an enviornment variable or something that can set this
> default for entire builds, and something to actually _measure_ stack
> usage after a run would be awesome. (The kernel has checkstack.pl for
> example?) And the elf2flt command line option for setting stack size
> was just _awkward_, no idea what you've done for your fdpic binaries
> but the traditional UI for this is horrible.

There kind of is an environment variable, LDFLAGS. Using
LDFLAGS=-Wl,-z,stack-size=$SIZE will make 

> >> >   Unfortunately I suspect fixing this might be controversial since
> >> >   there may be existing binaries using brk that can't fall back to
> >> >   mmap.
> >>
> >> No, look at what I did in uClibc for brk().  I think it fails, and everything
> >> in the past depends on that.
> >
> > OK. So do you think it would be safe/acceptable to make brk always
> > fail in the kernel?
> 
> Fork does, this is _less_ intrusive than that.

Yes but fork failing isn't a 'regression' versus older kernel
versions. Personally I don't consider this a regression but a bugfix
-- brk was returning non-usable memory and doing so in such a way that
it couldn't report failure -- but I'm not sure the kernel folks will
agree.

> >> >   These numbers
> >> >   indicate "6 arguments"; there is no good reason to encode the
> >> >   number of arguments in the trap number, so we might as well just
> >> >   always use the "6 argument" code which is what the variadic
> >> >   syscall() has to use anyway. User code should then aim to use the
> >> >   correct value (22 or 38) for the model it's running on (SH3/4 or
> >> >   SH2) for compatibility with old kernels, but will still run safely
> >> >   on new kernels if it detects wrong.
> >>
> >> I say drop backward compatibility.
> >
> > That generally goes against kernel stability principles and seems
> > unlikely to be acceptable upstream.
> 
> I think Jeff means software-level backward compatibility with sh2
> wouldn't be missed. (Linux wasn't big on that architecture in the
> first place, and buildroot dropped support for it a release or two
> back.) Losing backward compatibility with sh4 would make us look
> really bad, especially since musl already supports it, but I don't
> think he's suggesting breaking sh4.
> 
> If we make sh2 binaries accept sh4 syscalls, we can call it a new
> architecture variant. (Which we actually _are_ with sh2j.) The kernel
> patch to make that work would be tiny and the hardware doesn't change.

If you mean making new sh2 binaries use the sh3/4 syscall ABI (traps
0x10-0x16 instead of 0x20-0x26) then that's a problem for being able
to support sh2a, which uses 2 of those traps for hardware exceptions.
Having division errors cause a random syscall to get invoked would not
be fun.

If on the other hand we just use only 0x16, the 6-argument form, then
it should safely avoid such clashes.

> >> > 2. Neither binutils nor gcc accepts "sh2eb-linux" as a target. Trying
> >> >   to hack it in got me a little-endian toolchain. I'm currently just
> >> >   using "sheb" and -m2 to use sh2 instructions that aren't in sh1.
> >>
> >> That is why you really want to configure —target=sh2-uclinux (or make
> >> sh2-linux do the right thing).  SH2 I think is always Big...
> >
> > Well GCC seems to consider the plain sh2-linux target as
> > little-endian.
> 
> They're crazy and broken about a lot of stuff. Sega Saturn was big
> endian. (Saturn was to sh2 what dreamcast was to sh4.)
> 
> I also note that gcc 4.2.1 and binutils 2.17 parsed sh2eb. Current
> stuff not doing so is a regression.

This should probably be reported. Are you sure it worked on unpatched
versions of the above?

> >> > 3. The complex math functions cause ICE in all gcc versions I've tried
> >> >   targetting SH2. For now we can just remove src/complex from musl,
> >> >   but that's a hack. The cause of this bug needs to be found and
> >> >   fixed in GCC.
> 
> Can I get a test program I can build and try with Aboriginal's toolchain?

I don't have a minimal test case. Here is the error:

src/complex/casin.c: In function 'casin':
src/complex/casin.c:16:1: error: unable to find a register to spill in class 'SIBCALL_REGS'
src/complex/casin.c:16:1: error: this is the insn:
...

It goes away if this file is built as non-PIC, and I never experienced
problems with sh4 with fpu, so I'm pretty sure the ICE is an
intersection of soft-float and PIC code generation. It's probably
related to compound literals too.

Adding this to musl's makefile is a temporary workaround:

$(patsubst %.c,%.o,$(wildcard src/complex/*.c)): CFLAGS += -fno-PIC -fno-PIE

> >> > musl issues:
> >> >
> >> > 1. We need runtime detection for the right trap number to use for
> >> >   syscalls. Right now I've got the trap numbers hard-coded for SH2 in
> >> >   my local tree.
> >>
> >> I don’t agree.  Just rationalise it.  Why can SH3 and above not use the
> >> same traps as SH2?
> >
> > Because the kernel syscall interface is a stable API. Even if not for
> > that, unilaterally deciding to change the interface does not instill
> > confidence in the architecture as a stable target.
> 
> And because the perception out there in linux-land is that sh4 was a
> real (if stale) processor and sh2 wasn't, so breaking sh4 to suit sh2
> _before_ we've established our new open hardware thing as actually
> viable will get the door slammed on us so hard...

Agree 100%. This can't be stressed enough.

> > OTOH if we could change to using the SH2 trap range as the default and
> > just keep the old SH3/4 range as a 'backwards compatibility' thing on
> > SH3/4 hardware, I think that might be an acceptable solution too.
> > Existing SH3/4 binaries are never going to run on SH2 anyway.
> 
> QEMU supports sh4 right now. If sh2 supported sh4 traps we _might_ be
> able to run some sh2 code on qemu-sh4 and/or qemu-system-sh4. (But
> then I dunno what the issues there are, I need to sit down and fight
> with it now that elf2flt isn't blocking me.)

qemu-sh4 supports either version of the traps. :-) This actually
helped a lot with early-stage testing of my sh2 binaries.

> >> Don’t understand why we want to change personality?  More info?
> >
> > There's one unexpected place where the kernel has to know whether
> > you're doing FDPIC or not. The sigaction syscall takes a function
> > pointer for the signal handler, and on FDPIC, this is a pointer to the
> > function descriptor containing the GOT pointer and actual code
> > address. So if the kernel has loaded non-FDPIC ELF via the FDPIC ELF
> > loader, it will have switched personality to FDPIC to treat the signal
> > handler pointer specially. And then when you give it an actual
> > function address instead of a function descriptor, it reads a GOT
> > address and code address from the first 8 bytes of the function code,
> > and blows up.
> >
> > If we patch the FDPIC ELF loaded to support normal ELF files (this
> > should be roughly a 10-line patch) then it would never set the FDPIC
> > personality for them to begin with, and no hacks to set it back would
> > be needed.
> 
> Code/rodata  segment sharing is actually really _nice_ for nommu
> systems. It would be nice if we could get that to work at some point.

Absolutely.

> And then there's that XIP stuff that two different ELC presentations
> used for Cortex-M, the videos of which are now up at
> http://elinux.org/ELC_2015_Presentations

XIP should in principle be possible with any binary that can share
code on NOMMU, but I'm not clear on whether the kernel supports it for
all of them.

> (I refer to the talks from Jim Huang and Vitaly Wool.)
> 
> My point is, running contorted but technically valid ELF on nommu is
> just a starting point, we eventually want to go beyond that to take
> advantage of stuff only FDPIC can do.

Agreed.

> >> > 5. The brk workaround I'm doing now can't be upstreamed without a
> >> >   reliable runtime way to distinguish nommu. To put it in malloc.c
> >> >   this would have to be a cross-arch solution. What might make more
> >> >   sense is putting it in syscall_arch.h for sh
> >>
> >> No, this is a general nommu problem.  It will also appear on ARM,
> >> ColdFire, an BlackFin, which are important targets for MUSL.
> >
> > Right, but I don't want to hard-code it for these archs either. In
> > principle it should be possible to run an i386 binary on a nommu i386
> > setup, and if the (hypothetical) kernel had a dangerour/broken brk
> > there too, if also needs to be blacklisted. So what I'm looking for,
> > if the kernel can't/won't just remove brk support on nommu, is a way
> > to detect nommu/broken-brk and prevent it from being used. One
> > simple/stupid way is:
> >
> > if ((size_t)&local_var - (size_t)cur_brk < MIN_DIST_TO_STACK)
> >         // turn off brk support
> >
> > This would ban use of brk on any system where there's a risk of brk
> > extending up into the stack.
> 
> Or you can check your ELF header flags and see if you've got the fdpic bit set.

Well that's a big hack (hard to do, esp in static binaries, where you
don't have the machinery to find your own headers easily from malloc),
and wouldn't help once we get binfmt_elf_fdpic to load non-FDPIC ELF
files.

I think a variant of the above check is actually what should go into
musl since it's a valid check for any system (even MMU-ful ones) and
catches a dangerous condition where brk should not be used.

> >> >   where we already
> >> >   have to check for SH2 to determine the right trap number; the
> >> >   inline syscall code can just do if (nr==SYS_brk&&IS_SH2) return 0;
> >>
> >> I think a look at uClibc is in order.  I made it always return failure,
> >> after playing with having it return results from malloc().  It’s one of
> >> two things that we don’t do (or do poorly) with nommu, the other is the
> >> clone() and form() family being highly restricted.
> >
> > It's so broken I think it should just be fixed on the kernel side.
> 
> I don't know what you mean by "fixed" here. linux/nommu.c currently has:

That code is not the problem. The problem is in binfmt_elf_fdpic.c:

	current->mm->brk = current->mm->start_brk;
	current->mm->context.end_brk = current->mm->start_brk;
	current->mm->context.end_brk +=
		(stack_size > PAGE_SIZE) ? (stack_size - PAGE_SIZE) : 0;
	current->mm->start_stack = current->mm->start_brk + stack_size;

It sets up a brk zone that overlaps with the stack.

In principle you could have a separate brk zone that works, but it
would require reserving a big contiguous memory range (just like for
stack) which is something you don't want on NOMMU. It's much better
for brk just to fail so mmap gets used instead.

> /*
>  *  sys_brk() for the most part doesn't need the global kernel
>  *  lock, except when an application is doing something nasty
>  *  like trying to un-brk an area that has already been mapped
>  *  to a regular file.  in this case, the unmapping will need
>  *  to invoke file system routines that need the global lock.
>  */
> SYSCALL_DEFINE1(brk, unsigned long, brk)
> {
>         struct mm_struct *mm = current->mm;
> 
>         if (brk < mm->start_brk || brk > mm->context.end_brk)
>                 return mm->brk;
> 
>         if (mm->brk == brk)
>                 return mm->brk;
> 
>         /*
>          * Always allow shrinking brk
>          */
>         if (brk <= mm->brk) {
>                 mm->brk = brk;
>                 return brk;
>         }
> 
>         /*
>          * Ok, looks good - let it rip.
>          */
>         flush_icache_range(mm->brk, brk);
>         return mm->brk = brk;
> }
> 
> If that should be replaced with "return -ENOSYS" we can submit a patch...

I'd rather have the NOMMU binfmt loaders set end_brk=start_brk, but it
wouldn't hurt to do both. At least the code that sets up the brk zone
to overlap with the stack needs to be removed. That's just an
explosion (possibly physical, depending on your application) waiting
to happen.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.