musl - Re: Moving forward with sh2/nommu

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150601151107.GA20759@brightrain.aerifal.cx>
Date: Mon, 1 Jun 2015 12:04:33 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Cc: Rob Landley <rob@...dley.net>
Subject: Re: Moving forward with sh2/nommu

On Mon, Jun 01, 2015 at 01:19:32AM -0500, Rob Landley wrote:
> FYI, Jeff's response.
> 
> We REALLY need to get this on the mailing list.
> 
> Rob

OK, done.

> ---------- Forwarded message ----------
> From: D. Jeff Dionne
> [...]
> 
> 
> On May 31, 2015, at 3:19 PM, Rob Landley <rob@...dley.net> wrote:
> >
> > FYI
> >
> >
> > ---------- Forwarded message ----------
> > From: Rich Felker
> > [...]
> >
> > Here's a summary of the issues we need to work through to get a modern
> > SH2/nommu-targetted musl/toolchain out of the proof-of-concept stage
> > and to the point where it's something people can use roughly 'out of
> > the box':
> >
> > Kernel issues:
> >
> > 1. Kernel should support loading plain ELF directly, unmodified. Right
> >   now I'm writing 0x81 to byte 38 of the header to make a "non-FDPIC
> >   FDPIC ELF binary", which works, but I have to make a personality()
> >   syscall at startup to switch back (this matters to kernel signal
> >   handling) and it's just highly inconvenient/ugly.
> >
> >   Despite plain ELF being suitable for NOMMU, the loader
> >   implementation in binfmt_elf.c depends pretty heavily on MMU. The
> >   one in binfmt_elf_fdpic.c can work on either. The easiest way
> >   forward is to make it so that binfmt_elf_fdpic.c does not insist on
> >   having the FDPIC flags in the ELF header on NOMMU targets (where it
> >   won't confict with binfmt_elf.c since that loader isn't usable).
> 
> ‘Suitable’ depends on what you want.  ‘Works’ and slowly/inefficient because
> of the necessary fixups is perhaps also accurate.

Can I get you some sample binaries to examine?

> > 2. Kernel insists on having a stack size set in the PT_GNU_STACK
> >   program header; if it's 0 (the default ld produces) then execve
> >   fails. It should just provide a default, probably 128k (equal to
> >   MMU-ful Linux).
> 
> Nooooo.  8k.  uClinux programs cannot depend on a huge stack, because that
> means each instance needs to kmalloc() a huge block of memory.  That is
> bad, but it leads to failure to load because of fragmentation (not being
> able to find contiguous memory blocks for all those stacks).

My view here was just that the default, which none was specified while
building the program, should be something "safe". Failed execve
("oops, need to use the right -Wl,-z,stack-size=XXX") is a lot easier
to diagnose than a stack overflow that clobbers the program code with
stack objects. Right now the default is "always fails to load" because
the kernel explicitly rejects any request for a default.

> > 3. Kernel uses the stack for brk too, growing brk from the opposite
> >   end. This is horribly buggy/dangerous. Just dummying out brk to
> >   always-fail is what should be done, but if it can't be done on the
> >   kernel side musl can do it instead (that's what I'm doing now).
> 
> He is starting to understand what nommu is about, yes.
> 
> >   Unfortunately I suspect fixing this might be controversial since
> >   there may be existing binaries using brk that can't fall back to
> >   mmap.
> 
> No, look at what I did in uClibc for brk().  I think it fails, and everything
> in the past depends on that.

OK. So do you think it would be safe/acceptable to make brk always
fail in the kernel? As long as making it fail is left to userspace,
the userspace code has to know at runtime whether it's running on
nommu or not so it can make brk fail. (I'm assuming my goal of having
binaries that can run on both/either.)

> > 4. Syscall trap numbers differ on SH2 vs SH3/4. Presumably the reason
> >   is that these two SH2A hardware traps overlap with the syscall
> >   range used by SH3/4 ABI:
> >
> >        #  define TRAP_DIVZERO_ERROR  17
> >        #  define TRAP_DIVOVF_ERROR   18
> 
> No, 2A is actually the -newest- SH.  This is just gratuitous breakage, and it’s
> really unfortunate.
> 
> >   The path forward I'd like to see is deprecating everything but trap
> >   numbers 22 and 38, which, as far as I can tell, are safe for both
> >   the SH2 and SH3/4 kernel to treat as sys calls.
> 
> Kawasaki-san?  Thoughts?
> 
> >   These numbers
> >   indicate "6 arguments"; there is no good reason to encode the
> >   number of arguments in the trap number, so we might as well just
> >   always use the "6 argument" code which is what the variadic
> >   syscall() has to use anyway. User code should then aim to use the
> >   correct value (22 or 38) for the model it's running on (SH3/4 or
> >   SH2) for compatibility with old kernels, but will still run safely
> >   on new kernels if it detects wrong.
> 
> I say drop backward compatibility.

That generally goes against kernel stability principles and seems
unlikely to be acceptable upstream.

> > Toolchain issues:
> >
> > 1. We need static-PIE (with or without TEXTRELs) which gcc does not
> >   support out of the box. I have complex command lines that produce
> >   static-PIE, and I have specfile based recipes to convert a normal
> >   toolchain to produce (either optionally or by default) static-PIE,
> >   but these recipes conflict with using the same toolchain to build
> >   the kernel. If static-PIE were integrated properly upstream that
> >   would not be an issue.
> 
> I’d like to see -exactly- what the position independence code generator
> is doing in all cases (there are some interesting ones).  Embedded systems
> really do need to count every cycle, and while I’m good with making things
> as rational and standard as possible, if the function overhead is 50 cycles
> and blasts the iCache taking a trip through the trampoline, I think we
> need to reconsider.  Some serious benchmarking is also in order.  bFLT does
> not have any overhead at run time, which is why people still use it over
> FD-PIC on a lot of platforms...

There's no trampoline (I assume you mean PLT?) for static-PIE. Since
the linker resolves all symbolic references at ld-time, it never
generates PLT thunks; instead, all the calls using relative @PLT
addresses turn into what you would get with @PCREL addresses. So in
terms of calls, the main difference versus non-PIC is that you get a
braf/bsrf with a relative address instead of a jmp/jsr with an
absolute address, and in principle this leads to fewer relocations
('fixups') at runtime.

However, if PIC is too expensive for other reasons, there's no
fundamnental reason it has to be used. This is what I meant above by
"with or without TEXTRELs". You're free to compile without -fPIE or
-fPIC, then link as [static-]PIE, and what you'll end up with is
runtime relocations in the .text segment. Unlike on systems with MMU,
this is not a problem because (1) it's not shareable anyway, so you're
not preventing sharing, and (2) there's no memory protection against
writes to .text anyway, so you're not sacrificing memory protection.
I believe doing it this way gets you the same results (even in terms
of number/type of relocations) that you would get with non-shareable
bFLT.

> > 2. Neither binutils nor gcc accepts "sh2eb-linux" as a target. Trying
> >   to hack it in got me a little-endian toolchain. I'm currently just
> >   using "sheb" and -m2 to use sh2 instructions that aren't in sh1.
> 
> That is why you really want to configure —target=sh2-uclinux (or make
> sh2-linux do the right thing).  SH2 I think is always Big...

Well GCC seems to consider the plain sh2-linux target as
little-endian.

I think a lot of the issue here is that, to you, sh2 means particular
hardware architecture, whereas to the GCC developers and to me, sh2
means just the ISA, and from a non-kernel/non-baremetal perspective,
just the userspace part of the ISA, which is a subset of the sh3/4
ISAs. So when gcc is generating code for "sh2-linux", it's treating it
as using all the usual linux-sh conventions (default endianness,
psABI, etc.) but restricted to the sh2 instruction set (no later
instructions).

> > 3. The complex math functions cause ICE in all gcc versions I've tried
> >   targetting SH2. For now we can just remove src/complex from musl,
> >   but that's a hack. The cause of this bug needs to be found and
> >   fixed in GCC.
> 
> Does it happen in 4.5.2 or the Code Sorcery chain?  We need complex.

I'm not sure.

> > musl issues:
> >
> > 1. We need runtime detection for the right trap number to use for
> >   syscalls. Right now I've got the trap numbers hard-coded for SH2 in
> >   my local tree.
> 
> I don’t agree.  Just rationalise it.  Why can SH3 and above not use the
> same traps as SH2?

Because the kernel syscall interface is a stable API. Even if not for
that, unilaterally deciding to change the interface does not instill
confidence in the architecture as a stable target.

OTOH if we could change to using the SH2 trap range as the default and
just keep the old SH3/4 range as a 'backwards compatibility' thing on
SH3/4 hardware, I think that might be an acceptable solution too.
Existing SH3/4 binaries are never going to run on SH2 anyway.

> > 2. We need additional runtime detection options for atomics: interrupt
> >   masking for plain SH2, and the new CAS instruction for SH2J.
> 
> Dunno how to do that, actually.  There is no processor ID register in the
> older implementations.  Kawasaki-san?

The runtime detection should be possible via AT_HWCAP and/or
AT_PLATFORM provided by the kernel; no need for hardware instructions
to do it.

> > 3. We need sh/vfork.s since the default vfork.c just uses fork, which
> >   won't work. I have a version locally but it doesn't make sense to
> >   commit without runtime trap number selection.
> 
> K.  But see above
> 
> > 4. As long as we're using the FDPIC ELF header flag to get
> >   binfmt_elf_fdpic.c to load binaries, the startup code needs to call
> >   the personality() syscall to switch back. I have a local hack for
> >   doing this in rcrt1.o which is probably not worth upstreaming if we
> >   can just make the kernel do it right.
> 
> Don’t understand why we want to change personality?  More info?

There's one unexpected place where the kernel has to know whether
you're doing FDPIC or not. The sigaction syscall takes a function
pointer for the signal handler, and on FDPIC, this is a pointer to the
function descriptor containing the GOT pointer and actual code
address. So if the kernel has loaded non-FDPIC ELF via the FDPIC ELF
loader, it will have switched personality to FDPIC to treat the signal
handler pointer specially. And then when you give it an actual
function address instead of a function descriptor, it reads a GOT
address and code address from the first 8 bytes of the function code,
and blows up.

If we patch the FDPIC ELF loaded to support normal ELF files (this
should be roughly a 10-line patch) then it would never set the FDPIC
personality for them to begin with, and no hacks to set it back would
be needed.

> > 5. The brk workaround I'm doing now can't be upstreamed without a
> >   reliable runtime way to distinguish nommu. To put it in malloc.c
> >   this would have to be a cross-arch solution. What might make more
> >   sense is putting it in syscall_arch.h for sh
> 
> No, this is a general nommu problem.  It will also appear on ARM,
> ColdFire, an BlackFin, which are important targets for MUSL.

Right, but I don't want to hard-code it for these archs either. In
principle it should be possible to run an i386 binary on a nommu i386
setup, and if the (hypothetical) kernel had a dangerour/broken brk
there too, if also needs to be blacklisted. So what I'm looking for,
if the kernel can't/won't just remove brk support on nommu, is a way
to detect nommu/broken-brk and prevent it from being used. One
simple/stupid way is:

if ((size_t)&local_var - (size_t)cur_brk < MIN_DIST_TO_STACK)
	// turn off brk support

This would ban use of brk on any system where there's a risk of brk
extending up into the stack.

> >   where we already
> >   have to check for SH2 to determine the right trap number; the
> >   inline syscall code can just do if (nr==SYS_brk&&IS_SH2) return 0;
> 
> I think a look at uClibc is in order.  I made it always return failure,
> after playing with having it return results from malloc().  It’s one of
> two things that we don’t do (or do poorly) with nommu, the other is the
> clone() and form() family being highly restricted.

It's so broken I think it should just be fixed on the kernel side.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.