musl - Re: Moving forward with sh2/nommu

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOS_Y6TPOFmCJNGYaJmwFNF5+Nq74FEUOKcRUHGKA+N9aJTPdw@mail.gmail.com>
Date: Tue, 2 Jun 2015 01:09:47 -0500
From: Rob Landley <rob@...dley.net>
To: Rich Felker <dalias@...ifal.cx>
Cc: musl@...ts.openwall.com
Subject: Re: Moving forward with sh2/nommu

On Mon, Jun 1, 2015 at 11:04 AM, Rich Felker <dalias@...ifal.cx> wrote:
> On Mon, Jun 01, 2015 at 01:19:32AM -0500, Rob Landley wrote:
>> FYI, Jeff's response.
>>
>> We REALLY need to get this on the mailing list.
>>
>> Rob
>
> OK, done.

Actually I was hoping you and jeff could repost your respective bits, but eh.


>> > 2. Kernel insists on having a stack size set in the PT_GNU_STACK
>> >   program header; if it's 0 (the default ld produces) then execve
>> >   fails. It should just provide a default, probably 128k (equal to
>> >   MMU-ful Linux).

MMU-ful linux preallocates 0k and then demand faults in pages. MMU-ful
almost never has to worry about memory fragmentation because it can
remap _and_ move physical pages around (if nothing else, evict through
swap).

This is dedicated contiguous allocation even if it's wasted in a
system that's _very_ prone to fragmentation, meaning things like
"true" can fail well before you're in OOM killer territory.

It's not the same at all.

>> Nooooo.  8k.  uClinux programs cannot depend on a huge stack, because that
>> means each instance needs to kmalloc() a huge block of memory.  That is
>> bad, but it leads to failure to load because of fragmentation (not being
>> able to find contiguous memory blocks for all those stacks).
>
> My view here was just that the default, which none was specified while
> building the program, should be something "safe". Failed execve
> ("oops, need to use the right -Wl,-z,stack-size=XXX") is a lot easier
> to diagnose than a stack overflow that clobbers the program code with
> stack objects. Right now the default is "always fails to load" because
> the kernel explicitly rejects any request for a default.

I note that Rich was probably saying he wants the default at 128k for
ELF, not for FDPIC. That said, I'm not sure you can have a big enough
warning sign about vanilla elf being crappy in that case.

There's 2 things to balance here: if it doesn't "just work" then
people are frustrated getting their programs to run, but if the
defaults are horrible for scalability people will go "nommu is crap,
we can't use it" without ever learning what's actually _wrong_. (It's
a lot easier to get people to fix obvious breakage than to performance
tune something that becomes untenable after the fact. How big the
performance hit has to be before requiring --no-i-really-mean-it on
the command line is an open question, but this is up there.)

Of the two ("just works" but is horrible, breaks for trivial things
until you understand why), if they can't get "hello world" to work we
can probably get them to read a very small HOWTO, so they at least
know fixed stack size is an _issue_ in this context. (We're already in
"fork does not work" territory. We are not in Kansas anymore, if you
try to fake kansas _too_ hard you're doing developers a disservice.)

That said, annotating every package in the build is silly. Probably
there should be an enviornment variable or something that can set this
default for entire builds, and something to actually _measure_ stack
usage after a run would be awesome. (The kernel has checkstack.pl for
example?) And the elf2flt command line option for setting stack size
was just _awkward_, no idea what you've done for your fdpic binaries
but the traditional UI for this is horrible.

>> >   Unfortunately I suspect fixing this might be controversial since
>> >   there may be existing binaries using brk that can't fall back to
>> >   mmap.
>>
>> No, look at what I did in uClibc for brk().  I think it fails, and everything
>> in the past depends on that.
>
> OK. So do you think it would be safe/acceptable to make brk always
> fail in the kernel?

Fork does, this is _less_ intrusive than that.

> As long as making it fail is left to userspace,
> the userspace code has to know at runtime whether it's running on
> nommu or not so it can make brk fail. (I'm assuming my goal of having
> binaries that can run on both/either.)

Gotta make a certain amount of historical usage work if we're to wean
people off their weird bespoke builds. Not sure the right answer here.
Then again a patch to make future kernels do this right and then
depending on that is a problem that will solve itself in time. (We no
longer care about 2.4. I've generally used 7 years as a rule of thumb
for "that's too old to care about without a reason", and that gets us
back to around 2.6.25 at the moment...)

>> > 4. Syscall trap numbers differ on SH2 vs SH3/4. Presumably the reason
>> >   is that these two SH2A hardware traps overlap with the syscall
>> >   range used by SH3/4 ABI:
>> >
>> >        #  define TRAP_DIVZERO_ERROR  17
>> >        #  define TRAP_DIVOVF_ERROR   18
>>
>> No, 2A is actually the -newest- SH.  This is just gratuitous breakage, and it’s
>> really unfortunate.
>>
>> >   The path forward I'd like to see is deprecating everything but trap
>> >   numbers 22 and 38, which, as far as I can tell, are safe for both
>> >   the SH2 and SH3/4 kernel to treat as sys calls.
>>
>> Kawasaki-san?  Thoughts?
>>
>> >   These numbers
>> >   indicate "6 arguments"; there is no good reason to encode the
>> >   number of arguments in the trap number, so we might as well just
>> >   always use the "6 argument" code which is what the variadic
>> >   syscall() has to use anyway. User code should then aim to use the
>> >   correct value (22 or 38) for the model it's running on (SH3/4 or
>> >   SH2) for compatibility with old kernels, but will still run safely
>> >   on new kernels if it detects wrong.
>>
>> I say drop backward compatibility.
>
> That generally goes against kernel stability principles and seems
> unlikely to be acceptable upstream.

I think Jeff means software-level backward compatibility with sh2
wouldn't be missed. (Linux wasn't big on that architecture in the
first place, and buildroot dropped support for it a release or two
back.) Losing backward compatibility with sh4 would make us look
really bad, especially since musl already supports it, but I don't
think he's suggesting breaking sh4.

If we make sh2 binaries accept sh4 syscalls, we can call it a new
architecture variant. (Which we actually _are_ with sh2j.) The kernel
patch to make that work would be tiny and the hardware doesn't change.

Given that qemu guys implemented "qemu-system-sh4" instead of
"qemu-system-sh", the current perception is that superh _is_ sh4. Not
being compatible with sh4 probably has higher overhead than not being
compatible with historical linux for sh2.

>> > Toolchain issues:
>> >
>> > 1. We need static-PIE (with or without TEXTRELs) which gcc does not
>> >   support out of the box. I have complex command lines that produce
>> >   static-PIE, and I have specfile based recipes to convert a normal
>> >   toolchain to produce (either optionally or by default) static-PIE,
>> >   but these recipes conflict with using the same toolchain to build
>> >   the kernel. If static-PIE were integrated properly upstream that
>> >   would not be an issue.
>>
>> I’d like to see -exactly- what the position independence code generator
>> is doing in all cases (there are some interesting ones).  Embedded systems
>> really do need to count every cycle, and while I’m good with making things
>> as rational and standard as possible, if the function overhead is 50 cycles
>> and blasts the iCache taking a trip through the trampoline, I think we
>> need to reconsider.  Some serious benchmarking is also in order.  bFLT does
>> not have any overhead at run time, which is why people still use it over
>> FD-PIC on a lot of platforms...
>
> There's no trampoline (I assume you mean PLT?) for static-PIE. Since
> the linker resolves all symbolic references at ld-time, it never
> generates PLT thunks; instead, all the calls using relative @PLT
> addresses turn into what you would get with @PCREL addresses. So in
> terms of calls, the main difference versus non-PIC is that you get a
> braf/bsrf with a relative address instead of a jmp/jsr with an
> absolute address, and in principle this leads to fewer relocations
> ('fixups') at runtime.
>
> However, if PIC is too expensive for other reasons, there's no
> fundamnental reason it has to be used. This is what I meant above by
> "with or without TEXTRELs". You're free to compile without -fPIE or
> -fPIC, then link as [static-]PIE, and what you'll end up with is
> runtime relocations in the .text segment. Unlike on systems with MMU,
> this is not a problem because (1) it's not shareable anyway, so you're
> not preventing sharing, and (2) there's no memory protection against
> writes to .text anyway, so you're not sacrificing memory protection.
> I believe doing it this way gets you the same results (even in terms
> of number/type of relocations) that you would get with non-shareable
> bFLT.

Poke me on irc and I'll see what I can scrounge up. (Frantically
preparing for thursday's talk about how we open sourced our code and
vhdl and documented everything by beating our stuff into uploadable
shape and documenting everything. :)

>> > 2. Neither binutils nor gcc accepts "sh2eb-linux" as a target. Trying
>> >   to hack it in got me a little-endian toolchain. I'm currently just
>> >   using "sheb" and -m2 to use sh2 instructions that aren't in sh1.
>>
>> That is why you really want to configure —target=sh2-uclinux (or make
>> sh2-linux do the right thing).  SH2 I think is always Big...
>
> Well GCC seems to consider the plain sh2-linux target as
> little-endian.

They're crazy and broken about a lot of stuff. Sega Saturn was big
endian. (Saturn was to sh2 what dreamcast was to sh4.)

I also note that gcc 4.2.1 and binutils 2.17 parsed sh2eb. Current
stuff not doing so is a regression.

> I think a lot of the issue here is that, to you, sh2 means particular
> hardware architecture, whereas to the GCC developers and to me, sh2
> means just the ISA, and from a non-kernel/non-baremetal perspective,
> just the userspace part of the ISA, which is a subset of the sh3/4
> ISAs.

Extra bit of fun:

When Renesas was implementing the ELF spec, they used a version that
had been translated into japanese. The translation program switched
codepages, meaning _ and . got swapped. (This is why the superh
prefixes are borked, the developers accurately implemented the
documentation they had.)

That said, "sh2eb-elf" and "sh2eb-unknown-linux" are different targets
with different ELF prefixes, and our ROM bootloader code was written
for the ELF one. (I looked at porting it and it's really painful and
intrustive, and non-linux code tends to be written for -elf toochains
instead of -linux toolchains in general.)

Meaning I have to build _both_ toolchains and use one for the hardware
build (which includes the ROM bootloader code) and one for the kernel
and userspace builds (we boot vmlinux and the bootloader parsing the
elf cares about the prefixes; yes the bootloader that has to build
with one set of prefixes expects to parse code built with the _other_
set, don't get me started on that).

So when you say "what gcc developers think" the answer is "they
don't". It's an inconsistent mix of random historical crap and we have
to make the best of it. I'd like to try to be the least amount of
crazy we can going forward, please.

> So when gcc is generating code for "sh2-linux", it's treating it
> as using all the usual linux-sh conventions (default endianness,
> psABI, etc.) but restricted to the sh2 instruction set (no later
> instructions).

There really _aren't_ usual "linux-sh" conventions, there's the
perception that all the world's a <strike>vax</strike> sh4 and
everybody who doesn't think that is largely still using gcc 3.4
because that never stopped working and the new stuff breaks every
third release. (Chronic problem in the embedded world, getting people
to upgrade off of what they first got working, let alone interact with
upstream via anything other than an initial smash-and-grab and
hightailing it to the hideout and then staying silent until the
statute of limitations runs out. And that's _without_ factoring a
language barrier into it.)

>> > 3. The complex math functions cause ICE in all gcc versions I've tried
>> >   targetting SH2. For now we can just remove src/complex from musl,
>> >   but that's a hack. The cause of this bug needs to be found and
>> >   fixed in GCC.

Can I get a test program I can build and try with Aboriginal's toolchain?

>> Does it happen in 4.5.2 or the Code Sorcery chain?  We need complex.
>
> I'm not sure.

He's referring to
http://sourcery.mentor.com/public/gnu_toolchain/sh-linux-gnu/renesas-2011.03-36-sh-uclinux.src.tar.bz2
and renesas-2011.03-36-sh-uclinux-i686-pc-linux-gnu.tar.bz2 built from
that, both of which was there last month but seems to have gone down.
Grrr. And of course mentor graphics put a robots.txt to block
archive.org because they're SUCH an open source company it exudes from
their pores.

Right, I threw both on landley.net for the moment, probably take 'em
down again this weekend. (It's GPL code and that's the corresponding
source, there you go.)

Anyway, buildroot used to use this stuff to build toolchains (ala
http://git.busybox.net/buildroot/commit/?id=29efac3c23df9431375f26d1b240627f604f42ca)
but there was serious whack-a-mole in tracking where they moved it
this week (http://git.buildroot.net/buildroot/commit/?id=27404dad33a8f9068faa8be72916ed47f905b5e6)
so...

Building from source with vanilla is _so_ much nicer...

>> > musl issues:
>> >
>> > 1. We need runtime detection for the right trap number to use for
>> >   syscalls. Right now I've got the trap numbers hard-coded for SH2 in
>> >   my local tree.
>>
>> I don’t agree.  Just rationalise it.  Why can SH3 and above not use the
>> same traps as SH2?
>
> Because the kernel syscall interface is a stable API. Even if not for
> that, unilaterally deciding to change the interface does not instill
> confidence in the architecture as a stable target.

And because the perception out there in linux-land is that sh4 was a
real (if stale) processor and sh2 wasn't, so breaking sh4 to suit sh2
_before_ we've established our new open hardware thing as actually
viable will get the door slammed on us so hard...

> OTOH if we could change to using the SH2 trap range as the default and
> just keep the old SH3/4 range as a 'backwards compatibility' thing on
> SH3/4 hardware, I think that might be an acceptable solution too.
> Existing SH3/4 binaries are never going to run on SH2 anyway.

QEMU supports sh4 right now. If sh2 supported sh4 traps we _might_ be
able to run some sh2 code on qemu-sh4 and/or qemu-system-sh4. (But
then I dunno what the issues there are, I need to sit down and fight
with it now that elf2flt isn't blocking me.)

>> Don’t understand why we want to change personality?  More info?
>
> There's one unexpected place where the kernel has to know whether
> you're doing FDPIC or not. The sigaction syscall takes a function
> pointer for the signal handler, and on FDPIC, this is a pointer to the
> function descriptor containing the GOT pointer and actual code
> address. So if the kernel has loaded non-FDPIC ELF via the FDPIC ELF
> loader, it will have switched personality to FDPIC to treat the signal
> handler pointer specially. And then when you give it an actual
> function address instead of a function descriptor, it reads a GOT
> address and code address from the first 8 bytes of the function code,
> and blows up.
>
> If we patch the FDPIC ELF loaded to support normal ELF files (this
> should be roughly a 10-line patch) then it would never set the FDPIC
> personality for them to begin with, and no hacks to set it back would
> be needed.

Code/rodata  segment sharing is actually really _nice_ for nommu
systems. It would be nice if we could get that to work at some point.

And then there's that XIP stuff that two different ELC presentations
used for Cortex-M, the videos of which are now up at
http://elinux.org/ELC_2015_Presentations

(I refer to the talks from Jim Huang and Vitaly Wool.)

My point is, running contorted but technically valid ELF on nommu is
just a starting point, we eventually want to go beyond that to take
advantage of stuff only FDPIC can do.

>> > 5. The brk workaround I'm doing now can't be upstreamed without a
>> >   reliable runtime way to distinguish nommu. To put it in malloc.c
>> >   this would have to be a cross-arch solution. What might make more
>> >   sense is putting it in syscall_arch.h for sh
>>
>> No, this is a general nommu problem.  It will also appear on ARM,
>> ColdFire, an BlackFin, which are important targets for MUSL.
>
> Right, but I don't want to hard-code it for these archs either. In
> principle it should be possible to run an i386 binary on a nommu i386
> setup, and if the (hypothetical) kernel had a dangerour/broken brk
> there too, if also needs to be blacklisted. So what I'm looking for,
> if the kernel can't/won't just remove brk support on nommu, is a way
> to detect nommu/broken-brk and prevent it from being used. One
> simple/stupid way is:
>
> if ((size_t)&local_var - (size_t)cur_brk < MIN_DIST_TO_STACK)
>         // turn off brk support
>
> This would ban use of brk on any system where there's a risk of brk
> extending up into the stack.

Or you can check your ELF header flags and see if you've got the fdpic bit set.

>> >   where we already
>> >   have to check for SH2 to determine the right trap number; the
>> >   inline syscall code can just do if (nr==SYS_brk&&IS_SH2) return 0;
>>
>> I think a look at uClibc is in order.  I made it always return failure,
>> after playing with having it return results from malloc().  It’s one of
>> two things that we don’t do (or do poorly) with nommu, the other is the
>> clone() and form() family being highly restricted.
>
> It's so broken I think it should just be fixed on the kernel side.

I don't know what you mean by "fixed" here. linux/nommu.c currently has:

/*
 *  sys_brk() for the most part doesn't need the global kernel
 *  lock, except when an application is doing something nasty
 *  like trying to un-brk an area that has already been mapped
 *  to a regular file.  in this case, the unmapping will need
 *  to invoke file system routines that need the global lock.
 */
SYSCALL_DEFINE1(brk, unsigned long, brk)
{
        struct mm_struct *mm = current->mm;

        if (brk < mm->start_brk || brk > mm->context.end_brk)
                return mm->brk;

        if (mm->brk == brk)
                return mm->brk;

        /*
         * Always allow shrinking brk
         */
        if (brk <= mm->brk) {
                mm->brk = brk;
                return brk;
        }

        /*
         * Ok, looks good - let it rip.
         */
        flush_icache_range(mm->brk, brk);
        return mm->brk = brk;
}

If that should be replaced with "return -ENOSYS" we can submit a patch...

Rob
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.