musl - Re: [PATCH] Wasm support patch 3 (the actual arch/wasm)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <HE1PR0502MB3883F387337FDA773F7E93BDE70B0@HE1PR0502MB3883.eurprd05.prod.outlook.com>
Date: Fri, 15 Dec 2017 17:02:47 +0000
From: Nicholas Wilson <nicholas.wilson@...lvnc.com>
To: "musl@...ts.openwall.com" <musl@...ts.openwall.com>
Subject: Re: [PATCH] Wasm support patch 3 (the actual arch/wasm)

Hi,

Thank you for this feedback, it was helpful. It was also put in my spam bin, hence the late reply :(

I've moved the patch into a GitHub branch while I'm working on it, so the latest version of the patch is now here: https://github.com/NWilson/musl/pull/1

On 29 November 2017 20:36, Rich Felker wrote:
>> +#define a_ctz_64 a_ctz_64
>> +static inline int a_ctz_64(uint64_t x)
>> +{
>> +  return __builtin_ctzll(x);
>> +}
>> +
> Unsure about whether this is good; you can just omit them and the
> higher-level header provides default definitions. I think clang even
> optimizes them to a single logical insn.

If Clang can optimise the DeBruijn thing to the native instruction, that's magic :)
Maybe it's target-specific though. I tested with the a_ctz_64 definition from atomic.h, and found that unfortunately it's *not* optimised with -O2 for Wasm.

Perhaps Musl should default to using the intrinsic for all platforms that don't provide it, since the intrinsic's fallback ought to be as good as Musl's.

Anyway, using the builtin works for now, and it's the only way I've found to get the fast instruction to be used.

> Also note that coding style is to indent with tabs not spaces.
Fixed :)

>> +#define a_and_64 a_and_64
>> +static inline void a_and_64(volatile uint64_t *p, uint64_t v)
>> +{
>> +  // TODO use a WebAssembly CAS builtin, when those arrive with the threads feature
>> +  //__atomic_fetch_and(p, v, __ATOMIC_SEQ_CST);
>> +  *p &= v;
>> +}
> These need to be non-dummy at some point but we can discuss that
> later. However you shouldn't define dummy versions of all of them like
> this. Either just define dummy a_ll and a_sc, or dummy a_cas. Let the
> higher level framework take care of all the rest unless you really
> have a working, better-optimized version of them like x86 does.

I put a placeholder in wasm/atomic_arch.h for each atomic instruction in the Wasm atomics proposal, as a reminder and to show what's coming. Multithreaded Wasm support will come sometime in 2018, for now it's impossible to have races or threads, so the placeholders are equivalent to what would be generated via a_cas.

>> +#define a_crash a_crash
>> +static inline void a_crash()
>> +{
>> +  // This generates the Wasm "unreachable" instruction which traps when reached
>> +  __builtin_unreachable();
>> +}
> Shouldn't it be __builtin_trap()? __builtin_unreachable allows the
> compiler to assume the code path is not reachable, which is exactly
> the opposite of what we want.

Good point, __builtin_unreachable is emitted as a trap on Wasm (has the same effect), but you're right the C frontend will probably treat them differently with respect to inlining and branch elimination. __builtin_trap won't cause the code leading to it to be culled.

Fixed.

> Generally we don't do 32/64-bit archs together as one arch with
> #ifdefs but as separate ones.

We might find that we can get away with it on Wasm though? Apart from the size of long and void* changing, they are pretty much architectures (like running a CPU in 32/64 bit "mode").

I could take out the Wasm64 support for now, given that it's hypothetical and the toolchain's not done yet. I just thought it would save time later if the headers were 64-bit ready.

>> +TYPEDEF struct { union { int __i[9]; volatile int __vi[9]; unsigned __s[9]; } __u; } pthread_attr_t;
>> +TYPEDEF struct { union { int __i[6]; volatile int __vi[6]; volatile void *volatile __p[6]; } __u; } pthread_mutex_t;
>> +TYPEDEF struct { union { int __i[6]; volatile int __vi[6]; volatile void *volatile __p[6]; } __u; } mtx_t;
>> +TYPEDEF struct { union { int __i[12]; volatile int __vi[12]; void *__p[12]; } __u; } pthread_cond_t;
>> +TYPEDEF struct { union { int __i[12]; volatile int __vi[12]; void *__p[12]; } __u; } cnd_t;
>> +TYPEDEF struct { union { int __i[8]; volatile int __vi[8]; void *__p[8]; } __u; } pthread_rwlock_t;
>> +TYPEDEF struct { union { int __i[5]; volatile int __vi[5]; void *__p[5]; } __u; } pthread_barrier_t;
> ..despite these being in the arch headers, musl policy actually
> requires all 32-bit archs and all 64-bit archs to have the same
> pthread type sizes/definitions. Using the 32-bit definitions on a
> 64-bit arch will not work because of how pthread_impl.h lays out the
> actual usage of the slots; even if it did for some of them it wouldn't
> be future-proof.

Interesting, I had wondered what the rationale was. It looks like most of the slots are unused? I can't work out why the 64-bit structures actually need to be bigger than the 32-bit ones.

In any case these are just stubs for now - threading won't work on Wasm. It's easier to put some dummy defs in the headers than to hack out all the source files which use threading.

>> +++ b/arch/wasm/bits/float.h
> These should be defined explicitly to the correct values so that there
> are not silently varying arch parameters resulting in undocumentedly
> incompatible ABIs. I would assume they're the same as double, no?

Fair enough, I've replaced the float constants with hardcoded values.

Slightly surprisingly to me, long double seems to be 80-bit precision, like on ARM64. I haven't really thought about how it should work, I've not been involved with that part of the compiler frontend. float/double are normal IEEE on Wasm, so presumably we've got the compiler to emulate higher-precision.

>> diff --git a/arch/wasm/bits/limits.h b/arch/wasm/bits/limits.h
>> +#define _POSIX_V6_LP64_OFF64  1
>> +#define _POSIX_V7_LP64_OFF64  1
> These differ by 32/64-bit.

Thanks! Dan Gohman (Mozilla) spotted it too, hence the patch for x32 the other day.

>> diff --git a/arch/wasm/bits/setjmp.h b/arch/wasm/bits/setjmp.h
>> +typedef unsigned long long __jmp_buf[8];
> Are you sure that makes sense? What does the wasm register file look
> like? What registers are call-saved?

setjmp doesn't make sense for Wasm. We don't support setjmp, and Wasm doesn't even have any registers at all. It's a stack-based architecture, where instructions push/pop data to the stack for subsequent instructions to consume.

All the things that don't make sense for Wasm just have dummy definitions, to keep the build working. In this case, signal.c includes setjmp.h, which in turn depends on __jmp_buf being defined. Any actual calls to setjmp will fail to link though, because Wasm doesn't define it.

It's not ideal, but putting these dummies in the arch/wasm headers seems nicer than messing with code outside the wasm directories.

> Why is uc_stack omitted?

Ditto for arch/wasm/bits/signal.h, where I've got a comment explaining the definitions are dummies. All the members of struct __ucontext that are not referenced by the platform-independent code (only used by arch-specific files) can be omitted from Wasm.

>> diff --git a/arch/wasm/bits/syscall.h.in b/arch/wasm/bits/syscall.h.in
>> +#define SYS_accept4 syscall_accept4
> They probably need casts to an arithmetic type and need to be in a
> reserved namespace (like __syscall_accept4, etc.).

Yes, I've tweaked it now so that the syscall casts work well. I think it's quite a satisfactory solution for the static syscalls now.

>> diff --git a/arch/wasm/crt_arch.h b/arch/wasm/crt_arch.h
>> new file mode 100644
>> index 00000000..e69de29b
> If this is empty, how does program entry point get created? How does
> __libc_start_main get called?

It's not empty now :) I've added two entrypoints:
* The standard _start entrypoint is there for people who want to run run "exit(main(...))" straight away. We think that very few people will actually want this however.
* In src/internal/wasm/start_wasm.c I've added a function "_start_wasm" which initialises stdio and calls pre-main constructors, then simply returns. We're expecting that this (or something like it) will become the default entrypoint for the Wasm linker.

Typical usage for Wasm will be calling into the (compiled) WebAssembly module from JavaScript. The Wasm module will sit there doing nothing until you call a function from it, which then becomes the bottom of the stack and we enter into the Wasm code. When that invocation returns, the heap and all the global variables etc are still there in memory, the Wasm module just sits doing nothing (no execution stack) until the next time you call into it from JavaScript.

The linker arranges for _start_wasm to run automatically at the point when the browser first loads the module, before any other code is executed.

The Wasm linker is currently being reworked not to use the .init_array section anymore, this may need some tweaking soon. Future support will be available exclusively via the @llvm.global.ctors intrinsic - this is the mechanism that C++ constructors and __attribute__((ctor)) functions hook into.

So - although I thought it was sorted last week, it's now back in flux.

>> diff --git a/arch/wasm/pthread_arch.h b/arch/wasm/pthread_arch.h
>> +static inline struct pthread *__pthread_self(void) { return pthread_self(); }
> This is a circular definition. You need code to load the thread
> pointer from whatever part of the register file it's stored in.

Hm, you're right, I just inherited this from some Emscripten code.

There are no registers or threads, I'll have to implement this using a global.

>> diff --git a/arch/wasm/reloc.h b/arch/wasm/reloc.h
>> +#define CRTJMP(pc,sp) __builtin_unreachable()
> Needs newline. This is probably not actually important without dynamic
> linking.

Newline added. Making it unreachable seems a good compromise - attempts to use the dynamic linker won't work, but no special work needs to be done to adapt the existing code to Wasm yet.

>> +* Due to the type-safe nature of Wasm linkage, syscalls cannot actually be
>> +  variadic if defined externally.  Any syscalls called with a variable argument
>> +  count, and not provided here by Musl, could be fixed on a case-by-case basis
>> +  as needed.
>> \ No newline at end of file
> If variadic is a problem you could make them just always take 6 fixed
> long args and make the syscall_arch.h machinery just pass dummy zeros
> for the unused slots.

Thanks, that's exactly what I ended up doing. :)

> As noted before I think it makes sense to just drop brk. It's not
> needed.

I've done that, and implemented mmap instead.

>> +// Wasm doesn't *yet* have futex(), but it's being planned as part of the
>> +// threaded-Wasm support in Spring 2018.
>> +//
>> +// For now, Wasm is single-threaded and we simply assert that the lock is not
>> +// held, and abort if a wait would be required (assume it's a corrupted lock).
> I think this is probably a bogus assumption; you can't assume anything
> about how futex is being used to implement locks. Note that a
> perfectly valid implementation of futex is just one that always
> returns success; it will simply result in spinning at 100% cpu load
> waiting for the event to happen rather than going to sleep.

I'd argue it is valid. In Wasm, there's only one thread of execution, so if the condition isn't set now then it won't ever be. It doesn't matter that futex might have spurious wakeups - on "normal" archs futex will wait, so if a single-threaded arch like Wasm finds itself trying to wait on condition that can't ever happen, then it's reasonable to abort.

It's certainly more useful for debugging that causing a spin at 100% CPU.

Thanks very much for your feedback Rich, I'll try to keep maintaining my branch and keeping you updated on it, if you're happy with the idea that Wasm might be an accepted architecture (when the code's right).

Nick
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.