musl - Re: TLSDESC register-preserving mess

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181010023546.GM17110@brightrain.aerifal.cx>
Date: Tue, 9 Oct 2018 22:35:46 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: TLSDESC register-preserving mess

On Tue, Oct 09, 2018 at 09:26:20PM -0400, Rich Felker wrote:
> I've run across a bit of a problem in how the TLSDESC calling
> conventions work. In the case where the needed DTV slot is not yet
> filled in for the calling thread, the dynamic TLSDESC function needs
> to call into C code that obtains the memory that was previously
> reserved for it, initializes it (involving memcpy/memset), and fills
> in the DTV entry for it. This requires saving and restoring any
> call-clobbered registers that might be used by C code.
> 
> Because the operation involves memcpy/memset, it's not just
> theoretically possible but likely that vector registers could be used.
> As written, the aarch64 and arm asm save and restore float/vector
> registers around the call, but I don't think they're future-proof
> against ISA extensions that add more such registers; if libc were
> built to use such a future ISA level, the asm we have now would be
> unsafe. The i386 and x86_64 tlsdesc asm do not presently do anything
> to save float/vector registers, and doing so would involve lots of
> hwcap mess to figure out which ones are present. I think it would also
> fail to be future-proof. Fortunately, i386 and x86_64 both provide
> non-vector asm implementations of memcpy and memset, making it less
> likely that any vector registers would be used in these code paths,
> but still not impossible. It's also a hidden constraint, that things
> only work because of the asm implementation details.
> 
> Unfortunately making a future-proof solution is really hard; this is a
> consequence of the TLSDESC ABI and the way register file extensions
> get done by cpu vendors.
> 
> One approach would be generating a fully-flattened version of
> __tls_get_new for each arch that uses TLSDESC, via gcc -S, and
> committing the output into the project as a source file.
> Unfortunately, this involves atomic whose definitions vary by ISA
> level on arm, so I think that makes it a no-go. Obviously it's also
> really ugly.
> 
> Another approach is to depend on the compiler having flags that can be
> used to build for a profile that only allows GPRs (no vector regs,
> etc.), and building __tls_get_new as its own source file using these
> flags. This is not the sort of tooling requirement I like, since it
> abandons the principle of working with an arbitrary compiler with
> minimal GNU C features.
> 
> The only approach I know that doesn't involve any tooling is having
> the dynamic TLSDESC function raise a signal when it's missing the DTV
> slot it needs. This delegates the responsibility for awareness of what
> registers need saving to the kernel, which already must be aware in
> order to perform context switching (you inherently can't run a binary
> that uses new registers on an old kernel that's not aware of them).
> This approach is nice in that it's entirely arch-agnostic, and works
> for all present and future archs and ISA/register-file extensions. The
> easy approach would just nab another SIGRTx as an
> implementation-internal signal, so that all the asm would need to do
> is a tkill syscall. Multiplexing on another signal should be possible
> but makes for more complexity and I'm not sure there's any real
> benefit.
> 
> My leaning is to go with the signal solution.

An alternate approach being proposed on #musl that I might like better
is getting rid of __tls_get_new entirely, having the DTV for all
existing threads updated at dlopen time. This requires either a
__synccall with no failure path (which we don't have) or adding a
linked list of threads. The non-__synccall approach also requires the
SYS_membarrier syscall (Linux 4.3) and emulation of it as a fallback
(which can be done via signals if you have a list of threads).

Aside from solving the tlsdesc clobber issue, what I like about this
approach is that it removes all branches from __tls_get_addr and the
dynamic tlsdesc function; they just *always succeed in the hot path*.
It also makes it easier to facilitate recovery of memory allocated for
dynamic TLS if we want to -- it no longer has to be a shared block
doled out to threads via a_fetch_add, so each thread could get its own
malloc and then be able to free it at exit.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.