Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 9 Oct 2018 22:35:46 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: TLSDESC register-preserving mess

On Tue, Oct 09, 2018 at 09:26:20PM -0400, Rich Felker wrote:
> I've run across a bit of a problem in how the TLSDESC calling
> conventions work. In the case where the needed DTV slot is not yet
> filled in for the calling thread, the dynamic TLSDESC function needs
> to call into C code that obtains the memory that was previously
> reserved for it, initializes it (involving memcpy/memset), and fills
> in the DTV entry for it. This requires saving and restoring any
> call-clobbered registers that might be used by C code.
> 
> Because the operation involves memcpy/memset, it's not just
> theoretically possible but likely that vector registers could be used.
> As written, the aarch64 and arm asm save and restore float/vector
> registers around the call, but I don't think they're future-proof
> against ISA extensions that add more such registers; if libc were
> built to use such a future ISA level, the asm we have now would be
> unsafe. The i386 and x86_64 tlsdesc asm do not presently do anything
> to save float/vector registers, and doing so would involve lots of
> hwcap mess to figure out which ones are present. I think it would also
> fail to be future-proof. Fortunately, i386 and x86_64 both provide
> non-vector asm implementations of memcpy and memset, making it less
> likely that any vector registers would be used in these code paths,
> but still not impossible. It's also a hidden constraint, that things
> only work because of the asm implementation details.
> 
> Unfortunately making a future-proof solution is really hard; this is a
> consequence of the TLSDESC ABI and the way register file extensions
> get done by cpu vendors.
> 
> One approach would be generating a fully-flattened version of
> __tls_get_new for each arch that uses TLSDESC, via gcc -S, and
> committing the output into the project as a source file.
> Unfortunately, this involves atomic whose definitions vary by ISA
> level on arm, so I think that makes it a no-go. Obviously it's also
> really ugly.
> 
> Another approach is to depend on the compiler having flags that can be
> used to build for a profile that only allows GPRs (no vector regs,
> etc.), and building __tls_get_new as its own source file using these
> flags. This is not the sort of tooling requirement I like, since it
> abandons the principle of working with an arbitrary compiler with
> minimal GNU C features.
> 
> The only approach I know that doesn't involve any tooling is having
> the dynamic TLSDESC function raise a signal when it's missing the DTV
> slot it needs. This delegates the responsibility for awareness of what
> registers need saving to the kernel, which already must be aware in
> order to perform context switching (you inherently can't run a binary
> that uses new registers on an old kernel that's not aware of them).
> This approach is nice in that it's entirely arch-agnostic, and works
> for all present and future archs and ISA/register-file extensions. The
> easy approach would just nab another SIGRTx as an
> implementation-internal signal, so that all the asm would need to do
> is a tkill syscall. Multiplexing on another signal should be possible
> but makes for more complexity and I'm not sure there's any real
> benefit.
> 
> My leaning is to go with the signal solution.

An alternate approach being proposed on #musl that I might like better
is getting rid of __tls_get_new entirely, having the DTV for all
existing threads updated at dlopen time. This requires either a
__synccall with no failure path (which we don't have) or adding a
linked list of threads. The non-__synccall approach also requires the
SYS_membarrier syscall (Linux 4.3) and emulation of it as a fallback
(which can be done via signals if you have a list of threads).

Aside from solving the tlsdesc clobber issue, what I like about this
approach is that it removes all branches from __tls_get_addr and the
dynamic tlsdesc function; they just *always succeed in the hot path*.
It also makes it easier to facilitate recovery of memory allocated for
dynamic TLS if we want to -- it no longer has to be a shared block
doled out to threads via a_fetch_add, so each thread could get its own
malloc and then be able to free it at exit.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.