musl - Re: tough choice on thread pointer initialization issue

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120210170015.GH146@brightrain.aerifal.cx>
Date: Fri, 10 Feb 2012 12:00:15 -0500
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: tough choice on thread pointer initialization issue

On Fri, Feb 10, 2012 at 02:42:52PM +0400, Solar Designer wrote:
> On Fri, Feb 10, 2012 at 02:58:18AM -0500, Rich Felker wrote:
> > On Fri, Feb 10, 2012 at 11:40:02AM +0400, Solar Designer wrote:
> > > approach 4: initialize the thread pointer register to zero at program
> > > startup.  Then before its use, just check it for non-zero instead of
> > > checking a global flag.  (I presume that you're doing the latter now,
> > > based on your description of the problem.)
> > 
> > Well this is definitely feasible but it makes more arch-specific code,
> > since examining the thread register needs asm.
> 
> But accessing it needs asm anyway, no?

Well it's a little bit more complicated, but yes. The code that would
need fixing to use this approach is the "lazy init already completed"
version (__pthread_self() macro/inline) which intentionally has no
references to the init code to avoid pulling it into static bins that
don't want the extra code and data. This code would have to check that
the register is valid, and if not, backup registers (since the asm is
specified not to clobber any in the normal case) on the stack and make
a function call to a function that would get the proper value for the
thread register and fix it. This could be done without too much bloat
using some nonstandard calling convention hacks (putting the local
register save/restore code in the callee rather than caller) I
suppose. I don't claim any of this is impossible or even necessarily
too difficult to make approach 4 feasible, but it does at least
require some thinking about it.

> "Potentially" is the keyword here, but anyway I just did some testing on
> a Core 2'ish CPU (E5420), and the instruction is very fast.  I put 1000
> instances of:
> 
> movl %ss,%eax
> testl %eax,%eax
> jz 0
> 
> one after another, and this runs in just 1 cycle per the three
> instructions above (so 1000 cycles for the 1000 instances total).
> Of course, there are data dependencies here, so there's probably a
> bypass involved and the instructions are grouped differently, like:

OK, looks like performance is not an issue, at least on your hardware.

> In actual code (where you won't have repeats like this), you might want
> to insert some instructions between the mov and the test (if you can).

Unfortunately the only useful thing would be dereferencing %gs:0,
which is not a good idea if the test/jz hasn't completed... In pure
asm more could be done, but this code is only used in inline asm in C,
not in pure asm functions. After all the ugly arch-specific bugs in
glibc's asm thread primitives, I really don't want to go there.. :-)

> All of these execute in 1000 cycles total as well.  With "w" forms of
> the instructions there are extra prefixes, so I think these should
> better be avoided, even though there's no slowdown from them on this CPU.

I had no idea it was even valid to use the non-w-prefix forms with
segment registers. Learned something new. Are the high bits just
discarded (when writing) and zeroed (when reading)?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.