Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 30 Sep 2016 06:56:15 +0200
From: Markus Wichmann <>
Subject: Re: Model specific optimizations?

On Thu, Sep 29, 2016 at 02:13:36PM -0400, Rich Felker wrote:
> On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote:
> > [...]
> On Linux it's supposed to be the kernel which detects availability of
> features (either by feature-specific cpu flags or translating a model
> to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
> they botch this?

Maybe it's a new extension? I only know version 2.2 of the PowerPC Book.

Or maybe it goes back to the single core thing. (Only the 970 supports
it, and that's pretty new.) Or maybe Linux kernel developers aren't
interested in this problem, because a manual sqrt exists, and if need
be, anyone can just implement the Babylonian method for speed. On PPC,
it can be implemented in a loop consisting of four instructions, namely:

; .rodata
half: .double 0.5
; assuming positive finite argument
; if that can't be assumed, go through memory to inspect argument
fmr 1, 0    ; yes, halving the exponent would be a better estimate
; requires going through memory, though
lfd 2, half(13)
li 0, 6 ;or more for more accurcy
mtctr 0

1:  ; fr0 = x, fr1 = a
    fdiv 3, 1, 0    ; fr3 = a/x
    fadd 3, 3, 0    ; fr3 = x + a/x
    fmul 0, 3, 2    ; fr0 = 0.5(x + a/x)
    bdnz 1b

So maybe there wasn't a lot of need for the hardware sqrt.

> > Well, yes, I was just throwing shit at a wall to see what sticks. We
> > could also move the function pointer dispatch into a pthread_once block
> > or something. I don't know if any caches need to be cleared then or not.
> pthread_once/call_once would be the nice clean abstraction to use, but
> it's mildly to considerably more expensive, currently involving a full
> barrier. There's a nice technical report on how that can be eliminated
> but it requires TLS, which is also expensive on some archs. In cases
> like this where there's no state other than the function pointer,
> relaxed atomics can simply be used on the reading end and then they're
> always fast.

Hmmm... not on PPC, though. TLS on Linux PPC just uses r2 as TLS
pointer. So the entire thing could be used almost as-is by making sqrtfn

> > So any PowerPC implementation is free to include it or not.
> > There are a lot of optional features, and if the gas people made a
> > different subarch for each combination of them, they'd be here all day.
> They've actually done that for some archs...

That actually made me check if they did it here, but thankfully not. gas
assembles the instruction without flags, without warning, and without a
note or anything on the output file.

> Anyway, I would have no objection right away to doing a patch like
> this that's decided at compile-time based on predefined macros set by
> -march. For runtime choice I think we need to discuss motivation. Are
> you trying to do a powerpc-based distro where you need a universal
> that works optimally on various models? Or would just
> compiling for the right -march meet your needs?

Just idle musings. I was reading sqrt.c, which has a flowerbox saying
"Use hardware sqrt if available" and recalled that there is a hardware
sqrt on PPC and started doing research from there. And that ended up in
the OP.

> Rich


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.