musl - Re: Further dynamic linker optimizations

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150707185505.GI1173@brightrain.aerifal.cx>
Date: Tue, 7 Jul 2015 14:55:05 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Further dynamic linker optimizations

On Tue, Jul 07, 2015 at 09:39:09PM +0300, Alexander Monakov wrote:
> On Tue, 30 Jun 2015, Rich Felker wrote:
> 
> > Discussion on #musl with Timo Teräs has produced the following
> > results:
> > 
> > - Moving bloom filter size to struct dso gives 5% improvement in clang
> >   (built as 110 .so's) start time, simply because of a reduction of
> >   number of instructions in the hot path. So I think we should apply
> >   that patch.
> 
> I think most of the improvement here actually comes from fewer cache misses.
> As a result, I think we should take this idea further and shuffle struct dso a
> little bit so that fields accessed in the hot find_sym loop are packed
> together, if possible.

I'm not entirely convinced; the 5% seems consistent with the number of
instructions in the code path. Can you confirm this with cache miss
measurements? Or just by obtaining better timings reordering data for
cache locality? Note that the head of struct dso has to remain fixed
(it's gdb ABI :/) but the rest is free to change.

> > - The whole outer for loop in find_sym is the hot path for
> >   performance. As such, eliminating the lazy calculation of gnu_hash
> >   and simply doing it before the loop should be a measurable win, just
> >   by removing the if (!ghm) branch.
> 
> On a related note, it's possible to avoid calculating sysv hash, if gnu-hash
> is enabled system-wide, by not setting 'global' flag on the vdso item (as
> mentioned on IRC in your conversation with Timo).

Yes, and I think this sounds like a worthwhile approach. Seeing
timings for it would be great. :-)

> > - Even the check if (!dso->global) continue; has nontrivial cost.
> >   Since I want to replace this representation with a separate
> >   linked-list chain for global dsos anyway (for other reasons) I think
> >   that's worth prioritizing for performance too.
> 
> I'm curious what the other reasons are? :)

Depending on an open question I have to the Austin Group list (sorry,
I can't get the archives to work to provide a link), changes may be
needed for semantic correctness. It's easier to describe the issue
with code. Compile the attached test case with the following commands:

gcc -shared -fPIC -DLIB -o libA.so dlorder.c
gcc -shared -fPIC -DLIB -o libB.so dlorder.c
gcc -o dlorder dlorder.c

On musl it prints 2 different addresses (the subsequent RTLD_GLOBAL
changes the definition of a symbol) which I think is wrong, but I
haven't yet checked what other implementations do.

> > - The strength-reduction of remainder operations does not seem to
> >   provide worthwhile benefits yet, simply because so little of the
> >   overall time is spent on the division/remainder.
> 
> On IRC we noted that on AArch64 it's slower than native div/mod on our
> microbenchmark, and on ARM the speedup is smaller than expected.  My testing
> on x86 indicates that it's not profitable in the dynamic linker (not sure
> why).

Agreed, but I think we do know why it's not profitable: at least in
the cases tested, the time spent on remainders is negligible anyway.

Rich

View attachment "dlorder.c" of type "text/plain" (368 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.