musl - Re: Further dynamic linker optimizations

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150708084816.6d557b73@vostro>
Date: Wed, 8 Jul 2015 08:48:16 +0300
From: Timo Teras <timo.teras@....fi>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Further dynamic linker optimizations

On Tue, 7 Jul 2015 14:55:05 -0400
Rich Felker <dalias@...c.org> wrote:

> On Tue, Jul 07, 2015 at 09:39:09PM +0300, Alexander Monakov wrote:
> > On Tue, 30 Jun 2015, Rich Felker wrote:
> > 
> > > Discussion on #musl with Timo Teräs has produced the following
> > > results:
> > > 
> > > - Moving bloom filter size to struct dso gives 5% improvement in
> > > clang (built as 110 .so's) start time, simply because of a
> > > reduction of number of instructions in the hot path. So I think
> > > we should apply that patch.
> > 
> > I think most of the improvement here actually comes from fewer
> > cache misses. As a result, I think we should take this idea further
> > and shuffle struct dso a little bit so that fields accessed in the
> > hot find_sym loop are packed together, if possible.
> 
> I'm not entirely convinced; the 5% seems consistent with the number of
> instructions in the code path. Can you confirm this with cache miss
> measurements? Or just by obtaining better timings reordering data for
> cache locality? Note that the head of struct dso has to remain fixed
> (it's gdb ABI :/) but the rest is free to change.

I used cachegrind and callgrind to benchmark. In my case there was no
change in cache miss number - the speed up was purely based on running
less instructions on the hot path.

Though, I ran this on i7 with lot of cache. Cache misses could become
issue on smaller cpus. But I suspect the bloom filter is doing good
enough job to keep cache usage on sensible levels.

> > > - The whole outer for loop in find_sym is the hot path for
> > >   performance. As such, eliminating the lazy calculation of
> > > gnu_hash and simply doing it before the loop should be a
> > > measurable win, just by removing the if (!ghm) branch.
> > 
> > On a related note, it's possible to avoid calculating sysv hash, if
> > gnu-hash is enabled system-wide, by not setting 'global' flag on
> > the vdso item (as mentioned on IRC in your conversation with Timo).
> 
> Yes, and I think this sounds like a worthwhile approach. Seeing
> timings for it would be great. :-)

I told them earlier in IRC. But on the same i7 box and running "clang
--version" which has 100+ DT_NEEDED... removing vdso and thus sysv
hashing had magnitude of tens of milliseconds. (I wonder how it'd
perform if we calculated both sysv and gnu hashes at same time.)

Removing the 'global' flag testing, and making gnu-hash calculation
unconditional together were also a measurable speed-up. Around 5-10
milliseconds.

For reference, "time clang --version" on my Intel(R) Core(TM) i7-4510U:
- current musl release: ~160 ms
- current git master: ~90 ms
- ghashmask added: ~83 ms
- sysv hash calc removed: ~77 ms
- global test removed, unconditional gnu-hash: ~71 ms

As another reference, "clang --version" currently takes about 3 seconds
on Wandboard ARM box. But I have no numbers on the speed up on that box.

Thanks,
Timo
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.