Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251205183051.GN3520958@port70.net>
Date: Fri, 5 Dec 2025 19:30:51 +0100
From: Szabolcs Nagy <nsz@...t70.net>
To: Fabian Rast <fabian.rast@....de>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH] ldso: skip gnu hash calculation if precomputed

* Fabian Rast <fabian.rast@....de> [2025-12-05 18:37:57 +0100]:
> >> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> >> hashes. the lsb is used to indicate the end of a chain.
> >> this missing bit can be guessed for the cost of traversing two chains
> >> in the worst case.
> >
> > you mean two chains per loaded dso?
> >
> > then i'd expect the optimization to be only enabled
> > if dso-index-in-symbol-lookup-order < N.
> 
> I don't quite understand what you mean by this.
...
> Searching a hashtable bucket is not that expensive, in my experience they
> are rather small.
> For example here are the number of buckets for each length in libLLVM.so:
> length: 1: 924 buckets
> length: 2: 1429 buckets
> length: 3: 1918 buckets
> length: 4: 1895 buckets
> length: 5: 1517 buckets
> length: 6: 1045 buckets
...

you have to do this for every loaded dso that's
before the dso being relocated in symbol lookup
order.

we know the relocated dso has a definition for
the symbol since it was in the hash table. so
we expect every single earlier dso to fail the
symbol lookup: symbol interposition is rare.
the point of the hash is to make such lookup
failure fast. but you made it twice as slow on
average.

eventually the gain from not computing the hash
is dominated by caculating 2*n bloom filters
(and potentially chasing twice as many hash chains)

> > how can ghm<<1 == 0 happen?
> 
> Suppose that `gh % (8*sizeof(size_t))` is 63, so ghm will be 0x8000000000000000.

we establised that gh = hash & ~1

> > but would be good to know how many dsos (and relocs etc) are involved.
> 
> These are the dsos involved:
> ```
> $ ~/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 --list ./bin/clang
> 
> /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
> libclang-cpp.so.22.0git => /home/fr/src/llvm-project/build/lib/libclang-cpp.so.22.0git (0x7f4d1d400000)
> libLLVM.so.22.0git => /home/fr/src/llvm-project/build/lib/libLLVM.so.22.0git (0x7f4d18e00000)
> libstdc++.so.6 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libstdc++.so.6 (0x7f4d18a00000)
> libgcc_s.so.1 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libgcc_s.so.1 (0x7f4d20ea6000)
> libc.so => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
> ```

that's not a lot of dso, i'd try binaries like:

/usr/bin/ffmpeg
/usr/bin/mpv
/usr/bin/gsx
/usr/lib/libreoffice/program/soffice.bin
/usr/lib/libwebkit2gtk-*
/usr/lib/firefox/libxul.so

i.e. something with n indirect dso deps, where
n is large, so 2*n bloom filter checks may
outgrow the gnu_hash computation for a short
string.

> 
> Symbol lookups: used `gh_precomp` path / total calls to find_sym2
>     10189/14955

i guess what matters is how expensive the
hash lookup per dso vs gnu_hash eval is

> 
> Approximations for symbol etc count using readelf:
> Undefined symbols:
> readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -c
> 
> defined symbols:
> readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -cv
> 
> i exclude relative relocations from the count because they never need a symbol
> lookup and are irrelevant.
> My libraries have relr compressed relative relocations - i counted using a text
> editor and do not have a shell one liner.
> 
> libclang has 2204 undefined and 37831 defined symbols and 4723 non-relative relocations.
> libLLVM has 340 undefined and 39300 defined symbols and 5479 non-relative relocations.
> 
> Best Regards,
> Fabian Rast


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.