musl - Re: [PATCH v2] ldso: skip gnu hash calculation if precomputed

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251207184546.GP3520958@port70.net>
Date: Sun, 7 Dec 2025 19:45:46 +0100
From: Szabolcs Nagy <nsz@...t70.net>
To: Fabian Rast <fabian.rast@....de>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH v2] ldso: skip gnu hash calculation if precomputed

* Fabian Rast <fabian.rast@....de> [2025-12-06 18:31:06 +0100]:
> calculating the hashes of symbol names can take a significant amount of
> time while applying relocations. however, symbol lookups for
> symbols that are themselves part of the gnu hashtable in the object
> being relocated, the hashvalue is already present in that hashtable
> and can be efficiently retrieved using the symbol index.
> 
> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> hashes. the lsb is used to indicate the end of a chain.
> this missing bit can be inferred from the position of the symbol index
> relative to the two possible chains.
> ---
>  ldso/dynlink.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/ldso/dynlink.c b/ldso/dynlink.c
> index 44d9b181..6dee7a97 100644
> --- a/ldso/dynlink.c
> +++ b/ldso/dynlink.c
> @@ -320,10 +320,23 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
>  #if defined(__GNUC__)
>  __attribute__((always_inline))
>  #endif
> -static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
> +static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
>  {
> -	uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
> -	size_t ghm = 1ul << gh % (8*sizeof(size_t));
> +	uint32_t h = 0, gh, gho, *ght;
> +	size_t ghm;
> +
> +	if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1] && ght[0] > 1) {
> +		uint32_t nbuckets = ght[0], *buckets = ght + 4 + ght[2] * 2;

other part of the code uses *sizeof(size_t)/4 instead of *2.

> +		uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & ~1u, h1 = h0|1;
> +		uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> +		gh = i0 < i1 && sym_index < i1
> +		  || i0 > i1 && sym_index >= i0
> +		  || !i1 ? h0 : h1;

yeah, this became more expensive than i expected.

> +	} else gh = gnu_hash(s);
> +
> +	gho = gh / (8*sizeof(size_t));
> +	ghm = 1ul << gh % (8*sizeof(size_t));
> +
>  	struct symdef def = {0};
>  	struct dso **deps = use_deps ? dso->deps : 0;
>  	for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
...
> There some edgecases that i considered:
> 
> the gnu hashtable might only have a single bucket, in which case
> this approach cant work. (the bloom filter idea could still work here though)
> 
> In this case the hash is just calculated normally.
> 
> 
> The hashtable might contain empty chains (buckets[h % nbuckets] == 0)
> 
> It is impossible for both considered buckets to be empty. If one bucket is empty
> the correct hash must be the one indexing the other bucket.
> 
> 
> The relative position of the two chains (which one is first) is undefined.
> The chain indexed by h1 can be before the chain indexed by h0.
> Even if the buckets array is sorted, it can be that
> h0 % nbuckets == nbuckets - 1, and h1 % nbuckets == 0.
> 
> > uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> I used two modulo operations because it was slightly faster than
> branching on (h0 % nbuckets)+1 == nbuckets. I can't confidently explain this...

yeah this is a lot of code, big cpus have
enough resources to do two % out of order
while other things happen, smaller cpus
might struggle here..

> avglen.sh
> ```
> eu-readelf --dyn-syms $1 | awk '{ total_len += length($8); total += 1 } END { print total_len/total }'
> ```
> 
> sh avglen.sh /usr/lib/libLLVM.so.21.1
> 84.3108
> sh avglen.sh /usr/lib/libclang-cpp.so.21.1
> 88.9683
> sh avglen.sh /usr/lib/firefox/libxul.so
> 24.2836
> sh avglen.sh /usr/lib/libwebkit2gtk-4.1.so.0.17.6
> 37.898
> 
> To me it looks like the very long symbol names in libclang and libLLVM are
> the cause for the speedups in the clang example, while there are only barely
> measurable/no speedups with other libraries.

yeah, name mangling..

for better stats, check indirect deps too and
only symbols referenced by relocs and count
locally defined syms that we actually optimize:

echo '/usr/bin/clang
/usr/lib/firefox/libxul.so
/usr/lib/libwebkit2gtk-4.1.so.0
/usr/bin/mpv
/usr/bin/ffmpeg
/usr/bin/gsx
/usr/lib/libreoffice/program/soffice.bin' |while read y
do
echo $y
/lib/ld-musl-x86_64.so.1 --list $y|awk '$2~/=>/{print $3}'|while read x
do
readelf --dyn-sym -rW $x |awk '
$3~/^R_/&&$5{sub(/@.*/,"");a[$5]=length($5)}
$5~/GLOBAL|WEAK/&&$7!="UND"&&$8{sub(/@.*/,"");b[$8]="opt"}
END{for(i in a) print a[i],b[i]}'
done|sort -n|awk '
/opt/{a[n++]=$1;s+=$1}
END{
p=int(100*n/NR);n1=int(n*.25);n2=int(n*.5);n3=int(n*.75)
print "opt",p"%","avg",s/n,"q1",a[n1],"med",a[n2],"q3",a[n3]
}'

/usr/bin/clang
opt 87% avg 63.064 q1 33 med 50 q3 84
/usr/lib/firefox/libxul.so
opt 51% avg 30.4272 q1 19 med 25 q3 33
/usr/lib/libwebkit2gtk-4.1.so.0
opt 59% avg 46.349 q1 23 med 33 q3 54
/usr/bin/mpv
opt 66% avg 47.1374 q1 22 med 34 q3 58
/usr/bin/ffmpeg
opt 60% avg 38.9045 q1 19 med 28 q3 48
/usr/bin/gsx
opt 52% avg 29.1147 q1 18 med 24 q3 32
/usr/lib/libreoffice/program/soffice.bin
opt 37% avg 30.5672 q1 19 med 25 q3 35

i think you need high % of syms where
the optimization is applied (to be relevant)
and long syms as well as low cache miss
rate when accessing the local gnu hash tab
(to improve perf).

> = clang
>     1.24 ± 0.38 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
> = gsx
>     1.03 ± 0.32 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
> = ffmpeg
>     1.05 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
> = mpv
>     1.03 ± 0.12 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
> = libreoffice
>     1.02 ± 0.16 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
> = /usr/lib/libwebkit2gtk-*
>     1.05 ± 0.13 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
> = firefox
>     1.03 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.