musl - Re: [PATCH v2] ldso: skip gnu hash calculation if precomputed

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DF4W5AQL60YY.2N7SZII14YLAQ@tum.de>
Date: Mon, 22 Dec 2025 17:37:37 +0100
From: "Fabian Rast" <fabian.rast@....de>
To: "Rich Felker" <dalias@...c.org>
Cc: <musl@...ts.openwall.com>
Subject: Re: [PATCH v2] ldso: skip gnu hash calculation if
 precomputed

On Tue Dec 9, 2025 at 2:27 AM CET, Rich Felker wrote:
> Since the reference-only symbols aren't in the hashtable, I don't
> think there's anything useful you can do with this.

I am not sure I understand this correctly. I think that it is possible
to use the bloom filter to infer the missing hash bit just like you
proposed in the case that a lookup is performed for a symbol that is
also defined by the current dso.
If useful is supposed to refer to actual usefulness for musl in particular,
I agree that this optimization might not be useful because I consider it
a hack - the gnu hash table surely was never supposed to be used in this
way.

Just for completeness I implemented pretty much your initial proposal:
> One way to do this would be to use the bloom filter from the dso
> making the symbol reference: if only one of the two candidates passes
> it, we know which candidate was correct. I would expect this to happen
> most of the time, if the bloom filter is actually doing anything
> useful to begin with.

As people previously pointed out in the discussion, one of the two bits
tested per hash is shared between the two candidate hashes h0 and h1, which
means in the best case only one bit has to be checked, in the worst case
two bits are checked and the hash has to be calculated anyways.
Szabolcs Nagy predicted a 70%-80% success rate for this method.
This holds up in some practical examples:

- clang
Total 48253 bloom method applicable 63.3% (30549), success rate: 70.0% (21381) total: 44.3%
- gsx
Total 30320 bloom method applicable 58.3% (17665), success rate: 74.0% (13068) total: 43.1%
- ffmpeg
Total 33351 bloom method applicable 60.3% (20111), success rate: 73.6% (14811) total: 44.4%
- mpv
Total 67330 bloom method applicable 58.6% (39427), success rate: 71.4% (28170) total: 41.8%
- libreoffice
Total 154576 bloom method applicable 27.3% (42206), success rate: 75.9% (32037) total: 20.7%
- webkit2gtk
Total 137514 bloom method applicable 34.6% (47560), success rate: 75.1% (35734) total: 26.0%
- libxul
Total 30188 bloom method applicable 44.6% (13455), success rate: 75.3% (10126) total: 33.5%

The remaining benchmarks show similar results to the version with the two divides.
This optimization is most effective in clang and less in the other examples.

master -> precomp bloom
/usr/bin/clang
cycles:	 32551695 (0.66) -> 26122400 (0.65) -19.75%
instructions:	 58840422 (0.01) -> 47172812 (0.01) -19.83%
ref-cycles:	 23629302 (1.22) -> 19315295 (1.23) -18.26%
duration_time:	 12734736 (1.24) -> 10485433 (1.23) -17.66%

/usr/bin/gsx
cycles:	 45676684 (0.31) -> 43229878 (0.33) -5.36%
instructions:	 56504823 (0.01) -> 40754248 (0.02) -27.87%
ref-cycles:	 33672982 (0.98) -> 31472548 (0.87) -6.53%
duration_time:	 18184999 (1.0) -> 17036436 (0.88) -6.32%

/usr/bin/ffmpeg
cycles:	 72039503 (0.37) -> 67931365 (0.25) -5.7%
instructions:	 86761463 (0.01) -> 58112029 (0.01) -33.02%
ref-cycles:	 49514189 (1.04) -> 46358979 (1.01) -6.37%
duration_time:	 26200043 (1.04) -> 24564555 (1.02) -6.24%

/usr/bin/mpv
cycles:	 218613993 (0.24) -> 211474537 (0.13) -3.27%
instructions:	 265004308 (0.0) -> 161926673 (0.0) -38.9%
ref-cycles:	 133443129 (0.62) -> 128415393 (0.45) -3.77%
duration_time:	 68599776 (0.63) -> 66095647 (0.46) -3.65%

/usr/lib/libreoffice/program/soffice.bin
cycles:	 175438767 (0.18) -> 170710689 (0.17) -2.7%
instructions:	 266312008 (0.0) -> 199613021 (0.0) -25.05%
ref-cycles:	 110087107 (0.5) -> 106251019 (0.5) -3.48%
duration_time:	 56843553 (0.5) -> 54922004 (0.5) -3.38%

/usr/lib/libwebkit2gtk-4.1.so.0
cycles:	 234214481 (0.16) -> 224880118 (0.11) -3.99%
instructions:	 329852233 (0.0) -> 233217195 (0.0) -29.3%
ref-cycles:	 120347198 (1.09) -> 135856350 (0.6) 12.89%
duration_time:	 62053249 (1.08) -> 69948475 (0.59) 12.72%

/usr/lib/firefox/libxul.so
cycles:	 59222837 (0.27) -> 56319892 (0.27) -4.9%
instructions:	 72834785 (0.01) -> 53201233 (0.01) -26.96%
ref-cycles:	 40414897 (1.02) -> 38522864 (0.92) -4.68%
duration_time:	 21500525 (1.01) -> 20588222 (0.91) -4.24%

Interestingly, there is a big regression in perfomance loading libwebkit2gtk
that i cannot explain...

On my cpu (AMD Ryzen AI 9 365) the version with the two divides performs a
little bit better than this version:
Benchmark 1 (1471 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-bloom-install/lib/ /home/fr/src/musl/precomp-bloom-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.76ms ±  584us    4.69ms … 8.84ms         16 ( 1%)        0%
  peak_rss           12.7MB ±  101KB    12.3MB … 12.8MB         20 ( 1%)        0%
  cpu_cycles         9.18M  ±  747K     7.64M  … 12.2M          70 ( 5%)        0%
  instructions       14.9M  ±  919      14.9M  … 14.9M          48 ( 3%)        0%
  cache_references    469K  ± 4.10K      458K  …  489K          13 ( 1%)        0%
  cache_misses       82.9K  ± 1.48K     79.8K  … 90.7K         109 ( 7%)        0%
  branch_misses      80.8K  ±  781      79.3K  … 92.1K           4 ( 0%)        0%
Benchmark 2 (1533 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.49ms ±  656us    3.93ms … 8.82ms         58 ( 4%)        ⚡-  3.9% ±  0.7%
  peak_rss           12.7MB ±  101KB    12.3MB … 12.8MB         22 ( 1%)          -  0.0% ±  0.1%
  cpu_cycles         8.95M  ± 1.11M     7.22M  … 14.3M         129 ( 8%)        ⚡-  2.6% ±  0.7%
  instructions       14.7M  ±  857      14.7M  … 14.7M          46 ( 3%)        ⚡-  1.2% ±  0.0%
  cache_references    468K  ± 5.24K      455K  …  493K           8 ( 1%)          -  0.2% ±  0.1%
  cache_misses       83.0K  ± 1.67K     79.4K  … 89.3K          38 ( 2%)          +  0.1% ±  0.1%
  branch_misses      76.9K  ±  580      75.5K  … 78.8K           7 ( 0%)        ⚡-  4.9% ±  0.1%

I have attached the patch for reference.
Best Regards,

Fabian Rast


View attachment "v3-0001-ldso-skip-gnu-hash-calculation-if-precomputed.patch" of type "text/x-patch" (3484 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.