|
|
Message-Id: <DETAFVGQBPOG.3VHZKZZX5WH7H@tum.de>
Date: Tue, 09 Dec 2025 02:17:41 +0100
From: "Fabian Rast" <fabian.rast@....de>
To: <musl@...ts.openwall.com>
Subject: [PATCH] ldso: skip repeated symbol lookups for sorted relocations
when relocations are sorted by symbol index (-z combreloc),
we can remember the previous relocations symbol and skip doing the
lookup again for the next relocation on the same symbol.
an exception to this are copy relocations that need to resolve to
a different definition for the same symbol than regular relocations.
---
ldso/dynlink.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 715948f4..0c9b739a 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -385,7 +385,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
const char *name;
void *ctx;
int type;
- int sym_index;
+ int sym_index, prev_sym_index = 0;
struct symdef def;
size_t *reloc_addr;
size_t sym_val;
@@ -423,13 +423,19 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
sym_index = R_SYM(rel[1]);
if (sym_index) {
- sym = syms + sym_index;
- name = strings + sym->st_name;
- ctx = type==REL_COPY ? head->syms_next : head;
- def = (sym->st_info>>4) == STB_LOCAL
- ? (struct symdef){ .dso = dso, .sym = sym }
- : find_sym(ctx, name, type==REL_PLT);
- if (!def.sym) def = get_lfs64(name);
+ if (sym_index != prev_sym_index || type == REL_COPY) {
+ sym = syms + sym_index;
+ name = strings + sym->st_name;
+ if (type == REL_COPY)
+ ctx = head->syms_next, prev_sym_index = 0;
+ else
+ ctx = head, prev_sym_index = sym_index;
+ def = (sym->st_info>>4) == STB_LOCAL
+ ? (struct symdef){ .dso = dso, .sym = sym }
+ : find_sym(ctx, name, type==REL_PLT);
+ if (!def.sym) def = get_lfs64(name);
+ }
+
if (!def.sym && (sym->st_shndx != SHN_UNDEF
|| sym->st_info>>4 != STB_WEAK)) {
if (dso->lazy && (type==REL_PLT || type==REL_GOT)) {
--
2.52.0
>From the hash precomputation thread:
> I think modern linkers sort relocations by the symbol referenced,
> which allows you to bypass the lookup if the reference is the same as
> the previous relocation. We do not take advantage of this in the
> do_relocs loop at all, but we could. That would probably give far more
> performance boost than eliding the hashing and might even make eliding
> the hashing pointless.
Yes, this was as far as i know originally introduced
by Jakub Jelinek in prelink, which was then integrated into linkers.
(https://people.redhat.com/jakub/prelink.pdf Section 2 mentions that
"-z combreloc is the default in GNU linker versions 2.13 and later.")
I have implemented this optimization before and considered sending the patch
for it, but didn't because the improvement was smaller than the hashing
thing in my benchmark:
My build of clang goes from 5230us to 5031us so ~4% improvement.
It is consistently faster than with the optimization, but the speedup varies
between 1% to 5% between runs...
I have also measured using poop:
Benchmark 1 (1410 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 7.06ms ± 810us 4.96ms … 9.41ms 1 ( 0%) 0%
peak_rss 12.7MB ± 106KB 12.2MB … 12.8MB 21 ( 1%) 0%
cpu_cycles 10.6M ± 756K 9.04M … 14.4M 19 ( 1%) 0%
instructions 19.3M ± 839 19.3M … 19.3M 43 ( 3%) 0%
cache_references 461K ± 2.69K 452K … 481K 21 ( 1%) 0%
cache_misses 84.0K ± 1.29K 81.6K … 90.4K 106 ( 8%) 0%
branch_misses 79.1K ± 766 77.6K … 83.4K 12 ( 1%) 0%
Benchmark 2 (1469 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/combreloc-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 6.79ms ± 796us 4.52ms … 9.24ms 1 ( 0%) ⚡- 3.9% ± 0.8%
peak_rss 12.7MB ± 105KB 12.2MB … 12.8MB 18 ( 1%) + 0.1% ± 0.1%
cpu_cycles 9.98M ± 713K 8.60M … 14.0M 9 ( 1%) ⚡- 5.7% ± 0.5%
instructions 17.0M ± 843 17.0M … 17.0M 40 ( 3%) ⚡- 11.8% ± 0.0%
cache_references 459K ± 2.44K 452K … 475K 29 ( 2%) - 0.4% ± 0.0%
cache_misses 84.2K ± 1.19K 81.4K … 89.3K 53 ( 4%) + 0.3% ± 0.1%
branch_misses 77.9K ± 736 76.3K … 81.2K 9 ( 1%) ⚡- 1.5% ± 0.1%
Of course I should have benchmarked with other examples as well..
Here are some stats for how often symbol lookups could be skipped: (alpine container)
clang: skip/total: 23042/48853 (47.2%)
gsx: skip/total: 3493/30846 (11.3%)
ffmpeg: skip/total: 6678/33874 (19.7%)
mpv: skip/total: 21370/67805 (31.5%)
libreoffice: skip/total: 90069/155058 (58.1%)
webkit2gtk: skip/total: 79498/137980 (57.6%)
libxul: skip/total: 3954/30715 (12.9%)
I think these somewhat match the stats for the "repeat" optimization
that Szabolcs Nagy described.
= Benchmarks
clang:
1.15 ± 0.23 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
gsx:
1.05 ± 0.22 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
ffmpeg:
1.11 ± 0.22 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
mpv:
1.01 ± 0.16 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
libreoffice: *master is faster!*
1.01 ± 0.14 times faster than env LD_LIBRARY_PATH=/tmp/combreloc-install/lib /tmp/combreloc-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
webkit2gtk: *master is faster!*
1.01 ± 0.11 times faster than env LD_LIBRARY_PATH=/tmp/combreloc-install/lib /tmp/combreloc-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
libxul:
1.01 ± 0.17 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so
Better data with perf: (thanks for the suggestion, Alexander Monakov!):
master -> combreloc
/usr/bin/clang
cycles: 30756242000000 (0.51) -> 25105284000000 (0.35) -18.37%
instructions: 58826349000000 (0.01) -> 40690318000000 (0.01) -30.83%
ref-cycles: 20650588000000 (1.25) -> 16854803000000 (1.1) -18.38%
duration_time: 11001741000000 (1.26) -> 9044572000000 (1.1) -17.79%
/usr/bin/gsx
cycles: 45646868000000 (0.16) -> 43899115000000 (0.19) -3.83%
instructions: 56570118000000 (0.01) -> 51588050000000 (0.01) -8.81%
ref-cycles: 30311893000000 (0.77) -> 29877390000000 (0.81) -1.43%
duration_time: 16171347000000 (0.79) -> 15971631000000 (0.84) -1.23%
/usr/bin/ffmpeg
cycles: 68478419000000 (0.26) -> 61979185000000 (0.15) -9.49%
instructions: 86857608000000 (0.0) -> 72735172000000 (0.01) -16.26%
ref-cycles: 45155884000000 (0.82) -> 40688151000000 (0.71) -9.89%
duration_time: 23782076000000 (0.83) -> 21543194000000 (0.71) -9.41%
/usr/bin/mpv
cycles: 211430891000000 (0.09) -> 180386318000000 (0.16) -14.68%
instructions: 265138940000000 (0.0) -> 190818661000000 (0.0) -28.03%
ref-cycles: 121681650000000 (0.5) -> 110154718000000 (0.35) -9.47%
duration_time: 62568064000000 (0.5) -> 56879553000000 (0.35) -9.09%
/usr/lib/libreoffice/program/soffice.bin
cycles: 175200872000000 (0.12) -> 136981740000000 (0.76) -21.81%
instructions: 266321002000000 (0.0) -> 142871044000000 (0.0) -46.35%
ref-cycles: 87984124000000 (0.71) -> 88218185000000 (0.79) 0.27%
duration_time: 45507507000000 (0.71) -> 45735531000000 (0.78) 0.5%
/usr/lib/libwebkit2gtk-4.1.so.0
cycles: 235878354000000 (0.09) -> 177998371000000 (0.08) -24.54%
instructions: 329902619000000 (0.0) -> 190358140000000 (0.0) -42.3%
ref-cycles: 109348194000000 (0.52) -> 108782087000000 (0.34) -0.52%
duration_time: 56442314000000 (0.52) -> 56263194000000 (0.34) -0.32%
/usr/lib/firefox/libxul.so
cycles: 60077791000000 (0.14) -> 56767883000000 (0.14) -5.51%
instructions: 72867821000000 (0.01) -> 65733365000000 (0.01) -9.79%
ref-cycles: 39997925000000 (0.69) -> 38364591000000 (0.65) -4.08%
duration_time: 21149401000000 (0.7) -> 20278557000000 (0.66) -4.12%
The times seem to be roughly consistent with hyperfine measurements,
except for mpv, i am not sure what happened there.
It looks like this optimization has a big impact on instructions executed,
which may not alwyas translate into duration. E.g. libreoffice has a big
improvement in instructions executed, but is not measurably faster.
The improvement in instructions roughly correlates with the skip ratio, which
is reassuring.
I think the clang in the alpine container works a better than my own
build, because my build only links against libllvm and libclang,
but the alpine build links at runtime against some more libraries
/lib/ld-musl-x86_64.so.1 (0x7fb5b43ee000)
libclang-cpp.so.21.1 => /usr/lib/llvm21/lib/libclang-cpp.so.21.1 (0x7fb5aea9a000)
libLLVM.so.21.1 => /usr/lib/llvm21/lib/libLLVM.so.21.1 (0x7fb5a3f6b000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x7fb5a3cb9000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7fb5a3c8d000)
libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7fb5b43ee000)
libffi.so.8 => /usr/lib/libffi.so.8 (0x7fb5a3c83000)
libz.so.1 => /usr/lib/libz.so.1 (0x7fb5a3c68000)
libzstd.so.1 => /usr/lib/libzstd.so.1 (0x7fb5a3bb9000)
libxml2.so.2 => /usr/lib/libxml2.so.2 (0x7fb5a3ab2000)
liblzma.so.5 => /usr/lib/liblzma.so.5 (0x7fb5a3a79000)
perf script for reference:
```
import subprocess
import json
def perfstat(cmdline, rep=500):
p = subprocess.run(['perf', 'stat', '-r', str(rep), '-e', 'cycles,instructions,ref-cycles,duration_time', '-j'] + cmdline, capture_output=True, check=True)
return map(json.loads, p.stderr.decode().rstrip().split('\n'))
def b(prog):
print(f"\n{prog}")
p = ['--list', prog]
for a, b in zip(perfstat(['/tmp/master-install/lib/libc.so']+p), perfstat(['/tmp/combreloc-install/lib/libc.so']+p)):
v1 = int(''.join(a["counter-value"].split('.')))
v2 = int(''.join(b["counter-value"].split('.')))
d = round(((v2 - v1)/v1) * 100, 2)
print(f"{a['event']}:\t", v1, f"({a['variance']})", "->", v2, f"({b['variance']}) {d}%")
print("master -> combreloc")
for x in [f"/usr/bin/{x}" for x in ["clang", "gsx", "ffmpeg", "mpv"]] + [f"/usr/lib/{x}" for x in ["libreoffice/program/soffice.bin", "libwebkit2gtk-4.1.so.0", "firefox/libxul.so"]]: b(x)
```
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.