musl - [PATCH] ldso: skip repeated symbol lookups for sorted relocations

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <DETAFVGQBPOG.3VHZKZZX5WH7H@tum.de>
Date: Tue, 09 Dec 2025 02:17:41 +0100
From: "Fabian Rast" <fabian.rast@....de>
To: <musl@...ts.openwall.com>
Subject: [PATCH] ldso: skip repeated symbol lookups for sorted relocations

when relocations are sorted by symbol index (-z combreloc),
we can remember the previous relocations symbol and skip doing the
lookup again for the next relocation on the same symbol.

an exception to this are copy relocations that need to resolve to
a different definition for the same symbol than regular relocations.
---
 ldso/dynlink.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 715948f4..0c9b739a 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -385,7 +385,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
 	const char *name;
 	void *ctx;
 	int type;
-	int sym_index;
+	int sym_index, prev_sym_index = 0;
 	struct symdef def;
 	size_t *reloc_addr;
 	size_t sym_val;
@@ -423,13 +423,19 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
 
 		sym_index = R_SYM(rel[1]);
 		if (sym_index) {
-			sym = syms + sym_index;
-			name = strings + sym->st_name;
-			ctx = type==REL_COPY ? head->syms_next : head;
-			def = (sym->st_info>>4) == STB_LOCAL
-				? (struct symdef){ .dso = dso, .sym = sym }
-				: find_sym(ctx, name, type==REL_PLT);
-			if (!def.sym) def = get_lfs64(name);
+			if (sym_index != prev_sym_index || type == REL_COPY) {
+				sym = syms + sym_index;
+				name = strings + sym->st_name;
+				if (type == REL_COPY)
+					ctx = head->syms_next, prev_sym_index = 0;
+				else
+					ctx = head, prev_sym_index = sym_index;
+				def = (sym->st_info>>4) == STB_LOCAL
+					? (struct symdef){ .dso = dso, .sym = sym }
+					: find_sym(ctx, name, type==REL_PLT);
+				if (!def.sym) def = get_lfs64(name);
+			}
+
 			if (!def.sym && (sym->st_shndx != SHN_UNDEF
 			    || sym->st_info>>4 != STB_WEAK)) {
 				if (dso->lazy && (type==REL_PLT || type==REL_GOT)) {
-- 
2.52.0

>From the hash precomputation thread:

> I think modern linkers sort relocations by the symbol referenced,
> which allows you to bypass the lookup if the reference is the same as
> the previous relocation. We do not take advantage of this in the
> do_relocs loop at all, but we could. That would probably give far more
> performance boost than eliding the hashing and might even make eliding
> the hashing pointless.

Yes, this was as far as i know originally introduced
by Jakub Jelinek in prelink, which was then integrated into linkers.
(https://people.redhat.com/jakub/prelink.pdf Section 2 mentions that
"-z combreloc is the default in GNU linker versions 2.13 and later.")

I have implemented this optimization before and considered sending the patch
for it, but didn't because the improvement was smaller than the hashing
thing in my benchmark:
My build of clang goes from 5230us to 5031us so ~4% improvement.
It is consistently faster than with the optimization, but the speedup varies
between 1% to 5% between runs...

I have also measured using poop:
Benchmark 1 (1410 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.06ms ±  810us    4.96ms … 9.41ms          1 ( 0%)        0%
  peak_rss           12.7MB ±  106KB    12.2MB … 12.8MB         21 ( 1%)        0%
  cpu_cycles         10.6M  ±  756K     9.04M  … 14.4M          19 ( 1%)        0%
  instructions       19.3M  ±  839      19.3M  … 19.3M          43 ( 3%)        0%
  cache_references    461K  ± 2.69K      452K  …  481K          21 ( 1%)        0%
  cache_misses       84.0K  ± 1.29K     81.6K  … 90.4K         106 ( 8%)        0%
  branch_misses      79.1K  ±  766      77.6K  … 83.4K          12 ( 1%)        0%
Benchmark 2 (1469 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/combreloc-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.79ms ±  796us    4.52ms … 9.24ms          1 ( 0%)        ⚡-  3.9% ±  0.8%
  peak_rss           12.7MB ±  105KB    12.2MB … 12.8MB         18 ( 1%)          +  0.1% ±  0.1%
  cpu_cycles         9.98M  ±  713K     8.60M  … 14.0M           9 ( 1%)        ⚡-  5.7% ±  0.5%
  instructions       17.0M  ±  843      17.0M  … 17.0M          40 ( 3%)        ⚡- 11.8% ±  0.0%
  cache_references    459K  ± 2.44K      452K  …  475K          29 ( 2%)          -  0.4% ±  0.0%
  cache_misses       84.2K  ± 1.19K     81.4K  … 89.3K          53 ( 4%)          +  0.3% ±  0.1%
  branch_misses      77.9K  ±  736      76.3K  … 81.2K           9 ( 1%)        ⚡-  1.5% ±  0.1%


Of course I should have benchmarked with other examples as well..

Here are some stats for how often symbol lookups could be skipped: (alpine container)
clang:       skip/total: 23042/48853 (47.2%)
gsx:         skip/total: 3493/30846 (11.3%)
ffmpeg:      skip/total: 6678/33874 (19.7%)
mpv:         skip/total: 21370/67805 (31.5%)
libreoffice: skip/total: 90069/155058 (58.1%)
webkit2gtk:  skip/total: 79498/137980 (57.6%)
libxul:      skip/total: 3954/30715 (12.9%)

I think these somewhat match the stats for the "repeat" optimization
that Szabolcs Nagy described.

= Benchmarks
clang:
1.15 ± 0.23 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
gsx:
1.05 ± 0.22 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
ffmpeg:
1.11 ± 0.22 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
mpv:
1.01 ± 0.16 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
libreoffice: *master is faster!*
1.01 ± 0.14 times faster than env LD_LIBRARY_PATH=/tmp/combreloc-install/lib /tmp/combreloc-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
webkit2gtk: *master is faster!*
1.01 ± 0.11 times faster than env LD_LIBRARY_PATH=/tmp/combreloc-install/lib /tmp/combreloc-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
libxul:
1.01 ± 0.17 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so

Better data with perf: (thanks for the suggestion, Alexander Monakov!):
master -> combreloc
/usr/bin/clang
cycles:	 30756242000000 (0.51) -> 25105284000000 (0.35) -18.37%
instructions:	 58826349000000 (0.01) -> 40690318000000 (0.01) -30.83%
ref-cycles:	 20650588000000 (1.25) -> 16854803000000 (1.1) -18.38%
duration_time:	 11001741000000 (1.26) -> 9044572000000 (1.1) -17.79%

/usr/bin/gsx
cycles:	 45646868000000 (0.16) -> 43899115000000 (0.19) -3.83%
instructions:	 56570118000000 (0.01) -> 51588050000000 (0.01) -8.81%
ref-cycles:	 30311893000000 (0.77) -> 29877390000000 (0.81) -1.43%
duration_time:	 16171347000000 (0.79) -> 15971631000000 (0.84) -1.23%

/usr/bin/ffmpeg
cycles:	 68478419000000 (0.26) -> 61979185000000 (0.15) -9.49%
instructions:	 86857608000000 (0.0) -> 72735172000000 (0.01) -16.26%
ref-cycles:	 45155884000000 (0.82) -> 40688151000000 (0.71) -9.89%
duration_time:	 23782076000000 (0.83) -> 21543194000000 (0.71) -9.41%

/usr/bin/mpv
cycles:	 211430891000000 (0.09) -> 180386318000000 (0.16) -14.68%
instructions:	 265138940000000 (0.0) -> 190818661000000 (0.0) -28.03%
ref-cycles:	 121681650000000 (0.5) -> 110154718000000 (0.35) -9.47%
duration_time:	 62568064000000 (0.5) -> 56879553000000 (0.35) -9.09%

/usr/lib/libreoffice/program/soffice.bin
cycles:	 175200872000000 (0.12) -> 136981740000000 (0.76) -21.81%
instructions:	 266321002000000 (0.0) -> 142871044000000 (0.0) -46.35%
ref-cycles:	 87984124000000 (0.71) -> 88218185000000 (0.79) 0.27%
duration_time:	 45507507000000 (0.71) -> 45735531000000 (0.78) 0.5%

/usr/lib/libwebkit2gtk-4.1.so.0
cycles:	 235878354000000 (0.09) -> 177998371000000 (0.08) -24.54%
instructions:	 329902619000000 (0.0) -> 190358140000000 (0.0) -42.3%
ref-cycles:	 109348194000000 (0.52) -> 108782087000000 (0.34) -0.52%
duration_time:	 56442314000000 (0.52) -> 56263194000000 (0.34) -0.32%

/usr/lib/firefox/libxul.so
cycles:	 60077791000000 (0.14) -> 56767883000000 (0.14) -5.51%
instructions:	 72867821000000 (0.01) -> 65733365000000 (0.01) -9.79%
ref-cycles:	 39997925000000 (0.69) -> 38364591000000 (0.65) -4.08%
duration_time:	 21149401000000 (0.7) -> 20278557000000 (0.66) -4.12%

The times seem to be roughly consistent with hyperfine measurements,
except for mpv, i am not sure what happened there.
It looks like this optimization has a big impact on instructions executed,
which may not alwyas translate into duration. E.g. libreoffice has a big
improvement in instructions executed, but is not measurably faster.
The improvement in instructions roughly correlates with the skip ratio, which
is reassuring.

I think the clang in the alpine container works a better than my own
build, because my build only links against libllvm and libclang,
but the alpine build links at runtime against some more libraries
/lib/ld-musl-x86_64.so.1 (0x7fb5b43ee000)
libclang-cpp.so.21.1 => /usr/lib/llvm21/lib/libclang-cpp.so.21.1 (0x7fb5aea9a000)
libLLVM.so.21.1 => /usr/lib/llvm21/lib/libLLVM.so.21.1 (0x7fb5a3f6b000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x7fb5a3cb9000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7fb5a3c8d000)
libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x7fb5b43ee000)
libffi.so.8 => /usr/lib/libffi.so.8 (0x7fb5a3c83000)
libz.so.1 => /usr/lib/libz.so.1 (0x7fb5a3c68000)
libzstd.so.1 => /usr/lib/libzstd.so.1 (0x7fb5a3bb9000)
libxml2.so.2 => /usr/lib/libxml2.so.2 (0x7fb5a3ab2000)
liblzma.so.5 => /usr/lib/liblzma.so.5 (0x7fb5a3a79000)

perf script for reference:
```
import subprocess
import json

def perfstat(cmdline, rep=500):
    p = subprocess.run(['perf', 'stat', '-r', str(rep), '-e', 'cycles,instructions,ref-cycles,duration_time', '-j'] + cmdline, capture_output=True, check=True)
    return map(json.loads, p.stderr.decode().rstrip().split('\n'))

def b(prog):
    print(f"\n{prog}")
    p = ['--list', prog]
    for a, b in zip(perfstat(['/tmp/master-install/lib/libc.so']+p), perfstat(['/tmp/combreloc-install/lib/libc.so']+p)):
        v1 = int(''.join(a["counter-value"].split('.')))
        v2 = int(''.join(b["counter-value"].split('.')))
        d = round(((v2 - v1)/v1) * 100, 2)
        print(f"{a['event']}:\t", v1, f"({a['variance']})", "->", v2, f"({b['variance']}) {d}%")

print("master -> combreloc")
for x in [f"/usr/bin/{x}" for x in ["clang", "gsx", "ffmpeg", "mpv"]] + [f"/usr/lib/{x}" for x in ["libreoffice/program/soffice.bin", "libwebkit2gtk-4.1.so.0", "firefox/libxul.so"]]: b(x)
```

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.