musl - Re: using builtins within musl

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260613121036.GM3520958@port70.net>
Date: Sat, 13 Jun 2026 14:10:36 +0200
From: Szabolcs Nagy <nsz@...t70.net>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: using builtins within musl

* Rich Felker <dalias@...c.org> [2026-06-12 22:10:00 -0400]:
> On Sat, Jun 13, 2026 at 03:18:42AM +0200, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@...c.org> [2026-06-12 19:11:12 -0400]:
> > > On Fri, Jun 12, 2026 at 11:43:41PM +0200, Szabolcs Nagy wrote:
> > > > builtins supposed to improve code generation and most useful when a
> > > > library call can be lowered to a few instructions. e.g. memcpy and
> > > > memset with a fixed small size can be a few load/store/move ops.
> > > > 
> > > > unfortunately gcc creates a mess on x86_64 of code like
> > > > 
> > > >  if (n < 64) __builtin_memcpy(d,s,n);
> > > >  if (n < 64) __builtin_memset(p,0,n);
> > > > 
> > > > libc.so .text change:
> > > >     arch   diff   size
> > > >   x86_64: +4525 667424
> > > >  riscv64:  +724 613171
> > > >  aarch64:  -432 679747
> > > >      arm:  -152 658809
> > > 
> > > Can you give a brief summary of what gcc does such a bad job of on
> > > x86_64? Does it inline something with a bunch of branching cases for
> > > different sizes or something? The results on the other archs don't
> > > look so bad.
> > 
> > x86_64 inlines more, i assume it is fast, but not the
> > best for size, e.g. try on godbolt:
> > 
> > void foo(char *d, char *s, long n)
> > {
> >     if (n < 64) __builtin_memcpy(d,s,n);
> > }
> 
> OK, I see. It looks like this behavior is controlled by the stringop
> options documented at
> https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
> 
> There may be a way to tell it to stop being stupid.
> 
> It looks like -mstringop-strategy=libcall suppresses all dynamic-n
> inlining but still inlines constant-n. I have no idea why that isn't
> default below -O3.

note: dynamic size inlining happens even at -O0 on gcc-16

with -mstringop-strategy=libcall the memcpy+memset patch is only +162
.text size compared to master on x86_64, however relocs change like

  +1 -42 memcpy across 22 tu
 +30 -11 memset across 33 tu

the new relocs are from code like

 struct z {
   int i;
   char a[256];
 } x, y;
 void foo() {
   x = (struct z){.i=42}; // memset
   y = x; // memcpy
 }

before -mstringop-strategy=libcall struct init was done via "rep stos"
and struct copy via "rep movs".

after, it is done via reg load/store for small structs and memset/cpy
for large, so it affects fixed size ops too, not just dynamic.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.