libc-coord - Re: vdso getrandom is slower than syscall on Zen5

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ecaaa8bc-1b67-43d3-8efe-cadf6c42fea4@linaro.org>
Date: Fri, 25 Jul 2025 09:27:12 -0300
From: Adhemerval Zanella Netto <adhemerval.zanella@...aro.org>
To: libc-coord@...ts.openwall.com, Sam James <sam@...too.org>,
 Schrodinger ZHU Yifan <i@...yi.fan>, "Jason A. Donenfeld" <Jason@...c4.com>
Subject: Re: vdso getrandom is slower than syscall on Zen5

On 24/07/25 21:35, Sam James wrote:
> Schrodinger ZHU Yifan <i@...yi.fan> writes:
> 
>> Hi,
>>
>> In some experiments, I find it interesting that if getrandom is used to generate long sequence of
>> random bytes, it is actually slower than getrandom. This phenomenon depends on microarch and it is stable on Zen5. Is it
>> because the kernel space getrandom does not use ChaCha20? I am not very familiar with the kernel space implementation.
>>
>> Benchmark results are available at: https://github.com/bytecodealliance/rustix/issues/1185
>>
> 
> I'd probably file this as a glibc bug (or maybe bring it up on
> libc-alpha and CC zx2c4). libc-coord is for coordination between
> different libc impls on interfacts and such.
> 

This is something I noted on a presentation I did about the vDSO work [1],
and this is due:

  1. The vDSO was designed to optimize small buffer sizes where the syscall
     overhead dominates.  For instance, when we added arc4random on glibc
     (which back then just issues the syscall) we immediately received a bug 
     report where it added significant impact on a NTP  server [2].

  2. The vDSO symbol provided by the kernel has limited optimization
     compared to the one used internally on other Linux crypto subsystems.  
     This is mainly because the vDSO has additional security constraints,
     like not stack spilling to avoid state information leak.

  3. It also aims to support multiple chips without adding much additional
     maintainability.  It means for x86_64 is uses a basic sse2 variant which
     operation one block per iteration, different than kernel crypto subsystem
     that has sse3/AVX/AVX512L optimized variant that operates on multiple
     blocks at the time.

It would be possible to work around this on userland/libc, where it tests for
the buffer size and issues the syscall directly; but it would be better to 
add such optimization on the kernel itself since it has all the
information about the implementation and performance characteristics.

One possible optimization would be to add a block optimization (like the 
chacha_4block_xor_neon for arm64); and, for x86_64, some variant for 
x86_64-v3 (avx) or x86_64-v4 (avx512).  The kernel supports the alternative
mechanism on vDSO, so it should be feasible.

What I strongly do not recommend it to emulate the vDSO mechanism on userland,
like suggested on bug report [3]. This was implemented briefly on glibc,
and Jason has write extensively why it has multiple drawbacks security-wise.

[1] https://www.youtube.com/watch?v=orko_lbi5FA&list=PLKZSArYQptsODycGiE0XZdVovzAwYNwtK&index=36
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=29437
[3] https://github.com/bytecodealliance/rustix/issues/1185

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.