|
|
Message-ID: <ecaaa8bc-1b67-43d3-8efe-cadf6c42fea4@linaro.org>
Date: Fri, 25 Jul 2025 09:27:12 -0300
From: Adhemerval Zanella Netto <adhemerval.zanella@...aro.org>
To: libc-coord@...ts.openwall.com, Sam James <sam@...too.org>,
Schrodinger ZHU Yifan <i@...yi.fan>, "Jason A. Donenfeld" <Jason@...c4.com>
Subject: Re: vdso getrandom is slower than syscall on Zen5
On 24/07/25 21:35, Sam James wrote:
> Schrodinger ZHU Yifan <i@...yi.fan> writes:
>
>> Hi,
>>
>> In some experiments, I find it interesting that if getrandom is used to generate long sequence of
>> random bytes, it is actually slower than getrandom. This phenomenon depends on microarch and it is stable on Zen5. Is it
>> because the kernel space getrandom does not use ChaCha20? I am not very familiar with the kernel space implementation.
>>
>> Benchmark results are available at: https://github.com/bytecodealliance/rustix/issues/1185
>>
>
> I'd probably file this as a glibc bug (or maybe bring it up on
> libc-alpha and CC zx2c4). libc-coord is for coordination between
> different libc impls on interfacts and such.
>
This is something I noted on a presentation I did about the vDSO work [1],
and this is due:
1. The vDSO was designed to optimize small buffer sizes where the syscall
overhead dominates. For instance, when we added arc4random on glibc
(which back then just issues the syscall) we immediately received a bug
report where it added significant impact on a NTP server [2].
2. The vDSO symbol provided by the kernel has limited optimization
compared to the one used internally on other Linux crypto subsystems.
This is mainly because the vDSO has additional security constraints,
like not stack spilling to avoid state information leak.
3. It also aims to support multiple chips without adding much additional
maintainability. It means for x86_64 is uses a basic sse2 variant which
operation one block per iteration, different than kernel crypto subsystem
that has sse3/AVX/AVX512L optimized variant that operates on multiple
blocks at the time.
It would be possible to work around this on userland/libc, where it tests for
the buffer size and issues the syscall directly; but it would be better to
add such optimization on the kernel itself since it has all the
information about the implementation and performance characteristics.
One possible optimization would be to add a block optimization (like the
chacha_4block_xor_neon for arm64); and, for x86_64, some variant for
x86_64-v3 (avx) or x86_64-v4 (avx512). The kernel supports the alternative
mechanism on vDSO, so it should be feasible.
What I strongly do not recommend it to emulate the vDSO mechanism on userland,
like suggested on bug report [3]. This was implemented briefly on glibc,
and Jason has write extensively why it has multiple drawbacks security-wise.
[1] https://www.youtube.com/watch?v=orko_lbi5FA&list=PLKZSArYQptsODycGiE0XZdVovzAwYNwtK&index=36
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=29437
[3] https://github.com/bytecodealliance/rustix/issues/1185
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.