![]() |
|
Message-ID: <ecaaa8bc-1b67-43d3-8efe-cadf6c42fea4@linaro.org> Date: Fri, 25 Jul 2025 09:27:12 -0300 From: Adhemerval Zanella Netto <adhemerval.zanella@...aro.org> To: libc-coord@...ts.openwall.com, Sam James <sam@...too.org>, Schrodinger ZHU Yifan <i@...yi.fan>, "Jason A. Donenfeld" <Jason@...c4.com> Subject: Re: vdso getrandom is slower than syscall on Zen5 On 24/07/25 21:35, Sam James wrote: > Schrodinger ZHU Yifan <i@...yi.fan> writes: > >> Hi, >> >> In some experiments, I find it interesting that if getrandom is used to generate long sequence of >> random bytes, it is actually slower than getrandom. This phenomenon depends on microarch and it is stable on Zen5. Is it >> because the kernel space getrandom does not use ChaCha20? I am not very familiar with the kernel space implementation. >> >> Benchmark results are available at: https://github.com/bytecodealliance/rustix/issues/1185 >> > > I'd probably file this as a glibc bug (or maybe bring it up on > libc-alpha and CC zx2c4). libc-coord is for coordination between > different libc impls on interfacts and such. > This is something I noted on a presentation I did about the vDSO work [1], and this is due: 1. The vDSO was designed to optimize small buffer sizes where the syscall overhead dominates. For instance, when we added arc4random on glibc (which back then just issues the syscall) we immediately received a bug report where it added significant impact on a NTP server [2]. 2. The vDSO symbol provided by the kernel has limited optimization compared to the one used internally on other Linux crypto subsystems. This is mainly because the vDSO has additional security constraints, like not stack spilling to avoid state information leak. 3. It also aims to support multiple chips without adding much additional maintainability. It means for x86_64 is uses a basic sse2 variant which operation one block per iteration, different than kernel crypto subsystem that has sse3/AVX/AVX512L optimized variant that operates on multiple blocks at the time. It would be possible to work around this on userland/libc, where it tests for the buffer size and issues the syscall directly; but it would be better to add such optimization on the kernel itself since it has all the information about the implementation and performance characteristics. One possible optimization would be to add a block optimization (like the chacha_4block_xor_neon for arm64); and, for x86_64, some variant for x86_64-v3 (avx) or x86_64-v4 (avx512). The kernel supports the alternative mechanism on vDSO, so it should be feasible. What I strongly do not recommend it to emulate the vDSO mechanism on userland, like suggested on bug report [3]. This was implemented briefly on glibc, and Jason has write extensively why it has multiple drawbacks security-wise. [1] https://www.youtube.com/watch?v=orko_lbi5FA&list=PLKZSArYQptsODycGiE0XZdVovzAwYNwtK&index=36 [2] https://sourceware.org/bugzilla/show_bug.cgi?id=29437 [3] https://github.com/bytecodealliance/rustix/issues/1185
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.