![]() |
|
Message-ID: <a65dfd2c-b435-4d83-89d0-abc8002db7c7@efficios.com> Date: Fri, 12 Sep 2025 15:26:22 -0400 From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com> To: Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org> Cc: Peter Zilstra <peterz@...radead.org>, "Paul E. McKenney" <paulmck@...nel.org>, Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet <corbet@....net>, Prakash Sangappa <prakash.sangappa@...cle.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, K Prateek Nayak <kprateek.nayak@....com>, Steven Rostedt <rostedt@...dmis.org>, Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org, Florian Weimer <fweimer@...hat.com>, "carlos@...hat.com" <carlos@...hat.com>, libc-coord@...ts.openwall.com Subject: Re: [patch 00/12] rseq: Implement time slice extension mechanism [ For those just CC'd on this thread, the discussion is about time slice extension for userspace critical sections. We are specifically discussing the kernel ABI we plan to expose to userspace. ] On 2025-09-12 12:31, Thomas Gleixner wrote: > On Fri, Sep 12 2025 at 08:33, Mathieu Desnoyers wrote: >> On 2025-09-11 16:18, Thomas Gleixner wrote: >>> It receives SIGSEGV because that means that it did not follow the rules >>> and stuck an arbitrary syscall into the critical section. >> >> Not following the rules could also be done by just looping for a long >> time in userspace within or after the critical section, in which case >> the timer should catch it. > > It's pretty much impossible to tell for the kernel without more > overhead, whether that's actually a violation of the rules or not. > > The operation after the grant can be interrupted (without resulting in > scheduling), which is out of control of the task which got the extension > granted. > > The timer is there to ensure that there is an upper bound to the grant > independent of the actual reason. If the worse side-effect of this feature is that the slice extension is not granted when users misbehave, IMHO this would increase the likelihood of adoption compared to failure modes that end up killing the offending processes. > > Going through a different syscall is an obvious deviation from the rule. AFAIU, the grant is cleared when a signal handler is delivered, which makes it OK for signals to issue system calls even if they are nested on top of a granted extension critical section. > > As far I understood the earlier discussions, scheduler folks want to > enforce that because of PREEMPT_NONE semantics, where a randomly chosen > syscall might not result in an immediate reschedule because the work, > which needs to be done takes arbitrary time to complete. > > Though that's arguably not much different from > > syscall() > -> tick -> NEED_RESCHED > do_tons_of_work(); > exit_to_user() > schedule(); > > except that in the slice extension case, the latency increases by the > slice extension time. > > If we allow arbitrary syscalls to terminate the grant, then we need to > stick an immediate schedule() into the syscall entry work function. We'd > still need the separate yield() syscall to provide a side effect free > way of termination. > > I have no strong opinions either way. Peter? If it happens to not be too bothersome to allow arbitrary system calls to act as implicit rseq_slice_yield() rather than result in a segmentation fault, I think it would make this feature more widely adopted. Another scenario I have in mind is a userspace critical section that would typically benefit from slice extension, but seldomly requires to issue a system call. In C and higher level languages, that could be very much outside of the user control, such as accessing a global-dynamic TLS variable located within a global-dynamic shared object, which can trigger memory allocation under the hood on first access. Handling syscall within granted extension by killing the process will likely reserve this feature to the niche use-cases. > >>>> rseq->slice_request = true; /* WRITE_ONCE() */ >>>> barrier(); >>>> critical_section(); >>>> barrier(); >>>> rseq->slice_request = false; /* WRITE_ONCE() */ >>>> if (rseq->slice_grant) /* READ_ONCE() */ >>>> rseq_slice_yield(); >>> >>> That should work as it's strictly CPU local. Good point, now that you >>> said it it's obvious :) >>> >>> Let me rework it accordingly. >> >> I have two questions wrt ABI here: >> >> 1) Do we expect the slice requests to be done from C and higher level >> languages or only from assembly ? > > It doesn't matter as long as the ordering is guaranteed. OK, so I understand that you intend to target higher level languages as well, which makes my second point (nesting) relevant. > >> 2) Slice requests are a good fit for locking. Locking typically >> has nesting ability. >> >> We should consider making the slice request ABI a 8-bit >> or 16-bit nesting counter to allow nesting of its users. > > Making request a counter requires to keep request set when the > extension is granted. So the states would be: > > request granted > 0 0 Neutral > >0 0 Requested > >=0 1 Granted Yes. > > That should work. > > Though I'm not really convinced that unconditionally embeddeding it into > random locking primitives is the right thing to do. Me neither. I wonder what would be a good approach to integrate this with locking APIs. Here are a few ideas, some worse than others: - Extend pthread_mutexattr_t to set whether the mutex should be slice-extended. Downside: if a mutex has some long and some short critical sections, it's really a one-size fits all decision for all critical sections for that mutex. - Extend the pthread_mutex_lock/trylock with new APIs to allow specifying whether slice-extension is needed for the upcoming critical section. - Just let the pthread_mutex_lock caller explicitly request the slice extension *after* grabbing the lock. Downside: this opens a window of a few instructions where preemption can happen and slice extension would have been useful. Should we care ? > > The extension makes only sense, when the actual critical section is > small and likely to complete within the extension time, which is usually > only true for highly optimized code and not for general usage, where the > lock held section is arbitrary long and might even result in syscalls > even if the critical section itself does not have an obvious explicit > syscall embedded: > > lock(a) > lock(b) <- Contention results in syscall Nested locking is another scenario where _typically_ we'd want the slice extension for the outer lock if it is expected to be a short critical section, and sometimes hit futex while the extension is granted and clear the grant if this happens without killing the process. > > Same applies for library functions within a critical section. Yes. > > That then immediately conflicts with the yield mechanism rules, because > the extension could have been granted _before_ the syscall happens, so > we'd have remove that restriction too. Yes. > > That said, we can make the ABI a counter and split the slice control > word into two u16. So the decision function would be: > > get_usr(ctrl); > if (!ctrl.request) > return; > .... > ctrl.granted = 1; > put_usr(ctrl); > > Along with documentation why this should only be used nested when you > know what you are doing. Yes. This would turn the end of critical section into a decrement-and-test-for-zero. It's only when the request counter decrements back to zero that userspace should handle the granted flag and yield. > >> 3) Slice requests are also a good fit for rseq critical sections. >> Of course someone could explicitly increment/decrement the >> slice request counter before/after the rseq critical sections, but >> I think we could do better there and integrate this directly within >> the struct rseq_cs as a new critical section flag. Basically, a >> critical section with this new RSEQ_CS_SLICE_REQUEST flag (or >> better name) set within its descriptor flags would behave as if >> the slice request counter is non-zero when preempted without >> requiring any extra instruction on the fast path. The only >> added overhead would be a check of the rseq->slice_grant flag >> when exiting the critical section to conditionally issue >> rseq_slice_yield(). > > Plus checking first whether rseq->slice.request is actually zero, > i.e. whether the rseq critical section was the outermost one. If not, > you cannot invoke the yield even if granted is true, right? Right. > > But mixing state spaces is not really a good idea at all. Let's not go > there. I agree, let's keep this (3) for later if there is a strong use-case justifying the complexity. What is important for right now though is to figure out the behavior with respect to an ongoing rseq critical section when a time slice extension is granted: is the rseq critical section aborted or does it keep going on return to userspace ? > > Also you'd make checking of rseq_cs unconditional, which means extra > work in the grant decision function as it would then have to do: > > if (!usr->slice.ctrl.request) { > if (!usr->rseq_cs) > return; > if (!valid_ptr(usr->rseq_cs)) > goto die; > if (!within(regs->ip, usr->rseq_cs.start_ip, usr->rseq_cs.offset)) > return; > if (!(use->rseq_cs.flags & REQUEST)) > return; > } > > IOW, we'd copy half of the rseq cs handling into that code. > > Can we please keep it independent and simple? Of course. So in summary, here is my current understanding: - It would be good to support nested slice-extension requests, - It would be preferable to allow arbitrary system calls to cancel an ongoing slice-extension grant rather than kill the process if we want the slice-extension feature to be useful outside of niche use-cases. Thoughts ? Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.