Message ID | 20230830184958.2333078-8-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/clear_huge_page: multi-page clearing | expand |
On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote: > +#ifdef TIF_RESCHED_ALLOW > +/* > + * allow_resched() .. disallow_resched() demarcate a preemptible section. > + * > + * Used around primitives where it might not be convenient to periodically > + * call cond_resched(). > + */ > +static inline void allow_resched(void) > +{ > + might_sleep(); > + set_tsk_thread_flag(current, TIF_RESCHED_ALLOW); So the might_sleep() ensures we're not currently having preemption disabled; but there's nothing that ensures we don't do stupid things like: allow_resched(); spin_lock(); ... spin_unlock(); disallow_resched(); Which on a PREEMPT_COUNT=n build will cause preemption while holding the spinlock. I think something like the below will cause sufficient warnings to avoid growing patterns like that. Index: linux-2.6/kernel/sched/core.c =================================================================== --- linux-2.6.orig/kernel/sched/core.c +++ linux-2.6/kernel/sched/core.c @@ -5834,6 +5834,13 @@ void preempt_count_add(int val) { #ifdef CONFIG_DEBUG_PREEMPT /* + * Disabling preemption under TIF_RESCHED_ALLOW doesn't + * work for PREEMPT_COUNT=n builds. + */ + if (WARN_ON(resched_allowed())) + return; + + /* * Underflow? */ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote: > > Which on a PREEMPT_COUNT=n build will cause preemption while holding the > spinlock. I think something like the below will cause sufficient > warnings to avoid growing patterns like that. Hmm. I don't think that warning is valid. Disabling preemption is actually fine if it's done in an interrupt, iow if we have allow_resched(); -> irq happens spin_lock(); // Ok and should *not* complain ... spin_unlock(); <- irq return (and preemption) which actually makes me worry about the nested irq case, because this would *not* be ok: allow_resched(); -> irq happens -> *nested* irq happens <- nested irq return (and preemption) ie the allow_resched() needs to still honor the irq count, and a nested irq return obviously must not cause any preemption. I've lost sight of the original patch series, and I assume / hope that the above isn't actually an issue, but exactly because I've lost sight of the original patches and only have this one in my mailbox I wanted to check. Linus
On Fri, Sep 08, 2023 at 10:15:07AM -0700, Linus Torvalds wrote: > On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Which on a PREEMPT_COUNT=n build will cause preemption while holding the > > spinlock. I think something like the below will cause sufficient > > warnings to avoid growing patterns like that. > > Hmm. I don't think that warning is valid. > > Disabling preemption is actually fine if it's done in an interrupt, > iow if we have > > allow_resched(); > -> irq happens > spin_lock(); // Ok and should *not* complain > ... > spin_unlock(); > <- irq return (and preemption) Indeed. > > which actually makes me worry about the nested irq case, because this > would *not* be ok: > > allow_resched(); > -> irq happens > -> *nested* irq happens > <- nested irq return (and preemption) > > ie the allow_resched() needs to still honor the irq count, and a > nested irq return obviously must not cause any preemption. I think we killed nested interrupts a fair number of years ago, but I'll recheck -- but not today, sleep is imminent.
On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote: > > > > which actually makes me worry about the nested irq case, because this > > would *not* be ok: > > > > allow_resched(); > > -> irq happens > > -> *nested* irq happens > > <- nested irq return (and preemption) > > > > ie the allow_resched() needs to still honor the irq count, and a > > nested irq return obviously must not cause any preemption. > > I think we killed nested interrupts a fair number of years ago, but I'll > recheck -- but not today, sleep is imminent. I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED thing needs to look out for any nested exception (ie only ever trigger if it's returning to the kernel "task" stack). Because I could easily see us wanting to do "I'm going a big user copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on", and then instead of that first "irq happens", you have "page fault happens" instead. And inside that page fault handling you may well have critical sections (like a spinlock) that is fine - but the fact that the "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean that the page fault handler can reschedule. Maybe it already does. As mentioned, I lost sight of the patch series, even though I saw it originally (and liked it - only realizing on your complaint that it migth be more dangerous than I thought). Basically, the "allow resched" should be a marker for a single context level only. Kind of like a register state bit that gets saved on the exception stack. Not a "anything happening within this process is now preemptible". I'm hoping Ankur will just pipe in and say "of course I already implemented it that way, see XYZ". Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, 8 Sept 2023 at 00:03, Peter Zijlstra <peterz@infradead.org> wrote: >> >> Which on a PREEMPT_COUNT=n build will cause preemption while holding the >> spinlock. I think something like the below will cause sufficient >> warnings to avoid growing patterns like that. > > Hmm. I don't think that warning is valid. > > Disabling preemption is actually fine if it's done in an interrupt, > iow if we have > > allow_resched(); > -> irq happens > spin_lock(); // Ok and should *not* complain > ... > spin_unlock(); > <- irq return (and preemption) > > which actually makes me worry about the nested irq case, because this > would *not* be ok: > > allow_resched(); > -> irq happens > -> *nested* irq happens > <- nested irq return (andapreemption) > > ie the allow_resched() needs to still honor the irq count, and a > nested irq return obviously must not cause any preemption. IIUC, this should be equivalent to: 01 allow_resched(); 02 -> irq happens 03 preempt_count_add(HARDIRQ_OFFSET); 04 -> nested irq happens 05 preempt_count_add(HARDIRQ_OFFSET); 06 07 preempt_count_sub(HARDIRQ_OFFSET); 08 <- nested irq return 09 preempt_count_sub(HARDIRQ_OFFSET); So, even if there were nested interrrupts, then the !preempt_count() check in raw_irqentry_exit_cond_resched() should ensure that no preemption happens until after line 09. > I've lost sight of the original patch series, and I assume / hope that > the above isn't actually an issue, but exactly because I've lost sight > of the original patches and only have this one in my mailbox I wanted > to check. Yeah, sorry about that. The irqentry_exit_allow_resched() is pretty much this: +void irqentry_exit_allow_resched(void) +{ + if (resched_allowed()) + raw_irqentry_exit_cond_resched(); +} So, as long as raw_irqentry_exit_cond_resched() won't allow early preemption, having allow_resched() set, shouldn't either. -- ankur
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Fri, 8 Sept 2023 at 15:50, Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > which actually makes me worry about the nested irq case, because this >> > would *not* be ok: >> > >> > allow_resched(); >> > -> irq happens >> > -> *nested* irq happens >> > <- nested irq return (and preemption) >> > >> > ie the allow_resched() needs to still honor the irq count, and a >> > nested irq return obviously must not cause any preemption. >> >> I think we killed nested interrupts a fair number of years ago, but I'll >> recheck -- but not today, sleep is imminent. > > I don't think it has to be an interrupt. I think the TIF_ALLOW_RESCHED > thing needs to look out for any nested exception (ie only ever trigger > if it's returning to the kernel "task" stack). > > Because I could easily see us wanting to do "I'm going a big user > copy, it should do TIF_ALLOW_RESCHED, and I don't have preemption on", > and then instead of that first "irq happens", you have "page fault > happens" instead. > > And inside that page fault handling you may well have critical > sections (like a spinlock) that is fine - but the fact that the > "process context" had TIF_ALLOW_RESCHED most certainly does *not* mean > that the page fault handler can reschedule. > > Maybe it already does. As mentioned, I lost sight of the patch series, > even though I saw it originally (and liked it - only realizing on your > complaint that it migth be more dangerous than I thought). > > Basically, the "allow resched" should be a marker for a single context > level only. Kind of like a register state bit that gets saved on the > exception stack. Not a "anything happening within this process is now > preemptible". Yeah, exactly. Though, not even a single context level, but a flag attached to a single context at the process level only. Using preempt_count() == 0 as the preemption boundary. However, this has a problem with the PREEMPT_COUNT=n case because that doesn't have a preemption boundary. In the example that Peter gave: allow_resched(); spin_lock(); -> irq happens <- irq returns ---> preemption happens spin_unlock(); disallow_resched(); So, here the !preempt_count() clause in raw_irqentry_exit_cond_resched() won't protect us. My thinking was to restrict allow_resched() to be used only around primitive operations. But, I couldn't think of any way to enforce that. I think the warning in preempt_count_add() as Peter suggested upthread is a good idea. But, that's only for CONFIG_DEBUG_PREEMPT. -- ankur
On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote: > Yeah, exactly. Though, not even a single context level, but a flag > attached to a single context at the process level only. Using > preempt_count() == 0 as the preemption boundary. > > However, this has a problem with the PREEMPT_COUNT=n case because that > doesn't have a preemption boundary. So, with a little sleep, the nested exception/interrupt case should be good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment preempt_count with HARDIRQ_OFFSET / NMI_OFFSET. So while regular preempt_{dis,en}able() will turn into a NOP, the entry code *will* continue to increment preempt_count.
On Fri, Sep 08, 2023 at 10:30:57PM -0700, Ankur Arora wrote: > > which actually makes me worry about the nested irq case, because this > > would *not* be ok: > > > > allow_resched(); > > -> irq happens > > -> *nested* irq happens > > <- nested irq return (andapreemption) > > > > ie the allow_resched() needs to still honor the irq count, and a > > nested irq return obviously must not cause any preemption. > > IIUC, this should be equivalent to: > > 01 allow_resched(); > 02 -> irq happens > 03 preempt_count_add(HARDIRQ_OFFSET); > 04 -> nested irq happens > 05 preempt_count_add(HARDIRQ_OFFSET); > 06 > 07 preempt_count_sub(HARDIRQ_OFFSET); > 08 <- nested irq return > 09 preempt_count_sub(HARDIRQ_OFFSET); > > So, even if there were nested interrrupts, then the !preempt_count() > check in raw_irqentry_exit_cond_resched() should ensure that no > preemption happens until after line 09. Yes, this.
Peter Zijlstra <peterz@infradead.org> writes: > On Fri, Sep 08, 2023 at 11:39:47PM -0700, Ankur Arora wrote: > >> Yeah, exactly. Though, not even a single context level, but a flag >> attached to a single context at the process level only. Using >> preempt_count() == 0 as the preemption boundary. >> >> However, this has a problem with the PREEMPT_COUNT=n case because that >> doesn't have a preemption boundary. > > So, with a little sleep, the nested exception/interrupt case should be > good, irqenrty_enter() / irqentry_nmi_enter() unconditionally increment > preempt_count with HARDIRQ_OFFSET / NMI_OFFSET. > > So while regular preempt_{dis,en}able() will turn into a NOP, the entry > code *will* continue to increment preempt_count. Right, I was talking about the regular preempt_disable()/_enable() that will turn into a NOP with PREEMPT_COUNT=n. Actually, let me reply to the mail where you had described this case. -- ankur
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Aug 30, 2023 at 11:49:56AM -0700, Ankur Arora wrote: > >> +#ifdef TIF_RESCHED_ALLOW >> +/* >> + * allow_resched() .. disallow_resched() demarcate a preemptible section. >> + * >> + * Used around primitives where it might not be convenient to periodically >> + * call cond_resched(). >> + */ >> +static inline void allow_resched(void) >> +{ >> + might_sleep(); >> + set_tsk_thread_flag(current, TIF_RESCHED_ALLOW); > > So the might_sleep() ensures we're not currently having preemption > disabled; but there's nothing that ensures we don't do stupid things > like: > > allow_resched(); > spin_lock(); > ... > spin_unlock(); > disallow_resched(); > > Which on a PREEMPT_COUNT=n build will cause preemption while holding the > spinlock. I think something like the below will cause sufficient > warnings to avoid growing patterns like that. Yeah, I agree this is a problem. I'll expand on the comment above allow_resched() detailing this scenario. > Index: linux-2.6/kernel/sched/core.c > =================================================================== > --- linux-2.6.orig/kernel/sched/core.c > +++ linux-2.6/kernel/sched/core.c > @@ -5834,6 +5834,13 @@ void preempt_count_add(int val) > { > #ifdef CONFIG_DEBUG_PREEMPT > /* > + * Disabling preemption under TIF_RESCHED_ALLOW doesn't > + * work for PREEMPT_COUNT=n builds. > + */ > + if (WARN_ON(resched_allowed())) > + return; > + > + /* > * Underflow? > */ > if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0))) And, maybe something like this to guard against __this_cpu_read() etc: diff --git a/lib/smp_processor_id.c b/lib/smp_processor_id.c index a2bb7738c373..634788f16e9e 100644 --- a/lib/smp_processor_id.c +++ b/lib/smp_processor_id.c @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2) { int this_cpu = raw_smp_processor_id(); + if (unlikely(resched_allowed())) + goto out_error; + if (likely(preempt_count())) goto out; @@ -33,6 +36,7 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2) if (system_state < SYSTEM_SCHEDULING) goto out; +out_error: /* * Avoid recursion: */ -- ankur
On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > + if (WARN_ON(resched_allowed())) > > + return; > > And, maybe something like this to guard against __this_cpu_read() > etc: > > +++ b/lib/smp_processor_id.c > @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2) > { > int this_cpu = raw_smp_processor_id(); > > + if (unlikely(resched_allowed())) > + goto out_error; Again, both of those checks are WRONG. They'll error out even in exceptions / interrupts, when we have a preempt count already from the exception itself. So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW bit is wrong, wrong, wrong. These situations aren't errors if we already had a preemption count for other reasons. Only trying to disable preemption when in process context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing the check for "are we in a process context" part. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, 9 Sept 2023 at 13:16, Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> > + if (WARN_ON(resched_allowed())) >> > + return; >> >> And, maybe something like this to guard against __this_cpu_read() >> etc: >> >> +++ b/lib/smp_processor_id.c >> @@ -13,6 +13,9 @@ unsigned int check_preemption_disabled(const char *what1, const char *what2) >> { >> int this_cpu = raw_smp_processor_id(); >> >> + if (unlikely(resched_allowed())) >> + goto out_error; > > Again, both of those checks are WRONG. > > They'll error out even in exceptions / interrupts, when we have a > preempt count already from the exception itself. > > So testing "resched_allowed()" that only tests the TIF_RESCHED_ALLOW > bit is wrong, wrong, wrong. Yeah, you are right. I think we can keep these checks, but with this fixed definition of resched_allowed(). This might be better: --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void) static __always_inline bool resched_allowed(void) { - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); + return unlikely(!preempt_count() && + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); } Ankur > These situations aren't errors if we already had a preemption count > for other reasons. Only trying to disable preemption when in process > context (while TIF_RESCHED_ALLOW) is a problem. Your patch is missing > the check for "are we in a process context" part. > > Linus
On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > I think we can keep these checks, but with this fixed definition of > resched_allowed(). This might be better: > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void) > > static __always_inline bool resched_allowed(void) > { > - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); > + return unlikely(!preempt_count() && > + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); > } I'm not convinced (at all) that the preempt count is the right thing. It works for interrupts, yes, because interrupts will increment the preempt count even on non-preempt kernels (since the preempt count is also the interrupt context level). But what about any synchronous trap handling? In other words, just something like a page fault? A page fault doesn't increment the preemption count (and in fact many page faults _will_ obviously re-schedule as part of waiting for IO). A page fault can *itself* say "feel free to preempt me", and that's one thing. But a page fault can also *interupt* something that said "feel free to preempt me", and that's a completely *different* thing. So I feel like the "tsk_thread_flag" was sadly completely the wrong place to add this bit to, and the wrong place to test it in. What we really want is "current kernel entry context". So the right thing to do would basically be to put it in the stack frame at kernel entry - whether that kernel entry was a system call (which is doing some big copy that should be preemptible without us having to add "cond_resched()" in places), or is a page fault (which will also do things like big page clearings for hugepages) And we don't have that, do we? We have "on_thread_stack()", which checks for "are we on the system call stack". But that doesn't work for page faults. PeterZ - I feel like I might be either very confused, or missing something You probably go "Duh, Linus, you're off on one of your crazy tangents, and none of this is relevant because..." Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> I think we can keep these checks, but with this fixed definition of >> resched_allowed(). This might be better: >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void) >> >> static __always_inline bool resched_allowed(void) >> { >> - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); >> + return unlikely(!preempt_count() && >> + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); >> } > > I'm not convinced (at all) that the preempt count is the right thing. > > It works for interrupts, yes, because interrupts will increment the > preempt count even on non-preempt kernels (since the preempt count is > also the interrupt context level). > > But what about any synchronous trap handling? > > In other words, just something like a page fault? A page fault doesn't > increment the preemption count (and in fact many page faults _will_ > obviously re-schedule as part of waiting for IO). > > A page fault can *itself* say "feel free to preempt me", and that's one thing. > > But a page fault can also *interupt* something that said "feel free to > preempt me", and that's a completely *different* thing. > > So I feel like the "tsk_thread_flag" was sadly completely the wrong > place to add this bit to, and the wrong place to test it in. What we > really want is "current kernel entry context". So, what we want allow_resched() to say is: feel free to reschedule if in a reschedulable context. The problem with doing that with an allow_resched tsk_thread_flag is that the flag is really only valid while it is executing in the context it was set. And, trying to validate the flag by checking the preempt_count() makes it pretty fragile, given that now we are tying it with the specifics of whether the handling of arbitrary interrupts bumps up the preempt_count() or not. > So the right thing to do would basically be to put it in the stack > frame at kernel entry - whether that kernel entry was a system call > (which is doing some big copy that should be preemptible without us > having to add "cond_resched()" in places), or is a page fault (which > will also do things like big page clearings for hugepages) Seems to me that associating an allow_resched flag with the stack also has similar issue. Couldn't the context level change while we are on the same stack? I guess the problem is that allow_resched()/disallow_resched() really need to demarcate a section of code having some property, but instead set up state that has much wider scope. Maybe code that allows resched can be in a new .section ".text.resched" or whatever, and we could use something like this as a check: int resched_allowed(regs) { return !preempt_count() && in_resched_function(regs->rip); } (allow_resched()/disallow_resched() shouldn't be needed except for debug checks.) We still need the !preempt_count() check, but now both the conditions in the test express two orthogonal ideas: - !preempt_count(): preemption is safe in the current context - in_resched_function(regs->rip): okay to reschedule here So in this example, it should allow scheduling inside both the clear_pages_reschedulable() calls: -> page_fault() clear_page_reschedulable(); -> page_fault() clear_page_reschedulable(); Here though, rescheduling could happen only in the first call to clear_page_reschedulable(): -> page_fault() clear_page_reschedulable(); -> hardirq() -> page_fault() clear_page_reschedulable(); Does that make any sense, or I'm just talking through my hat? -- ankur
On Sun, 10 Sept 2023 at 03:01, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > Seems to me that associating an allow_resched flag with the stack also > has similar issue. Couldn't the context level change while we are on the > same stack? On x86-64 no, but in other situations yes. > I guess the problem is that allow_resched()/disallow_resched() really > need to demarcate a section of code having some property, but instead > set up state that has much wider scope. > > Maybe code that allows resched can be in a new .section ".text.resched" > or whatever, and we could use something like this as a check: Yes. I'm starting to think that that the only sane solution is to limit cases that can do this a lot, and the "instruciton pointer region" approach would certainly work. At the same time I really hate that, because I was hoping we'd be able to use this to not have so many of those annoying and random "cond_resched()" calls. I literally merged another batch of "random stupid crap" an hour ago: commit 3d7d72a34e05 ("x86/sgx: Break up long non-preemptible delays in sgx_vepc_release()") literally just adds manual 'cond_resched()' calls in random places. I was hoping that we'd have some generic way to deal with this where we could just say "this thing is reschedulable", and get rid of - or at least not increasingly add to - the cond_resched() mess. Of course, that was probably always unrealistic, and those random cond_resched() calls we just added probably couldn't just be replaced by "you can reschedule me" simply because the functions quite possibly end up taking some lock hidden in one of the xa_xyz() functions. For the _particular_ case of "give me a preemptible big memory copy / clear", the section model seems fine. It's just that we do have quite a bit of code where we can end up with long loops that want that cond_resched() too that I was hoping we'd _also_ be able to solve. Linus
On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote: > I was hoping that we'd have some generic way to deal with this where > we could just say "this thing is reschedulable", and get rid of - or > at least not increasingly add to - the cond_resched() mess. Isn't that called PREEMPT=y ? That tracks precisely all the constraints required to know when/if we can preempt. The whole voluntary preempt model is basically the traditional co-operative preemption model and that fully relies on manual yields. The problem with the REP prefix (and Xen hypercalls) is that they're long running instructions and it becomes fundamentally impossible to put a cond_resched() in. > Yes. I'm starting to think that that the only sane solution is to > limit cases that can do this a lot, and the "instruciton pointer > region" approach would certainly work. From a code locality / I-cache POV, I think a sorted list of (non overlapping) ranges might be best.
On 11/09/2023 4:04 pm, Peter Zijlstra wrote: > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote: > >> I was hoping that we'd have some generic way to deal with this where >> we could just say "this thing is reschedulable", and get rid of - or >> at least not increasingly add to - the cond_resched() mess. > Isn't that called PREEMPT=y ? That tracks precisely all the constraints > required to know when/if we can preempt. > > The whole voluntary preempt model is basically the traditional > co-operative preemption model and that fully relies on manual yields. > > The problem with the REP prefix (and Xen hypercalls) is that > they're long running instructions and it becomes fundamentally > impossible to put a cond_resched() in. Any VMM - Xen isn't special here. And if we're talking about instructions, then CPUID, GETSEC and ENCL{S,U} and plenty of {RD,WR}MSRs in in a similar category, being effectively blocking RPC operations to something else in the platform. The Xen evtchn upcall logic in Linux does cond_resched() when possible. i.e. long-running hypercalls issued with interrupts enabled can reschedule if an interrupt occurs, which is pretty close to how REP works too. ~Andrew
On Sat, 9 Sep 2023 21:35:54 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > > I think we can keep these checks, but with this fixed definition of > > resched_allowed(). This might be better: > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void) > > > > static __always_inline bool resched_allowed(void) > > { > > - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); > > + return unlikely(!preempt_count() && > > + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); > > } > > I'm not convinced (at all) that the preempt count is the right thing. > > It works for interrupts, yes, because interrupts will increment the > preempt count even on non-preempt kernels (since the preempt count is > also the interrupt context level). > > But what about any synchronous trap handling? > > In other words, just something like a page fault? A page fault doesn't > increment the preemption count (and in fact many page faults _will_ > obviously re-schedule as part of waiting for IO). I wonder if we should make it a rule to not allow page faults when RESCHED_ALLOW is set? Yeah, we can preempt in page faults, but that's not what the allow_resched() is about. Since the main purpose of that function, according to the change log, is for kernel threads. Do kernel threads page fault? (perhaps for vmalloc? but do we take locks in those cases?). That is, treat allow_resched() like preempt_disable(). If we page fault with "preempt_disable()" we usually complain about that (unless we do some magic with *_nofault() functions). Then we could just add checks in the page fault handlers to see if allow_resched() is set, and if so, complain about it like we do with preempt_disable in the might_fault() function. -- Steve
Peter Zijlstra <peterz@infradead.org> writes: > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote: > >> I was hoping that we'd have some generic way to deal with this where >> we could just say "this thing is reschedulable", and get rid of - or >> at least not increasingly add to - the cond_resched() mess. > > Isn't that called PREEMPT=y ? That tracks precisely all the constraints > required to know when/if we can preempt. > > The whole voluntary preempt model is basically the traditional > co-operative preemption model and that fully relies on manual yields. Yeah, but as Linus says, this means a lot of code is just full of cond_resched(). For instance a loop the process_huge_page() uses this pattern: for (...) { cond_resched(); clear_page(i); cond_resched(); clear_page(j); } > The problem with the REP prefix (and Xen hypercalls) is that > they're long running instructions and it becomes fundamentally > impossible to put a cond_resched() in. > >> Yes. I'm starting to think that that the only sane solution is to >> limit cases that can do this a lot, and the "instruciton pointer >> region" approach would certainly work. > > From a code locality / I-cache POV, I think a sorted list of > (non overlapping) ranges might be best. Yeah, agreed. There are a few problems with doing that though. I was thinking of using a check of this kind to schedule out when it is executing in this "reschedulable" section: !preempt_count() && in_resched_function(regs->rip); For preemption=full, this should mostly work. For preemption=voluntary, though this'll only work with out-of-line locks, not if the lock is inlined. (Both, should have problems with __this_cpu_* and the like, but maybe we can handwave that away with sparse/objtool etc.) How expensive would be always having PREEMPT_COUNT=y? -- ankur
On Mon, 11 Sept 2023 at 09:48, Steven Rostedt <rostedt@goodmis.org> wrote: > > I wonder if we should make it a rule to not allow page faults when > RESCHED_ALLOW is set? I really think that user copies might actually be one of the prime targets. Right now we special-case big user copes - see for example copy_chunked_from_user(). But that's an example of exactly the problem this code has - we literally make more complex - and objectively *WORSE* code just to deal with "I want this to be interruptible". So yes, we could limit RESCHED_ALLOW to not allow page faults, but big user copies literally are one of the worst problems. Another example of this this is just plain read/write. It's not a problem in practice right now, because large pages are effectively never used. But just imagine what happens once filemap_read() actually does big folios? Do you really want this code: copied = copy_folio_to_iter(folio, offset, bytes, iter); to forever use the artificial chunking it does now? And yes, right now it will still do things in one-page chunks in copy_page_to_iter(). It doesn't even have cond_resched() - it's currently in the caller, in filemap_read(). But just think about possible futures. Now, one option really is to do what I think PeterZ kind of alluded to - start deprecating PREEMPT_VOLUNTARY and PREEMPT_NONE entirely. Except we've actually been *adding* to this whole mess, rather than removing it. So we have actively *expanded* on that preemption choice with PREEMPT_DYNAMIC. That's actually reasonably recent, implying that distros really want to still have the option. And it seems like it's actually server people who want the "no preemption" (and presumably avoid all the preempt count stuff entirely - it's not necessarily the *preemption* that is the cost, it's the incessant preempt count updates) Linus
On Mon, 11 Sept 2023 at 13:50, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Except we've actually been *adding* to this whole mess, rather than > removing it. So we have actively *expanded* on that preemption choice > with PREEMPT_DYNAMIC. Actually, that config option makes no sense. It makes the sched_cond() behavior conditional with a static call. But all the *real* overhead is still there and unconditional (ie all the preempt count updates and the "did it go down to zero and we need to check" code). That just seems stupid. It seems to have all the overhead of a preemptible kernel, just not doing the preemption. So I must be mis-reading this, or just missing something important. The real cost seems to be PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both will end up with that, and thus both cases will have all the spinlock preempt count stuff. There must be some non-preempt_count cost that people worry about. Or maybe I'm just mis-reading the Kconfig stuff entirely. That's possible, because this seems *so* pointless to me. Somebody please hit me with a clue-bat to the noggin. Linus
On Mon, 11 Sep 2023 13:50:53 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > And it seems like it's actually server people who want the "no > preemption" (and presumably avoid all the preempt count stuff entirely > - it's not necessarily the *preemption* that is the cost, it's the > incessant preempt count updates) I'm sure there's some overhead with the preemption itself. With the meltdown/spectre mitigations going into and out of the kernel does add some more overhead. And finishing a system call before being preempted may give some performance benefits for some micro benchmark out there. Going out on a crazy idea, I wonder if we could get the compiler to help us here. As all preempt disabled locations are static, and as for functions, they can be called with preemption enabled or disabled. Would it be possible for the compiler to mark all locations that need preemption disabled? If a function is called in a preempt disabled section and also called in a preempt enable section, it could make two versions of the function (one where preemption is disabled and one where it is enabled). Then all we would need is a look up table to know if preemption is safe or not by looking at the instruction pointer. Yes, I know this is kind of a wild idea, but I do believe it is possible. The compiler wouldn't need to know of the concept of "preemption" just a "make this location special, and keep functions called by that location special and duplicate them if they are are called outside of this special section". ;-) -- Steve
Steven Rostedt <rostedt@goodmis.org> writes: > On Mon, 11 Sep 2023 13:50:53 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> And it seems like it's actually server people who want the "no >> preemption" (and presumably avoid all the preempt count stuff entirely >> - it's not necessarily the *preemption* that is the cost, it's the >> incessant preempt count updates) > > I'm sure there's some overhead with the preemption itself. With the > meltdown/spectre mitigations going into and out of the kernel does add some > more overhead. And finishing a system call before being preempted may give > some performance benefits for some micro benchmark out there. > > Going out on a crazy idea, I wonder if we could get the compiler to help us > here. As all preempt disabled locations are static, and as for functions, > they can be called with preemption enabled or disabled. Would it be > possible for the compiler to mark all locations that need preemption disabled? An even crazier version of that idea would be to have preempt_disable/enable() demarcate regions, and the compiler putting all of the preemption disabled region out-of-line to a special section. Seems to me, that then we could do away to preempt_enable/disable()? (Ignoring the preempt_count used in hardirq etc.) This would allow preemption always, unless executing in the preemption-disabled section. Though I don't have any intuition for how much extra call overhead this would add. Ankur > If a function is called in a preempt disabled section and also called in a > preempt enable section, it could make two versions of the function (one > where preemption is disabled and one where it is enabled). > > Then all we would need is a look up table to know if preemption is safe or > not by looking at the instruction pointer. > > Yes, I know this is kind of a wild idea, but I do believe it is possible. > > The compiler wouldn't need to know of the concept of "preemption" just a > "make this location special, and keep functions called by that location > special and duplicate them if they are are called outside of this special > section". > > ;-) > > -- Steve
On Mon, 11 Sep 2023 16:10:31 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > An even crazier version of that idea would be to have > preempt_disable/enable() demarcate regions, and the compiler putting all > of the preemption disabled region out-of-line to a special section. > Seems to me, that then we could do away to preempt_enable/disable()? > (Ignoring the preempt_count used in hardirq etc.) > I thought about this too, but wasn't sure if it would be easier or harder to implement. This would still require the duplicate functions (which I guess would be the most difficult part). > This would allow preemption always, unless executing in the > preemption-disabled section. > > Though I don't have any intuition for how much extra call overhead this > would add. I don't think this version would have as high of an overhead. You would get a direct jump (which isn't bad as all speculation knows exactly where to look), and it would improve the look up. No table, just a simple range check. -- Steve
On Mon, Sep 11, 2023 at 01:50:53PM -0700, Linus Torvalds wrote: > Another example of this this is just plain read/write. It's not a > problem in practice right now, because large pages are effectively > never used. > > But just imagine what happens once filemap_read() actually does big folios? > > Do you really want this code: > > copied = copy_folio_to_iter(folio, offset, bytes, iter); > > to forever use the artificial chunking it does now? > > And yes, right now it will still do things in one-page chunks in > copy_page_to_iter(). It doesn't even have cond_resched() - it's > currently in the caller, in filemap_read(). Ah, um. If you take a look in fs/iomap/buffered-io.c, you'll see ... iomap_write_iter: size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER; struct folio *folio; bytes = min(chunk - offset, iov_iter_count(i)); if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) { copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); So we do still cond_resched(), but we might go up to PMD_SIZE between calls. This is new code in 6.6 so it hasn't seen use by too many users yet, but it's certainly bigger than the 16 pages used by copy_chunked_from_user(). I honestly hadn't thought about preemption latency.
On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote: > On Mon, 11 Sept 2023 at 13:50, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Except we've actually been *adding* to this whole mess, rather than > > removing it. So we have actively *expanded* on that preemption choice > > with PREEMPT_DYNAMIC. > > Actually, that config option makes no sense. > > It makes the sched_cond() behavior conditional with a static call. > > But all the *real* overhead is still there and unconditional (ie all > the preempt count updates and the "did it go down to zero and we need > to check" code). > > That just seems stupid. It seems to have all the overhead of a > preemptible kernel, just not doing the preemption. > > So I must be mis-reading this, or just missing something important. > > The real cost seems to be > > PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT > > and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both > will end up with that, and thus both cases will have all the spinlock > preempt count stuff. > > There must be some non-preempt_count cost that people worry about. > > Or maybe I'm just mis-reading the Kconfig stuff entirely. That's > possible, because this seems *so* pointless to me. > > Somebody please hit me with a clue-bat to the noggin. Well, I was about to reply to your previous email explaining this, but this one time I did read more email.. Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops out the schedule()/cond_resched() calls where appropriate. This work was done by a distro (SuSE) and if they're willing to ship this I'm thinking the overheads are acceptable to them. For a significant number of workloads the real overhead is the extra preepmtions themselves more than the counting -- but yes, the counting is measurable, but probably in the noise compared to other some of the other horrible things we have done the past years. Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes, deleting the other options are definitely an option.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Sep 11, 2023 at 02:16:18PM -0700, Linus Torvalds wrote: > > On Mon, 11 Sept 2023 at 13:50, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Except we've actually been *adding* to this whole mess, rather than > > > removing it. So we have actively *expanded* on that preemption choice > > > with PREEMPT_DYNAMIC. > > > > Actually, that config option makes no sense. > > > > It makes the sched_cond() behavior conditional with a static call. > > > > But all the *real* overhead is still there and unconditional (ie all > > the preempt count updates and the "did it go down to zero and we need > > to check" code). > > > > That just seems stupid. It seems to have all the overhead of a > > preemptible kernel, just not doing the preemption. > > > > So I must be mis-reading this, or just missing something important. > > > > The real cost seems to be > > > > PREEMPT_BUILD -> PREEMPTION -> PREEMPT_COUNT > > > > and PREEMPT vs PREEMPT_DYNAMIC makes no difference to that, since both > > will end up with that, and thus both cases will have all the spinlock > > preempt count stuff. > > > > There must be some non-preempt_count cost that people worry about. > > > > Or maybe I'm just mis-reading the Kconfig stuff entirely. That's > > possible, because this seems *so* pointless to me. > > > > Somebody please hit me with a clue-bat to the noggin. > > Well, I was about to reply to your previous email explaining this, but > this one time I did read more email.. > > Yes, PREEMPT_DYNAMIC has all the preempt count twiddling and only nops > out the schedule()/cond_resched() calls where appropriate. > > This work was done by a distro (SuSE) and if they're willing to ship this > I'm thinking the overheads are acceptable to them. > > For a significant number of workloads the real overhead is the extra > preepmtions themselves more than the counting -- but yes, the counting is > measurable, but probably in the noise compared to other some of the other > horrible things we have done the past years. > > Anyway, if distros are fine shipping with PREEMPT_DYNAMIC, then yes, > deleting the other options are definitely an option. Yes, so my understanding is that distros generally worry more about macro-overhead, for example material changes to a random subset of key benchmarks that specific enterprise customers care about, and distros are not nearly as sensitive about micro-overhead that preempt_count() maintenance causes. PREEMPT_DYNAMIC is basically a reflection of that: the desire to have only a single kernel image, but a boot-time toggle to differentiate between desktop and server loads and have CONFIG_PREEMPT (desktop) but also PREEMPT_VOLUNTARY behavior (server). There's also the view that PREEMPT kernels are a bit more QA-friendly, because atomic code sequences are much better defined & enforced via kernel warnings. Without preempt_count we only have irqs-off warnings, that are only a small fraction of all critical sections in the kernel. Ideally we'd be able to patch out most of the preempt_count maintenance overhead too - OTOH these days it's little more than noise on most CPUs, considering the kind of horrible security-workaround overhead we have on almost all x86 CPU types ... :-/ Thanks, Ingo
On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote: > > Peter Zijlstra <peterz@infradead.org> writes: > > > On Sun, Sep 10, 2023 at 11:32:32AM -0700, Linus Torvalds wrote: > > > >> I was hoping that we'd have some generic way to deal with this where > >> we could just say "this thing is reschedulable", and get rid of - or > >> at least not increasingly add to - the cond_resched() mess. > > > > Isn't that called PREEMPT=y ? That tracks precisely all the constraints > > required to know when/if we can preempt. > > > > The whole voluntary preempt model is basically the traditional > > co-operative preemption model and that fully relies on manual yields. > > Yeah, but as Linus says, this means a lot of code is just full of > cond_resched(). For instance a loop the process_huge_page() uses > this pattern: > > for (...) { > cond_resched(); > clear_page(i); > > cond_resched(); > clear_page(j); > } Yeah, that's what co-operative preemption gets you. > > The problem with the REP prefix (and Xen hypercalls) is that > > they're long running instructions and it becomes fundamentally > > impossible to put a cond_resched() in. > > > >> Yes. I'm starting to think that that the only sane solution is to > >> limit cases that can do this a lot, and the "instruciton pointer > >> region" approach would certainly work. > > > > From a code locality / I-cache POV, I think a sorted list of > > (non overlapping) ranges might be best. > > Yeah, agreed. There are a few problems with doing that though. > > I was thinking of using a check of this kind to schedule out when > it is executing in this "reschedulable" section: > !preempt_count() && in_resched_function(regs->rip); > > For preemption=full, this should mostly work. > For preemption=voluntary, though this'll only work with out-of-line > locks, not if the lock is inlined. > > (Both, should have problems with __this_cpu_* and the like, but > maybe we can handwave that away with sparse/objtool etc.) So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges thing, and then only search the range when TIF flag is set. And I'm thinking it might be a good idea to have objtool validate the range only contains simple instructions, the moment it contains control flow I'm thinking it's too complicated. > How expensive would be always having PREEMPT_COUNT=y? Effectively I think that is true today. At the very least Debian and SuSE (I can't find a RHEL .config in a hurry but I would think they too) ship with PREEMPT_DYNAMIC=y. Mel, I'm sure you ran numbers at the time (you always do), what if any was the measured overhead from PREEMPT_DYNAMIC vs 'regular' voluntary preemption?
On Tue, Sep 12, 2023 at 10:26:06AM +0200 Peter Zijlstra wrote: > On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote: > > > > How expensive would be always having PREEMPT_COUNT=y? > > Effectively I think that is true today. At the very least Debian and > SuSE (I can't find a RHEL .config in a hurry but I would think they too) > ship with PREEMPT_DYNAMIC=y. > Yes, RHEL too. Cheers, Phil --
On Tue, Sep 12, 2023 at 10:26:06AM +0200, Peter Zijlstra wrote: > > How expensive would be always having PREEMPT_COUNT=y? > > Effectively I think that is true today. At the very least Debian and > SuSE (I can't find a RHEL .config in a hurry but I would think they too) > ship with PREEMPT_DYNAMIC=y. $ grep PREEMPT uek-rpm/ol9/config-x86_64 # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DRM_I915_PREEMPT_TIMEOUT=640 # CONFIG_PREEMPTIRQ_DELAY_TEST is not set $ grep PREEMPT uek-rpm/ol9/config-aarch64 # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_PREEMPT_NOTIFIERS=y # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
On Mon, 11 Sept 2023 at 20:27, Matthew Wilcox <willy@infradead.org> wrote: > > So we do still cond_resched(), but we might go up to PMD_SIZE > between calls. This is new code in 6.6 so it hasn't seen use by too > many users yet, but it's certainly bigger than the 16 pages used by > copy_chunked_from_user(). I honestly hadn't thought about preemption > latency. The thing about cond_resched() is that you literally won't get anybody who complains until the big page case is common enough that it hits special people. This is also a large part of why I dislike cond_resched() a lot. It's not just that it's sprinkled randomly in our code-base, it's that it's *found* and added so randomly. Some developers will look at code and say "this may be a long loop" and add it without any numbers. It's rare, but it happens. And other than that it usually is something like the RT people who have the latency trackers, and one particular load that they use for testing. Oh well. Enough kvetching. I'm not happy about it, but in the end it's a small annoyance, not a big issue. Linus
On Mon, 11 Sept 2023 at 15:20, Steven Rostedt <rostedt@goodmis.org> wrote: > > Going out on a crazy idea, I wonder if we could get the compiler to help us > here. As all preempt disabled locations are static, and as for functions, > they can be called with preemption enabled or disabled. Would it be > possible for the compiler to mark all locations that need preemption disabled? Definitely not. Those preempt-disabled areas aren't static, for one thing. Any time you take any exception in kernel space, your exception handles is all dynamically preemtible or not, possibly depending on architecture details) Yes, most exception handlers then have magic rules: page faults won't get past a particular point if they happened while not preemptible, for example. And interrupts will disable preemption themselves. But we have a ton of code that runs lots of subtle code in exception handlers that is very architecture-dependent, whether it is things like unaligned fixups, or instruction rewriting things for dynamic calls, or a lot of very grotty code. Most (all?) of it could probably be made to be non-preemptible, but it's a lot of code for a lot of architectures, and it's not the trivial kind. And that's ignoring all the code that is run in just regular process context with no exceptions that is sometimes run under spinlocks, and sometimes not. There's a *lot* of it. Think something as trivial as memcpy(), but also kmalloc() or any number of stuff that is just random support code that can be used from atomic (non-preemtible) context. So even if we could rely on magic compiler support that doesn't exist - which we can't - it's not evenb *remotely* as static as you seem to think it is. Linus
On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote: > On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote: >> > The problem with the REP prefix (and Xen hypercalls) is that >> > they're long running instructions and it becomes fundamentally >> > impossible to put a cond_resched() in. >> > >> >> Yes. I'm starting to think that that the only sane solution is to >> >> limit cases that can do this a lot, and the "instruciton pointer >> >> region" approach would certainly work. >> > >> > From a code locality / I-cache POV, I think a sorted list of >> > (non overlapping) ranges might be best. >> >> Yeah, agreed. There are a few problems with doing that though. >> >> I was thinking of using a check of this kind to schedule out when >> it is executing in this "reschedulable" section: >> !preempt_count() && in_resched_function(regs->rip); >> >> For preemption=full, this should mostly work. >> For preemption=voluntary, though this'll only work with out-of-line >> locks, not if the lock is inlined. >> >> (Both, should have problems with __this_cpu_* and the like, but >> maybe we can handwave that away with sparse/objtool etc.) > > So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges > thing, and then only search the range when TIF flag is set. > > And I'm thinking it might be a good idea to have objtool validate the > range only contains simple instructions, the moment it contains control > flow I'm thinking it's too complicated. Can we take a step back and look at the problem from a scheduling perspective? The basic operation of a non-preemptible kernel is time slice scheduling, which means that a task can run more or less undisturbed for a full time slice once it gets on the CPU unless it schedules away voluntary via a blocking operation. This works pretty well as long as everything runs in userspace as the preemption points in the return to user space path are independent of the preemption model. These preemption points handle both time slice exhaustion and priority based preemption. With PREEMPT=NONE these are the only available preemption points. That means that kernel code can run more or less indefinitely until it schedules out or returns to user space, which is obviously not possible for kernel threads. To prevent starvation the kernel gained voluntary preemption points, i.e. cond_resched(), which has to be added manually to code as a developer sees fit. Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as additional preemption points. might_resched() utilizes the existing might_sched() debug points, which are in code paths which might block on a contended resource. These debug points are mostly in core and infrastructure code and are in code paths which can block anyway. The only difference is that they allow preemption even when the resource is uncontended. Additionally we have PREEMPT=FULL which utilizes every zero transition of preeempt_count as a potential preemption point. Now we have the situation of long running data copies or data clear operations which run fully in hardware, but can be interrupted. As the interrupt return to kernel mode does not preempt in the NONE and VOLUNTARY cases, new workarounds emerged. Mostly by defining a data chunk size and adding cond_reched() again. That's ugly and does not work for long lasting hardware operations so we ended up with the suggestion of TIF_ALLOW_RESCHED to work around that. But again this needs to be manually annotated in the same way as a IP range based preemption scheme requires annotation. TBH. I detest all of this. Both cond_resched() and might_sleep/sched() are completely random mechanisms as seen from time slice operation and the data chunk based mechanism is just heuristics which works as good as heuristics tend to work. allow_resched() is not any different and IP based preemption mechanism are not going to be any better. The approach here is: Prevent the scheduler to make decisions and then mitigate the fallout with heuristics. That's just backwards as it moves resource control out of the scheduler into random code which has absolutely no business to do resource control. We have the reverse issue observed in PREEMPT_RT. The fact that spinlock held sections became preemtible caused even more preemption activity than on a PREEMPT=FULL kernel. The worst side effect of that was extensive lock contention. The way how we addressed that was to add a lazy preemption mode, which tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to preempt tasks which all belong to the SCHED_OTHER scheduling class. This works pretty well and gains back a massive amount of performance for the non-realtime throughput oriented tasks without affecting the schedulability of real-time tasks at all. IOW, it does not take control away from the scheduler. It cooperates with the scheduler and leaves the ultimate decisions to it. I think we can do something similar for the problem at hand, which avoids most of these heuristic horrors and control boundary violations. The main issue is that long running operations do not honour the time slice and we work around that with cond_resched() and now have ideas with this new TIF bit and IP ranges. None of that is really well defined in respect to time slices. In fact its not defined at all versus any aspect of scheduling behaviour. What about the following: 1) Keep preemption count and the real preemption points enabled unconditionally. That's not more overhead than the current DYNAMIC_PREEMPT mechanism as long as the preemption count does not go to zero, i.e. the folded NEED_RESCHED bit stays set. From earlier experiments I know that the overhead of preempt_count is minimal and only really observable with micro benchmarks. Otherwise it ends up in the noise as long as the slow path is not taken. I did a quick check comparing a plain inc/dec pair vs. the DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is in the non-conclusive noise. 20 years ago this was a real issue because we did not have: - the folding of NEED_RESCHED into the preempt count - the cacheline optimizations which make the preempt count cache pretty much always cache hot - the hardware was way less capable I'm not saying that preempt_count is completely free today as it obviously adds more text and affects branch predictors, but as the major distros ship with DYNAMIC_PREEMPT enabled it is obviously an acceptable and tolerable tradeoff. 2) When the scheduler wants to set NEED_RESCHED due it sets NEED_RESCHED_LAZY instead which is only evaluated in the return to user space preemption points. As NEED_RESCHED_LAZY is not folded into the preemption count the preemption count won't become zero, so the task can continue until it hits return to user space. That preserves the existing behaviour. 3) When the scheduler tick observes that the time slice is exhausted, then it folds the NEED_RESCHED bit into the preempt count which causes the real preemption points to actually preempt including the return from interrupt to kernel path. That even allows the scheduler to enforce preemption for e.g. RT class tasks without changing anything else. I'm pretty sure that this gets rid of cond_resched(), which is an impressive list of instances: ./drivers 392 ./fs 318 ./mm 189 ./kernel 184 ./arch 95 ./net 83 ./include 46 ./lib 36 ./crypto 16 ./sound 16 ./block 11 ./io_uring 13 ./security 11 ./ipc 3 That list clearly documents that the majority of these cond_resched() invocations is in code which neither should care nor should have any influence on the core scheduling decision machinery. I think it's worth a try as it just fits into the existing preemption scheme, solves the issue of long running kernel functions, prevents invalid preemption and can utilize the existing instrumentation and debug infrastructure. Most importantly it gives control back to the scheduler and does not make it depend on the mercy of cond_resched(), allow_resched() or whatever heuristics sprinkled all over the kernel. To me this makes a lot of sense, but I might be on the completely wrong track. Se feel free to tell me that I'm completely nuts and/or just not seeing the obvious. Thanks, tglx
On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote: > > What about the following: > > 1) Keep preemption count and the real preemption points enabled > unconditionally. Well, it's certainly the simplest solution, and gets rid of not just the 'rep string' issue, but gets rid of all the cond_resched() hackery entirely. > 20 years ago this was a real issue because we did not have: > > - the folding of NEED_RESCHED into the preempt count > > - the cacheline optimizations which make the preempt count cache > pretty much always cache hot > > - the hardware was way less capable > > I'm not saying that preempt_count is completely free today as it > obviously adds more text and affects branch predictors, but as the > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an > acceptable and tolerable tradeoff. Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY approach isn't actually used, and is only causing pain. > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. Is this just to try to emulate the existing PREEMPT_NONE behavior? If the new world order is that the time slice is always honored, then the "this might be a latency issue" goes away. Good. And we'd also get better coverage for the *debug* aim of "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always existing. But because the latency argument is gone, the "might_resched()" should then just be removed entirely from "might_sleep()", so that might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing. That argues for your suggestion too, since we had a performance issue due to "might_sleep()" _not_ being just a debug thing, and pointlessly causing a reschedule in a place where reschedules were _allowed_, but certainly much less than optimal. Which then caused that fairly recent commit 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()"). However, that does bring up an issue: even with full preemption, there are certainly places where we are *allowed* to schedule (when the preempt count is zero), but there are also some places that are *better* than other places to schedule (for example, when we don't hold any other locks). So, I do think that if we just decide to go "let's just always be preemptible", we might still have points in the kernel where preemption might be *better* than in others points. But none of might_resched(), might_sleep() _or_ cond_resched() are necessarily that kind of "this is a good point" thing. They come from a different background. So what I think what you are saying is that we'd have the following situation: - scheduling at "return to user space" is presumably always a good thing. A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or whatever) would cover that, and would give us basically the existing CONFIG_PREEMPT_NONE behavior. So a config variable (either compile-time with PREEMPT_NONE or a dynamic one with DYNAMIC_PREEMPT set to none) would make any external wakeup only set that bit. And then a "fully preemptible low-latency desktop" would set the preempt-count bit too. - but the "timeslice over" case would always set the preempt-count-bit, regardless of any config, and would guarantee that we have reasonable latencies. This all makes cond_resched() (and might_resched()) pointless, and they can just go away. Then the question becomes whether we'd want to introduce a *new* concept, which is a "if you are going to schedule, do it now rather than later, because I'm taking a lock, and while it's a preemptible lock, I'd rather not sleep while holding this resource". I suspect we want to avoid that for now, on the assumption that it's hopefully not a problem in practice (the recently addressed problem with might_sleep() was that it actively *moved* the scheduling point to a bad place, not that scheduling could happen there, so instead of optimizing scheduling, it actively pessimized it). But I thought I'd mention it. Anyway, I'm definitely not opposed. We'd get rid of a config option that is presumably not very widely used, and we'd simplify a lot of issues, and get rid of all these badly defined "cond_preempt()" things. Linus
On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote: > On preempt_model_none() or preempt_model_voluntary() configurations > rescheduling of kernel threads happens only when they allow it, and > only at explicit preemption points, via calls to cond_resched() or > similar. > > That leaves out contexts where it is not convenient to periodically > call cond_resched() -- for instance when executing a potentially long > running primitive (such as REP; STOSB.) > So I said this not too long ago in the context of Xen PV, but maybe it's time to ask it in general: Why do we support anything other than full preempt? I can think of two reasons, neither of which I think is very good: 1. Once upon a time, tracking preempt state was expensive. But we fixed that. 2. Folklore suggests that there's a latency vs throughput tradeoff, and serious workloads, for some definition of serious, want throughput, so they should run without full preemption. I think #2 is a bit silly. If you want throughput, and you're busy waiting for a CPU that wants to run you, but it's not because it's running some low-priority non-preemptible thing (because preempt is set to none or volunary), you're not getting throughput. If you want to get keep some I/O resource busy to get throughput, but you have excessive latency getting scheduled, you don't get throughput. If the actual problem is that there's a workload that performs better when scheduling is delayed (which preempt=none and preempt=volunary do, essentialy at random), then maybe someone should identify that workload and fix the scheduler. So maybe we should just very strongly encourage everyone to run with full preempt and simplify the kernel?
* Thomas Gleixner <tglx@linutronix.de> wrote: > Additionally we have PREEMPT=FULL which utilizes every zero transition > of preeempt_count as a potential preemption point. Just to complete this nice new entry to Documentation/sched/: in PREEMPT=FULL there's also IRQ-return driven preemption of kernel-mode code, at almost any instruction boundary the hardware allows, in addition to the preemption driven by regular zero transition of preempt_count in syscall/kthread code. Thanks, Ingo
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > What about the following: > > > > 1) Keep preemption count and the real preemption points enabled > > unconditionally. > > Well, it's certainly the simplest solution, and gets rid of not just > the 'rep string' issue, but gets rid of all the cond_resched() hackery > entirely. > > > 20 years ago this was a real issue because we did not have: > > > > - the folding of NEED_RESCHED into the preempt count > > > > - the cacheline optimizations which make the preempt count cache > > pretty much always cache hot > > > > - the hardware was way less capable > > > > I'm not saying that preempt_count is completely free today as it > > obviously adds more text and affects branch predictors, but as the > > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an > > acceptable and tolerable tradeoff. > > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY > approach isn't actually used, and is only causing pain. The macro-behavior of NONE/VOLUNTARY is still used & relied upon in server distros - and that's the behavior that enterprise distros truly cared about. Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the 'noise' category for all major distros I'd say. And that's what Thomas's proposal achieves: keep the nicely execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER tasks, while having the latency advantages of fully-preemptible kernel code for RT and critical tasks. So I'm fully on board with this. It would reduce the number of preemption variants to just two: regular kernel and PREEMPT_RT. Yummie! > > 2) When the scheduler wants to set NEED_RESCHED due it sets > > NEED_RESCHED_LAZY instead which is only evaluated in the return to > > user space preemption points. > > Is this just to try to emulate the existing PREEMPT_NONE behavior? Yes: I'd guesstimate that the batching caused by timeslice-laziness that is naturally part of NONE/VOLUNTARY resolves ~90%+ of observable macro-performance regressions between NONE/VOLUNTARY and PREEMPT/RT. > If the new world order is that the time slice is always honored, then the > "this might be a latency issue" goes away. Good. > > And we'd also get better coverage for the *debug* aim of "might_sleep()" > and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always > existing. > > But because the latency argument is gone, the "might_resched()" should > then just be removed entirely from "might_sleep()", so that might_sleep() > would *only* be that DEBUG_ATOMIC_SLEEP thing. Correct. And that's even a minor code generation advantage, as we wouldn't have these additional hundreds of random/statistical preemption checks. > That argues for your suggestion too, since we had a performance issue due > to "might_sleep()" _not_ being just a debug thing, and pointlessly > causing a reschedule in a place where reschedules were _allowed_, but > certainly much less than optimal. > > Which then caused that fairly recent commit 4542057e18ca ("mm: avoid > 'might_sleep()' in get_mmap_lock_carefully()"). 4542057e18ca is arguably kind of a workaround though - and with the preempt_count + NEED_RESCHED_LAZY approach we'd have both the latency advantages *and* the execution-batching performance advantages of NONE/VOLUNTARY that 4542057e18ca exposed. > However, that does bring up an issue: even with full preemption, there > are certainly places where we are *allowed* to schedule (when the preempt > count is zero), but there are also some places that are *better* than > other places to schedule (for example, when we don't hold any other > locks). > > So, I do think that if we just decide to go "let's just always be > preemptible", we might still have points in the kernel where preemption > might be *better* than in others points. So in the broadest sense we have 3 stages of pending preemption: NEED_RESCHED_LAZY NEED_RESCHED_SOON NEED_RESCHED_NOW And we'd transition: - from 0 -> SOON when an eligible task is woken up, - from LAZY -> SOON when current timeslice is exhausted, - from SOON -> NOW when no locks/resources are held. [ With a fast-track for RT or other urgent tasks to enter NOW immediately. ] On the regular kernels it's probably not worth modeling the SOON/NOW split, as we'd have to track the depth of sleeping locks as well, which we don't do right now. On PREEMPT_RT the SOON/NOW distinction possibly makes sense, as there we are aware of locking depth already and it would be relatively cheap to check for it on natural 0-preempt_count boundaries. > But none of might_resched(), might_sleep() _or_ cond_resched() are > necessarily that kind of "this is a good point" thing. They come from a > different background. Correct, they come from two sources: - They are hundreds of points that we know are 'technically correct' preemption points, and they break up ~90% of long latencies by brute force & chance. - Explicitly identified problem points that added a cond_resched() or its equivalent. These are rare and also tend to bitrot, because *removing* them is always more risky than adding them, so they tend to accumulate. > So what I think what you are saying is that we'd have the following > situation: > > - scheduling at "return to user space" is presumably always a good thing. > > A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or > whatever) would cover that, and would give us basically the existing > CONFIG_PREEMPT_NONE behavior. > > So a config variable (either compile-time with PREEMPT_NONE or a > dynamic one with DYNAMIC_PREEMPT set to none) would make any external > wakeup only set that bit. > > And then a "fully preemptible low-latency desktop" would set the > preempt-count bit too. I'd even argue that we only need two preemption modes, and that 'fully preemptible low-latency desktop' is an artifact of poor latencies on PREEMPT_NONE. Ie. in the long run - after a careful period of observing performance regressions and other dragons - we'd only have *two* preemption modes left: !PREEMPT_RT # regular kernel. Single default behavior. PREEMPT_RT=y # -rt kernel, because rockets, satellites & cars matter. Any other application level preemption preferences can be expressed via scheduling policies & priorities. Nothing else. We don't need PREEMPT_DYNAMIC, PREEMPT_VOLUNTARY or PREEMPT_NONE in any of their variants, probably not even as runtime knobs. People who want shorter timeslices can set shorter timeslices, and people who want immediate preemption of certain tasks can manage priorities. > - but the "timeslice over" case would always set the preempt-count-bit, > regardless of any config, and would guarantee that we have reasonable > latencies. Yes. Probably a higher nice-priority task becoming runnable would cause immediate preemption too, in addition to RT tasks. Ie. the execution batching would be for same-priority groups of SCHED_OTHER tasks. > This all makes cond_resched() (and might_resched()) pointless, and > they can just go away. Yep. > Then the question becomes whether we'd want to introduce a *new* concept, > which is a "if you are going to schedule, do it now rather than later, > because I'm taking a lock, and while it's a preemptible lock, I'd rather > not sleep while holding this resource". Something close to this concept is naturally available on PREEMPT_RT kernels, which only use a single central lock primitive (rt_mutex), but it would have be added explicitly for regular kernels. We could do the following intermediate step: - Remove all the random cond_resched() points such as might_sleep() - Turn all explicit cond_resched() points into 'ideal point to reschedule'. - Maybe even rename it from cond_resched() to resched_point(), to signal the somewhat different role. While cond_resched() and resched_point() are not 100% matches, they are close enough, as most existing cond_resched() points were added to places that cause the least amount of disruption with held resources. But I think it would be better to add resched_point() as a new API, and add it to places where there's a performance benefit. Clean slate, documentation, and all that. > I suspect we want to avoid that for now, on the assumption that it's > hopefully not a problem in practice (the recently addressed problem with > might_sleep() was that it actively *moved* the scheduling point to a bad > place, not that scheduling could happen there, so instead of optimizing > scheduling, it actively pessimized it). But I thought I'd mention it. > > Anyway, I'm definitely not opposed. We'd get rid of a config option that > is presumably not very widely used, and we'd simplify a lot of issues, > and get rid of all these badly defined "cond_preempt()" things. I think we can get rid of *all* the preemption model Kconfig knobs, except PREEMPT_RT. :-) Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > > Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most > > distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY > > approach isn't actually used, and is only causing pain. > > The macro-behavior of NONE/VOLUNTARY is still used & relied upon in > server distros - and that's the behavior that enterprise distros truly > cared about. > > Micro-overhead of NONE/VOLUNTARY vs. FULL is nonzero but is in the > 'noise' category for all major distros I'd say. > > And that's what Thomas's proposal achieves: keep the nicely > execution-batched NONE/VOLUNTARY scheduling behavior for SCHED_OTHER > tasks, while having the latency advantages of fully-preemptible kernel > code for RT and critical tasks. > > So I'm fully on board with this. It would reduce the number of preemption > variants to just two: regular kernel and PREEMPT_RT. Yummie! As an additional side note: with various changes such as EEVDF the scheduler is a lot less preemption-happy these days, without wrecking latencies & timeslice distribution. So in principle we might not even need the NEED_RESCHED_LAZY extra bit, which -rt uses as a kind of additional layer to make sure they don't change scheduling policy. Ie. a modern scheduler might have mooted much of this change: 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()") ... because now we'll only reschedule on timeslice exhaustion, or if a task comes in with a big deadline deficit. And even the deadline-deficit wakeup preemption can be turned off further with: $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features And we are considering making that the default behavior for same-prio tasks - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which should be quite similar to what NEED_RESCHED_LAZY achieves on -rt. Thanks, Ingo
On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote: > On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote: > Why do we support anything other than full preempt? I can think of > two reasons, neither of which I think is very good: > > 1. Once upon a time, tracking preempt state was expensive. But we fixed that. > > 2. Folklore suggests that there's a latency vs throughput tradeoff, > and serious workloads, for some definition of serious, want > throughput, so they should run without full preemption. It's absolutely not folklore. Run to completion is has well known benefits as it avoids contention and avoids the overhead of scheduling for a large amount of scenarios. We've seen that painfully in PREEMPT_RT before we came up with the concept of lazy preemption for throughput oriented tasks. Thanks, tglx
* Thomas Gleixner <tglx@linutronix.de> wrote: > On Mon, Sep 18 2023 at 20:21, Andy Lutomirski wrote: > > On Wed, Aug 30, 2023, at 11:49 AM, Ankur Arora wrote: > > > Why do we support anything other than full preempt? I can think of > > two reasons, neither of which I think is very good: > > > > 1. Once upon a time, tracking preempt state was expensive. But we fixed that. > > > > 2. Folklore suggests that there's a latency vs throughput tradeoff, > > and serious workloads, for some definition of serious, want > > throughput, so they should run without full preemption. > > It's absolutely not folklore. Run to completion is has well known > benefits as it avoids contention and avoids the overhead of scheduling > for a large amount of scenarios. > > We've seen that painfully in PREEMPT_RT before we came up with the > concept of lazy preemption for throughput oriented tasks. Yeah, for a large majority of workloads reduction in preemption increases batching and improves cache locality. Most scalability-conscious enterprise users want longer timeslices & better cache locality, not shorter timeslices with spread out cache use. There's microbenchmarks that fit mostly in cache that benefit if work is immediately processed by freshly woken tasks - but that's not true for most workloads with a substantial real-life cache footprint. Thanks, Ingo
Linus! On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: > On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@linutronix.de> wrote: >> 2) When the scheduler wants to set NEED_RESCHED due it sets >> NEED_RESCHED_LAZY instead which is only evaluated in the return to >> user space preemption points. > > Is this just to try to emulate the existing PREEMPT_NONE behavior? To some extent yes. > If the new world order is that the time slice is always honored, then > the "this might be a latency issue" goes away. Good. That's the point. > And we'd also get better coverage for the *debug* aim of > "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on > PREEMPT_COUNT always existing. > > But because the latency argument is gone, the "might_resched()" should > then just be removed entirely from "might_sleep()", so that > might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing. True. And this gives the scheduler the flexibility to enforce preemption under certain conditions, e.g. when a task with RT scheduling class or a task with a sporadic event handler is woken up. That's what VOLUNTARY tries to achieve with all the might_sleep()/might_resched() magic. > That argues for your suggestion too, since we had a performance issue > due to "might_sleep()" _not_ being just a debug thing, and pointlessly > causing a reschedule in a place where reschedules were _allowed_, but > certainly much less than optimal. > > Which then caused that fairly recent commit 4542057e18ca ("mm: avoid > 'might_sleep()' in get_mmap_lock_carefully()"). Awesome. > However, that does bring up an issue: even with full preemption, there > are certainly places where we are *allowed* to schedule (when the > preempt count is zero), but there are also some places that are > *better* than other places to schedule (for example, when we don't > hold any other locks). > > So, I do think that if we just decide to go "let's just always be > preemptible", we might still have points in the kernel where > preemption might be *better* than in others points. > > But none of might_resched(), might_sleep() _or_ cond_resched() are > necessarily that kind of "this is a good point" thing. They come from > a different background. They are subject to subsystem/driver specific preferences and therefore biased towards a certain usage scenario, which is not necessarily to the benefit of everyone else. > So what I think what you are saying is that we'd have the following situation: > > - scheduling at "return to user space" is presumably always a good thing. > > A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or > whatever) would cover that, and would give us basically the existing > CONFIG_PREEMPT_NONE behavior. > > So a config variable (either compile-time with PREEMPT_NONE or a > dynamic one with DYNAMIC_PREEMPT set to none) would make any external > wakeup only set that bit. > > And then a "fully preemptible low-latency desktop" would set the > preempt-count bit too. Correct. > - but the "timeslice over" case would always set the > preempt-count-bit, regardless of any config, and would guarantee that > we have reasonable latencies. Yes. That's the reasoning. > This all makes cond_resched() (and might_resched()) pointless, and > they can just go away. :) So the decision matrix would be: Ret2user Ret2kernel PreemptCnt=0 NEED_RESCHED Y Y Y LAZY_RESCHED Y N N That is completely independent of the preemption model and the differentiation of the preemption models happens solely at the scheduler level: PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time slice where it sets NEED_RESCHED. PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class tasks or sporadic event tasks sets NEED_RESCHED too. PREEMPT_FULL always sets NEED_RESCHED like today. We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we only end up with two variants or even subsume PREEMPT_FULL into that model because that's what is closer to the RT LAZY preempt behaviour, which has two goals: 1) Make low latency guarantees for RT workloads 2) Preserve the throughput for non-RT workloads But in any case this decision happens solely in the core scheduler code and nothing outside of it needs to be changed. So we not only get rid of the cond/might_resched() muck, we also get rid of the static_call/static_key machinery which drives PREEMPT_DYNAMIC. The only place which still needs that runtime tweaking is the scheduler itself. Though it just occured to me that there are dragons lurking: arch/alpha/Kconfig: select ARCH_NO_PREEMPT arch/hexagon/Kconfig: select ARCH_NO_PREEMPT arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE arch/um/Kconfig: select ARCH_NO_PREEMPT So we have four architectures which refuse to enable preemption points, i.e. the only model they allow is NONE and they rely on cond_resched() for breaking large computations. But they support PREEMPT_COUNT, so we might get away with a reduced preemption point coverage: Ret2user Ret2kernel PreemptCnt=0 NEED_RESCHED Y N Y LAZY_RESCHED Y N N i.e. the only difference is that Ret2kernel is not a preemption point. That's where the scheduler tick enforcement of the time slice happens. It still might work out good enough and if not then it should not be real rocket science to add that Ret2kernel preemption point to cure it. > Then the question becomes whether we'd want to introduce a *new* > concept, which is a "if you are going to schedule, do it now rather > than later, because I'm taking a lock, and while it's a preemptible > lock, I'd rather not sleep while holding this resource". > > I suspect we want to avoid that for now, on the assumption that it's > hopefully not a problem in practice (the recently addressed problem > with might_sleep() was that it actively *moved* the scheduling point > to a bad place, not that scheduling could happen there, so instead of > optimizing scheduling, it actively pessimized it). But I thought I'd > mention it. I think we want to avoid that completely and if this becomes an issue, we rather be smart about it at the core level. It's trivial enough to have a per task counter which tells whether a preemtible lock is held (or about to be acquired) or not. Then the scheduler can take that hint into account and decide to grant a timeslice extension once in the expectation that the task leaves the lock held section soonish and either returns to user space or schedules out. It still can enforce it later on. We really want to let the scheduler decide and rather give it proper hints at the conceptual level instead of letting developers make random decisions which might work well for a particular use case and completely suck for the rest. I think we wasted enough time already on those. > Anyway, I'm definitely not opposed. We'd get rid of a config option > that is presumably not very widely used, and we'd simplify a lot of > issues, and get rid of all these badly defined "cond_preempt()" > things. Hmm. Didn't I promise a year ago that I won't do further large scale cleanups and simplifications beyond printk. Maybe I get away this time with just suggesting it. :) Thanks, tglx
On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: > Though it just occured to me that there are dragons lurking: > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > arch/um/Kconfig: select ARCH_NO_PREEMPT Sounds like three-and-a-half architectures which could be queued up for removal right behind ia64 ... I suspect none of these architecture maintainers have any idea there's a problem. Look at commit 87a4c375995e and the discussion in https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/ Let's cc those maintainers so they can remove this and fix whatever breaks.
Ingo! On Tue, Sep 19 2023 at 10:03, Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: >> Then the question becomes whether we'd want to introduce a *new* concept, >> which is a "if you are going to schedule, do it now rather than later, >> because I'm taking a lock, and while it's a preemptible lock, I'd rather >> not sleep while holding this resource". > > Something close to this concept is naturally available on PREEMPT_RT > kernels, which only use a single central lock primitive (rt_mutex), but it > would have be added explicitly for regular kernels. > > We could do the following intermediate step: > > - Remove all the random cond_resched() points such as might_sleep() > - Turn all explicit cond_resched() points into 'ideal point to reschedule'. > > - Maybe even rename it from cond_resched() to resched_point(), to signal > the somewhat different role. > > While cond_resched() and resched_point() are not 100% matches, they are > close enough, as most existing cond_resched() points were added to places > that cause the least amount of disruption with held resources. > > But I think it would be better to add resched_point() as a new API, and add > it to places where there's a performance benefit. Clean slate, > documentation, and all that. Lets not go there. You just replace one magic mushroom with a different flavour. We want to get rid of them completely. The whole point is to let the scheduler decide and give it enough information to make informed decisions. So with the LAZY scheme in effect, there is no real reason to have these extra points and I rather add task::sleepable_locks_held and do that accounting in the relevant lock/unlock paths. Based on that the scheduler can decide whether it grants a time slice expansion or just says no. That's extremly cheap and well defined. You can document the hell out of resched_point(), but it won't be any different from the existing ones and always subject to personal preference and goals and its going to be sprinkled all over the place just like the existing ones. So where is the gain? Thanks, tglx
Hi Willy, On Tue, Sep 19, 2023 at 3:01 PM Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: > > Though it just occured to me that there are dragons lurking: > > > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > > arch/um/Kconfig: select ARCH_NO_PREEMPT > > Sounds like three-and-a-half architectures which could be queued up for > removal right behind ia64 ... > > I suspect none of these architecture maintainers have any idea there's a > problem. Look at commit 87a4c375995e and the discussion in > https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/ These links don't really point out there is a grave problem? > Let's cc those maintainers so they can remove this and fix whatever > breaks. Gr{oetje,eeting}s, Geert
On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote: > On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: > > Though it just occured to me that there are dragons lurking: > > > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > > arch/um/Kconfig: select ARCH_NO_PREEMPT > > Sounds like three-and-a-half architectures which could be queued up for > removal right behind ia64 ... The agreement to kill off ia64 wasn't an invitation to kill off other stuff that people are still working on! Can we please not do this? Thanks, Adrian
On Tue, Sep 19, 2023 at 03:37:24PM +0200, John Paul Adrian Glaubitz wrote: > On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote: > > On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: > > > Though it just occured to me that there are dragons lurking: > > > > > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > > > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > > > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > > > arch/um/Kconfig: select ARCH_NO_PREEMPT > > > > Sounds like three-and-a-half architectures which could be queued up for > > removal right behind ia64 ... > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff > that people are still working on! Can we please not do this? If you're working on one of them, then surely it's a simple matter of working on adding CONFIG_PREEMPT support :-)
On Tue, Sep 19 2023 at 10:43, Ingo Molnar wrote: > * Ingo Molnar <mingo@kernel.org> wrote: > Ie. a modern scheduler might have mooted much of this change: > > 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()") > > ... because now we'll only reschedule on timeslice exhaustion, or if a task > comes in with a big deadline deficit. > > And even the deadline-deficit wakeup preemption can be turned off further > with: > > $ echo NO_WAKEUP_PREEMPTION > /debug/sched/features > > And we are considering making that the default behavior for same-prio tasks > - basically turn same-prio SCHED_OTHER tasks into SCHED_BATCH - which > should be quite similar to what NEED_RESCHED_LAZY achieves on -rt. I don't think that you can get rid of NEED_RESCHED_LAZY for !RT because there is a clear advantage of having the return to user preemption point. It spares to have the kernel/user transition just to get the task back via the timeslice interrupt. I experimented with that on RT and the result was definitely worse. We surely can revisit that, but I'd really start with the straight forward mappable LAZY bit approach and if experimentation turns out to provide good enough results by not setting that bit at all, then we still can do so without changing anything except the core scheduler decision logic. It's again a cheap thing due to the way how the return to user TIF handling works: ti_work = read_thread_flags(); if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work); TIF_LAZY_RESCHED is part of EXIT_TO_USER_MODE_WORK, so the non-work case does not become more expensive than today. If any of the bits is set, then the slowpath wont get measurably different performance whether the bit is evaluated or not in exit_to_user_mode_loop(). As we really want TIF_LAZY_RESCHED for RT, we just keep all of this consistent in terms of code and purely a scheduler decision whether it utilizes it or not. As a consequence PREEMPT_RT is not longer special in that regard and the main RT difference becomes the lock substitution and forced interrupt threading. For the magic 'spare me the extra conditional' optimization of exit_to_user_mode_loop() if LAZY can be optimized out for !RT because the scheduler is sooo clever (which I doubt), we can just use the same approach as for other TIF bits and define them to 0 :) So lets start consistent and optimize on top if really required. Thanks, tglx
On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff > > that people are still working on! Can we please not do this? > > If you're working on one of them, then surely it's a simple matter of > working on adding CONFIG_PREEMPT support :-) As Geert poined out, I'm not seeing anything particular problematic with the architectures lacking CONFIG_PREEMPT at the moment. This seems to be more something about organizing KConfig files. I find it a bit unfair that maintainers of architectures that have huge companies behind them use their manpower to urge less popular architectures for removal just because they don't have 150 people working on the port so they can keep up with design changes quickly. Adrian
On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote: > On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: > > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff > > > that people are still working on! Can we please not do this? > > > > If you're working on one of them, then surely it's a simple matter of > > working on adding CONFIG_PREEMPT support :-) > > As Geert poined out, I'm not seeing anything particular problematic with the > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > something about organizing KConfig files. The plan in the parent thread is to remove PREEMPT_NONE and PREEMPT_VOLUNTARY and only keep PREEMPT_FULL. > I find it a bit unfair that maintainers of architectures that have huge companies > behind them use their manpower to urge less popular architectures for removal just > because they don't have 150 people working on the port so they can keep up with > design changes quickly. PREEMPT isn't something new. Also, I don't think the arch part for actually supporting it is particularly hard, mostly it is sticking the preempt_schedule_irq() call in return from interrupt code path. If you convert the arch to generic-entry (a much larger undertaking) then you get this for free.
On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote: > On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: >> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff >> > that people are still working on! Can we please not do this? >> >> If you're working on one of them, then surely it's a simple matter of >> working on adding CONFIG_PREEMPT support :-) > > As Geert poined out, I'm not seeing anything particular problematic with the > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > something about organizing KConfig files. > > I find it a bit unfair that maintainers of architectures that have huge companies > behind them use their manpower to urge less popular architectures for removal just > because they don't have 150 people working on the port so they can keep up with > design changes quickly. I don't urge for removal. I just noticed that these four architectures lack PREEMPT support. The only thing which is missing is the actual preemption point in the return to kernel code path. But otherwise it should just work, which I obviously can't confirm :) Even without that preemption point it should build and boot. There might be some minor latency issues when that preemption point is not there, but adding it is not rocket science either. It's probably about 10 lines of ASM code, if at all. Though not adding that might cause a blocking issue for the rework of the whole preemption logic in order to remove the sprinkled around cond_resched() muck or force us to maintain some nasty workaround just for the benefit of a few stranglers. So I can make the same argument the other way around, that it's unjustified that some architectures which are just supported for nostalgia throw roadblocks into kernel developemnt. If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack myself, but that's going to cost more of my and your time than it's worth the trouble, Hmm. I could delegate that to Linus, he might still remember :) Thanks, tglx
On 19/09/2023 14:42, Peter Zijlstra wrote: > On Tue, Sep 19, 2023 at 03:37:24PM +0200, John Paul Adrian Glaubitz wrote: >> On Tue, 2023-09-19 at 14:00 +0100, Matthew Wilcox wrote: >>> On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: >>>> Though it just occured to me that there are dragons lurking: >>>> >>>> arch/alpha/Kconfig: select ARCH_NO_PREEMPT >>>> arch/hexagon/Kconfig: select ARCH_NO_PREEMPT >>>> arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE >>>> arch/um/Kconfig: select ARCH_NO_PREEMPT >>> >>> Sounds like three-and-a-half architectures which could be queued up for >>> removal right behind ia64 ... >> >> The agreement to kill off ia64 wasn't an invitation to kill off other stuff >> that people are still working on! Can we please not do this? > > If you're working on one of them, then surely it's a simple matter of > working on adding CONFIG_PREEMPT support :-) In the case of UML adding preempt will be quite difficult. I looked at this a few years back. At the same time it is used for kernel test and other stuff. It is not exactly abandonware on a CPU found in archaeological artifacts of past civilizations like ia64. > > _______________________________________________ > linux-um mailing list > linux-um@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-um >
On Tue, 2023-09-19 at 16:16 +0200, Peter Zijlstra wrote: > > I find it a bit unfair that maintainers of architectures that have huge companies > > behind them use their manpower to urge less popular architectures for removal just > > because they don't have 150 people working on the port so they can keep up with > > design changes quickly. > > PREEMPT isn't something new. Also, I don't think the arch part for > actually supporting it is particularly hard, mostly it is sticking the > preempt_schedule_irq() call in return from interrupt code path. > > If you convert the arch to generic-entry (a much larger undertaking) > then you get this for free. If the conversion isn't hard, why is the first reflex the urge to remove an architecture instead of offering advise how to get the conversion done? Adrian
On Tue, Sep 19, 2023 at 04:24:48PM +0200, John Paul Adrian Glaubitz wrote: > If the conversion isn't hard, why is the first reflex the urge to remove an architecture > instead of offering advise how to get the conversion done? Because PREEMPT has been around since before 2005 (cc19ca86a023 created Kconfig.preempt and I don't need to go back further than that to make my point), and you haven't done the work yet. Clearly it takes the threat of removal to get some kind of motion.
On September 19, 2023 7:17:04 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote: >On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote: >> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: >>> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff >>> > that people are still working on! Can we please not do this? >>> >>> If you're working on one of them, then surely it's a simple matter of >>> working on adding CONFIG_PREEMPT support :-) >> >> As Geert poined out, I'm not seeing anything particular problematic with the >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more >> something about organizing KConfig files. >> >> I find it a bit unfair that maintainers of architectures that have huge companies >> behind them use their manpower to urge less popular architectures for removal just >> because they don't have 150 people working on the port so they can keep up with >> design changes quickly. > >I don't urge for removal. I just noticed that these four architectures >lack PREEMPT support. The only thing which is missing is the actual >preemption point in the return to kernel code path. > >But otherwise it should just work, which I obviously can't confirm :) > >Even without that preemption point it should build and boot. There might >be some minor latency issues when that preemption point is not there, >but adding it is not rocket science either. It's probably about 10 lines >of ASM code, if at all. > >Though not adding that might cause a blocking issue for the rework of >the whole preemption logic in order to remove the sprinkled around >cond_resched() muck or force us to maintain some nasty workaround just >for the benefit of a few stranglers. > >So I can make the same argument the other way around, that it's >unjustified that some architectures which are just supported for >nostalgia throw roadblocks into kernel developemnt. > >If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack >myself, but that's going to cost more of my and your time than it's >worth the trouble, > >Hmm. I could delegate that to Linus, he might still remember :) > >Thanks, > > tglx Does *anyone* actually run Alpha at this point?
On Tue, Sep 19, 2023 at 10:51 AM H. Peter Anvin <hpa@zytor.com> wrote: > > On September 19, 2023 7:17:04 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote: > >On Tue, Sep 19 2023 at 15:48, John Paul Adrian Glaubitz wrote: > >> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: > >>> > The agreement to kill off ia64 wasn't an invitation to kill off other stuff > >>> > that people are still working on! Can we please not do this? > >>> > >>> If you're working on one of them, then surely it's a simple matter of > >>> working on adding CONFIG_PREEMPT support :-) > >> > >> As Geert poined out, I'm not seeing anything particular problematic with the > >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > >> something about organizing KConfig files. > >> > >> I find it a bit unfair that maintainers of architectures that have huge companies > >> behind them use their manpower to urge less popular architectures for removal just > >> because they don't have 150 people working on the port so they can keep up with > >> design changes quickly. > > > >I don't urge for removal. I just noticed that these four architectures > >lack PREEMPT support. The only thing which is missing is the actual > >preemption point in the return to kernel code path. > > > >But otherwise it should just work, which I obviously can't confirm :) > > > >Even without that preemption point it should build and boot. There might > >be some minor latency issues when that preemption point is not there, > >but adding it is not rocket science either. It's probably about 10 lines > >of ASM code, if at all. > > > >Though not adding that might cause a blocking issue for the rework of > >the whole preemption logic in order to remove the sprinkled around > >cond_resched() muck or force us to maintain some nasty workaround just > >for the benefit of a few stranglers. > > > >So I can make the same argument the other way around, that it's > >unjustified that some architectures which are just supported for > >nostalgia throw roadblocks into kernel developemnt. > > > >If my ALPHA foo wouldn't be very close to zero, I'd write that ASM hack > >myself, but that's going to cost more of my and your time than it's > >worth the trouble, > > > >Hmm. I could delegate that to Linus, he might still remember :) > > > >Thanks, > > > > tglx > > Does *anyone* actually run Alpha at this point? I do, as part of maintaining the Gentoo distribution for Alpha. I'm listed in MAINTAINERS, but really only so I can collect patches send them to Linus after testing. I don't have copious amounts of free time to be proactive in kernel development and it's also not really my area of expertise so I'm nowhere near effective at it. I would be happy to test any patches sent my way (but I acknowledge that writing these patches wouldn't be high on anyone's priority list, etc) (A video my friend Ian and I made about a particularly large AlphaServer I have in my basement, in case anyone is interested: https://www.youtube.com/watch?v=z658a8Js5qg)
On Tue, Sep 19 2023 at 15:21, Anton Ivanov wrote: > On 19/09/2023 14:42, Peter Zijlstra wrote: >> If you're working on one of them, then surely it's a simple matter of >> working on adding CONFIG_PREEMPT support :-) > > In the case of UML adding preempt will be quite difficult. I looked at > this a few years back. What's so difficult about it? Thanks, tglx
On 19/09/2023 16:17, Thomas Gleixner wrote: > On Tue, Sep 19 2023 at 15:21, Anton Ivanov wrote: >> On 19/09/2023 14:42, Peter Zijlstra wrote: >>> If you're working on one of them, then surely it's a simple matter of >>> working on adding CONFIG_PREEMPT support :-) >> In the case of UML adding preempt will be quite difficult. I looked at >> this a few years back. > What's so difficult about it? It's been a while. I remember that I dropped it at the time, but do not remember the full details. There was some stuff related to FP state and a few other issues I ran into while rewriting the interrupt controller. Some of it may be resolved by now as we are using host cpu flags, etc. I can give it another go :) > > Thanks, > > tglx >
On Tue, 19 Sep 2023 15:32:05 +0100 Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Sep 19, 2023 at 04:24:48PM +0200, John Paul Adrian Glaubitz wrote: > > If the conversion isn't hard, why is the first reflex the urge to remove an architecture > > instead of offering advise how to get the conversion done? > > Because PREEMPT has been around since before 2005 (cc19ca86a023 created > Kconfig.preempt and I don't need to go back further than that to make my > point), and you haven't done the work yet. Clearly it takes the threat > of removal to get some kind of motion. Or the use case of a preempt kernel on said arch has never been a request. Just because it was available doesn't necessarily mean it's required. Please, let's not jump to threats of removal just to get a feature in. Simply ask first. I didn't see anyone reaching out to the maintainers asking for this as it will be needed for a new feature that will likely make maintaining said arch easier. Everything is still in brainstorming mode. -- Steve
----- Ursprüngliche Mail ----- > Von: "anton ivanov" <anton.ivanov@cambridgegreys.com> > It's been a while. I remember that I dropped it at the time, but do not remember > the full details. > > There was some stuff related to FP state and a few other issues I ran into while > rewriting the interrupt controller. Some of it may be resolved by now as we are > using host cpu flags, etc. I remember also having a hacky but working version almost 10 years ago. It was horrible slow because of the extra scheduler rounds. But yes, if PREEMPT will be a must-have feature we'll have to try again. Thanks, //richard
On 19/09/2023 17:22, Richard Weinberger wrote: > ----- Ursprüngliche Mail ----- >> Von: "anton ivanov" <anton.ivanov@cambridgegreys.com> >> It's been a while. I remember that I dropped it at the time, but do not remember >> the full details. >> >> There was some stuff related to FP state and a few other issues I ran into while >> rewriting the interrupt controller. Some of it may be resolved by now as we are >> using host cpu flags, etc. > > I remember also having a hacky but working version almost 10 years ago. > It was horrible slow because of the extra scheduler rounds. > But yes, if PREEMPT will be a must-have feature we'll have to try again. We will need proper fpu primitives for starters that's for sure. fpu_star/end in UML are presently NOOP. Some of the default spinlocks and other stuff which we pick up from generic may need to change as well. This is off the top of my head and something which we can fix straight away. I will send some patches to the mailing list tomorrow or on Thu. A. > > Thanks, > //richard > > _______________________________________________ > linux-um mailing list > linux-um@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-um
Hi,
[del]
> Does *anyone* actually run Alpha at this point?
Yes, at least I'm still trying to keep my boxes running from time to time,
CU,
Uli
On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> wrote: > > As Geert poined out, I'm not seeing anything particular problematic with the > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > something about organizing KConfig files. It can definitely be problematic. Not the Kconfig file part, and not the preempt count part itself. But the fact that it has never been used and tested means that there might be tons of "this architecture code knows it's not preemptible, because this architecture doesn't support preemption". So you may have basic architecture code that simply doesn't have the "preempt_disable()/enable()" pairs that it needs. PeterZ mentioned the generic entry code, which does this for the entry path. But it actually goes much deeper: just do a git grep preempt_disable arch/x86/kernel and then do the same for some other architectures. Looking at alpha, for example, there *are* hits for it, so at least some of the code there clearly *tries* to do it. But does it cover all the required parts? If it's never been tested, I'd be surprised if it's all just ready to go. I do think we'd need to basically continue to support ARCH_NO_PREEMPT - and such architectures migth end up with the worst-cast latencies of only scheduling at return to user space. Linus
On Tue, Sep 19 2023 at 17:41, Anton Ivanov wrote: > On 19/09/2023 17:22, Richard Weinberger wrote: >> ----- Ursprüngliche Mail ----- >>> Von: "anton ivanov" <anton.ivanov@cambridgegreys.com> >>> It's been a while. I remember that I dropped it at the time, but do not remember >>> the full details. >>> >>> There was some stuff related to FP state and a few other issues I ran into while >>> rewriting the interrupt controller. Some of it may be resolved by now as we are >>> using host cpu flags, etc. >> >> I remember also having a hacky but working version almost 10 years ago. >> It was horrible slow because of the extra scheduler rounds. Which can be completely avoided as the proposed change will have the preemption points, but they are only utilized when preempt FULL is enabled (at boot or runtime). So the behaviour can still be like preempt NONE, but with a twist to get rid of the cond_resched()/might_resched() and other heuristic approaches to prevent starvation by long running functions. That twist needs the preemption points. See https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx >> But yes, if PREEMPT will be a must-have feature we'll have to try again. > > We will need proper fpu primitives for starters that's for > sure. fpu_star/end in UML are presently NOOP. > > Some of the default spinlocks and other stuff which we pick up from > generic may need to change as well. > > This is off the top of my head and something which we can fix straight > away. I will send some patches to the mailing list tomorrow or on Thu. I think it does not have to be perfect. UM is far from perfect in mimicing a real kernel. The main point is that it provides the preempt counter in the first place and some minimal amount of preemption points aside of those which come with the preempt_enable() machinery for free. Thanks, tglx
Hi Linus! On Tue, 2023-09-19 at 10:25 -0700, Linus Torvalds wrote: > On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz > <glaubitz@physik.fu-berlin.de> wrote: > > > > As Geert poined out, I'm not seeing anything particular problematic with the > > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > > something about organizing KConfig files. > > It can definitely be problematic. > > Not the Kconfig file part, and not the preempt count part itself. > > But the fact that it has never been used and tested means that there > might be tons of "this architecture code knows it's not preemptible, > because this architecture doesn't support preemption". > > So you may have basic architecture code that simply doesn't have the > "preempt_disable()/enable()" pairs that it needs. > > PeterZ mentioned the generic entry code, which does this for the entry > path. But it actually goes much deeper: just do a > > git grep preempt_disable arch/x86/kernel > > and then do the same for some other architectures. > > Looking at alpha, for example, there *are* hits for it, so at least > some of the code there clearly *tries* to do it. But does it cover all > the required parts? If it's never been tested, I'd be surprised if > it's all just ready to go. Thanks for the detailed explanation. > I do think we'd need to basically continue to support ARCH_NO_PREEMPT > - and such architectures migth end up with the worst-cast latencies of > only scheduling at return to user space. Great to hear, thank you. And, yes, eventually I would be happy to help get alpha and m68k converted. Adrian
On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote: > On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz > <glaubitz@physik.fu-berlin.de> wrote: >> >> As Geert poined out, I'm not seeing anything particular problematic with the >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more >> something about organizing KConfig files. > > It can definitely be problematic. > > Not the Kconfig file part, and not the preempt count part itself. > > But the fact that it has never been used and tested means that there > might be tons of "this architecture code knows it's not preemptible, > because this architecture doesn't support preemption". > > So you may have basic architecture code that simply doesn't have the > "preempt_disable()/enable()" pairs that it needs. > > PeterZ mentioned the generic entry code, which does this for the entry > path. But it actually goes much deeper: just do a > > git grep preempt_disable arch/x86/kernel > > and then do the same for some other architectures. > > Looking at alpha, for example, there *are* hits for it, so at least > some of the code there clearly *tries* to do it. But does it cover all > the required parts? If it's never been tested, I'd be surprised if > it's all just ready to go. > > I do think we'd need to basically continue to support ARCH_NO_PREEMPT > - and such architectures migth end up with the worst-cast latencies of > only scheduling at return to user space. The only thing these architectures should gain is the preempt counter itself, but yes the extra preemption points are not mandatory to have, i.e. we simply do not enable them for the nostalgia club. The removal of cond_resched() might cause latencies, but then I doubt that these museus pieces are used for real work :) Thanks, tglx
On Tue, 19 Sep 2023 20:31:50 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > The removal of cond_resched() might cause latencies, but then I doubt > that these museus pieces are used for real work :) We could simply leave the cond_resched() around but defined as nops for everything but the "nostalgia club" to keep them from having any regressions. -- Steve
On Tue, 19 Sept 2023 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote: > > We could simply leave the cond_resched() around but defined as nops for > everything but the "nostalgia club" to keep them from having any regressions. I doubt the nostalgia club cares about some latencies (that are usually only noticeable under extreme loads anyway). And if they do, maybe that would make somebody sit down and look into doing it right. So I think keeping it around would actually be both useless and counter-productive. Linus
Thomas Gleixner <tglx@linutronix.de> writes: > On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote: >> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote: >>> > The problem with the REP prefix (and Xen hypercalls) is that >>> > they're long running instructions and it becomes fundamentally >>> > impossible to put a cond_resched() in. >>> > >>> >> Yes. I'm starting to think that that the only sane solution is to >>> >> limit cases that can do this a lot, and the "instruciton pointer >>> >> region" approach would certainly work. >>> > >>> > From a code locality / I-cache POV, I think a sorted list of >>> > (non overlapping) ranges might be best. >>> >>> Yeah, agreed. There are a few problems with doing that though. >>> >>> I was thinking of using a check of this kind to schedule out when >>> it is executing in this "reschedulable" section: >>> !preempt_count() && in_resched_function(regs->rip); >>> >>> For preemption=full, this should mostly work. >>> For preemption=voluntary, though this'll only work with out-of-line >>> locks, not if the lock is inlined. >>> >>> (Both, should have problems with __this_cpu_* and the like, but >>> maybe we can handwave that away with sparse/objtool etc.) >> >> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges >> thing, and then only search the range when TIF flag is set. >> >> And I'm thinking it might be a good idea to have objtool validate the >> range only contains simple instructions, the moment it contains control >> flow I'm thinking it's too complicated. > > Can we take a step back and look at the problem from a scheduling > perspective? > > The basic operation of a non-preemptible kernel is time slice > scheduling, which means that a task can run more or less undisturbed for > a full time slice once it gets on the CPU unless it schedules away > voluntary via a blocking operation. > > This works pretty well as long as everything runs in userspace as the > preemption points in the return to user space path are independent of > the preemption model. > > These preemption points handle both time slice exhaustion and priority > based preemption. > > With PREEMPT=NONE these are the only available preemption points. > > That means that kernel code can run more or less indefinitely until it > schedules out or returns to user space, which is obviously not possible > for kernel threads. > > To prevent starvation the kernel gained voluntary preemption points, > i.e. cond_resched(), which has to be added manually to code as a > developer sees fit. > > Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as > additional preemption points. might_resched() utilizes the existing > might_sched() debug points, which are in code paths which might block on > a contended resource. These debug points are mostly in core and > infrastructure code and are in code paths which can block anyway. The > only difference is that they allow preemption even when the resource is > uncontended. > > Additionally we have PREEMPT=FULL which utilizes every zero transition > of preeempt_count as a potential preemption point. > > Now we have the situation of long running data copies or data clear > operations which run fully in hardware, but can be interrupted. As the > interrupt return to kernel mode does not preempt in the NONE and > VOLUNTARY cases, new workarounds emerged. Mostly by defining a data > chunk size and adding cond_reched() again. > > That's ugly and does not work for long lasting hardware operations so we > ended up with the suggestion of TIF_ALLOW_RESCHED to work around > that. But again this needs to be manually annotated in the same way as a > IP range based preemption scheme requires annotation. > > TBH. I detest all of this. > > Both cond_resched() and might_sleep/sched() are completely random > mechanisms as seen from time slice operation and the data chunk based > mechanism is just heuristics which works as good as heuristics tend to > work. allow_resched() is not any different and IP based preemption > mechanism are not going to be any better. Agreed. I was looking at how to add resched sections etc, and in addition to the randomness the choice of where exactly to add it seemed to be quite fuzzy. A recipe for future kruft. > The approach here is: Prevent the scheduler to make decisions and then > mitigate the fallout with heuristics. > > That's just backwards as it moves resource control out of the scheduler > into random code which has absolutely no business to do resource > control. > > We have the reverse issue observed in PREEMPT_RT. The fact that spinlock > held sections became preemtible caused even more preemption activity > than on a PREEMPT=FULL kernel. The worst side effect of that was > extensive lock contention. > > The way how we addressed that was to add a lazy preemption mode, which > tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to > preempt tasks which all belong to the SCHED_OTHER scheduling class. This > works pretty well and gains back a massive amount of performance for the > non-realtime throughput oriented tasks without affecting the > schedulability of real-time tasks at all. IOW, it does not take control > away from the scheduler. It cooperates with the scheduler and leaves the > ultimate decisions to it. > > I think we can do something similar for the problem at hand, which > avoids most of these heuristic horrors and control boundary violations. > > The main issue is that long running operations do not honour the time > slice and we work around that with cond_resched() and now have ideas > with this new TIF bit and IP ranges. > > None of that is really well defined in respect to time slices. In fact > its not defined at all versus any aspect of scheduling behaviour. > > What about the following: > > 1) Keep preemption count and the real preemption points enabled > unconditionally. That's not more overhead than the current > DYNAMIC_PREEMPT mechanism as long as the preemption count does not > go to zero, i.e. the folded NEED_RESCHED bit stays set. > > From earlier experiments I know that the overhead of preempt_count > is minimal and only really observable with micro benchmarks. > Otherwise it ends up in the noise as long as the slow path is not > taken. > > I did a quick check comparing a plain inc/dec pair vs. the > DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is > in the non-conclusive noise. > > 20 years ago this was a real issue because we did not have: > > - the folding of NEED_RESCHED into the preempt count > > - the cacheline optimizations which make the preempt count cache > pretty much always cache hot > > - the hardware was way less capable > > I'm not saying that preempt_count is completely free today as it > obviously adds more text and affects branch predictors, but as the > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an > acceptable and tolerable tradeoff. > > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. > > As NEED_RESCHED_LAZY is not folded into the preemption count the > preemption count won't become zero, so the task can continue until > it hits return to user space. > > That preserves the existing behaviour. > > 3) When the scheduler tick observes that the time slice is exhausted, > then it folds the NEED_RESCHED bit into the preempt count which > causes the real preemption points to actually preempt including > the return from interrupt to kernel path. Right, and currently we check cond_resched() all the time in expectation that something might need a resched. Folding it in with the scheduler determining when next preemption happens seems to make a lot of sense to me. Thanks Ankur > That even allows the scheduler to enforce preemption for e.g. RT > class tasks without changing anything else. > > I'm pretty sure that this gets rid of cond_resched(), which is an > impressive list of instances: > > ./drivers 392 > ./fs 318 > ./mm 189 > ./kernel 184 > ./arch 95 > ./net 83 > ./include 46 > ./lib 36 > ./crypto 16 > ./sound 16 > ./block 11 > ./io_uring 13 > ./security 11 > ./ipc 3 > > That list clearly documents that the majority of these > cond_resched() invocations is in code which neither should care > nor should have any influence on the core scheduling decision > machinery. > > I think it's worth a try as it just fits into the existing preemption > scheme, solves the issue of long running kernel functions, prevents > invalid preemption and can utilize the existing instrumentation and > debug infrastructure. > > Most importantly it gives control back to the scheduler and does not > make it depend on the mercy of cond_resched(), allow_resched() or > whatever heuristics sprinkled all over the kernel. > To me this makes a lot of sense, but I might be on the completely wrong > track. Se feel free to tell me that I'm completely nuts and/or just not > seeing the obvious. > > Thanks, > > tglx -- ankur
On Tue, Sep 19 2023 at 11:52, Linus Torvalds wrote: > On Tue, 19 Sept 2023 at 11:37, Steven Rostedt <rostedt@goodmis.org> wrote: >> >> We could simply leave the cond_resched() around but defined as nops for >> everything but the "nostalgia club" to keep them from having any regressions. > > I doubt the nostalgia club cares about some latencies (that are > usually only noticeable under extreme loads anyway). > > And if they do, maybe that would make somebody sit down and look into > doing it right. > > So I think keeping it around would actually be both useless and > counter-productive. Amen to that.
* Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote: > > On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz > > <glaubitz@physik.fu-berlin.de> wrote: > >> > >> As Geert poined out, I'm not seeing anything particular problematic with the > >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > >> something about organizing KConfig files. > > > > It can definitely be problematic. > > > > Not the Kconfig file part, and not the preempt count part itself. > > > > But the fact that it has never been used and tested means that there > > might be tons of "this architecture code knows it's not preemptible, > > because this architecture doesn't support preemption". > > > > So you may have basic architecture code that simply doesn't have the > > "preempt_disable()/enable()" pairs that it needs. > > > > PeterZ mentioned the generic entry code, which does this for the entry > > path. But it actually goes much deeper: just do a > > > > git grep preempt_disable arch/x86/kernel > > > > and then do the same for some other architectures. > > > > Looking at alpha, for example, there *are* hits for it, so at least > > some of the code there clearly *tries* to do it. But does it cover all > > the required parts? If it's never been tested, I'd be surprised if > > it's all just ready to go. > > > > I do think we'd need to basically continue to support ARCH_NO_PREEMPT > > - and such architectures migth end up with the worst-cast latencies of > > only scheduling at return to user space. > > The only thing these architectures should gain is the preempt counter > itself, [...] And if any of these machines are still used, there's the small benefit of preempt_count increasing debuggability of scheduling in supposedly preempt-off sections that were ignored silently previously, as most of these architectures do not even enable CONFIG_DEBUG_ATOMIC_SLEEP=y in their defconfigs: $ for ARCH in alpha hexagon m68k um; do git grep DEBUG_ATOMIC_SLEEP arch/$ARCH; done $ Plus the efficiency of CONFIG_DEBUG_ATOMIC_SLEEP=y is much reduced on non-PREEMPT kernels to begin with: it will basically only detect scheduling in hardirqs-off critical sections. So IMHO there's a distinct debuggability & robustness plus in enabling the preemption count on all architectures, even if they don't or cannot use the rescheduling points. > [...] but yes the extra preemption points are not mandatory to have, i.e. > we simply do not enable them for the nostalgia club. > > The removal of cond_resched() might cause latencies, but then I doubt > that these museus pieces are used for real work :) I'm not sure we should initially remove *explicit* legacy cond_resched() points, except from high-freq paths where they hurt - and of course remove them from might_sleep(). Thanks, Ingo
* Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 19 Sep 2023 20:31:50 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > The removal of cond_resched() might cause latencies, but then I doubt > > that these museus pieces are used for real work :) > > We could simply leave the cond_resched() around but defined as nops for > everything but the "nostalgia club" to keep them from having any regressions. That's not a good idea IMO, it's an invitation for accelerated rate bitrot turning cond_resched() meaningless very quickly. We should remove cond_resched() - but probably not as the first step. They are conceptually independent of NEED_RESCHED_LAZY and we don't *have to* remove them straight away. By removing cond_resched() separately there's an easily bisectable point to blame for any longer latencies on legacy platforms, should any of them still be used with recent kernels. Thanks, Ingo
On Tue, Sep 19 2023 at 10:25, Linus Torvalds wrote: > PeterZ mentioned the generic entry code, which does this for the entry > path. But it actually goes much deeper: just do a > > git grep preempt_disable arch/x86/kernel > > and then do the same for some other architectures. > > Looking at alpha, for example, there *are* hits for it, so at least > some of the code there clearly *tries* to do it. But does it cover all > the required parts? If it's never been tested, I'd be surprised if > it's all just ready to go. Interestingly enough m68k has zero instances, but it supports PREEMPT on the COLDFIRE subarchitecture...
From: Linus Torvalds > Sent: 19 September 2023 18:25 > > On Tue, 19 Sept 2023 at 06:48, John Paul Adrian Glaubitz > <glaubitz@physik.fu-berlin.de> wrote: > > > > As Geert poined out, I'm not seeing anything particular problematic with the > > architectures lacking CONFIG_PREEMPT at the moment. This seems to be more > > something about organizing KConfig files. > > It can definitely be problematic. > > Not the Kconfig file part, and not the preempt count part itself. > > But the fact that it has never been used and tested means that there > might be tons of "this architecture code knows it's not preemptible, > because this architecture doesn't support preemption". Do distos even build x86 kernels with PREEMPT_FULL? I know I've had issues with massive latencies caused graphics driver forcing write-backs of all the framebuffer memory. (I think it is a failed attempt to fix a temporary display corruption.) OTOH SMP support and CONFIG_RT will test most of the code. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Thomas Gleixner <tglx@linutronix.de> writes: > So the decision matrix would be: > > Ret2user Ret2kernel PreemptCnt=0 > > NEED_RESCHED Y Y Y > LAZY_RESCHED Y N N > > That is completely independent of the preemption model and the > differentiation of the preemption models happens solely at the scheduler > level: This is relatively minor, but do we need two flags? Seems to me we can get to the same decision matrix by letting the scheduler fold into the preempt-count based on current preemption model. > PREEMPT_NONE sets only LAZY_RESCHED unless it needs to enforce the time > slice where it sets NEED_RESCHED. PREEMPT_NONE sets up TIF_NEED_RESCHED. For the time-slice expiry case, also fold into preempt-count. > PREEMPT_VOLUNTARY extends the NONE model so that the wakeup of RT class > tasks or sporadic event tasks sets NEED_RESCHED too. PREEMPT_NONE sets up TIF_NEED_RESCHED and also folds it for the RT/sporadic tasks. > PREEMPT_FULL always sets NEED_RESCHED like today. Always fold the TIF_NEED_RESCHED into the preempt-count. > We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we > only end up with two variants or even subsume PREEMPT_FULL into that > model because that's what is closer to the RT LAZY preempt behaviour, > which has two goals: > > 1) Make low latency guarantees for RT workloads > > 2) Preserve the throughput for non-RT workloads > > But in any case this decision happens solely in the core scheduler code > and nothing outside of it needs to be changed. > > So we not only get rid of the cond/might_resched() muck, we also get rid > of the static_call/static_key machinery which drives PREEMPT_DYNAMIC. > The only place which still needs that runtime tweaking is the scheduler > itself. True. The dynamic preemption could just become a scheduler tunable. > Though it just occured to me that there are dragons lurking: > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > arch/um/Kconfig: select ARCH_NO_PREEMPT > > So we have four architectures which refuse to enable preemption points, > i.e. the only model they allow is NONE and they rely on cond_resched() > for breaking large computations. > > But they support PREEMPT_COUNT, so we might get away with a reduced > preemption point coverage: > > Ret2user Ret2kernel PreemptCnt=0 > > NEED_RESCHED Y N Y > LAZY_RESCHED Y N N So from the discussion in the other thread, for the ARCH_NO_PREEMPT configs that don't support preemption, we probably need a fourth preemption model, say PREEMPT_UNSAFE. These could use only the Ret2user preemption points and just fallback to the !PREEMPT_COUNT primitives. Thanks -- ankur
On 19/09/2023 15:16, Peter Zijlstra wrote: > On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote: >> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: >>>> The agreement to kill off ia64 wasn't an invitation to kill off other stuff >>>> that people are still working on! Can we please not do this? >>> >>> If you're working on one of them, then surely it's a simple matter of >>> working on adding CONFIG_PREEMPT support :-) >> >> As Geert poined out, I'm not seeing anything particular problematic with the >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more >> something about organizing KConfig files. > > The plan in the parent thread is to remove PREEMPT_NONE and > PREEMPT_VOLUNTARY and only keep PREEMPT_FULL. > >> I find it a bit unfair that maintainers of architectures that have huge companies >> behind them use their manpower to urge less popular architectures for removal just >> because they don't have 150 people working on the port so they can keep up with >> design changes quickly. > > PREEMPT isn't something new. Also, I don't think the arch part for > actually supporting it is particularly hard, mostly it is sticking the > preempt_schedule_irq() call in return from interrupt code path. That calls local_irq_enable() which does various signal related/irq pending work on UML. That in turn does no like being invoked again (as you may have already been invoked out of that) in the IRQ return path. So it is likely to end up being slightly more difficult than that for UML - it will need to be wrapped so it can be invoked from the "host" side signal code as well as invoked with some additional checks to avoid making a hash out of the IRQ handling. It may be necessary to modify some of the existing reentrancy prevention logic in the signal handlers as well and change it to make use of the preempt count instead of its own flags/counters. > > If you convert the arch to generic-entry (a much larger undertaking) > then you get this for free. > > _______________________________________________ > linux-um mailing list > linux-um@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-um >
On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote: > Thomas Gleixner <tglx@linutronix.de> writes: > >> So the decision matrix would be: >> >> Ret2user Ret2kernel PreemptCnt=0 >> >> NEED_RESCHED Y Y Y >> LAZY_RESCHED Y N N >> >> That is completely independent of the preemption model and the >> differentiation of the preemption models happens solely at the scheduler >> level: > > This is relatively minor, but do we need two flags? Seems to me we > can get to the same decision matrix by letting the scheduler fold > into the preempt-count based on current preemption model. You still need the TIF flags because there is no way to do remote modification of preempt count. The preempt count folding is an optimization which simplifies the preempt_enable logic: if (--preempt_count && need_resched()) schedule() to if (--preempt_count) schedule() i.e. a single conditional instead of two. The lazy bit is only evaluated in: 1) The return to user path 2) need_reched() In neither case preempt_count is involved. So it does not buy us enything. We might revisit that later, but for simplicity sake the extra TIF bit is way simpler. Premature optimization is the enemy of correctness. >> We should be able merge the PREEMPT_NONE/VOLUNTARY behaviour so that we >> only end up with two variants or even subsume PREEMPT_FULL into that >> model because that's what is closer to the RT LAZY preempt behaviour, >> which has two goals: >> >> 1) Make low latency guarantees for RT workloads >> >> 2) Preserve the throughput for non-RT workloads >> >> But in any case this decision happens solely in the core scheduler code >> and nothing outside of it needs to be changed. >> >> So we not only get rid of the cond/might_resched() muck, we also get rid >> of the static_call/static_key machinery which drives PREEMPT_DYNAMIC. >> The only place which still needs that runtime tweaking is the scheduler >> itself. > > True. The dynamic preemption could just become a scheduler tunable. That's the point. >> But they support PREEMPT_COUNT, so we might get away with a reduced >> preemption point coverage: >> >> Ret2user Ret2kernel PreemptCnt=0 >> >> NEED_RESCHED Y N Y >> LAZY_RESCHED Y N N > > So from the discussion in the other thread, for the ARCH_NO_PREEMPT > configs that don't support preemption, we probably need a fourth > preemption model, say PREEMPT_UNSAFE. As discussed they wont really notice the latency issues because the museum pieces are not used for anything crucial and for UM that's the least of the correctness worries. So no, we don't need yet another knob. We keep them chucking along and if they really want they can adopt to the new world order. :) Thanks, tglx
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: > On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: >> Anyway, I'm definitely not opposed. We'd get rid of a config option >> that is presumably not very widely used, and we'd simplify a lot of >> issues, and get rid of all these badly defined "cond_preempt()" >> things. > > Hmm. Didn't I promise a year ago that I won't do further large scale > cleanups and simplifications beyond printk. > > Maybe I get away this time with just suggesting it. :) Maybe not. As I'm inveterate curious, I sat down and figured out how that might look like. To some extent I really curse my curiosity as the amount of macro maze, config options and convoluted mess behind all these preempt mechanisms is beyond disgusting. Find below a PoC which implements that scheme. It's not even close to correct, but it builds, boots and survives lightweight testing. I did not even try to look into time-slice enforcement, but I really want to share this for illustration and for others to experiment. This keeps all the existing mechanisms in place and introduces a new config knob in the preemption model Kconfig switch: PREEMPT_AUTO If selected it builds a CONFIG_PREEMPT kernel, which disables the cond_resched() machinery and switches the fair scheduler class to use the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to the preempt NONE model except that cond_resched() is a NOOP and I did not validate the time-slice enforcement. The latter should be a no-brainer to figure out and fix if required. For run-time switching this to the FULL preemption model which always uses TIF_NEED_RESCHED, you need to enable CONFIG_SCHED_DEBUG and then you can enable "FULL" via: echo FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features and switch back to some sort of "NONE" via echo NO_FORCE_NEED_RESCHED >/sys/kernel/debug/sched/features It seems to work as expected for a simple hackbench -l 10000 run: NO_FORCE_NEED_RESCHED FORCE_NEED_RESCHED schedule() [1] 3646163 2701641 preemption 12554 927856 total 3658717 3629497 [1] is voluntary schedule() AND_ schedule() from return to user space. I did not come around to account them separately yet, but for a quick check this clearly shows that this "works" as advertised. Of course this needs way more analysis than this quick PoC+check, but you get the idea. Contrary to other hot of the press hacks, I'm pretty sure it won't destroy your hard-disk, but I won't recommend that you deploy it on your alarm-clock as it might make you miss the bus. If this concept holds, which I'm pretty convinced of by now, then this is an opportunity to trade ~3000 lines of unholy hacks for about 100-200 lines of understandable code :) Thanks, tglx --- arch/x86/Kconfig | 1 arch/x86/include/asm/thread_info.h | 2 + drivers/acpi/processor_idle.c | 2 - include/linux/entry-common.h | 2 - include/linux/entry-kvm.h | 2 - include/linux/sched.h | 18 +++++++++++----- include/linux/sched/idle.h | 8 +++---- include/linux/thread_info.h | 19 +++++++++++++++++ kernel/Kconfig.preempt | 12 +++++++++- kernel/entry/common.c | 2 - kernel/sched/core.c | 41 ++++++++++++++++++++++++------------- kernel/sched/fair.c | 10 ++++----- kernel/sched/features.h | 2 + kernel/sched/idle.c | 3 -- kernel/sched/sched.h | 1 kernel/trace/trace.c | 2 - 16 files changed, 91 insertions(+), 36 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -898,14 +898,14 @@ static inline void hrtick_rq_init(struct #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) /* - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, * this avoids any races wrt polling state changes and thereby avoids * spurious IPIs. */ -static inline bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit) { struct thread_info *ti = task_thread_info(p); - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); + return !(fetch_or(&ti->flags, 1 << nr_bit) & _TIF_POLLING_NRFLAG); } /* @@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct tas } #else -static inline bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p, int nr_bit) { - set_tsk_need_resched(p); + set_tsk_thread_flag(p, nr_bit); return true; } @@ -1038,28 +1038,42 @@ void wake_up_q(struct wake_q_head *head) * might also involve a cross-CPU call to trigger the scheduler on * the target CPU. */ -void resched_curr(struct rq *rq) +static void __resched_curr(struct rq *rq, int nr_bit) { struct task_struct *curr = rq->curr; int cpu; lockdep_assert_rq_held(rq); - if (test_tsk_need_resched(curr)) + if (test_tsk_need_resched_type(curr, nr_bit)) return; cpu = cpu_of(rq); if (cpu == smp_processor_id()) { - set_tsk_need_resched(curr); - set_preempt_need_resched(); + set_tsk_thread_flag(curr, nr_bit); + if (nr_bit == TIF_NEED_RESCHED) + set_preempt_need_resched(); return; } - if (set_nr_and_not_polling(curr)) - smp_send_reschedule(cpu); - else + if (set_nr_and_not_polling(curr, nr_bit)) { + if (nr_bit == TIF_NEED_RESCHED) + smp_send_reschedule(cpu); + } else { trace_sched_wake_idle_without_ipi(cpu); + } +} + +void resched_curr(struct rq *rq) +{ + __resched_curr(rq, TIF_NEED_RESCHED); +} + +void resched_curr_lazy(struct rq *rq) +{ + __resched_curr(rq, sched_feat(FORCE_NEED_RESCHED) ? + TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY); } void resched_cpu(int cpu) @@ -1132,7 +1146,7 @@ static void wake_up_idle_cpu(int cpu) if (cpu == smp_processor_id()) return; - if (set_nr_and_not_polling(rq->idle)) + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); @@ -8872,7 +8886,6 @@ static void __init preempt_dynamic_init( WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ return preempt_dynamic_mode == preempt_dynamic_##mode; \ } \ - EXPORT_SYMBOL_GPL(preempt_model_##mode) PREEMPT_MODEL_ACCESSOR(none); PREEMPT_MODEL_ACCESSOR(voluntary); --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -59,6 +59,11 @@ enum syscall_work_bit { #include <asm/thread_info.h> +#ifndef CONFIG_PREEMPT_AUTO +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED +#endif + #ifdef __KERNEL__ #ifndef arch_set_restart_data @@ -185,6 +190,13 @@ static __always_inline bool tif_need_res (unsigned long *)(¤t_thread_info()->flags)); } +static __always_inline bool tif_need_resched_lazy(void) +{ + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && + arch_test_bit(TIF_NEED_RESCHED_LAZY, + (unsigned long *)(¤t_thread_info()->flags)); +} + #else static __always_inline bool tif_need_resched(void) @@ -193,6 +205,13 @@ static __always_inline bool tif_need_res (unsigned long *)(¤t_thread_info()->flags)); } +static __always_inline bool tif_need_resched_lazy(void) +{ + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && + test_bit(TIF_NEED_RESCHED_LAZY, + (unsigned long *)(¤t_thread_info()->flags)); +} + #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -11,6 +11,9 @@ config PREEMPT_BUILD select PREEMPTION select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK +config HAVE_PREEMPT_AUTO + bool + choice prompt "Preemption Model" default PREEMPT_NONE @@ -67,6 +70,13 @@ config PREEMPT embedded system with latency requirements in the milliseconds range. +config PREEMPT_AUTO + bool "Automagic preemption mode with runtime tweaking support" + depends on HAVE_PREEMPT_AUTO + select PREEMPT_BUILD + help + Add some sensible blurb here + config PREEMPT_RT bool "Fully Preemptible Kernel (Real-Time)" depends on EXPERT && ARCH_SUPPORTS_RT @@ -95,7 +105,7 @@ config PREEMPTION config PREEMPT_DYNAMIC bool "Preemption behaviour defined on boot" - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY select PREEMPT_BUILD default y if HAVE_PREEMPT_DYNAMIC_CALL --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -60,7 +60,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ - ARCH_EXIT_TO_USER_MODE_WORK) + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) /** * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs --- a/include/linux/entry-kvm.h +++ b/include/linux/entry-kvm.h @@ -18,7 +18,7 @@ #define XFER_TO_GUEST_MODE_WORK \ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) struct kvm_vcpu; --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l local_irq_enable_exit_to_user(ti_work); - if (ti_work & _TIF_NEED_RESCHED) + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) schedule(); if (ti_work & _TIF_UPROBE) --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(HZ_BW, true) + +SCHED_FEAT(FORCE_NEED_RESCHED, false) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); extern void reweight_task(struct task_struct *p, int prio); extern void resched_curr(struct rq *rq); +extern void resched_curr_lazy(struct rq *rq); extern void resched_cpu(int cpu); extern struct rt_bandwidth def_rt_bandwidth; --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla update_ti_thread_flag(task_thread_info(tsk), flag, value); } -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); } -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); } -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_ti_thread_flag(task_thread_info(tsk), flag); } @@ -2069,13 +2069,21 @@ static inline void set_tsk_need_resched( static inline void clear_tsk_need_resched(struct task_struct *tsk) { clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); } -static inline int test_tsk_need_resched(struct task_struct *tsk) +static inline bool test_tsk_need_resched(struct task_struct *tsk) { return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); } +static inline bool test_tsk_need_resched_type(struct task_struct *tsk, + int nr_bit) +{ + return unlikely(test_tsk_thread_flag(tsk, 1 << nr_bit)); +} + /* * cond_resched() and cond_resched_lock(): latency reduction via * explicit rescheduling in places that are safe. The return @@ -2252,7 +2260,7 @@ static inline int rwlock_needbreak(rwloc static __always_inline bool need_resched(void) { - return unlikely(tif_need_resched()); + return unlikely(tif_need_resched_lazy() || tif_need_resched()); } /* --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -985,7 +985,7 @@ static void update_deadline(struct cfs_r * The task has consumed its request, reschedule. */ if (cfs_rq->nr_running > 1) { - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); clear_buddies(cfs_rq, se); } } @@ -5267,7 +5267,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc * validating it and just reschedule. */ if (queued) { - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); return; } /* @@ -5413,7 +5413,7 @@ static void __account_cfs_rq_runtime(str * hierarchy can be throttled */ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); } static __always_inline @@ -5673,7 +5673,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf /* Determine whether we need to wake up potentially idle CPU: */ if (rq->curr == rq->idle && rq->cfs.nr_running) - resched_curr(rq); + resched_curr_lazy(rq); } #ifdef CONFIG_SMP @@ -8073,7 +8073,7 @@ static void check_preempt_wakeup(struct return; preempt: - resched_curr(rq); + resched_curr_lazy(rq); } #ifdef CONFIG_SMP --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -108,7 +108,7 @@ static const struct dmi_system_id proces */ static void __cpuidle acpi_safe_halt(void) { - if (!tif_need_resched()) { + if (!need_resched()) { raw_safe_halt(); raw_local_irq_disable(); } --- a/include/linux/sched/idle.h +++ b/include/linux/sched/idle.h @@ -63,7 +63,7 @@ static __always_inline bool __must_check */ smp_mb__after_atomic(); - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } static __always_inline bool __must_check current_clr_polling_and_test(void) @@ -76,7 +76,7 @@ static __always_inline bool __must_check */ smp_mb__after_atomic(); - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } #else @@ -85,11 +85,11 @@ static inline void __current_clr_polling static inline bool __must_check current_set_polling_and_test(void) { - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } static inline bool __must_check current_clr_polling_and_test(void) { - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } #endif --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p ct_cpuidle_enter(); raw_local_irq_enable(); - while (!tif_need_resched() && - (cpu_idle_force_poll || tick_check_broadcast_expired())) + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); raw_local_irq_disable(); --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2720,7 +2720,7 @@ unsigned int tracing_gen_ctx_irq_test(un if (softirq_count() >> (SOFTIRQ_SHIFT + 1)) trace_flags |= TRACE_FLAG_BH_OFF; - if (tif_need_resched()) + if (need_resched()) trace_flags |= TRACE_FLAG_NEED_RESCHED; if (test_preempt_need_resched()) trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -271,6 +271,7 @@ config X86 select HAVE_STATIC_CALL select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL select HAVE_PREEMPT_DYNAMIC_CALL + select HAVE_PREEMPT_AUTO select HAVE_RSEQ select HAVE_RUST if X86_64 select HAVE_SYSCALL_TRACEPOINTS --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -83,6 +83,7 @@ struct thread_info { #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ #define TIF_SSBD 5 /* Speculative store bypass disable */ +#define TIF_NEED_RESCHED_LAZY 6 /* Lazy rescheduling */ #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ @@ -106,6 +107,7 @@ struct thread_info { #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SSBD (1 << TIF_SSBD) +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY) #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) #define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH) #define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
On Wed, Sep 20 2023 at 22:51, Thomas Gleixner wrote: > On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote: > > The preempt count folding is an optimization which simplifies the > preempt_enable logic: > > if (--preempt_count && need_resched()) > schedule() > to > if (--preempt_count) > schedule() That should be (!(--preempt_count... in both cases of course :)
Thomas Gleixner <tglx@linutronix.de> writes: > On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: >> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: >>> Anyway, I'm definitely not opposed. We'd get rid of a config option >>> that is presumably not very widely used, and we'd simplify a lot of >>> issues, and get rid of all these badly defined "cond_preempt()" >>> things. >> >> Hmm. Didn't I promise a year ago that I won't do further large scale >> cleanups and simplifications beyond printk. >> >> Maybe I get away this time with just suggesting it. :) > > Maybe not. As I'm inveterate curious, I sat down and figured out how > that might look like. > > To some extent I really curse my curiosity as the amount of macro maze, > config options and convoluted mess behind all these preempt mechanisms > is beyond disgusting. > > Find below a PoC which implements that scheme. It's not even close to > correct, but it builds, boots and survives lightweight testing. Whew, that was electric. I had barely managed to sort through some of the config maze. From a quick look this is pretty much how you described it. > I did not even try to look into time-slice enforcement, but I really want > to share this for illustration and for others to experiment. > > This keeps all the existing mechanisms in place and introduces a new > config knob in the preemption model Kconfig switch: PREEMPT_AUTO > > If selected it builds a CONFIG_PREEMPT kernel, which disables the > cond_resched() machinery and switches the fair scheduler class to use > the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to > the preempt NONE model except that cond_resched() is a NOOP and I did > not validate the time-slice enforcement. The latter should be a > no-brainer to figure out and fix if required. Yeah, let me try this out. Thanks Ankur
Thomas Gleixner <tglx@linutronix.de> writes: > On Wed, Sep 20 2023 at 07:22, Ankur Arora wrote: >> Thomas Gleixner <tglx@linutronix.de> writes: >> >>> So the decision matrix would be: >>> >>> Ret2user Ret2kernel PreemptCnt=0 >>> >>> NEED_RESCHED Y Y Y >>> LAZY_RESCHED Y N N >>> >>> That is completely independent of the preemption model and the >>> differentiation of the preemption models happens solely at the scheduler >>> level: >> >> This is relatively minor, but do we need two flags? Seems to me we >> can get to the same decision matrix by letting the scheduler fold >> into the preempt-count based on current preemption model. > > You still need the TIF flags because there is no way to do remote > modification of preempt count. Yes, agreed. In my version, I was envisaging that the remote cpu always only sets up TIF_NEED_RESCHED and then we decide which one we want at the preemption point. Anyway, I see what you meant in your PoC. >>> But they support PREEMPT_COUNT, so we might get away with a reduced >>> preemption point coverage: >>> >>> Ret2user Ret2kernel PreemptCnt=0 >>> >>> NEED_RESCHED Y N Y >>> LAZY_RESCHED Y N N >> >> So from the discussion in the other thread, for the ARCH_NO_PREEMPT >> configs that don't support preemption, we probably need a fourth >> preemption model, say PREEMPT_UNSAFE. > > As discussed they wont really notice the latency issues because the > museum pieces are not used for anything crucial and for UM that's the > least of the correctness worries. > > So no, we don't need yet another knob. We keep them chucking along and > if they really want they can adopt to the new world order. :) Will they chuckle along, or die trying ;)? I grepped for "preempt_enable|preempt_disable" for all the archs and hexagon and m68k don't seem to do any explicit accounting at all. (Though, neither do nios2 and openrisc, and both csky and microblaze only do it in the tlbflush path.) arch/hexagon 0 arch/m68k 0 arch/nios2 0 arch/openrisc 0 arch/csky 3 arch/microblaze 3 arch/um 4 arch/riscv 8 arch/arc 14 arch/parisc 15 arch/arm 16 arch/sparc 16 arch/xtensa 19 arch/sh 21 arch/alpha 23 arch/ia64 27 arch/loongarch 53 arch/arm64 54 arch/s390 91 arch/mips 115 arch/x86 146 arch/powerpc 201 My concern is given that we preempt on timeslice expiration for all three preemption models, we could end up preempting at an unsafe location. Still, not the most pressing of problems. Thanks -- ankur
On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote: > Thomas Gleixner <tglx@linutronix.de> writes: >> Find below a PoC which implements that scheme. It's not even close to >> correct, but it builds, boots and survives lightweight testing. > > Whew, that was electric. I had barely managed to sort through some of > the config maze. > From a quick look this is pretty much how you described it. Unsurpringly I spent at least 10x the time to describe it than to hack it up. IOW, I had done the analysis before I offered the idea and before I changed a single line of code. The tools I used for that are git-grep, tags, paper, pencil, accrued knowledge and patience, i.e. nothing even close to rocket science. Converting the analysis into code was mostly a matter of brain dumping the analysis and adherence to accrued methodology. What's electric about that? I might be missing some meaning of 'electric' which is not covered by my mostly Webster restricted old-school understanding of the english language :) >> I did not even try to look into time-slice enforcement, but I really want >> to share this for illustration and for others to experiment. >> >> This keeps all the existing mechanisms in place and introduces a new >> config knob in the preemption model Kconfig switch: PREEMPT_AUTO >> >> If selected it builds a CONFIG_PREEMPT kernel, which disables the >> cond_resched() machinery and switches the fair scheduler class to use >> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to >> the preempt NONE model except that cond_resched() is a NOOP and I did >> not validate the time-slice enforcement. The latter should be a >> no-brainer to figure out and fix if required. > > Yeah, let me try this out. That's what I hoped for :) Thanks, tglx
On Wed, Sep 20 2023 at 17:58, Ankur Arora wrote: > Thomas Gleixner <tglx@linutronix.de> writes: >> So no, we don't need yet another knob. We keep them chucking along and >> if they really want they can adopt to the new world order. :) > > Will they chuckle along, or die trying ;)? Either way is fine :) > I grepped for "preempt_enable|preempt_disable" for all the archs and > hexagon and m68k don't seem to do any explicit accounting at all. > (Though, neither do nios2 and openrisc, and both csky and microblaze > only do it in the tlbflush path.) > > arch/hexagon 0 > arch/m68k 0 ... > arch/s390 91 > arch/mips 115 > arch/x86 146 > arch/powerpc 201 > > My concern is given that we preempt on timeslice expiration for all > three preemption models, we could end up preempting at an unsafe > location. As I said in my reply to Linus, that count is not really conclusive. arch/m68k has a count of 0 and supports PREEMPT for the COLDFIRE sub-architecture and I know for sure that at some point in the past PREEMPT_RT was supported on COLDFIRE with minimal changes to the architecture code. That said, I'm pretty sure that quite some of these preempt_disable/enable pairs in arch/* are subject to voodoo programming, but that's a different problem to analyze. > Still, not the most pressing of problems. Exactly :) Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote: >> Thomas Gleixner <tglx@linutronix.de> writes: >>> Find below a PoC which implements that scheme. It's not even close to >>> correct, but it builds, boots and survives lightweight testing. >> >> Whew, that was electric. I had barely managed to sort through some of >> the config maze. >> From a quick look this is pretty much how you described it. > > Unsurpringly I spent at least 10x the time to describe it than to hack > it up. > > IOW, I had done the analysis before I offered the idea and before I > changed a single line of code. The tools I used for that are git-grep, > tags, paper, pencil, accrued knowledge and patience, i.e. nothing even > close to rocket science. > > Converting the analysis into code was mostly a matter of brain dumping > the analysis and adherence to accrued methodology. > > What's electric about that? Hmm, so I /think/ I was going for something like electric current taking the optimal path, with a side meaning of electrifying. Though, I guess electron flow is a quantum mechanical, so that would really try all possible paths, which means the analogy doesn't quite fit. Let me substitute greased lightning for electric :D. > I might be missing some meaning of 'electric' which is not covered by my > mostly Webster restricted old-school understanding of the english language :) > >>> I did not even try to look into time-slice enforcement, but I really want >>> to share this for illustration and for others to experiment. >>> >>> This keeps all the existing mechanisms in place and introduces a new >>> config knob in the preemption model Kconfig switch: PREEMPT_AUTO >>> >>> If selected it builds a CONFIG_PREEMPT kernel, which disables the >>> cond_resched() machinery and switches the fair scheduler class to use >>> the NEED_PREEMPT_LAZY bit by default, i.e. it should be pretty close to >>> the preempt NONE model except that cond_resched() is a NOOP and I did >>> not validate the time-slice enforcement. The latter should be a >>> no-brainer to figure out and fix if required. >> >> Yeah, let me try this out. > > That's what I hoped for :) :). Quick update: it hasn't eaten my hard disk yet. Both the "none" and "full" variants are stable with kbuild. Next: time-slice validation, any fixes and then maybe alarm-clock deployments. And, then if you are okay with it, I'll cleanup/structure your patches together with all the other preemption cleanups we discussed into an RFC series. (One thing I can't wait to measure is how many cond_resched() calls and associated dynamic instructions do we not execute now. Not because I think it really matters for performance -- though it might on low IPC archs, but just it'll be a relief seeing the cond_resched() gone for real.) -- ankur
On Tue, Sep 19, 2023, at 10:16, Peter Zijlstra wrote: > On Tue, Sep 19, 2023 at 03:48:09PM +0200, John Paul Adrian Glaubitz wrote: >> On Tue, 2023-09-19 at 15:42 +0200, Peter Zijlstra wrote: >> > > The agreement to kill off ia64 wasn't an invitation to kill off other stuff >> > > that people are still working on! Can we please not do this? >> > >> > If you're working on one of them, then surely it's a simple matter of >> > working on adding CONFIG_PREEMPT support :-) >> >> As Geert poined out, I'm not seeing anything particular problematic with the >> architectures lacking CONFIG_PREEMPT at the moment. This seems to be more >> something about organizing KConfig files. > > The plan in the parent thread is to remove PREEMPT_NONE and > PREEMPT_VOLUNTARY and only keep PREEMPT_FULL. ... > > PREEMPT isn't something new. Also, I don't think the arch part for > actually supporting it is particularly hard, mostly it is sticking the > preempt_schedule_irq() call in return from interrupt code path. > > If you convert the arch to generic-entry (a much larger undertaking) > then you get this for free. I checked the default configurations for both in-kernel targets and general-purpose distros and was surprised to learn that very few actually turn on full preemption by default: - All distros I looked at (rhel, debian, opensuse) use PREEMPT_VOLUNTARY by default, though they usually also set PREEMPT_DYNAMIC to let users override it at boot time. - The majority (220) of all defconfig files in the kernel don't select any preemption options, and just get PREEMPT_NONE automatically. This includes the generic configs for armv7, s390 and mips. - A small number (24) set PREEMPT_VOLUNTARY, but this notably includes x86 and ppc64. x86 is the only one of those that sets PREEMPT_DYNAMIC - CONFIG_PREEMPT=y (full preemption) is used on 89 defconfigs, including arm64 and a lot of the older arm32, arc and mips platforms. If we want to have a chance of removing both PREEMPT_NONE and PREEMPT_VOLUNTARY, I think we should start with changing the defaults first, so defconfigs that don't specify anything else get PREEMPT=y, and distros that use PREEMPT_VOLUNTARY use it use it in the absence of a command line argument. If that doesn't cause too many regressions, the next step might be to hide the choice under CONFIG_EXPERT until all m68k and alpha no longer require PREEMPT_NONE. Arnd
On Wed, 20 Sep 2023 21:16:17 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > > On Wed, Sep 20 2023 at 17:57, Ankur Arora wrote: > >> Thomas Gleixner <tglx@linutronix.de> writes: > >>> Find below a PoC which implements that scheme. It's not even close to > >>> correct, but it builds, boots and survives lightweight testing. > >> > >> Whew, that was electric. I had barely managed to sort through some of > >> the config maze. > >> From a quick look this is pretty much how you described it. > > > > What's electric about that? > > Hmm, so I /think/ I was going for something like electric current taking > the optimal path, with a side meaning of electrifying. > > Though, I guess electron flow is a quantum mechanical, so that would > really try all possible paths, which means the analogy doesn't quite > fit. > > Let me substitute greased lightning for electric :D. "It's electrifying!" ;-) https://www.youtube.com/watch?v=7oKPYe53h78 -- Steve
Ok, I like this. That said, this part of it: On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <tglx@linutronix.de> wrote: > > -void resched_curr(struct rq *rq) > +static void __resched_curr(struct rq *rq, int nr_bit) > [...] > - set_tsk_need_resched(curr); > - set_preempt_need_resched(); > + set_tsk_thread_flag(curr, nr_bit); > + if (nr_bit == TIF_NEED_RESCHED) > + set_preempt_need_resched(); feels really hacky. I think that instead of passing a random TIF bit around, it should just pass a "lazy or not" value around. Then you make the TIF bit be some easily computable thing (eg something like #define TIF_RESCHED(lazy) (TIF_NEED_RESCHED + (lazy)) or whatever), and write the above conditional as if (!lazy) set_preempt_need_resched(); so that it all *does* the same thing, but the code makes it clear about what the logic is. Because honestly, without having been part of this thread, I would look at that if (nr_bit == TIF_NEED_RESCHED) set_preempt_need_resched(); and I'd be completely lost. It doesn't make conceptual sense, I feel. So I'd really like the source code to be more directly expressing the *intent* of the code, not be so centered around the implementation detail. Put another way: I think we can make the compiler turn the intent into the implementation, and I'd rather *not* have us humans have to infer the intent from the implementation. That said - I think as a proof of concept and "look, with this we get the expected scheduling event counts", that patch is perfect. I think you more than proved the concept. Linus
Linus! On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: > Ok, I like this. Thanks! > That said, this part of it: > On Wed, 20 Sept 2023 at 16:58, Thomas Gleixner <tglx@linutronix.de> wrote: > Because honestly, without having been part of this thread, I would look at that > > if (nr_bit == TIF_NEED_RESCHED) > set_preempt_need_resched(); > > and I'd be completely lost. It doesn't make conceptual sense, I feel. > > So I'd really like the source code to be more directly expressing the > *intent* of the code, not be so centered around the implementation > detail. > > Put another way: I think we can make the compiler turn the intent into > the implementation, and I'd rather *not* have us humans have to infer > the intent from the implementation. No argument about that. I didn't like it either, but at 10PM ... > That said - I think as a proof of concept and "look, with this we get > the expected scheduling event counts", that patch is perfect. I think > you more than proved the concept. There is certainly quite some analyis work to do to make this a one to one replacement. With a handful of benchmarks the PoC (tweaked with some obvious fixes) is pretty much on par with the current mainline variants (NONE/FULL), but the memtier benchmark makes a massive dent. It sports a whopping 10% regression with the LAZY mode versus the mainline NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. That benchmark is really sensitive to the preemption model. With current mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% performance drop versus preempt=NONE. I have no clue what's going on there yet, but that shows that there is obviously quite some work ahead to get this sorted. Though I'm pretty convinced by now that this is the right direction and well worth the effort which needs to be put into that. Thanks, tglx
On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: >> That said - I think as a proof of concept and "look, with this we get >> the expected scheduling event counts", that patch is perfect. I think >> you more than proved the concept. > > There is certainly quite some analyis work to do to make this a one to > one replacement. > > With a handful of benchmarks the PoC (tweaked with some obvious fixes) > is pretty much on par with the current mainline variants (NONE/FULL), > but the memtier benchmark makes a massive dent. > > It sports a whopping 10% regression with the LAZY mode versus the mainline > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. > > That benchmark is really sensitive to the preemption model. With current > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% > performance drop versus preempt=NONE. That 20% was a tired pilot error. The real number is in the 5% ballpark. > I have no clue what's going on there yet, but that shows that there is > obviously quite some work ahead to get this sorted. It took some head scratching to figure that out. The initial fix broke the handling of the hog issue, i.e. the problem that Ankur tried to solve, but I hacked up a "solution" for that too. With that the memtier benchmark is roughly back to the mainline numbers, but my throughput benchmark know how is pretty close to zero, so that should be looked at by people who actually understand these things. Likewise the hog prevention is just at the PoC level and clearly beyond my knowledge of scheduler details: It unconditionally forces a reschedule when the looping task is not responding to a lazy reschedule request before the next tick. IOW it forces a reschedule on the second tick, which is obviously different from the cond_resched()/might_sleep() behaviour. The changes vs. the original PoC aside of the bug and thinko fixes: 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the lazy preempt bit as the trace_entry::flags field is full already. That obviously breaks the tracer ABI, but if we go there then this needs to be fixed. Steven? 2) debugfs file to validate that loops can be force preempted w/o cond_resched() The usage is: # taskset -c 1 bash # echo 1 > /sys/kernel/debug/sched/hog & # echo 1 > /sys/kernel/debug/sched/hog & # echo 1 > /sys/kernel/debug/sched/hog & top shows ~33% CPU for each of the hogs and tracing confirms that the crude hack in the scheduler tick works: bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr The 'l' instead of the usual 'N' reflects that the lazy resched bit is set. That makes __update_curr() invoke resched_curr() instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED and folds it into preempt_count so that preemption happens at the next possible point, i.e. either in return from interrupt or at the next preempt_enable(). That's as much as I wanted to demonstrate and I'm not going to spend more cycles on it as I have already too many other things on flight and the resulting scheduler woes are clearly outside of my expertice. Though definitely I'm putting a permanent NAK in place for any attempts to duct tape the preempt=NONE model any further by sprinkling more cond*() and whatever warts around. Thanks, tglx --- arch/x86/Kconfig | 1 arch/x86/include/asm/thread_info.h | 6 ++-- drivers/acpi/processor_idle.c | 2 - include/linux/entry-common.h | 2 - include/linux/entry-kvm.h | 2 - include/linux/sched.h | 12 +++++--- include/linux/sched/idle.h | 8 ++--- include/linux/thread_info.h | 24 +++++++++++++++++ include/linux/trace_events.h | 8 ++--- kernel/Kconfig.preempt | 17 +++++++++++- kernel/entry/common.c | 4 +- kernel/entry/kvm.c | 2 - kernel/sched/core.c | 51 +++++++++++++++++++++++++------------ kernel/sched/debug.c | 19 +++++++++++++ kernel/sched/fair.c | 46 ++++++++++++++++++++++----------- kernel/sched/features.h | 2 + kernel/sched/idle.c | 3 -- kernel/sched/sched.h | 1 kernel/trace/trace.c | 2 + kernel/trace/trace_output.c | 16 ++++++++++- 20 files changed, 171 insertions(+), 57 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) /* - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, * this avoids any races wrt polling state changes and thereby avoids * spurious IPIs. */ -static inline bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) { struct thread_info *ti = task_thread_info(p); - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); + + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG); } /* @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas for (;;) { if (!(val & _TIF_POLLING_NRFLAG)) return false; - if (val & _TIF_NEED_RESCHED) + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) return true; if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) break; @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas } #else -static inline bool set_nr_and_not_polling(struct task_struct *p) +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) { - set_tsk_need_resched(p); + set_tsk_thread_flag(p, tif_bit); return true; } @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head) * might also involve a cross-CPU call to trigger the scheduler on * the target CPU. */ -void resched_curr(struct rq *rq) +static void __resched_curr(struct rq *rq, int lazy) { + int cpu, tif_bit = TIF_NEED_RESCHED + lazy; struct task_struct *curr = rq->curr; - int cpu; lockdep_assert_rq_held(rq); - if (test_tsk_need_resched(curr)) + if (unlikely(test_tsk_thread_flag(curr, tif_bit))) return; cpu = cpu_of(rq); if (cpu == smp_processor_id()) { - set_tsk_need_resched(curr); - set_preempt_need_resched(); + set_tsk_thread_flag(curr, tif_bit); + if (!lazy) + set_preempt_need_resched(); return; } - if (set_nr_and_not_polling(curr)) - smp_send_reschedule(cpu); - else + if (set_nr_and_not_polling(curr, tif_bit)) { + if (!lazy) + smp_send_reschedule(cpu); + } else { trace_sched_wake_idle_without_ipi(cpu); + } +} + +void resched_curr(struct rq *rq) +{ + __resched_curr(rq, 0); +} + +void resched_curr_lazy(struct rq *rq) +{ + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ? + TIF_NEED_RESCHED_LAZY_OFFSET : 0; + + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED))) + return; + + __resched_curr(rq, lazy); } void resched_cpu(int cpu) @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu) if (cpu == smp_processor_id()) return; - if (set_nr_and_not_polling(rq->idle)) + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init( WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ return preempt_dynamic_mode == preempt_dynamic_##mode; \ } \ - EXPORT_SYMBOL_GPL(preempt_model_##mode) PREEMPT_MODEL_ACCESSOR(none); PREEMPT_MODEL_ACCESSOR(voluntary); --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -59,6 +59,16 @@ enum syscall_work_bit { #include <asm/thread_info.h> +#ifdef CONFIG_PREEMPT_AUTO +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) +#else +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED +# define TIF_NEED_RESCHED_LAZY_OFFSET 0 +#endif + #ifdef __KERNEL__ #ifndef arch_set_restart_data @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res (unsigned long *)(¤t_thread_info()->flags)); } +static __always_inline bool tif_need_resched_lazy(void) +{ + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && + arch_test_bit(TIF_NEED_RESCHED_LAZY, + (unsigned long *)(¤t_thread_info()->flags)); +} + #else static __always_inline bool tif_need_resched(void) @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res (unsigned long *)(¤t_thread_info()->flags)); } +static __always_inline bool tif_need_resched_lazy(void) +{ + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && + test_bit(TIF_NEED_RESCHED_LAZY, + (unsigned long *)(¤t_thread_info()->flags)); +} + #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -11,6 +11,13 @@ config PREEMPT_BUILD select PREEMPTION select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK +config PREEMPT_BUILD_AUTO + bool + select PREEMPT_BUILD + +config HAVE_PREEMPT_AUTO + bool + choice prompt "Preemption Model" default PREEMPT_NONE @@ -67,9 +74,17 @@ config PREEMPT embedded system with latency requirements in the milliseconds range. +config PREEMPT_AUTO + bool "Automagic preemption mode with runtime tweaking support" + depends on HAVE_PREEMPT_AUTO + select PREEMPT_BUILD_AUTO + help + Add some sensible blurb here + config PREEMPT_RT bool "Fully Preemptible Kernel (Real-Time)" depends on EXPERT && ARCH_SUPPORTS_RT + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO select PREEMPTION help This option turns the kernel into a real-time kernel by replacing @@ -95,7 +110,7 @@ config PREEMPTION config PREEMPT_DYNAMIC bool "Preemption behaviour defined on boot" - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY select PREEMPT_BUILD default y if HAVE_PREEMPT_DYNAMIC_CALL --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -60,7 +60,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ - ARCH_EXIT_TO_USER_MODE_WORK) + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) /** * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs --- a/include/linux/entry-kvm.h +++ b/include/linux/entry-kvm.h @@ -18,7 +18,7 @@ #define XFER_TO_GUEST_MODE_WORK \ (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) struct kvm_vcpu; --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l local_irq_enable_exit_to_user(ti_work); - if (ti_work & _TIF_NEED_RESCHED) + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) schedule(); if (ti_work & _TIF_UPROBE) @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void rcu_irq_exit_check_preempt(); if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) WARN_ON_ONCE(!on_thread_stack()); - if (need_resched()) + if (test_tsk_need_resched(current)) preempt_schedule_irq(); } } --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) SCHED_FEAT(LATENCY_WARN, false) SCHED_FEAT(HZ_BW, true) + +SCHED_FEAT(FORCE_NEED_RESCHED, false) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); extern void reweight_task(struct task_struct *p, int prio); extern void resched_curr(struct rq *rq); +extern void resched_curr_lazy(struct rq *rq); extern void resched_cpu(int cpu); extern struct rt_bandwidth def_rt_bandwidth; --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla update_ti_thread_flag(task_thread_info(tsk), flag, value); } -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); } -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); } -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) { return test_ti_thread_flag(task_thread_info(tsk), flag); } @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched( static inline void clear_tsk_need_resched(struct task_struct *tsk) { clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); } -static inline int test_tsk_need_resched(struct task_struct *tsk) +static inline bool test_tsk_need_resched(struct task_struct *tsk) { return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); } @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc static __always_inline bool need_resched(void) { - return unlikely(tif_need_resched()); + return unlikely(tif_need_resched_lazy() || tif_need_resched()); } /* --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i * this is probably good enough. */ -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick) { + struct rq *rq = rq_of(cfs_rq); + if ((s64)(se->vruntime - se->deadline) < 0) return; @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r /* * The task has consumed its request, reschedule. */ - if (cfs_rq->nr_running > 1) { - resched_curr(rq_of(cfs_rq)); - clear_buddies(cfs_rq, se); + if (cfs_rq->nr_running < 2) + return; + + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) { + resched_curr(rq); + } else { + /* Did the task ignore the lazy reschedule request? */ + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) + resched_curr(rq); + else + resched_curr_lazy(rq); } + clear_buddies(cfs_rq, se); } #include "pelt.h" @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf /* * Update the current task's runtime statistics. */ -static void update_curr(struct cfs_rq *cfs_rq) +static void __update_curr(struct cfs_rq *cfs_rq, bool tick) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c schedstat_add(cfs_rq->exec_clock, delta_exec); curr->vruntime += calc_delta_fair(delta_exec, curr); - update_deadline(cfs_rq, curr); + update_deadline(cfs_rq, curr, tick); update_min_vruntime(cfs_rq); if (entity_is_task(curr)) { @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c account_cfs_rq_runtime(cfs_rq, delta_exec); } +static inline void update_curr(struct cfs_rq *cfs_rq) +{ + __update_curr(cfs_rq, false); +} + static void update_curr_fair(struct rq *rq) { update_curr(cfs_rq_of(&rq->curr->se)); @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc /* * Update run-time statistics of the 'current'. */ - update_curr(cfs_rq); + __update_curr(cfs_rq, true); /* * Ensure that runnable average is periodically updated. @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc * validating it and just reschedule. */ if (queued) { - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); return; } /* @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str * hierarchy can be throttled */ if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) - resched_curr(rq_of(cfs_rq)); + resched_curr_lazy(rq_of(cfs_rq)); } static __always_inline @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf /* Determine whether we need to wake up potentially idle CPU: */ if (rq->curr == rq->idle && rq->cfs.nr_running) - resched_curr(rq); + resched_curr_lazy(rq); } #ifdef CONFIG_SMP @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq if (delta < 0) { if (task_current(rq, p)) - resched_curr(rq); + resched_curr_lazy(rq); return; } hrtick_start(rq, delta); @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct * prevents us from potentially nominating it as a false LAST_BUDDY * below. */ - if (test_tsk_need_resched(curr)) + if (need_resched()) return; /* Idle tasks are by definition preempted by non-idle tasks. */ @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct return; preempt: - resched_curr(rq); + resched_curr_lazy(rq); } #ifdef CONFIG_SMP @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct */ if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) - resched_curr(rq); + resched_curr_lazy(rq); } /* @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct */ if (task_current(rq, p)) { if (p->prio > oldprio) - resched_curr(rq); + resched_curr_lazy(rq); } else check_preempt_curr(rq, p, 0); } --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -108,7 +108,7 @@ static const struct dmi_system_id proces */ static void __cpuidle acpi_safe_halt(void) { - if (!tif_need_resched()) { + if (!need_resched()) { raw_safe_halt(); raw_local_irq_disable(); } --- a/include/linux/sched/idle.h +++ b/include/linux/sched/idle.h @@ -63,7 +63,7 @@ static __always_inline bool __must_check */ smp_mb__after_atomic(); - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } static __always_inline bool __must_check current_clr_polling_and_test(void) @@ -76,7 +76,7 @@ static __always_inline bool __must_check */ smp_mb__after_atomic(); - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } #else @@ -85,11 +85,11 @@ static inline void __current_clr_polling static inline bool __must_check current_set_polling_and_test(void) { - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } static inline bool __must_check current_clr_polling_and_test(void) { - return unlikely(tif_need_resched()); + return unlikely(need_resched()); } #endif --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p ct_cpuidle_enter(); raw_local_irq_enable(); - while (!tif_need_resched() && - (cpu_idle_force_poll || tick_check_broadcast_expired())) + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); raw_local_irq_disable(); --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un if (tif_need_resched()) trace_flags |= TRACE_FLAG_NEED_RESCHED; + if (tif_need_resched_lazy()) + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; if (test_preempt_need_resched()) trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -271,6 +271,7 @@ config X86 select HAVE_STATIC_CALL select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL select HAVE_PREEMPT_DYNAMIC_CALL + select HAVE_PREEMPT_AUTO select HAVE_RSEQ select HAVE_RUST if X86_64 select HAVE_SYSCALL_TRACEPOINTS --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -81,8 +81,9 @@ struct thread_info { #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ #define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ -#define TIF_SSBD 5 /* Speculative store bypass disable */ +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */ +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ +#define TIF_SSBD 6 /* Speculative store bypass disable */ #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ @@ -104,6 +105,7 @@ struct thread_info { #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SSBD (1 << TIF_SSBD) #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc return -EINTR; } - if (ti_work & _TIF_NEED_RESCHED) + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY)) schedule(); if (ti_work & _TIF_NOTIFY_RESUME) --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un enum trace_flag_type { TRACE_FLAG_IRQS_OFF = 0x01, - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, - TRACE_FLAG_NEED_RESCHED = 0x04, + TRACE_FLAG_NEED_RESCHED = 0x02, + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, TRACE_FLAG_HARDIRQ = 0x08, TRACE_FLAG_SOFTIRQ = 0x10, TRACE_FLAG_PREEMPT_RESCHED = 0x20, @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) { - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); + return tracing_gen_ctx_irq_test(0); } static inline unsigned int tracing_gen_ctx(void) { - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); + return tracing_gen_ctx_irq_test(0); } #endif --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : bh_off ? 'b' : - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : '.'; - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED)) { + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: + need_resched = 'B'; + break; case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: need_resched = 'N'; break; + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: + need_resched = 'L'; + break; + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: + need_resched = 'b'; + break; case TRACE_FLAG_NEED_RESCHED: need_resched = 'n'; break; + case TRACE_FLAG_NEED_RESCHED_LAZY: + need_resched = 'l'; + break; case TRACE_FLAG_PREEMPT_RESCHED: need_resched = 'p'; break; --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -333,6 +333,23 @@ static const struct file_operations sche .release = seq_release, }; +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + unsigned long end = jiffies + 60 * HZ; + + for (; time_before(jiffies, end) && !signal_pending(current);) + cpu_relax(); + + return cnt; +} + +static const struct file_operations sched_hog_fops = { + .write = sched_hog_write, + .open = simple_open, + .llseek = default_llseek, +}; + static struct dentry *debugfs_sched; static __init int sched_init_debug(void) @@ -374,6 +391,8 @@ static __init int sched_init_debug(void) debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops); + return 0; } late_initcall(sched_init_debug);
On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: > On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: >> Then the question becomes whether we'd want to introduce a *new* >> concept, which is a "if you are going to schedule, do it now rather >> than later, because I'm taking a lock, and while it's a preemptible >> lock, I'd rather not sleep while holding this resource". >> >> I suspect we want to avoid that for now, on the assumption that it's >> hopefully not a problem in practice (the recently addressed problem >> with might_sleep() was that it actively *moved* the scheduling point >> to a bad place, not that scheduling could happen there, so instead of >> optimizing scheduling, it actively pessimized it). But I thought I'd >> mention it. > > I think we want to avoid that completely and if this becomes an issue, > we rather be smart about it at the core level. > > It's trivial enough to have a per task counter which tells whether a > preemtible lock is held (or about to be acquired) or not. Then the > scheduler can take that hint into account and decide to grant a > timeslice extension once in the expectation that the task leaves the > lock held section soonish and either returns to user space or schedules > out. It still can enforce it later on. > > We really want to let the scheduler decide and rather give it proper > hints at the conceptual level instead of letting developers make random > decisions which might work well for a particular use case and completely > suck for the rest. I think we wasted enough time already on those. Finally I realized why cond_resched() & et al. are so disgusting. They are scope-less and just a random spot which someone decided to be a good place to reschedule. But in fact the really relevant measure is scope. Full preemption is scope based: preempt_disable(); do_stuff(); preempt_enable(); which also nests properly: preempt_disable(); do_stuff() preempt_disable(); do_other_stuff(); preempt_enable(); preempt_enable(); cond_resched() cannot nest and is obviously scope-less. The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only pretends to be scoped. As Peter pointed out it does not properly nest with other mechanisms and it cannot even nest in itself because it is boolean. The worst thing about it is that it is semantically reverse to the established model of preempt_disable()/enable(), i.e. allow_resched()/disallow_resched(). So instead of giving the scheduler a hint about 'this might be a good place to preempt', providing proper scope would make way more sense: preempt_lazy_disable(); do_stuff(); preempt_lazy_enable(); That would be the obvious and semantically consistent counterpart to the existing preemption control primitives with proper nesting support. might_sleep(), which is in all the lock acquire functions or your variant of hint (resched better now before I take the lock) are the wrong place. hint(); lock(); do_stuff(); unlock(); hint() might schedule and when the task comes back schedule immediately again because the lock is contended. hint() does again not have scope and might be meaningless or even counterproductive if called in a deeper callchain. Proper scope based hints avoid that. preempt_lazy_disable(); lock(); do_stuff(); unlock(); preempt_lazy_enable(); That's way better because it describes the scope and the task will either schedule out in lock() on contention or provide a sensible lazy preemption point in preempt_lazy_enable(). It also nests properly: preempt_lazy_disable(); lock(A); do_stuff() preempt_lazy_disable(); lock(B); do_other_stuff(); unlock(B); preempt_lazy_enable(); unlock(A); preempt_lazy_enable(); So in this case it does not matter wheter do_stuff() is invoked from a lock held section or not. The scope which defines the throughput relevant hint to the scheduler is correct in any case. Contrary to preempt_disable() the lazy variant does neither prevent scheduling nor preemption, but its a understandable properly nestable mechanism. I seriously hope to avoid it alltogether :) Thanks, tglx
On Sun, Sep 24 2023 at 00:50, Thomas Gleixner wrote: > On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: > That's way better because it describes the scope and the task will > either schedule out in lock() on contention or provide a sensible lazy > preemption point in preempt_lazy_enable(). It also nests properly: > > preempt_lazy_disable(); > lock(A); > do_stuff() > preempt_lazy_disable(); > lock(B); > do_other_stuff(); > unlock(B); > preempt_lazy_enable(); > unlock(A); > preempt_lazy_enable(); > > So in this case it does not matter wheter do_stuff() is invoked from a > lock held section or not. The scope which defines the throughput > relevant hint to the scheduler is correct in any case. Which also means that automatically injecting it into lock primitives makes suddenly sense in the same way as the implicit preempt_disable() in the rw/spinlock primitives does. Thanks, tglx
On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote: > cond_resched() cannot nest and is obviously scope-less. > > The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only > pretends to be scoped. > > As Peter pointed out it does not properly nest with other mechanisms and > it cannot even nest in itself because it is boolean. We can nest a single bit without turning it into a counter -- we do this for memalloc_nofs_save() for example. Simply return the current value of the bit, and pass it to _restore(). eg xfs_prepare_ioend(): /* * We can allocate memory here while doing writeback on behalf of * memory reclaim. To avoid memory allocation deadlocks set the * task-wide nofs context for the following operations. */ nofs_flag = memalloc_nofs_save(); /* Convert CoW extents to regular */ if (!status && (ioend->io_flags & IOMAP_F_SHARED)) { status = xfs_reflink_convert_cow(XFS_I(ioend->io_inode), ioend->io_offset, ioend->io_size); } memalloc_nofs_restore(nofs_flag); I like your other approach better, but just in case anybody starts worrying about turning a bit into a counter, there's no need to do that.
On Sun, Sep 24 2023 at 08:19, Matthew Wilcox wrote: > On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote: >> cond_resched() cannot nest and is obviously scope-less. >> >> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only >> pretends to be scoped. >> >> As Peter pointed out it does not properly nest with other mechanisms and >> it cannot even nest in itself because it is boolean. > > We can nest a single bit without turning it into a counter -- we > do this for memalloc_nofs_save() for example. Simply return the > current value of the bit, and pass it to _restore(). Right. That works, but the reverse logic still does not make sense: allow_resched(); .... spin_lock(); while resched_now_is_suboptimal(); ... spin_lock(); works. Thanks, tglx
On Sun, Sep 24, 2023 at 09:55:52AM +0200, Thomas Gleixner wrote: > On Sun, Sep 24 2023 at 08:19, Matthew Wilcox wrote: > > On Sun, Sep 24, 2023 at 12:50:43AM +0200, Thomas Gleixner wrote: > >> cond_resched() cannot nest and is obviously scope-less. > >> > >> The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only > >> pretends to be scoped. > >> > >> As Peter pointed out it does not properly nest with other mechanisms and > >> it cannot even nest in itself because it is boolean. > > > > We can nest a single bit without turning it into a counter -- we > > do this for memalloc_nofs_save() for example. Simply return the > > current value of the bit, and pass it to _restore(). > > Right. > > That works, but the reverse logic still does not make sense: > > allow_resched(); > .... > spin_lock(); > > while > resched_now_is_suboptimal(); > ... > spin_lock(); > > works. Oh, indeed. I had in mind state = resched_now_is_suboptimal(); spin_lock(); ... spin_unlock(); resched_might_be_optimal_again(state); ... or we could bundle it up ... state = spin_lock_resched_disable(); ... spin_unlock_resched_restore(state);
Thomas Gleixner <tglx@linutronix.de> writes: > On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: >> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: >>> Then the question becomes whether we'd want to introduce a *new* >>> concept, which is a "if you are going to schedule, do it now rather >>> than later, because I'm taking a lock, and while it's a preemptible >>> lock, I'd rather not sleep while holding this resource". >>> >>> I suspect we want to avoid that for now, on the assumption that it's >>> hopefully not a problem in practice (the recently addressed problem >>> with might_sleep() was that it actively *moved* the scheduling point >>> to a bad place, not that scheduling could happen there, so instead of >>> optimizing scheduling, it actively pessimized it). But I thought I'd >>> mention it. >> >> I think we want to avoid that completely and if this becomes an issue, >> we rather be smart about it at the core level. >> >> It's trivial enough to have a per task counter which tells whether a >> preemtible lock is held (or about to be acquired) or not. Then the >> scheduler can take that hint into account and decide to grant a >> timeslice extension once in the expectation that the task leaves the >> lock held section soonish and either returns to user space or schedules >> out. It still can enforce it later on. >> >> We really want to let the scheduler decide and rather give it proper >> hints at the conceptual level instead of letting developers make random >> decisions which might work well for a particular use case and completely >> suck for the rest. I think we wasted enough time already on those. > > Finally I realized why cond_resched() & et al. are so disgusting. They > are scope-less and just a random spot which someone decided to be a good > place to reschedule. > > But in fact the really relevant measure is scope. Full preemption is > scope based: > > preempt_disable(); > do_stuff(); > preempt_enable(); > > which also nests properly: > > preempt_disable(); > do_stuff() > preempt_disable(); > do_other_stuff(); > preempt_enable(); > preempt_enable(); > > cond_resched() cannot nest and is obviously scope-less. That's true. Though, I would argue that another way to look at cond_resched() might be that it summarizes two kinds of state. First, the timer/resched activity that might cause you to schedule. The second, as an annotation from the programmer summarizing their understanding of the state of the execution stack and that there are no resources held across the current point. The second is, as you say, hard to get right -- because there's no clear definition of what it means for us to get it right, resulting in random placement of cond_rescheds() until latency improves. In any case this summary of execution state is done better by just always tracking preemption scope. > The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only > pretends to be scoped. > > As Peter pointed out it does not properly nest with other mechanisms and > it cannot even nest in itself because it is boolean. > > The worst thing about it is that it is semantically reverse to the > established model of preempt_disable()/enable(), > i.e. allow_resched()/disallow_resched(). Can't disagree with that. In part it was that way because I was trying to provide an alternative to cond_resched() while executing in a particular preemptible scope -- except for not actually having any real notion of scoping. > > So instead of giving the scheduler a hint about 'this might be a good > place to preempt', providing proper scope would make way more sense: > > preempt_lazy_disable(); > do_stuff(); > preempt_lazy_enable(); > > That would be the obvious and semantically consistent counterpart to the > existing preemption control primitives with proper nesting support. > > might_sleep(), which is in all the lock acquire functions or your > variant of hint (resched better now before I take the lock) are the > wrong place. > > hint(); > lock(); > do_stuff(); > unlock(); > > hint() might schedule and when the task comes back schedule immediately > again because the lock is contended. hint() does again not have scope > and might be meaningless or even counterproductive if called in a deeper > callchain. Perhaps another problem is that some of these hints are useful for two different things: as an annotation about the state of execution, and also as a hint to the scheduler. For instance, this fix that Linus pointed to a few days ago: 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()"). is using might_sleep() in the first sense. Thanks -- ankur
On Sat, 23 Sep 2023 03:11:05 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Though definitely I'm putting a permanent NAK in place for any attempts > to duct tape the preempt=NONE model any further by sprinkling more > cond*() and whatever warts around. Well, until we have this fix in, we will still need to sprinkle those around when they are triggering watchdog timeouts. I just had to add one recently due to a timeout report :-( > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un > > if (tif_need_resched()) > trace_flags |= TRACE_FLAG_NEED_RESCHED; > + if (tif_need_resched_lazy()) > + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; > if (test_preempt_need_resched()) > trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; > return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un > > enum trace_flag_type { > TRACE_FLAG_IRQS_OFF = 0x01, > - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, I never cared for that NOSUPPORT flag. It's from 2008 and only used by archs that do not support irq tracing (aka lockdep). I'm fine with dropping it and just updating the user space libraries (which will no longer see it not supported, but that's fine with me). > - TRACE_FLAG_NEED_RESCHED = 0x04, > + TRACE_FLAG_NEED_RESCHED = 0x02, > + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT? Because, NEED_RESCHED is known, and moving that to bit 2 will break user space. Having LAZY replace the IRQS_NOSUPPORT will cause the least "breakage". -- Steve > TRACE_FLAG_HARDIRQ = 0x08, > TRACE_FLAG_SOFTIRQ = 0x10, > TRACE_FLAG_PREEMPT_RESCHED = 0x20, > @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c > > static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) > { > - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > + return tracing_gen_ctx_irq_test(0); > } > static inline unsigned int tracing_gen_ctx(void) > { > - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > + return tracing_gen_ctx_irq_test(0); > } > #endif > > --- a/kernel/trace/trace_output.c > +++ b/kernel/trace/trace_output.c > @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq > (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : > (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : > bh_off ? 'b' : > - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : > + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : > '.'; > > - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | > + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | > TRACE_FLAG_PREEMPT_RESCHED)) { > + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > + need_resched = 'B'; > + break; > case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: > need_resched = 'N'; > break; > + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > + need_resched = 'L'; > + break; > + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: > + need_resched = 'b'; > + break; > case TRACE_FLAG_NEED_RESCHED: > need_resched = 'n'; > break; > + case TRACE_FLAG_NEED_RESCHED_LAZY: > + need_resched = 'l'; > + break; > case TRACE_FLAG_PREEMPT_RESCHED: > need_resched = 'p'; > break;
On Mon, Oct 02 2023 at 10:15, Steven Rostedt wrote: > On Sat, 23 Sep 2023 03:11:05 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > >> Though definitely I'm putting a permanent NAK in place for any attempts >> to duct tape the preempt=NONE model any further by sprinkling more >> cond*() and whatever warts around. > > Well, until we have this fix in, we will still need to sprinkle those > around when they are triggering watchdog timeouts. I just had to add one > recently due to a timeout report :-( cond_resched() sure. But not new flavours of it, like the [dis]allow_resched() which sparked this discussion. >> - TRACE_FLAG_NEED_RESCHED = 0x04, >> + TRACE_FLAG_NEED_RESCHED = 0x02, >> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, > > Is LAZY only used for PREEMPT_NONE? Or do we use it for CONFIG_PREEMPT? > Because, NEED_RESCHED is known, and moving that to bit 2 will break user > space. Having LAZY replace the IRQS_NOSUPPORT will cause the least > "breakage". Either way works for me. Thanks, tglx
Hi Thomas, On Tue, Sep 19, 2023 at 9:57 PM Thomas Gleixner <tglx@linutronix.de> wrote: > Though it just occured to me that there are dragons lurking: > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > arch/um/Kconfig: select ARCH_NO_PREEMPT > > So we have four architectures which refuse to enable preemption points, > i.e. the only model they allow is NONE and they rely on cond_resched() > for breaking large computations. Looks like there is a fifth one hidden: although openrisc does not select ARCH_NO_PREEMPT, it does not call preempt_schedule_irq() or select GENERIC_ENTRY? Gr{oetje,eeting}s, Geert
Hi Willy, On Tue, Sep 19, 2023 at 3:01 PM Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Sep 19, 2023 at 02:30:59PM +0200, Thomas Gleixner wrote: > > Though it just occured to me that there are dragons lurking: > > > > arch/alpha/Kconfig: select ARCH_NO_PREEMPT > > arch/hexagon/Kconfig: select ARCH_NO_PREEMPT > > arch/m68k/Kconfig: select ARCH_NO_PREEMPT if !COLDFIRE > > arch/um/Kconfig: select ARCH_NO_PREEMPT > > Sounds like three-and-a-half architectures which could be queued up for > removal right behind ia64 ... > > I suspect none of these architecture maintainers have any idea there's a > problem. Look at commit 87a4c375995e and the discussion in > https://lore.kernel.org/lkml/20180724175646.3621-1-hch@lst.de/ > > Let's cc those maintainers so they can remove this and fix whatever > breaks. Looks like your scare tactics are working ;-) [PATCH/RFC] m68k: Add full preempt support https://lore.kernel.org/all/7858a184cda66e0991fd295c711dfed7e4d1248c.1696603287.git.geert@linux-m68k.org Gr{oetje,eeting}s, Geert
On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote: > On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: > > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: > >> That said - I think as a proof of concept and "look, with this we get > >> the expected scheduling event counts", that patch is perfect. I think > >> you more than proved the concept. > > > > There is certainly quite some analyis work to do to make this a one to > > one replacement. > > > > With a handful of benchmarks the PoC (tweaked with some obvious fixes) > > is pretty much on par with the current mainline variants (NONE/FULL), > > but the memtier benchmark makes a massive dent. > > > > It sports a whopping 10% regression with the LAZY mode versus the mainline > > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. > > > > That benchmark is really sensitive to the preemption model. With current > > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% > > performance drop versus preempt=NONE. > > That 20% was a tired pilot error. The real number is in the 5% ballpark. > > > I have no clue what's going on there yet, but that shows that there is > > obviously quite some work ahead to get this sorted. > > It took some head scratching to figure that out. The initial fix broke > the handling of the hog issue, i.e. the problem that Ankur tried to > solve, but I hacked up a "solution" for that too. > > With that the memtier benchmark is roughly back to the mainline numbers, > but my throughput benchmark know how is pretty close to zero, so that > should be looked at by people who actually understand these things. > > Likewise the hog prevention is just at the PoC level and clearly beyond > my knowledge of scheduler details: It unconditionally forces a > reschedule when the looping task is not responding to a lazy reschedule > request before the next tick. IOW it forces a reschedule on the second > tick, which is obviously different from the cond_resched()/might_sleep() > behaviour. > > The changes vs. the original PoC aside of the bug and thinko fixes: > > 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the > lazy preempt bit as the trace_entry::flags field is full already. > > That obviously breaks the tracer ABI, but if we go there then > this needs to be fixed. Steven? > > 2) debugfs file to validate that loops can be force preempted w/o > cond_resched() > > The usage is: > > # taskset -c 1 bash > # echo 1 > /sys/kernel/debug/sched/hog & > # echo 1 > /sys/kernel/debug/sched/hog & > # echo 1 > /sys/kernel/debug/sched/hog & > > top shows ~33% CPU for each of the hogs and tracing confirms that > the crude hack in the scheduler tick works: > > bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr > bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr > bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr > bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr > bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr > bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr > bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr > bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr > > The 'l' instead of the usual 'N' reflects that the lazy resched > bit is set. That makes __update_curr() invoke resched_curr() > instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED > and folds it into preempt_count so that preemption happens at the > next possible point, i.e. either in return from interrupt or at > the next preempt_enable(). Belatedly calling out some RCU issues. Nothing fatal, just a (surprisingly) few adjustments that will need to be made. The key thing to note is that from RCU's viewpoint, with this change, all kernels are preemptible, though rcu_read_lock() readers remain non-preemptible. With that: 1. As an optimization, given that preempt_count() would always give good information, the scheduling-clock interrupt could sense RCU readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the IPI handlers for expedited grace periods. A nice optimization. Except that... 2. The quiescent-state-forcing code currently relies on the presence of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix would be to do resched_cpu() more quickly, but some workloads might not love the additional IPIs. Another approach to do #1 above to replace the quiescent states from cond_resched() with scheduler-tick-interrupt-sensed quiescent states. Plus... 3. For nohz_full CPUs that run for a long time in the kernel, there are no scheduling-clock interrupts. RCU reaches for the resched_cpu() hammer a few jiffies into the grace period. And it sets the ->rcu_urgent_qs flag so that the holdout CPU's interrupt-entry code will re-enable its scheduling-clock interrupt upon receiving the resched_cpu() IPI. So nohz_full CPUs should be OK as far as RCU is concerned. Other subsystems might have other opinions. 4. As another optimization, kvfree_rcu() could unconditionally check preempt_count() to sense a clean environment suitable for memory allocation. 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must instead say "select TASKS_RCU". This means that the #else in include/linux/rcupdate.h that defines TASKS_RCU in terms of vanilla RCU must go. There might be be some fallout if something fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, and expects call_rcu_tasks(), synchronize_rcu_tasks(), or rcu_tasks_classic_qs() do do something useful. 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace or RCU Tasks Rude) would need those pesky cond_resched() calls to stick around. The reason is that RCU Tasks readers are ended only by voluntary context switches. This means that although a preemptible infinite loop in the kernel won't inconvenience a real-time task (nor an non-real-time task for all that long), and won't delay grace periods for the other flavors of RCU, it would indefinitely delay an RCU Tasks grace period. However, RCU Tasks grace periods seem to be finite in preemptible kernels today, so they should remain finite in limited-preemptible kernels tomorrow. Famous last words... 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice any algorithmic difference from this change. 8. As has been noted elsewhere, in this new limited-preemption mode of operation, rcu_read_lock() readers remain preemptible. This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. 9. The rcu_preempt_depth() macro could do something useful in limited-preemption kernels. Its current lack of ability in CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. 10. The cond_resched_rcu() function must remain because we still have non-preemptible rcu_read_lock() readers. 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains unchanged, but I must defer to the include/net/ip_vs.h people. 12. I need to check with the BPF folks on the BPF verifier's definition of BTF_ID(func, rcu_read_unlock_strict). 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() function might have some redundancy across the board instead of just on CONFIG_PREEMPT_RCU=y. Or might not. 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function might need to do something for non-preemptible RCU to make up for the lack of cond_resched() calls. Maybe just drop the "IS_ENABLED()" and execute the body of the current "if" statement unconditionally. 15. I must defer to others on the mm/pgtable-generic.c file's #ifdef that depends on CONFIG_PREEMPT_RCU. While in the area, I noted that KLP seems to depend on cond_resched(), but on this I must defer to the KLP people. I am sure that I am missing something, but I have not yet seen any show-stoppers. Just some needed adjustments. Thoughts? Thanx, Paul > That's as much as I wanted to demonstrate and I'm not going to spend > more cycles on it as I have already too many other things on flight and > the resulting scheduler woes are clearly outside of my expertice. > > Though definitely I'm putting a permanent NAK in place for any attempts > to duct tape the preempt=NONE model any further by sprinkling more > cond*() and whatever warts around. > > Thanks, > > tglx > --- > arch/x86/Kconfig | 1 > arch/x86/include/asm/thread_info.h | 6 ++-- > drivers/acpi/processor_idle.c | 2 - > include/linux/entry-common.h | 2 - > include/linux/entry-kvm.h | 2 - > include/linux/sched.h | 12 +++++--- > include/linux/sched/idle.h | 8 ++--- > include/linux/thread_info.h | 24 +++++++++++++++++ > include/linux/trace_events.h | 8 ++--- > kernel/Kconfig.preempt | 17 +++++++++++- > kernel/entry/common.c | 4 +- > kernel/entry/kvm.c | 2 - > kernel/sched/core.c | 51 +++++++++++++++++++++++++------------ > kernel/sched/debug.c | 19 +++++++++++++ > kernel/sched/fair.c | 46 ++++++++++++++++++++++----------- > kernel/sched/features.h | 2 + > kernel/sched/idle.c | 3 -- > kernel/sched/sched.h | 1 > kernel/trace/trace.c | 2 + > kernel/trace/trace_output.c | 16 ++++++++++- > 20 files changed, 171 insertions(+), 57 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct > > #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) > /* > - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, > + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, > * this avoids any races wrt polling state changes and thereby avoids > * spurious IPIs. > */ > -static inline bool set_nr_and_not_polling(struct task_struct *p) > +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) > { > struct thread_info *ti = task_thread_info(p); > - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); > + > + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG); > } > > /* > @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas > for (;;) { > if (!(val & _TIF_POLLING_NRFLAG)) > return false; > - if (val & _TIF_NEED_RESCHED) > + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) > return true; > if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) > break; > @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas > } > > #else > -static inline bool set_nr_and_not_polling(struct task_struct *p) > +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) > { > - set_tsk_need_resched(p); > + set_tsk_thread_flag(p, tif_bit); > return true; > } > > @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head) > * might also involve a cross-CPU call to trigger the scheduler on > * the target CPU. > */ > -void resched_curr(struct rq *rq) > +static void __resched_curr(struct rq *rq, int lazy) > { > + int cpu, tif_bit = TIF_NEED_RESCHED + lazy; > struct task_struct *curr = rq->curr; > - int cpu; > > lockdep_assert_rq_held(rq); > > - if (test_tsk_need_resched(curr)) > + if (unlikely(test_tsk_thread_flag(curr, tif_bit))) > return; > > cpu = cpu_of(rq); > > if (cpu == smp_processor_id()) { > - set_tsk_need_resched(curr); > - set_preempt_need_resched(); > + set_tsk_thread_flag(curr, tif_bit); > + if (!lazy) > + set_preempt_need_resched(); > return; > } > > - if (set_nr_and_not_polling(curr)) > - smp_send_reschedule(cpu); > - else > + if (set_nr_and_not_polling(curr, tif_bit)) { > + if (!lazy) > + smp_send_reschedule(cpu); > + } else { > trace_sched_wake_idle_without_ipi(cpu); > + } > +} > + > +void resched_curr(struct rq *rq) > +{ > + __resched_curr(rq, 0); > +} > + > +void resched_curr_lazy(struct rq *rq) > +{ > + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ? > + TIF_NEED_RESCHED_LAZY_OFFSET : 0; > + > + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED))) > + return; > + > + __resched_curr(rq, lazy); > } > > void resched_cpu(int cpu) > @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu) > if (cpu == smp_processor_id()) > return; > > - if (set_nr_and_not_polling(rq->idle)) > + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init( > WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ > return preempt_dynamic_mode == preempt_dynamic_##mode; \ > } \ > - EXPORT_SYMBOL_GPL(preempt_model_##mode) > > PREEMPT_MODEL_ACCESSOR(none); > PREEMPT_MODEL_ACCESSOR(voluntary); > --- a/include/linux/thread_info.h > +++ b/include/linux/thread_info.h > @@ -59,6 +59,16 @@ enum syscall_work_bit { > > #include <asm/thread_info.h> > > +#ifdef CONFIG_PREEMPT_AUTO > +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY > +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY > +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) > +#else > +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED > +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED > +# define TIF_NEED_RESCHED_LAZY_OFFSET 0 > +#endif > + > #ifdef __KERNEL__ > > #ifndef arch_set_restart_data > @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res > (unsigned long *)(¤t_thread_info()->flags)); > } > > +static __always_inline bool tif_need_resched_lazy(void) > +{ > + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && > + arch_test_bit(TIF_NEED_RESCHED_LAZY, > + (unsigned long *)(¤t_thread_info()->flags)); > +} > + > #else > > static __always_inline bool tif_need_resched(void) > @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res > (unsigned long *)(¤t_thread_info()->flags)); > } > > +static __always_inline bool tif_need_resched_lazy(void) > +{ > + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && > + test_bit(TIF_NEED_RESCHED_LAZY, > + (unsigned long *)(¤t_thread_info()->flags)); > +} > + > #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ > > #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES > --- a/kernel/Kconfig.preempt > +++ b/kernel/Kconfig.preempt > @@ -11,6 +11,13 @@ config PREEMPT_BUILD > select PREEMPTION > select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK > > +config PREEMPT_BUILD_AUTO > + bool > + select PREEMPT_BUILD > + > +config HAVE_PREEMPT_AUTO > + bool > + > choice > prompt "Preemption Model" > default PREEMPT_NONE > @@ -67,9 +74,17 @@ config PREEMPT > embedded system with latency requirements in the milliseconds > range. > > +config PREEMPT_AUTO > + bool "Automagic preemption mode with runtime tweaking support" > + depends on HAVE_PREEMPT_AUTO > + select PREEMPT_BUILD_AUTO > + help > + Add some sensible blurb here > + > config PREEMPT_RT > bool "Fully Preemptible Kernel (Real-Time)" > depends on EXPERT && ARCH_SUPPORTS_RT > + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO > select PREEMPTION > help > This option turns the kernel into a real-time kernel by replacing > @@ -95,7 +110,7 @@ config PREEMPTION > > config PREEMPT_DYNAMIC > bool "Preemption behaviour defined on boot" > - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT > + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO > select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY > select PREEMPT_BUILD > default y if HAVE_PREEMPT_DYNAMIC_CALL > --- a/include/linux/entry-common.h > +++ b/include/linux/entry-common.h > @@ -60,7 +60,7 @@ > #define EXIT_TO_USER_MODE_WORK \ > (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ > _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ > - ARCH_EXIT_TO_USER_MODE_WORK) > + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) > > /** > * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs > --- a/include/linux/entry-kvm.h > +++ b/include/linux/entry-kvm.h > @@ -18,7 +18,7 @@ > > #define XFER_TO_GUEST_MODE_WORK \ > (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ > - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) > + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) > > struct kvm_vcpu; > > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l > > local_irq_enable_exit_to_user(ti_work); > > - if (ti_work & _TIF_NEED_RESCHED) > + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) > schedule(); > > if (ti_work & _TIF_UPROBE) > @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void > rcu_irq_exit_check_preempt(); > if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) > WARN_ON_ONCE(!on_thread_stack()); > - if (need_resched()) > + if (test_tsk_need_resched(current)) > preempt_schedule_irq(); > } > } > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) > SCHED_FEAT(LATENCY_WARN, false) > > SCHED_FEAT(HZ_BW, true) > + > +SCHED_FEAT(FORCE_NEED_RESCHED, false) > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); > extern void reweight_task(struct task_struct *p, int prio); > > extern void resched_curr(struct rq *rq); > +extern void resched_curr_lazy(struct rq *rq); > extern void resched_cpu(int cpu); > > extern struct rt_bandwidth def_rt_bandwidth; > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla > update_ti_thread_flag(task_thread_info(tsk), flag, value); > } > > -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) > +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) > { > return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); > } > > -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) > +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) > { > return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); > } > > -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) > +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) > { > return test_ti_thread_flag(task_thread_info(tsk), flag); > } > @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched( > static inline void clear_tsk_need_resched(struct task_struct *tsk) > { > clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); > + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) > + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); > } > > -static inline int test_tsk_need_resched(struct task_struct *tsk) > +static inline bool test_tsk_need_resched(struct task_struct *tsk) > { > return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); > } > @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc > > static __always_inline bool need_resched(void) > { > - return unlikely(tif_need_resched()); > + return unlikely(tif_need_resched_lazy() || tif_need_resched()); > } > > /* > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq > * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i > * this is probably good enough. > */ > -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) > +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick) > { > + struct rq *rq = rq_of(cfs_rq); > + > if ((s64)(se->vruntime - se->deadline) < 0) > return; > > @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r > /* > * The task has consumed its request, reschedule. > */ > - if (cfs_rq->nr_running > 1) { > - resched_curr(rq_of(cfs_rq)); > - clear_buddies(cfs_rq, se); > + if (cfs_rq->nr_running < 2) > + return; > + > + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) { > + resched_curr(rq); > + } else { > + /* Did the task ignore the lazy reschedule request? */ > + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) > + resched_curr(rq); > + else > + resched_curr_lazy(rq); > } > + clear_buddies(cfs_rq, se); > } > > #include "pelt.h" > @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf > /* > * Update the current task's runtime statistics. > */ > -static void update_curr(struct cfs_rq *cfs_rq) > +static void __update_curr(struct cfs_rq *cfs_rq, bool tick) > { > struct sched_entity *curr = cfs_rq->curr; > u64 now = rq_clock_task(rq_of(cfs_rq)); > @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c > schedstat_add(cfs_rq->exec_clock, delta_exec); > > curr->vruntime += calc_delta_fair(delta_exec, curr); > - update_deadline(cfs_rq, curr); > + update_deadline(cfs_rq, curr, tick); > update_min_vruntime(cfs_rq); > > if (entity_is_task(curr)) { > @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c > account_cfs_rq_runtime(cfs_rq, delta_exec); > } > > +static inline void update_curr(struct cfs_rq *cfs_rq) > +{ > + __update_curr(cfs_rq, false); > +} > + > static void update_curr_fair(struct rq *rq) > { > update_curr(cfs_rq_of(&rq->curr->se)); > @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc > /* > * Update run-time statistics of the 'current'. > */ > - update_curr(cfs_rq); > + __update_curr(cfs_rq, true); > > /* > * Ensure that runnable average is periodically updated. > @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc > * validating it and just reschedule. > */ > if (queued) { > - resched_curr(rq_of(cfs_rq)); > + resched_curr_lazy(rq_of(cfs_rq)); > return; > } > /* > @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str > * hierarchy can be throttled > */ > if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) > - resched_curr(rq_of(cfs_rq)); > + resched_curr_lazy(rq_of(cfs_rq)); > } > > static __always_inline > @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > > /* Determine whether we need to wake up potentially idle CPU: */ > if (rq->curr == rq->idle && rq->cfs.nr_running) > - resched_curr(rq); > + resched_curr_lazy(rq); > } > > #ifdef CONFIG_SMP > @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq > > if (delta < 0) { > if (task_current(rq, p)) > - resched_curr(rq); > + resched_curr_lazy(rq); > return; > } > hrtick_start(rq, delta); > @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct > * prevents us from potentially nominating it as a false LAST_BUDDY > * below. > */ > - if (test_tsk_need_resched(curr)) > + if (need_resched()) > return; > > /* Idle tasks are by definition preempted by non-idle tasks. */ > @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct > return; > > preempt: > - resched_curr(rq); > + resched_curr_lazy(rq); > } > > #ifdef CONFIG_SMP > @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct > */ > if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && > __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) > - resched_curr(rq); > + resched_curr_lazy(rq); > } > > /* > @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct > */ > if (task_current(rq, p)) { > if (p->prio > oldprio) > - resched_curr(rq); > + resched_curr_lazy(rq); > } else > check_preempt_curr(rq, p, 0); > } > --- a/drivers/acpi/processor_idle.c > +++ b/drivers/acpi/processor_idle.c > @@ -108,7 +108,7 @@ static const struct dmi_system_id proces > */ > static void __cpuidle acpi_safe_halt(void) > { > - if (!tif_need_resched()) { > + if (!need_resched()) { > raw_safe_halt(); > raw_local_irq_disable(); > } > --- a/include/linux/sched/idle.h > +++ b/include/linux/sched/idle.h > @@ -63,7 +63,7 @@ static __always_inline bool __must_check > */ > smp_mb__after_atomic(); > > - return unlikely(tif_need_resched()); > + return unlikely(need_resched()); > } > > static __always_inline bool __must_check current_clr_polling_and_test(void) > @@ -76,7 +76,7 @@ static __always_inline bool __must_check > */ > smp_mb__after_atomic(); > > - return unlikely(tif_need_resched()); > + return unlikely(need_resched()); > } > > #else > @@ -85,11 +85,11 @@ static inline void __current_clr_polling > > static inline bool __must_check current_set_polling_and_test(void) > { > - return unlikely(tif_need_resched()); > + return unlikely(need_resched()); > } > static inline bool __must_check current_clr_polling_and_test(void) > { > - return unlikely(tif_need_resched()); > + return unlikely(need_resched()); > } > #endif > > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p > ct_cpuidle_enter(); > > raw_local_irq_enable(); > - while (!tif_need_resched() && > - (cpu_idle_force_poll || tick_check_broadcast_expired())) > + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) > cpu_relax(); > raw_local_irq_disable(); > > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un > > if (tif_need_resched()) > trace_flags |= TRACE_FLAG_NEED_RESCHED; > + if (tif_need_resched_lazy()) > + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; > if (test_preempt_need_resched()) > trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; > return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -271,6 +271,7 @@ config X86 > select HAVE_STATIC_CALL > select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL > select HAVE_PREEMPT_DYNAMIC_CALL > + select HAVE_PREEMPT_AUTO > select HAVE_RSEQ > select HAVE_RUST if X86_64 > select HAVE_SYSCALL_TRACEPOINTS > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -81,8 +81,9 @@ struct thread_info { > #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ > #define TIF_SIGPENDING 2 /* signal pending */ > #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ > -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ > -#define TIF_SSBD 5 /* Speculative store bypass disable */ > +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */ > +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ > +#define TIF_SSBD 6 /* Speculative store bypass disable */ > #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ > #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ > #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ > @@ -104,6 +105,7 @@ struct thread_info { > #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) > #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) > #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) > +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY) > #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) > #define _TIF_SSBD (1 << TIF_SSBD) > #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) > --- a/kernel/entry/kvm.c > +++ b/kernel/entry/kvm.c > @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc > return -EINTR; > } > > - if (ti_work & _TIF_NEED_RESCHED) > + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY)) > schedule(); > > if (ti_work & _TIF_NOTIFY_RESUME) > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un > > enum trace_flag_type { > TRACE_FLAG_IRQS_OFF = 0x01, > - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, > - TRACE_FLAG_NEED_RESCHED = 0x04, > + TRACE_FLAG_NEED_RESCHED = 0x02, > + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, > TRACE_FLAG_HARDIRQ = 0x08, > TRACE_FLAG_SOFTIRQ = 0x10, > TRACE_FLAG_PREEMPT_RESCHED = 0x20, > @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c > > static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) > { > - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > + return tracing_gen_ctx_irq_test(0); > } > static inline unsigned int tracing_gen_ctx(void) > { > - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > + return tracing_gen_ctx_irq_test(0); > } > #endif > > --- a/kernel/trace/trace_output.c > +++ b/kernel/trace/trace_output.c > @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq > (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : > (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : > bh_off ? 'b' : > - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : > + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : > '.'; > > - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | > + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | > TRACE_FLAG_PREEMPT_RESCHED)) { > + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > + need_resched = 'B'; > + break; > case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: > need_resched = 'N'; > break; > + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > + need_resched = 'L'; > + break; > + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: > + need_resched = 'b'; > + break; > case TRACE_FLAG_NEED_RESCHED: > need_resched = 'n'; > break; > + case TRACE_FLAG_NEED_RESCHED_LAZY: > + need_resched = 'l'; > + break; > case TRACE_FLAG_PREEMPT_RESCHED: > need_resched = 'p'; > break; > --- a/kernel/sched/debug.c > +++ b/kernel/sched/debug.c > @@ -333,6 +333,23 @@ static const struct file_operations sche > .release = seq_release, > }; > > +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf, > + size_t cnt, loff_t *ppos) > +{ > + unsigned long end = jiffies + 60 * HZ; > + > + for (; time_before(jiffies, end) && !signal_pending(current);) > + cpu_relax(); > + > + return cnt; > +} > + > +static const struct file_operations sched_hog_fops = { > + .write = sched_hog_write, > + .open = simple_open, > + .llseek = default_llseek, > +}; > + > static struct dentry *debugfs_sched; > > static __init int sched_init_debug(void) > @@ -374,6 +391,8 @@ static __init int sched_init_debug(void) > > debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); > > + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops); > + > return 0; > } > late_initcall(sched_init_debug); >
Paul E. McKenney <paulmck@kernel.org> writes: > On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote: >> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: >> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: >> >> That said - I think as a proof of concept and "look, with this we get >> >> the expected scheduling event counts", that patch is perfect. I think >> >> you more than proved the concept. >> > >> > There is certainly quite some analyis work to do to make this a one to >> > one replacement. >> > >> > With a handful of benchmarks the PoC (tweaked with some obvious fixes) >> > is pretty much on par with the current mainline variants (NONE/FULL), >> > but the memtier benchmark makes a massive dent. >> > >> > It sports a whopping 10% regression with the LAZY mode versus the mainline >> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. >> > >> > That benchmark is really sensitive to the preemption model. With current >> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% >> > performance drop versus preempt=NONE. >> >> That 20% was a tired pilot error. The real number is in the 5% ballpark. >> >> > I have no clue what's going on there yet, but that shows that there is >> > obviously quite some work ahead to get this sorted. >> >> It took some head scratching to figure that out. The initial fix broke >> the handling of the hog issue, i.e. the problem that Ankur tried to >> solve, but I hacked up a "solution" for that too. >> >> With that the memtier benchmark is roughly back to the mainline numbers, >> but my throughput benchmark know how is pretty close to zero, so that >> should be looked at by people who actually understand these things. >> >> Likewise the hog prevention is just at the PoC level and clearly beyond >> my knowledge of scheduler details: It unconditionally forces a >> reschedule when the looping task is not responding to a lazy reschedule >> request before the next tick. IOW it forces a reschedule on the second >> tick, which is obviously different from the cond_resched()/might_sleep() >> behaviour. >> >> The changes vs. the original PoC aside of the bug and thinko fixes: >> >> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the >> lazy preempt bit as the trace_entry::flags field is full already. >> >> That obviously breaks the tracer ABI, but if we go there then >> this needs to be fixed. Steven? >> >> 2) debugfs file to validate that loops can be force preempted w/o >> cond_resched() >> >> The usage is: >> >> # taskset -c 1 bash >> # echo 1 > /sys/kernel/debug/sched/hog & >> # echo 1 > /sys/kernel/debug/sched/hog & >> # echo 1 > /sys/kernel/debug/sched/hog & >> >> top shows ~33% CPU for each of the hogs and tracing confirms that >> the crude hack in the scheduler tick works: >> >> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr >> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr >> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr >> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr >> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr >> >> The 'l' instead of the usual 'N' reflects that the lazy resched >> bit is set. That makes __update_curr() invoke resched_curr() >> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED >> and folds it into preempt_count so that preemption happens at the >> next possible point, i.e. either in return from interrupt or at >> the next preempt_enable(). > > Belatedly calling out some RCU issues. Nothing fatal, just a > (surprisingly) few adjustments that will need to be made. The key thing > to note is that from RCU's viewpoint, with this change, all kernels > are preemptible, though rcu_read_lock() readers remain non-preemptible. Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models none/voluntary/full are just scheduler tweaks on top of that. And, so this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock() readers be preemptible? (An alternate configuration might be: config PREEMPT_NONE select PREEMPT_COUNT config PREEMPT_FULL select PREEMPTION This probably allows for more configuration flexibility across archs? Would allow for TREE_RCU=y, for instance. That said, so far I've only been working with PREEMPT_RCU=y.) > With that: > > 1. As an optimization, given that preempt_count() would always give > good information, the scheduling-clock interrupt could sense RCU > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > IPI handlers for expedited grace periods. A nice optimization. > Except that... > > 2. The quiescent-state-forcing code currently relies on the presence > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > would be to do resched_cpu() more quickly, but some workloads > might not love the additional IPIs. Another approach to do #1 > above to replace the quiescent states from cond_resched() with > scheduler-tick-interrupt-sensed quiescent states. Right, the call to rcu_all_qs(). Just to see if I have it straight, something like this for PREEMPT_RCU=n kernels? if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) rcu_all_qs(); (Masked because PREEMPT_NONE might not do any folding for NEED_RESCHED_LAZY in the tick.) Though the comment around rcu_all_qs() mentions that rcu_all_qs() reports a quiescent state only if urgently needed. Given that the tick executes less frequently than calls to cond_resched(), could we just always report instead? Or I'm completely on the wrong track? if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) { preempt_disable(); rcu_qs(); preempt_enable(); } On your point about the preempt_count() being dependable, there's a wrinkle. As Linus mentions in https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/, that might not be true for architectures that define ARCH_NO_PREEMPT. My plan was to limit those archs to do preemption only at user space boundary but there are almost certainly RCU implications that I missed. > Plus... > > 3. For nohz_full CPUs that run for a long time in the kernel, > there are no scheduling-clock interrupts. RCU reaches for > the resched_cpu() hammer a few jiffies into the grace period. > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > interrupt-entry code will re-enable its scheduling-clock interrupt > upon receiving the resched_cpu() IPI. > > So nohz_full CPUs should be OK as far as RCU is concerned. > Other subsystems might have other opinions. Ah, that's what I thought from my reading of the RCU comments. Good to have that confirmed. Thanks. > 4. As another optimization, kvfree_rcu() could unconditionally > check preempt_count() to sense a clean environment suitable for > memory allocation. Had missed this completely. Could you elaborate? > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > instead say "select TASKS_RCU". This means that the #else > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > vanilla RCU must go. There might be be some fallout if something > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > rcu_tasks_classic_qs() do do something useful. Ack. > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > or RCU Tasks Rude) would need those pesky cond_resched() calls > to stick around. The reason is that RCU Tasks readers are ended > only by voluntary context switches. This means that although a > preemptible infinite loop in the kernel won't inconvenience a > real-time task (nor an non-real-time task for all that long), > and won't delay grace periods for the other flavors of RCU, > it would indefinitely delay an RCU Tasks grace period. > > However, RCU Tasks grace periods seem to be finite in preemptible > kernels today, so they should remain finite in limited-preemptible > kernels tomorrow. Famous last words... > > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > any algorithmic difference from this change. So, essentially, as long as RCU tasks eventually, in the fullness of time, call schedule(), removing cond_resched() shouldn't have any effect :). > 8. As has been noted elsewhere, in this new limited-preemption > mode of operation, rcu_read_lock() readers remain preemptible. > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. Ack. > 9. The rcu_preempt_depth() macro could do something useful in > limited-preemption kernels. Its current lack of ability in > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. > > 10. The cond_resched_rcu() function must remain because we still > have non-preemptible rcu_read_lock() readers. For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need only be this, right?: static inline void cond_resched_rcu(void) { #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) rcu_read_unlock(); rcu_read_lock(); #endif } > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > unchanged, but I must defer to the include/net/ip_vs.h people. > > 12. I need to check with the BPF folks on the BPF verifier's > definition of BTF_ID(func, rcu_read_unlock_strict). > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > function might have some redundancy across the board instead > of just on CONFIG_PREEMPT_RCU=y. Or might not. I don't think I understand any of these well enough to comment. Will Cc the relevant folks when I send out the RFC. > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > might need to do something for non-preemptible RCU to make > up for the lack of cond_resched() calls. Maybe just drop the > "IS_ENABLED()" and execute the body of the current "if" statement > unconditionally. Aah, yes this is a good idea. Thanks. > 15. I must defer to others on the mm/pgtable-generic.c file's > #ifdef that depends on CONFIG_PREEMPT_RCU. > > While in the area, I noted that KLP seems to depend on cond_resched(), > but on this I must defer to the KLP people. Yeah, as part of this work, I ended up unhooking most of the KLP hooks in cond_resched() and of course, cond_resched() itself. Will poke the livepatching people. > I am sure that I am missing something, but I have not yet seen any > show-stoppers. Just some needed adjustments. Appreciate this detailed list. Makes me think that everything might not go up in smoke after all! Thanks Ankur > Thoughts? > > Thanx, Paul > >> That's as much as I wanted to demonstrate and I'm not going to spend >> more cycles on it as I have already too many other things on flight and >> the resulting scheduler woes are clearly outside of my expertice. >> >> Though definitely I'm putting a permanent NAK in place for any attempts >> to duct tape the preempt=NONE model any further by sprinkling more >> cond*() and whatever warts around. >> >> Thanks, >> >> tglx >> --- >> arch/x86/Kconfig | 1 >> arch/x86/include/asm/thread_info.h | 6 ++-- >> drivers/acpi/processor_idle.c | 2 - >> include/linux/entry-common.h | 2 - >> include/linux/entry-kvm.h | 2 - >> include/linux/sched.h | 12 +++++--- >> include/linux/sched/idle.h | 8 ++--- >> include/linux/thread_info.h | 24 +++++++++++++++++ >> include/linux/trace_events.h | 8 ++--- >> kernel/Kconfig.preempt | 17 +++++++++++- >> kernel/entry/common.c | 4 +- >> kernel/entry/kvm.c | 2 - >> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------ >> kernel/sched/debug.c | 19 +++++++++++++ >> kernel/sched/fair.c | 46 ++++++++++++++++++++++----------- >> kernel/sched/features.h | 2 + >> kernel/sched/idle.c | 3 -- >> kernel/sched/sched.h | 1 >> kernel/trace/trace.c | 2 + >> kernel/trace/trace_output.c | 16 ++++++++++- >> 20 files changed, 171 insertions(+), 57 deletions(-) >> >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct >> >> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) >> /* >> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, >> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, >> * this avoids any races wrt polling state changes and thereby avoids >> * spurious IPIs. >> */ >> -static inline bool set_nr_and_not_polling(struct task_struct *p) >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) >> { >> struct thread_info *ti = task_thread_info(p); >> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); >> + >> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG); >> } >> >> /* >> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas >> for (;;) { >> if (!(val & _TIF_POLLING_NRFLAG)) >> return false; >> - if (val & _TIF_NEED_RESCHED) >> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) >> return true; >> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) >> break; >> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas >> } >> >> #else >> -static inline bool set_nr_and_not_polling(struct task_struct *p) >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) >> { >> - set_tsk_need_resched(p); >> + set_tsk_thread_flag(p, tif_bit); >> return true; >> } >> >> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head) >> * might also involve a cross-CPU call to trigger the scheduler on >> * the target CPU. >> */ >> -void resched_curr(struct rq *rq) >> +static void __resched_curr(struct rq *rq, int lazy) >> { >> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy; >> struct task_struct *curr = rq->curr; >> - int cpu; >> >> lockdep_assert_rq_held(rq); >> >> - if (test_tsk_need_resched(curr)) >> + if (unlikely(test_tsk_thread_flag(curr, tif_bit))) >> return; >> >> cpu = cpu_of(rq); >> >> if (cpu == smp_processor_id()) { >> - set_tsk_need_resched(curr); >> - set_preempt_need_resched(); >> + set_tsk_thread_flag(curr, tif_bit); >> + if (!lazy) >> + set_preempt_need_resched(); >> return; >> } >> >> - if (set_nr_and_not_polling(curr)) >> - smp_send_reschedule(cpu); >> - else >> + if (set_nr_and_not_polling(curr, tif_bit)) { >> + if (!lazy) >> + smp_send_reschedule(cpu); >> + } else { >> trace_sched_wake_idle_without_ipi(cpu); >> + } >> +} >> + >> +void resched_curr(struct rq *rq) >> +{ >> + __resched_curr(rq, 0); >> +} >> + >> +void resched_curr_lazy(struct rq *rq) >> +{ >> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ? >> + TIF_NEED_RESCHED_LAZY_OFFSET : 0; >> + >> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED))) >> + return; >> + >> + __resched_curr(rq, lazy); >> } >> >> void resched_cpu(int cpu) >> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu) >> if (cpu == smp_processor_id()) >> return; >> >> - if (set_nr_and_not_polling(rq->idle)) >> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) >> smp_send_reschedule(cpu); >> else >> trace_sched_wake_idle_without_ipi(cpu); >> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init( >> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ >> return preempt_dynamic_mode == preempt_dynamic_##mode; \ >> } \ >> - EXPORT_SYMBOL_GPL(preempt_model_##mode) >> >> PREEMPT_MODEL_ACCESSOR(none); >> PREEMPT_MODEL_ACCESSOR(voluntary); >> --- a/include/linux/thread_info.h >> +++ b/include/linux/thread_info.h >> @@ -59,6 +59,16 @@ enum syscall_work_bit { >> >> #include <asm/thread_info.h> >> >> +#ifdef CONFIG_PREEMPT_AUTO >> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY >> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY >> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) >> +#else >> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED >> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED >> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0 >> +#endif >> + >> #ifdef __KERNEL__ >> >> #ifndef arch_set_restart_data >> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res >> (unsigned long *)(¤t_thread_info()->flags)); >> } >> >> +static __always_inline bool tif_need_resched_lazy(void) >> +{ >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && >> + arch_test_bit(TIF_NEED_RESCHED_LAZY, >> + (unsigned long *)(¤t_thread_info()->flags)); >> +} >> + >> #else >> >> static __always_inline bool tif_need_resched(void) >> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res >> (unsigned long *)(¤t_thread_info()->flags)); >> } >> >> +static __always_inline bool tif_need_resched_lazy(void) >> +{ >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && >> + test_bit(TIF_NEED_RESCHED_LAZY, >> + (unsigned long *)(¤t_thread_info()->flags)); >> +} >> + >> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ >> >> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES >> --- a/kernel/Kconfig.preempt >> +++ b/kernel/Kconfig.preempt >> @@ -11,6 +11,13 @@ config PREEMPT_BUILD >> select PREEMPTION >> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK >> >> +config PREEMPT_BUILD_AUTO >> + bool >> + select PREEMPT_BUILD >> + >> +config HAVE_PREEMPT_AUTO >> + bool >> + >> choice >> prompt "Preemption Model" >> default PREEMPT_NONE >> @@ -67,9 +74,17 @@ config PREEMPT >> embedded system with latency requirements in the milliseconds >> range. >> >> +config PREEMPT_AUTO >> + bool "Automagic preemption mode with runtime tweaking support" >> + depends on HAVE_PREEMPT_AUTO >> + select PREEMPT_BUILD_AUTO >> + help >> + Add some sensible blurb here >> + >> config PREEMPT_RT >> bool "Fully Preemptible Kernel (Real-Time)" >> depends on EXPERT && ARCH_SUPPORTS_RT >> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO >> select PREEMPTION >> help >> This option turns the kernel into a real-time kernel by replacing >> @@ -95,7 +110,7 @@ config PREEMPTION >> >> config PREEMPT_DYNAMIC >> bool "Preemption behaviour defined on boot" >> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT >> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO >> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY >> select PREEMPT_BUILD >> default y if HAVE_PREEMPT_DYNAMIC_CALL >> --- a/include/linux/entry-common.h >> +++ b/include/linux/entry-common.h >> @@ -60,7 +60,7 @@ >> #define EXIT_TO_USER_MODE_WORK \ >> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ >> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ >> - ARCH_EXIT_TO_USER_MODE_WORK) >> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) >> >> /** >> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs >> --- a/include/linux/entry-kvm.h >> +++ b/include/linux/entry-kvm.h >> @@ -18,7 +18,7 @@ >> >> #define XFER_TO_GUEST_MODE_WORK \ >> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ >> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) >> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) >> >> struct kvm_vcpu; >> >> --- a/kernel/entry/common.c >> +++ b/kernel/entry/common.c >> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l >> >> local_irq_enable_exit_to_user(ti_work); >> >> - if (ti_work & _TIF_NEED_RESCHED) >> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) >> schedule(); >> >> if (ti_work & _TIF_UPROBE) >> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void >> rcu_irq_exit_check_preempt(); >> if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) >> WARN_ON_ONCE(!on_thread_stack()); >> - if (need_resched()) >> + if (test_tsk_need_resched(current)) >> preempt_schedule_irq(); >> } >> } >> --- a/kernel/sched/features.h >> +++ b/kernel/sched/features.h >> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) >> SCHED_FEAT(LATENCY_WARN, false) >> >> SCHED_FEAT(HZ_BW, true) >> + >> +SCHED_FEAT(FORCE_NEED_RESCHED, false) >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); >> extern void reweight_task(struct task_struct *p, int prio); >> >> extern void resched_curr(struct rq *rq); >> +extern void resched_curr_lazy(struct rq *rq); >> extern void resched_cpu(int cpu); >> >> extern struct rt_bandwidth def_rt_bandwidth; >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla >> update_ti_thread_flag(task_thread_info(tsk), flag, value); >> } >> >> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); >> } >> >> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); >> } >> >> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_ti_thread_flag(task_thread_info(tsk), flag); >> } >> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched( >> static inline void clear_tsk_need_resched(struct task_struct *tsk) >> { >> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); >> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) >> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); >> } >> >> -static inline int test_tsk_need_resched(struct task_struct *tsk) >> +static inline bool test_tsk_need_resched(struct task_struct *tsk) >> { >> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); >> } >> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc >> >> static __always_inline bool need_resched(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(tif_need_resched_lazy() || tif_need_resched()); >> } >> >> /* >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq >> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i >> * this is probably good enough. >> */ >> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) >> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick) >> { >> + struct rq *rq = rq_of(cfs_rq); >> + >> if ((s64)(se->vruntime - se->deadline) < 0) >> return; >> >> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r >> /* >> * The task has consumed its request, reschedule. >> */ >> - if (cfs_rq->nr_running > 1) { >> - resched_curr(rq_of(cfs_rq)); >> - clear_buddies(cfs_rq, se); >> + if (cfs_rq->nr_running < 2) >> + return; >> + >> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) { >> + resched_curr(rq); >> + } else { >> + /* Did the task ignore the lazy reschedule request? */ >> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) >> + resched_curr(rq); >> + else >> + resched_curr_lazy(rq); >> } >> + clear_buddies(cfs_rq, se); >> } >> >> #include "pelt.h" >> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf >> /* >> * Update the current task's runtime statistics. >> */ >> -static void update_curr(struct cfs_rq *cfs_rq) >> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick) >> { >> struct sched_entity *curr = cfs_rq->curr; >> u64 now = rq_clock_task(rq_of(cfs_rq)); >> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c >> schedstat_add(cfs_rq->exec_clock, delta_exec); >> >> curr->vruntime += calc_delta_fair(delta_exec, curr); >> - update_deadline(cfs_rq, curr); >> + update_deadline(cfs_rq, curr, tick); >> update_min_vruntime(cfs_rq); >> >> if (entity_is_task(curr)) { >> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c >> account_cfs_rq_runtime(cfs_rq, delta_exec); >> } >> >> +static inline void update_curr(struct cfs_rq *cfs_rq) >> +{ >> + __update_curr(cfs_rq, false); >> +} >> + >> static void update_curr_fair(struct rq *rq) >> { >> update_curr(cfs_rq_of(&rq->curr->se)); >> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc >> /* >> * Update run-time statistics of the 'current'. >> */ >> - update_curr(cfs_rq); >> + __update_curr(cfs_rq, true); >> >> /* >> * Ensure that runnable average is periodically updated. >> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc >> * validating it and just reschedule. >> */ >> if (queued) { >> - resched_curr(rq_of(cfs_rq)); >> + resched_curr_lazy(rq_of(cfs_rq)); >> return; >> } >> /* >> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str >> * hierarchy can be throttled >> */ >> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) >> - resched_curr(rq_of(cfs_rq)); >> + resched_curr_lazy(rq_of(cfs_rq)); >> } >> >> static __always_inline >> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf >> >> /* Determine whether we need to wake up potentially idle CPU: */ >> if (rq->curr == rq->idle && rq->cfs.nr_running) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> #ifdef CONFIG_SMP >> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq >> >> if (delta < 0) { >> if (task_current(rq, p)) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> return; >> } >> hrtick_start(rq, delta); >> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct >> * prevents us from potentially nominating it as a false LAST_BUDDY >> * below. >> */ >> - if (test_tsk_need_resched(curr)) >> + if (need_resched()) >> return; >> >> /* Idle tasks are by definition preempted by non-idle tasks. */ >> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct >> return; >> >> preempt: >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> #ifdef CONFIG_SMP >> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct >> */ >> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && >> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> /* >> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct >> */ >> if (task_current(rq, p)) { >> if (p->prio > oldprio) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } else >> check_preempt_curr(rq, p, 0); >> } >> --- a/drivers/acpi/processor_idle.c >> +++ b/drivers/acpi/processor_idle.c >> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces >> */ >> static void __cpuidle acpi_safe_halt(void) >> { >> - if (!tif_need_resched()) { >> + if (!need_resched()) { >> raw_safe_halt(); >> raw_local_irq_disable(); >> } >> --- a/include/linux/sched/idle.h >> +++ b/include/linux/sched/idle.h >> @@ -63,7 +63,7 @@ static __always_inline bool __must_check >> */ >> smp_mb__after_atomic(); >> >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> >> static __always_inline bool __must_check current_clr_polling_and_test(void) >> @@ -76,7 +76,7 @@ static __always_inline bool __must_check >> */ >> smp_mb__after_atomic(); >> >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> >> #else >> @@ -85,11 +85,11 @@ static inline void __current_clr_polling >> >> static inline bool __must_check current_set_polling_and_test(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> static inline bool __must_check current_clr_polling_and_test(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> #endif >> >> --- a/kernel/sched/idle.c >> +++ b/kernel/sched/idle.c >> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p >> ct_cpuidle_enter(); >> >> raw_local_irq_enable(); >> - while (!tif_need_resched() && >> - (cpu_idle_force_poll || tick_check_broadcast_expired())) >> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) >> cpu_relax(); >> raw_local_irq_disable(); >> >> --- a/kernel/trace/trace.c >> +++ b/kernel/trace/trace.c >> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un >> >> if (tif_need_resched()) >> trace_flags |= TRACE_FLAG_NEED_RESCHED; >> + if (tif_need_resched_lazy()) >> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; >> if (test_preempt_need_resched()) >> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; >> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -271,6 +271,7 @@ config X86 >> select HAVE_STATIC_CALL >> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL >> select HAVE_PREEMPT_DYNAMIC_CALL >> + select HAVE_PREEMPT_AUTO >> select HAVE_RSEQ >> select HAVE_RUST if X86_64 >> select HAVE_SYSCALL_TRACEPOINTS >> --- a/arch/x86/include/asm/thread_info.h >> +++ b/arch/x86/include/asm/thread_info.h >> @@ -81,8 +81,9 @@ struct thread_info { >> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ >> #define TIF_SIGPENDING 2 /* signal pending */ >> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ >> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ >> -#define TIF_SSBD 5 /* Speculative store bypass disable */ >> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */ >> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ >> +#define TIF_SSBD 6 /* Speculative store bypass disable */ >> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ >> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ >> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ >> @@ -104,6 +105,7 @@ struct thread_info { >> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) >> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) >> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) >> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY) >> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) >> #define _TIF_SSBD (1 << TIF_SSBD) >> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) >> --- a/kernel/entry/kvm.c >> +++ b/kernel/entry/kvm.c >> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc >> return -EINTR; >> } >> >> - if (ti_work & _TIF_NEED_RESCHED) >> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY)) >> schedule(); >> >> if (ti_work & _TIF_NOTIFY_RESUME) >> --- a/include/linux/trace_events.h >> +++ b/include/linux/trace_events.h >> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un >> >> enum trace_flag_type { >> TRACE_FLAG_IRQS_OFF = 0x01, >> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, >> - TRACE_FLAG_NEED_RESCHED = 0x04, >> + TRACE_FLAG_NEED_RESCHED = 0x02, >> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, >> TRACE_FLAG_HARDIRQ = 0x08, >> TRACE_FLAG_SOFTIRQ = 0x10, >> TRACE_FLAG_PREEMPT_RESCHED = 0x20, >> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c >> >> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) >> { >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); >> + return tracing_gen_ctx_irq_test(0); >> } >> static inline unsigned int tracing_gen_ctx(void) >> { >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); >> + return tracing_gen_ctx_irq_test(0); >> } >> #endif >> >> --- a/kernel/trace/trace_output.c >> +++ b/kernel/trace/trace_output.c >> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq >> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : >> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : >> bh_off ? 'b' : >> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : >> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : >> '.'; >> >> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | >> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | >> TRACE_FLAG_PREEMPT_RESCHED)) { >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: >> + need_resched = 'B'; >> + break; >> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: >> need_resched = 'N'; >> break; >> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: >> + need_resched = 'L'; >> + break; >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: >> + need_resched = 'b'; >> + break; >> case TRACE_FLAG_NEED_RESCHED: >> need_resched = 'n'; >> break; >> + case TRACE_FLAG_NEED_RESCHED_LAZY: >> + need_resched = 'l'; >> + break; >> case TRACE_FLAG_PREEMPT_RESCHED: >> need_resched = 'p'; >> break; >> --- a/kernel/sched/debug.c >> +++ b/kernel/sched/debug.c >> @@ -333,6 +333,23 @@ static const struct file_operations sche >> .release = seq_release, >> }; >> >> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf, >> + size_t cnt, loff_t *ppos) >> +{ >> + unsigned long end = jiffies + 60 * HZ; >> + >> + for (; time_before(jiffies, end) && !signal_pending(current);) >> + cpu_relax(); >> + >> + return cnt; >> +} >> + >> +static const struct file_operations sched_hog_fops = { >> + .write = sched_hog_write, >> + .open = simple_open, >> + .llseek = default_llseek, >> +}; >> + >> static struct dentry *debugfs_sched; >> >> static __init int sched_init_debug(void) >> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void) >> >> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); >> >> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops); >> + >> return 0; >> } >> late_initcall(sched_init_debug); >> -- ankur
Paul! On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > Belatedly calling out some RCU issues. Nothing fatal, just a > (surprisingly) few adjustments that will need to be made. The key thing > to note is that from RCU's viewpoint, with this change, all kernels > are preemptible, though rcu_read_lock() readers remain > non-preemptible. Why? Either I'm confused or you or both of us :) With this approach the kernel is by definition fully preemptible, which means means rcu_read_lock() is preemptible too. That's pretty much the same situation as with PREEMPT_DYNAMIC. For throughput sake this fully preemptible kernel provides a mechanism to delay preemption for SCHED_OTHER tasks, i.e. instead of setting NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. That means the preemption points in preempt_enable() and return from interrupt to kernel will not see NEED_RESCHED and the tasks can run to completion either to the point where they call schedule() or when they return to user space. That's pretty much what PREEMPT_NONE does today. The difference to NONE/VOLUNTARY is that the explicit cond_resched() points are not longer required because the scheduler can preempt the long running task by setting NEED_RESCHED instead. That preemption might be suboptimal in some cases compared to cond_resched(), but from my initial experimentation that's not really an issue. > With that: > > 1. As an optimization, given that preempt_count() would always give > good information, the scheduling-clock interrupt could sense RCU > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > IPI handlers for expedited grace periods. A nice optimization. > Except that... > > 2. The quiescent-state-forcing code currently relies on the presence > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > would be to do resched_cpu() more quickly, but some workloads > might not love the additional IPIs. Another approach to do #1 > above to replace the quiescent states from cond_resched() with > scheduler-tick-interrupt-sensed quiescent states. Right. The tick can see either the lazy resched bit "ignored" or some magic "RCU needs a quiescent state" and force a reschedule. > Plus... > > 3. For nohz_full CPUs that run for a long time in the kernel, > there are no scheduling-clock interrupts. RCU reaches for > the resched_cpu() hammer a few jiffies into the grace period. > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > interrupt-entry code will re-enable its scheduling-clock interrupt > upon receiving the resched_cpu() IPI. You can spare the IPI by setting NEED_RESCHED on the remote CPU which will cause it to preempt. > So nohz_full CPUs should be OK as far as RCU is concerned. > Other subsystems might have other opinions. > > 4. As another optimization, kvfree_rcu() could unconditionally > check preempt_count() to sense a clean environment suitable for > memory allocation. Correct. All the limitations of preempt count being useless are gone. > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > instead say "select TASKS_RCU". This means that the #else > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > vanilla RCU must go. There might be be some fallout if something > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > rcu_tasks_classic_qs() do do something useful. In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob remaining would be CONFIG_PREEMPT_RT, which should be renamed to CONFIG_RT or such as it does not really change the preemption model itself. RT just reduces the preemption disabled sections with the lock conversions, forced interrupt threading and some more. > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > or RCU Tasks Rude) would need those pesky cond_resched() calls > to stick around. The reason is that RCU Tasks readers are ended > only by voluntary context switches. This means that although a > preemptible infinite loop in the kernel won't inconvenience a > real-time task (nor an non-real-time task for all that long), > and won't delay grace periods for the other flavors of RCU, > it would indefinitely delay an RCU Tasks grace period. > > However, RCU Tasks grace periods seem to be finite in preemptible > kernels today, so they should remain finite in limited-preemptible > kernels tomorrow. Famous last words... That's an issue which you have today with preempt FULL, right? So if it turns out to be a problem then it's not a problem of the new model. > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > any algorithmic difference from this change. > > 8. As has been noted elsewhere, in this new limited-preemption > mode of operation, rcu_read_lock() readers remain preemptible. > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no? > 9. The rcu_preempt_depth() macro could do something useful in > limited-preemption kernels. Its current lack of ability in > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. Correct. > 10. The cond_resched_rcu() function must remain because we still > have non-preemptible rcu_read_lock() readers. Where? > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > unchanged, but I must defer to the include/net/ip_vs.h people. *blink* > 12. I need to check with the BPF folks on the BPF verifier's > definition of BTF_ID(func, rcu_read_unlock_strict). > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > function might have some redundancy across the board instead > of just on CONFIG_PREEMPT_RCU=y. Or might not. > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > might need to do something for non-preemptible RCU to make > up for the lack of cond_resched() calls. Maybe just drop the > "IS_ENABLED()" and execute the body of the current "if" statement > unconditionally. Again. There is no non-preemtible RCU with this model, unless I'm missing something important here. > 15. I must defer to others on the mm/pgtable-generic.c file's > #ifdef that depends on CONFIG_PREEMPT_RCU. All those ifdefs should die :) > While in the area, I noted that KLP seems to depend on cond_resched(), > but on this I must defer to the KLP people. Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO. > I am sure that I am missing something, but I have not yet seen any > show-stoppers. Just some needed adjustments. Right. If it works out as I think it can work out the main adjustments are to remove a large amount of #ifdef maze and related gunk :) Thanks, tglx
On Wed, 18 Oct 2023 15:16:12 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > > might need to do something for non-preemptible RCU to make > > up for the lack of cond_resched() calls. Maybe just drop the > > "IS_ENABLED()" and execute the body of the current "if" statement > > unconditionally. Right. I'm guessing you are talking about this code: /* * In some cases, notably when running on a nohz_full CPU with * a stopped tick PREEMPT_RCU has no way to account for QSs. * This will eventually cause unwarranted noise as PREEMPT_RCU * will force preemption as the means of ending the current * grace period. We avoid this problem by calling * rcu_momentary_dyntick_idle(), which performs a zero duration * EQS allowing PREEMPT_RCU to end the current grace period. * This call shouldn't be wrapped inside an RCU critical * section. * * Note that in non PREEMPT_RCU kernels QSs are handled through * cond_resched() */ if (IS_ENABLED(CONFIG_PREEMPT_RCU)) { if (!disable_irq) local_irq_disable(); rcu_momentary_dyntick_idle(); if (!disable_irq) local_irq_enable(); } /* * For the non-preemptive kernel config: let threads runs, if * they so wish, unless set not do to so. */ if (!disable_irq && !disable_preemption) cond_resched(); If everything becomes PREEMPT_RCU, then the above should be able to be turned into just: if (!disable_irq) local_irq_disable(); rcu_momentary_dyntick_idle(); if (!disable_irq) local_irq_enable(); And no cond_resched() is needed. > > Again. There is no non-preemtible RCU with this model, unless I'm > missing something important here. Daniel? -- Steve
On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > Paul! > > On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > > Belatedly calling out some RCU issues. Nothing fatal, just a > > (surprisingly) few adjustments that will need to be made. The key thing > > to note is that from RCU's viewpoint, with this change, all kernels > > are preemptible, though rcu_read_lock() readers remain > > non-preemptible. > > Why? Either I'm confused or you or both of us :) Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() as preempt_enable() in this approach? I certainly hope so, as RCU priority boosting would be a most unwelcome addition to many datacenter workloads. > With this approach the kernel is by definition fully preemptible, which > means means rcu_read_lock() is preemptible too. That's pretty much the > same situation as with PREEMPT_DYNAMIC. Please, just no!!! Please note that the current use of PREEMPT_DYNAMIC with preempt=none avoids preempting RCU read-side critical sections. This means that the distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption of RCU readers in environments expecting no preemption. > For throughput sake this fully preemptible kernel provides a mechanism > to delay preemption for SCHED_OTHER tasks, i.e. instead of setting > NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. > > That means the preemption points in preempt_enable() and return from > interrupt to kernel will not see NEED_RESCHED and the tasks can run to > completion either to the point where they call schedule() or when they > return to user space. That's pretty much what PREEMPT_NONE does today. > > The difference to NONE/VOLUNTARY is that the explicit cond_resched() > points are not longer required because the scheduler can preempt the > long running task by setting NEED_RESCHED instead. > > That preemption might be suboptimal in some cases compared to > cond_resched(), but from my initial experimentation that's not really an > issue. I am not (repeat NOT) arguing for keeping cond_resched(). I am instead arguing that the less-preemptible variants of the kernel should continue to avoid preempting RCU read-side critical sections. > > With that: > > > > 1. As an optimization, given that preempt_count() would always give > > good information, the scheduling-clock interrupt could sense RCU > > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > > IPI handlers for expedited grace periods. A nice optimization. > > Except that... > > > > 2. The quiescent-state-forcing code currently relies on the presence > > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > > would be to do resched_cpu() more quickly, but some workloads > > might not love the additional IPIs. Another approach to do #1 > > above to replace the quiescent states from cond_resched() with > > scheduler-tick-interrupt-sensed quiescent states. > > Right. The tick can see either the lazy resched bit "ignored" or some > magic "RCU needs a quiescent state" and force a reschedule. Good, thank you for confirming. > > Plus... > > > > 3. For nohz_full CPUs that run for a long time in the kernel, > > there are no scheduling-clock interrupts. RCU reaches for > > the resched_cpu() hammer a few jiffies into the grace period. > > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > > interrupt-entry code will re-enable its scheduling-clock interrupt > > upon receiving the resched_cpu() IPI. > > You can spare the IPI by setting NEED_RESCHED on the remote CPU which > will cause it to preempt. That is not sufficient for nohz_full CPUs executing in userspace, which won't see that NEED_RESCHED until they either take an interrupt or do a system call. And applications often work hard to prevent nohz_full CPUs from doing either. Please note that if the holdout CPU really is a nohz_full CPU executing in userspace, RCU will see this courtesy of context tracking and will therefore avoid ever IPIin it. The IPIs only happen if a nohz_full CPU ends up executing for a long time in the kernel, which is an error condition for the nohz_full use cases that I am aware of. > > So nohz_full CPUs should be OK as far as RCU is concerned. > > Other subsystems might have other opinions. > > > > 4. As another optimization, kvfree_rcu() could unconditionally > > check preempt_count() to sense a clean environment suitable for > > memory allocation. > > Correct. All the limitations of preempt count being useless are gone. Woo-hoo!!! And that is of course a very attractive property of this. > > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > > instead say "select TASKS_RCU". This means that the #else > > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > > vanilla RCU must go. There might be be some fallout if something > > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > > rcu_tasks_classic_qs() do do something useful. > > In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > remaining would be CONFIG_PREEMPT_RT, which should be renamed to > CONFIG_RT or such as it does not really change the preemption > model itself. RT just reduces the preemption disabled sections with the > lock conversions, forced interrupt threading and some more. Again, please, no. There are situations where we still need rcu_read_lock() and rcu_read_unlock() to be preempt_disable() and preempt_enable(), repectively. Those can be cases selected only by Kconfig option, not available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > > or RCU Tasks Rude) would need those pesky cond_resched() calls > > to stick around. The reason is that RCU Tasks readers are ended > > only by voluntary context switches. This means that although a > > preemptible infinite loop in the kernel won't inconvenience a > > real-time task (nor an non-real-time task for all that long), > > and won't delay grace periods for the other flavors of RCU, > > it would indefinitely delay an RCU Tasks grace period. > > > > However, RCU Tasks grace periods seem to be finite in preemptible > > kernels today, so they should remain finite in limited-preemptible > > kernels tomorrow. Famous last words... > > That's an issue which you have today with preempt FULL, right? So if it > turns out to be a problem then it's not a problem of the new model. Agreed, and hence my last three lines of text above. Plus the guy who requested RCU Tasks said that it was OK for its grace periods to take a long time, and I am holding Steven Rostedt to that. ;-) > > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > > any algorithmic difference from this change. > > > > 8. As has been noted elsewhere, in this new limited-preemption > > mode of operation, rcu_read_lock() readers remain preemptible. > > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. > > Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no? That is in fact the problem. Preemption can be good, but it is possible to have too much of a good thing, and preemptible RCU read-side critical sections definitely is in that category for some important workloads. ;-) > > 9. The rcu_preempt_depth() macro could do something useful in > > limited-preemption kernels. Its current lack of ability in > > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. > > Correct. > > > 10. The cond_resched_rcu() function must remain because we still > > have non-preemptible rcu_read_lock() readers. > > Where? In datacenters. > > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > > unchanged, but I must defer to the include/net/ip_vs.h people. > > *blink* No argument here. ;-) > > 12. I need to check with the BPF folks on the BPF verifier's > > definition of BTF_ID(func, rcu_read_unlock_strict). > > > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > > function might have some redundancy across the board instead > > of just on CONFIG_PREEMPT_RCU=y. Or might not. > > > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > > might need to do something for non-preemptible RCU to make > > up for the lack of cond_resched() calls. Maybe just drop the > > "IS_ENABLED()" and execute the body of the current "if" statement > > unconditionally. > > Again. There is no non-preemtible RCU with this model, unless I'm > missing something important here. And again, there needs to be non-preemptible RCU with this model. > > 15. I must defer to others on the mm/pgtable-generic.c file's > > #ifdef that depends on CONFIG_PREEMPT_RCU. > > All those ifdefs should die :) Like all things, they will eventually. ;-) > > While in the area, I noted that KLP seems to depend on cond_resched(), > > but on this I must defer to the KLP people. > > Yeah, KLP needs some thoughts, but that's not rocket science to fix IMO. Not rocket science, just KLP science, which I am happy to defer to the KLP people. > > I am sure that I am missing something, but I have not yet seen any > > show-stoppers. Just some needed adjustments. > > Right. If it works out as I think it can work out the main adjustments > are to remove a large amount of #ifdef maze and related gunk :) Just please don't remove the #ifdef gunk that is still needed! Thanx, Paul
On Wed, 18 Oct 2023 10:19:53 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > as preempt_enable() in this approach? I certainly hope so, as RCU > priority boosting would be a most unwelcome addition to many datacenter > workloads. > > > With this approach the kernel is by definition fully preemptible, which > > means means rcu_read_lock() is preemptible too. That's pretty much the > > same situation as with PREEMPT_DYNAMIC. > > Please, just no!!! Note, when I first read Thomas's proposal, I figured that Paul would no longer get to brag that: "In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply nops!" But instead, they would be: static void rcu_read_lock(void) { preempt_disable(); } static void rcu_read_unlock(void) { preempt_enable(); } as it was mentioned that today's preempt_disable() is fast and not an issue like it was in older kernels. That would mean that there will still be a "non preempt" version of RCU. As the preempt version of RCU adds a lot more logic when scheduling out in an RCU critical section, that I can envision not all workloads would want around. Adding "preempt_disable()" is now low overhead, but adding the RCU logic to handle preemption isn't as lightweight as that. Not to mention the logic to boost those threads that were preempted and being starved for some time. > > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > > > or RCU Tasks Rude) would need those pesky cond_resched() calls > > > to stick around. The reason is that RCU Tasks readers are ended > > > only by voluntary context switches. This means that although a > > > preemptible infinite loop in the kernel won't inconvenience a > > > real-time task (nor an non-real-time task for all that long), > > > and won't delay grace periods for the other flavors of RCU, > > > it would indefinitely delay an RCU Tasks grace period. > > > > > > However, RCU Tasks grace periods seem to be finite in preemptible > > > kernels today, so they should remain finite in limited-preemptible > > > kernels tomorrow. Famous last words... > > > > That's an issue which you have today with preempt FULL, right? So if it > > turns out to be a problem then it's not a problem of the new model. > > Agreed, and hence my last three lines of text above. Plus the guy who > requested RCU Tasks said that it was OK for its grace periods to take > a long time, and I am holding Steven Rostedt to that. ;-) Matters what your definition of "long time" is ;-) -- Steve
On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote: > >> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: > >> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: > >> >> That said - I think as a proof of concept and "look, with this we get > >> >> the expected scheduling event counts", that patch is perfect. I think > >> >> you more than proved the concept. > >> > > >> > There is certainly quite some analyis work to do to make this a one to > >> > one replacement. > >> > > >> > With a handful of benchmarks the PoC (tweaked with some obvious fixes) > >> > is pretty much on par with the current mainline variants (NONE/FULL), > >> > but the memtier benchmark makes a massive dent. > >> > > >> > It sports a whopping 10% regression with the LAZY mode versus the mainline > >> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. > >> > > >> > That benchmark is really sensitive to the preemption model. With current > >> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% > >> > performance drop versus preempt=NONE. > >> > >> That 20% was a tired pilot error. The real number is in the 5% ballpark. > >> > >> > I have no clue what's going on there yet, but that shows that there is > >> > obviously quite some work ahead to get this sorted. > >> > >> It took some head scratching to figure that out. The initial fix broke > >> the handling of the hog issue, i.e. the problem that Ankur tried to > >> solve, but I hacked up a "solution" for that too. > >> > >> With that the memtier benchmark is roughly back to the mainline numbers, > >> but my throughput benchmark know how is pretty close to zero, so that > >> should be looked at by people who actually understand these things. > >> > >> Likewise the hog prevention is just at the PoC level and clearly beyond > >> my knowledge of scheduler details: It unconditionally forces a > >> reschedule when the looping task is not responding to a lazy reschedule > >> request before the next tick. IOW it forces a reschedule on the second > >> tick, which is obviously different from the cond_resched()/might_sleep() > >> behaviour. > >> > >> The changes vs. the original PoC aside of the bug and thinko fixes: > >> > >> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the > >> lazy preempt bit as the trace_entry::flags field is full already. > >> > >> That obviously breaks the tracer ABI, but if we go there then > >> this needs to be fixed. Steven? > >> > >> 2) debugfs file to validate that loops can be force preempted w/o > >> cond_resched() > >> > >> The usage is: > >> > >> # taskset -c 1 bash > >> # echo 1 > /sys/kernel/debug/sched/hog & > >> # echo 1 > /sys/kernel/debug/sched/hog & > >> # echo 1 > /sys/kernel/debug/sched/hog & > >> > >> top shows ~33% CPU for each of the hogs and tracing confirms that > >> the crude hack in the scheduler tick works: > >> > >> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr > >> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr > >> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr > >> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr > >> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr > >> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr > >> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr > >> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr > >> > >> The 'l' instead of the usual 'N' reflects that the lazy resched > >> bit is set. That makes __update_curr() invoke resched_curr() > >> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED > >> and folds it into preempt_count so that preemption happens at the > >> next possible point, i.e. either in return from interrupt or at > >> the next preempt_enable(). > > > > Belatedly calling out some RCU issues. Nothing fatal, just a > > (surprisingly) few adjustments that will need to be made. The key thing > > to note is that from RCU's viewpoint, with this change, all kernels > > are preemptible, though rcu_read_lock() readers remain non-preemptible. > > Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models > none/voluntary/full are just scheduler tweaks on top of that. And, so > this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock() > readers be preemptible? > > (An alternate configuration might be: > config PREEMPT_NONE > select PREEMPT_COUNT > > config PREEMPT_FULL > select PREEMPTION > > This probably allows for more configuration flexibility across archs? > Would allow for TREE_RCU=y, for instance. That said, so far I've only > been working with PREEMPT_RCU=y.) Then this is a bug that needs to be fixed. We need a way to make RCU readers non-preemptible. > > With that: > > > > 1. As an optimization, given that preempt_count() would always give > > good information, the scheduling-clock interrupt could sense RCU > > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > > IPI handlers for expedited grace periods. A nice optimization. > > Except that... > > > > 2. The quiescent-state-forcing code currently relies on the presence > > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > > would be to do resched_cpu() more quickly, but some workloads > > might not love the additional IPIs. Another approach to do #1 > > above to replace the quiescent states from cond_resched() with > > scheduler-tick-interrupt-sensed quiescent states. > > Right, the call to rcu_all_qs(). Just to see if I have it straight, > something like this for PREEMPT_RCU=n kernels? > > if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) > rcu_all_qs(); > > (Masked because PREEMPT_NONE might not do any folding for > NEED_RESCHED_LAZY in the tick.) > > Though the comment around rcu_all_qs() mentions that rcu_all_qs() > reports a quiescent state only if urgently needed. Given that the tick > executes less frequently than calls to cond_resched(), could we just > always report instead? Or I'm completely on the wrong track? > > if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) { > preempt_disable(); > rcu_qs(); > preempt_enable(); > } > > On your point about the preempt_count() being dependable, there's a > wrinkle. As Linus mentions in > https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@mail.gmail.com/, > that might not be true for architectures that define ARCH_NO_PREEMPT. > > My plan was to limit those archs to do preemption only at user space boundary > but there are almost certainly RCU implications that I missed. Just add this to the "if" condition of the CONFIG_PREEMPT_RCU=n version of rcu_flavor_sched_clock_irq(): || !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) Resulting in something like this: ------------------------------------------------------------------------ static void rcu_flavor_sched_clock_irq(int user) { if (user || rcu_is_cpu_rrupt_from_idle() || !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) { /* * Get here if this CPU took its interrupt from user * mode or from the idle loop, and if this is not a nested * interrupt, or if the interrupt is from a preemptible * region of the kernel. In this case, the CPU is in a * quiescent state, so note it. * * No memory barrier is required here because rcu_qs() * references only CPU-local variables that other CPUs * neither access nor modify, at least not while the * corresponding CPU is online. */ rcu_qs(); } } ------------------------------------------------------------------------ > > Plus... > > > > 3. For nohz_full CPUs that run for a long time in the kernel, > > there are no scheduling-clock interrupts. RCU reaches for > > the resched_cpu() hammer a few jiffies into the grace period. > > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > > interrupt-entry code will re-enable its scheduling-clock interrupt > > upon receiving the resched_cpu() IPI. > > > > So nohz_full CPUs should be OK as far as RCU is concerned. > > Other subsystems might have other opinions. > > Ah, that's what I thought from my reading of the RCU comments. Good to > have that confirmed. Thanks. > > > 4. As another optimization, kvfree_rcu() could unconditionally > > check preempt_count() to sense a clean environment suitable for > > memory allocation. > > Had missed this completely. Could you elaborate? It is just an optimization. But the idea is to use less restrictive GFP_ flags in add_ptr_to_bulk_krc_lock() when the caller's context allows it. Add Uladzislau on CC for his thoughts. > > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > > instead say "select TASKS_RCU". This means that the #else > > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > > vanilla RCU must go. There might be be some fallout if something > > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > > rcu_tasks_classic_qs() do do something useful. > > Ack. > > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > > or RCU Tasks Rude) would need those pesky cond_resched() calls > > to stick around. The reason is that RCU Tasks readers are ended > > only by voluntary context switches. This means that although a > > preemptible infinite loop in the kernel won't inconvenience a > > real-time task (nor an non-real-time task for all that long), > > and won't delay grace periods for the other flavors of RCU, > > it would indefinitely delay an RCU Tasks grace period. > > > > However, RCU Tasks grace periods seem to be finite in preemptible > > kernels today, so they should remain finite in limited-preemptible > > kernels tomorrow. Famous last words... > > > > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > > any algorithmic difference from this change. > > So, essentially, as long as RCU tasks eventually, in the fullness of > time, call schedule(), removing cond_resched() shouldn't have any > effect :). Almost. SRCU and RCU Tasks Trace have explicit read-side state changes that the corresponding grace-period code can detect, one way or another, and thus is not dependent on reschedules. RCU Tasks Rude does explicit reschedules on all CPUs (hence "Rude"), and thus doesn't have to care about whether or not other things do reschedules. > > 8. As has been noted elsewhere, in this new limited-preemption > > mode of operation, rcu_read_lock() readers remain preemptible. > > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. > > Ack. > > > 9. The rcu_preempt_depth() macro could do something useful in > > limited-preemption kernels. Its current lack of ability in > > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. > > > > 10. The cond_resched_rcu() function must remain because we still > > have non-preemptible rcu_read_lock() readers. > > For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need > only be this, right?: > > static inline void cond_resched_rcu(void) > { > #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) > rcu_read_unlock(); > > rcu_read_lock(); > #endif > } There is a good chance that it will also need to do an explicit rcu_all_qs(). The problem is that there is an extremely low probability that the scheduling clock interrupt will hit that space between the rcu_read_unlock() and rcu_read_lock(). But either way, not a showstopper. > > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > > unchanged, but I must defer to the include/net/ip_vs.h people. > > > > 12. I need to check with the BPF folks on the BPF verifier's > > definition of BTF_ID(func, rcu_read_unlock_strict). > > > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > > function might have some redundancy across the board instead > > of just on CONFIG_PREEMPT_RCU=y. Or might not. > > I don't think I understand any of these well enough to comment. Will > Cc the relevant folks when I send out the RFC. Sounds like a plan to me! ;-) > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > > might need to do something for non-preemptible RCU to make > > up for the lack of cond_resched() calls. Maybe just drop the > > "IS_ENABLED()" and execute the body of the current "if" statement > > unconditionally. > > Aah, yes this is a good idea. Thanks. > > > 15. I must defer to others on the mm/pgtable-generic.c file's > > #ifdef that depends on CONFIG_PREEMPT_RCU. > > > > While in the area, I noted that KLP seems to depend on cond_resched(), > > but on this I must defer to the KLP people. > > Yeah, as part of this work, I ended up unhooking most of the KLP > hooks in cond_resched() and of course, cond_resched() itself. > Will poke the livepatching people. Again, sounds like a plan to me! > > I am sure that I am missing something, but I have not yet seen any > > show-stoppers. Just some needed adjustments. > > Appreciate this detailed list. Makes me think that everything might > not go up in smoke after all! C'mon, Ankur, if it doesn't go up in smoke at some point, you just aren't trying hard enough! ;-) Thanx, Paul > Thanks > Ankur > > > Thoughts? > > > > Thanx, Paul > > > >> That's as much as I wanted to demonstrate and I'm not going to spend > >> more cycles on it as I have already too many other things on flight and > >> the resulting scheduler woes are clearly outside of my expertice. > >> > >> Though definitely I'm putting a permanent NAK in place for any attempts > >> to duct tape the preempt=NONE model any further by sprinkling more > >> cond*() and whatever warts around. > >> > >> Thanks, > >> > >> tglx > >> --- > >> arch/x86/Kconfig | 1 > >> arch/x86/include/asm/thread_info.h | 6 ++-- > >> drivers/acpi/processor_idle.c | 2 - > >> include/linux/entry-common.h | 2 - > >> include/linux/entry-kvm.h | 2 - > >> include/linux/sched.h | 12 +++++--- > >> include/linux/sched/idle.h | 8 ++--- > >> include/linux/thread_info.h | 24 +++++++++++++++++ > >> include/linux/trace_events.h | 8 ++--- > >> kernel/Kconfig.preempt | 17 +++++++++++- > >> kernel/entry/common.c | 4 +- > >> kernel/entry/kvm.c | 2 - > >> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------ > >> kernel/sched/debug.c | 19 +++++++++++++ > >> kernel/sched/fair.c | 46 ++++++++++++++++++++++----------- > >> kernel/sched/features.h | 2 + > >> kernel/sched/idle.c | 3 -- > >> kernel/sched/sched.h | 1 > >> kernel/trace/trace.c | 2 + > >> kernel/trace/trace_output.c | 16 ++++++++++- > >> 20 files changed, 171 insertions(+), 57 deletions(-) > >> > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct > >> > >> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) > >> /* > >> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, > >> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, > >> * this avoids any races wrt polling state changes and thereby avoids > >> * spurious IPIs. > >> */ > >> -static inline bool set_nr_and_not_polling(struct task_struct *p) > >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) > >> { > >> struct thread_info *ti = task_thread_info(p); > >> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); > >> + > >> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG); > >> } > >> > >> /* > >> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas > >> for (;;) { > >> if (!(val & _TIF_POLLING_NRFLAG)) > >> return false; > >> - if (val & _TIF_NEED_RESCHED) > >> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) > >> return true; > >> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) > >> break; > >> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas > >> } > >> > >> #else > >> -static inline bool set_nr_and_not_polling(struct task_struct *p) > >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) > >> { > >> - set_tsk_need_resched(p); > >> + set_tsk_thread_flag(p, tif_bit); > >> return true; > >> } > >> > >> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head) > >> * might also involve a cross-CPU call to trigger the scheduler on > >> * the target CPU. > >> */ > >> -void resched_curr(struct rq *rq) > >> +static void __resched_curr(struct rq *rq, int lazy) > >> { > >> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy; > >> struct task_struct *curr = rq->curr; > >> - int cpu; > >> > >> lockdep_assert_rq_held(rq); > >> > >> - if (test_tsk_need_resched(curr)) > >> + if (unlikely(test_tsk_thread_flag(curr, tif_bit))) > >> return; > >> > >> cpu = cpu_of(rq); > >> > >> if (cpu == smp_processor_id()) { > >> - set_tsk_need_resched(curr); > >> - set_preempt_need_resched(); > >> + set_tsk_thread_flag(curr, tif_bit); > >> + if (!lazy) > >> + set_preempt_need_resched(); > >> return; > >> } > >> > >> - if (set_nr_and_not_polling(curr)) > >> - smp_send_reschedule(cpu); > >> - else > >> + if (set_nr_and_not_polling(curr, tif_bit)) { > >> + if (!lazy) > >> + smp_send_reschedule(cpu); > >> + } else { > >> trace_sched_wake_idle_without_ipi(cpu); > >> + } > >> +} > >> + > >> +void resched_curr(struct rq *rq) > >> +{ > >> + __resched_curr(rq, 0); > >> +} > >> + > >> +void resched_curr_lazy(struct rq *rq) > >> +{ > >> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ? > >> + TIF_NEED_RESCHED_LAZY_OFFSET : 0; > >> + > >> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED))) > >> + return; > >> + > >> + __resched_curr(rq, lazy); > >> } > >> > >> void resched_cpu(int cpu) > >> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu) > >> if (cpu == smp_processor_id()) > >> return; > >> > >> - if (set_nr_and_not_polling(rq->idle)) > >> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) > >> smp_send_reschedule(cpu); > >> else > >> trace_sched_wake_idle_without_ipi(cpu); > >> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init( > >> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ > >> return preempt_dynamic_mode == preempt_dynamic_##mode; \ > >> } \ > >> - EXPORT_SYMBOL_GPL(preempt_model_##mode) > >> > >> PREEMPT_MODEL_ACCESSOR(none); > >> PREEMPT_MODEL_ACCESSOR(voluntary); > >> --- a/include/linux/thread_info.h > >> +++ b/include/linux/thread_info.h > >> @@ -59,6 +59,16 @@ enum syscall_work_bit { > >> > >> #include <asm/thread_info.h> > >> > >> +#ifdef CONFIG_PREEMPT_AUTO > >> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY > >> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY > >> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) > >> +#else > >> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED > >> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED > >> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0 > >> +#endif > >> + > >> #ifdef __KERNEL__ > >> > >> #ifndef arch_set_restart_data > >> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res > >> (unsigned long *)(¤t_thread_info()->flags)); > >> } > >> > >> +static __always_inline bool tif_need_resched_lazy(void) > >> +{ > >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && > >> + arch_test_bit(TIF_NEED_RESCHED_LAZY, > >> + (unsigned long *)(¤t_thread_info()->flags)); > >> +} > >> + > >> #else > >> > >> static __always_inline bool tif_need_resched(void) > >> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res > >> (unsigned long *)(¤t_thread_info()->flags)); > >> } > >> > >> +static __always_inline bool tif_need_resched_lazy(void) > >> +{ > >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && > >> + test_bit(TIF_NEED_RESCHED_LAZY, > >> + (unsigned long *)(¤t_thread_info()->flags)); > >> +} > >> + > >> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ > >> > >> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES > >> --- a/kernel/Kconfig.preempt > >> +++ b/kernel/Kconfig.preempt > >> @@ -11,6 +11,13 @@ config PREEMPT_BUILD > >> select PREEMPTION > >> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK > >> > >> +config PREEMPT_BUILD_AUTO > >> + bool > >> + select PREEMPT_BUILD > >> + > >> +config HAVE_PREEMPT_AUTO > >> + bool > >> + > >> choice > >> prompt "Preemption Model" > >> default PREEMPT_NONE > >> @@ -67,9 +74,17 @@ config PREEMPT > >> embedded system with latency requirements in the milliseconds > >> range. > >> > >> +config PREEMPT_AUTO > >> + bool "Automagic preemption mode with runtime tweaking support" > >> + depends on HAVE_PREEMPT_AUTO > >> + select PREEMPT_BUILD_AUTO > >> + help > >> + Add some sensible blurb here > >> + > >> config PREEMPT_RT > >> bool "Fully Preemptible Kernel (Real-Time)" > >> depends on EXPERT && ARCH_SUPPORTS_RT > >> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO > >> select PREEMPTION > >> help > >> This option turns the kernel into a real-time kernel by replacing > >> @@ -95,7 +110,7 @@ config PREEMPTION > >> > >> config PREEMPT_DYNAMIC > >> bool "Preemption behaviour defined on boot" > >> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT > >> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO > >> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY > >> select PREEMPT_BUILD > >> default y if HAVE_PREEMPT_DYNAMIC_CALL > >> --- a/include/linux/entry-common.h > >> +++ b/include/linux/entry-common.h > >> @@ -60,7 +60,7 @@ > >> #define EXIT_TO_USER_MODE_WORK \ > >> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ > >> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ > >> - ARCH_EXIT_TO_USER_MODE_WORK) > >> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) > >> > >> /** > >> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs > >> --- a/include/linux/entry-kvm.h > >> +++ b/include/linux/entry-kvm.h > >> @@ -18,7 +18,7 @@ > >> > >> #define XFER_TO_GUEST_MODE_WORK \ > >> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ > >> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) > >> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) > >> > >> struct kvm_vcpu; > >> > >> --- a/kernel/entry/common.c > >> +++ b/kernel/entry/common.c > >> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l > >> > >> local_irq_enable_exit_to_user(ti_work); > >> > >> - if (ti_work & _TIF_NEED_RESCHED) > >> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) > >> schedule(); > >> > >> if (ti_work & _TIF_UPROBE) > >> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void > >> rcu_irq_exit_check_preempt(); > >> if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) > >> WARN_ON_ONCE(!on_thread_stack()); > >> - if (need_resched()) > >> + if (test_tsk_need_resched(current)) > >> preempt_schedule_irq(); > >> } > >> } > >> --- a/kernel/sched/features.h > >> +++ b/kernel/sched/features.h > >> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) > >> SCHED_FEAT(LATENCY_WARN, false) > >> > >> SCHED_FEAT(HZ_BW, true) > >> + > >> +SCHED_FEAT(FORCE_NEED_RESCHED, false) > >> --- a/kernel/sched/sched.h > >> +++ b/kernel/sched/sched.h > >> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); > >> extern void reweight_task(struct task_struct *p, int prio); > >> > >> extern void resched_curr(struct rq *rq); > >> +extern void resched_curr_lazy(struct rq *rq); > >> extern void resched_cpu(int cpu); > >> > >> extern struct rt_bandwidth def_rt_bandwidth; > >> --- a/include/linux/sched.h > >> +++ b/include/linux/sched.h > >> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla > >> update_ti_thread_flag(task_thread_info(tsk), flag, value); > >> } > >> > >> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) > >> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) > >> { > >> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); > >> } > >> > >> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) > >> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) > >> { > >> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); > >> } > >> > >> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) > >> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) > >> { > >> return test_ti_thread_flag(task_thread_info(tsk), flag); > >> } > >> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched( > >> static inline void clear_tsk_need_resched(struct task_struct *tsk) > >> { > >> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); > >> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) > >> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); > >> } > >> > >> -static inline int test_tsk_need_resched(struct task_struct *tsk) > >> +static inline bool test_tsk_need_resched(struct task_struct *tsk) > >> { > >> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); > >> } > >> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc > >> > >> static __always_inline bool need_resched(void) > >> { > >> - return unlikely(tif_need_resched()); > >> + return unlikely(tif_need_resched_lazy() || tif_need_resched()); > >> } > >> > >> /* > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq > >> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i > >> * this is probably good enough. > >> */ > >> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) > >> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick) > >> { > >> + struct rq *rq = rq_of(cfs_rq); > >> + > >> if ((s64)(se->vruntime - se->deadline) < 0) > >> return; > >> > >> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r > >> /* > >> * The task has consumed its request, reschedule. > >> */ > >> - if (cfs_rq->nr_running > 1) { > >> - resched_curr(rq_of(cfs_rq)); > >> - clear_buddies(cfs_rq, se); > >> + if (cfs_rq->nr_running < 2) > >> + return; > >> + > >> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) { > >> + resched_curr(rq); > >> + } else { > >> + /* Did the task ignore the lazy reschedule request? */ > >> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) > >> + resched_curr(rq); > >> + else > >> + resched_curr_lazy(rq); > >> } > >> + clear_buddies(cfs_rq, se); > >> } > >> > >> #include "pelt.h" > >> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf > >> /* > >> * Update the current task's runtime statistics. > >> */ > >> -static void update_curr(struct cfs_rq *cfs_rq) > >> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick) > >> { > >> struct sched_entity *curr = cfs_rq->curr; > >> u64 now = rq_clock_task(rq_of(cfs_rq)); > >> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c > >> schedstat_add(cfs_rq->exec_clock, delta_exec); > >> > >> curr->vruntime += calc_delta_fair(delta_exec, curr); > >> - update_deadline(cfs_rq, curr); > >> + update_deadline(cfs_rq, curr, tick); > >> update_min_vruntime(cfs_rq); > >> > >> if (entity_is_task(curr)) { > >> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c > >> account_cfs_rq_runtime(cfs_rq, delta_exec); > >> } > >> > >> +static inline void update_curr(struct cfs_rq *cfs_rq) > >> +{ > >> + __update_curr(cfs_rq, false); > >> +} > >> + > >> static void update_curr_fair(struct rq *rq) > >> { > >> update_curr(cfs_rq_of(&rq->curr->se)); > >> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc > >> /* > >> * Update run-time statistics of the 'current'. > >> */ > >> - update_curr(cfs_rq); > >> + __update_curr(cfs_rq, true); > >> > >> /* > >> * Ensure that runnable average is periodically updated. > >> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc > >> * validating it and just reschedule. > >> */ > >> if (queued) { > >> - resched_curr(rq_of(cfs_rq)); > >> + resched_curr_lazy(rq_of(cfs_rq)); > >> return; > >> } > >> /* > >> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str > >> * hierarchy can be throttled > >> */ > >> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) > >> - resched_curr(rq_of(cfs_rq)); > >> + resched_curr_lazy(rq_of(cfs_rq)); > >> } > >> > >> static __always_inline > >> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf > >> > >> /* Determine whether we need to wake up potentially idle CPU: */ > >> if (rq->curr == rq->idle && rq->cfs.nr_running) > >> - resched_curr(rq); > >> + resched_curr_lazy(rq); > >> } > >> > >> #ifdef CONFIG_SMP > >> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq > >> > >> if (delta < 0) { > >> if (task_current(rq, p)) > >> - resched_curr(rq); > >> + resched_curr_lazy(rq); > >> return; > >> } > >> hrtick_start(rq, delta); > >> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct > >> * prevents us from potentially nominating it as a false LAST_BUDDY > >> * below. > >> */ > >> - if (test_tsk_need_resched(curr)) > >> + if (need_resched()) > >> return; > >> > >> /* Idle tasks are by definition preempted by non-idle tasks. */ > >> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct > >> return; > >> > >> preempt: > >> - resched_curr(rq); > >> + resched_curr_lazy(rq); > >> } > >> > >> #ifdef CONFIG_SMP > >> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct > >> */ > >> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && > >> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) > >> - resched_curr(rq); > >> + resched_curr_lazy(rq); > >> } > >> > >> /* > >> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct > >> */ > >> if (task_current(rq, p)) { > >> if (p->prio > oldprio) > >> - resched_curr(rq); > >> + resched_curr_lazy(rq); > >> } else > >> check_preempt_curr(rq, p, 0); > >> } > >> --- a/drivers/acpi/processor_idle.c > >> +++ b/drivers/acpi/processor_idle.c > >> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces > >> */ > >> static void __cpuidle acpi_safe_halt(void) > >> { > >> - if (!tif_need_resched()) { > >> + if (!need_resched()) { > >> raw_safe_halt(); > >> raw_local_irq_disable(); > >> } > >> --- a/include/linux/sched/idle.h > >> +++ b/include/linux/sched/idle.h > >> @@ -63,7 +63,7 @@ static __always_inline bool __must_check > >> */ > >> smp_mb__after_atomic(); > >> > >> - return unlikely(tif_need_resched()); > >> + return unlikely(need_resched()); > >> } > >> > >> static __always_inline bool __must_check current_clr_polling_and_test(void) > >> @@ -76,7 +76,7 @@ static __always_inline bool __must_check > >> */ > >> smp_mb__after_atomic(); > >> > >> - return unlikely(tif_need_resched()); > >> + return unlikely(need_resched()); > >> } > >> > >> #else > >> @@ -85,11 +85,11 @@ static inline void __current_clr_polling > >> > >> static inline bool __must_check current_set_polling_and_test(void) > >> { > >> - return unlikely(tif_need_resched()); > >> + return unlikely(need_resched()); > >> } > >> static inline bool __must_check current_clr_polling_and_test(void) > >> { > >> - return unlikely(tif_need_resched()); > >> + return unlikely(need_resched()); > >> } > >> #endif > >> > >> --- a/kernel/sched/idle.c > >> +++ b/kernel/sched/idle.c > >> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p > >> ct_cpuidle_enter(); > >> > >> raw_local_irq_enable(); > >> - while (!tif_need_resched() && > >> - (cpu_idle_force_poll || tick_check_broadcast_expired())) > >> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) > >> cpu_relax(); > >> raw_local_irq_disable(); > >> > >> --- a/kernel/trace/trace.c > >> +++ b/kernel/trace/trace.c > >> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un > >> > >> if (tif_need_resched()) > >> trace_flags |= TRACE_FLAG_NEED_RESCHED; > >> + if (tif_need_resched_lazy()) > >> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; > >> if (test_preempt_need_resched()) > >> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; > >> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | > >> --- a/arch/x86/Kconfig > >> +++ b/arch/x86/Kconfig > >> @@ -271,6 +271,7 @@ config X86 > >> select HAVE_STATIC_CALL > >> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL > >> select HAVE_PREEMPT_DYNAMIC_CALL > >> + select HAVE_PREEMPT_AUTO > >> select HAVE_RSEQ > >> select HAVE_RUST if X86_64 > >> select HAVE_SYSCALL_TRACEPOINTS > >> --- a/arch/x86/include/asm/thread_info.h > >> +++ b/arch/x86/include/asm/thread_info.h > >> @@ -81,8 +81,9 @@ struct thread_info { > >> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ > >> #define TIF_SIGPENDING 2 /* signal pending */ > >> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ > >> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ > >> -#define TIF_SSBD 5 /* Speculative store bypass disable */ > >> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */ > >> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ > >> +#define TIF_SSBD 6 /* Speculative store bypass disable */ > >> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ > >> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ > >> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ > >> @@ -104,6 +105,7 @@ struct thread_info { > >> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) > >> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) > >> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) > >> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY) > >> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) > >> #define _TIF_SSBD (1 << TIF_SSBD) > >> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) > >> --- a/kernel/entry/kvm.c > >> +++ b/kernel/entry/kvm.c > >> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc > >> return -EINTR; > >> } > >> > >> - if (ti_work & _TIF_NEED_RESCHED) > >> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY)) > >> schedule(); > >> > >> if (ti_work & _TIF_NOTIFY_RESUME) > >> --- a/include/linux/trace_events.h > >> +++ b/include/linux/trace_events.h > >> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un > >> > >> enum trace_flag_type { > >> TRACE_FLAG_IRQS_OFF = 0x01, > >> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, > >> - TRACE_FLAG_NEED_RESCHED = 0x04, > >> + TRACE_FLAG_NEED_RESCHED = 0x02, > >> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, > >> TRACE_FLAG_HARDIRQ = 0x08, > >> TRACE_FLAG_SOFTIRQ = 0x10, > >> TRACE_FLAG_PREEMPT_RESCHED = 0x20, > >> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c > >> > >> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) > >> { > >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > >> + return tracing_gen_ctx_irq_test(0); > >> } > >> static inline unsigned int tracing_gen_ctx(void) > >> { > >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); > >> + return tracing_gen_ctx_irq_test(0); > >> } > >> #endif > >> > >> --- a/kernel/trace/trace_output.c > >> +++ b/kernel/trace/trace_output.c > >> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq > >> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : > >> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : > >> bh_off ? 'b' : > >> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : > >> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : > >> '.'; > >> > >> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | > >> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | > >> TRACE_FLAG_PREEMPT_RESCHED)) { > >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > >> + need_resched = 'B'; > >> + break; > >> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: > >> need_resched = 'N'; > >> break; > >> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: > >> + need_resched = 'L'; > >> + break; > >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: > >> + need_resched = 'b'; > >> + break; > >> case TRACE_FLAG_NEED_RESCHED: > >> need_resched = 'n'; > >> break; > >> + case TRACE_FLAG_NEED_RESCHED_LAZY: > >> + need_resched = 'l'; > >> + break; > >> case TRACE_FLAG_PREEMPT_RESCHED: > >> need_resched = 'p'; > >> break; > >> --- a/kernel/sched/debug.c > >> +++ b/kernel/sched/debug.c > >> @@ -333,6 +333,23 @@ static const struct file_operations sche > >> .release = seq_release, > >> }; > >> > >> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf, > >> + size_t cnt, loff_t *ppos) > >> +{ > >> + unsigned long end = jiffies + 60 * HZ; > >> + > >> + for (; time_before(jiffies, end) && !signal_pending(current);) > >> + cpu_relax(); > >> + > >> + return cnt; > >> +} > >> + > >> +static const struct file_operations sched_hog_fops = { > >> + .write = sched_hog_write, > >> + .open = simple_open, > >> + .llseek = default_llseek, > >> +}; > >> + > >> static struct dentry *debugfs_sched; > >> > >> static __init int sched_init_debug(void) > >> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void) > >> > >> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); > >> > >> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops); > >> + > >> return 0; > >> } > >> late_initcall(sched_init_debug); > >> > > > -- > ankur
On Wed, Oct 18, 2023 at 10:31:46AM -0400, Steven Rostedt wrote: > On Wed, 18 Oct 2023 15:16:12 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > > > might need to do something for non-preemptible RCU to make > > > up for the lack of cond_resched() calls. Maybe just drop the > > > "IS_ENABLED()" and execute the body of the current "if" statement > > > unconditionally. > > Right. > > I'm guessing you are talking about this code: > > /* > * In some cases, notably when running on a nohz_full CPU with > * a stopped tick PREEMPT_RCU has no way to account for QSs. > * This will eventually cause unwarranted noise as PREEMPT_RCU > * will force preemption as the means of ending the current > * grace period. We avoid this problem by calling > * rcu_momentary_dyntick_idle(), which performs a zero duration > * EQS allowing PREEMPT_RCU to end the current grace period. > * This call shouldn't be wrapped inside an RCU critical > * section. > * > * Note that in non PREEMPT_RCU kernels QSs are handled through > * cond_resched() > */ > if (IS_ENABLED(CONFIG_PREEMPT_RCU)) { > if (!disable_irq) > local_irq_disable(); > > rcu_momentary_dyntick_idle(); > > if (!disable_irq) > local_irq_enable(); > } That is indeed the place! > /* > * For the non-preemptive kernel config: let threads runs, if > * they so wish, unless set not do to so. > */ > if (!disable_irq && !disable_preemption) > cond_resched(); > > > > If everything becomes PREEMPT_RCU, then the above should be able to be > turned into just: > > if (!disable_irq) > local_irq_disable(); > > rcu_momentary_dyntick_idle(); > > if (!disable_irq) > local_irq_enable(); > > And no cond_resched() is needed. Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that run_osnoise() is running in kthread context with preemption and everything else enabled (am I right?), then the change you suggest should work fine. > > Again. There is no non-preemtible RCU with this model, unless I'm > > missing something important here. > > Daniel? But very happy to defer to Daniel. ;-) Thanx, Paul
On Wed, Oct 18, 2023 at 01:41:07PM -0400, Steven Rostedt wrote: > On Wed, 18 Oct 2023 10:19:53 -0700 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > > as preempt_enable() in this approach? I certainly hope so, as RCU > > priority boosting would be a most unwelcome addition to many datacenter > > workloads. > > > > > With this approach the kernel is by definition fully preemptible, which > > > means means rcu_read_lock() is preemptible too. That's pretty much the > > > same situation as with PREEMPT_DYNAMIC. > > > > Please, just no!!! > > Note, when I first read Thomas's proposal, I figured that Paul would no > longer get to brag that: > > "In CONFIG_PREEMPT_NONE, rcu_read_lock() and rcu_read_unlock() are simply > nops!" I will still be able to brag that in a fully non-preemptible environment, rcu_read_lock() and rcu_read_unlock() are simply no-ops. It will just be that the Linux kernel will no longer be such an environment. For the moment, anyway, there is still userspace RCU along with a few other instances of zero-cost RCU readers. ;-) > But instead, they would be: > > static void rcu_read_lock(void) > { > preempt_disable(); > } > > static void rcu_read_unlock(void) > { > preempt_enable(); > } > > as it was mentioned that today's preempt_disable() is fast and not an issue > like it was in older kernels. And they are already defined as you show above in rcupdate.h, albeit with leading underscores on the function names. > That would mean that there will still be a "non preempt" version of RCU. That would be very good! > As the preempt version of RCU adds a lot more logic when scheduling out in > an RCU critical section, that I can envision not all workloads would want > around. Adding "preempt_disable()" is now low overhead, but adding the RCU > logic to handle preemption isn't as lightweight as that. > > Not to mention the logic to boost those threads that were preempted and > being starved for some time. Exactly, thank you! > > > > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > > > > or RCU Tasks Rude) would need those pesky cond_resched() calls > > > > to stick around. The reason is that RCU Tasks readers are ended > > > > only by voluntary context switches. This means that although a > > > > preemptible infinite loop in the kernel won't inconvenience a > > > > real-time task (nor an non-real-time task for all that long), > > > > and won't delay grace periods for the other flavors of RCU, > > > > it would indefinitely delay an RCU Tasks grace period. > > > > > > > > However, RCU Tasks grace periods seem to be finite in preemptible > > > > kernels today, so they should remain finite in limited-preemptible > > > > kernels tomorrow. Famous last words... > > > > > > That's an issue which you have today with preempt FULL, right? So if it > > > turns out to be a problem then it's not a problem of the new model. > > > > Agreed, and hence my last three lines of text above. Plus the guy who > > requested RCU Tasks said that it was OK for its grace periods to take > > a long time, and I am holding Steven Rostedt to that. ;-) > > Matters what your definition of "long time" is ;-) If RCU Tasks grace-period latency has been acceptable in preemptible kernels (including all CONFIG_PREEMPT_DYNAMIC=y kernels), your definition of "long" is sufficiently short. ;-) Thanx, Paul
On Wed, 18 Oct 2023 10:55:02 -0700 "Paul E. McKenney" <paulmck@kernel.org> wrote: > > If everything becomes PREEMPT_RCU, then the above should be able to be > > turned into just: > > > > if (!disable_irq) > > local_irq_disable(); > > > > rcu_momentary_dyntick_idle(); > > > > if (!disable_irq) > > local_irq_enable(); > > > > And no cond_resched() is needed. > > Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that > run_osnoise() is running in kthread context with preemption and everything > else enabled (am I right?), then the change you suggest should work fine. There's a user space option that lets you run that loop with preemption and/or interrupts disabled. > > > > Again. There is no non-preemtible RCU with this model, unless I'm > > > missing something important here. > > > > Daniel? > > But very happy to defer to Daniel. ;-) But Daniel could also correct me ;-) -- Steve
On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote: > On Wed, 18 Oct 2023 10:55:02 -0700 > "Paul E. McKenney" <paulmck@kernel.org> wrote: > > > > If everything becomes PREEMPT_RCU, then the above should be able to be > > > turned into just: > > > > > > if (!disable_irq) > > > local_irq_disable(); > > > > > > rcu_momentary_dyntick_idle(); > > > > > > if (!disable_irq) > > > local_irq_enable(); > > > > > > And no cond_resched() is needed. > > > > Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that > > run_osnoise() is running in kthread context with preemption and everything > > else enabled (am I right?), then the change you suggest should work fine. > > There's a user space option that lets you run that loop with preemption and/or > interrupts disabled. Ah, thank you. Then as long as this function is not expecting an RCU reader to span that call to rcu_momentary_dyntick_idle(), all is well. This is a kthread, so there cannot be something else expecting an RCU reader to span that call. > > > > Again. There is no non-preemtible RCU with this model, unless I'm > > > > missing something important here. > > > > > > Daniel? > > > > But very happy to defer to Daniel. ;-) > > But Daniel could also correct me ;-) If he figures out a way that it is broken, he gets to fix it. ;-) Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes: > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: >> Paul! >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: >> > Belatedly calling out some RCU issues. Nothing fatal, just a >> > (surprisingly) few adjustments that will need to be made. The key thing >> > to note is that from RCU's viewpoint, with this change, all kernels >> > are preemptible, though rcu_read_lock() readers remain >> > non-preemptible. >> >> Why? Either I'm confused or you or both of us :) > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > as preempt_enable() in this approach? I certainly hope so, as RCU > priority boosting would be a most unwelcome addition to many datacenter > workloads. No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus PREEMPT_RCU so rcu_read_lock/unlock() would touch the rcu_read_lock_nesting. Which is identical to what PREEMPT_DYNAMIC does. >> With this approach the kernel is by definition fully preemptible, which >> means means rcu_read_lock() is preemptible too. That's pretty much the >> same situation as with PREEMPT_DYNAMIC. > > Please, just no!!! > > Please note that the current use of PREEMPT_DYNAMIC with preempt=none > avoids preempting RCU read-side critical sections. This means that the > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption > of RCU readers in environments expecting no preemption. Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU, preempt=none stubs out the actual preemption via __preempt_schedule. Okay, I see what you are saying. (Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none, _cond_resched() doesn't call rcu_all_qs().) >> For throughput sake this fully preemptible kernel provides a mechanism >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. >> >> That means the preemption points in preempt_enable() and return from >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to >> completion either to the point where they call schedule() or when they >> return to user space. That's pretty much what PREEMPT_NONE does today. >> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched() >> points are not longer required because the scheduler can preempt the >> long running task by setting NEED_RESCHED instead. >> >> That preemption might be suboptimal in some cases compared to >> cond_resched(), but from my initial experimentation that's not really an >> issue. > > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead > arguing that the less-preemptible variants of the kernel should continue > to avoid preempting RCU read-side critical sections. [ snip ] >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to >> CONFIG_RT or such as it does not really change the preemption >> model itself. RT just reduces the preemption disabled sections with the >> lock conversions, forced interrupt threading and some more. > > Again, please, no. > > There are situations where we still need rcu_read_lock() and > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > repectively. Those can be cases selected only by Kconfig option, not > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. As far as non-preemptible RCU read-side critical sections are concerned, are the current - PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config (rcu_read_lock/unlock() do not manipulate preempt_count, but do stub out preempt_schedule()) - and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate preempt_count)? roughly similar or no? >> > I am sure that I am missing something, but I have not yet seen any >> > show-stoppers. Just some needed adjustments. >> >> Right. If it works out as I think it can work out the main adjustments >> are to remove a large amount of #ifdef maze and related gunk :) > > Just please don't remove the #ifdef gunk that is still needed! Always the hard part :). Thanks -- ankur
On Wed, Oct 18, 2023 at 01:15:28PM -0700, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > >> Paul! > >> > >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > >> > Belatedly calling out some RCU issues. Nothing fatal, just a > >> > (surprisingly) few adjustments that will need to be made. The key thing > >> > to note is that from RCU's viewpoint, with this change, all kernels > >> > are preemptible, though rcu_read_lock() readers remain > >> > non-preemptible. > >> > >> Why? Either I'm confused or you or both of us :) > > > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > > as preempt_enable() in this approach? I certainly hope so, as RCU > > priority boosting would be a most unwelcome addition to many datacenter > > workloads. > > No, in this approach, PREEMPT_AUTO selects PREEMPTION and thus > PREEMPT_RCU so rcu_read_lock/unlock() would touch the > rcu_read_lock_nesting. Which is identical to what PREEMPT_DYNAMIC does. Understood. And we need some way to build a kernel such that RCU read-side critical sections are non-preemptible. This is a hard requirement that is not going away anytime soon. > >> With this approach the kernel is by definition fully preemptible, which > >> means means rcu_read_lock() is preemptible too. That's pretty much the > >> same situation as with PREEMPT_DYNAMIC. > > > > Please, just no!!! > > > > Please note that the current use of PREEMPT_DYNAMIC with preempt=none > > avoids preempting RCU read-side critical sections. This means that the > > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption > > of RCU readers in environments expecting no preemption. > > Ah. So, though PREEMPT_DYNAMIC with preempt=none runs with PREEMPT_RCU, > preempt=none stubs out the actual preemption via __preempt_schedule. > > Okay, I see what you are saying. More to the point, currently, you can build with CONFIG_PREEMPT_DYNAMIC=n and CONFIG_PREEMPT_NONE=y and have non-preemptible RCU read-side critical sections. > (Side issue: but this means that even for PREEMPT_DYNAMIC preempt=none, > _cond_resched() doesn't call rcu_all_qs().) I have no idea if anyone runs with CONFIG_PREEMPT_DYNAMIC=y and preempt=none. We don't do so. ;-) > >> For throughput sake this fully preemptible kernel provides a mechanism > >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting > >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. > >> > >> That means the preemption points in preempt_enable() and return from > >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to > >> completion either to the point where they call schedule() or when they > >> return to user space. That's pretty much what PREEMPT_NONE does today. > >> > >> The difference to NONE/VOLUNTARY is that the explicit cond_resched() > >> points are not longer required because the scheduler can preempt the > >> long running task by setting NEED_RESCHED instead. > >> > >> That preemption might be suboptimal in some cases compared to > >> cond_resched(), but from my initial experimentation that's not really an > >> issue. > > > > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead > > arguing that the less-preemptible variants of the kernel should continue > > to avoid preempting RCU read-side critical sections. > > [ snip ] > > >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to > >> CONFIG_RT or such as it does not really change the preemption > >> model itself. RT just reduces the preemption disabled sections with the > >> lock conversions, forced interrupt threading and some more. > > > > Again, please, no. > > > > There are situations where we still need rcu_read_lock() and > > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > > repectively. Those can be cases selected only by Kconfig option, not > > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > > As far as non-preemptible RCU read-side critical sections are concerned, > are the current > - PREEMPT_DYNAMIC=y, PREEMPT_RCU, preempt=none config > (rcu_read_lock/unlock() do not manipulate preempt_count, but do > stub out preempt_schedule()) > - and PREEMPT_NONE=y, TREE_RCU config (rcu_read_lock/unlock() manipulate > preempt_count)? > > roughly similar or no? No. There is still considerable exposure to preemptible-RCU code paths, for example, when current->rcu_read_unlock_special.b.blocked is set. > >> > I am sure that I am missing something, but I have not yet seen any > >> > show-stoppers. Just some needed adjustments. > >> > >> Right. If it works out as I think it can work out the main adjustments > >> are to remove a large amount of #ifdef maze and related gunk :) > > > > Just please don't remove the #ifdef gunk that is still needed! > > Always the hard part :). Hey, we wouldn't want to insult your intelligence by letting you work on too easy of a problem! ;-) Thanx, Paul
On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote: > On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote: Can you folks please trim your replies. It's annoying to scroll through hundreds of quoted lines to figure out that nothing is there. >> This probably allows for more configuration flexibility across archs? >> Would allow for TREE_RCU=y, for instance. That said, so far I've only >> been working with PREEMPT_RCU=y.) > > Then this is a bug that needs to be fixed. We need a way to make > RCU readers non-preemptible. Why?
On Thu, Oct 19, 2023 at 12:53:05AM +0200, Thomas Gleixner wrote: > On Wed, Oct 18 2023 at 10:51, Paul E. McKenney wrote: > > On Wed, Oct 18, 2023 at 05:09:46AM -0700, Ankur Arora wrote: > > Can you folks please trim your replies. It's annoying to scroll > through hundreds of quoted lines to figure out that nothing is there. > > >> This probably allows for more configuration flexibility across archs? > >> Would allow for TREE_RCU=y, for instance. That said, so far I've only > >> been working with PREEMPT_RCU=y.) > > > > Then this is a bug that needs to be fixed. We need a way to make > > RCU readers non-preemptible. > > Why? So that we don't get tail latencies from preempted RCU readers that result in memory-usage spikes on systems that have good and sufficient quantities of memory, but which do not have enough memory to tolerate readers being preempted. Thanx, Paul
Paul! On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: >> > Belatedly calling out some RCU issues. Nothing fatal, just a >> > (surprisingly) few adjustments that will need to be made. The key thing >> > to note is that from RCU's viewpoint, with this change, all kernels >> > are preemptible, though rcu_read_lock() readers remain >> > non-preemptible. >> >> Why? Either I'm confused or you or both of us :) > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > as preempt_enable() in this approach? I certainly hope so, as RCU > priority boosting would be a most unwelcome addition to many datacenter > workloads. Sure, but that's an orthogonal problem, really. >> With this approach the kernel is by definition fully preemptible, which >> means means rcu_read_lock() is preemptible too. That's pretty much the >> same situation as with PREEMPT_DYNAMIC. > > Please, just no!!! > > Please note that the current use of PREEMPT_DYNAMIC with preempt=none > avoids preempting RCU read-side critical sections. This means that the > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption > of RCU readers in environments expecting no preemption. It does not _avoid_ it, it simply _prevents_ it by not preempting in preempt_enable() and on return from interrupt so whatever sets NEED_RESCHED has to wait for a voluntary invocation of schedule(), cond_resched() or return to user space. But under the hood RCU is fully preemptible and the boosting logic is active, but it does not have an effect until one of those preemption points is reached, which makes the boosting moot. >> For throughput sake this fully preemptible kernel provides a mechanism >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. >> >> That means the preemption points in preempt_enable() and return from >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to >> completion either to the point where they call schedule() or when they >> return to user space. That's pretty much what PREEMPT_NONE does today. >> >> The difference to NONE/VOLUNTARY is that the explicit cond_resched() >> points are not longer required because the scheduler can preempt the >> long running task by setting NEED_RESCHED instead. >> >> That preemption might be suboptimal in some cases compared to >> cond_resched(), but from my initial experimentation that's not really an >> issue. > > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead > arguing that the less-preemptible variants of the kernel should continue > to avoid preempting RCU read-side critical sections. That's the whole point of the lazy mechanism: It avoids (repeat AVOIDS) preemption of any kernel code as much as it can by _not_ setting NEED_RESCHED. The only difference is that it does not _prevent_ it like preempt=none does. It will preempt when NEED_RESCHED is set. Now the question is when will NEED_RESCHED be set? 1) If the preempting task belongs to a scheduling class above SCHED_OTHER This is a PoC implementation detail. The lazy mechanism can be extended to any other scheduling class w/o a big effort. I deliberately did not do that because: A) I'm lazy B) More importantly I wanted to demonstrate that as long as there are only SCHED_OTHER tasks involved there is no forced (via NEED_RESCHED) preemption unless the to be preempted task ignores the lazy resched request, which proves that cond_resched() can be avoided. At the same time such a kernel allows a RT task to preempt at any time. 2) If the to be preempted task does not react within a certain time frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY request, which is the prerequisite to get rid of cond_resched() and related muck. That's obviously mandatory for getting rid of cond_resched() and related muck, no? I concede that there are a lot of details to be discussed before we get there, but I don't see a real show stopper yet. The important point is that the details are basically boiling down to policy decisions in the scheduler which are aided by hints from the programmer. As I said before we might end up with something like preempt_me_not_if_not_absolutely_required(); .... preempt_me_I_dont_care(); (+/- name bike shedding) to give the scheduler a better understanding of the context. Something like that has distinct advantages over the current situation with all the cond_resched() muck: 1) It is clearly scope based 2) It is properly nesting 3) It can be easily made implicit for existing scope constructs like rcu_read_lock/unlock() or regular locking mechanisms. The important point is that at the very end the scheduler has the ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any random damage due to the fact that preemption count is functional, which makes your life easier as well as you admitted already. But that does not mean you can eat the cake and still have it. :) That said, I completely understand your worries about the consequences, but please take the step back and look at it from a conceptual point of view. The goal is to replace the hard coded (Kconfig or DYNAMIC) policy mechanisms with a flexible scheduler controlled policy mechanism. That allows you to focus on one consolidated model and optimize that for particular policy scenarios instead of dealing with optimizing the hell out of hardcoded policies which force you to come up with horrible workaround for each of them. Of course the policies have to be defined (scheduling classes affected depending on model, hint/annotation meaning etc.), but that's way more palatable than what we have now. Let me give you a simple example: Right now the only way out on preempt=none when a rogue code path which lacks a cond_resched() fails to release the CPU is a big fat stall splat and a hosed machine. I rather prefer to have the fully controlled hammer ready which keeps the machine usable and the situation debuggable. You still can yell in dmesg, but that again is a flexible policy decision and not hard coded by any means. >> > 3. For nohz_full CPUs that run for a long time in the kernel, >> > there are no scheduling-clock interrupts. RCU reaches for >> > the resched_cpu() hammer a few jiffies into the grace period. >> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's >> > interrupt-entry code will re-enable its scheduling-clock interrupt >> > upon receiving the resched_cpu() IPI. >> >> You can spare the IPI by setting NEED_RESCHED on the remote CPU which >> will cause it to preempt. > > That is not sufficient for nohz_full CPUs executing in userspace, That's not what I was talking about. You said: >> > 3. For nohz_full CPUs that run for a long time in the kernel, ^^^^^^ Duh! I did not realize that you meant user space. For user space there is zero difference to the current situation. Once the task is out in user space it's out of RCU side critical sections, so that's obiously not a problem. As I said: I might be confused. :) >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to >> CONFIG_RT or such as it does not really change the preemption >> model itself. RT just reduces the preemption disabled sections with the >> lock conversions, forced interrupt threading and some more. > > Again, please, no. > > There are situations where we still need rcu_read_lock() and > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > repectively. Those can be cases selected only by Kconfig option, not > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. Why are you so fixated on making everything hardcoded instead of making it a proper policy decision problem. See above. >> > 8. As has been noted elsewhere, in this new limited-preemption >> > mode of operation, rcu_read_lock() readers remain preemptible. >> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. >> >> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no? > > That is in fact the problem. Preemption can be good, but it is possible > to have too much of a good thing, and preemptible RCU read-side critical > sections definitely is in that category for some important workloads. ;-) See above. >> > 10. The cond_resched_rcu() function must remain because we still >> > have non-preemptible rcu_read_lock() readers. >> >> Where? > > In datacenters. See above. >> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function >> > might need to do something for non-preemptible RCU to make >> > up for the lack of cond_resched() calls. Maybe just drop the >> > "IS_ENABLED()" and execute the body of the current "if" statement >> > unconditionally. >> >> Again. There is no non-preemtible RCU with this model, unless I'm >> missing something important here. > > And again, there needs to be non-preemptible RCU with this model. See above. Thanks, tglx
On 10/18/23 20:13, Paul E. McKenney wrote: > On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote: >> On Wed, 18 Oct 2023 10:55:02 -0700 >> "Paul E. McKenney" <paulmck@kernel.org> wrote: >> >>>> If everything becomes PREEMPT_RCU, then the above should be able to be >>>> turned into just: >>>> >>>> if (!disable_irq) >>>> local_irq_disable(); >>>> >>>> rcu_momentary_dyntick_idle(); >>>> >>>> if (!disable_irq) >>>> local_irq_enable(); >>>> >>>> And no cond_resched() is needed. >>> >>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that >>> run_osnoise() is running in kthread context with preemption and everything >>> else enabled (am I right?), then the change you suggest should work fine. >> >> There's a user space option that lets you run that loop with preemption and/or >> interrupts disabled. > > Ah, thank you. Then as long as this function is not expecting an RCU > reader to span that call to rcu_momentary_dyntick_idle(), all is well. > This is a kthread, so there cannot be something else expecting an RCU > reader to span that call. Sorry for the delay, this thread is quite long (and I admit I should be paying attention to it). It seems that you both figure it out without me anyways. This piece of code is preemptive unless a config is set to disable irq or preemption (as steven mentioned). That call is just a ping to RCU to say that things are fine. So Steven's suggestion should work. >>>>> Again. There is no non-preemtible RCU with this model, unless I'm >>>>> missing something important here. >>>> >>>> Daniel? >>> >>> But very happy to defer to Daniel. ;-) >> >> But Daniel could also correct me ;-) > > If he figures out a way that it is broken, he gets to fix it. ;-) It works for me, keep in the loop for the patches and I can test and adjust osnoise accordingly. osnoise should not be a reason to block more important things like this patch set, and we can find a way out in the osnoise tracer side. (I might need an assistance from rcu people, but I know I can count on them :-). Thanks! -- Daniel > Thanx, Paul
On Thu, Oct 19, 2023 at 02:37:23PM +0200, Daniel Bristot de Oliveira wrote: > On 10/18/23 20:13, Paul E. McKenney wrote: > > On Wed, Oct 18, 2023 at 02:00:35PM -0400, Steven Rostedt wrote: > >> On Wed, 18 Oct 2023 10:55:02 -0700 > >> "Paul E. McKenney" <paulmck@kernel.org> wrote: > >> > >>>> If everything becomes PREEMPT_RCU, then the above should be able to be > >>>> turned into just: > >>>> > >>>> if (!disable_irq) > >>>> local_irq_disable(); > >>>> > >>>> rcu_momentary_dyntick_idle(); > >>>> > >>>> if (!disable_irq) > >>>> local_irq_enable(); > >>>> > >>>> And no cond_resched() is needed. > >>> > >>> Even given that CONFIG_PREEMPT_RCU=n still exists, the fact that > >>> run_osnoise() is running in kthread context with preemption and everything > >>> else enabled (am I right?), then the change you suggest should work fine. > >> > >> There's a user space option that lets you run that loop with preemption and/or > >> interrupts disabled. > > > > Ah, thank you. Then as long as this function is not expecting an RCU > > reader to span that call to rcu_momentary_dyntick_idle(), all is well. > > This is a kthread, so there cannot be something else expecting an RCU > > reader to span that call. > > Sorry for the delay, this thread is quite long (and I admit I should be paying > attention to it). > > It seems that you both figure it out without me anyways. This piece of > code is preemptive unless a config is set to disable irq or preemption (as > steven mentioned). That call is just a ping to RCU to say that things > are fine. > > So Steven's suggestion should work. Very good! > >>>>> Again. There is no non-preemtible RCU with this model, unless I'm > >>>>> missing something important here. > >>>> > >>>> Daniel? > >>> > >>> But very happy to defer to Daniel. ;-) > >> > >> But Daniel could also correct me ;-) > > > > If he figures out a way that it is broken, he gets to fix it. ;-) > > It works for me, keep in the loop for the patches and I can test and > adjust osnoise accordingly. osnoise should not be a reason to block more > important things like this patch set, and we can find a way out in > the osnoise tracer side. (I might need an assistance from rcu > people, but I know I can count on them :-). For good or for bad, we will be here. ;-) Thanx, Paul
Thomas! On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: > Paul! > > On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: > > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > >> > Belatedly calling out some RCU issues. Nothing fatal, just a > >> > (surprisingly) few adjustments that will need to be made. The key thing > >> > to note is that from RCU's viewpoint, with this change, all kernels > >> > are preemptible, though rcu_read_lock() readers remain > >> > non-preemptible. > >> > >> Why? Either I'm confused or you or both of us :) > > > > Isn't rcu_read_lock() defined as preempt_disable() and rcu_read_unlock() > > as preempt_enable() in this approach? I certainly hope so, as RCU > > priority boosting would be a most unwelcome addition to many datacenter > > workloads. > > Sure, but that's an orthogonal problem, really. Orthogonal, parallel, skew, whatever, it and its friends still need to be addressed. > >> With this approach the kernel is by definition fully preemptible, which > >> means means rcu_read_lock() is preemptible too. That's pretty much the > >> same situation as with PREEMPT_DYNAMIC. > > > > Please, just no!!! > > > > Please note that the current use of PREEMPT_DYNAMIC with preempt=none > > avoids preempting RCU read-side critical sections. This means that the > > distro use of PREEMPT_DYNAMIC has most definitely *not* tested preemption > > of RCU readers in environments expecting no preemption. > > It does not _avoid_ it, it simply _prevents_ it by not preempting in > preempt_enable() and on return from interrupt so whatever sets > NEED_RESCHED has to wait for a voluntary invocation of schedule(), > cond_resched() or return to user space. A distinction without a difference. ;-) > But under the hood RCU is fully preemptible and the boosting logic is > active, but it does not have an effect until one of those preemption > points is reached, which makes the boosting moot. And for many distros, this appears to be just fine, not that I personally know of anyone running large numbers of systems in production with kernels built with CONFIG_PREEMPT_DYNAMIC=y and booted with preempt=none. And let's face it, if you want exactly the same binary to support both modes, you are stuck with the fully-preemptible implementation of RCU. But we should not make a virtue of such a distro's necessity. And some of us are not afraid to build our own kernels, which allows us to completely avoid the added code required to make RCU read-side critical sections be preemptible. > >> For throughput sake this fully preemptible kernel provides a mechanism > >> to delay preemption for SCHED_OTHER tasks, i.e. instead of setting > >> NEED_RESCHED the scheduler sets NEED_RESCHED_LAZY. > >> > >> That means the preemption points in preempt_enable() and return from > >> interrupt to kernel will not see NEED_RESCHED and the tasks can run to > >> completion either to the point where they call schedule() or when they > >> return to user space. That's pretty much what PREEMPT_NONE does today. > >> > >> The difference to NONE/VOLUNTARY is that the explicit cond_resched() > >> points are not longer required because the scheduler can preempt the > >> long running task by setting NEED_RESCHED instead. > >> > >> That preemption might be suboptimal in some cases compared to > >> cond_resched(), but from my initial experimentation that's not really an > >> issue. > > > > I am not (repeat NOT) arguing for keeping cond_resched(). I am instead > > arguing that the less-preemptible variants of the kernel should continue > > to avoid preempting RCU read-side critical sections. > > That's the whole point of the lazy mechanism: > > It avoids (repeat AVOIDS) preemption of any kernel code as much as it > can by _not_ setting NEED_RESCHED. > > The only difference is that it does not _prevent_ it like > preempt=none does. It will preempt when NEED_RESCHED is set. > > Now the question is when will NEED_RESCHED be set? > > 1) If the preempting task belongs to a scheduling class above > SCHED_OTHER > > This is a PoC implementation detail. The lazy mechanism can be > extended to any other scheduling class w/o a big effort. > > I deliberately did not do that because: > > A) I'm lazy > > B) More importantly I wanted to demonstrate that as long as > there are only SCHED_OTHER tasks involved there is no forced > (via NEED_RESCHED) preemption unless the to be preempted task > ignores the lazy resched request, which proves that > cond_resched() can be avoided. > > At the same time such a kernel allows a RT task to preempt at > any time. > > 2) If the to be preempted task does not react within a certain time > frame (I used a full tick in my PoC) on the NEED_RESCHED_LAZY > request, which is the prerequisite to get rid of cond_resched() > and related muck. > > That's obviously mandatory for getting rid of cond_resched() and > related muck, no? Keeping firmly in mind that there are no cond_resched() calls within RCU read-side critical sections, sure. Or, if you prefer, any such calls are bugs. And agreed, outside of atomic contexts (in my specific case, including RCU readers), there does eventually need to be a preemption. > I concede that there are a lot of details to be discussed before we get > there, but I don't see a real show stopper yet. Which is what I have been saying as well, at least as long as we can have a way of building a kernel with a non-preemptible build of RCU. And not just a preemptible RCU in which the scheduler (sometimes?) refrains from preempting the RCU read-side critical sections, but really only having the CONFIG_PREEMPT_RCU=n code built. Give or take the needs of the KLP guys, but again, I must defer to them. > The important point is that the details are basically boiling down to > policy decisions in the scheduler which are aided by hints from the > programmer. > > As I said before we might end up with something like > > preempt_me_not_if_not_absolutely_required(); > .... > preempt_me_I_dont_care(); > > (+/- name bike shedding) to give the scheduler a better understanding of > the context. > > Something like that has distinct advantages over the current situation > with all the cond_resched() muck: > > 1) It is clearly scope based > > 2) It is properly nesting > > 3) It can be easily made implicit for existing scope constructs like > rcu_read_lock/unlock() or regular locking mechanisms. You know, I was on board with throwing cond_resched() overboard (again, give or take whatever KLP might need) when I first read of this in that LWN article. You therefore cannot possibly gain anything by continuing to sell it to me, and, worse yet, you might provoke an heretofore-innocent bystander into pushing some bogus but convincing argument against. ;-) Yes, there are risks due to additional state space exposed by the additional preemption. However, at least some of this is already covered by quite a few people running preemptible kernels. There will be some not covered, given our sensitivity to low-probability bugs, but there should also be some improvements in tail latency. The process of getting the first cond_resched()-free kernel deployed will therefore likely be a bit painful, but overall the gains should be worth the pain. > The important point is that at the very end the scheduler has the > ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any > random damage due to the fact that preemption count is functional, which > makes your life easier as well as you admitted already. But that does > not mean you can eat the cake and still have it. :) Which is exactly why I need rcu_read_lock() to map to preempt_disable() and rcu_read_unlock() to preempt_enable(). ;-) > That said, I completely understand your worries about the consequences, > but please take the step back and look at it from a conceptual point of > view. Conceptual point of view? That sounds suspiciously academic. Who are you and what did you do with the real Thomas Gleixner? ;-) But yes, consequences are extremely important, as always. > The goal is to replace the hard coded (Kconfig or DYNAMIC) policy > mechanisms with a flexible scheduler controlled policy mechanism. Are you saying that CONFIG_PREEMPT_RT will also be selected at boot time and/or via debugfs? > That allows you to focus on one consolidated model and optimize that > for particular policy scenarios instead of dealing with optimizing the > hell out of hardcoded policies which force you to come up with > horrible workaround for each of them. > > Of course the policies have to be defined (scheduling classes affected > depending on model, hint/annotation meaning etc.), but that's way more > palatable than what we have now. Let me give you a simple example: > > Right now the only way out on preempt=none when a rogue code path > which lacks a cond_resched() fails to release the CPU is a big fat > stall splat and a hosed machine. > > I rather prefer to have the fully controlled hammer ready which keeps > the machine usable and the situation debuggable. > > You still can yell in dmesg, but that again is a flexible policy > decision and not hard coded by any means. And I have agreed from my first read of that LWN article that allowing preemption of code where preempt_count()=0 is a good thing. The only thing that I am pushing back on is specifially your wish to always be running the CONFIG_PREEMPT_RCU=y RCU code. Yes, that is what single-binary distros will do, just as they do now. But again, some of us are happy to build our own kernels. There might be other things that I should be pushing back on, but that is all that I am aware of right now. ;-) > >> > 3. For nohz_full CPUs that run for a long time in the kernel, > >> > there are no scheduling-clock interrupts. RCU reaches for > >> > the resched_cpu() hammer a few jiffies into the grace period. > >> > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > >> > interrupt-entry code will re-enable its scheduling-clock interrupt > >> > upon receiving the resched_cpu() IPI. > >> > >> You can spare the IPI by setting NEED_RESCHED on the remote CPU which > >> will cause it to preempt. > > > > That is not sufficient for nohz_full CPUs executing in userspace, > > That's not what I was talking about. You said: > > >> > 3. For nohz_full CPUs that run for a long time in the kernel, > ^^^^^^ > Duh! I did not realize that you meant user space. For user space there > is zero difference to the current situation. Once the task is out in > user space it's out of RCU side critical sections, so that's obiously > not a problem. > > As I said: I might be confused. :) And I might well also be confused. Here is my view for nohz_full CPUs: o Running in userspace. RCU will ignore them without disturbing the CPU, courtesy of context tracking. As you say, there is no way (absent extremely strange sidechannel attacks) to have a kernel RCU read-side critical section here. These CPUs will ignore NEED_RESCHED until they exit usermode one way or another. This exit will usually be supplied by the scheduler's wakeup IPI for the newly awakened task. But just setting NEED_RESCHED without otherwise getting the CPU's full attention won't have any effect. o Running in the kernel entry/exit code. RCU will ignore them without disturbing the CPU, courtesy of context tracking. Unlike usermode, you can type rcu_read_lock(), but if you do, lockdep will complain bitterly. Assuming the time in the kernel is sharply bounded, as it usually will be, these CPUs will respond to NEED_RESCHED in a timely manner. For longer times in the kernel, please see below. o Running in the kernel in deep idle, that is, where RCU is not watching. RCU will ignore them without disturbing the CPU, courtesy of context tracking. As with the entry/exit code, you can type rcu_read_lock(), but if you do, lockdep will complain bitterly. The exact response to NEED_RESCHED depends on the type of idle loop, with (as I understand it) polling idle loops responding quickly and other idle loops needing some event to wake up the CPU. This event is typically an IPI, as is the case when the scheduler wakes up a task on the CPU in question. o Running in other parts of the kernel, but with scheduling clock interrupt enabled. The next scheduling clock interrupt will take care of both RCU and NEED_RESCHED. Give or take policy decisions, as you say above. o Running in other parts of the kernel, but with scheduling clock interrupt disabled. If there is a grace period waiting on this CPU, RCU will eventually set a flag and invoke resched_cpu(), which will get the CPU's attention via an IPI and will also turn the scheduling clock interrupt back on. I believe that a wakeup from the scheduler has the same effect, and that it uses an IPI to get the CPU's attention when needed, but it has been one good long time since I traced out all the details. However, given that there is to be no cond_resched(), setting NEED_RESCHED without doing something like an IPI to get that CPU's attention will still not be guarantee to have any effect, just as with the nohz_full CPU executing in userspace, correct? Did I miss anything? > >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to > >> CONFIG_RT or such as it does not really change the preemption > >> model itself. RT just reduces the preemption disabled sections with the > >> lock conversions, forced interrupt threading and some more. > > > > Again, please, no. > > > > There are situations where we still need rcu_read_lock() and > > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > > repectively. Those can be cases selected only by Kconfig option, not > > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > > Why are you so fixated on making everything hardcoded instead of making > it a proper policy decision problem. See above. Because I am one of the people who will bear the consequences. In that same vein, why are you so opposed to continuing to provide the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code is already in place, is extremely well tested, and you need to handle preempt_disable()/preeempt_enable() regions of code in any case. What is the real problem here? > >> > 8. As has been noted elsewhere, in this new limited-preemption > >> > mode of operation, rcu_read_lock() readers remain preemptible. > >> > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. > >> > >> Why? You fundamentally have a preemptible kernel with PREEMPT_RCU, no? > > > > That is in fact the problem. Preemption can be good, but it is possible > > to have too much of a good thing, and preemptible RCU read-side critical > > sections definitely is in that category for some important workloads. ;-) > > See above. > > >> > 10. The cond_resched_rcu() function must remain because we still > >> > have non-preemptible rcu_read_lock() readers. > >> > >> Where? > > > > In datacenters. > > See above. > > >> > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > >> > might need to do something for non-preemptible RCU to make > >> > up for the lack of cond_resched() calls. Maybe just drop the > >> > "IS_ENABLED()" and execute the body of the current "if" statement > >> > unconditionally. > >> > >> Again. There is no non-preemtible RCU with this model, unless I'm > >> missing something important here. > > > > And again, there needs to be non-preemptible RCU with this model. > > See above. And back at you with all three instances of "See above". ;-) Thanx, Paul
On Thu, Oct 19, 2023 at 12:13:31PM -0700, Paul E. McKenney wrote: > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: > > On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: > > > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > > >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: [ . . . ] > > >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > > >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to > > >> CONFIG_RT or such as it does not really change the preemption > > >> model itself. RT just reduces the preemption disabled sections with the > > >> lock conversions, forced interrupt threading and some more. > > > > > > Again, please, no. > > > > > > There are situations where we still need rcu_read_lock() and > > > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > > > repectively. Those can be cases selected only by Kconfig option, not > > > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > > > > Why are you so fixated on making everything hardcoded instead of making > > it a proper policy decision problem. See above. > > Because I am one of the people who will bear the consequences. > > In that same vein, why are you so opposed to continuing to provide > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code > is already in place, is extremely well tested, and you need to handle > preempt_disable()/preeempt_enable() regions of code in any case. What is > the real problem here? I should hasten to add that from a conceptual viewpoint, I do support the eventual elimination of CONFIG_PREEMPT_RCU=n code, but with emphasis on the word "eventual". Although preemptible RCU is plenty reliable if you are running only a few thousand servers (and maybe even a few tens of thousands), it has some improving to do before I will be comfortable recommending its use in a large-scale datacenters. And yes, I know about Android deployments. But those devices tend to spend very little time in the kernel, in fact, many of them tend to spend very little time powered up. Plus they tend to have relatively few CPUs, at least by 2020s standards. So it takes a rather large number of Android devices to impose the same stress on the kernel that is imposed by a single mid-sized server. And we are working on making preemptible RCU more reliable. One nice change over the past 5-10 years is that more people are getting serious about digging into the RCU code, testing it, and reporting and fixing the resulting bugs. I am also continuing to make rcutorture more vicious, and of course I am greatly helped by the easier availability of hardware with which to test RCU. If this level of activity continues for another five years, then maybe preemptible RCU will be ready for large datacenter deployments. But I am guessing that you had something in mind in addition to code consolidation. Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes: > Thomas! > > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: >> Paul! >> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to >> >> CONFIG_RT or such as it does not really change the preemption >> >> model itself. RT just reduces the preemption disabled sections with the >> >> lock conversions, forced interrupt threading and some more. >> > >> > Again, please, no. >> > >> > There are situations where we still need rcu_read_lock() and >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(), >> > repectively. Those can be cases selected only by Kconfig option, not >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. >> >> Why are you so fixated on making everything hardcoded instead of making >> it a proper policy decision problem. See above. > > Because I am one of the people who will bear the consequences. > > In that same vein, why are you so opposed to continuing to provide > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code > is already in place, is extremely well tested, and you need to handle > preempt_disable()/preeempt_enable() regions of code in any case. What is > the real problem here? I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y? I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from 2015, stating that the only possible choice for PREEMPTION=y kernels is PREEMPT_RCU=y: The RCU implementation is chosen based on PREEMPT and SMP config options and is not really a user-selectable choice. This commit removes the menu entry, given that there is not much point in calling something a choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and PREEMPT_RCU Kconfig options continue to be selected based solely on the values of the PREEMPT and SMP options. As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly stronger forward progress guarantees with respect to rcu readers (in that they can't be preempted.) So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something obvious there. Thanks -- ankur
On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > Thomas! > > > > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: > >> Paul! > >> > >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: > >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to > >> >> CONFIG_RT or such as it does not really change the preemption > >> >> model itself. RT just reduces the preemption disabled sections with the > >> >> lock conversions, forced interrupt threading and some more. > >> > > >> > Again, please, no. > >> > > >> > There are situations where we still need rcu_read_lock() and > >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > >> > repectively. Those can be cases selected only by Kconfig option, not > >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > >> > >> Why are you so fixated on making everything hardcoded instead of making > >> it a proper policy decision problem. See above. > > > > Because I am one of the people who will bear the consequences. > > > > In that same vein, why are you so opposed to continuing to provide > > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code > > is already in place, is extremely well tested, and you need to handle > > preempt_disable()/preeempt_enable() regions of code in any case. What is > > the real problem here? > > I have a somewhat related question. What ties PREEMPTION=y to PREEMPT_RCU=y? This Kconfig block in kernel/rcu/Kconfig: ------------------------------------------------------------------------ config PREEMPT_RCU bool default y if PREEMPTION select TREE_RCU help This option selects the RCU implementation that is designed for very large SMP systems with hundreds or thousands of CPUs, but for which real-time response is also required. It also scales down nicely to smaller systems. Select this option if you are unsure. ------------------------------------------------------------------------ There is no prompt string after the "bool", so it is not user-settable. Therefore, it is driven directly off of the value of PREEMPTION, taking the global default of "n" if PREEMPTION is not set and "y" otherwise. You could change the second line to read: bool "Go ahead! Make my day!" or preferably something more helpful. This change would allow a preemptible kernel to be built with non-preemptible RCU and vice versa, as used to be the case long ago. However, it might be way better to drive the choice from some other Kconfig option and leave out the prompt string. > I see e72aeafc66 ("rcu: Remove prompt for RCU implementation") from > 2015, stating that the only possible choice for PREEMPTION=y kernels > is PREEMPT_RCU=y: > > The RCU implementation is chosen based on PREEMPT and SMP config options > and is not really a user-selectable choice. This commit removes the > menu entry, given that there is not much point in calling something a > choice when there is in fact no choice.. The TINY_RCU, TREE_RCU, and > PREEMPT_RCU Kconfig options continue to be selected based solely on the > values of the PREEMPT and SMP options. The main point of this commit was to reduce testing effort and sysadm confusion by removing choices that were not necessary back then. > As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly > stronger forward progress guarantees with respect to rcu readers (in > that they can't be preempted.) TREE_RCU=y is absolutely required if you want a kernel to run on a system with more than one CPU, and for that matter, if you want preemptible RCU, even on a single-CPU system. > So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something > obvious there. If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you can run any combination: PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible kernels, so this works just fine (famous last words). PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible RCU, so that rcu_read_lock() is preempt_disable() and rcu_read_unlock() is preempt_enable(). This should just work, except for the fact that cond_resched() disappears, which stymies some of RCU's forward-progress mechanisms. And this was the topic of our earlier discussion on this thread. The fixes should not be too hard. Of course, this has not been either tested or used for at least eight years, so there might be some bitrot. If so, I will of course be happy to help fix it. !PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible RCU. Although this particular combination of Kconfig options has not been tested for at least eight years, giving a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none kernel boot parameter gets you pretty close. Again, there is likely to be some bitrot somewhere, but way fewer bits to rot than for PREEMPTION && !PREEMPT_RCU. Outside of the current CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this combination, but if there is a need and if it is broken, I will be happy to help fix it. !PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible RCU, which is what we use today for non-preemptible kernels built with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last works, this works just fine. Does that help, or am I missing the point of your question? Thanx, Paul
Paul E. McKenney <paulmck@kernel.org> writes: > On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote: >> >> Paul E. McKenney <paulmck@kernel.org> writes: >> >> > Thomas! >> > >> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: >> >> Paul! >> >> >> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: >> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: >> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: >> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob >> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to >> >> >> CONFIG_RT or such as it does not really change the preemption >> >> >> model itself. RT just reduces the preemption disabled sections with the >> >> >> lock conversions, forced interrupt threading and some more. >> >> > >> >> > Again, please, no. >> >> > >> >> > There are situations where we still need rcu_read_lock() and >> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(), >> >> > repectively. Those can be cases selected only by Kconfig option, not >> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. >> >> >> >> Why are you so fixated on making everything hardcoded instead of making >> >> it a proper policy decision problem. See above. >> > >> > Because I am one of the people who will bear the consequences. >> > >> > In that same vein, why are you so opposed to continuing to provide >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code >> > is already in place, is extremely well tested, and you need to handle >> > preempt_disable()/preeempt_enable() regions of code in any case. What is >> > the real problem here? >> [ snip ] >> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly >> stronger forward progress guarantees with respect to rcu readers (in >> that they can't be preempted.) > > TREE_RCU=y is absolutely required if you want a kernel to run on a system > with more than one CPU, and for that matter, if you want preemptible RCU, > even on a single-CPU system. > >> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something >> obvious there. > > If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you > can run any combination: Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y, PREEMPT_RCU=n). > PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible > kernels, so this works just fine (famous last words). > > PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible > RCU, so that rcu_read_lock() is preempt_disable() and > rcu_read_unlock() is preempt_enable(). This should just work, > except for the fact that cond_resched() disappears, which > stymies some of RCU's forward-progress mechanisms. And this > was the topic of our earlier discussion on this thread. The > fixes should not be too hard. > > Of course, this has not been either tested or used for at least > eight years, so there might be some bitrot. If so, I will of > course be happy to help fix it. > > > !PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible > RCU. Although this particular combination of Kconfig > options has not been tested for at least eight years, giving > a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none > kernel boot parameter gets you pretty close. Again, there is > likely to be some bitrot somewhere, but way fewer bits to rot > than for PREEMPTION && !PREEMPT_RCU. Outside of the current > CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this > combination, but if there is a need and if it is broken, I will > be happy to help fix it. > > !PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible > RCU, which is what we use today for non-preemptible kernels built > with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last > works, this works just fine. > > Does that help, or am I missing the point of your question? It does indeed. What I was going for, is that this series (or, at least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION in spirit, while doing away with it as a compile-time config option. That it does, as TGLX mentioned upthread, by moving all of the policy to the scheduler, which can be tuned by user-space (via sched-features.) So, my question was in response to this: >> > In that same vein, why are you so opposed to continuing to provide >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code >> > is already in place, is extremely well tested, and you need to handle >> > preempt_disable()/preeempt_enable() regions of code in any case. What is >> > the real problem here? Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration seems to be eminently usable with this configuration. (Or maybe I'm missed the point of that discussion.) On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n, TREE_RCU=y) kernel some hours ago. Nothing broken (yet!). -- ankur
On Fri, Oct 20, 2023 at 06:05:21PM -0700, Ankur Arora wrote: > > Paul E. McKenney <paulmck@kernel.org> writes: > > > On Fri, Oct 20, 2023 at 03:56:38PM -0700, Ankur Arora wrote: > >> > >> Paul E. McKenney <paulmck@kernel.org> writes: > >> > >> > Thomas! > >> > > >> > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: > >> >> Paul! > >> >> > >> >> On Wed, Oct 18 2023 at 10:19, Paul E. McKenney wrote: > >> >> > On Wed, Oct 18, 2023 at 03:16:12PM +0200, Thomas Gleixner wrote: > >> >> >> On Tue, Oct 17 2023 at 18:03, Paul E. McKenney wrote: > >> >> >> In the end there is no CONFIG_PREEMPT_XXX anymore. The only knob > >> >> >> remaining would be CONFIG_PREEMPT_RT, which should be renamed to > >> >> >> CONFIG_RT or such as it does not really change the preemption > >> >> >> model itself. RT just reduces the preemption disabled sections with the > >> >> >> lock conversions, forced interrupt threading and some more. > >> >> > > >> >> > Again, please, no. > >> >> > > >> >> > There are situations where we still need rcu_read_lock() and > >> >> > rcu_read_unlock() to be preempt_disable() and preempt_enable(), > >> >> > repectively. Those can be cases selected only by Kconfig option, not > >> >> > available in kernels compiled with CONFIG_PREEMPT_DYNAMIC=y. > >> >> > >> >> Why are you so fixated on making everything hardcoded instead of making > >> >> it a proper policy decision problem. See above. > >> > > >> > Because I am one of the people who will bear the consequences. > >> > > >> > In that same vein, why are you so opposed to continuing to provide > >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code > >> > is already in place, is extremely well tested, and you need to handle > >> > preempt_disable()/preeempt_enable() regions of code in any case. What is > >> > the real problem here? > >> > > [ snip ] > > >> As far as I can tell (which isn't all that far), TREE_RCU=y makes strictly > >> stronger forward progress guarantees with respect to rcu readers (in > >> that they can't be preempted.) > > > > TREE_RCU=y is absolutely required if you want a kernel to run on a system > > with more than one CPU, and for that matter, if you want preemptible RCU, > > even on a single-CPU system. > > > >> So, can PREEMPTION=y run with, say TREE_RCU=y? Or maybe I'm missing something > >> obvious there. > > > > If you meant to ask about PREEMPTION and PREEMPT_RCU, in theory, you > > can run any combination: > > Sorry, yes I did. Should have said "can PREEMPTION=y run with, (TREE_RCU=y, > PREEMPT_RCU=n). > > > PREEMPTION && PREEMPT_RCU: This is what we use today for preemptible > > kernels, so this works just fine (famous last words). > > > > PREEMPTION && !PREEMPT_RCU: A preemptible kernel with non-preemptible > > RCU, so that rcu_read_lock() is preempt_disable() and > > rcu_read_unlock() is preempt_enable(). This should just work, > > except for the fact that cond_resched() disappears, which > > stymies some of RCU's forward-progress mechanisms. And this > > was the topic of our earlier discussion on this thread. The > > fixes should not be too hard. > > > > Of course, this has not been either tested or used for at least > > eight years, so there might be some bitrot. If so, I will of > > course be happy to help fix it. > > > > > > !PREEMPTION && PREEMPT_RCU: A non-preemptible kernel with preemptible > > RCU. Although this particular combination of Kconfig > > options has not been tested for at least eight years, giving > > a kernel built with CONFIG_PREEMPT_DYNAMIC=y the preempt=none > > kernel boot parameter gets you pretty close. Again, there is > > likely to be some bitrot somewhere, but way fewer bits to rot > > than for PREEMPTION && !PREEMPT_RCU. Outside of the current > > CONFIG_PREEMPT_DYNAMIC=y case, I don't see the need for this > > combination, but if there is a need and if it is broken, I will > > be happy to help fix it. > > > > !PREEMPTION && !PREEMPT_RCU: A non-preemptible kernel with non-preemptible > > RCU, which is what we use today for non-preemptible kernels built > > with CONFIG_PREEMPT_DYNAMIC=n. So to repeat those famous last > > works, this works just fine. > > > > Does that help, or am I missing the point of your question? > > It does indeed. What I was going for, is that this series (or, at > least my adaptation of TGLX's PoC) wants to keep CONFIG_PREEMPTION > in spirit, while doing away with it as a compile-time config option. > > That it does, as TGLX mentioned upthread, by moving all of the policy > to the scheduler, which can be tuned by user-space (via sched-features.) > > So, my question was in response to this: > > >> > In that same vein, why are you so opposed to continuing to provide > >> > the ability to build a kernel with CONFIG_PREEMPT_RCU=n? This code > >> > is already in place, is extremely well tested, and you need to handle > >> > preempt_disable()/preeempt_enable() regions of code in any case. What is > >> > the real problem here? > > Based on your response the (PREEMPT_RCU=n, TREE_RCU=y) configuration > seems to be eminently usable with this configuration. > > (Or maybe I'm missed the point of that discussion.) > > On a related note, I had started rcutorture on a (PREEMPTION=y, PREEMPT_RCU=n, > TREE_RCU=y) kernel some hours ago. Nothing broken (yet!). Thank you, and here is hoping! ;-) Thanx, Paul
Paul! On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote: > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: >> The important point is that at the very end the scheduler has the >> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any >> random damage due to the fact that preemption count is functional, which >> makes your life easier as well as you admitted already. But that does >> not mean you can eat the cake and still have it. :) > > Which is exactly why I need rcu_read_lock() to map to preempt_disable() > and rcu_read_unlock() to preempt_enable(). ;-) After reading back in the thread, I think we greatly talked past each other mostly due to the different expectations and the resulting dependencies which seem to be hardwired into our brains. I'm pleading guilty as charged as I failed completely to read your initial statement "The key thing to note is that from RCU's viewpoint, with this change, all kernels are preemptible, though rcu_read_lock() readers remain non-preemptible." with that in mind and instead of dissecting it properly I committed the fallacy of stating exactly the opposite, which obviously reflects only the point of view I'm coming from. With a fresh view, this turns out to be a complete non-problem because there is no semantical dependency between the preemption model and the RCU flavour. The unified kernel preemption model has the following properties: 1) It provides full preemptive multitasking. 2) Preemptability is limited by implicit and explicit mechanisms. 3) The ability to avoid overeager preemption for SCHED_OTHER tasks via the PREEMPT_LAZY mechanism. This emulates the NONE/VOLUNTARY preemption models which semantically provide collaborative multitasking. This emulation is not breaking the semantical properties of full preemptive multitasking because the scheduler still has the ability to enforce immediate preemption under consideration of #2. Which in turn is a prerequiste for removing the semantically ill-defined cond/might_resched() constructs. The compile time selectable RCU flavour (preemptible/non-preemptible) is not imposing a semantical change on this unified preemption model. The selection of the RCU flavour is solely affecting the preemptability (#2 above). Selecting non-preemptible RCU reduces preemptability by adding an implicit restriction via mapping rcu_read_lock() to preempt_disable(). IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n is only enforced by the the lack of the full preempt counter in PREEMPTION=n configs. Once the preemption counter is always enabled this hardwired dependency goes away. Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because with PREEMPT_DYNAMIC the preemption counter is unconditionally available. So that makes these hardwired dependencies go away in practice and hopefully soon from our mental models too :) RT will keep its hard dependency on RCU_PREEMPT in the same way it depends hard on forced interrupt threading and other minor details to enable the spinlock substitution. >> That said, I completely understand your worries about the consequences, >> but please take the step back and look at it from a conceptual point of >> view. > > Conceptual point of view? That sounds suspiciously academic. Hehehe. > Who are you and what did you do with the real Thomas Gleixner? ;-) The point I'm trying to make is not really academic, it comes from a very practical point of view. As you know for almost two decades I'm mostly busy with janitoring and mopping up the kernel. A major takeaway from this eclectic experience is that there is a tendency to implement very specialized solutions for different classes of use cases. The reasons to do so in the first place: 1) Avoid breaking the existing and established solutions: E.g. the initial separation of x8664 and i386 2) Enforcement due to dependencies on mechanisms, which are considered "harmful" for particular use cases E.g. Preemptible RCU, which is separate also due to #1 3) Because we can and something is sooo special You probably remember the full day we both spent in a room with SoC people to make them understand that their SoCs are not so special at all. :) So there are perfectly valid reasons (#1, #2) to separate things, but we really need to go back from time to time and think hard about the question whether a particular separation is still justified. This is especially true when dependencies or prerequisites change. But in many cases we just keep going, take the separation as set in stone forever and add features and workarounds on all ends without rethinking whether we could unify these things for the better. The real bad thing about this is that the more we add to the separation the harder consolidation or unification becomes. Granted that my initial take of consolidating on preemptible RCU might be too brisk or too naive, but I still think that with the prospect of an unified preemption model it's at least worth to have a very close look at this question. Not asking such questions or dismissing them upfront is a real danger for the long term sustainability and maintainability of the kernel in my opinion. Especially when the few people who actively "janitor" these things are massively outnumbered by people who indulge in specialization. :) That said, the real Thomas Gleixner and his grumpy self are still there, just slightly tired of handling the slurry brush all day long :) Thanks, tglx
On Tue, 19 Sep 2023 01:42:03 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. > > As NEED_RESCHED_LAZY is not folded into the preemption count the > preemption count won't become zero, so the task can continue until > it hits return to user space. > > That preserves the existing behaviour. I'm looking into extending this concept to user space and to VMs. I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis") The ideas is this. Have VMs/user space share a memory region with the kernel that is per thread/vCPU. This would be registered via a syscall or ioctl on some defined file or whatever. Then, when entering user space / VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it checks if the thread has this memory region and a special bit in it is set, and if it does, it does not schedule. It will treat it like a long kernel system call. The kernel will then set another bit in the shared memory region that will tell user space / VM that the kernel wanted to schedule, but is allowing it to finish its critical section. When user space / VM is done with the critical section, it will check the bit that may be set by the kernel and if it is set, it should do a sched_yield() or VMEXIT so that the kernel can now schedule it. What about DOS you say? It's no different than running a long system call. No task can run forever. It's not a "preempt disable", it's just "give me some more time". A "NEED_RESCHED" will always schedule, just like a kernel system call that takes a long time. The goal is to allow user space to get out of critical sections that we know can cause problems if they get preempted. Usually it's a user space / VM lock is held or maybe a VM interrupt handler that needs to wake up a task on another vCPU. If we are worried about abuse, we could even punish tasks that don't call sched_yield() by the time its extended time slice is taken. Even without that punishment, if we have EEVDF, this extension will make it less eligible the next time around. The goal is to prevent a thread / vCPU being preempted while holding a lock or resource that other threads / vCPUs will want. That is, prevent contention, as that's usually the biggest issue with performance in user space and VMs. I'm going to work on a POC, and see if I can get some benchmarks on how much this could help tasks like databases and VMs in general. -- Steve
On Tue, Oct 24, 2023 at 02:15:25PM +0200, Thomas Gleixner wrote: > Paul! > > On Thu, Oct 19 2023 at 12:13, Paul E. McKenney wrote: > > On Thu, Oct 19, 2023 at 02:21:35AM +0200, Thomas Gleixner wrote: > >> The important point is that at the very end the scheduler has the > >> ultimate power to say: "Not longer Mr. Nice Guy" without the risk of any > >> random damage due to the fact that preemption count is functional, which > >> makes your life easier as well as you admitted already. But that does > >> not mean you can eat the cake and still have it. :) > > > > Which is exactly why I need rcu_read_lock() to map to preempt_disable() > > and rcu_read_unlock() to preempt_enable(). ;-) > > After reading back in the thread, I think we greatly talked past each > other mostly due to the different expectations and the resulting > dependencies which seem to be hardwired into our brains. > > I'm pleading guilty as charged as I failed completely to read your > initial statement > > "The key thing to note is that from RCU's viewpoint, with this change, > all kernels are preemptible, though rcu_read_lock() readers remain > non-preemptible." > > with that in mind and instead of dissecting it properly I committed the > fallacy of stating exactly the opposite, which obviously reflects only > the point of view I'm coming from. > > With a fresh view, this turns out to be a complete non-problem because > there is no semantical dependency between the preemption model and the > RCU flavour. Agreed, and been there and done that myself, as you well know! ;-) > The unified kernel preemption model has the following properties: > > 1) It provides full preemptive multitasking. > > 2) Preemptability is limited by implicit and explicit mechanisms. > > 3) The ability to avoid overeager preemption for SCHED_OTHER tasks via > the PREEMPT_LAZY mechanism. > > This emulates the NONE/VOLUNTARY preemption models which > semantically provide collaborative multitasking. > > This emulation is not breaking the semantical properties of full > preemptive multitasking because the scheduler still has the ability > to enforce immediate preemption under consideration of #2. > > Which in turn is a prerequiste for removing the semantically > ill-defined cond/might_resched() constructs. > > The compile time selectable RCU flavour (preemptible/non-preemptible) is > not imposing a semantical change on this unified preemption model. > > The selection of the RCU flavour is solely affecting the preemptability > (#2 above). Selecting non-preemptible RCU reduces preemptability by > adding an implicit restriction via mapping rcu_read_lock() > to preempt_disable(). > > IOW, the current upstream enforcement of RCU_PREEMPT=n when PREEMPTION=n > is only enforced by the the lack of the full preempt counter in > PREEMPTION=n configs. Once the preemption counter is always enabled this > hardwired dependency goes away. > > Even PREEMPT_DYNAMIC should just work with RCU_PREEMPT=n today because > with PREEMPT_DYNAMIC the preemption counter is unconditionally > available. > > So that makes these hardwired dependencies go away in practice and > hopefully soon from our mental models too :) The real reason for tying RCU_PREEMPT to PREEMPTION back in the day was that there were no real-world uses of RCU_PREEMPT not matching PREEMPTION, so those combinations were ruled out in order to reduce the number of rcutorture scenarios. But now it appears that we do have a use case for PREEMPTION=y and RCU_PREEMPT=n, plus I have access to way more test hardware, so that the additional rcutorture scenarios are less of a testing burden. > RT will keep its hard dependency on RCU_PREEMPT in the same way it > depends hard on forced interrupt threading and other minor details to > enable the spinlock substitution. "other minor details". ;-) Making PREEMPT_RT select RCU_PREEMPT makes sense to me! > >> That said, I completely understand your worries about the consequences, > >> but please take the step back and look at it from a conceptual point of > >> view. > > > > Conceptual point of view? That sounds suspiciously academic. > > Hehehe. > > > Who are you and what did you do with the real Thomas Gleixner? ;-) > > The point I'm trying to make is not really academic, it comes from a > very practical point of view. As you know for almost two decades I'm > mostly busy with janitoring and mopping up the kernel. > > A major takeaway from this eclectic experience is that there is a > tendency to implement very specialized solutions for different classes > of use cases. > > The reasons to do so in the first place: > > 1) Avoid breaking the existing and established solutions: > > E.g. the initial separation of x8664 and i386 > > 2) Enforcement due to dependencies on mechanisms, which are > considered "harmful" for particular use cases > > E.g. Preemptible RCU, which is separate also due to #1 > > 3) Because we can and something is sooo special > > You probably remember the full day we both spent in a room with SoC > people to make them understand that their SoCs are not so special at > all. :) 4) Because we don't see a use for a given combination, and we want to keep test time down to a dull roar, as noted above. > So there are perfectly valid reasons (#1, #2) to separate things, but we > really need to go back from time to time and think hard about the > question whether a particular separation is still justified. This is > especially true when dependencies or prerequisites change. > > But in many cases we just keep going, take the separation as set in > stone forever and add features and workarounds on all ends without > rethinking whether we could unify these things for the better. The real > bad thing about this is that the more we add to the separation the > harder consolidation or unification becomes. > > Granted that my initial take of consolidating on preemptible RCU might > be too brisk or too naive, but I still think that with the prospect of > an unified preemption model it's at least worth to have a very close > look at this question. > > Not asking such questions or dismissing them upfront is a real danger > for the long term sustainability and maintainability of the kernel in my > opinion. Especially when the few people who actively "janitor" these > things are massively outnumbered by people who indulge in > specialization. :) Longer term, I do agree in principle with the notion of simplifying the Linux-kernel RCU implementation by eliminating the PREEMPT_RCU=n code. In the near term practice, here are the reasons for holding off on this consolidation: 1. Preemptible RCU needs more work for datacenter deployments, as mentioned earlier. I also reiterate that if you only have a few thousand (or maybe even a few tens of thousand) servers, preemptible RCU will be just fine for you. Give or take the safety criticality of your application. 2. RCU priority boosting has not yet been really tested and tuned for systems that are adequately but not generously endowed with memory. Boost too soon and you needlessly burning cycles and preempt important tasks. Boost too late and it is OOM for you! 3. To the best of my knowledge, the scheduler doesn't take memory footprint into account. In particular, if a long-running RCU reader is preempted in a memory-naive fashion, all we gain is turning a potentially unimportant latency outlier into a definitely important OOM. 4. There are probably a few gotchas that I haven't thought of or that I am forgetting. More likely, more than a few. As always! But to your point, yes, these are things that we should be able to do something about, given appropriate time and effort. My guess is five years, with the long pole being the reliability. Preemptible RCU has been gone through line by line recently, which is an extremely good thing and an extremely welcome change from past practice, but that is just a start. That effort was getting people familiar with the code, and should not be mistaken for a find-lots-of-bugs review session, let alone a find-all-bugs review session. > That said, the real Thomas Gleixner and his grumpy self are still there, > just slightly tired of handling the slurry brush all day long :) Whew!!! Good to hear that the real Thomas Gleixner is still with us!!! ;-) Thanx, Paul
On Tue, 24 Oct 2023 10:34:26 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > I'm going to work on a POC, and see if I can get some benchmarks on how > much this could help tasks like databases and VMs in general. And that was much easier than I thought it would be. It also shows some great results! I started with Thomas's PREEMPT_AUTO.patch from the rt-devel tree: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/PREEMPT_AUTO.patch?h=v6.6-rc6-rt10-patches So you need to select: CONFIG_PREEMPT_AUTO The below is my proof of concept patch. It still has debugging in it, and I'm sure the interface will need to be changed. There's now a new file: /sys/kernel/extend_sched Attached is a program that tests it. It mmaps that file, with: struct extend_map { unsigned long flags; }; static __thread struct extend_map *extend_map; That is, there's this structure for every thread. It's assigned with: fd = open("/sys/kernel/extend_sched", O_RDWR); extend_map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); I don't actually like this interface, as it wastes a full page for just two bits :-p Anyway, to tell the kernel to "extend" the time slice if possible because it's in a critical section, we have: static void extend(void) { if (!extend_map) return; extend_map->flags = 1; } And to say that's it's done: static void unextend(void) { unsigned long prev; if (!extend_map) return; prev = xchg(&extend_map->flags, 0); if (prev & 2) sched_yield(); } So, bit 1 is for user space to tell the kernel "please extend me", and bit two is for the kernel to tell user space "OK, I extended you, but call sched_yield() when done". This test program creates 1 + number of CPUs threads, that run in a loop for 5 seconds. Each thread will grab a user space spin lock (not a futex, but just shared memory). Before grabbing the lock it will call "extend()", if it fails to grab the lock, it calls "unextend()" and spins on the lock until its free, where it will try again. Then after it gets the lock, it will update a counter, and release the lock, calling "unextend()" as well. Then it will spin on the counter until it increments again to allow another task to get into the critical section. With the init of the extend_map disabled and it doesn't use the extend code, it ends with: Ran for 3908165 times Total wait time: 33.965654 I can give you stdev and all that too, but the above is pretty much the same after several runs. After enabling the extend code, it has: Ran for 4829340 times Total wait time: 32.635407 It was able to get into the critical section almost 1 million times more in those 5 seconds! That's a 23% improvement! The wait time for getting into the critical section also dropped by the total of over a second (4% improvement). I ran a traceeval tool on it (still work in progress, but I can post when it's done), and with the following trace, and the writes to trace-marker (tracefs_printf) trace-cmd record -e sched_switch ./extend-sched It showed that without the extend, each task was preempted while holding the lock around 200 times. With the extend, only one task was ever preempted while holding the lock, and it only happened once! Below is my patch (with debugging and on top of Thomas's PREEMPT_AUTO.patch): Attached is the program I tested it with. It uses libtracefs to write to the trace_marker file, but if you don't want to build it with libtracefs: gcc -o extend-sched extend-sched.c `pkg-config --libs --cflags libtracefs` -lpthread You can just do: grep -v tracefs extend-sched.c > extend-sched-notracefs.c And build that. -- Steve diff --git a/include/linux/sched.h b/include/linux/sched.h index 9b13b7d4f1d3..fb540dd0dec0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -740,6 +740,10 @@ struct kmap_ctrl { #endif }; +struct extend_map { + long flags; +}; + struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* @@ -802,6 +806,8 @@ struct task_struct { unsigned int core_occupation; #endif + struct extend_map *extend_map; + #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index c1f706038637..21d0e4d81d33 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -147,17 +147,32 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work) { + unsigned long ignore_mask; + /* * Before returning to user space ensure that all pending work * items have been completed. */ while (ti_work & EXIT_TO_USER_MODE_WORK) { + ignore_mask = 0; local_irq_enable_exit_to_user(ti_work); - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) + if (ti_work & _TIF_NEED_RESCHED) { schedule(); + } else if (ti_work & _TIF_NEED_RESCHED_LAZY) { + if (!current->extend_map || + !(current->extend_map->flags & 1)) { + schedule(); + } else { + trace_printk("Extend!\n"); + /* Allow to leave with NEED_RESCHED_LAZY still set */ + ignore_mask |= _TIF_NEED_RESCHED_LAZY; + current->extend_map->flags |= 2; + } + } + if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -184,6 +199,7 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); + ti_work &= ~ignore_mask; } /* Return the latest work state for arch_exit_to_user_mode() */ diff --git a/kernel/exit.c b/kernel/exit.c index edb50b4c9972..ddf89ec9ab62 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -906,6 +906,13 @@ void __noreturn do_exit(long code) if (tsk->io_context) exit_io_context(tsk); + if (tsk->extend_map) { + unsigned long addr = (unsigned long)tsk->extend_map; + + virt_to_page(addr)->mapping = NULL; + free_page(addr); + } + if (tsk->splice_pipe) free_pipe_info(tsk->splice_pipe); diff --git a/kernel/fork.c b/kernel/fork.c index 3b6d20dfb9a8..da2214082d25 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1166,6 +1166,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->wake_q.next = NULL; tsk->worker_private = NULL; + tsk->extend_map = NULL; + kcov_task_init(tsk); kmsan_task_create(tsk); kmap_local_fork(tsk); diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 976092b7bd45..297061cfa08d 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -32,3 +32,4 @@ obj-y += core.o obj-y += fair.o obj-y += build_policy.o obj-y += build_utility.o +obj-y += extend.o diff --git a/kernel/sched/extend.c b/kernel/sched/extend.c new file mode 100644 index 000000000000..a632e1a8f57b --- /dev/null +++ b/kernel/sched/extend.c @@ -0,0 +1,90 @@ +#include <linux/kobject.h> +#include <linux/pagemap.h> +#include <linux/sysfs.h> +#include <linux/init.h> + +#ifdef CONFIG_SYSFS +static ssize_t extend_sched_read(struct file *file, struct kobject *kobj, + struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t len) +{ + static const char output[] = "Extend scheduling time slice\n"; + + printk("%s:%d\n", __func__, __LINE__); + if (off >= sizeof(output)) + return 0; + + strscpy(buf, output + off, len); + return min((ssize_t)len, sizeof(output) - off - 1); +} + +static ssize_t extend_sched_write(struct file *file, struct kobject *kobj, + struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t len) +{ + printk("%s:%d\n", __func__, __LINE__); + return -EINVAL; +} + +static vm_fault_t extend_sched_mmap_fault(struct vm_fault *vmf) +{ + vm_fault_t ret = VM_FAULT_SIGBUS; + + trace_printk("%s:%d\n", __func__, __LINE__); + /* Only has one page */ + if (vmf->pgoff || !current->extend_map) + return ret; + + vmf->page = virt_to_page(current->extend_map); + + get_page(vmf->page); + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void extend_sched_mmap_open(struct vm_area_struct *vma) +{ + printk("%s:%d\n", __func__, __LINE__); + WARN_ON(!current->extend_map); +} + +static const struct vm_operations_struct extend_sched_vmops = { + .open = extend_sched_mmap_open, + .fault = extend_sched_mmap_fault, +}; + +static int extend_sched_mmap(struct file *file, struct kobject *kobj, + struct bin_attribute *attr, + struct vm_area_struct *vma) +{ + if (current->extend_map) + return -EBUSY; + + current->extend_map = page_to_virt(alloc_page(GFP_USER | __GFP_ZERO)); + if (!current->extend_map) + return -ENOMEM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP | VM_MAYWRITE, 0); + vma->vm_ops = &extend_sched_vmops; + + return 0; +} + +static struct bin_attribute extend_sched_attr = { + .attr = { + .name = "extend_sched", + .mode = 0777, + }, + .read = &extend_sched_read, + .write = &extend_sched_write, + .mmap = &extend_sched_mmap, +}; + +static __init int extend_init(void) +{ + return sysfs_create_bin_file(kernel_kobj, &extend_sched_attr); +} +late_initcall(extend_init); +#endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 700b140ac1bb..17ca22e80384 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -993,9 +993,10 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool resched_curr(rq); } else { /* Did the task ignore the lazy reschedule request? */ - if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) { + trace_printk("Force resched?\n"); resched_curr(rq); - else + } else resched_curr_lazy(rq); } clear_buddies(cfs_rq, se);
On (23/10/24 10:34), Steven Rostedt wrote: > On Tue, 19 Sep 2023 01:42:03 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > 2) When the scheduler wants to set NEED_RESCHED due it sets > > NEED_RESCHED_LAZY instead which is only evaluated in the return to > > user space preemption points. > > > > As NEED_RESCHED_LAZY is not folded into the preemption count the > > preemption count won't become zero, so the task can continue until > > it hits return to user space. > > > > That preserves the existing behaviour. > > I'm looking into extending this concept to user space and to VMs. > > I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis") > > The ideas is this. Have VMs/user space share a memory region with the > kernel that is per thread/vCPU. This would be registered via a syscall or > ioctl on some defined file or whatever. Then, when entering user space / > VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it > checks if the thread has this memory region and a special bit in it is > set, and if it does, it does not schedule. It will treat it like a long > kernel system call. > > The kernel will then set another bit in the shared memory region that will > tell user space / VM that the kernel wanted to schedule, but is allowing it > to finish its critical section. When user space / VM is done with the > critical section, it will check the bit that may be set by the kernel and > if it is set, it should do a sched_yield() or VMEXIT so that the kernel can > now schedule it. > > What about DOS you say? It's no different than running a long system call. > No task can run forever. It's not a "preempt disable", it's just "give me > some more time". A "NEED_RESCHED" will always schedule, just like a kernel > system call that takes a long time. The goal is to allow user space to get > out of critical sections that we know can cause problems if they get > preempted. Usually it's a user space / VM lock is held or maybe a VM > interrupt handler that needs to wake up a task on another vCPU. > > If we are worried about abuse, we could even punish tasks that don't call > sched_yield() by the time its extended time slice is taken. Even without > that punishment, if we have EEVDF, this extension will make it less > eligible the next time around. > > The goal is to prevent a thread / vCPU being preempted while holding a lock > or resource that other threads / vCPUs will want. That is, prevent > contention, as that's usually the biggest issue with performance in user > space and VMs. I think some time ago we tried to check guest's preempt count on each vm-exit and we'd vm-enter if guest exited from a critical section (those that bump preempt count) so that it can hopefully finish whatever is was going to do and vmexit again. We didn't look into covering guest's RCU read-side critical sections. Can you educate me, is your PoC significantly different from guest preempt count check?
On Thu, 26 Oct 2023 16:50:16 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote: > > The goal is to prevent a thread / vCPU being preempted while holding a lock > > or resource that other threads / vCPUs will want. That is, prevent > > contention, as that's usually the biggest issue with performance in user > > space and VMs. > > I think some time ago we tried to check guest's preempt count on each vm-exit > and we'd vm-enter if guest exited from a critical section (those that bump > preempt count) so that it can hopefully finish whatever is was going to > do and vmexit again. We didn't look into covering guest's RCU read-side > critical sections. > > Can you educate me, is your PoC significantly different from guest preempt > count check? No, it's probably very similar. Just the mechanism to allow it to run longer may be different. -- Steve
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index d63b02940747..fc6f4121b412 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -100,6 +100,7 @@ struct thread_info { #define TIF_BLOCKSTEP 25 /* set when we want DEBUGCTLMSR_BTF */ #define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */ #define TIF_ADDR32 29 /* 32-bit address space on 64 bits */ +#define TIF_RESCHED_ALLOW 30 /* reschedule if needed */ #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) @@ -122,6 +123,7 @@ struct thread_info { #define _TIF_BLOCKSTEP (1 << TIF_BLOCKSTEP) #define _TIF_LAZY_MMU_UPDATES (1 << TIF_LAZY_MMU_UPDATES) #define _TIF_ADDR32 (1 << TIF_ADDR32) +#define _TIF_RESCHED_ALLOW (1 << TIF_RESCHED_ALLOW) /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 177b3f3676ef..4dd3d91d990f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2245,6 +2245,36 @@ static __always_inline bool need_resched(void) return unlikely(tif_need_resched()); } +#ifdef TIF_RESCHED_ALLOW +/* + * allow_resched() .. disallow_resched() demarcate a preemptible section. + * + * Used around primitives where it might not be convenient to periodically + * call cond_resched(). + */ +static inline void allow_resched(void) +{ + might_sleep(); + set_tsk_thread_flag(current, TIF_RESCHED_ALLOW); +} + +static inline void disallow_resched(void) +{ + clear_tsk_thread_flag(current, TIF_RESCHED_ALLOW); +} + +static __always_inline bool resched_allowed(void) +{ + return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); +} + +#else +static __always_inline bool resched_allowed(void) +{ + return false; +} +#endif /* TIF_RESCHED_ALLOW */ + /* * Wrappers for p->thread_info->cpu access. No-op on UP. */
On preempt_model_none() or preempt_model_voluntary() configurations rescheduling of kernel threads happens only when they allow it, and only at explicit preemption points, via calls to cond_resched() or similar. That leaves out contexts where it is not convenient to periodically call cond_resched() -- for instance when executing a potentially long running primitive (such as REP; STOSB.) This means that we either suffer high scheduling latency or avoid certain constructs. Define TIF_ALLOW_RESCHED to demarcate such sections. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- arch/x86/include/asm/thread_info.h | 2 ++ include/linux/sched.h | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)