Message ID | 20240925232425.2763385-2-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Enable haltpoll on arm64 | expand |
On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > index 9b6d90a72601..fc1204426158 100644 > --- a/drivers/cpuidle/poll_state.c > +++ b/drivers/cpuidle/poll_state.c > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > raw_local_irq_enable(); > if (!current_set_polling_and_test()) { > - unsigned int loop_count = 0; > u64 limit; > > limit = cpuidle_poll_time(drv, dev); > > while (!need_resched()) { > - cpu_relax(); > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > - continue; > - > - loop_count = 0; > + unsigned int loop_count = 0; > if (local_clock_noinstr() - time_start > limit) { > dev->poll_time_limit = true; > break; > } > + > + smp_cond_load_relaxed(¤t_thread_info()->flags, > + VAL & _TIF_NEED_RESCHED || > + loop_count++ >= POLL_IDLE_RELAX_COUNT); The above is not guaranteed to make progress if _TIF_NEED_RESCHED is never set. With the event stream enabled on arm64, the WFE will eventually be woken up, loop_count incremented and the condition would become true. However, the smp_cond_load_relaxed() semantics require that a different agent updates the variable being waited on, not the waiting CPU updating it itself. Also note that the event stream can be disabled on arm64 on the kernel command line. Does the code above break any other architecture? I'd say if you want something like this, better introduce a new smp_cond_load_timeout() API. The above looks like a hack that may only work on arm64 when the event stream is enabled. A generic option is udelay() (on arm64 it would use WFE/WFET by default). Not sure how important it is for poll_idle() but the downside of udelay() that it won't be able to also poll need_resched() while waiting for the timeout. If this matters, you could instead make smaller udelay() calls. Yet another problem, I don't know how energy efficient udelay() is on x86 vs cpu_relax(). So maybe an smp_cond_load_timeout() would be better, implemented with cpu_relax() generically and the arm64 would use LDXR, WFE and rely on the event stream (or fall back to cpu_relax() if the event stream is disabled).
On Tue, 15 Oct 2024, Catalin Marinas wrote: > > + unsigned int loop_count = 0; > > if (local_clock_noinstr() - time_start > limit) { > > dev->poll_time_limit = true; > > break; > > } > > + > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > + VAL & _TIF_NEED_RESCHED || > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > never set. With the event stream enabled on arm64, the WFE will > eventually be woken up, loop_count incremented and the condition would > become true. However, the smp_cond_load_relaxed() semantics require that > a different agent updates the variable being waited on, not the waiting > CPU updating it itself. Also note that the event stream can be disabled > on arm64 on the kernel command line. Setting of need_resched() from another processor involves sending an IPI after that was set. I dont think we need to smp_cond_load_relaxed since the IPI will cause an event. For ARM a WFE would be sufficient.
On Tue, Oct 15, 2024 at 09:42:56AM -0700, Christoph Lameter (Ampere) wrote: > On Tue, 15 Oct 2024, Catalin Marinas wrote: > > > + unsigned int loop_count = 0; > > > if (local_clock_noinstr() - time_start > limit) { > > > dev->poll_time_limit = true; > > > break; > > > } > > > + > > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > > + VAL & _TIF_NEED_RESCHED || > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > > never set. With the event stream enabled on arm64, the WFE will > > eventually be woken up, loop_count incremented and the condition would > > become true. However, the smp_cond_load_relaxed() semantics require that > > a different agent updates the variable being waited on, not the waiting > > CPU updating it itself. Also note that the event stream can be disabled > > on arm64 on the kernel command line. > > Setting of need_resched() from another processor involves sending an IPI > after that was set. I dont think we need to smp_cond_load_relaxed since > the IPI will cause an event. For ARM a WFE would be sufficient. I'm not worried about the need_resched() case, even without an IPI it would still work. The loop_count++ side of the condition is supposed to timeout in the absence of a need_resched() event. You can't do an smp_cond_load_*() on a variable that's only updated by the waiting CPU. Nothing guarantees to wake it up to update the variable (the event stream on arm64, yes, but that's generic code).
On Tue, 15 Oct 2024, Catalin Marinas wrote: > > Setting of need_resched() from another processor involves sending an IPI > > after that was set. I dont think we need to smp_cond_load_relaxed since > > the IPI will cause an event. For ARM a WFE would be sufficient. > > I'm not worried about the need_resched() case, even without an IPI it > would still work. > > The loop_count++ side of the condition is supposed to timeout in the > absence of a need_resched() event. You can't do an smp_cond_load_*() on > a variable that's only updated by the waiting CPU. Nothing guarantees to > wake it up to update the variable (the event stream on arm64, yes, but > that's generic code). Hmm... I have WFET implementation here without smp_cond modelled after the delay() implementation ARM64 (but its not generic and there is an additional patch required to make this work. Intermediate patch attached) From: Christoph Lameter (Ampere) <cl@gentwo.org> Subject: [Haltpoll: Implement waiting using WFET Use WFET if the hardware supports it to implement a wait until something happens to wake up the cpu. If WFET is not available then use the stream event source to periodically wake up until an event happens or the timeout expires. The smp_cond_wait() is not necessary because the scheduler will create an event on the targeted cpu by sending an IPI. Without cond_wait we can simply take the basic approach from the delay() function and customize it a bit. Signed-off-by: Christoph Lameter <cl@linux.com> --- drivers/cpuidle/poll_state.c | 43 +++++++++++++++++------------------- 1 file changed, 20 insertions(+), 23 deletions(-) Index: linux/drivers/cpuidle/poll_state.c =================================================================== --- linux.orig/drivers/cpuidle/poll_state.c +++ linux/drivers/cpuidle/poll_state.c @@ -5,48 +5,41 @@ #include <linux/cpuidle.h> #include <linux/sched.h> -#include <linux/sched/clock.h> #include <linux/sched/idle.h> - -#ifdef CONFIG_ARM64 -/* - * POLL_IDLE_RELAX_COUNT determines how often we check for timeout - * while polling for TIF_NEED_RESCHED in thread_info->flags. - * - * Set this to a low value since arm64, instead of polling, uses a - * event based mechanism. - */ -#define POLL_IDLE_RELAX_COUNT 1 -#else -#define POLL_IDLE_RELAX_COUNT 200 -#endif +#include <clocksource/arm_arch_timer.h> static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - u64 time_start; - - time_start = local_clock_noinstr(); + const cycles_t start = get_cycles(); dev->poll_time_limit = false; raw_local_irq_enable(); if (!current_set_polling_and_test()) { - u64 limit; - limit = cpuidle_poll_time(drv, dev); + const cycles_t end = start + ARCH_TIMER_NSECS_TO_CYCLES(cpuidle_poll_time(drv, dev)); while (!need_resched()) { - unsigned int loop_count = 0; - if (local_clock_noinstr() - time_start > limit) { - dev->poll_time_limit = true; - break; - } - smp_cond_load_relaxed(¤t_thread_info()->flags, - VAL & _TIF_NEED_RESCHED || - loop_count++ >= POLL_IDLE_RELAX_COUNT); + if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { + + /* We can power down for a configurable interval while waiting */ + while (!need_resched() && get_cycles() < end) + wfet(end); + + } else if (arch_timer_evtstrm_available()) { + const cycles_t timer_period = ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); + + /* Wake up periodically during evstream events */ + while (!need_resched() && get_cycles() + timer_period <= end) + wfe(); + } } + + /* In case time is not up yet due to coarse time intervals above */ + while (!need_resched() && get_cycles() < end) + cpu_relax(); } raw_local_irq_disable(); From: Christoph Lameter (Ampere) <cl@linux.com> Move the conversion from time to cycles of arch_timer into arch_timer.h. Add nsec conversion since we will need that soon. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/arch/arm64/lib/delay.c =================================================================== --- linux.orig/arch/arm64/lib/delay.c +++ linux/arch/arm64/lib/delay.c @@ -15,14 +15,6 @@ #include <clocksource/arm_arch_timer.h> -#define USECS_TO_CYCLES(time_usecs) \ - xloops_to_cycles((time_usecs) * 0x10C7UL) - -static inline unsigned long xloops_to_cycles(unsigned long xloops) -{ - return (xloops * loops_per_jiffy * HZ) >> 32; -} - void __delay(unsigned long cycles) { cycles_t start = get_cycles(); @@ -39,7 +31,7 @@ void __delay(unsigned long cycles) wfet(end); } else if (arch_timer_evtstrm_available()) { const cycles_t timer_evt_period = - USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); + ARCH_TIMER_USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US); while ((get_cycles() - start + timer_evt_period) < cycles) wfe(); @@ -52,7 +44,7 @@ EXPORT_SYMBOL(__delay); inline void __const_udelay(unsigned long xloops) { - __delay(xloops_to_cycles(xloops)); + __delay(arch_timer_xloops_to_cycles(xloops)); } EXPORT_SYMBOL(__const_udelay); Index: linux/include/clocksource/arm_arch_timer.h =================================================================== --- linux.orig/include/clocksource/arm_arch_timer.h +++ linux/include/clocksource/arm_arch_timer.h @@ -90,6 +90,19 @@ extern u64 (*arch_timer_read_counter)(vo extern struct arch_timer_kvm_info *arch_timer_get_kvm_info(void); extern bool arch_timer_evtstrm_available(void); +#include <linux/delay.h> + +static inline unsigned long arch_timer_xloops_to_cycles(unsigned long xloops) +{ + return (xloops * loops_per_jiffy * HZ) >> 32; +} + +#define ARCH_TIMER_USECS_TO_CYCLES(time_usecs) \ + arch_timer_xloops_to_cycles((time_usecs) * 0x10C7UL) + +#define ARCH_TIMER_NSECS_TO_CYCLES(time_nsecs) \ + arch_timer_xloops_to_cycles((time_nsecs) * 0x5UL) + #else static inline u32 arch_timer_get_rate(void)
On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote: > On Tue, 15 Oct 2024, Catalin Marinas wrote: > > > Setting of need_resched() from another processor involves sending an IPI > > > after that was set. I dont think we need to smp_cond_load_relaxed since > > > the IPI will cause an event. For ARM a WFE would be sufficient. > > > > I'm not worried about the need_resched() case, even without an IPI it > > would still work. > > > > The loop_count++ side of the condition is supposed to timeout in the > > absence of a need_resched() event. You can't do an smp_cond_load_*() on > > a variable that's only updated by the waiting CPU. Nothing guarantees to > > wake it up to update the variable (the event stream on arm64, yes, but > > that's generic code). > > Hmm... I have WFET implementation here without smp_cond modelled after > the delay() implementation ARM64 (but its not generic and there is > an additional patch required to make this work. Intermediate patch > attached) At least one additional patch ;). But yeah, I suggested hiding all this behind something like smp_cond_load_timeout() which would wait on current_thread_info()->flags but with a timeout. The arm64 implementation would follow some of the logic in __delay(). Others may simply poll with cpu_relax(). Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and rely on need_resched() and some new delay/cpu_relax() API that waits for a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout() which on arm64 it's just a simplified version of __delay() without the 'while' loops.
Catalin Marinas <catalin.marinas@arm.com> writes: > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> index 9b6d90a72601..fc1204426158 100644 >> --- a/drivers/cpuidle/poll_state.c >> +++ b/drivers/cpuidle/poll_state.c >> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> >> raw_local_irq_enable(); >> if (!current_set_polling_and_test()) { >> - unsigned int loop_count = 0; >> u64 limit; >> >> limit = cpuidle_poll_time(drv, dev); >> >> while (!need_resched()) { >> - cpu_relax(); >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> - continue; >> - >> - loop_count = 0; >> + unsigned int loop_count = 0; >> if (local_clock_noinstr() - time_start > limit) { >> dev->poll_time_limit = true; >> break; >> } >> + >> + smp_cond_load_relaxed(¤t_thread_info()->flags, >> + VAL & _TIF_NEED_RESCHED || >> + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > never set. With the event stream enabled on arm64, the WFE will > eventually be woken up, loop_count incremented and the condition would > become true. That makes sense. > However, the smp_cond_load_relaxed() semantics require that > a different agent updates the variable being waited on, not the waiting > CPU updating it itself. Right. And, that seems to work well with the semantics of WFE. And, the event stream (if enabled) has a side effect that allows the exit from the loop. > Also note that the event stream can be disabled > on arm64 on the kernel command line. Yes, that's a good point. In patch-11 I tried to address that aspect by only allowing haltpoll to be force loaded. But, I guess your point is that its not just haltpoll that has a problem, but also regular polling -- and maybe the right thing to do would be to disable polling if the event stream is disabled. > Does the code above break any other architecture? Me (and others) have so far tested x86, ARM64 (with/without the event stream), and I believe riscv. I haven't seen any obvious breakage. But, that's probably because most of the time somebody would be set TIF_NEED_RESCHED. > I'd say if you want > something like this, better introduce a new smp_cond_load_timeout() > API. The above looks like a hack that may only work on arm64 when the > event stream is enabled. I had a preliminary version of smp_cond_load_relaxed_timeout() here: https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/ Even with an smp_cond_load_timeout(), we would need to fallback to something like the above for uarchs without WFxT. > A generic option is udelay() (on arm64 it would use WFE/WFET by > default). Not sure how important it is for poll_idle() but the downside > of udelay() that it won't be able to also poll need_resched() while > waiting for the timeout. If this matters, you could instead make smaller > udelay() calls. Yet another problem, I don't know how energy efficient > udelay() is on x86 vs cpu_relax(). > > So maybe an smp_cond_load_timeout() would be better, implemented with > cpu_relax() generically and the arm64 would use LDXR, WFE and rely on > the event stream (or fall back to cpu_relax() if the event stream is > disabled). Yeah, something like that might work. -- ankur
Catalin Marinas <catalin.marinas@arm.com> writes: > On Tue, Oct 15, 2024 at 10:17:13AM -0700, Christoph Lameter (Ampere) wrote: >> On Tue, 15 Oct 2024, Catalin Marinas wrote: >> > > Setting of need_resched() from another processor involves sending an IPI >> > > after that was set. I dont think we need to smp_cond_load_relaxed since >> > > the IPI will cause an event. For ARM a WFE would be sufficient. >> > >> > I'm not worried about the need_resched() case, even without an IPI it >> > would still work. >> > >> > The loop_count++ side of the condition is supposed to timeout in the >> > absence of a need_resched() event. You can't do an smp_cond_load_*() on >> > a variable that's only updated by the waiting CPU. Nothing guarantees to >> > wake it up to update the variable (the event stream on arm64, yes, but >> > that's generic code). >> >> Hmm... I have WFET implementation here without smp_cond modelled after >> the delay() implementation ARM64 (but its not generic and there is >> an additional patch required to make this work. Intermediate patch >> attached) > > At least one additional patch ;). But yeah, I suggested hiding all this > behind something like smp_cond_load_timeout() which would wait on > current_thread_info()->flags but with a timeout. The arm64 > implementation would follow some of the logic in __delay(). Others may > simply poll with cpu_relax(). > > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and > rely on need_resched() and some new delay/cpu_relax() API that waits for > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout() > which on arm64 it's just a simplified version of __delay() without the > 'while' loops. AFAICT when polling (which we are since poll_idle() calls current_set_polling_and_test()), the scheduler will elide the IPI by remotely setting the need-resched bit via set_nr_if_polling(). Once we stop polling then the scheduler should take the IPI path because call_function_single_prep_ipi() will fail. -- ankur
On Tue, 15 Oct 2024, Ankur Arora wrote: > > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and > > rely on need_resched() and some new delay/cpu_relax() API that waits for > > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout() > > which on arm64 it's just a simplified version of __delay() without the > > 'while' loops. > > AFAICT when polling (which we are since poll_idle() calls > current_set_polling_and_test()), the scheduler will elide the IPI > by remotely setting the need-resched bit via set_nr_if_polling(). The scheduler runs on multiple cores. The core on which we are running this code puts the core into a wait state so the scheduler does not run on this core at all during the wait period. The other cores may run scheduler functions and set the need_resched bit for the core where we are currently waiting. The other core will wake our core up by sending an IPI. The IPI will invoke a scheduler function on our core and the WFE will continue. > Once we stop polling then the scheduler should take the IPI path > because call_function_single_prep_ipi() will fail. The IPI stops the polling. IPI is an interrupt.
Here is a patch that keeps the cpuidle stuiff generic but allows an override by arm64.. From: Christoph Lameter (Ampere) <cl@linux.com> Subject: Revise cpu poll idle to make full use of wfet() and wfe() ARM64 has instructions that can wait for an event and timeouts. Clean up the code in drivers/cpuidle/ to wait until the end of a period and allow the override of the handling of the waiting by an architecture. Provide an optimized wait function for arm64. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/arch/arm64/lib/delay.c =================================================================== --- linux.orig/arch/arm64/lib/delay.c +++ linux/arch/arm64/lib/delay.c @@ -12,6 +12,8 @@ #include <linux/kernel.h> #include <linux/module.h> #include <linux/timex.h> +#include <linux/sched/clock.h> +#include <linux/cpuidle.h> #include <clocksource/arm_arch_timer.h> @@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs) __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */ } EXPORT_SYMBOL(__ndelay); + +void cpuidle_wait_for_resched_with_timeout(u64 end) +{ + u64 start; + + while (!need_resched() && (start = local_clock_noinstr()) < end) { + + if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { + + /* Processor supports waiting for a specified period */ + wfet(xloops_to_cycles((end - start) * 0x5UL)); + + } else + if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) { + + /* We can wait until a periodic event occurs */ + wfe(); + + } else + /* Need to spin until the end */ + cpu_relax(); + } +} + Index: linux/drivers/cpuidle/poll_state.c =================================================================== --- linux.orig/drivers/cpuidle/poll_state.c +++ linux/drivers/cpuidle/poll_state.c @@ -8,35 +8,29 @@ #include <linux/sched/clock.h> #include <linux/sched/idle.h> -#define POLL_IDLE_RELAX_COUNT 200 +__weak void cpuidle_wait_for_resched_with_timeout(u64 end) +{ + while (!need_resched() && local_clock_noinstr() < end) { + cpu_relax(); + } +} static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - u64 time_start; - - time_start = local_clock_noinstr(); + u64 time_start = local_clock_noinstr(); + u64 time_end = time_start + cpuidle_poll_time(drv, dev); dev->poll_time_limit = false; raw_local_irq_enable(); if (!current_set_polling_and_test()) { - unsigned int loop_count = 0; - u64 limit; - limit = cpuidle_poll_time(drv, dev); + cpuidle_wait_for_resched_with_timeout(time_end); + + if (!need_resched()) + dev->poll_time_limit = true; - while (!need_resched()) { - cpu_relax(); - if (loop_count++ < POLL_IDLE_RELAX_COUNT) - continue; - - loop_count = 0; - if (local_clock_noinstr() - time_start > limit) { - dev->poll_time_limit = true; - break; - } - } } raw_local_irq_disable(); Index: linux/include/linux/cpuidle.h =================================================================== --- linux.orig/include/linux/cpuidle.h +++ linux/include/linux/cpuidle.h @@ -202,6 +202,9 @@ extern int cpuidle_play_dead(void); extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev); static inline struct cpuidle_device *cpuidle_get_device(void) {return __this_cpu_read(cpuidle_devices); } + +extern __weak void cpuidle_wait_for_resched_with_timeout(u64); + #else static inline void disable_cpuidle(void) { } static inline bool cpuidle_not_available(struct cpuidle_driver *drv,
On 2024/10/16 上午5:32, Ankur Arora wrote: > > Catalin Marinas <catalin.marinas@arm.com> writes: > >> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: >>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >>> index 9b6d90a72601..fc1204426158 100644 >>> --- a/drivers/cpuidle/poll_state.c >>> +++ b/drivers/cpuidle/poll_state.c >>> @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >>> >>> raw_local_irq_enable(); >>> if (!current_set_polling_and_test()) { >>> - unsigned int loop_count = 0; >>> u64 limit; >>> >>> limit = cpuidle_poll_time(drv, dev); >>> >>> while (!need_resched()) { >>> - cpu_relax(); >>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >>> - continue; >>> - >>> - loop_count = 0; >>> + unsigned int loop_count = 0; >>> if (local_clock_noinstr() - time_start > limit) { >>> dev->poll_time_limit = true; >>> break; >>> } >>> + >>> + smp_cond_load_relaxed(¤t_thread_info()->flags, >>> + VAL & _TIF_NEED_RESCHED || >>> + loop_count++ >= POLL_IDLE_RELAX_COUNT); >> >> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is >> never set. With the event stream enabled on arm64, the WFE will >> eventually be woken up, loop_count incremented and the condition would >> become true. > > That makes sense. > >> However, the smp_cond_load_relaxed() semantics require that >> a different agent updates the variable being waited on, not the waiting >> CPU updating it itself. > > Right. And, that seems to work well with the semantics of WFE. And, > the event stream (if enabled) has a side effect that allows the exit > from the loop. > >> Also note that the event stream can be disabled >> on arm64 on the kernel command line. > > Yes, that's a good point. In patch-11 I tried to address that aspect > by only allowing haltpoll to be force loaded. > > But, I guess your point is that its not just haltpoll that has a problem, > but also regular polling -- and maybe the right thing to do would be to > disable polling if the event stream is disabled. > >> Does the code above break any other architecture? > > Me (and others) have so far tested x86, ARM64 (with/without the > event stream), and I believe riscv. I haven't seen any obvious > breakage. But, that's probably because most of the time somebody would > be set TIF_NEED_RESCHED. > >> I'd say if you want >> something like this, better introduce a new smp_cond_load_timeout() >> API. The above looks like a hack that may only work on arm64 when the >> event stream is enabled. > > I had a preliminary version of smp_cond_load_relaxed_timeout() here: > https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/ > > Even with an smp_cond_load_timeout(), we would need to fallback to > something like the above for uarchs without WFxT. > >> A generic option is udelay() (on arm64 it would use WFE/WFET by >> default). Not sure how important it is for poll_idle() but the downside >> of udelay() that it won't be able to also poll need_resched() while >> waiting for the timeout. If this matters, you could instead make smaller >> udelay() calls. Yet another problem, I don't know how energy efficient >> udelay() is on x86 vs cpu_relax(). >> >> So maybe an smp_cond_load_timeout() would be better, implemented with >> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on >> the event stream (or fall back to cpu_relax() if the event stream is >> disabled). > > Yeah, something like that might work. Yeah, I like idea about smp_cond_load_timeout method. If generic smp_cond_load_timeout method does not meet the requirement, the architecture can define itself one. And this keeps less modification about original code logic. Regards Bibo Mao > > -- > ankur >
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Tue, 15 Oct 2024, Ankur Arora wrote: > >> > Alternatively, if we get an IPI anyway, we can avoid smp_cond_load() and >> > rely on need_resched() and some new delay/cpu_relax() API that waits for >> > a timeout or an IPI, whichever comes first. E.g. cpu_relax_timeout() >> > which on arm64 it's just a simplified version of __delay() without the >> > 'while' loops. >> >> AFAICT when polling (which we are since poll_idle() calls >> current_set_polling_and_test()), the scheduler will elide the IPI >> by remotely setting the need-resched bit via set_nr_if_polling(). > > The scheduler runs on multiple cores. The core on which we are > running this code puts the core into a wait state so the scheduler does > not run on this core at all during the wait period. Yes. > The other cores may run scheduler functions and set the need_resched bit > for the core where we are currently waiting. Yes. > The other core will wake our core up by sending an IPI. The IPI will > invoke a scheduler function on our core and the WFE will continue. Why? The target core is not sleeping. It is *polling* on a memory address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell it that a need-resched bit is set. >> Once we stop polling then the scheduler should take the IPI path >> because call_function_single_prep_ipi() will fail. > > The IPI stops the polling. IPI is an interrupt. Yes an IPI is an interrupt. And, since the target is polling there's no need for an interrupt to inform it that a memory address on which it is polling has changed. resched_curr() is a good example. It only sends the resched IPI if the target is not polling. resched_curr() { ... if (set_nr_and_not_polling(curr)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); } -- ankur
On Tue, Oct 15, 2024 at 03:40:33PM -0700, Christoph Lameter (Ampere) wrote: > Index: linux/arch/arm64/lib/delay.c > =================================================================== > --- linux.orig/arch/arm64/lib/delay.c > +++ linux/arch/arm64/lib/delay.c > @@ -12,6 +12,8 @@ > #include <linux/kernel.h> > #include <linux/module.h> > #include <linux/timex.h> > +#include <linux/sched/clock.h> > +#include <linux/cpuidle.h> > > #include <clocksource/arm_arch_timer.h> > > @@ -67,3 +69,27 @@ void __ndelay(unsigned long nsecs) > __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */ > } > EXPORT_SYMBOL(__ndelay); > + > +void cpuidle_wait_for_resched_with_timeout(u64 end) > +{ > + u64 start; > + > + while (!need_resched() && (start = local_clock_noinstr()) < end) { > + > + if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) { > + > + /* Processor supports waiting for a specified period */ > + wfet(xloops_to_cycles((end - start) * 0x5UL)); > + > + } else > + if (arch_timer_evtstrm_available() && start + ARCH_TIMER_EVT_STREAM_PERIOD_US * 1000 < end) { > + > + /* We can wait until a periodic event occurs */ > + wfe(); > + > + } else > + /* Need to spin until the end */ > + cpu_relax(); > + } > +} The behaviour above is slightly different from the current poll_idle() implementation. The above is more like poll every timeout period rather than continuously poll until either the need_resched() condition is true _or_ the timeout expired. From Ankur's email, an IPI may not happen so we don't have any guarantee that WFET will wake up before the timeout. The only way for WFE/WFET to wake up on need_resched() is to use LDXR to arm the exclusive monitor. That's what smp_cond_load_relaxed() does. If you only need the behaviour proposed above, you might as well go for udelay() directly. Otherwise I think we need to revisit Ankur's smp_cond_load_timeout() proposal from earlier this year.
On Tue, Oct 15, 2024 at 02:32:00PM -0700, Ankur Arora wrote: > Catalin Marinas <catalin.marinas@arm.com> writes: > > I'd say if you want > > something like this, better introduce a new smp_cond_load_timeout() > > API. The above looks like a hack that may only work on arm64 when the > > event stream is enabled. > > I had a preliminary version of smp_cond_load_relaxed_timeout() here: > https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/ > > Even with an smp_cond_load_timeout(), we would need to fallback to > something like the above for uarchs without WFxT. Yes, the default/generic implementation would use cpu_relax(). For the arm64 one, some combination of __cmpwait() and __delay() with the fallback to cpu_relax().
On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > index 9b6d90a72601..fc1204426158 100644 > > --- a/drivers/cpuidle/poll_state.c > > +++ b/drivers/cpuidle/poll_state.c > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > raw_local_irq_enable(); > > if (!current_set_polling_and_test()) { > > - unsigned int loop_count = 0; > > u64 limit; > > > > limit = cpuidle_poll_time(drv, dev); > > > > while (!need_resched()) { > > - cpu_relax(); > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > - continue; > > - > > - loop_count = 0; > > + unsigned int loop_count = 0; > > if (local_clock_noinstr() - time_start > limit) { > > dev->poll_time_limit = true; > > break; > > } > > + > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > + VAL & _TIF_NEED_RESCHED || > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > never set. With the event stream enabled on arm64, the WFE will > eventually be woken up, loop_count incremented and the condition would > become true. However, the smp_cond_load_relaxed() semantics require that > a different agent updates the variable being waited on, not the waiting > CPU updating it itself. Also note that the event stream can be disabled > on arm64 on the kernel command line. Alternately could we condition arch_haltpoll_want() on arch_timer_evtstrm_available(), like v7? > > Does the code above break any other architecture? I'd say if you want > something like this, better introduce a new smp_cond_load_timeout() > API. The above looks like a hack that may only work on arm64 when the > event stream is enabled. > > A generic option is udelay() (on arm64 it would use WFE/WFET by > default). Not sure how important it is for poll_idle() but the downside > of udelay() that it won't be able to also poll need_resched() while > waiting for the timeout. If this matters, you could instead make smaller > udelay() calls. Yet another problem, I don't know how energy efficient > udelay() is on x86 vs cpu_relax(). > > So maybe an smp_cond_load_timeout() would be better, implemented with > cpu_relax() generically and the arm64 would use LDXR, WFE and rely on > the event stream (or fall back to cpu_relax() if the event stream is > disabled). > > -- > Catalin
Okanovic, Haris <harisokn@amazon.com> writes: > On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: >> > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> > index 9b6d90a72601..fc1204426158 100644 >> > --- a/drivers/cpuidle/poll_state.c >> > +++ b/drivers/cpuidle/poll_state.c >> > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> > >> > raw_local_irq_enable(); >> > if (!current_set_polling_and_test()) { >> > - unsigned int loop_count = 0; >> > u64 limit; >> > >> > limit = cpuidle_poll_time(drv, dev); >> > >> > while (!need_resched()) { >> > - cpu_relax(); >> > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> > - continue; >> > - >> > - loop_count = 0; >> > + unsigned int loop_count = 0; >> > if (local_clock_noinstr() - time_start > limit) { >> > dev->poll_time_limit = true; >> > break; >> > } >> > + >> > + smp_cond_load_relaxed(¤t_thread_info()->flags, >> > + VAL & _TIF_NEED_RESCHED || >> > + loop_count++ >= POLL_IDLE_RELAX_COUNT); >> >> The above is not guaranteed to make progress if _TIF_NEED_RESCHED is >> never set. With the event stream enabled on arm64, the WFE will >> eventually be woken up, loop_count incremented and the condition would >> become true. However, the smp_cond_load_relaxed() semantics require that >> a different agent updates the variable being waited on, not the waiting >> CPU updating it itself. Also note that the event stream can be disabled >> on arm64 on the kernel command line. > > Alternately could we condition arch_haltpoll_want() on > arch_timer_evtstrm_available(), like v7? Yes, I'm thinking of staging it somewhat like that. First an smp_cond_load_relaxed() which gets rid of this issue, followed by one based on smp_cond_load_relaxed_timeout(). That said, conditioning just arch_haltpoll_want() won't suffice since what Catalin pointed out affects all users of poll_idle(), not just haltpoll. Right now there's only haltpoll but there are future users like zhenglifeng with a patch for acpi-idle here: https://lore.kernel.org/all/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/ >> Does the code above break any other architecture? I'd say if you want >> something like this, better introduce a new smp_cond_load_timeout() >> API. The above looks like a hack that may only work on arm64 when the >> event stream is enabled. >> >> A generic option is udelay() (on arm64 it would use WFE/WFET by >> default). Not sure how important it is for poll_idle() but the downside >> of udelay() that it won't be able to also poll need_resched() while >> waiting for the timeout. If this matters, you could instead make smaller >> udelay() calls. Yet another problem, I don't know how energy efficient >> udelay() is on x86 vs cpu_relax(). >> >> So maybe an smp_cond_load_timeout() would be better, implemented with >> cpu_relax() generically and the arm64 would use LDXR, WFE and rely on >> the event stream (or fall back to cpu_relax() if the event stream is >> disabled). >> >> -- >> Catalin -- ankur
On Wed, 2024-10-16 at 10:04 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Okanovic, Haris <harisokn@amazon.com> writes: > > > On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > > > > > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > > index 9b6d90a72601..fc1204426158 100644 > > > > --- a/drivers/cpuidle/poll_state.c > > > > +++ b/drivers/cpuidle/poll_state.c > > > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > > > > raw_local_irq_enable(); > > > > if (!current_set_polling_and_test()) { > > > > - unsigned int loop_count = 0; > > > > u64 limit; > > > > > > > > limit = cpuidle_poll_time(drv, dev); > > > > > > > > while (!need_resched()) { > > > > - cpu_relax(); > > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > > - continue; > > > > - > > > > - loop_count = 0; > > > > + unsigned int loop_count = 0; > > > > if (local_clock_noinstr() - time_start > limit) { > > > > dev->poll_time_limit = true; > > > > break; > > > > } > > > > + > > > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > > > + VAL & _TIF_NEED_RESCHED || > > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > > > never set. With the event stream enabled on arm64, the WFE will > > > eventually be woken up, loop_count incremented and the condition would > > > become true. However, the smp_cond_load_relaxed() semantics require that > > > a different agent updates the variable being waited on, not the waiting > > > CPU updating it itself. Also note that the event stream can be disabled > > > on arm64 on the kernel command line. > > > > Alternately could we condition arch_haltpoll_want() on > > arch_timer_evtstrm_available(), like v7? > > Yes, I'm thinking of staging it somewhat like that. First an > smp_cond_load_relaxed() which gets rid of this issue, followed by > one based on smp_cond_load_relaxed_timeout(). > > That said, conditioning just arch_haltpoll_want() won't suffice since > what Catalin pointed out affects all users of poll_idle(), not just > haltpoll. The only other users I see today are apm_init() and acpi_processor_setup_cstates(), both in x86 path. Perhaps not ideal, but should be sufficient. > > Right now there's only haltpoll but there are future users like > zhenglifeng with a patch for acpi-idle here: > > https://lore.kernel.org/all/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/ > > > > Does the code above break any other architecture? I'd say if you want > > > something like this, better introduce a new smp_cond_load_timeout() > > > API. The above looks like a hack that may only work on arm64 when the > > > event stream is enabled. > > > > > > A generic option is udelay() (on arm64 it would use WFE/WFET by > > > default). Not sure how important it is for poll_idle() but the downside > > > of udelay() that it won't be able to also poll need_resched() while > > > waiting for the timeout. If this matters, you could instead make smaller > > > udelay() calls. Yet another problem, I don't know how energy efficient > > > udelay() is on x86 vs cpu_relax(). > > > > > > So maybe an smp_cond_load_timeout() would be better, implemented with > > > cpu_relax() generically and the arm64 would use LDXR, WFE and rely on > > > the event stream (or fall back to cpu_relax() if the event stream is > > > disabled). > > > > > > -- > > > Catalin > > > -- > ankur
On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote: > On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: > > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > index 9b6d90a72601..fc1204426158 100644 > > > --- a/drivers/cpuidle/poll_state.c > > > +++ b/drivers/cpuidle/poll_state.c > > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > > raw_local_irq_enable(); > > > if (!current_set_polling_and_test()) { > > > - unsigned int loop_count = 0; > > > u64 limit; > > > > > > limit = cpuidle_poll_time(drv, dev); > > > > > > while (!need_resched()) { > > > - cpu_relax(); > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > - continue; > > > - > > > - loop_count = 0; > > > + unsigned int loop_count = 0; > > > if (local_clock_noinstr() - time_start > limit) { > > > dev->poll_time_limit = true; > > > break; > > > } > > > + > > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > > + VAL & _TIF_NEED_RESCHED || > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > > never set. With the event stream enabled on arm64, the WFE will > > eventually be woken up, loop_count incremented and the condition would > > become true. However, the smp_cond_load_relaxed() semantics require that > > a different agent updates the variable being waited on, not the waiting > > CPU updating it itself. Also note that the event stream can be disabled > > on arm64 on the kernel command line. > > Alternately could we condition arch_haltpoll_want() on > arch_timer_evtstrm_available(), like v7? No. The problem is about the smp_cond_load_relaxed() semantics - it can't wait on a variable that's only updated in its exit condition. We need a new API for this, especially since we are changing generic code here (even it was arm64 code only, I'd still object to such smp_cond_load_*() constructs).
On Wed, 16 Oct 2024, Ankur Arora wrote: > > The other core will wake our core up by sending an IPI. The IPI will > > invoke a scheduler function on our core and the WFE will continue. > > Why? The target core is not sleeping. It is *polling* on a memory > address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell > it that a need-resched bit is set. The IPI is sent to interrupt the process that is not sleeping. This is done so the busy processor can reschedule the currently running process and respond to the event. It does not matter if the core is "sleeping" or not.
On Wed, 16 Oct 2024, Catalin Marinas wrote: > The behaviour above is slightly different from the current poll_idle() > implementation. The above is more like poll every timeout period rather > than continuously poll until either the need_resched() condition is true > _or_ the timeout expired. From Ankur's email, an IPI may not happen so > we don't have any guarantee that WFET will wake up before the timeout. > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to > arm the exclusive monitor. That's what smp_cond_load_relaxed() does. Sorry no. The IPI will cause the WFE to continue immediately and not wait till the end of the timeout period.
On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote: > On Wed, 16 Oct 2024, Catalin Marinas wrote: > > The behaviour above is slightly different from the current poll_idle() > > implementation. The above is more like poll every timeout period rather > > than continuously poll until either the need_resched() condition is true > > _or_ the timeout expired. From Ankur's email, an IPI may not happen so > > we don't have any guarantee that WFET will wake up before the timeout. > > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to > > arm the exclusive monitor. That's what smp_cond_load_relaxed() does. > > Sorry no. The IPI will cause the WFE to continue immediately and not wait > till the end of the timeout period. *If* there is an IPI. The scheduler is not really my area but some functions like wake_up_idle_cpu() seem to elide the IPI if TIF_NR_POLLING is set. But even if we had an IPI, it still feels like abusing the semantics of smp_cond_load_relaxed() when relying on it to increment a variable in the condition check as a result of some unrelated wake-up event. This API is meant to wait for a condition on a single variable. It cannot wait on multiple variables and especially not one it updates itself (even if it happens to work on arm64 under certain conditions). My strong preference would be to revive the smp_cond_load_timeout() proposal from Ankur earlier in the year.
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Wed, 16 Oct 2024, Ankur Arora wrote: > >> > The other core will wake our core up by sending an IPI. The IPI will >> > invoke a scheduler function on our core and the WFE will continue. >> >> Why? The target core is not sleeping. It is *polling* on a memory >> address (on arm64, via LDXR; WFE). Ergo an IPI is not needed to tell >> it that a need-resched bit is set. > > The IPI is sent to interrupt the process that is not sleeping. This is > done so the busy processor can reschedule the currently running process > and respond to the event. The scheduler treats idle specially (if the architecture defines TIF_POLLING_NRFLAG). There's also the sched_wake_idle_without_ipi tracepoint for this path. $ sudo perf stat -e sched:sched_wake_idle_without_ipi perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 5.173 [sec] 5.173613 usecs/op 193288 ops/sec Performance counter stats for 'perf bench sched pipe': 1,992,368 sched:sched_wake_idle_without_ipi 5.178976487 seconds time elapsed 0.396076000 seconds user 6.999566000 seconds sys -- ankur
Catalin Marinas <catalin.marinas@arm.com> writes: > On Thu, Oct 17, 2024 at 09:56:13AM -0700, Christoph Lameter (Ampere) wrote: >> On Wed, 16 Oct 2024, Catalin Marinas wrote: >> > The behaviour above is slightly different from the current poll_idle() >> > implementation. The above is more like poll every timeout period rather >> > than continuously poll until either the need_resched() condition is true >> > _or_ the timeout expired. From Ankur's email, an IPI may not happen so >> > we don't have any guarantee that WFET will wake up before the timeout. >> > The only way for WFE/WFET to wake up on need_resched() is to use LDXR to >> > arm the exclusive monitor. That's what smp_cond_load_relaxed() does. >> >> Sorry no. The IPI will cause the WFE to continue immediately and not wait >> till the end of the timeout period. > > *If* there is an IPI. The scheduler is not really my area but some > functions like wake_up_idle_cpu() seem to elide the IPI if > TIF_NR_POLLING is set. > > But even if we had an IPI, it still feels like abusing the semantics of > smp_cond_load_relaxed() when relying on it to increment a variable in > the condition check as a result of some unrelated wake-up event. This > API is meant to wait for a condition on a single variable. It cannot > wait on multiple variables and especially not one it updates itself > (even if it happens to work on arm64 under certain conditions). Yeah that makes sense. smp_cond_load_relaxed() uses two separate side-effects to make sure things work: the event-stream and the increment in the conditional. I do want to thresh out smp_cond_load_timeout() a bit more but let me reply to your other mail for that. > My strong preference would be to revive the smp_cond_load_timeout() > proposal from Ankur earlier in the year. Ack that. -- ankur
Catalin Marinas <catalin.marinas@arm.com> writes: > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote: >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: >> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> > > index 9b6d90a72601..fc1204426158 100644 >> > > --- a/drivers/cpuidle/poll_state.c >> > > +++ b/drivers/cpuidle/poll_state.c >> > > @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> > > >> > > raw_local_irq_enable(); >> > > if (!current_set_polling_and_test()) { >> > > - unsigned int loop_count = 0; >> > > u64 limit; >> > > >> > > limit = cpuidle_poll_time(drv, dev); >> > > >> > > while (!need_resched()) { >> > > - cpu_relax(); >> > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> > > - continue; >> > > - >> > > - loop_count = 0; >> > > + unsigned int loop_count = 0; >> > > if (local_clock_noinstr() - time_start > limit) { >> > > dev->poll_time_limit = true; >> > > break; >> > > } >> > > + >> > > + smp_cond_load_relaxed(¤t_thread_info()->flags, >> > > + VAL & _TIF_NEED_RESCHED || >> > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); >> > >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is >> > never set. With the event stream enabled on arm64, the WFE will >> > eventually be woken up, loop_count incremented and the condition would >> > become true. However, the smp_cond_load_relaxed() semantics require that >> > a different agent updates the variable being waited on, not the waiting >> > CPU updating it itself. Also note that the event stream can be disabled >> > on arm64 on the kernel command line. >> >> Alternately could we condition arch_haltpoll_want() on >> arch_timer_evtstrm_available(), like v7? > > No. The problem is about the smp_cond_load_relaxed() semantics - it > can't wait on a variable that's only updated in its exit condition. We > need a new API for this, especially since we are changing generic code > here (even it was arm64 code only, I'd still object to such > smp_cond_load_*() constructs). Right. The problem is that smp_cond_load_relaxed() used in this context depends on the event-stream side effect when the interface does not encode those semantics anywhere. So, a smp_cond_load_timeout() like in [1] that continues to depend on the event-stream is better because it explicitly accounts for the side effect from the timeout. This would cover both the WFxT and the event-stream case. The part I'm a little less sure about is the case where WFxT and the event-stream are absent. As you said earlier, for that case on arm64, we use either short __delay() calls or spin in cpu_relax(), both of which are essentially the same thing. Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends it and from my measurement a loop doing "while (!cond) cpu_relax()" gets an IPC of something like 0.1 or similar. On my arm64 systems however the same loop gets an IPC of 2. Now this likely varies greatly but seems like it would run pretty hot some of the time. So maybe the right thing to do would be to keep smp_cond_load_timeout() but only allow polling if WFxT or event-stream is enabled. And enhance cpuidle_poll_state_init() to fail if the above condition is not met. Does that make sense? Thanks Ankur [1] https://lore.kernel.org/lkml/87edae3a1x.fsf@oracle.com/
On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote: > Catalin Marinas <catalin.marinas@arm.com> writes: > > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote: > >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: > >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: > >> > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > >> > > + VAL & _TIF_NEED_RESCHED || > >> > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > >> > > >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is > >> > never set. With the event stream enabled on arm64, the WFE will > >> > eventually be woken up, loop_count incremented and the condition would > >> > become true. However, the smp_cond_load_relaxed() semantics require that > >> > a different agent updates the variable being waited on, not the waiting > >> > CPU updating it itself. Also note that the event stream can be disabled > >> > on arm64 on the kernel command line. > >> > >> Alternately could we condition arch_haltpoll_want() on > >> arch_timer_evtstrm_available(), like v7? > > > > No. The problem is about the smp_cond_load_relaxed() semantics - it > > can't wait on a variable that's only updated in its exit condition. We > > need a new API for this, especially since we are changing generic code > > here (even it was arm64 code only, I'd still object to such > > smp_cond_load_*() constructs). > > Right. The problem is that smp_cond_load_relaxed() used in this context > depends on the event-stream side effect when the interface does not > encode those semantics anywhere. > > So, a smp_cond_load_timeout() like in [1] that continues to depend on > the event-stream is better because it explicitly accounts for the side > effect from the timeout. > > This would cover both the WFxT and the event-stream case. Indeed. > The part I'm a little less sure about is the case where WFxT and the > event-stream are absent. > > As you said earlier, for that case on arm64, we use either short > __delay() calls or spin in cpu_relax(), both of which are essentially > the same thing. Something derived from __delay(), not exactly this function. We can't use it directly as we also want it to wake up if an event is generated as a result of a memory write (like the current smp_cond_load(). > Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends > it and from my measurement a loop doing "while (!cond) cpu_relax()" gets > an IPC of something like 0.1 or similar. > > On my arm64 systems however the same loop gets an IPC of 2. Now this > likely varies greatly but seems like it would run pretty hot some of > the time. For the cpu_relax() fall-back, it wouldn't be any worse than the current poll_idle() code, though I guess in this instance we'd not enable idle polling. I expect the event stream to be on in all production deployments. The reason we have a way to disable it is for testing. We've had hardware errata in the past where the event on spin_unlock doesn't cross the cluster boundary. We'd not notice because of the event stream. > So maybe the right thing to do would be to keep smp_cond_load_timeout() > but only allow polling if WFxT or event-stream is enabled. And enhance > cpuidle_poll_state_init() to fail if the above condition is not met. We could do this as well. Maybe hide this behind another function like arch_has_efficient_smp_cond_load_timeout() (well, some shorter name), checked somewhere in or on the path to cpuidle_poll_state_init(). Well, it might be simpler to do this in haltpoll_want(), backed by an arch_haltpoll_want() function. I assume we want poll_idle() to wake up as soon as a task becomes available. Otherwise we could have just used udelay() for some fraction of cpuidle_poll_time() instead of cpu_relax().
Catalin Marinas <catalin.marinas@arm.com> writes: > On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote: >> Catalin Marinas <catalin.marinas@arm.com> writes: >> > On Wed, Oct 16, 2024 at 03:13:33PM +0000, Okanovic, Haris wrote: >> >> On Tue, 2024-10-15 at 13:04 +0100, Catalin Marinas wrote: >> >> > On Wed, Sep 25, 2024 at 04:24:15PM -0700, Ankur Arora wrote: >> >> > > + smp_cond_load_relaxed(¤t_thread_info()->flags, >> >> > > + VAL & _TIF_NEED_RESCHED || >> >> > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); >> >> > >> >> > The above is not guaranteed to make progress if _TIF_NEED_RESCHED is >> >> > never set. With the event stream enabled on arm64, the WFE will >> >> > eventually be woken up, loop_count incremented and the condition would >> >> > become true. However, the smp_cond_load_relaxed() semantics require that >> >> > a different agent updates the variable being waited on, not the waiting >> >> > CPU updating it itself. Also note that the event stream can be disabled >> >> > on arm64 on the kernel command line. >> >> >> >> Alternately could we condition arch_haltpoll_want() on >> >> arch_timer_evtstrm_available(), like v7? >> > >> > No. The problem is about the smp_cond_load_relaxed() semantics - it >> > can't wait on a variable that's only updated in its exit condition. We >> > need a new API for this, especially since we are changing generic code >> > here (even it was arm64 code only, I'd still object to such >> > smp_cond_load_*() constructs). >> >> Right. The problem is that smp_cond_load_relaxed() used in this context >> depends on the event-stream side effect when the interface does not >> encode those semantics anywhere. >> >> So, a smp_cond_load_timeout() like in [1] that continues to depend on >> the event-stream is better because it explicitly accounts for the side >> effect from the timeout. >> >> This would cover both the WFxT and the event-stream case. > > Indeed. > >> The part I'm a little less sure about is the case where WFxT and the >> event-stream are absent. >> >> As you said earlier, for that case on arm64, we use either short >> __delay() calls or spin in cpu_relax(), both of which are essentially >> the same thing. > Something derived from __delay(), not exactly this function. We can't > use it directly as we also want it to wake up if an event is generated > as a result of a memory write (like the current smp_cond_load(). > >> Now on x86 cpu_relax() is quite optimal. The spec explicitly recommends >> it and from my measurement a loop doing "while (!cond) cpu_relax()" gets >> an IPC of something like 0.1 or similar. >> >> On my arm64 systems however the same loop gets an IPC of 2. Now this >> likely varies greatly but seems like it would run pretty hot some of >> the time. > > For the cpu_relax() fall-back, it wouldn't be any worse than the current > poll_idle() code, though I guess in this instance we'd not enable idle > polling. > > I expect the event stream to be on in all production deployments. The > reason we have a way to disable it is for testing. We've had hardware > errata in the past where the event on spin_unlock doesn't cross the > cluster boundary. We'd not notice because of the event stream. Ah, interesting. Thanks, that helps. >> So maybe the right thing to do would be to keep smp_cond_load_timeout() >> but only allow polling if WFxT or event-stream is enabled. And enhance >> cpuidle_poll_state_init() to fail if the above condition is not met. > > We could do this as well. Maybe hide this behind another function like > arch_has_efficient_smp_cond_load_timeout() (well, some shorter name), > checked somewhere in or on the path to cpuidle_poll_state_init(). Well, > it might be simpler to do this in haltpoll_want(), backed by an > arch_haltpoll_want() function. Yeah, checking in arch_haltpoll_want() would mean that we can leave all the cpuidle_poll_state_init() call sites unchanged. However, I suspect that even acpi-idle on arm64 might end up using poll_idle() (as this patch tries to do: https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/). So, let me try doing it both ways to see which one is simpler. Given that the event-stream can be assumed to be always-on it might just be more straight-forward to fallback to cpu_relax() in that edge case. > I assume we want poll_idle() to wake up as soon as a task becomes > available. Otherwise we could have just used udelay() for some fraction > of cpuidle_poll_time() instead of cpu_relax(). Yeah, agreed. Thanks -- ankur
On Fri, Oct 18, 2024 at 12:00:34PM -0700, Ankur Arora wrote: > Catalin Marinas <catalin.marinas@arm.com> writes: > > On Thu, Oct 17, 2024 at 03:47:31PM -0700, Ankur Arora wrote: > >> So maybe the right thing to do would be to keep smp_cond_load_timeout() > >> but only allow polling if WFxT or event-stream is enabled. And enhance > >> cpuidle_poll_state_init() to fail if the above condition is not met. > > > > We could do this as well. Maybe hide this behind another function like > > arch_has_efficient_smp_cond_load_timeout() (well, some shorter name), > > checked somewhere in or on the path to cpuidle_poll_state_init(). Well, > > it might be simpler to do this in haltpoll_want(), backed by an > > arch_haltpoll_want() function. > > Yeah, checking in arch_haltpoll_want() would mean that we can leave all > the cpuidle_poll_state_init() call sites unchanged. > > However, I suspect that even acpi-idle on arm64 might end up using > poll_idle() (as this patch tries to do: > https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/). > > So, let me try doing it both ways to see which one is simpler. > Given that the event-stream can be assumed to be always-on it might just > be more straight-forward to fallback to cpu_relax() in that edge case. I agree, let's go with the simplest. If one has some strong case for running with the event stream disabled and idle polling becomes too energy inefficient, we can revisit and add some run-time checks.
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index 9b6d90a72601..fc1204426158 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -21,21 +21,20 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, raw_local_irq_enable(); if (!current_set_polling_and_test()) { - unsigned int loop_count = 0; u64 limit; limit = cpuidle_poll_time(drv, dev); while (!need_resched()) { - cpu_relax(); - if (loop_count++ < POLL_IDLE_RELAX_COUNT) - continue; - - loop_count = 0; + unsigned int loop_count = 0; if (local_clock_noinstr() - time_start > limit) { dev->poll_time_limit = true; break; } + + smp_cond_load_relaxed(¤t_thread_info()->flags, + VAL & _TIF_NEED_RESCHED || + loop_count++ >= POLL_IDLE_RELAX_COUNT); } } raw_local_irq_disable();