Message ID | 1707982910-27680-8-git-send-email-mihai.carabas@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,1/8] x86: Move ARCH_HAS_CPU_RELAX to arch | expand |
Hi, > Subject: [PATCH v4 7/8] cpuidle/poll_state: replace cpu_relax with smp_cond_load_relaxed > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > smp_cond_load_relaxed which basically does a "wfe". > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > --- > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > index 9b6d90a72601..1e45be906e72 100644 > --- a/drivers/cpuidle/poll_state.c > +++ b/drivers/cpuidle/poll_state.c > @@ -13,6 +13,7 @@ > static int __cpuidle poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > + unsigned long ret; > u64 time_start; > > time_start = local_clock_noinstr(); > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > limit = cpuidle_poll_time(drv, dev); > > - while (!need_resched()) { > - cpu_relax(); > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > - continue; > - > + for (;;) { > loop_count = 0; > + > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > + VAL & _TIF_NEED_RESCHED || > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > + > + if (!(ret & _TIF_NEED_RESCHED)) > + break; Should this be "if (ret & _TIF_NEED_RESCHED) since we want to break here if the flag is set, or am I misunderstood? Regards, Tomohiro > + > if (local_clock_noinstr() - time_start > limit) { > dev->poll_time_limit = true; > break; > -- > 1.8.3.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Tomohiro Misono (Fujitsu) <misono.tomohiro@fujitsu.com> writes: > Hi, > > Subject: [PATCH v4 7/8] cpuidle/poll_state: replace cpu_relax with smp_cond_load_relaxed > > > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > > smp_cond_load_relaxed which basically does a "wfe". > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > > --- > > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > index 9b6d90a72601..1e45be906e72 100644 > > --- a/drivers/cpuidle/poll_state.c > > +++ b/drivers/cpuidle/poll_state.c > > @@ -13,6 +13,7 @@ > > static int __cpuidle poll_idle(struct cpuidle_device *dev, > > struct cpuidle_driver *drv, int index) > > { > > + unsigned long ret; > > u64 time_start; > > > > time_start = local_clock_noinstr(); > > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > limit = cpuidle_poll_time(drv, dev); > > > > - while (!need_resched()) { > > - cpu_relax(); > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > - continue; > > - > > + for (;;) { > > loop_count = 0; > > + > > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > > + VAL & _TIF_NEED_RESCHED || > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > + > > + if (!(ret & _TIF_NEED_RESCHED)) > > + break; > > Should this be "if (ret & _TIF_NEED_RESCHED) since we want to break here > if the flag is set, or am I misunderstood? Yeah, you are right. The check is inverted. I'll be re-spinning this series. Will fix. Though, it probably makes sense to just keep the original "while (!need_resched())" check. Thanks for the review. -- ankur
On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > smp_cond_load_relaxed which basically does a "wfe". > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > --- > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > index 9b6d90a72601..1e45be906e72 100644 > --- a/drivers/cpuidle/poll_state.c > +++ b/drivers/cpuidle/poll_state.c > @@ -13,6 +13,7 @@ > static int __cpuidle poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > + unsigned long ret; > u64 time_start; > > time_start = local_clock_noinstr(); > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > limit = cpuidle_poll_time(drv, dev); > > - while (!need_resched()) { > - cpu_relax(); > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > - continue; > - > + for (;;) { > loop_count = 0; > + > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > + VAL & _TIF_NEED_RESCHED || > + loop_count++ >= POLL_IDLE_RELAX_COUNT); Is it necessary to repeat this 200 times with a wfe poll? Does kvm not implement a timeout period? Could you make it configurable? This patch improves certain workloads on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us increments before going to wfi, which is a bit excessive. > + > + if (!(ret & _TIF_NEED_RESCHED)) > + break; > + > if (local_clock_noinstr() - time_start > limit) { > dev->poll_time_limit = true; > break;
Okanovic, Haris <harisokn@amazon.com> writes: > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: >> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >> smp_cond_load_relaxed which basically does a "wfe". >> >> Suggested-by: Peter Zijlstra <peterz@infradead.org> >> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> >> --- >> drivers/cpuidle/poll_state.c | 15 ++++++++++----- >> 1 file changed, 10 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> index 9b6d90a72601..1e45be906e72 100644 >> --- a/drivers/cpuidle/poll_state.c >> +++ b/drivers/cpuidle/poll_state.c >> @@ -13,6 +13,7 @@ >> static int __cpuidle poll_idle(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, int index) >> { >> + unsigned long ret; >> u64 time_start; >> >> time_start = local_clock_noinstr(); >> @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> >> limit = cpuidle_poll_time(drv, dev); >> >> - while (!need_resched()) { >> - cpu_relax(); >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> - continue; >> - >> + for (;;) { >> loop_count = 0; >> + >> + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, >> + VAL & _TIF_NEED_RESCHED || >> + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > Is it necessary to repeat this 200 times with a wfe poll? The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() iteration is much shorter. With WFE, it makes less sense. > Does kvm not implement a timeout period? Not yet, but it does become more useful after a WFE haltpoll is available on ARM64. Haltpoll does have a timeout, which you should be able to tune via /sys/module/haltpoll/parameters/ but that, of course, won't help here. > Could you make it configurable? This patch improves certain workloads > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us > increments before going to wfi, which is a bit excessive. Yeah, this looks like a problem. We could solve it by making it an architectural parameter. Though I worry about ARM platforms with much smaller default timeouts. The other possibility is using WFET in the primitive, but then we have that dependency and that's a bigger change. Will address this in the next version. Thanks for pointing this out. -- ankur
On Fri, 2024-04-05 at 16:14 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Okanovic, Haris <harisokn@amazon.com> writes: > > > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: > > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > > > smp_cond_load_relaxed which basically does a "wfe". > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > > > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > > > --- > > > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > index 9b6d90a72601..1e45be906e72 100644 > > > --- a/drivers/cpuidle/poll_state.c > > > +++ b/drivers/cpuidle/poll_state.c > > > @@ -13,6 +13,7 @@ > > > static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > struct cpuidle_driver *drv, int index) > > > { > > > + unsigned long ret; > > > u64 time_start; > > > > > > time_start = local_clock_noinstr(); > > > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > > limit = cpuidle_poll_time(drv, dev); > > > > > > - while (!need_resched()) { > > > - cpu_relax(); > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > - continue; > > > - > > > + for (;;) { > > > loop_count = 0; > > > + > > > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > > > + VAL & _TIF_NEED_RESCHED || > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > Is it necessary to repeat this 200 times with a wfe poll? > > The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() > iteration is much shorter. > > With WFE, it makes less sense. > > > Does kvm not implement a timeout period? > > Not yet, but it does become more useful after a WFE haltpoll is > available on ARM64. Note that kvm conditionally traps WFE and WFI based on number of host CPU tasks. VMs will sometimes see hardware behavior - potentially polling for a long time before entering WFI. https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/arm.c#L459 > > Haltpoll does have a timeout, which you should be able to tune via > /sys/module/haltpoll/parameters/ but that, of course, won't help here. > > > Could you make it configurable? This patch improves certain workloads > > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us > > increments before going to wfi, which is a bit excessive. > > Yeah, this looks like a problem. We could solve it by making it an > architectural parameter. Though I worry about ARM platforms with > much smaller default timeouts. > The other possibility is using WFET in the primitive, but then we > have that dependency and that's a bigger change. See arm64's delay() for inspiration: https://elixir.bootlin.com/linux/v6.9-rc2/source/arch/arm64/lib/delay.c#L26 > > Will address this in the next version. > > Thanks for pointing this out. > > -- > ankur
Okanovic, Haris <harisokn@amazon.com> writes: > On Fri, 2024-04-05 at 16:14 -0700, Ankur Arora wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Okanovic, Haris <harisokn@amazon.com> writes: >> >> > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: >> > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with >> > > smp_cond_load_relaxed which basically does a "wfe". >> > > >> > > Suggested-by: Peter Zijlstra <peterz@infradead.org> >> > > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> >> > > --- >> > > drivers/cpuidle/poll_state.c | 15 ++++++++++----- >> > > 1 file changed, 10 insertions(+), 5 deletions(-) >> > > >> > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> > > index 9b6d90a72601..1e45be906e72 100644 >> > > --- a/drivers/cpuidle/poll_state.c >> > > +++ b/drivers/cpuidle/poll_state.c >> > > @@ -13,6 +13,7 @@ >> > > static int __cpuidle poll_idle(struct cpuidle_device *dev, >> > > struct cpuidle_driver *drv, int index) >> > > { >> > > + unsigned long ret; >> > > u64 time_start; >> > > >> > > time_start = local_clock_noinstr(); >> > > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> > > >> > > limit = cpuidle_poll_time(drv, dev); >> > > >> > > - while (!need_resched()) { >> > > - cpu_relax(); >> > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> > > - continue; >> > > - >> > > + for (;;) { >> > > loop_count = 0; >> > > + >> > > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, >> > > + VAL & _TIF_NEED_RESCHED || >> > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); >> > >> > Is it necessary to repeat this 200 times with a wfe poll? >> >> The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() >> iteration is much shorter. >> >> With WFE, it makes less sense. >> >> > Does kvm not implement a timeout period? >> >> Not yet, but it does become more useful after a WFE haltpoll is >> available on ARM64. > > Note that kvm conditionally traps WFE and WFI based on number of host > CPU tasks. VMs will sometimes see hardware behavior - potentially > polling for a long time before entering WFI. > > https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/arm.c#L459 Yeah. There was a discussion on this https://lore.kernel.org/lkml/871qc6qufy.fsf@oracle.com/. >> Haltpoll does have a timeout, which you should be able to tune via >> /sys/module/haltpoll/parameters/ but that, of course, won't help here. >> >> > Could you make it configurable? This patch improves certain workloads >> > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us >> > increments before going to wfi, which is a bit excessive. >> >> Yeah, this looks like a problem. We could solve it by making it an >> architectural parameter. Though I worry about ARM platforms with >> much smaller default timeouts. >> The other possibility is using WFET in the primitive, but then we >> have that dependency and that's a bigger change. > > See arm64's delay() for inspiration: > > https://elixir.bootlin.com/linux/v6.9-rc2/source/arch/arm64/lib/delay.c#L26 Sure, that part is straight-forward enough. However, this will need a fallback the case when WFET is not available. And, because this path is used on x86, so we need a cross platform smp_cond*timeout(). Though given that the x86 version is based on cpu_relax() then that could just fold the sched_clock() check in. Maybe another place to do this would be by KVM forcing a WFE timeout. Arguably that is needed regardless of whether we use a smp_cond*timeout() or not. -- ankur
On Mon, 2024-04-08 at 11:46 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Okanovic, Haris <harisokn@amazon.com> writes: > > > On Fri, 2024-04-05 at 16:14 -0700, Ankur Arora wrote: > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > > > > > > > > > Okanovic, Haris <harisokn@amazon.com> writes: > > > > > > > On Thu, 2024-02-15 at 09:41 +0200, Mihai Carabas wrote: > > > > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > > > > > smp_cond_load_relaxed which basically does a "wfe". > > > > > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > > > > > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > > > > > --- > > > > > drivers/cpuidle/poll_state.c | 15 ++++++++++----- > > > > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > > > index 9b6d90a72601..1e45be906e72 100644 > > > > > --- a/drivers/cpuidle/poll_state.c > > > > > +++ b/drivers/cpuidle/poll_state.c > > > > > @@ -13,6 +13,7 @@ > > > > > static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > struct cpuidle_driver *drv, int index) > > > > > { > > > > > + unsigned long ret; > > > > > u64 time_start; > > > > > > > > > > time_start = local_clock_noinstr(); > > > > > @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > > > > > > > > limit = cpuidle_poll_time(drv, dev); > > > > > > > > > > - while (!need_resched()) { > > > > > - cpu_relax(); > > > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > > > - continue; > > > > > - > > > > > + for (;;) { > > > > > loop_count = 0; > > > > > + > > > > > + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, > > > > > + VAL & _TIF_NEED_RESCHED || > > > > > + loop_count++ >= POLL_IDLE_RELAX_COUNT); > > > > > > > > Is it necessary to repeat this 200 times with a wfe poll? > > > > > > The POLL_IDLE_RELAX_COUNT is there because on x86 each cpu_relax() > > > iteration is much shorter. > > > > > > With WFE, it makes less sense. > > > > > > > Does kvm not implement a timeout period? > > > > > > Not yet, but it does become more useful after a WFE haltpoll is > > > available on ARM64. > > > > Note that kvm conditionally traps WFE and WFI based on number of host > > CPU tasks. VMs will sometimes see hardware behavior - potentially > > polling for a long time before entering WFI. > > > > https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/arm.c#L459 > > Yeah. There was a discussion on this > https://lore.kernel.org/lkml/871qc6qufy.fsf@oracle.com/. > > > > Haltpoll does have a timeout, which you should be able to tune via > > > /sys/module/haltpoll/parameters/ but that, of course, won't help here. > > > > > > > Could you make it configurable? This patch improves certain workloads > > > > on AWS Graviton instances as well, but blocks up to 6ms in 200 * 30us > > > > increments before going to wfi, which is a bit excessive. > > > > > > Yeah, this looks like a problem. We could solve it by making it an > > > architectural parameter. Though I worry about ARM platforms with > > > much smaller default timeouts. > > > The other possibility is using WFET in the primitive, but then we > > > have that dependency and that's a bigger change. > > > > See arm64's delay() for inspiration: > > > > https://elixir.bootlin.com/linux/v6.9-rc2/source/arch/arm64/lib/delay.c#L26 > > Sure, that part is straight-forward enough. However, this will need a fallback > the case when WFET is not available. And, because this path is used on x86, > so we need a cross platform smp_cond*timeout(). Though given that the x86 > version is based on cpu_relax() then that could just fold the sched_clock() > check in. I was trying to point out how delay() handles different configurations: It prefers WFET when available, falls back to WFE when event stream is available, and finally falls back to cpu_relax() as last resort. Same logic can apply here. The x86 case can always use cpu_relax() fallback, for same behavior as smp_cond_load_relaxed(). Re your concern about "ARM platforms with much smaller default timeouts": You could do something different when arch_timer_get_rate() is too small. Although I'm not sure this is a huge concern, given that delay() doesn't seem to care in the WFE case. -- Haris Okanovic > > Maybe another place to do this would be by KVM forcing a WFE timeout. Arguably > that is needed regardless of whether we use a smp_cond*timeout() or not. > > -- > ankur
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index 9b6d90a72601..1e45be906e72 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -13,6 +13,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + unsigned long ret; u64 time_start; time_start = local_clock_noinstr(); @@ -26,12 +27,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, limit = cpuidle_poll_time(drv, dev); - while (!need_resched()) { - cpu_relax(); - if (loop_count++ < POLL_IDLE_RELAX_COUNT) - continue; - + for (;;) { loop_count = 0; + + ret = smp_cond_load_relaxed(¤t_thread_info()->flags, + VAL & _TIF_NEED_RESCHED || + loop_count++ >= POLL_IDLE_RELAX_COUNT); + + if (!(ret & _TIF_NEED_RESCHED)) + break; + if (local_clock_noinstr() - time_start > limit) { dev->poll_time_limit = true; break;
cpu_relax on ARM64 does a simple "yield". Thus we replace it with smp_cond_load_relaxed which basically does a "wfe". Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> --- drivers/cpuidle/poll_state.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)