Message ID | 1700488898-12431-8-git-send-email-mihai.carabas@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/7] x86: Move ARCH_HAS_CPU_RELAX to arch | expand |
On Mon, 20 Nov 2023, Mihai Carabas wrote: > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > smp_cond_load_relaxed which basically does a "wfe". Well it clears events first (which requires the first WFE) and then does a WFE waiting for any events if no events were pending. WFE does not cause a VMEXIT? Or does the inner loop of smp_cond_load_relaxed now do 2x VMEXITS? KVM ARM64 code seems to indicate that WFE causes a VMEXIT. See kvm_handle_wfx().
La 22.11.2023 22:51, Christoph Lameter a scris: > > On Mon, 20 Nov 2023, Mihai Carabas wrote: > >> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >> smp_cond_load_relaxed which basically does a "wfe". > > Well it clears events first (which requires the first WFE) and then > does a WFE waiting for any events if no events were pending. > > WFE does not cause a VMEXIT? Or does the inner loop of > smp_cond_load_relaxed now do 2x VMEXITS? > > KVM ARM64 code seems to indicate that WFE causes a VMEXIT. See > kvm_handle_wfx(). In KVM ARM64 the WFE traping is dynamic: it is enabled only if there are more tasks waiting on the same core (e.g. on an oversubscribed system). In arch/arm64/kvm/arm.c: 457 >-------if (single_task_running()) 458 >------->-------vcpu_clear_wfx_traps(vcpu); 459 >-------else 460 >------->-------vcpu_set_wfx_traps(vcpu); This of course can be improved by having a knob where you can completly disable wfx traping by your needs, but I left this as another subject to tackle.
On Wed, 22 Nov 2023, Mihai Carabas wrote: > La 22.11.2023 22:51, Christoph Lameter a scris: >> >> On Mon, 20 Nov 2023, Mihai Carabas wrote: >> >>> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >>> smp_cond_load_relaxed which basically does a "wfe". >> >> Well it clears events first (which requires the first WFE) and then does a >> WFE waiting for any events if no events were pending. >> >> WFE does not cause a VMEXIT? Or does the inner loop of >> smp_cond_load_relaxed now do 2x VMEXITS? >> >> KVM ARM64 code seems to indicate that WFE causes a VMEXIT. See >> kvm_handle_wfx(). > > In KVM ARM64 the WFE traping is dynamic: it is enabled only if there are more > tasks waiting on the same core (e.g. on an oversubscribed system). > > In arch/arm64/kvm/arm.c: > > 457 >-------if (single_task_running()) > 458 >------->-------vcpu_clear_wfx_traps(vcpu); > 459 >-------else > 460 >------->-------vcpu_set_wfx_traps(vcpu); Ahh. Cool did not know about that. But still: Lots of VMEXITs once the load has to be shared. > This of course can be improved by having a knob where you can completly > disable wfx traping by your needs, but I left this as another subject to > tackle. kvm_arch_vcpu_load() looks strange. On the one hand we pass a cpu number into it and then we use functions that only work if we are running on that cpu? It would be better to use smp_processor_id() in the function and not pass the cpu number to it.
Christoph Lameter (Ampere) <cl@linux.com> writes: > On Wed, 22 Nov 2023, Mihai Carabas wrote: > >> La 22.11.2023 22:51, Christoph Lameter a scris: >>> On Mon, 20 Nov 2023, Mihai Carabas wrote: >>> >>>> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >>>> smp_cond_load_relaxed which basically does a "wfe". >>> Well it clears events first (which requires the first WFE) and then does a >>> WFE waiting for any events if no events were pending. >>> WFE does not cause a VMEXIT? Or does the inner loop of >>> smp_cond_load_relaxed now do 2x VMEXITS? >>> KVM ARM64 code seems to indicate that WFE causes a VMEXIT. See >>> kvm_handle_wfx(). >> >> In KVM ARM64 the WFE traping is dynamic: it is enabled only if there are more >> tasks waiting on the same core (e.g. on an oversubscribed system). >> >> In arch/arm64/kvm/arm.c: >> >> 457 >-------if (single_task_running()) >> 458 >------->-------vcpu_clear_wfx_traps(vcpu); >> 459 >-------else >> 460 >------->-------vcpu_set_wfx_traps(vcpu); > > Ahh. Cool did not know about that. But still: Lots of VMEXITs once the load has > to be shared. Yeah, anytime there's more than one runnable process. Another, more critical place where we will vmexit is the qspinlock slowpath which uses smp_cond_load. >> This of course can be improved by having a knob where you can completly >> disable wfx traping by your needs, but I left this as another subject to >> tackle. Probably needs to be adaptive since we use WFE in error paths as well (for instance to park the CPU.) Ankur
On Mon, Nov 20, 2023 at 04:01:38PM +0200, Mihai Carabas wrote: > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > smp_cond_load_relaxed which basically does a "wfe". > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > --- > drivers/cpuidle/poll_state.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > index 9b6d90a72601..440cd713e39a 100644 > --- a/drivers/cpuidle/poll_state.c > +++ b/drivers/cpuidle/poll_state.c > @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > limit = cpuidle_poll_time(drv, dev); > > - while (!need_resched()) { > - cpu_relax(); > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > - continue; > - > + for (;;) { > loop_count = 0; > + > + smp_cond_load_relaxed(¤t_thread_info()->flags, > + (VAL & _TIF_NEED_RESCHED) || > + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); > + > + if (loop_count < POLL_IDLE_RELAX_COUNT) > + break; > + > if (local_clock_noinstr() - time_start > limit) { > dev->poll_time_limit = true; > break; Doesn't this make ARCH_HAS_CPU_RELAX a complete misnomer? Will
La 11.12.2023 13:46, Will Deacon a scris: > On Mon, Nov 20, 2023 at 04:01:38PM +0200, Mihai Carabas wrote: >> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >> smp_cond_load_relaxed which basically does a "wfe". >> >> Suggested-by: Peter Zijlstra <peterz@infradead.org> >> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> >> --- >> drivers/cpuidle/poll_state.c | 14 +++++++++----- >> 1 file changed, 9 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> index 9b6d90a72601..440cd713e39a 100644 >> --- a/drivers/cpuidle/poll_state.c >> +++ b/drivers/cpuidle/poll_state.c >> @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >> >> limit = cpuidle_poll_time(drv, dev); >> >> - while (!need_resched()) { >> - cpu_relax(); >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> - continue; >> - >> + for (;;) { >> loop_count = 0; >> + >> + smp_cond_load_relaxed(¤t_thread_info()->flags, >> + (VAL & _TIF_NEED_RESCHED) || >> + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); >> + >> + if (loop_count < POLL_IDLE_RELAX_COUNT) >> + break; >> + >> if (local_clock_noinstr() - time_start > limit) { >> dev->poll_time_limit = true; >> break; > Doesn't this make ARCH_HAS_CPU_RELAX a complete misnomer? This controls the build of poll_state.c and the generic definition of smp_cond_load_relaxed (used by x86) is using cpu_relax(). Do you propose other approach here? > > Will
On Sun, Jan 28, 2024 at 11:22:50PM +0200, Mihai Carabas wrote: > La 11.12.2023 13:46, Will Deacon a scris: > > On Mon, Nov 20, 2023 at 04:01:38PM +0200, Mihai Carabas wrote: > > > cpu_relax on ARM64 does a simple "yield". Thus we replace it with > > > smp_cond_load_relaxed which basically does a "wfe". > > > > > > Suggested-by: Peter Zijlstra <peterz@infradead.org> > > > Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> > > > --- > > > drivers/cpuidle/poll_state.c | 14 +++++++++----- > > > 1 file changed, 9 insertions(+), 5 deletions(-) > > > > > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > > > index 9b6d90a72601..440cd713e39a 100644 > > > --- a/drivers/cpuidle/poll_state.c > > > +++ b/drivers/cpuidle/poll_state.c > > > @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > > limit = cpuidle_poll_time(drv, dev); > > > - while (!need_resched()) { > > > - cpu_relax(); > > > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > > - continue; > > > - > > > + for (;;) { > > > loop_count = 0; > > > + > > > + smp_cond_load_relaxed(¤t_thread_info()->flags, > > > + (VAL & _TIF_NEED_RESCHED) || > > > + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); > > > + > > > + if (loop_count < POLL_IDLE_RELAX_COUNT) > > > + break; > > > + > > > if (local_clock_noinstr() - time_start > limit) { > > > dev->poll_time_limit = true; > > > break; > > Doesn't this make ARCH_HAS_CPU_RELAX a complete misnomer? > > This controls the build of poll_state.c and the generic definition of > smp_cond_load_relaxed (used by x86) is using cpu_relax(). Do you propose > other approach here? Give it a better name? Having ARCH_HAS_CPU_RELAX control a piece of code that doesn't use cpu_relax() doesn't make sense to me. Will
>>>> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >>>> smp_cond_load_relaxed which basically does a "wfe". >>>> >>>> Suggested-by: Peter Zijlstra <peterz@infradead.org> >>>> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> >>>> --- >>>> drivers/cpuidle/poll_state.c | 14 +++++++++----- >>>> 1 file changed, 9 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >>>> index 9b6d90a72601..440cd713e39a 100644 >>>> --- a/drivers/cpuidle/poll_state.c >>>> +++ b/drivers/cpuidle/poll_state.c >>>> @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >>>> limit = cpuidle_poll_time(drv, dev); >>>> - while (!need_resched()) { >>>> - cpu_relax(); >>>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >>>> - continue; >>>> - >>>> + for (;;) { >>>> loop_count = 0; >>>> + >>>> + smp_cond_load_relaxed(¤t_thread_info()->flags, >>>> + (VAL & _TIF_NEED_RESCHED) || >>>> + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); >>>> + >>>> + if (loop_count < POLL_IDLE_RELAX_COUNT) >>>> + break; >>>> + >>>> if (local_clock_noinstr() - time_start > limit) { >>>> dev->poll_time_limit = true; >>>> break; >>> Doesn't this make ARCH_HAS_CPU_RELAX a complete misnomer? >> This controls the build of poll_state.c and the generic definition of >> smp_cond_load_relaxed (used by x86) is using cpu_relax(). Do you propose >> other approach here? > Give it a better name? Having ARCH_HAS_CPU_RELAX control a piece of code > that doesn't use cpu_relax() doesn't make sense to me. The generic code for smp_cond_load_relaxed is using cpu_relax and this one is used on x86 - so ARCH_HAS_CPU_RELAX is a prerequisite on x86 when using haltpoll. Only on ARM64 this is overwritten. Moreover ARCH_HAS_CPU_RELAX is controlling the function definition for cpuidle_poll_state_init (this is how it was originally designed). Thanks, Mihai
Mihai Carabas <mihai.carabas@oracle.com> writes: >>>>> cpu_relax on ARM64 does a simple "yield". Thus we replace it with >>>>> smp_cond_load_relaxed which basically does a "wfe". >>>>> >>>>> Suggested-by: Peter Zijlstra <peterz@infradead.org> >>>>> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> >>>>> --- >>>>> drivers/cpuidle/poll_state.c | 14 +++++++++----- >>>>> 1 file changed, 9 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >>>>> index 9b6d90a72601..440cd713e39a 100644 >>>>> --- a/drivers/cpuidle/poll_state.c >>>>> +++ b/drivers/cpuidle/poll_state.c >>>>> @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, >>>>> limit = cpuidle_poll_time(drv, dev); >>>>> - while (!need_resched()) { >>>>> - cpu_relax(); >>>>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >>>>> - continue; >>>>> - >>>>> + for (;;) { >>>>> loop_count = 0; >>>>> + >>>>> + smp_cond_load_relaxed(¤t_thread_info()->flags, >>>>> + (VAL & _TIF_NEED_RESCHED) || >>>>> + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); >>>>> + >>>>> + if (loop_count < POLL_IDLE_RELAX_COUNT) >>>>> + break; >>>>> + >>>>> if (local_clock_noinstr() - time_start > limit) { >>>>> dev->poll_time_limit = true; >>>>> break; >>>> Doesn't this make ARCH_HAS_CPU_RELAX a complete misnomer? >>> This controls the build of poll_state.c and the generic definition of >>> smp_cond_load_relaxed (used by x86) is using cpu_relax(). Do you propose >>> other approach here? >> Give it a better name? Having ARCH_HAS_CPU_RELAX control a piece of code >> that doesn't use cpu_relax() doesn't make sense to me. > > The generic code for smp_cond_load_relaxed is using cpu_relax and this one is > used on x86 - so ARCH_HAS_CPU_RELAX is a prerequisite on x86 when using > haltpoll. Only on ARM64 this is overwritten. Moreover ARCH_HAS_CPU_RELAX is > controlling the function definition for cpuidle_poll_state_init (this is how it > was originally designed). I suspect Will's point is that the term ARCH_HAS_CPU_RELAX doesn't make a whole lot of sense when we are only indirectly using cpu_relax() in the series. Also, all archs define cpu_relax() (though some as just a barrier()) so ARCH_HAS_CPU_RELAX . Maybe an arch can instead just opt into polling in idle? Perhaps something like this trivial patch: -- diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5edec175b9bf..d80c98c64fd4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -367,7 +367,7 @@ config ARCH_MAY_HAVE_PC_FDC config GENERIC_CALIBRATE_DELAY def_bool y -config ARCH_HAS_CPU_RELAX +config ARCH_WANTS_IDLE_POLL def_bool y config ARCH_HIBERNATION_POSSIBLE diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 55437f5e0c3a..6a0a1f16a5c3 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -36,7 +36,7 @@ #include <asm/cpu.h> #endif -#define ACPI_IDLE_STATE_START (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX) ? 1 : 0) +#define ACPI_IDLE_STATE_START (IS_ENABLED(CONFIG_ARCH_WANTS_IDLE_POLL) ? 1 : 0) static unsigned int max_cstate __read_mostly = ACPI_PROCESSOR_MAX_POWER; module_param(max_cstate, uint, 0400); @@ -787,7 +787,7 @@ static int acpi_processor_setup_cstates(struct acpi_processor *pr) if (max_cstate == 0) max_cstate = 1; - if (IS_ENABLED(CONFIG_ARCH_HAS_CPU_RELAX)) { + if (IS_ENABLED(CONFIG_ARCH_WANTS_IDLE_POLL)) { cpuidle_poll_state_init(drv); count = 1; } else { diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile index d103342b7cfc..23f48d99f0f2 100644 --- a/drivers/cpuidle/Makefile +++ b/drivers/cpuidle/Makefile @@ -7,7 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o obj-$(CONFIG_DT_IDLE_GENPD) += dt_idle_genpd.o -obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o +obj-$(CONFIG_ARCH_WANTS_IDLE_POLL) += poll_state.o obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o ################################################################################## diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 3183aeb7f5b4..53e55a91d55d 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -275,7 +275,7 @@ static inline void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, } #endif -#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX) +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_WANTS_IDLE_POLL) void cpuidle_poll_state_init(struct cpuidle_driver *drv); #else static inline void cpuidle_poll_state_init(struct cpuidle_driver *drv) {}
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index 9b6d90a72601..440cd713e39a 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -26,12 +26,16 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, limit = cpuidle_poll_time(drv, dev); - while (!need_resched()) { - cpu_relax(); - if (loop_count++ < POLL_IDLE_RELAX_COUNT) - continue; - + for (;;) { loop_count = 0; + + smp_cond_load_relaxed(¤t_thread_info()->flags, + (VAL & _TIF_NEED_RESCHED) || + (loop_count++ >= POLL_IDLE_RELAX_COUNT)); + + if (loop_count < POLL_IDLE_RELAX_COUNT) + break; + if (local_clock_noinstr() - time_start > limit) { dev->poll_time_limit = true; break;
cpu_relax on ARM64 does a simple "yield". Thus we replace it with smp_cond_load_relaxed which basically does a "wfe". Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> --- drivers/cpuidle/poll_state.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)