Message ID | 1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 22 Jun 2017, root wrote: > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -39,6 +39,10 @@ > #include <asm/desc.h> > #include <asm/prctl.h> > > +#ifdef CONFIG_HYPERVISOR_GUEST > +unsigned long poll_threshold_ns; > +#endif > + > /* > * per-CPU TSS segments. Threads are completely 'soft' on Linux, > * no more per-task TSS's. The TSS size is kept cacheline-aligned > @@ -313,6 +317,23 @@ static inline void play_dead(void) > } > #endif > > +#ifdef CONFIG_HYPERVISOR_GUEST > +void arch_cpu_idle_poll(void) > +{ > + ktime_t start, cur, stop; > + > + if (poll_threshold_ns) { > + start = cur = ktime_get(); > + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); > + do { > + if (need_resched()) > + break; > + cur = ktime_get(); > + } while (ktime_before(cur, stop)); > + } > +} > +#endif Aside of the whole approach being debatable, what's the reason to make this depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that mechanism is worthwhile then it should go into the generic code and not into x86. There is absolutely nothing x86 specific in that patch. Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal. Thanks, tglx
On 2017/6/22 22:23, Thomas Gleixner wrote: > On Thu, 22 Jun 2017, root wrote: >> --- a/arch/x86/kernel/process.c >> +++ b/arch/x86/kernel/process.c >> @@ -39,6 +39,10 @@ >> #include <asm/desc.h> >> #include <asm/prctl.h> >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +unsigned long poll_threshold_ns; >> +#endif >> + >> /* >> * per-CPU TSS segments. Threads are completely 'soft' on Linux, >> * no more per-task TSS's. The TSS size is kept cacheline-aligned >> @@ -313,6 +317,23 @@ static inline void play_dead(void) >> } >> #endif >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +void arch_cpu_idle_poll(void) >> +{ >> + ktime_t start, cur, stop; >> + >> + if (poll_threshold_ns) { >> + start = cur = ktime_get(); >> + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); >> + do { >> + if (need_resched()) >> + break; >> + cur = ktime_get(); >> + } while (ktime_before(cur, stop)); >> + } >> +} >> +#endif > > Aside of the whole approach being debatable, what's the reason to make this > depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that > mechanism is worthwhile then it should go into the generic code and not > into x86. There is absolutely nothing x86 specific in that patch. > > Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship > with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal. You are right. As Paolo suggested, i will integrate it with paravalization code. > > Thanks, > > tglx >
On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote: > From: Yang Zhang <yang.zhang.wz@gmail.com> > > This patch introduce a new mechanism to poll for a while before > entering idle state. > > David has a topic in KVM forum to describe the problem on current KVM VM > when running some message passing workload in KVM forum. Also, there > are some work to improve the performance in KVM, like halt polling in KVM. > But we still has 4 MSR wirtes and HLT vmexit when going into halt idle > which introduce lot of latency. > > Halt polling in KVM provide the capbility to not schedule out VCPU when > it is the only task in this pCPU. Unlike it, this patch will let VCPU polls > for a while if there is no work inside VCPU to elimiate heavy vmexit during > in/out idle. The potential impact is it will cost more CPU cycle since we > are doing polling and may impact other task which waiting on the same > physical CPU in host. I wonder whether you considered doing this in an idle driver. I have a prototype patch combining this with mwait within guest - I can post it if you are interested. > Here is the data i get when running benchmark contextswitch > (https://github.com/tsuna/contextswitch) > > before patch: > 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) > > after patch: > 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) > > Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com> > --- > Documentation/sysctl/kernel.txt | 10 ++++++++++ > arch/x86/kernel/process.c | 21 +++++++++++++++++++++ > include/linux/kernel.h | 3 +++ > kernel/sched/idle.c | 3 +++ > kernel/sysctl.c | 9 +++++++++ > 5 files changed, 46 insertions(+) > > diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt > index bac23c1..4e71bfe 100644 > --- a/Documentation/sysctl/kernel.txt > +++ b/Documentation/sysctl/kernel.txt > @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: > - perf_event_max_stack > - perf_event_max_contexts_per_stack > - pid_max > +- poll_threshold_ns [ X86 only ] > - powersave-nap [ PPC only ] > - printk > - printk_delay > @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. > > ============================================================== > > +poll_threshold_ns: (X86 only) > + > +This parameter used to control the max wait time to poll before going > +into real idle state. By default, the values is 0 means don't poll. > +It is recommended to change the value to non-zero if running latency-bound > +workloads in VM. > + > +============================================================== > + > powersave-nap: (PPC only) > > If set, Linux-PPC will use the 'nap' mode of powersaving, > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > index 0bb8842..6361783 100644 > --- a/arch/x86/kernel/process.c > +++ b/arch/x86/kernel/process.c > @@ -39,6 +39,10 @@ > #include <asm/desc.h> > #include <asm/prctl.h> > > +#ifdef CONFIG_HYPERVISOR_GUEST > +unsigned long poll_threshold_ns; > +#endif > + > /* > * per-CPU TSS segments. Threads are completely 'soft' on Linux, > * no more per-task TSS's. The TSS size is kept cacheline-aligned > @@ -313,6 +317,23 @@ static inline void play_dead(void) > } > #endif > > +#ifdef CONFIG_HYPERVISOR_GUEST > +void arch_cpu_idle_poll(void) > +{ > + ktime_t start, cur, stop; > + > + if (poll_threshold_ns) { > + start = cur = ktime_get(); > + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); > + do { > + if (need_resched()) > + break; > + cur = ktime_get(); > + } while (ktime_before(cur, stop)); > + } > +} > +#endif > + > void arch_cpu_idle_enter(void) > { > tsc_verify_tsc_adjust(false); > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index 13bc08a..04cf774 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -460,6 +460,9 @@ extern __scanf(2, 0) > extern int sysctl_panic_on_stackoverflow; > > extern bool crash_kexec_post_notifiers; > +#ifdef CONFIG_HYPERVISOR_GUEST > +extern unsigned long poll_threshold_ns; > +#endif > > /* > * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c > index 2a25a9e..e789f99 100644 > --- a/kernel/sched/idle.c > +++ b/kernel/sched/idle.c > @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) > } > > /* Weak implementations for optional arch specific functions */ > +void __weak arch_cpu_idle_poll(void) { } > void __weak arch_cpu_idle_prepare(void) { } > void __weak arch_cpu_idle_enter(void) { } > void __weak arch_cpu_idle_exit(void) { } > @@ -219,6 +220,8 @@ static void do_idle(void) > */ > > __current_set_polling(); > + arch_cpu_idle_poll(); > + > tick_nohz_idle_enter(); > > while (!need_resched()) { > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 4dfba1a..9174d57 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > .extra2 = &one, > }, > #endif > +#ifdef CONFIG_HYPERVISOR_GUEST > + { > + .procname = "halt_poll_threshold", > + .data = &poll_threshold_ns, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > +#endif > { } > }; > > -- > 1.8.3.1
On 2017/8/16 12:04, Michael S. Tsirkin wrote: > On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote: >> From: Yang Zhang <yang.zhang.wz@gmail.com> >> >> This patch introduce a new mechanism to poll for a while before >> entering idle state. >> >> David has a topic in KVM forum to describe the problem on current KVM VM >> when running some message passing workload in KVM forum. Also, there >> are some work to improve the performance in KVM, like halt polling in KVM. >> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle >> which introduce lot of latency. >> >> Halt polling in KVM provide the capbility to not schedule out VCPU when >> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls >> for a while if there is no work inside VCPU to elimiate heavy vmexit during >> in/out idle. The potential impact is it will cost more CPU cycle since we >> are doing polling and may impact other task which waiting on the same >> physical CPU in host. > > I wonder whether you considered doing this in an idle driver. > I have a prototype patch combining this with mwait within guest - > I can post it if you are interested. I am not sure mwait can solve this problem. But yes, if you have any prototype patch, i can do a test. Also, i am working on next version with better approach. I will post it ASAP. > > >> Here is the data i get when running benchmark contextswitch >> (https://github.com/tsuna/contextswitch) >> >> before patch: >> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) >> >> after patch: >> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) >> >> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com> >> --- >> Documentation/sysctl/kernel.txt | 10 ++++++++++ >> arch/x86/kernel/process.c | 21 +++++++++++++++++++++ >> include/linux/kernel.h | 3 +++ >> kernel/sched/idle.c | 3 +++ >> kernel/sysctl.c | 9 +++++++++ >> 5 files changed, 46 insertions(+) >> >> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt >> index bac23c1..4e71bfe 100644 >> --- a/Documentation/sysctl/kernel.txt >> +++ b/Documentation/sysctl/kernel.txt >> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: >> - perf_event_max_stack >> - perf_event_max_contexts_per_stack >> - pid_max >> +- poll_threshold_ns [ X86 only ] >> - powersave-nap [ PPC only ] >> - printk >> - printk_delay >> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. >> >> ============================================================== >> >> +poll_threshold_ns: (X86 only) >> + >> +This parameter used to control the max wait time to poll before going >> +into real idle state. By default, the values is 0 means don't poll. >> +It is recommended to change the value to non-zero if running latency-bound >> +workloads in VM. >> + >> +============================================================== >> + >> powersave-nap: (PPC only) >> >> If set, Linux-PPC will use the 'nap' mode of powersaving, >> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c >> index 0bb8842..6361783 100644 >> --- a/arch/x86/kernel/process.c >> +++ b/arch/x86/kernel/process.c >> @@ -39,6 +39,10 @@ >> #include <asm/desc.h> >> #include <asm/prctl.h> >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +unsigned long poll_threshold_ns; >> +#endif >> + >> /* >> * per-CPU TSS segments. Threads are completely 'soft' on Linux, >> * no more per-task TSS's. The TSS size is kept cacheline-aligned >> @@ -313,6 +317,23 @@ static inline void play_dead(void) >> } >> #endif >> >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +void arch_cpu_idle_poll(void) >> +{ >> + ktime_t start, cur, stop; >> + >> + if (poll_threshold_ns) { >> + start = cur = ktime_get(); >> + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); >> + do { >> + if (need_resched()) >> + break; >> + cur = ktime_get(); >> + } while (ktime_before(cur, stop)); >> + } >> +} >> +#endif >> + >> void arch_cpu_idle_enter(void) >> { >> tsc_verify_tsc_adjust(false); >> diff --git a/include/linux/kernel.h b/include/linux/kernel.h >> index 13bc08a..04cf774 100644 >> --- a/include/linux/kernel.h >> +++ b/include/linux/kernel.h >> @@ -460,6 +460,9 @@ extern __scanf(2, 0) >> extern int sysctl_panic_on_stackoverflow; >> >> extern bool crash_kexec_post_notifiers; >> +#ifdef CONFIG_HYPERVISOR_GUEST >> +extern unsigned long poll_threshold_ns; >> +#endif >> >> /* >> * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c >> index 2a25a9e..e789f99 100644 >> --- a/kernel/sched/idle.c >> +++ b/kernel/sched/idle.c >> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) >> } >> >> /* Weak implementations for optional arch specific functions */ >> +void __weak arch_cpu_idle_poll(void) { } >> void __weak arch_cpu_idle_prepare(void) { } >> void __weak arch_cpu_idle_enter(void) { } >> void __weak arch_cpu_idle_exit(void) { } >> @@ -219,6 +220,8 @@ static void do_idle(void) >> */ >> >> __current_set_polling(); >> + arch_cpu_idle_poll(); >> + >> tick_nohz_idle_enter(); >> >> while (!need_resched()) { >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >> index 4dfba1a..9174d57 100644 >> --- a/kernel/sysctl.c >> +++ b/kernel/sysctl.c >> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, >> .extra2 = &one, >> }, >> #endif >> +#ifdef CONFIG_HYPERVISOR_GUEST >> + { >> + .procname = "halt_poll_threshold", >> + .data = &poll_threshold_ns, >> + .maxlen = sizeof(unsigned long), >> + .mode = 0644, >> + .proc_handler = proc_dointvec, >> + }, >> +#endif >> { } >> }; >> >> -- >> 1.8.3.1
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index bac23c1..4e71bfe 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: - perf_event_max_stack - perf_event_max_contexts_per_stack - pid_max +- poll_threshold_ns [ X86 only ] - powersave-nap [ PPC only ] - printk - printk_delay @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. ============================================================== +poll_threshold_ns: (X86 only) + +This parameter used to control the max wait time to poll before going +into real idle state. By default, the values is 0 means don't poll. +It is recommended to change the value to non-zero if running latency-bound +workloads in VM. + +============================================================== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0bb8842..6361783 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -39,6 +39,10 @@ #include <asm/desc.h> #include <asm/prctl.h> +#ifdef CONFIG_HYPERVISOR_GUEST +unsigned long poll_threshold_ns; +#endif + /* * per-CPU TSS segments. Threads are completely 'soft' on Linux, * no more per-task TSS's. The TSS size is kept cacheline-aligned @@ -313,6 +317,23 @@ static inline void play_dead(void) } #endif +#ifdef CONFIG_HYPERVISOR_GUEST +void arch_cpu_idle_poll(void) +{ + ktime_t start, cur, stop; + + if (poll_threshold_ns) { + start = cur = ktime_get(); + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); + do { + if (need_resched()) + break; + cur = ktime_get(); + } while (ktime_before(cur, stop)); + } +} +#endif + void arch_cpu_idle_enter(void) { tsc_verify_tsc_adjust(false); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 13bc08a..04cf774 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -460,6 +460,9 @@ extern __scanf(2, 0) extern int sysctl_panic_on_stackoverflow; extern bool crash_kexec_post_notifiers; +#ifdef CONFIG_HYPERVISOR_GUEST +extern unsigned long poll_threshold_ns; +#endif /* * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 2a25a9e..e789f99 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } void __weak arch_cpu_idle_exit(void) { } @@ -219,6 +220,8 @@ static void do_idle(void) */ __current_set_polling(); + arch_cpu_idle_poll(); + tick_nohz_idle_enter(); while (!need_resched()) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4dfba1a..9174d57 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .extra2 = &one, }, #endif +#ifdef CONFIG_HYPERVISOR_GUEST + { + .procname = "halt_poll_threshold", + .data = &poll_threshold_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif { } };