[1/2] x86/idle: add halt poll for halt idle

Message ID	1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: root <yang.zhang.wz@gmail.com> To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com Cc: x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, adobriyan@gmail.com, fgao@ikuai8.com, ebiederm@xmission.com, subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk, mgorman@techsingularity.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm@vger.kernel.org, Yang Zhang <yang.zhang.wz@gmail.com> Subject: [PATCH 1/2] x86/idle: add halt poll for halt idle Date: Thu, 22 Jun 2017 11:22:13 +0000 Message-Id: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> In-Reply-To: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Message ID

1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal (mailing list archive)

State

New, archived

Headers

From: root <yang.zhang.wz@gmail.com>
To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com
Cc: x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de,
	peterz@infradead.org, mchehab@kernel.org,
	akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com,
	luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com,
	rgerst@gmail.com, minipli@googlemail.com,
	douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com,
	dvlasenk@redhat.com, bristot@redhat.com,
	yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com,
	yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org,
	me@kylehuey.com, len.brown@intel.com, prarit@redhat.com,
	hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com,
	pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net,
	zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com,
	niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com,
	adobriyan@gmail.com, fgao@ikuai8.com, ebiederm@xmission.com,
	subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk,
	mgorman@techsingularity.net, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-edac@vger.kernel.org,
	kvm@vger.kernel.org, Yang Zhang <yang.zhang.wz@gmail.com>
Subject: [PATCH 1/2] x86/idle: add halt poll for halt idle
Date: Thu, 22 Jun 2017 11:22:13 +0000
Message-Id: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal>
In-Reply-To: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal>
References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal>
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Yang Zhang June 22, 2017, 11:22 a.m. UTC

From: Yang Zhang <yang.zhang.wz@gmail.com>

This patch introduce a new mechanism to poll for a while before
entering idle state.

David has a topic in KVM forum to describe the problem on current KVM VM
when running some message passing workload in KVM forum. Also, there
are some work to improve the performance in KVM, like halt polling in KVM.
But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
which introduce lot of latency.

Halt polling in KVM provide the capbility to not schedule out VCPU when
it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
for a while if there is no work inside VCPU to elimiate heavy vmexit during
in/out idle. The potential impact is it will cost more CPU cycle since we
are doing polling and may impact other task which waiting on the same
physical CPU in host.

Here is the data i get when running benchmark contextswitch
(https://github.com/tsuna/contextswitch)

before patch:
2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)

after patch:
2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)

Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
---
 Documentation/sysctl/kernel.txt | 10 ++++++++++
 arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
 include/linux/kernel.h          |  3 +++
 kernel/sched/idle.c             |  3 +++
 kernel/sysctl.c                 |  9 +++++++++
 5 files changed, 46 insertions(+)

Comments

Thomas Gleixner June 22, 2017, 2:23 p.m. UTC | #1

On Thu, 22 Jun 2017, root wrote:
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,10 @@
>  #include <asm/desc.h>
>  #include <asm/prctl.h>
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +unsigned long poll_threshold_ns;
> +#endif
> +
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +void arch_cpu_idle_poll(void)
> +{
> +	ktime_t start, cur, stop;
> +
> +	if (poll_threshold_ns) {
> +		start = cur = ktime_get();
> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
> +		do {
> +			if (need_resched())
> +				break;
> +			cur = ktime_get();
> +		} while (ktime_before(cur, stop));
> +	}
> +}
> +#endif

Aside of the whole approach being debatable, what's the reason to make this
depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
mechanism is worthwhile then it should go into the generic code and not
into x86. There is absolutely nothing x86 specific in that patch.

Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.

Thanks,

	tglx

Yang Zhang June 23, 2017, 4:05 a.m. UTC | #2

On 2017/6/22 22:23, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, root wrote:
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -39,6 +39,10 @@
>>  #include <asm/desc.h>
>>  #include <asm/prctl.h>
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +unsigned long poll_threshold_ns;
>> +#endif
>> +
>>  /*
>>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
>> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	ktime_t start, cur, stop;
>> +
>> +	if (poll_threshold_ns) {
>> +		start = cur = ktime_get();
>> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
>> +		do {
>> +			if (need_resched())
>> +				break;
>> +			cur = ktime_get();
>> +		} while (ktime_before(cur, stop));
>> +	}
>> +}
>> +#endif
>
> Aside of the whole approach being debatable, what's the reason to make this
> depend on CONFIG_HYPERVISOR_GUEST and to move it into x86. If that
> mechanism is worthwhile then it should go into the generic code and not
> into x86. There is absolutely nothing x86 specific in that patch.
>
> Also the CONFIG_HYPERVISOR_GUEST dependency is silly. Distro kernels ship
> with CONFIG_HYPERVISOR_GUEST=y so this also gets into affect on bare metal.

You are right. As Paolo suggested, i will integrate it with 
paravalization code.

>
> Thanks,
>
> 	tglx
>

Michael S. Tsirkin Aug. 16, 2017, 4:04 a.m. UTC | #3

On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote:
> From: Yang Zhang <yang.zhang.wz@gmail.com>
> 
> This patch introduce a new mechanism to poll for a while before
> entering idle state.
> 
> David has a topic in KVM forum to describe the problem on current KVM VM
> when running some message passing workload in KVM forum. Also, there
> are some work to improve the performance in KVM, like halt polling in KVM.
> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
> which introduce lot of latency.
> 
> Halt polling in KVM provide the capbility to not schedule out VCPU when
> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
> for a while if there is no work inside VCPU to elimiate heavy vmexit during
> in/out idle. The potential impact is it will cost more CPU cycle since we
> are doing polling and may impact other task which waiting on the same
> physical CPU in host.

I wonder whether you considered doing this in an idle driver.
I have a prototype patch combining this with mwait within guest -
I can post it if you are interested.


> Here is the data i get when running benchmark contextswitch
> (https://github.com/tsuna/contextswitch)
> 
> before patch:
> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
> 
> after patch:
> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
> 
> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
> ---
>  Documentation/sysctl/kernel.txt | 10 ++++++++++
>  arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
>  include/linux/kernel.h          |  3 +++
>  kernel/sched/idle.c             |  3 +++
>  kernel/sysctl.c                 |  9 +++++++++
>  5 files changed, 46 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..4e71bfe 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_threshold_ns        [ X86 only ]
>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_threshold_ns: (X86 only)
> +
> +This parameter used to control the max wait time to poll before going
> +into real idle state. By default, the values is 0 means don't poll.
> +It is recommended to change the value to non-zero if running latency-bound
> +workloads in VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0bb8842..6361783 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,10 @@
>  #include <asm/desc.h>
>  #include <asm/prctl.h>
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +unsigned long poll_threshold_ns;
> +#endif
> +
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +void arch_cpu_idle_poll(void)
> +{
> +	ktime_t start, cur, stop;
> +
> +	if (poll_threshold_ns) {
> +		start = cur = ktime_get();
> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
> +		do {
> +			if (need_resched())
> +				break;
> +			cur = ktime_get();
> +		} while (ktime_before(cur, stop));
> +	}
> +}
> +#endif
> +
>  void arch_cpu_idle_enter(void)
>  {
>  	tsc_verify_tsc_adjust(false);
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 13bc08a..04cf774 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>  extern int sysctl_panic_on_stackoverflow;
>  
>  extern bool crash_kexec_post_notifiers;
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +extern unsigned long poll_threshold_ns;
> +#endif
>  
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 2a25a9e..e789f99 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>  }
>  
>  /* Weak implementations for optional arch specific functions */
> +void __weak arch_cpu_idle_poll(void) { }
>  void __weak arch_cpu_idle_prepare(void) { }
>  void __weak arch_cpu_idle_enter(void) { }
>  void __weak arch_cpu_idle_exit(void) { }
> @@ -219,6 +220,8 @@ static void do_idle(void)
>  	 */
>  
>  	__current_set_polling();
> +	arch_cpu_idle_poll();
> +
>  	tick_nohz_idle_enter();
>  
>  	while (!need_resched()) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 4dfba1a..9174d57 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +	{
> +		.procname	= "halt_poll_threshold",
> +		.data		= &poll_threshold_ns,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +#endif
>  	{ }
>  };
>  
> -- 
> 1.8.3.1

Yang Zhang Aug. 17, 2017, 7:29 a.m. UTC | #4

On 2017/8/16 12:04, Michael S. Tsirkin wrote:
> On Thu, Jun 22, 2017 at 11:22:13AM +0000, root wrote:
>> From: Yang Zhang <yang.zhang.wz@gmail.com>
>>
>> This patch introduce a new mechanism to poll for a while before
>> entering idle state.
>>
>> David has a topic in KVM forum to describe the problem on current KVM VM
>> when running some message passing workload in KVM forum. Also, there
>> are some work to improve the performance in KVM, like halt polling in KVM.
>> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
>> which introduce lot of latency.
>>
>> Halt polling in KVM provide the capbility to not schedule out VCPU when
>> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
>> for a while if there is no work inside VCPU to elimiate heavy vmexit during
>> in/out idle. The potential impact is it will cost more CPU cycle since we
>> are doing polling and may impact other task which waiting on the same
>> physical CPU in host.
>
> I wonder whether you considered doing this in an idle driver.
> I have a prototype patch combining this with mwait within guest -
> I can post it if you are interested.

I am not sure mwait can solve this problem. But yes, if you have any 
prototype patch, i can do a test. Also, i am working on next version 
with better approach. I will post it ASAP.

>
>
>> Here is the data i get when running benchmark contextswitch
>> (https://github.com/tsuna/contextswitch)
>>
>> before patch:
>> 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw)
>>
>> after patch:
>> 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw)
>>
>> Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
>> ---
>>  Documentation/sysctl/kernel.txt | 10 ++++++++++
>>  arch/x86/kernel/process.c       | 21 +++++++++++++++++++++
>>  include/linux/kernel.h          |  3 +++
>>  kernel/sched/idle.c             |  3 +++
>>  kernel/sysctl.c                 |  9 +++++++++
>>  5 files changed, 46 insertions(+)
>>
>> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
>> index bac23c1..4e71bfe 100644
>> --- a/Documentation/sysctl/kernel.txt
>> +++ b/Documentation/sysctl/kernel.txt
>> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>>  - perf_event_max_stack
>>  - perf_event_max_contexts_per_stack
>>  - pid_max
>> +- poll_threshold_ns        [ X86 only ]
>>  - powersave-nap               [ PPC only ]
>>  - printk
>>  - printk_delay
>> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>>
>>  ==============================================================
>>
>> +poll_threshold_ns: (X86 only)
>> +
>> +This parameter used to control the max wait time to poll before going
>> +into real idle state. By default, the values is 0 means don't poll.
>> +It is recommended to change the value to non-zero if running latency-bound
>> +workloads in VM.
>> +
>> +==============================================================
>> +
>>  powersave-nap: (PPC only)
>>
>>  If set, Linux-PPC will use the 'nap' mode of powersaving,
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 0bb8842..6361783 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -39,6 +39,10 @@
>>  #include <asm/desc.h>
>>  #include <asm/prctl.h>
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +unsigned long poll_threshold_ns;
>> +#endif
>> +
>>  /*
>>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
>> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +void arch_cpu_idle_poll(void)
>> +{
>> +	ktime_t start, cur, stop;
>> +
>> +	if (poll_threshold_ns) {
>> +		start = cur = ktime_get();
>> +		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
>> +		do {
>> +			if (need_resched())
>> +				break;
>> +			cur = ktime_get();
>> +		} while (ktime_before(cur, stop));
>> +	}
>> +}
>> +#endif
>> +
>>  void arch_cpu_idle_enter(void)
>>  {
>>  	tsc_verify_tsc_adjust(false);
>> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
>> index 13bc08a..04cf774 100644
>> --- a/include/linux/kernel.h
>> +++ b/include/linux/kernel.h
>> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>>  extern int sysctl_panic_on_stackoverflow;
>>
>>  extern bool crash_kexec_post_notifiers;
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +extern unsigned long poll_threshold_ns;
>> +#endif
>>
>>  /*
>>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 2a25a9e..e789f99 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>  }
>>
>>  /* Weak implementations for optional arch specific functions */
>> +void __weak arch_cpu_idle_poll(void) { }
>>  void __weak arch_cpu_idle_prepare(void) { }
>>  void __weak arch_cpu_idle_enter(void) { }
>>  void __weak arch_cpu_idle_exit(void) { }
>> @@ -219,6 +220,8 @@ static void do_idle(void)
>>  	 */
>>
>>  	__current_set_polling();
>> +	arch_cpu_idle_poll();
>> +
>>  	tick_nohz_idle_enter();
>>
>>  	while (!need_resched()) {
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 4dfba1a..9174d57 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>>  		.extra2		= &one,
>>  	},
>>  #endif
>> +#ifdef CONFIG_HYPERVISOR_GUEST
>> +	{
>> +		.procname	= "halt_poll_threshold",
>> +		.data		= &poll_threshold_ns,
>> +		.maxlen		= sizeof(unsigned long),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec,
>> +	},
>> +#endif
>>  	{ }
>>  };
>>
>> --
>> 1.8.3.1

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..4e71bfe 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,7 @@  show up in /proc/sys/kernel:
 - perf_event_max_stack
 - perf_event_max_contexts_per_stack
 - pid_max
+- poll_threshold_ns        [ X86 only ]
 - powersave-nap               [ PPC only ]
 - printk
 - printk_delay
@@ -702,6 +703,15 @@  kernel tries to allocate a number starting from this one.
 
 ==============================================================
 
+poll_threshold_ns: (X86 only)
+
+This parameter used to control the max wait time to poll before going
+into real idle state. By default, the values is 0 means don't poll.
+It is recommended to change the value to non-zero if running latency-bound
+workloads in VM.
+
+==============================================================
+
 powersave-nap: (PPC only)
 
 If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0bb8842..6361783 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,10 @@ 
 #include <asm/desc.h>
 #include <asm/prctl.h>
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+unsigned long poll_threshold_ns;
+#endif
+
 /*
  * per-CPU TSS segments. Threads are completely 'soft' on Linux,
  * no more per-task TSS's. The TSS size is kept cacheline-aligned
@@ -313,6 +317,23 @@  static inline void play_dead(void)
 }
 #endif
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+void arch_cpu_idle_poll(void)
+{
+	ktime_t start, cur, stop;
+
+	if (poll_threshold_ns) {
+		start = cur = ktime_get();
+		stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
+		do {
+			if (need_resched())
+				break;
+			cur = ktime_get();
+		} while (ktime_before(cur, stop));
+	}
+}
+#endif
+
 void arch_cpu_idle_enter(void)
 {
 	tsc_verify_tsc_adjust(false);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 13bc08a..04cf774 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -460,6 +460,9 @@  extern __scanf(2, 0)
 extern int sysctl_panic_on_stackoverflow;
 
 extern bool crash_kexec_post_notifiers;
+#ifdef CONFIG_HYPERVISOR_GUEST
+extern unsigned long poll_threshold_ns;
+#endif
 
 /*
  * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2a25a9e..e789f99 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@  static noinline int __cpuidle cpu_idle_poll(void)
 }
 
 /* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
 void __weak arch_cpu_idle_prepare(void) { }
 void __weak arch_cpu_idle_enter(void) { }
 void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,8 @@  static void do_idle(void)
 	 */
 
 	__current_set_polling();
+	arch_cpu_idle_poll();
+
 	tick_nohz_idle_enter();
 
 	while (!need_resched()) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4dfba1a..9174d57 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1203,6 +1203,15 @@  static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.extra2		= &one,
 	},
 #endif
+#ifdef CONFIG_HYPERVISOR_GUEST
+	{
+		.procname	= "halt_poll_threshold",
+		.data		= &poll_threshold_ns,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 	{ }
 };

[1/2] x86/idle: add halt poll for halt idle

Commit Message

Comments

Patch