diff mbox series

[rfc,4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption

Message ID 20210916162451.709260-5-guro@fb.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Scheduler BPF | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch
bpf/vmtest-bpf-PR fail merge-conflict
bpf/vmtest-bpf-next success VM_Test
bpf/vmtest-bpf-next-PR success PR summary

Commit Message

Roman Gushchin Sept. 16, 2021, 4:24 p.m. UTC
This patch adds 3 hooks to control wakeup and tick preemption:
  cfs_check_preempt_tick
  cfs_check_preempt_wakeup
  cfs_wakeup_preempt_entity

The first one allows to force or suppress a preemption from a tick
context. An obvious usage example is to minimize the number of
non-voluntary context switches and decrease an associated latency
penalty by (conditionally) providing tasks or task groups an extended
execution slice. It can be used instead of tweaking
sysctl_sched_min_granularity.

The second one is called from the wakeup preemption code and allows
to redefine whether a newly woken task should preempt the execution
of the current task. This is useful to minimize a number of
preemptions of latency sensitive tasks. To some extent it's a more
flexible analog of a sysctl_sched_wakeup_granularity.

The third one is similar, but it tweaks the wakeup_preempt_entity()
function, which is called not only from a wakeup context, but also
from pick_next_task(), which allows to influence the decision on which
task will be running next.

It's a place for a discussion whether we need both these hooks or only
one of them: the second is more powerful, but depends more on the
current implementation. In any case, bpf hooks are not an ABI, so it's
not a deal breaker.

The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He
also contributed a lot to the whole patchset by proving his ideas,
recommendations and a feedback for earlier (non-public) versions.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/bpf_sched.h       |  1 +
 include/linux/sched_hook_defs.h |  4 +++-
 kernel/sched/fair.c             | 27 +++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

Comments

Barry Song Oct. 1, 2021, 3:35 a.m. UTC | #1
On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <guro@fb.com> wrote:
>
> This patch adds 3 hooks to control wakeup and tick preemption:
>   cfs_check_preempt_tick
>   cfs_check_preempt_wakeup
>   cfs_wakeup_preempt_entity
>
> The first one allows to force or suppress a preemption from a tick
> context. An obvious usage example is to minimize the number of
> non-voluntary context switches and decrease an associated latency
> penalty by (conditionally) providing tasks or task groups an extended
> execution slice. It can be used instead of tweaking
> sysctl_sched_min_granularity.
>
> The second one is called from the wakeup preemption code and allows
> to redefine whether a newly woken task should preempt the execution
> of the current task. This is useful to minimize a number of
> preemptions of latency sensitive tasks. To some extent it's a more
> flexible analog of a sysctl_sched_wakeup_granularity.

This reminds me of Mel's recent work which might be relevant:
sched/fair: Scale wakeup granularity relative to nr_running
https://lore.kernel.org/lkml/20210920142614.4891-3-mgorman@techsingularity.net/

>
> The third one is similar, but it tweaks the wakeup_preempt_entity()
> function, which is called not only from a wakeup context, but also
> from pick_next_task(), which allows to influence the decision on which
> task will be running next.
>
> It's a place for a discussion whether we need both these hooks or only
> one of them: the second is more powerful, but depends more on the
> current implementation. In any case, bpf hooks are not an ABI, so it's
> not a deal breaker.

I am also curious if similar hook can benefit
newidle_balance/sched_migration_cost
tuning things in this thread:
https://lore.kernel.org/lkml/ef3b3e55-8be9-595f-6d54-886d13a7e2fd@hisilicon.com/

It seems those static values are not universal. different topology might need
different settings.  but dynamically tuning them in the kernel seems to be
extremely difficult.

>
> The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He
> also contributed a lot to the whole patchset by proving his ideas,
> recommendations and a feedback for earlier (non-public) versions.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/bpf_sched.h       |  1 +
>  include/linux/sched_hook_defs.h |  4 +++-
>  kernel/sched/fair.c             | 27 +++++++++++++++++++++++++++
>  3 files changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
> index 6e773aecdff7..5c238aeb853c 100644
> --- a/include/linux/bpf_sched.h
> +++ b/include/linux/bpf_sched.h
> @@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__)       \
>  {                                              \
>         return DEFAULT;                         \
>  }
> +#include <linux/sched_hook_defs.h>
>  #undef BPF_SCHED_HOOK
>
>  static inline bool bpf_sched_enabled(void)
> diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
> index 14344004e335..f075b32698cd 100644
> --- a/include/linux/sched_hook_defs.h
> +++ b/include/linux/sched_hook_defs.h
> @@ -1,2 +1,4 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> -BPF_SCHED_HOOK(int, 0, dummy, void)
> +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec)
> +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p)
> +BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..35ea8911b25c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -21,6 +21,7 @@
>   *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
>   */
>  #include "sched.h"
> +#include <linux/bpf_sched.h>
>
>  /*
>   * Targeted preemption latency for CPU-bound tasks:
> @@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>
>         ideal_runtime = sched_slice(cfs_rq, curr);
>         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> +
> +       if (bpf_sched_enabled()) {
> +               int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec);
> +
> +               if (ret < 0)
> +                       return;
> +               else if (ret > 0)
> +                       resched_curr(rq_of(cfs_rq));
> +       }
> +
>         if (delta_exec > ideal_runtime) {
>                 resched_curr(rq_of(cfs_rq));
>                 /*
> @@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
>  {
>         s64 gran, vdiff = curr->vruntime - se->vruntime;
>
> +       if (bpf_sched_enabled()) {
> +               int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se);
> +
> +               if (ret)
> +                       return ret;
> +       }
> +
>         if (vdiff <= 0)
>                 return -1;
>
> @@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>             likely(!task_has_idle_policy(p)))
>                 goto preempt;
>
> +       if (bpf_sched_enabled()) {
> +               int ret = bpf_sched_cfs_check_preempt_wakeup(current, p);
> +
> +               if (ret < 0)
> +                       return;
> +               else if (ret > 0)
> +                       goto preempt;
> +       }
> +
>         /*
>          * Batch and idle tasks do not preempt non-idle tasks (their preemption
>          * is driven by the tick):
> --
> 2.31.1
>

Thanks
barry
Roman Gushchin Oct. 2, 2021, 12:13 a.m. UTC | #2
On Fri, Oct 01, 2021 at 04:35:58PM +1300, Barry Song wrote:
> On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > This patch adds 3 hooks to control wakeup and tick preemption:
> >   cfs_check_preempt_tick
> >   cfs_check_preempt_wakeup
> >   cfs_wakeup_preempt_entity
> >
> > The first one allows to force or suppress a preemption from a tick
> > context. An obvious usage example is to minimize the number of
> > non-voluntary context switches and decrease an associated latency
> > penalty by (conditionally) providing tasks or task groups an extended
> > execution slice. It can be used instead of tweaking
> > sysctl_sched_min_granularity.
> >
> > The second one is called from the wakeup preemption code and allows
> > to redefine whether a newly woken task should preempt the execution
> > of the current task. This is useful to minimize a number of
> > preemptions of latency sensitive tasks. To some extent it's a more
> > flexible analog of a sysctl_sched_wakeup_granularity.
> 
> This reminds me of Mel's recent work which might be relevant:
> sched/fair: Scale wakeup granularity relative to nr_running
> https://lore.kernel.org/lkml/20210920142614.4891-3-mgorman@techsingularity.net/

Oh, this is interesting, thank you for the link! This is a perfect example
of a case when bpf can be useful if the change will be considered to be too
special to be accepted in the mainline code.

> 
> >
> > The third one is similar, but it tweaks the wakeup_preempt_entity()
> > function, which is called not only from a wakeup context, but also
> > from pick_next_task(), which allows to influence the decision on which
> > task will be running next.
> >
> > It's a place for a discussion whether we need both these hooks or only
> > one of them: the second is more powerful, but depends more on the
> > current implementation. In any case, bpf hooks are not an ABI, so it's
> > not a deal breaker.
> 
> I am also curious if similar hook can benefit
> newidle_balance/sched_migration_cost
> tuning things in this thread:
> https://lore.kernel.org/lkml/ef3b3e55-8be9-595f-6d54-886d13a7e2fd@hisilicon.com/
> 
> It seems those static values are not universal. different topology might need
> different settings.  but dynamically tuning them in the kernel seems to be
> extremely difficult.

Absolutely! I'm already playing with newidle_balance (no specific results yet).
And sched_migration_cost is likely a good target too!

Thanks!
diff mbox series

Patch

diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
index 6e773aecdff7..5c238aeb853c 100644
--- a/include/linux/bpf_sched.h
+++ b/include/linux/bpf_sched.h
@@ -40,6 +40,7 @@  static inline RET bpf_sched_##NAME(__VA_ARGS__)	\
 {						\
 	return DEFAULT;				\
 }
+#include <linux/sched_hook_defs.h>
 #undef BPF_SCHED_HOOK
 
 static inline bool bpf_sched_enabled(void)
diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
index 14344004e335..f075b32698cd 100644
--- a/include/linux/sched_hook_defs.h
+++ b/include/linux/sched_hook_defs.h
@@ -1,2 +1,4 @@ 
 /* SPDX-License-Identifier: GPL-2.0 */
-BPF_SCHED_HOOK(int, 0, dummy, void)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p)
+BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..35ea8911b25c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@ 
  *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
  */
 #include "sched.h"
+#include <linux/bpf_sched.h>
 
 /*
  * Targeted preemption latency for CPU-bound tasks:
@@ -4447,6 +4448,16 @@  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 
 	ideal_runtime = sched_slice(cfs_rq, curr);
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+	if (bpf_sched_enabled()) {
+		int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec);
+
+		if (ret < 0)
+			return;
+		else if (ret > 0)
+			resched_curr(rq_of(cfs_rq));
+	}
+
 	if (delta_exec > ideal_runtime) {
 		resched_curr(rq_of(cfs_rq));
 		/*
@@ -7083,6 +7094,13 @@  wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
 {
 	s64 gran, vdiff = curr->vruntime - se->vruntime;
 
+	if (bpf_sched_enabled()) {
+		int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se);
+
+		if (ret)
+			return ret;
+	}
+
 	if (vdiff <= 0)
 		return -1;
 
@@ -7168,6 +7186,15 @@  static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	    likely(!task_has_idle_policy(p)))
 		goto preempt;
 
+	if (bpf_sched_enabled()) {
+		int ret = bpf_sched_cfs_check_preempt_wakeup(current, p);
+
+		if (ret < 0)
+			return;
+		else if (ret > 0)
+			goto preempt;
+	}
+
 	/*
 	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 	 * is driven by the tick):