Message ID | 20210916162451.709260-5-guro@fb.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | Scheduler BPF | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
bpf/vmtest-bpf-PR | fail | merge-conflict |
bpf/vmtest-bpf-next | success | VM_Test |
bpf/vmtest-bpf-next-PR | success | PR summary |
On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <guro@fb.com> wrote: > > This patch adds 3 hooks to control wakeup and tick preemption: > cfs_check_preempt_tick > cfs_check_preempt_wakeup > cfs_wakeup_preempt_entity > > The first one allows to force or suppress a preemption from a tick > context. An obvious usage example is to minimize the number of > non-voluntary context switches and decrease an associated latency > penalty by (conditionally) providing tasks or task groups an extended > execution slice. It can be used instead of tweaking > sysctl_sched_min_granularity. > > The second one is called from the wakeup preemption code and allows > to redefine whether a newly woken task should preempt the execution > of the current task. This is useful to minimize a number of > preemptions of latency sensitive tasks. To some extent it's a more > flexible analog of a sysctl_sched_wakeup_granularity. This reminds me of Mel's recent work which might be relevant: sched/fair: Scale wakeup granularity relative to nr_running https://lore.kernel.org/lkml/20210920142614.4891-3-mgorman@techsingularity.net/ > > The third one is similar, but it tweaks the wakeup_preempt_entity() > function, which is called not only from a wakeup context, but also > from pick_next_task(), which allows to influence the decision on which > task will be running next. > > It's a place for a discussion whether we need both these hooks or only > one of them: the second is more powerful, but depends more on the > current implementation. In any case, bpf hooks are not an ABI, so it's > not a deal breaker. I am also curious if similar hook can benefit newidle_balance/sched_migration_cost tuning things in this thread: https://lore.kernel.org/lkml/ef3b3e55-8be9-595f-6d54-886d13a7e2fd@hisilicon.com/ It seems those static values are not universal. different topology might need different settings. but dynamically tuning them in the kernel seems to be extremely difficult. > > The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He > also contributed a lot to the whole patchset by proving his ideas, > recommendations and a feedback for earlier (non-public) versions. > > Signed-off-by: Roman Gushchin <guro@fb.com> > --- > include/linux/bpf_sched.h | 1 + > include/linux/sched_hook_defs.h | 4 +++- > kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ > 3 files changed, 31 insertions(+), 1 deletion(-) > > diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h > index 6e773aecdff7..5c238aeb853c 100644 > --- a/include/linux/bpf_sched.h > +++ b/include/linux/bpf_sched.h > @@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \ > { \ > return DEFAULT; \ > } > +#include <linux/sched_hook_defs.h> > #undef BPF_SCHED_HOOK > > static inline bool bpf_sched_enabled(void) > diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h > index 14344004e335..f075b32698cd 100644 > --- a/include/linux/sched_hook_defs.h > +++ b/include/linux/sched_hook_defs.h > @@ -1,2 +1,4 @@ > /* SPDX-License-Identifier: GPL-2.0 */ > -BPF_SCHED_HOOK(int, 0, dummy, void) > +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec) > +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p) > +BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se) > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ff69f245b939..35ea8911b25c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -21,6 +21,7 @@ > * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra > */ > #include "sched.h" > +#include <linux/bpf_sched.h> > > /* > * Targeted preemption latency for CPU-bound tasks: > @@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > ideal_runtime = sched_slice(cfs_rq, curr); > delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; > + > + if (bpf_sched_enabled()) { > + int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec); > + > + if (ret < 0) > + return; > + else if (ret > 0) > + resched_curr(rq_of(cfs_rq)); > + } > + > if (delta_exec > ideal_runtime) { > resched_curr(rq_of(cfs_rq)); > /* > @@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) > { > s64 gran, vdiff = curr->vruntime - se->vruntime; > > + if (bpf_sched_enabled()) { > + int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se); > + > + if (ret) > + return ret; > + } > + > if (vdiff <= 0) > return -1; > > @@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ > likely(!task_has_idle_policy(p))) > goto preempt; > > + if (bpf_sched_enabled()) { > + int ret = bpf_sched_cfs_check_preempt_wakeup(current, p); > + > + if (ret < 0) > + return; > + else if (ret > 0) > + goto preempt; > + } > + > /* > * Batch and idle tasks do not preempt non-idle tasks (their preemption > * is driven by the tick): > -- > 2.31.1 > Thanks barry
On Fri, Oct 01, 2021 at 04:35:58PM +1300, Barry Song wrote: > On Fri, Sep 17, 2021 at 4:36 AM Roman Gushchin <guro@fb.com> wrote: > > > > This patch adds 3 hooks to control wakeup and tick preemption: > > cfs_check_preempt_tick > > cfs_check_preempt_wakeup > > cfs_wakeup_preempt_entity > > > > The first one allows to force or suppress a preemption from a tick > > context. An obvious usage example is to minimize the number of > > non-voluntary context switches and decrease an associated latency > > penalty by (conditionally) providing tasks or task groups an extended > > execution slice. It can be used instead of tweaking > > sysctl_sched_min_granularity. > > > > The second one is called from the wakeup preemption code and allows > > to redefine whether a newly woken task should preempt the execution > > of the current task. This is useful to minimize a number of > > preemptions of latency sensitive tasks. To some extent it's a more > > flexible analog of a sysctl_sched_wakeup_granularity. > > This reminds me of Mel's recent work which might be relevant: > sched/fair: Scale wakeup granularity relative to nr_running > https://lore.kernel.org/lkml/20210920142614.4891-3-mgorman@techsingularity.net/ Oh, this is interesting, thank you for the link! This is a perfect example of a case when bpf can be useful if the change will be considered to be too special to be accepted in the mainline code. > > > > > The third one is similar, but it tweaks the wakeup_preempt_entity() > > function, which is called not only from a wakeup context, but also > > from pick_next_task(), which allows to influence the decision on which > > task will be running next. > > > > It's a place for a discussion whether we need both these hooks or only > > one of them: the second is more powerful, but depends more on the > > current implementation. In any case, bpf hooks are not an ABI, so it's > > not a deal breaker. > > I am also curious if similar hook can benefit > newidle_balance/sched_migration_cost > tuning things in this thread: > https://lore.kernel.org/lkml/ef3b3e55-8be9-595f-6d54-886d13a7e2fd@hisilicon.com/ > > It seems those static values are not universal. different topology might need > different settings. but dynamically tuning them in the kernel seems to be > extremely difficult. Absolutely! I'm already playing with newidle_balance (no specific results yet). And sched_migration_cost is likely a good target too! Thanks!
diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h index 6e773aecdff7..5c238aeb853c 100644 --- a/include/linux/bpf_sched.h +++ b/include/linux/bpf_sched.h @@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \ { \ return DEFAULT; \ } +#include <linux/sched_hook_defs.h> #undef BPF_SCHED_HOOK static inline bool bpf_sched_enabled(void) diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h index 14344004e335..f075b32698cd 100644 --- a/include/linux/sched_hook_defs.h +++ b/include/linux/sched_hook_defs.h @@ -1,2 +1,4 @@ /* SPDX-License-Identifier: GPL-2.0 */ -BPF_SCHED_HOOK(int, 0, dummy, void) +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec) +BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p) +BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..35ea8911b25c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -21,6 +21,7 @@ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra */ #include "sched.h" +#include <linux/bpf_sched.h> /* * Targeted preemption latency for CPU-bound tasks: @@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) ideal_runtime = sched_slice(cfs_rq, curr); delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + + if (bpf_sched_enabled()) { + int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec); + + if (ret < 0) + return; + else if (ret > 0) + resched_curr(rq_of(cfs_rq)); + } + if (delta_exec > ideal_runtime) { resched_curr(rq_of(cfs_rq)); /* @@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se) { s64 gran, vdiff = curr->vruntime - se->vruntime; + if (bpf_sched_enabled()) { + int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se); + + if (ret) + return ret; + } + if (vdiff <= 0) return -1; @@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_ likely(!task_has_idle_policy(p))) goto preempt; + if (bpf_sched_enabled()) { + int ret = bpf_sched_cfs_check_preempt_wakeup(current, p); + + if (ret < 0) + return; + else if (ret > 0) + goto preempt; + } + /* * Batch and idle tasks do not preempt non-idle tasks (their preemption * is driven by the tick):
This patch adds 3 hooks to control wakeup and tick preemption: cfs_check_preempt_tick cfs_check_preempt_wakeup cfs_wakeup_preempt_entity The first one allows to force or suppress a preemption from a tick context. An obvious usage example is to minimize the number of non-voluntary context switches and decrease an associated latency penalty by (conditionally) providing tasks or task groups an extended execution slice. It can be used instead of tweaking sysctl_sched_min_granularity. The second one is called from the wakeup preemption code and allows to redefine whether a newly woken task should preempt the execution of the current task. This is useful to minimize a number of preemptions of latency sensitive tasks. To some extent it's a more flexible analog of a sysctl_sched_wakeup_granularity. The third one is similar, but it tweaks the wakeup_preempt_entity() function, which is called not only from a wakeup context, but also from pick_next_task(), which allows to influence the decision on which task will be running next. It's a place for a discussion whether we need both these hooks or only one of them: the second is more powerful, but depends more on the current implementation. In any case, bpf hooks are not an ABI, so it's not a deal breaker. The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He also contributed a lot to the whole patchset by proving his ideas, recommendations and a feedback for earlier (non-public) versions. Signed-off-by: Roman Gushchin <guro@fb.com> --- include/linux/bpf_sched.h | 1 + include/linux/sched_hook_defs.h | 4 +++- kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ 3 files changed, 31 insertions(+), 1 deletion(-)