Message ID | 20230804093804.47039-2-zhouchuyi@bytedance.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | mm: Select victim using bpf_select_task | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch, async |
bpf/vmtest-bpf-next-PR | fail | PR summary |
bpf/vmtest-bpf-next-VM_Test-1 | success | Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }} |
bpf/vmtest-bpf-next-VM_Test-2 | success | Logs for ShellCheck |
bpf/vmtest-bpf-next-VM_Test-3 | fail | Logs for build for aarch64 with gcc |
bpf/vmtest-bpf-next-VM_Test-4 | fail | Logs for build for s390x with gcc |
bpf/vmtest-bpf-next-VM_Test-5 | fail | Logs for build for x86_64 with gcc |
bpf/vmtest-bpf-next-VM_Test-6 | fail | Logs for build for x86_64 with llvm-16 |
bpf/vmtest-bpf-next-VM_Test-7 | success | Logs for set-matrix |
bpf/vmtest-bpf-next-VM_Test-8 | success | Logs for veristat |
On Fri 04-08-23 17:38:03, Chuyi Zhou wrote: > This patch adds a new hook bpf_select_task in oom_evaluate_task. It > takes oc and current iterating task as parameters and returns a result > indicating which one is selected by bpf program. > > Although bpf_select_task is used to bypass the default method, there are > some existing rules should be obeyed. Specifically, we skip these > "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not > consider tasks with lowest score returned by oom_badness except it was > caused by OOM_SCORE_ADJ_MIN. Is this really necessary? I do get why we need to preserve OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why should an arbitrary oom policy care. Look at it from an arbitrary user space based policy. It just picks a task or memcg and kills taks by sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains will not prevent anybody from doing that. tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but not too much. The primary motivation is to prevent new oom victims while there is one already being killed. This is a reasonable heuristic especially with the async oom reclaim (oom_reaper). It also reduces amount of oom emergency memory reserves to some degree but since those are not absolute this is no longer the primary motivation. _But_ I can imagine that some policies might be much more aggresive and allow to select new victims if preexisting are not being killed in time. oom_unkillable_task is a general sanity check so it should remain in place. I am not really sure about oom_task_origin. That is just a very weird case and I guess it wouldn't hurt to keep it in generic path. All that being said I think we want something like the following (very pseudo-code). I have no idea what is the proper way how to define BPF hooks though so a help from BPF maintainers would be more then handy --- diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 00982b133dc1..9f1743ee2b28 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void) { return false; } -static inline bool trigger_allbutself_cpu_backtrace(void) -{ - return false; -} static inline bool trigger_cpumask_backtrace(struct cpumask *mask) { return false; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 612b5597d3af..c9e04be52700 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) goto next; + /* + * If task is allocating a lot of memory and has been marked to be + * killed first if it triggers an oom, then select it. + */ + if (oom_task_origin(task)) { + points = LONG_MAX; + goto select; + } + + switch (bpf_oom_evaluate_task(task, oc, &points)) { + case -EOPNOTSUPP: break; /* No BPF policy */ + case -EBUSY: goto abort; /* abort search process */ + case 0: goto next; /* ignore process */ + default: goto select; /* note the task */ + } + /* * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves unless @@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto abort; } - /* - * If task is allocating a lot of memory and has been marked to be - * killed first if it triggers an oom, then select it. - */ - if (oom_task_origin(task)) { - points = LONG_MAX; - goto select; - } - points = oom_badness(task, oc->totalpages); if (points == LONG_MIN || points < oc->chosen_points) goto next;
On 04/08/2023 10:38, Chuyi Zhou wrote: > This patch adds a new hook bpf_select_task in oom_evaluate_task. It bpf_select_task() feels like too generic a name - bpf_oom_select_task() might make the context clearer. I'd also suggest adding a documentation patch for a new Documentation/bpf/oom.rst or whatever to describe how it is all supposed to work. > takes oc and current iterating task as parameters and returns a result > indicating which one is selected by bpf program. > > Although bpf_select_task is used to bypass the default method, there are > some existing rules should be obeyed. Specifically, we skip these > "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not > consider tasks with lowest score returned by oom_badness except it was > caused by OOM_SCORE_ADJ_MIN. > > If we attach a prog to the hook, the interface is enabled only when we have > successfully chosen at least one valid candidate in previous iteraion. This > is to avoid that we find nothing if bpf program rejects all tasks. > I don't know anything about OOM mechanisms, so maybe it's just me, but I found this confusing. Relying on the previous iteration to control current iteration behaviour seems risky - even if BPF found a victim in iteration N, it's no guarantee it will in iteration N+1. Naively I would have thought the right answer here would be to honour the choice OOM would have made (in the absence of BPF execution) for cases where BPF did not select a victim. Is that sort of scheme workable? Does that make sense from the mm side, or would we actually want to fall back to pr_warn("Out of memory and no killable processes...\n"); ...if BPF didn't select a process? The danger here seems to be that the current non-BPF mechanism seems to be guaranteed to find a chosen victim, but delegating to BPF is not. So what is the right behaviour for such cases from the mm perspective? (One thing that would probably be worth doing from the BPF side would be to add a tracepoint to mark the scenario where nothing was chosen for OOM kill via BPF; this would allow BPF programs to catch the fact that their OOM selection mechanisms didn't work.) Alan > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > --- > mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 50 insertions(+), 7 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 612b5597d3af..aec4c55ed49a 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -18,6 +18,7 @@ > * kernel subsystems and hints as to where to find out what things do. > */ > > +#include <linux/bpf.h> > #include <linux/oom.h> > #include <linux/mm.h> > #include <linux/err.h> > @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages) > if (!p) > return LONG_MIN; > > + /* > + * If task is allocating a lot of memory and has been marked to be > + * killed first if it triggers an oom, then set points to LONG_MAX. > + * It will be selected unless we keep oc->chosen through bpf interface. > + */ > + if (oom_task_origin(p)) { > + task_unlock(p); > + return LONG_MAX; > + } > + > /* > * Do not even consider tasks which are explicitly marked oom > * unkillable or have been already oom reaped or the are in > @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) > return CONSTRAINT_NONE; > } > > +enum bpf_select_ret { > + BPF_SELECT_DISABLE, > + BPF_SELECT_TASK, > + BPF_SELECT_CHOSEN, > +}; > + > +__weak noinline int bpf_select_task(struct oom_control *oc, > + struct task_struct *task, long badness_points) > +{ > + return BPF_SELECT_DISABLE; > +} > + > +BTF_SET8_START(oom_bpf_fmodret_ids) > +BTF_ID_FLAGS(func, bpf_select_task) > +BTF_SET8_END(oom_bpf_fmodret_ids) > + > +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = { > + .owner = THIS_MODULE, > + .set = &oom_bpf_fmodret_ids, > +}; > + > static int oom_evaluate_task(struct task_struct *task, void *arg) > { > + enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE; > struct oom_control *oc = arg; > long points; > > @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > goto abort; > } > > + points = oom_badness(task, oc->totalpages); > + > /* > - * If task is allocating a lot of memory and has been marked to be > - * killed first if it triggers an oom, then select it. > + * Do not consider tasks with lowest score value except it was caused > + * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by > + * bpf interface. > */ > - if (oom_task_origin(task)) { > - points = LONG_MAX; > + if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) > + goto next; > + > + if (oc->chosen) > + bpf_ret = bpf_select_task(oc, task, points); > + > + if (bpf_ret == BPF_SELECT_TASK) > goto select; > - } > > - points = oom_badness(task, oc->totalpages); > - if (points == LONG_MIN || points < oc->chosen_points) > + if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points) > goto next; > > select: > @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = { > > static int __init oom_init(void) > { > + int err; > oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); > #ifdef CONFIG_SYSCTL > register_sysctl_init("vm", vm_oom_kill_table); > #endif probably worth having #ifdef CONFIG_BPF or similar here.. > + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set); > + if (err) > + pr_warn("error while registering oom fmodret entrypoints: %d", err); > return 0; > } > subsys_initcall(oom_init)
Hello, 在 2023/8/4 19:29, Michal Hocko 写道: > On Fri 04-08-23 17:38:03, Chuyi Zhou wrote: >> This patch adds a new hook bpf_select_task in oom_evaluate_task. It >> takes oc and current iterating task as parameters and returns a result >> indicating which one is selected by bpf program. >> >> Although bpf_select_task is used to bypass the default method, there are >> some existing rules should be obeyed. Specifically, we skip these >> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not >> consider tasks with lowest score returned by oom_badness except it was >> caused by OOM_SCORE_ADJ_MIN. > > Is this really necessary? I do get why we need to preserve > OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why > should an arbitrary oom policy care. Look at it from an arbitrary user > space based policy. It just picks a task or memcg and kills taks by > sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains > will not prevent anybody from doing that. Sorry, some of my expressions may have misled you. I do agree bpf interface should bypass the current OOM_SCORE_ADJ_* logic. What I meant to say is that bpf can select a task even it was setted OOM_SCORE_ADJ_MIN. > > tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but > not too much. The primary motivation is to prevent new oom victims > while there is one already being killed. This is a reasonable heuristic > especially with the async oom reclaim (oom_reaper). It also reduces > amount of oom emergency memory reserves to some degree but since those > are not absolute this is no longer the primary motivation. _But_ I can > imagine that some policies might be much more aggresive and allow to > select new victims if preexisting are not being killed in time. > > oom_unkillable_task is a general sanity check so it should remain in > place. > > I am not really sure about oom_task_origin. That is just a very weird > case and I guess it wouldn't hurt to keep it in generic path. > > All that being said I think we want something like the following (very > pseudo-code). I have no idea what is the proper way how to define BPF > hooks though so a help from BPF maintainers would be more then handy > --- > diff --git a/include/linux/nmi.h b/include/linux/nmi.h > index 00982b133dc1..9f1743ee2b28 100644 > --- a/include/linux/nmi.h > +++ b/include/linux/nmi.h > @@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void) > { > return false; > } > -static inline bool trigger_allbutself_cpu_backtrace(void) > -{ > - return false; > -} > static inline bool trigger_cpumask_backtrace(struct cpumask *mask) > { > return false; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 612b5597d3af..c9e04be52700 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > goto next; > > + /* > + * If task is allocating a lot of memory and has been marked to be > + * killed first if it triggers an oom, then select it. > + */ > + if (oom_task_origin(task)) { > + points = LONG_MAX; > + goto select; > + } > + > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > + case -EOPNOTSUPP: break; /* No BPF policy */ > + case -EBUSY: goto abort; /* abort search process */ > + case 0: goto next; /* ignore process */ > + default: goto select; /* note the task */ > + } Why we need to change the *points* value if we do not care about oom_badness ? Is it used to record some state? If so, we could record it through bpf map. > + > /* > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves unless > @@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > goto abort; > } > > - /* > - * If task is allocating a lot of memory and has been marked to be > - * killed first if it triggers an oom, then select it. > - */ > - if (oom_task_origin(task)) { > - points = LONG_MAX; > - goto select; > - } > - > points = oom_badness(task, oc->totalpages); > if (points == LONG_MIN || points < oc->chosen_points) > goto next; Thanks for your advice, I'm very glad to follow your suggestions for the next version of development.
On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: [...] > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > + case -EBUSY: goto abort; /* abort search process */ > > + case 0: goto next; /* ignore process */ > > + default: goto select; /* note the task */ > > + } > > Why we need to change the *points* value if we do not care about oom_badness > ? Is it used to record some state? If so, we could record it through bpf > map. Strictly speaking we do not need to. That would require BPF to keep the state internally. Many will do I suppose but we have to keep track of the victim so that the oom killer knows what to kill so I thought that it doesn't hurt to keep track of an abstract concept of points as well. If you think this is not needed then oc->points could be always 0 for bpf selected victims. The value is not used anyway in the proposed scheme. Btw. we will need another hook or metadata for the reporting side of things. Generally dump_header() to know what has been the selection policy.
Hello, 在 2023/8/4 19:34, Alan Maguire 写道: > On 04/08/2023 10:38, Chuyi Zhou wrote: >> This patch adds a new hook bpf_select_task in oom_evaluate_task. It > > bpf_select_task() feels like too generic a name - bpf_oom_select_task() > might make the context clearer. > > I'd also suggest adding a documentation patch for a new > Documentation/bpf/oom.rst or whatever to describe how it is all supposed > to work.Got it, I would add it in next version. > >> takes oc and current iterating task as parameters and returns a result >> indicating which one is selected by bpf program. >> >> Although bpf_select_task is used to bypass the default method, there are >> some existing rules should be obeyed. Specifically, we skip these >> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not >> consider tasks with lowest score returned by oom_badness except it was >> caused by OOM_SCORE_ADJ_MIN. >> >> If we attach a prog to the hook, the interface is enabled only when we have >> successfully chosen at least one valid candidate in previous iteraion. This >> is to avoid that we find nothing if bpf program rejects all tasks. >> > > I don't know anything about OOM mechanisms, so maybe it's just me, but I > found this confusing. Relying on the previous iteration to control > current iteration behaviour seems risky - even if BPF found a victim in > iteration N, it's no guarantee it will in iteration N+1. > The current kernel's OOM actually works like this: 1. if we first find a valid candidate victim A in iteration N, we would record it in oc->chosen. 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current iterating task. Suppose we think current task B is better than oc->chosen(A), we would set oc->chosen = B and we would not consider A anymore. IIUC, most policy works like this. We just need to find the *most* suitable victim. Normally, if in current iteration we drop A and select B, we would not consider A anymore. > Naively I would have thought the right answer here would be to honour > the choice OOM would have made (in the absence of BPF execution) for > cases where BPF did not select a victim. Is that sort of scheme > workable? Does that make sense from the mm side, or would we actually > want to fall back to > > pr_warn("Out of memory and no killable processes...\n"); > > ...if BPF didn't select a process? > My major concern was wether we should fully trust the correctness of BPF Progarm from user since some OOM may invoke kernel panic if we find nothing. Actually, the current non-BPF mechanism also is not guaranteed to find a chosen victim (If user set inappropriate oom_score_adj, we may find nothing). It seems both you and Michal think here we should honour the default logic of OOM and do not add something additional to prevent BPF find nothing. > The danger here seems to be that the current non-BPF mechanism seems to > be guaranteed to find a chosen victim, but delegating to BPF is not. So > what is the right behaviour for such cases from the mm perspective? > > (One thing that would probably be worth doing from the BPF side would be > to add a tracepoint to mark the scenario where nothing was chosen for > OOM kill via BPF; this would allow BPF programs to catch the fact that > their OOM selection mechanisms didn't work.) Nice idea, maybe we could add this tracepoint when we finishing victim selection? In this way, we could easily catch the selection result in BPF programs even if we successfully find something. > > Alan > >> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >> --- >> mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 50 insertions(+), 7 deletions(-) >> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 612b5597d3af..aec4c55ed49a 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -18,6 +18,7 @@ >> * kernel subsystems and hints as to where to find out what things do. >> */ >> >> +#include <linux/bpf.h> >> #include <linux/oom.h> >> #include <linux/mm.h> >> #include <linux/err.h> >> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages) >> if (!p) >> return LONG_MIN; >> >> + /* >> + * If task is allocating a lot of memory and has been marked to be >> + * killed first if it triggers an oom, then set points to LONG_MAX. >> + * It will be selected unless we keep oc->chosen through bpf interface. >> + */ >> + if (oom_task_origin(p)) { >> + task_unlock(p); >> + return LONG_MAX; >> + } >> + >> /* >> * Do not even consider tasks which are explicitly marked oom >> * unkillable or have been already oom reaped or the are in >> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) >> return CONSTRAINT_NONE; >> } >> >> +enum bpf_select_ret { >> + BPF_SELECT_DISABLE, >> + BPF_SELECT_TASK, >> + BPF_SELECT_CHOSEN, >> +}; >> + >> +__weak noinline int bpf_select_task(struct oom_control *oc, >> + struct task_struct *task, long badness_points) >> +{ >> + return BPF_SELECT_DISABLE; >> +} >> + >> +BTF_SET8_START(oom_bpf_fmodret_ids) >> +BTF_ID_FLAGS(func, bpf_select_task) >> +BTF_SET8_END(oom_bpf_fmodret_ids) >> + >> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = { >> + .owner = THIS_MODULE, >> + .set = &oom_bpf_fmodret_ids, >> +}; >> + >> static int oom_evaluate_task(struct task_struct *task, void *arg) >> { >> + enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE; >> struct oom_control *oc = arg; >> long points; >> >> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) >> goto abort; >> } >> >> + points = oom_badness(task, oc->totalpages); >> + >> /* >> - * If task is allocating a lot of memory and has been marked to be >> - * killed first if it triggers an oom, then select it. >> + * Do not consider tasks with lowest score value except it was caused >> + * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by >> + * bpf interface. >> */ >> - if (oom_task_origin(task)) { >> - points = LONG_MAX; >> + if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) >> + goto next; >> + >> + if (oc->chosen) >> + bpf_ret = bpf_select_task(oc, task, points); >> + >> + if (bpf_ret == BPF_SELECT_TASK) >> goto select; >> - } >> >> - points = oom_badness(task, oc->totalpages); >> - if (points == LONG_MIN || points < oc->chosen_points) >> + if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points) >> goto next; >> >> select: >> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = { >> >> static int __init oom_init(void) >> { >> + int err; >> oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); >> #ifdef CONFIG_SYSCTL >> register_sysctl_init("vm", vm_oom_kill_table); >> #endif > > probably worth having #ifdef CONFIG_BPF or similar here.. I see. Thanks for your remind. register_btf_fmodret_id_set is controled by CONFIG_BPF_SYSCALL. So maybe we can add #ifdef CONFIG_BPF_SYSCALL here. Thanks for your advice, I'm very glad to follow your suggestions for the next version of development. > >> + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set); >> + if (err) >> + pr_warn("error while registering oom fmodret entrypoints: %d", err); >> return 0; >> } >> subsys_initcall(oom_init)
在 2023/8/4 21:34, Michal Hocko 写道: > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > [...] >>> + switch (bpf_oom_evaluate_task(task, oc, &points)) { >>> + case -EOPNOTSUPP: break; /* No BPF policy */ >>> + case -EBUSY: goto abort; /* abort search process */ >>> + case 0: goto next; /* ignore process */ >>> + default: goto select; /* note the task */ >>> + } >> >> Why we need to change the *points* value if we do not care about oom_badness >> ? Is it used to record some state? If so, we could record it through bpf >> map. > > Strictly speaking we do not need to. That would require BPF to keep the > state internally. Many will do I suppose but we have to keep track of > the victim so that the oom killer knows what to kill so I thought that > it doesn't hurt to keep track of an abstract concept of points as well. > If you think this is not needed then oc->points could be always 0 for > bpf selected victims. The value is not used anyway in the proposed > scheme. > > Btw. we will need another hook or metadata for the reporting side of > things. Generally dump_header() to know what has been the selection > policy. > OK. Maybe a integer like policy_type is enough to distinguish different policies and the default method is zero. Or we can let BPF return a string like policy_name. Which one should I start implementing in next version? Do you have a better idea? Thanks.
On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > [...] > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > + case 0: goto next; /* ignore process */ > > > > + default: goto select; /* note the task */ > > > > + } > > > > > > Why we need to change the *points* value if we do not care about oom_badness > > > ? Is it used to record some state? If so, we could record it through bpf > > > map. > > > > Strictly speaking we do not need to. That would require BPF to keep the > > state internally. Many will do I suppose but we have to keep track of > > the victim so that the oom killer knows what to kill so I thought that > > it doesn't hurt to keep track of an abstract concept of points as well. > > If you think this is not needed then oc->points could be always 0 for > > bpf selected victims. The value is not used anyway in the proposed > > scheme. > > > > Btw. we will need another hook or metadata for the reporting side of > > things. Generally dump_header() to know what has been the selection > > policy. > > > OK. Maybe a integer like policy_type is enough to distinguish different > policies and the default method is zero. Or we can let BPF return a string > like policy_name. > > Which one should I start implementing in next version? Do you have a better > idea? String seems to be more descriptive.
On Sat 05-08-23 07:55:56, Chuyi Zhou wrote: > Hello, > > 在 2023/8/4 19:34, Alan Maguire 写道: [...] > > I don't know anything about OOM mechanisms, so maybe it's just me, but I > > found this confusing. Relying on the previous iteration to control > > current iteration behaviour seems risky - even if BPF found a victim in > > iteration N, it's no guarantee it will in iteration N+1. > > > The current kernel's OOM actually works like this: > > 1. if we first find a valid candidate victim A in iteration N, we would > record it in oc->chosen. > > 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current > iterating task. Suppose we think current task B is better than > oc->chosen(A), we would set oc->chosen = B and we would not consider A > anymore. > > IIUC, most policy works like this. We just need to find the *most* suitable > victim. Normally, if in current iteration we drop A and select B, we would > not consider A anymore. Yes, we iterate over all tasks in the specific oom domain (all tasks for global and all members of memcg tree for hard limit oom). The in-tree oom policy has to iterate all tasks to achieve some of its goals (like preventing overkilling while the previously selected victim is still on the way out). Also oom_score_adj might change the final decision so you have to really check all eligible tasks. I can imagine a BPF based policy could be less constrained and as Roman suggested have a pre-selected victims on stand by. I do not see problem to have break like mode. Similar to current abort without a canceling an already noted victim.
On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > [...] > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > + case 0: goto next; /* ignore process */ > > > > > + default: goto select; /* note the task */ > > > > > + } To be honest, I can't say I like it. IMO it's not really using the full bpf potential and is too attached to the current oom implementation. First, I'm a bit concerned about implicit restrictions we apply to bpf programs which will be executed potentially thousands times under a very heavy memory pressure. We will need to make sure that they don't allocate (much) memory, don't take any locks which might deadlock with other memory allocations etc. It will potentially require hard restrictions on what these programs can and can't do and this is something that the bpf community will have to maintain long-term. Second, if we're introducing bpf here (which I'm not yet convinced), IMO we should use it in a more generic and expressive way. Instead of adding hooks into the existing oom killer implementation, we can call a bpf program before invoking the in-kernel oom killer and let it do whatever it takes to free some memory. E.g. we can provide it with an API to kill individual tasks as well as all tasks in a cgroup. This approach is more generic and will allow to solve certain problems which can't be solved by the current oom killer, e.g. deleting files from a tmpfs instead of killing tasks. So I think the alternative approach is to provide some sort of an interface to pre-select oom victims in advance. E.g. on memcg level it can look like: echo PID >> memory.oom.victim_proc If the list is empty, the default oom killer is invoked. If there are tasks, the first one is killed on OOM. A similar interface can exist to choose between sibling cgroups: echo CGROUP_NAME >> memory.oom.victim_cgroup This is just a rough idea. Thanks!
On Mon 07-08-23 10:28:17, Roman Gushchin wrote: > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > > [...] > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > > + case 0: goto next; /* ignore process */ > > > > > > + default: goto select; /* note the task */ > > > > > > + } > > To be honest, I can't say I like it. IMO it's not really using the full bpf > potential and is too attached to the current oom implementation. TBH I am not sure we are able to come up with an interface that would ise the full BPF potential at this stage and I strongly believe that we should start by something that is good enough. > First, I'm a bit concerned about implicit restrictions we apply to bpf programs > which will be executed potentially thousands times under a very heavy memory > pressure. We will need to make sure that they don't allocate (much) memory, don't > take any locks which might deadlock with other memory allocations etc. > It will potentially require hard restrictions on what these programs can and can't > do and this is something that the bpf community will have to maintain long-term. Right, BPF callbacks operating under OOM situations will be really constrained but this is more or less by definition. Isn't it? > Second, if we're introducing bpf here (which I'm not yet convinced), > IMO we should use it in a more generic and expressive way. > Instead of adding hooks into the existing oom killer implementation, we can call > a bpf program before invoking the in-kernel oom killer and let it do whatever > it takes to free some memory. E.g. we can provide it with an API to kill individual > tasks as well as all tasks in a cgroup. > This approach is more generic and will allow to solve certain problems which > can't be solved by the current oom killer, e.g. deleting files from a tmpfs > instead of killing tasks. The aim of this proposal is to lift any heavy lifting steming from iterating tasks or cgroups which those BPF might need to make a decision. There are other ways of course and provide this iteration functionality as library functions but my BPF experience is very limited to say how easy is that. > So I think the alternative approach is to provide some sort of an interface to > pre-select oom victims in advance. E.g. on memcg level it can look like: > > echo PID >> memory.oom.victim_proc this is just a terrible interface TBH. Pids are very volatile objects. At the time oom killer reads this pid it might be a completely different process. > If the list is empty, the default oom killer is invoked. > If there are tasks, the first one is killed on OOM. > A similar interface can exist to choose between sibling cgroups: > > echo CGROUP_NAME >> memory.oom.victim_cgroup Slightly less volatile but not much better either. > This is just a rough idea. I am pretty sure that both policies could be implemetd by the proposed BPF interface though if you want something like that.
On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote: > On Mon 07-08-23 10:28:17, Roman Gushchin wrote: > > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > > > [...] > > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > > > + case 0: goto next; /* ignore process */ > > > > > > > + default: goto select; /* note the task */ > > > > > > > + } > > > > To be honest, I can't say I like it. IMO it's not really using the full bpf > > potential and is too attached to the current oom implementation. > > TBH I am not sure we are able to come up with an interface that would > ise the full BPF potential at this stage and I strongly believe that we > should start by something that is good enough. > > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs > > which will be executed potentially thousands times under a very heavy memory > > pressure. We will need to make sure that they don't allocate (much) memory, don't > > take any locks which might deadlock with other memory allocations etc. > > It will potentially require hard restrictions on what these programs can and can't > > do and this is something that the bpf community will have to maintain long-term. > > Right, BPF callbacks operating under OOM situations will be really > constrained but this is more or less by definition. Isn't it? What do you mean? In general, the bpf community is trying to make it as generic as possible and adding new and new features. Bpf programs are not as constrained as they were when it's all started. > > > Second, if we're introducing bpf here (which I'm not yet convinced), > > IMO we should use it in a more generic and expressive way. > > Instead of adding hooks into the existing oom killer implementation, we can call > > a bpf program before invoking the in-kernel oom killer and let it do whatever > > it takes to free some memory. E.g. we can provide it with an API to kill individual > > tasks as well as all tasks in a cgroup. > > This approach is more generic and will allow to solve certain problems which > > can't be solved by the current oom killer, e.g. deleting files from a tmpfs > > instead of killing tasks. > > The aim of this proposal is to lift any heavy lifting steming from > iterating tasks or cgroups which those BPF might need to make a > decision. There are other ways of course and provide this iteration > functionality as library functions but my BPF experience is very limited > to say how easy is that. > > > So I think the alternative approach is to provide some sort of an interface to > > pre-select oom victims in advance. E.g. on memcg level it can look like: > > > > echo PID >> memory.oom.victim_proc > > this is just a terrible interface TBH. Pids are very volatile objects. > At the time oom killer reads this pid it might be a completely different > process. Well, we already have cgroup.procs interface, which works ok. Obviously if the task is dead (or is actually killed in a result of oom), it's pid is removed from the list. > > > If the list is empty, the default oom killer is invoked. > > If there are tasks, the first one is killed on OOM. > > A similar interface can exist to choose between sibling cgroups: > > > > echo CGROUP_NAME >> memory.oom.victim_cgroup > > Slightly less volatile but not much better either. > > > This is just a rough idea. > > I am pretty sure that both policies could be implemetd by the proposed > BPF interface though if you want something like that. As I said, I'm pretty concerned about how reliable (and effective) it will be. I'm not convinced that executing a generic bpf program from the oom context is safe (and we're talking about executing it potentially thousands of times). If we're going this way, we need an explicit acknowledge from the bpf community and a long-term agreement on how we'll keep thing safe. It would be also nice to come up with some practical examples of bpf programs. What are meaningful scenarios which can be covered with the proposed approach and are not covered now with oom_score_adj. Thanks!
On Tue 08-08-23 14:41:17, Roman Gushchin wrote: > On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote: > > On Mon 07-08-23 10:28:17, Roman Gushchin wrote: > > > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > > > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > > > > > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > > > > [...] > > > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > > > > + case 0: goto next; /* ignore process */ > > > > > > > > + default: goto select; /* note the task */ > > > > > > > > + } > > > > > > To be honest, I can't say I like it. IMO it's not really using the full bpf > > > potential and is too attached to the current oom implementation. > > > > TBH I am not sure we are able to come up with an interface that would > > ise the full BPF potential at this stage and I strongly believe that we > > should start by something that is good enough. > > > > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs > > > which will be executed potentially thousands times under a very heavy memory > > > pressure. We will need to make sure that they don't allocate (much) memory, don't > > > take any locks which might deadlock with other memory allocations etc. > > > It will potentially require hard restrictions on what these programs can and can't > > > do and this is something that the bpf community will have to maintain long-term. > > > > Right, BPF callbacks operating under OOM situations will be really > > constrained but this is more or less by definition. Isn't it? > > What do you mean? Callbacks cannot depend on any direct or indirect memory allocations. Dependencies on any sleeping locks (again directly or indirectly) is not allowed just to name the most important ones. > In general, the bpf community is trying to make it as generic as possible and > adding new and new features. Bpf programs are not as constrained as they were > when it's all started. Are the above ones somehow carved into BPF in general? > > > Second, if we're introducing bpf here (which I'm not yet convinced), > > > IMO we should use it in a more generic and expressive way. > > > Instead of adding hooks into the existing oom killer implementation, we can call > > > a bpf program before invoking the in-kernel oom killer and let it do whatever > > > it takes to free some memory. E.g. we can provide it with an API to kill individual > > > tasks as well as all tasks in a cgroup. > > > This approach is more generic and will allow to solve certain problems which > > > can't be solved by the current oom killer, e.g. deleting files from a tmpfs > > > instead of killing tasks. > > > > The aim of this proposal is to lift any heavy lifting steming from > > iterating tasks or cgroups which those BPF might need to make a > > decision. There are other ways of course and provide this iteration > > functionality as library functions but my BPF experience is very limited > > to say how easy is that. > > > > > So I think the alternative approach is to provide some sort of an interface to > > > pre-select oom victims in advance. E.g. on memcg level it can look like: > > > > > > echo PID >> memory.oom.victim_proc > > > > this is just a terrible interface TBH. Pids are very volatile objects. > > At the time oom killer reads this pid it might be a completely different > > process. > > Well, we already have cgroup.procs interface, which works ok. > Obviously if the task is dead (or is actually killed in a result of oom), > it's pid is removed from the list. Right, but writing the pid into the file has an immediate effect and recycle pid issues would be rare unless the pid space is mostly depleted. You are proposing an interface where the pid would be consumed in potentially very distant future. Such an approach would only work if the pid is auto-removed and then you need a notification mechanism to replace it by something else. > > > If the list is empty, the default oom killer is invoked. > > > If there are tasks, the first one is killed on OOM. > > > A similar interface can exist to choose between sibling cgroups: > > > > > > echo CGROUP_NAME >> memory.oom.victim_cgroup > > > > Slightly less volatile but not much better either. > > > > > This is just a rough idea. > > > > I am pretty sure that both policies could be implemetd by the proposed > > BPF interface though if you want something like that. > > As I said, I'm pretty concerned about how reliable (and effective) it will be. > I'm not convinced that executing a generic bpf program from the oom context > is safe (and we're talking about executing it potentially thousands of times). > If we're going this way, we need an explicit acknowledge from the bpf > community and a long-term agreement on how we'll keep thing safe. I do agree with that. > It would be also nice to come up with some practical examples of bpf programs. > What are meaningful scenarios which can be covered with the proposed approach > and are not covered now with oom_score_adj. Agreed here as well. This RFC serves purpose of brainstorming on all of this. There is a fundamental question whether we need BPF for this task in the first place. Are there any huge advantages to export the callback and allow a kernel module to hook into it?
On 8/9/23 3:53 PM, Michal Hocko wrote: > On Tue 08-08-23 14:41:17, Roman Gushchin wrote: >> It would be also nice to come up with some practical examples of bpf programs. >> What are meaningful scenarios which can be covered with the proposed approach >> and are not covered now with oom_score_adj. > > Agreed here as well. This RFC serves purpose of brainstorming on all of > this. > > There is a fundamental question whether we need BPF for this task in the > first place. Are there any huge advantages to export the callback and > allow a kernel module to hook into it? The ancient oom-killer largely depends on memory usage when choosing victims, which might not fit the need of modern scenarios. It's common nowadays that multiple workloads (tenants) with different 'priorities' run together, and the decisions made by the oom-killer doesn't always obey the service level agreements. While the oom_score_adj only adjusts the usage-based decisions, so it can hardly be translated into 'priority' semantic. How can we properly configure it given that we don't know how much memory the workloads will use? It's really hard for a static strategy to deal with dynamic provision. IMHO the oom_score_adj is just another demon. Reworking the oom-killer's internal algorithm or patching some random metrics may satisfy the immediate needs, but for the next 10 years? I doubt it. So I think we do need the flexibility to bypass the legacy usage-based algorithm, through bpf or pre-select interfaces. Regards, Abel
>>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs >>>> which will be executed potentially thousands times under a very heavy memory >>>> pressure. We will need to make sure that they don't allocate (much) memory, don't >>>> take any locks which might deadlock with other memory allocations etc. >>>> It will potentially require hard restrictions on what these programs can and can't >>>> do and this is something that the bpf community will have to maintain long-term. >>> >>> Right, BPF callbacks operating under OOM situations will be really >>> constrained but this is more or less by definition. Isn't it? >> >> What do you mean? > > Callbacks cannot depend on any direct or indirect memory allocations. > Dependencies on any sleeping locks (again directly or indirectly) is not > allowed just to name the most important ones. > >> In general, the bpf community is trying to make it as generic as possible and >> adding new and new features. Bpf programs are not as constrained as they were >> when it's all started. bpf supports different running context. For example, only non-sleepable bpf prog is allowed to run at the NIC driver. A sleepable bpf prog is only allowed to run at some bpf_lsm hooks that is known to be safe to call blocking bpf-helper/kfunc. From the bpf side, it ensures a non-sleepable bpf prog cannot do things that may block. fwiw, Dave has recently proposed something for iterating the task vma (https://lore.kernel.org/bpf/20230810183513.684836-4-davemarchevsky@fb.com/). Potentially, a similar iterator can be created for a bpf program to iterate cgroups and tasks.
Hello, 在 2023/8/9 15:53, Michal Hocko 写道: > On Tue 08-08-23 14:41:17, Roman Gushchin wrote: >> On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote: >>> On Mon 07-08-23 10:28:17, Roman Gushchin wrote: >>>> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: >>>>> On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: >>>>>> >>>>>> >>>>>> 在 2023/8/4 21:34, Michal Hocko 写道: >>>>>>> On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: >>>>>>> [...] >>>>>>>>> + switch (bpf_oom_evaluate_task(task, oc, &points)) { >>>>>>>>> + case -EOPNOTSUPP: break; /* No BPF policy */ >>>>>>>>> + case -EBUSY: goto abort; /* abort search process */ >>>>>>>>> + case 0: goto next; /* ignore process */ >>>>>>>>> + default: goto select; /* note the task */ >>>>>>>>> + } >>>> >>>> To be honest, I can't say I like it. IMO it's not really using the full bpf >>>> potential and is too attached to the current oom implementation. >>> >>> TBH I am not sure we are able to come up with an interface that would >>> ise the full BPF potential at this stage and I strongly believe that we >>> should start by something that is good enough. >>> >>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs >>>> which will be executed potentially thousands times under a very heavy memory >>>> pressure. We will need to make sure that they don't allocate (much) memory, don't >>>> take any locks which might deadlock with other memory allocations etc. >>>> It will potentially require hard restrictions on what these programs can and can't >>>> do and this is something that the bpf community will have to maintain long-term. >>> >>> Right, BPF callbacks operating under OOM situations will be really >>> constrained but this is more or less by definition. Isn't it? >> >> What do you mean? > > Callbacks cannot depend on any direct or indirect memory allocations. > Dependencies on any sleeping locks (again directly or indirectly) is not > allowed just to name the most important ones. > >> In general, the bpf community is trying to make it as generic as possible and >> adding new and new features. Bpf programs are not as constrained as they were >> when it's all started. > > Are the above ones somehow carved into BPF in general? > >>>> Second, if we're introducing bpf here (which I'm not yet convinced), >>>> IMO we should use it in a more generic and expressive way. >>>> Instead of adding hooks into the existing oom killer implementation, we can call >>>> a bpf program before invoking the in-kernel oom killer and let it do whatever >>>> it takes to free some memory. E.g. we can provide it with an API to kill individual >>>> tasks as well as all tasks in a cgroup. >>>> This approach is more generic and will allow to solve certain problems which >>>> can't be solved by the current oom killer, e.g. deleting files from a tmpfs >>>> instead of killing tasks. >>> >>> The aim of this proposal is to lift any heavy lifting steming from >>> iterating tasks or cgroups which those BPF might need to make a >>> decision. There are other ways of course and provide this iteration >>> functionality as library functions but my BPF experience is very limited >>> to say how easy is that. >>> >>>> So I think the alternative approach is to provide some sort of an interface to >>>> pre-select oom victims in advance. E.g. on memcg level it can look like: >>>> >>>> echo PID >> memory.oom.victim_proc >>> >>> this is just a terrible interface TBH. Pids are very volatile objects. >>> At the time oom killer reads this pid it might be a completely different >>> process. >> >> Well, we already have cgroup.procs interface, which works ok. >> Obviously if the task is dead (or is actually killed in a result of oom), >> it's pid is removed from the list. > > Right, but writing the pid into the file has an immediate effect and > recycle pid issues would be rare unless the pid space is mostly > depleted. You are proposing an interface where the pid would be consumed > in potentially very distant future. Such an approach would only work if > the pid is auto-removed and then you need a notification mechanism to > replace it by something else. > >>>> If the list is empty, the default oom killer is invoked. >>>> If there are tasks, the first one is killed on OOM. >>>> A similar interface can exist to choose between sibling cgroups: >>>> >>>> echo CGROUP_NAME >> memory.oom.victim_cgroup >>> >>> Slightly less volatile but not much better either. >>> >>>> This is just a rough idea. >>> >>> I am pretty sure that both policies could be implemetd by the proposed >>> BPF interface though if you want something like that. >> >> As I said, I'm pretty concerned about how reliable (and effective) it will be. >> I'm not convinced that executing a generic bpf program from the oom context >> is safe (and we're talking about executing it potentially thousands of times). >> If we're going this way, we need an explicit acknowledge from the bpf >> community and a long-term agreement on how we'll keep thing safe. > > I do agree with that. > >> It would be also nice to come up with some practical examples of bpf programs. >> What are meaningful scenarios which can be covered with the proposed approach >> and are not covered now with oom_score_adj. > Just like Abel said, the oom_score_adj only adjusts the memory usage-based decisions, and it's hard to be translated into other semantics. We see that some userspace oom-killer like oomd has implemented policies based on other semantics(e.g., memory growth, priority, psi pressure, ect.) which can be useful in some specific scenario. > Agreed here as well. This RFC serves purpose of brainstorming on all of > this. > > There is a fundamental question whether we need BPF for this task in the > first place. Are there any huge advantages to export the callback and > allow a kernel module to hook into it? If we export the callback to a kernel module and hook into it, We still have the same problems (e.g., allocating much memory). Just like Martin saied, at least BPF supports some basic running context and some unsafe behavior is restricted.
On Thu, Aug 10, 2023 at 12:41:01PM -0700, Martin KaFai Lau wrote: > > > > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs > > > > > which will be executed potentially thousands times under a very heavy memory > > > > > pressure. We will need to make sure that they don't allocate (much) memory, don't > > > > > take any locks which might deadlock with other memory allocations etc. > > > > > It will potentially require hard restrictions on what these programs can and can't > > > > > do and this is something that the bpf community will have to maintain long-term. > > > > > > > > Right, BPF callbacks operating under OOM situations will be really > > > > constrained but this is more or less by definition. Isn't it? > > > > > > What do you mean? > > > > Callbacks cannot depend on any direct or indirect memory allocations. > > Dependencies on any sleeping locks (again directly or indirectly) is not > > allowed just to name the most important ones. > > > > > In general, the bpf community is trying to make it as generic as possible and > > > adding new and new features. Bpf programs are not as constrained as they were > > > when it's all started. > > bpf supports different running context. For example, only non-sleepable bpf > prog is allowed to run at the NIC driver. A sleepable bpf prog is only > allowed to run at some bpf_lsm hooks that is known to be safe to call > blocking bpf-helper/kfunc. From the bpf side, it ensures a non-sleepable bpf > prog cannot do things that may block. Yeah, you're right: non-sleepable bpf should be ok here. > > fwiw, Dave has recently proposed something for iterating the task vma > (https://lore.kernel.org/bpf/20230810183513.684836-4-davemarchevsky@fb.com/). > Potentially, a similar iterator can be created for a bpf program to iterate > cgroups and tasks. Yes, it looks like a much better approach rather than adding a hook into the existing iteration over all tasks. Thanks!
On Thu, Aug 10, 2023 at 12:00:36PM +0800, Abel Wu wrote: > On 8/9/23 3:53 PM, Michal Hocko wrote: > > On Tue 08-08-23 14:41:17, Roman Gushchin wrote: > > > It would be also nice to come up with some practical examples of bpf programs. > > > What are meaningful scenarios which can be covered with the proposed approach > > > and are not covered now with oom_score_adj. > > > > Agreed here as well. This RFC serves purpose of brainstorming on all of > > this. > > > > There is a fundamental question whether we need BPF for this task in the > > first place. Are there any huge advantages to export the callback and > > allow a kernel module to hook into it? > > The ancient oom-killer largely depends on memory usage when choosing > victims, which might not fit the need of modern scenarios. It's common > nowadays that multiple workloads (tenants) with different 'priorities' > run together, and the decisions made by the oom-killer doesn't always > obey the service level agreements. > > While the oom_score_adj only adjusts the usage-based decisions, so it > can hardly be translated into 'priority' semantic. How can we properly > configure it given that we don't know how much memory the workloads > will use? It's really hard for a static strategy to deal with dynamic > provision. IMHO the oom_score_adj is just another demon. > > Reworking the oom-killer's internal algorithm or patching some random > metrics may satisfy the immediate needs, but for the next 10 years? I > doubt it. So I think we do need the flexibility to bypass the legacy > usage-based algorithm, through bpf or pre-select interfaces. I agree in general, but I wouldn't call the existing implementation a legacy or obsolete. It's all about trade-offs. The goal of the existing implementation is to guarantee the forward progress without killing any processes prematurely. And it does it relatively well. Userspace oom killers (e.g. oomd) on top of PSI were initially created to solve the problem of memory thrashing: having a system which is barely making anything useful, but not stuck enough for the OOM killer to kick in. But also they were able to provide a much better flexibility. The downside - they can't be as reliable as the in-kernel OOM killer. Bpf or a pre-select interface can in theory glue them together: make sure that a user has a flexibility to choose the OOM victim without compromising on the reliability. Pre-select interface could be preferable if all the logic is already implemented in userspace, but might be slightly less accurate if some statistics (e.g. memory usage) is used for the determination of the victim. Bpf approach will require re-implementing the logic, but potentially is more powerful due to a fast access to a lot of kernel data. Thanks!
[Still catching up with older threads. Sorry for the late reply] On Mon 14-08-23 19:25:08, Chuyi Zhou wrote: > Hello, > > 在 2023/8/9 15:53, Michal Hocko 写道: > > On Tue 08-08-23 14:41:17, Roman Gushchin wrote: [...] > > > It would be also nice to come up with some practical examples of bpf programs. > > > What are meaningful scenarios which can be covered with the proposed approach > > > and are not covered now with oom_score_adj. > > > Just like Abel said, the oom_score_adj only adjusts the memory usage-based > decisions, and it's hard to be translated into other semantics. We see that > some userspace oom-killer like oomd has implemented policies based on other > semantics(e.g., memory growth, priority, psi pressure, ect.) which can be > useful in some specific scenario. Sure, I guess we do not really need to discuss that oom_score_adj is not a great fit ;) We want to have practical (read no-toy) oom policies that are useful as a PoC though. > > Agreed here as well. This RFC serves purpose of brainstorming on all of > > this. > > > > There is a fundamental question whether we need BPF for this task in the > > first place. Are there any huge advantages to export the callback and > > allow a kernel module to hook into it? > > If we export the callback to a kernel module and hook into it, > We still have the same problems (e.g., allocating much memory). Just like > Martin saied, at least BPF supports some basic running context and some > unsafe behavior is restricted. I do not follow. Kernel module has access to any existing kernel interfaces without any need to BPF them. So what exactly is the strength of the BPF over kernel module hook? I am pretty sure there are some (many?) but it is really important to be explicit about those before we make any decision.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 612b5597d3af..aec4c55ed49a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -18,6 +18,7 @@ * kernel subsystems and hints as to where to find out what things do. */ +#include <linux/bpf.h> #include <linux/oom.h> #include <linux/mm.h> #include <linux/err.h> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages) if (!p) return LONG_MIN; + /* + * If task is allocating a lot of memory and has been marked to be + * killed first if it triggers an oom, then set points to LONG_MAX. + * It will be selected unless we keep oc->chosen through bpf interface. + */ + if (oom_task_origin(p)) { + task_unlock(p); + return LONG_MAX; + } + /* * Do not even consider tasks which are explicitly marked oom * unkillable or have been already oom reaped or the are in @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) return CONSTRAINT_NONE; } +enum bpf_select_ret { + BPF_SELECT_DISABLE, + BPF_SELECT_TASK, + BPF_SELECT_CHOSEN, +}; + +__weak noinline int bpf_select_task(struct oom_control *oc, + struct task_struct *task, long badness_points) +{ + return BPF_SELECT_DISABLE; +} + +BTF_SET8_START(oom_bpf_fmodret_ids) +BTF_ID_FLAGS(func, bpf_select_task) +BTF_SET8_END(oom_bpf_fmodret_ids) + +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = { + .owner = THIS_MODULE, + .set = &oom_bpf_fmodret_ids, +}; + static int oom_evaluate_task(struct task_struct *task, void *arg) { + enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE; struct oom_control *oc = arg; long points; @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto abort; } + points = oom_badness(task, oc->totalpages); + /* - * If task is allocating a lot of memory and has been marked to be - * killed first if it triggers an oom, then select it. + * Do not consider tasks with lowest score value except it was caused + * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by + * bpf interface. */ - if (oom_task_origin(task)) { - points = LONG_MAX; + if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) + goto next; + + if (oc->chosen) + bpf_ret = bpf_select_task(oc, task, points); + + if (bpf_ret == BPF_SELECT_TASK) goto select; - } - points = oom_badness(task, oc->totalpages); - if (points == LONG_MIN || points < oc->chosen_points) + if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points) goto next; select: @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = { static int __init oom_init(void) { + int err; oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); #ifdef CONFIG_SYSCTL register_sysctl_init("vm", vm_oom_kill_table); #endif + err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set); + if (err) + pr_warn("error while registering oom fmodret entrypoints: %d", err); return 0; } subsys_initcall(oom_init)
This patch adds a new hook bpf_select_task in oom_evaluate_task. It takes oc and current iterating task as parameters and returns a result indicating which one is selected by bpf program. Although bpf_select_task is used to bypass the default method, there are some existing rules should be obeyed. Specifically, we skip these "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not consider tasks with lowest score returned by oom_badness except it was caused by OOM_SCORE_ADJ_MIN. If we attach a prog to the hook, the interface is enabled only when we have successfully chosen at least one valid candidate in previous iteraion. This is to avoid that we find nothing if bpf program rejects all tasks. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> --- mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 50 insertions(+), 7 deletions(-)