diff mbox series

[RFC,1/2] mm, oom: Introduce bpf_select_task

Message ID 20230804093804.47039-2-zhouchuyi@bytedance.com (mailing list archive)
State RFC
Headers show
Series mm: Select victim using bpf_select_task | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch, async
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat

Commit Message

Chuyi Zhou Aug. 4, 2023, 9:38 a.m. UTC
This patch adds a new hook bpf_select_task in oom_evaluate_task. It
takes oc and current iterating task as parameters and returns a result
indicating which one is selected by bpf program.

Although bpf_select_task is used to bypass the default method, there are
some existing rules should be obeyed. Specifically, we skip these
"unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
consider tasks with lowest score returned by oom_badness except it was
caused by OOM_SCORE_ADJ_MIN.

If we attach a prog to the hook, the interface is enabled only when we have
successfully chosen at least one valid candidate in previous iteraion. This
is to avoid that we find nothing if bpf program rejects all tasks.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 50 insertions(+), 7 deletions(-)

Comments

Michal Hocko Aug. 4, 2023, 11:29 a.m. UTC | #1
On Fri 04-08-23 17:38:03, Chuyi Zhou wrote:
> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
> takes oc and current iterating task as parameters and returns a result
> indicating which one is selected by bpf program.
> 
> Although bpf_select_task is used to bypass the default method, there are
> some existing rules should be obeyed. Specifically, we skip these
> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
> consider tasks with lowest score returned by oom_badness except it was
> caused by OOM_SCORE_ADJ_MIN.

Is this really necessary? I do get why we need to preserve
OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why
should an arbitrary oom policy care. Look at it from an arbitrary user
space based policy. It just picks a task or memcg and kills taks by
sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains
will not prevent anybody from doing that.

tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but
not too much. The primary motivation is to prevent new oom victims
while there is one already being killed. This is a reasonable heuristic
especially with the async oom reclaim (oom_reaper). It also reduces
amount of oom emergency memory reserves to some degree but since those
are not absolute this is no longer the primary motivation. _But_ I can
imagine that some policies might be much more aggresive and allow to
select new victims if preexisting are not being killed in time.

oom_unkillable_task is a general sanity check so it should remain in
place.

I am not really sure about oom_task_origin. That is just a very weird
case and I guess it wouldn't hurt to keep it in generic path.

All that being said I think we want something like the following (very
pseudo-code). I have no idea what is the proper way how to define BPF
hooks though so a help from BPF maintainers would be more then handy
---
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 00982b133dc1..9f1743ee2b28 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void)
 {
 	return false;
 }
-static inline bool trigger_allbutself_cpu_backtrace(void)
-{
-	return false;
-}
 static inline bool trigger_cpumask_backtrace(struct cpumask *mask)
 {
 	return false;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..c9e04be52700 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
 		goto next;
 
+	/*
+	 * If task is allocating a lot of memory and has been marked to be
+	 * killed first if it triggers an oom, then select it.
+	 */
+	if (oom_task_origin(task)) {
+		points = LONG_MAX;
+		goto select;
+	}
+
+	switch (bpf_oom_evaluate_task(task, oc, &points)) {
+		case -EOPNOTSUPP: break; /* No BPF policy */
+		case -EBUSY: goto abort; /* abort search process */
+		case 0: goto next; /* ignore process */
+		default: goto select; /* note the task */
+	}
+
 	/*
 	 * This task already has access to memory reserves and is being killed.
 	 * Don't allow any other task to have access to the reserves unless
@@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
-	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
-	 */
-	if (oom_task_origin(task)) {
-		points = LONG_MAX;
-		goto select;
-	}
-
 	points = oom_badness(task, oc->totalpages);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
Alan Maguire Aug. 4, 2023, 11:34 a.m. UTC | #2
On 04/08/2023 10:38, Chuyi Zhou wrote:
> This patch adds a new hook bpf_select_task in oom_evaluate_task. It

bpf_select_task() feels like too generic a name - bpf_oom_select_task()
might make the context clearer.

I'd also suggest adding a documentation patch for a new
Documentation/bpf/oom.rst or whatever to describe how it is all supposed
to work.

> takes oc and current iterating task as parameters and returns a result
> indicating which one is selected by bpf program.
> 
> Although bpf_select_task is used to bypass the default method, there are
> some existing rules should be obeyed. Specifically, we skip these
> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
> consider tasks with lowest score returned by oom_badness except it was
> caused by OOM_SCORE_ADJ_MIN.
> 
> If we attach a prog to the hook, the interface is enabled only when we have
> successfully chosen at least one valid candidate in previous iteraion. This
> is to avoid that we find nothing if bpf program rejects all tasks.
> 

I don't know anything about OOM mechanisms, so maybe it's just me, but I
found this confusing. Relying on the previous iteration to control
current iteration behaviour seems risky - even if BPF found a victim in
iteration N, it's no guarantee it will in iteration N+1.

Naively I would have thought the right answer here would be to honour
the choice OOM would have made (in the absence of BPF execution) for
cases where BPF did not select a victim. Is that sort of scheme
workable? Does that make sense from the mm side, or would we actually
want to fall back to

	pr_warn("Out of memory and no killable processes...\n");

...if BPF didn't select a process?

The danger here seems to be that the current non-BPF mechanism seems to
be guaranteed to find a chosen victim, but delegating to BPF is not. So
what is the right behaviour for such cases from the mm perspective?

(One thing that would probably be worth doing from the BPF side would be
to add a tracepoint to mark the scenario where nothing was chosen for
OOM kill via BPF; this would allow BPF programs to catch the fact that
their OOM selection mechanisms didn't work.)

Alan

> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 50 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 612b5597d3af..aec4c55ed49a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -18,6 +18,7 @@
>   *  kernel subsystems and hints as to where to find out what things do.
>   */
>  
> +#include <linux/bpf.h>
>  #include <linux/oom.h>
>  #include <linux/mm.h>
>  #include <linux/err.h>
> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
>  	if (!p)
>  		return LONG_MIN;
>  
> +	/*
> +	 * If task is allocating a lot of memory and has been marked to be
> +	 * killed first if it triggers an oom, then set points to LONG_MAX.
> +	 * It will be selected unless we keep oc->chosen through bpf interface.
> +	 */
> +	if (oom_task_origin(p)) {
> +		task_unlock(p);
> +		return LONG_MAX;
> +	}
> +
>  	/*
>  	 * Do not even consider tasks which are explicitly marked oom
>  	 * unkillable or have been already oom reaped or the are in
> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	return CONSTRAINT_NONE;
>  }
>  
> +enum bpf_select_ret {
> +	BPF_SELECT_DISABLE,
> +	BPF_SELECT_TASK,
> +	BPF_SELECT_CHOSEN,
> +};
> +
> +__weak noinline int bpf_select_task(struct oom_control *oc,
> +				struct task_struct *task, long badness_points)
> +{
> +	return BPF_SELECT_DISABLE;
> +}
> +
> +BTF_SET8_START(oom_bpf_fmodret_ids)
> +BTF_ID_FLAGS(func, bpf_select_task)
> +BTF_SET8_END(oom_bpf_fmodret_ids)
> +
> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &oom_bpf_fmodret_ids,
> +};
> +
>  static int oom_evaluate_task(struct task_struct *task, void *arg)
>  {
> +	enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
>  	struct oom_control *oc = arg;
>  	long points;
>  
> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>  		goto abort;
>  	}
>  
> +	points = oom_badness(task, oc->totalpages);
> +
>  	/*
> -	 * If task is allocating a lot of memory and has been marked to be
> -	 * killed first if it triggers an oom, then select it.
> +	 * Do not consider tasks with lowest score value except it was caused
> +	 * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
> +	 * bpf interface.
>  	 */
> -	if (oom_task_origin(task)) {
> -		points = LONG_MAX;
> +	if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +		goto next;
> +
> +	if (oc->chosen)
> +		bpf_ret = bpf_select_task(oc, task, points);
> +
> +	if (bpf_ret == BPF_SELECT_TASK)
>  		goto select;
> -	}
>  
> -	points = oom_badness(task, oc->totalpages);
> -	if (points == LONG_MIN || points < oc->chosen_points)
> +	if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
>  		goto next;
>  
>  select:
> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
>  
>  static int __init oom_init(void)
>  {
> +	int err;
>  	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
>  #ifdef CONFIG_SYSCTL
>  	register_sysctl_init("vm", vm_oom_kill_table);
>  #endif

probably worth having #ifdef CONFIG_BPF or similar here..

> +	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
> +	if (err)
> +		pr_warn("error while registering oom fmodret entrypoints: %d", err);
>  	return 0;
>  }
>  subsys_initcall(oom_init)
Chuyi Zhou Aug. 4, 2023, 1:15 p.m. UTC | #3
Hello,

在 2023/8/4 19:29, Michal Hocko 写道:
> On Fri 04-08-23 17:38:03, Chuyi Zhou wrote:
>> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
>> takes oc and current iterating task as parameters and returns a result
>> indicating which one is selected by bpf program.
>>
>> Although bpf_select_task is used to bypass the default method, there are
>> some existing rules should be obeyed. Specifically, we skip these
>> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
>> consider tasks with lowest score returned by oom_badness except it was
>> caused by OOM_SCORE_ADJ_MIN.
> 
> Is this really necessary? I do get why we need to preserve
> OOM_SCORE_ADJ_* semantic for in-kernel oom selection logic but why
> should an arbitrary oom policy care. Look at it from an arbitrary user
> space based policy. It just picks a task or memcg and kills taks by
> sending SIG_KILL (or maybe SIG_TERM first) signal. oom_score constrains
> will not prevent anybody from doing that.

Sorry, some of my expressions may have misled you.

I do agree bpf interface should bypass the current OOM_SCORE_ADJ_* 
logic. What I meant to say is that bpf can select a task even it was
setted OOM_SCORE_ADJ_MIN.

> 
> tsk_is_oom_victim (and MMF_OOM_SKIP) is a slightly different case but
> not too much. The primary motivation is to prevent new oom victims
> while there is one already being killed. This is a reasonable heuristic
> especially with the async oom reclaim (oom_reaper). It also reduces
> amount of oom emergency memory reserves to some degree but since those
> are not absolute this is no longer the primary motivation. _But_ I can
> imagine that some policies might be much more aggresive and allow to
> select new victims if preexisting are not being killed in time.
> 
> oom_unkillable_task is a general sanity check so it should remain in
> place.
> 
> I am not really sure about oom_task_origin. That is just a very weird
> case and I guess it wouldn't hurt to keep it in generic path.
> 
> All that being said I think we want something like the following (very
> pseudo-code). I have no idea what is the proper way how to define BPF
> hooks though so a help from BPF maintainers would be more then handy
> ---
> diff --git a/include/linux/nmi.h b/include/linux/nmi.h
> index 00982b133dc1..9f1743ee2b28 100644
> --- a/include/linux/nmi.h
> +++ b/include/linux/nmi.h
> @@ -190,10 +190,6 @@ static inline bool trigger_all_cpu_backtrace(void)
>   {
>   	return false;
>   }
> -static inline bool trigger_allbutself_cpu_backtrace(void)
> -{
> -	return false;
> -}
>   static inline bool trigger_cpumask_backtrace(struct cpumask *mask)
>   {
>   	return false;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 612b5597d3af..c9e04be52700 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -317,6 +317,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>   	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc))
>   		goto next;
>   
> +	/*
> +	 * If task is allocating a lot of memory and has been marked to be
> +	 * killed first if it triggers an oom, then select it.
> +	 */
> +	if (oom_task_origin(task)) {
> +		points = LONG_MAX;
> +		goto select;
> +	}
> +
> +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> +		case -EOPNOTSUPP: break; /* No BPF policy */
> +		case -EBUSY: goto abort; /* abort search process */
> +		case 0: goto next; /* ignore process */
> +		default: goto select; /* note the task */
> +	}

Why we need to change the *points* value if we do not care about 
oom_badness ? Is it used to record some state? If so, we could record it 
through bpf map.

> +
>   	/*
>   	 * This task already has access to memory reserves and is being killed.
>   	 * Don't allow any other task to have access to the reserves unless
> @@ -329,15 +345,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>   		goto abort;
>   	}
>   
> -	/*
> -	 * If task is allocating a lot of memory and has been marked to be
> -	 * killed first if it triggers an oom, then select it.
> -	 */
> -	if (oom_task_origin(task)) {
> -		points = LONG_MAX;
> -		goto select;
> -	}
> -
>   	points = oom_badness(task, oc->totalpages);
>   	if (points == LONG_MIN || points < oc->chosen_points)
>   		goto next;
Thanks for your advice, I'm very glad to follow your suggestions for the 
next version of development.
Michal Hocko Aug. 4, 2023, 1:34 p.m. UTC | #4
On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
[...]
> > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > +		case -EBUSY: goto abort; /* abort search process */
> > +		case 0: goto next; /* ignore process */
> > +		default: goto select; /* note the task */
> > +	}
> 
> Why we need to change the *points* value if we do not care about oom_badness
> ? Is it used to record some state? If so, we could record it through bpf
> map.

Strictly speaking we do not need to. That would require BPF to keep the
state internally. Many will do I suppose but we have to keep track of
the victim so that the oom killer knows what to kill so I thought that
it doesn't hurt to keep track of an abstract concept of points as well.
If you think this is not needed then oc->points could be always 0 for
bpf selected victims. The value is not used anyway in the proposed
scheme.

Btw. we will need another hook or metadata for the reporting side of
things. Generally dump_header() to know what has been the selection
policy.
Chuyi Zhou Aug. 4, 2023, 11:55 p.m. UTC | #5
Hello,

在 2023/8/4 19:34, Alan Maguire 写道:
> On 04/08/2023 10:38, Chuyi Zhou wrote:
>> This patch adds a new hook bpf_select_task in oom_evaluate_task. It
> 
> bpf_select_task() feels like too generic a name - bpf_oom_select_task()
> might make the context clearer.
> 
> I'd also suggest adding a documentation patch for a new
> Documentation/bpf/oom.rst or whatever to describe how it is all supposed
> to work.Got it, I would add it in next version.
> 
>> takes oc and current iterating task as parameters and returns a result
>> indicating which one is selected by bpf program.
>>
>> Although bpf_select_task is used to bypass the default method, there are
>> some existing rules should be obeyed. Specifically, we skip these
>> "unkillable" tasks(e.g., kthread, MMF_OOM_SKIP, in_vfork()).So we do not
>> consider tasks with lowest score returned by oom_badness except it was
>> caused by OOM_SCORE_ADJ_MIN.
>>
>> If we attach a prog to the hook, the interface is enabled only when we have
>> successfully chosen at least one valid candidate in previous iteraion. This
>> is to avoid that we find nothing if bpf program rejects all tasks.
>>
> 
> I don't know anything about OOM mechanisms, so maybe it's just me, but I
> found this confusing. Relying on the previous iteration to control
> current iteration behaviour seems risky - even if BPF found a victim in
> iteration N, it's no guarantee it will in iteration N+1.
>
The current kernel's OOM actually works like this:

1. if we first find a valid candidate victim A in iteration N, we would 
record it in oc->chosen.

2. In iteration N + 1, N+2..., we just compare oc->chosen with the 
current iterating task. Suppose we think current task B is better than 
oc->chosen(A), we would set oc->chosen = B and we would not consider A 
anymore.

IIUC, most policy works like this. We just need to find the *most* 
suitable victim. Normally, if in current iteration we drop A and select 
B, we would not consider A anymore.

> Naively I would have thought the right answer here would be to honour
> the choice OOM would have made (in the absence of BPF execution) for
> cases where BPF did not select a victim. Is that sort of scheme
> workable? Does that make sense from the mm side, or would we actually
> want to fall back to
> 
> 	pr_warn("Out of memory and no killable processes...\n");
> 
> ...if BPF didn't select a process?
> 
My major concern was wether we should fully trust the correctness of BPF 
Progarm from user since some OOM may invoke kernel panic if we find 
nothing. Actually, the current non-BPF mechanism also is not guaranteed 
to find a chosen victim (If user set inappropriate oom_score_adj, we may 
find nothing).

It seems both you and Michal think here we should honour the default 
logic of OOM and do not add something additional to prevent BPF find 
nothing.

> The danger here seems to be that the current non-BPF mechanism seems to
> be guaranteed to find a chosen victim, but delegating to BPF is not. So
> what is the right behaviour for such cases from the mm perspective?
> 

> (One thing that would probably be worth doing from the BPF side would be
> to add a tracepoint to mark the scenario where nothing was chosen for
> OOM kill via BPF; this would allow BPF programs to catch the fact that
> their OOM selection mechanisms didn't work.)

Nice idea, maybe we could add this tracepoint when we finishing victim 
selection? In this way, we could easily catch the selection result in 
BPF programs even if we successfully find something.
> 
> Alan
> 
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   mm/oom_kill.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 50 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 612b5597d3af..aec4c55ed49a 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -18,6 +18,7 @@
>>    *  kernel subsystems and hints as to where to find out what things do.
>>    */
>>   
>> +#include <linux/bpf.h>
>>   #include <linux/oom.h>
>>   #include <linux/mm.h>
>>   #include <linux/err.h>
>> @@ -210,6 +211,16 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
>>   	if (!p)
>>   		return LONG_MIN;
>>   
>> +	/*
>> +	 * If task is allocating a lot of memory and has been marked to be
>> +	 * killed first if it triggers an oom, then set points to LONG_MAX.
>> +	 * It will be selected unless we keep oc->chosen through bpf interface.
>> +	 */
>> +	if (oom_task_origin(p)) {
>> +		task_unlock(p);
>> +		return LONG_MAX;
>> +	}
>> +
>>   	/*
>>   	 * Do not even consider tasks which are explicitly marked oom
>>   	 * unkillable or have been already oom reaped or the are in
>> @@ -305,8 +316,30 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>>   	return CONSTRAINT_NONE;
>>   }
>>   
>> +enum bpf_select_ret {
>> +	BPF_SELECT_DISABLE,
>> +	BPF_SELECT_TASK,
>> +	BPF_SELECT_CHOSEN,
>> +};
>> +
>> +__weak noinline int bpf_select_task(struct oom_control *oc,
>> +				struct task_struct *task, long badness_points)
>> +{
>> +	return BPF_SELECT_DISABLE;
>> +}
>> +
>> +BTF_SET8_START(oom_bpf_fmodret_ids)
>> +BTF_ID_FLAGS(func, bpf_select_task)
>> +BTF_SET8_END(oom_bpf_fmodret_ids)
>> +
>> +static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &oom_bpf_fmodret_ids,
>> +};
>> +
>>   static int oom_evaluate_task(struct task_struct *task, void *arg)
>>   {
>> +	enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
>>   	struct oom_control *oc = arg;
>>   	long points;
>>   
>> @@ -329,17 +362,23 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
>>   		goto abort;
>>   	}
>>   
>> +	points = oom_badness(task, oc->totalpages);
>> +
>>   	/*
>> -	 * If task is allocating a lot of memory and has been marked to be
>> -	 * killed first if it triggers an oom, then select it.
>> +	 * Do not consider tasks with lowest score value except it was caused
>> +	 * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
>> +	 * bpf interface.
>>   	 */
>> -	if (oom_task_origin(task)) {
>> -		points = LONG_MAX;
>> +	if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
>> +		goto next;
>> +
>> +	if (oc->chosen)
>> +		bpf_ret = bpf_select_task(oc, task, points);
>> +
>> +	if (bpf_ret == BPF_SELECT_TASK)
>>   		goto select;
>> -	}
>>   
>> -	points = oom_badness(task, oc->totalpages);
>> -	if (points == LONG_MIN || points < oc->chosen_points)
>> +	if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
>>   		goto next;
>>   
>>   select:
>> @@ -732,10 +771,14 @@ static struct ctl_table vm_oom_kill_table[] = {
>>   
>>   static int __init oom_init(void)
>>   {
>> +	int err;
>>   	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
>>   #ifdef CONFIG_SYSCTL
>>   	register_sysctl_init("vm", vm_oom_kill_table);
>>   #endif
> 
> probably worth having #ifdef CONFIG_BPF or similar here..
I see. Thanks for your remind.
register_btf_fmodret_id_set is controled by CONFIG_BPF_SYSCALL. So maybe 
we can add #ifdef CONFIG_BPF_SYSCALL here.

Thanks for your advice, I'm very glad to follow your suggestions for the
next version of development.
> 
>> +	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
>> +	if (err)
>> +		pr_warn("error while registering oom fmodret entrypoints: %d", err);
>>   	return 0;
>>   }
>>   subsys_initcall(oom_init)
Chuyi Zhou Aug. 7, 2023, 2:21 a.m. UTC | #6
在 2023/8/4 21:34, Michal Hocko 写道:
> On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> [...]
>>> +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
>>> +		case -EOPNOTSUPP: break; /* No BPF policy */
>>> +		case -EBUSY: goto abort; /* abort search process */
>>> +		case 0: goto next; /* ignore process */
>>> +		default: goto select; /* note the task */
>>> +	}
>>
>> Why we need to change the *points* value if we do not care about oom_badness
>> ? Is it used to record some state? If so, we could record it through bpf
>> map.
> 
> Strictly speaking we do not need to. That would require BPF to keep the
> state internally. Many will do I suppose but we have to keep track of
> the victim so that the oom killer knows what to kill so I thought that
> it doesn't hurt to keep track of an abstract concept of points as well.
> If you think this is not needed then oc->points could be always 0 for
> bpf selected victims. The value is not used anyway in the proposed
> scheme.
> 
> Btw. we will need another hook or metadata for the reporting side of
> things. Generally dump_header() to know what has been the selection
> policy.
> 
OK. Maybe a integer like policy_type is enough to distinguish different 
policies and the default method is zero. Or we can let BPF return a 
string like policy_name.

Which one should I start implementing in next version? Do you have a 
better idea?

Thanks.
Michal Hocko Aug. 7, 2023, 7:04 a.m. UTC | #7
On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> 
> 
> 在 2023/8/4 21:34, Michal Hocko 写道:
> > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > [...]
> > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > +		case 0: goto next; /* ignore process */
> > > > +		default: goto select; /* note the task */
> > > > +	}
> > > 
> > > Why we need to change the *points* value if we do not care about oom_badness
> > > ? Is it used to record some state? If so, we could record it through bpf
> > > map.
> > 
> > Strictly speaking we do not need to. That would require BPF to keep the
> > state internally. Many will do I suppose but we have to keep track of
> > the victim so that the oom killer knows what to kill so I thought that
> > it doesn't hurt to keep track of an abstract concept of points as well.
> > If you think this is not needed then oc->points could be always 0 for
> > bpf selected victims. The value is not used anyway in the proposed
> > scheme.
> > 
> > Btw. we will need another hook or metadata for the reporting side of
> > things. Generally dump_header() to know what has been the selection
> > policy.
> > 
> OK. Maybe a integer like policy_type is enough to distinguish different
> policies and the default method is zero. Or we can let BPF return a string
> like policy_name.
> 
> Which one should I start implementing in next version? Do you have a better
> idea?

String seems to be more descriptive.
Michal Hocko Aug. 7, 2023, 8:32 a.m. UTC | #8
On Sat 05-08-23 07:55:56, Chuyi Zhou wrote:
> Hello,
> 
> 在 2023/8/4 19:34, Alan Maguire 写道:
[...]
> > I don't know anything about OOM mechanisms, so maybe it's just me, but I
> > found this confusing. Relying on the previous iteration to control
> > current iteration behaviour seems risky - even if BPF found a victim in
> > iteration N, it's no guarantee it will in iteration N+1.
> > 
> The current kernel's OOM actually works like this:
> 
> 1. if we first find a valid candidate victim A in iteration N, we would
> record it in oc->chosen.
> 
> 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current
> iterating task. Suppose we think current task B is better than
> oc->chosen(A), we would set oc->chosen = B and we would not consider A
> anymore.
> 
> IIUC, most policy works like this. We just need to find the *most* suitable
> victim. Normally, if in current iteration we drop A and select B, we would
> not consider A anymore.

Yes, we iterate over all tasks in the specific oom domain (all tasks for
global and all members of memcg tree for hard limit oom). The in-tree
oom policy has to iterate all tasks to achieve some of its goals (like
preventing overkilling while the previously selected victim is still on
the way out). Also oom_score_adj might change the final decision so you
have to really check all eligible tasks.

I can imagine a BPF based policy could be less constrained and as Roman
suggested have a pre-selected victims on stand by. I do not see problem
to have break like mode. Similar to current abort without a canceling an
already noted victim.
Roman Gushchin Aug. 7, 2023, 5:28 p.m. UTC | #9
On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > 
> > 
> > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > [...]
> > > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > > +		case 0: goto next; /* ignore process */
> > > > > +		default: goto select; /* note the task */
> > > > > +	}

To be honest, I can't say I like it. IMO it's not really using the full bpf
potential and is too attached to the current oom implementation.

First, I'm a bit concerned about implicit restrictions we apply to bpf programs
which will be executed potentially thousands times under a very heavy memory
pressure. We will need to make sure that they don't allocate (much) memory, don't
take any locks which might deadlock with other memory allocations etc.
It will potentially require hard restrictions on what these programs can and can't
do and this is something that the bpf community will have to maintain long-term.

Second, if we're introducing bpf here (which I'm not yet convinced),
IMO we should use it in a more generic and expressive way.
Instead of adding hooks into the existing oom killer implementation, we can call
a bpf program before invoking the in-kernel oom killer and let it do whatever
it takes to free some memory. E.g. we can provide it with an API to kill individual
tasks as well as all tasks in a cgroup.
This approach is more generic and will allow to solve certain problems which
can't be solved by the current oom killer, e.g. deleting files from a tmpfs
instead of killing tasks.

So I think the alternative approach is to provide some sort of an interface to
pre-select oom victims in advance. E.g. on memcg level it can look like:

echo PID >> memory.oom.victim_proc

If the list is empty, the default oom killer is invoked.
If there are tasks, the first one is killed on OOM.
A similar interface can exist to choose between sibling cgroups:

echo CGROUP_NAME >> memory.oom.victim_cgroup

This is just a rough idea.

Thanks!
Michal Hocko Aug. 8, 2023, 8:18 a.m. UTC | #10
On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > 
> > > 
> > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > [...]
> > > > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > > > +		case 0: goto next; /* ignore process */
> > > > > > +		default: goto select; /* note the task */
> > > > > > +	}
> 
> To be honest, I can't say I like it. IMO it's not really using the full bpf
> potential and is too attached to the current oom implementation.

TBH I am not sure we are able to come up with an interface that would
ise the full BPF potential at this stage and I strongly believe that we
should start by something that is good enough.

> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> which will be executed potentially thousands times under a very heavy memory
> pressure. We will need to make sure that they don't allocate (much) memory, don't
> take any locks which might deadlock with other memory allocations etc.
> It will potentially require hard restrictions on what these programs can and can't
> do and this is something that the bpf community will have to maintain long-term.

Right, BPF callbacks operating under OOM situations will be really
constrained but this is more or less by definition. Isn't it?

> Second, if we're introducing bpf here (which I'm not yet convinced),
> IMO we should use it in a more generic and expressive way.
> Instead of adding hooks into the existing oom killer implementation, we can call
> a bpf program before invoking the in-kernel oom killer and let it do whatever
> it takes to free some memory. E.g. we can provide it with an API to kill individual
> tasks as well as all tasks in a cgroup.
> This approach is more generic and will allow to solve certain problems which
> can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> instead of killing tasks.

The aim of this proposal is to lift any heavy lifting steming from
iterating tasks or cgroups which those BPF might need to make a
decision. There are other ways of course and provide this iteration
functionality as library functions but my BPF experience is very limited
to say how easy is that.

> So I think the alternative approach is to provide some sort of an interface to
> pre-select oom victims in advance. E.g. on memcg level it can look like:
> 
> echo PID >> memory.oom.victim_proc

this is just a terrible interface TBH. Pids are very volatile objects.
At the time oom killer reads this pid it might be a completely different
process.

> If the list is empty, the default oom killer is invoked.
> If there are tasks, the first one is killed on OOM.
> A similar interface can exist to choose between sibling cgroups:
> 
> echo CGROUP_NAME >> memory.oom.victim_cgroup

Slightly less volatile but not much better either.

> This is just a rough idea.

I am pretty sure that both policies could be implemetd by the proposed
BPF interface though if you want something like that.
Roman Gushchin Aug. 8, 2023, 9:41 p.m. UTC | #11
On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
> On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > > 
> > > > 
> > > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > > [...]
> > > > > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > > > > +		case 0: goto next; /* ignore process */
> > > > > > > +		default: goto select; /* note the task */
> > > > > > > +	}
> > 
> > To be honest, I can't say I like it. IMO it's not really using the full bpf
> > potential and is too attached to the current oom implementation.
> 
> TBH I am not sure we are able to come up with an interface that would
> ise the full BPF potential at this stage and I strongly believe that we
> should start by something that is good enough.
> 
> > First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> > which will be executed potentially thousands times under a very heavy memory
> > pressure. We will need to make sure that they don't allocate (much) memory, don't
> > take any locks which might deadlock with other memory allocations etc.
> > It will potentially require hard restrictions on what these programs can and can't
> > do and this is something that the bpf community will have to maintain long-term.
> 
> Right, BPF callbacks operating under OOM situations will be really
> constrained but this is more or less by definition. Isn't it?

What do you mean?
In general, the bpf community is trying to make it as generic as possible and
adding new and new features. Bpf programs are not as constrained as they were
when it's all started.

> 
> > Second, if we're introducing bpf here (which I'm not yet convinced),
> > IMO we should use it in a more generic and expressive way.
> > Instead of adding hooks into the existing oom killer implementation, we can call
> > a bpf program before invoking the in-kernel oom killer and let it do whatever
> > it takes to free some memory. E.g. we can provide it with an API to kill individual
> > tasks as well as all tasks in a cgroup.
> > This approach is more generic and will allow to solve certain problems which
> > can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> > instead of killing tasks.
> 
> The aim of this proposal is to lift any heavy lifting steming from
> iterating tasks or cgroups which those BPF might need to make a
> decision. There are other ways of course and provide this iteration
> functionality as library functions but my BPF experience is very limited
> to say how easy is that.
> 
> > So I think the alternative approach is to provide some sort of an interface to
> > pre-select oom victims in advance. E.g. on memcg level it can look like:
> > 
> > echo PID >> memory.oom.victim_proc
> 
> this is just a terrible interface TBH. Pids are very volatile objects.
> At the time oom killer reads this pid it might be a completely different
> process.

Well, we already have cgroup.procs interface, which works ok.
Obviously if the task is dead (or is actually killed in a result of oom),
it's pid is removed from the list.

> 
> > If the list is empty, the default oom killer is invoked.
> > If there are tasks, the first one is killed on OOM.
> > A similar interface can exist to choose between sibling cgroups:
> > 
> > echo CGROUP_NAME >> memory.oom.victim_cgroup
> 
> Slightly less volatile but not much better either.
> 
> > This is just a rough idea.
> 
> I am pretty sure that both policies could be implemetd by the proposed
> BPF interface though if you want something like that.

As I said, I'm pretty concerned about how reliable (and effective) it will be.
I'm not convinced that executing a generic bpf program from the oom context
is safe (and we're talking about executing it potentially thousands of times).
If we're going this way, we need an explicit acknowledge from the bpf
community and a long-term agreement on how we'll keep thing safe.

It would be also nice to come up with some practical examples of bpf programs.
What are meaningful scenarios which can be covered with the proposed approach
and are not covered now with oom_score_adj.

Thanks!
Michal Hocko Aug. 9, 2023, 7:53 a.m. UTC | #12
On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
> On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
> > On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> > > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > > > 
> > > > > 
> > > > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > > > [...]
> > > > > > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > > > > > +		case 0: goto next; /* ignore process */
> > > > > > > > +		default: goto select; /* note the task */
> > > > > > > > +	}
> > > 
> > > To be honest, I can't say I like it. IMO it's not really using the full bpf
> > > potential and is too attached to the current oom implementation.
> > 
> > TBH I am not sure we are able to come up with an interface that would
> > ise the full BPF potential at this stage and I strongly believe that we
> > should start by something that is good enough.
> > 
> > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> > > which will be executed potentially thousands times under a very heavy memory
> > > pressure. We will need to make sure that they don't allocate (much) memory, don't
> > > take any locks which might deadlock with other memory allocations etc.
> > > It will potentially require hard restrictions on what these programs can and can't
> > > do and this is something that the bpf community will have to maintain long-term.
> > 
> > Right, BPF callbacks operating under OOM situations will be really
> > constrained but this is more or less by definition. Isn't it?
> 
> What do you mean?

Callbacks cannot depend on any direct or indirect memory allocations.
Dependencies on any sleeping locks (again directly or indirectly) is not
allowed just to name the most important ones.

> In general, the bpf community is trying to make it as generic as possible and
> adding new and new features. Bpf programs are not as constrained as they were
> when it's all started.

Are the above ones somehow carved into BPF in general?
 
> > > Second, if we're introducing bpf here (which I'm not yet convinced),
> > > IMO we should use it in a more generic and expressive way.
> > > Instead of adding hooks into the existing oom killer implementation, we can call
> > > a bpf program before invoking the in-kernel oom killer and let it do whatever
> > > it takes to free some memory. E.g. we can provide it with an API to kill individual
> > > tasks as well as all tasks in a cgroup.
> > > This approach is more generic and will allow to solve certain problems which
> > > can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> > > instead of killing tasks.
> > 
> > The aim of this proposal is to lift any heavy lifting steming from
> > iterating tasks or cgroups which those BPF might need to make a
> > decision. There are other ways of course and provide this iteration
> > functionality as library functions but my BPF experience is very limited
> > to say how easy is that.
> > 
> > > So I think the alternative approach is to provide some sort of an interface to
> > > pre-select oom victims in advance. E.g. on memcg level it can look like:
> > > 
> > > echo PID >> memory.oom.victim_proc
> > 
> > this is just a terrible interface TBH. Pids are very volatile objects.
> > At the time oom killer reads this pid it might be a completely different
> > process.
> 
> Well, we already have cgroup.procs interface, which works ok.
> Obviously if the task is dead (or is actually killed in a result of oom),
> it's pid is removed from the list.

Right, but writing the pid into the file has an immediate effect and
recycle pid issues would be rare unless the pid space is mostly
depleted. You are proposing an interface where the pid would be consumed
in potentially very distant future. Such an approach would only work if
the pid is auto-removed and then you need a notification mechanism to
replace it by something else.
 
> > > If the list is empty, the default oom killer is invoked.
> > > If there are tasks, the first one is killed on OOM.
> > > A similar interface can exist to choose between sibling cgroups:
> > > 
> > > echo CGROUP_NAME >> memory.oom.victim_cgroup
> > 
> > Slightly less volatile but not much better either.
> > 
> > > This is just a rough idea.
> > 
> > I am pretty sure that both policies could be implemetd by the proposed
> > BPF interface though if you want something like that.
> 
> As I said, I'm pretty concerned about how reliable (and effective) it will be.
> I'm not convinced that executing a generic bpf program from the oom context
> is safe (and we're talking about executing it potentially thousands of times).
> If we're going this way, we need an explicit acknowledge from the bpf
> community and a long-term agreement on how we'll keep thing safe.

I do agree with that.

> It would be also nice to come up with some practical examples of bpf programs.
> What are meaningful scenarios which can be covered with the proposed approach
> and are not covered now with oom_score_adj.

Agreed here as well. This RFC serves purpose of brainstorming on all of
this.

There is a fundamental question whether we need BPF for this task in the
first place. Are there any huge advantages to export the callback and
allow a kernel module to hook into it?
Abel Wu Aug. 10, 2023, 4 a.m. UTC | #13
On 8/9/23 3:53 PM, Michal Hocko wrote:
> On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
>> It would be also nice to come up with some practical examples of bpf programs.
>> What are meaningful scenarios which can be covered with the proposed approach
>> and are not covered now with oom_score_adj.
> 
> Agreed here as well. This RFC serves purpose of brainstorming on all of
> this.
> 
> There is a fundamental question whether we need BPF for this task in the
> first place. Are there any huge advantages to export the callback and
> allow a kernel module to hook into it?

The ancient oom-killer largely depends on memory usage when choosing
victims, which might not fit the need of modern scenarios. It's common
nowadays that multiple workloads (tenants) with different 'priorities'
run together, and the decisions made by the oom-killer doesn't always
obey the service level agreements.

While the oom_score_adj only adjusts the usage-based decisions, so it
can hardly be translated into 'priority' semantic. How can we properly
configure it given that we don't know how much memory the workloads
will use? It's really hard for a static strategy to deal with dynamic
provision. IMHO the oom_score_adj is just another demon.

Reworking the oom-killer's internal algorithm or patching some random
metrics may satisfy the immediate needs, but for the next 10 years? I
doubt it. So I think we do need the flexibility to bypass the legacy
usage-based algorithm, through bpf or pre-select interfaces.

Regards,
	Abel
Martin KaFai Lau Aug. 10, 2023, 7:41 p.m. UTC | #14
>>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
>>>> which will be executed potentially thousands times under a very heavy memory
>>>> pressure. We will need to make sure that they don't allocate (much) memory, don't
>>>> take any locks which might deadlock with other memory allocations etc.
>>>> It will potentially require hard restrictions on what these programs can and can't
>>>> do and this is something that the bpf community will have to maintain long-term.
>>>
>>> Right, BPF callbacks operating under OOM situations will be really
>>> constrained but this is more or less by definition. Isn't it?
>>
>> What do you mean?
> 
> Callbacks cannot depend on any direct or indirect memory allocations.
> Dependencies on any sleeping locks (again directly or indirectly) is not
> allowed just to name the most important ones.
> 
>> In general, the bpf community is trying to make it as generic as possible and
>> adding new and new features. Bpf programs are not as constrained as they were
>> when it's all started.

bpf supports different running context. For example, only non-sleepable bpf prog 
is allowed to run at the NIC driver. A sleepable bpf prog is only allowed to run 
at some bpf_lsm hooks that is known to be safe to call blocking 
bpf-helper/kfunc. From the bpf side, it ensures a non-sleepable bpf prog cannot 
do things that may block.

fwiw, Dave has recently proposed something for iterating the task vma 
(https://lore.kernel.org/bpf/20230810183513.684836-4-davemarchevsky@fb.com/). 
Potentially, a similar iterator can be created for a bpf program to iterate 
cgroups and tasks.
Chuyi Zhou Aug. 14, 2023, 11:25 a.m. UTC | #15
Hello,

在 2023/8/9 15:53, Michal Hocko 写道:
> On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
>> On Tue, Aug 08, 2023 at 10:18:39AM +0200, Michal Hocko wrote:
>>> On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
>>>> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
>>>>> On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
>>>>>>
>>>>>>
>>>>>> 在 2023/8/4 21:34, Michal Hocko 写道:
>>>>>>> On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
>>>>>>> [...]
>>>>>>>>> +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
>>>>>>>>> +		case -EOPNOTSUPP: break; /* No BPF policy */
>>>>>>>>> +		case -EBUSY: goto abort; /* abort search process */
>>>>>>>>> +		case 0: goto next; /* ignore process */
>>>>>>>>> +		default: goto select; /* note the task */
>>>>>>>>> +	}
>>>>
>>>> To be honest, I can't say I like it. IMO it's not really using the full bpf
>>>> potential and is too attached to the current oom implementation.
>>>
>>> TBH I am not sure we are able to come up with an interface that would
>>> ise the full BPF potential at this stage and I strongly believe that we
>>> should start by something that is good enough.
>>>
>>>> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
>>>> which will be executed potentially thousands times under a very heavy memory
>>>> pressure. We will need to make sure that they don't allocate (much) memory, don't
>>>> take any locks which might deadlock with other memory allocations etc.
>>>> It will potentially require hard restrictions on what these programs can and can't
>>>> do and this is something that the bpf community will have to maintain long-term.
>>>
>>> Right, BPF callbacks operating under OOM situations will be really
>>> constrained but this is more or less by definition. Isn't it?
>>
>> What do you mean?
> 
> Callbacks cannot depend on any direct or indirect memory allocations.
> Dependencies on any sleeping locks (again directly or indirectly) is not
> allowed just to name the most important ones.
> 
>> In general, the bpf community is trying to make it as generic as possible and
>> adding new and new features. Bpf programs are not as constrained as they were
>> when it's all started.
> 
> Are the above ones somehow carved into BPF in general?
>   
>>>> Second, if we're introducing bpf here (which I'm not yet convinced),
>>>> IMO we should use it in a more generic and expressive way.
>>>> Instead of adding hooks into the existing oom killer implementation, we can call
>>>> a bpf program before invoking the in-kernel oom killer and let it do whatever
>>>> it takes to free some memory. E.g. we can provide it with an API to kill individual
>>>> tasks as well as all tasks in a cgroup.
>>>> This approach is more generic and will allow to solve certain problems which
>>>> can't be solved by the current oom killer, e.g. deleting files from a tmpfs
>>>> instead of killing tasks.
>>>
>>> The aim of this proposal is to lift any heavy lifting steming from
>>> iterating tasks or cgroups which those BPF might need to make a
>>> decision. There are other ways of course and provide this iteration
>>> functionality as library functions but my BPF experience is very limited
>>> to say how easy is that.
>>>
>>>> So I think the alternative approach is to provide some sort of an interface to
>>>> pre-select oom victims in advance. E.g. on memcg level it can look like:
>>>>
>>>> echo PID >> memory.oom.victim_proc
>>>
>>> this is just a terrible interface TBH. Pids are very volatile objects.
>>> At the time oom killer reads this pid it might be a completely different
>>> process.
>>
>> Well, we already have cgroup.procs interface, which works ok.
>> Obviously if the task is dead (or is actually killed in a result of oom),
>> it's pid is removed from the list.
> 
> Right, but writing the pid into the file has an immediate effect and
> recycle pid issues would be rare unless the pid space is mostly
> depleted. You are proposing an interface where the pid would be consumed
> in potentially very distant future. Such an approach would only work if
> the pid is auto-removed and then you need a notification mechanism to
> replace it by something else.
>   
>>>> If the list is empty, the default oom killer is invoked.
>>>> If there are tasks, the first one is killed on OOM.
>>>> A similar interface can exist to choose between sibling cgroups:
>>>>
>>>> echo CGROUP_NAME >> memory.oom.victim_cgroup
>>>
>>> Slightly less volatile but not much better either.
>>>
>>>> This is just a rough idea.
>>>
>>> I am pretty sure that both policies could be implemetd by the proposed
>>> BPF interface though if you want something like that.
>>
>> As I said, I'm pretty concerned about how reliable (and effective) it will be.
>> I'm not convinced that executing a generic bpf program from the oom context
>> is safe (and we're talking about executing it potentially thousands of times).
>> If we're going this way, we need an explicit acknowledge from the bpf
>> community and a long-term agreement on how we'll keep thing safe.
> 
> I do agree with that.
> 
>> It would be also nice to come up with some practical examples of bpf programs.
>> What are meaningful scenarios which can be covered with the proposed approach
>> and are not covered now with oom_score_adj.
> 
Just like Abel said, the oom_score_adj only adjusts the memory 
usage-based decisions, and it's hard to be translated into other 
semantics. We see that some userspace oom-killer like oomd has 
implemented policies based on other semantics(e.g., memory growth, 
priority, psi pressure, ect.) which can be useful in some specific scenario.

> Agreed here as well. This RFC serves purpose of brainstorming on all of
> this.
> 
> There is a fundamental question whether we need BPF for this task in the
> first place. Are there any huge advantages to export the callback and
> allow a kernel module to hook into it?

If we export the callback to a kernel module and hook into it,
We still have the same problems (e.g., allocating much memory). Just 
like Martin saied, at least BPF supports some basic running context and 
some unsafe behavior is restricted.
Roman Gushchin Aug. 15, 2023, 7:03 p.m. UTC | #16
On Thu, Aug 10, 2023 at 12:41:01PM -0700, Martin KaFai Lau wrote:
> > > > > First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> > > > > which will be executed potentially thousands times under a very heavy memory
> > > > > pressure. We will need to make sure that they don't allocate (much) memory, don't
> > > > > take any locks which might deadlock with other memory allocations etc.
> > > > > It will potentially require hard restrictions on what these programs can and can't
> > > > > do and this is something that the bpf community will have to maintain long-term.
> > > > 
> > > > Right, BPF callbacks operating under OOM situations will be really
> > > > constrained but this is more or less by definition. Isn't it?
> > > 
> > > What do you mean?
> > 
> > Callbacks cannot depend on any direct or indirect memory allocations.
> > Dependencies on any sleeping locks (again directly or indirectly) is not
> > allowed just to name the most important ones.
> > 
> > > In general, the bpf community is trying to make it as generic as possible and
> > > adding new and new features. Bpf programs are not as constrained as they were
> > > when it's all started.
> 
> bpf supports different running context. For example, only non-sleepable bpf
> prog is allowed to run at the NIC driver. A sleepable bpf prog is only
> allowed to run at some bpf_lsm hooks that is known to be safe to call
> blocking bpf-helper/kfunc. From the bpf side, it ensures a non-sleepable bpf
> prog cannot do things that may block.

Yeah, you're right: non-sleepable bpf should be ok here.

> 
> fwiw, Dave has recently proposed something for iterating the task vma
> (https://lore.kernel.org/bpf/20230810183513.684836-4-davemarchevsky@fb.com/).
> Potentially, a similar iterator can be created for a bpf program to iterate
> cgroups and tasks.

Yes, it looks like a much better approach rather than adding a hook into
the existing iteration over all tasks.

Thanks!
Roman Gushchin Aug. 15, 2023, 7:52 p.m. UTC | #17
On Thu, Aug 10, 2023 at 12:00:36PM +0800, Abel Wu wrote:
> On 8/9/23 3:53 PM, Michal Hocko wrote:
> > On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
> > > It would be also nice to come up with some practical examples of bpf programs.
> > > What are meaningful scenarios which can be covered with the proposed approach
> > > and are not covered now with oom_score_adj.
> > 
> > Agreed here as well. This RFC serves purpose of brainstorming on all of
> > this.
> > 
> > There is a fundamental question whether we need BPF for this task in the
> > first place. Are there any huge advantages to export the callback and
> > allow a kernel module to hook into it?
> 
> The ancient oom-killer largely depends on memory usage when choosing
> victims, which might not fit the need of modern scenarios. It's common
> nowadays that multiple workloads (tenants) with different 'priorities'
> run together, and the decisions made by the oom-killer doesn't always
> obey the service level agreements.
> 
> While the oom_score_adj only adjusts the usage-based decisions, so it
> can hardly be translated into 'priority' semantic. How can we properly
> configure it given that we don't know how much memory the workloads
> will use? It's really hard for a static strategy to deal with dynamic
> provision. IMHO the oom_score_adj is just another demon.
> 
> Reworking the oom-killer's internal algorithm or patching some random
> metrics may satisfy the immediate needs, but for the next 10 years? I
> doubt it. So I think we do need the flexibility to bypass the legacy
> usage-based algorithm, through bpf or pre-select interfaces.

I agree in general, but I wouldn't call the existing implementation a legacy
or obsolete. It's all about trade-offs. The goal of the existing implementation
is to guarantee the forward progress without killing any processes prematurely.
And it does it relatively well.

Userspace oom killers (e.g. oomd) on top of PSI were initially created to
solve the problem of memory thrashing: having a system which is barely making
anything useful, but not stuck enough for the OOM killer to kick in.
But also they were able to provide a much better flexibility. The downside -
they can't be as reliable as the in-kernel OOM killer.

Bpf or a pre-select interface can in theory glue them together: make sure that
a user has a flexibility to choose the OOM victim without compromising on the
reliability. Pre-select interface could be preferable if all the logic is
already implemented in userspace, but might be slightly less accurate if some
statistics (e.g. memory usage) is used for the determination of the victim.
Bpf approach will require re-implementing the logic, but potentially is more
powerful due to a fast access to a lot of kernel data.

Thanks!
Michal Hocko Aug. 22, 2023, 12:42 p.m. UTC | #18
[Still catching up with older threads. Sorry for the late reply]
On Mon 14-08-23 19:25:08, Chuyi Zhou wrote:
> Hello,
> 
> 在 2023/8/9 15:53, Michal Hocko 写道:
> > On Tue 08-08-23 14:41:17, Roman Gushchin wrote:
[...]
> > > It would be also nice to come up with some practical examples of bpf programs.
> > > What are meaningful scenarios which can be covered with the proposed approach
> > > and are not covered now with oom_score_adj.
> > 
> Just like Abel said, the oom_score_adj only adjusts the memory usage-based
> decisions, and it's hard to be translated into other semantics. We see that
> some userspace oom-killer like oomd has implemented policies based on other
> semantics(e.g., memory growth, priority, psi pressure, ect.) which can be
> useful in some specific scenario.

Sure, I guess we do not really need to discuss that oom_score_adj is not
a great fit ;) We want to have practical (read no-toy) oom policies that
are useful as a PoC though.

> > Agreed here as well. This RFC serves purpose of brainstorming on all of
> > this.
> > 
> > There is a fundamental question whether we need BPF for this task in the
> > first place. Are there any huge advantages to export the callback and
> > allow a kernel module to hook into it?
> 
> If we export the callback to a kernel module and hook into it,
> We still have the same problems (e.g., allocating much memory). Just like
> Martin saied, at least BPF supports some basic running context and some
> unsafe behavior is restricted.

I do not follow. Kernel module has access to any existing kernel
interfaces without any need to BPF them. So what exactly is the strength
of the BPF over kernel module hook? I am pretty sure there are some
(many?) but it is really important to be explicit about those before we
make any decision.
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 612b5597d3af..aec4c55ed49a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -18,6 +18,7 @@ 
  *  kernel subsystems and hints as to where to find out what things do.
  */
 
+#include <linux/bpf.h>
 #include <linux/oom.h>
 #include <linux/mm.h>
 #include <linux/err.h>
@@ -210,6 +211,16 @@  long oom_badness(struct task_struct *p, unsigned long totalpages)
 	if (!p)
 		return LONG_MIN;
 
+	/*
+	 * If task is allocating a lot of memory and has been marked to be
+	 * killed first if it triggers an oom, then set points to LONG_MAX.
+	 * It will be selected unless we keep oc->chosen through bpf interface.
+	 */
+	if (oom_task_origin(p)) {
+		task_unlock(p);
+		return LONG_MAX;
+	}
+
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
 	 * unkillable or have been already oom reaped or the are in
@@ -305,8 +316,30 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
+enum bpf_select_ret {
+	BPF_SELECT_DISABLE,
+	BPF_SELECT_TASK,
+	BPF_SELECT_CHOSEN,
+};
+
+__weak noinline int bpf_select_task(struct oom_control *oc,
+				struct task_struct *task, long badness_points)
+{
+	return BPF_SELECT_DISABLE;
+}
+
+BTF_SET8_START(oom_bpf_fmodret_ids)
+BTF_ID_FLAGS(func, bpf_select_task)
+BTF_SET8_END(oom_bpf_fmodret_ids)
+
+static const struct btf_kfunc_id_set oom_bpf_fmodret_set = {
+	.owner = THIS_MODULE,
+	.set   = &oom_bpf_fmodret_ids,
+};
+
 static int oom_evaluate_task(struct task_struct *task, void *arg)
 {
+	enum bpf_select_ret bpf_ret = BPF_SELECT_DISABLE;
 	struct oom_control *oc = arg;
 	long points;
 
@@ -329,17 +362,23 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto abort;
 	}
 
+	points = oom_badness(task, oc->totalpages);
+
 	/*
-	 * If task is allocating a lot of memory and has been marked to be
-	 * killed first if it triggers an oom, then select it.
+	 * Do not consider tasks with lowest score value except it was caused
+	 * by OOM_SCORE_ADJ_MIN. Give these tasks a chance to be selected by
+	 * bpf interface.
 	 */
-	if (oom_task_origin(task)) {
-		points = LONG_MAX;
+	if (points == LONG_MIN && task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
+		goto next;
+
+	if (oc->chosen)
+		bpf_ret = bpf_select_task(oc, task, points);
+
+	if (bpf_ret == BPF_SELECT_TASK)
 		goto select;
-	}
 
-	points = oom_badness(task, oc->totalpages);
-	if (points == LONG_MIN || points < oc->chosen_points)
+	if (bpf_ret == BPF_SELECT_CHOSEN || points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
 select:
@@ -732,10 +771,14 @@  static struct ctl_table vm_oom_kill_table[] = {
 
 static int __init oom_init(void)
 {
+	int err;
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("vm", vm_oom_kill_table);
 #endif
+	err = register_btf_fmodret_id_set(&oom_bpf_fmodret_set);
+	if (err)
+		pr_warn("error while registering oom fmodret entrypoints: %d", err);
 	return 0;
 }
 subsys_initcall(oom_init)