diff mbox series

[2/2] mm/madvise: add process_madvise MADV_DONTNEER support

Message ID 20201124053943.1684874-3-surenb@google.com (mailing list archive)
State New, archived
Headers show
Series userspace memory reaping using process_madvise | expand

Commit Message

Suren Baghdasaryan Nov. 24, 2020, 5:39 a.m. UTC
In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory
quickly and efficiently. Unfortunately the time process takes to free
up its memory after receiving a SIGKILL might vary based on the state
of the process (uninterruptible sleep), size and OPP level of the core
the process is running.
In such situation it is desirable to be able to free up the memory of the
process being killed in a more controlled way.
Enable MADV_DONTNEED to be used with process_madvise when applied to a
dying process to reclaim its memory. This would allow userspace system
components like oomd and lmkd to free memory of the target process in
a more predictable way.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Oleg Nesterov Nov. 24, 2020, 1:42 p.m. UTC | #1
On 11/23, Suren Baghdasaryan wrote:
>
> +	if (madvise_destructive(behavior)) {
> +		/* Allow destructive madvise only on a dying processes */
> +		if (!signal_group_exit(task->signal)) {

signal_group_exit(task) is true if this task execs and kills other threads,
see the comment above this helper.

I think you need !(task->signal->flags & SIGNAL_GROUP_EXIT).

Oleg.
Suren Baghdasaryan Nov. 24, 2020, 4:42 p.m. UTC | #2
On Tue, Nov 24, 2020 at 5:42 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 11/23, Suren Baghdasaryan wrote:
> >
> > +     if (madvise_destructive(behavior)) {
> > +             /* Allow destructive madvise only on a dying processes */
> > +             if (!signal_group_exit(task->signal)) {
>
> signal_group_exit(task) is true if this task execs and kills other threads,
> see the comment above this helper.
>
> I think you need !(task->signal->flags & SIGNAL_GROUP_EXIT).

I see. Thanks for the feedback, Oleg. I'll test and fix it in the next version.

>
> Oleg.
>
Jann Horn Dec. 8, 2020, 11:40 p.m. UTC | #3
On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> In modern systems it's not unusual to have a system component monitoring
> memory conditions of the system and tasked with keeping system memory
> pressure under control. One way to accomplish that is to kill
> non-essential processes to free up memory for more important ones.
> Examples of this are Facebook's OOM killer daemon called oomd and
> Android's low memory killer daemon called lmkd.
> For such system component it's important to be able to free memory
> quickly and efficiently. Unfortunately the time process takes to free
> up its memory after receiving a SIGKILL might vary based on the state
> of the process (uninterruptible sleep), size and OPP level of the core
> the process is running.
> In such situation it is desirable to be able to free up the memory of the
> process being killed in a more controlled way.
> Enable MADV_DONTNEED to be used with process_madvise when applied to a
> dying process to reclaim its memory. This would allow userspace system
> components like oomd and lmkd to free memory of the target process in
> a more predictable way.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
[...]
> @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
>                 goto release_task;
>         }
>
> +       if (madvise_destructive(behavior)) {
> +               /* Allow destructive madvise only on a dying processes */
> +               if (!signal_group_exit(task->signal)) {
> +                       ret = -EINVAL;
> +                       goto release_mm;
> +               }

Technically Linux allows processes to share mm_struct without being in
the same thread group, so I'm not sure whether this check is good
enough? AFAICS the normal OOM killer deals with this case by letting
__oom_kill_process() always kill all tasks that share the mm_struct.
Suren Baghdasaryan Dec. 8, 2020, 11:59 p.m. UTC | #4
On Tue, Dec 8, 2020 at 3:40 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Nov 24, 2020 at 6:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > In modern systems it's not unusual to have a system component monitoring
> > memory conditions of the system and tasked with keeping system memory
> > pressure under control. One way to accomplish that is to kill
> > non-essential processes to free up memory for more important ones.
> > Examples of this are Facebook's OOM killer daemon called oomd and
> > Android's low memory killer daemon called lmkd.
> > For such system component it's important to be able to free memory
> > quickly and efficiently. Unfortunately the time process takes to free
> > up its memory after receiving a SIGKILL might vary based on the state
> > of the process (uninterruptible sleep), size and OPP level of the core
> > the process is running.
> > In such situation it is desirable to be able to free up the memory of the
> > process being killed in a more controlled way.
> > Enable MADV_DONTNEED to be used with process_madvise when applied to a
> > dying process to reclaim its memory. This would allow userspace system
> > components like oomd and lmkd to free memory of the target process in
> > a more predictable way.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> [...]
> > @@ -1239,6 +1256,23 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> >                 goto release_task;
> >         }
> >
> > +       if (madvise_destructive(behavior)) {
> > +               /* Allow destructive madvise only on a dying processes */
> > +               if (!signal_group_exit(task->signal)) {
> > +                       ret = -EINVAL;
> > +                       goto release_mm;
> > +               }
>
> Technically Linux allows processes to share mm_struct without being in
> the same thread group, so I'm not sure whether this check is good
> enough? AFAICS the normal OOM killer deals with this case by letting
> __oom_kill_process() always kill all tasks that share the mm_struct.

Thanks for the comment Jann.
You are right. I think replacing !signal_group_exit(task->signal) with
task_will_free_mem(task) would address both your and Oleg's comments.
IIUC, task_will_free_mem() calls __task_will_free_mem() on the task
itself and on all processes sharing the mm_struct ensuring that they
are all dying.
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 1aa074a46524..11306534369e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -29,6 +29,7 @@ 
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/oom.h>
 
 #include <asm/tlb.h>
 
@@ -995,6 +996,18 @@  process_madvise_behavior_valid(int behavior)
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_DONTNEED:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool madvise_destructive(int behavior)
+{
+	switch (behavior) {
+	case MADV_DONTNEED:
+	case MADV_FREE:
 		return true;
 	default:
 		return false;
@@ -1006,6 +1019,10 @@  static bool can_range_madv_lru_vma(struct vm_area_struct *vma, int behavior)
 	if (!can_madv_lru_vma(vma))
 		return false;
 
+	/* For destructive madvise skip shared file-backed VMAs */
+	if (madvise_destructive(behavior))
+		return vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED);
+
 	return true;
 }
 
@@ -1239,6 +1256,23 @@  SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto release_task;
 	}
 
+	if (madvise_destructive(behavior)) {
+		/* Allow destructive madvise only on a dying processes */
+		if (!signal_group_exit(task->signal)) {
+			ret = -EINVAL;
+			goto release_mm;
+		}
+		/* Ensure no competition with OOM-killer to avoid contention */
+		if (unlikely(mm_is_oom_victim(mm)) ||
+		    unlikely(test_bit(MMF_OOM_SKIP, &mm->flags))) {
+			/* Already being reclaimed */
+			ret = 0;
+			goto release_mm;
+		}
+		/* Mark mm as unstable */
+		set_bit(MMF_UNSTABLE, &mm->flags);
+	}
+
 	/*
 	 * For range madvise only the entire address space is supported for now
 	 * and input iovec is ignored.