diff mbox series

[v6] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head

Message ID 20220407184254.3612387-1-npache@redhat.com (mailing list archive)
State New
Headers show
Series [v6] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head | expand

Commit Message

Nico Pache April 7, 2022, 6:42 p.m. UTC
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list head; the kernel does not keep a copy of the robust list and
instead references a userspace address to maintain the robustness during
a process death. A race can occur between exit_mm and the oom reaper that
allows the oom reaper to free the memory of the futex robust list before
the exit path has handled the futex death:

    CPU1                               CPU2
------------------------------------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper
                                        oom_reaper
                                        oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Use the robust_list address stored in the kernel to skip the VMA that holds
it, allowing a successful futex_cleanup.

Theoretically a failure can still occur if there are locks mapped as
PRIVATE|ANON; however, the robust futexes are a best-effort approach.
This patch only strengthens that best-effort.

The following case can still fail:
robust head (skipped) -> private lock (reaped) -> shared lock (skipped)

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

[1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/oom.h |  3 ++-
 mm/mmap.c           |  3 ++-
 mm/oom_kill.c       | 14 +++++++++++---
 3 files changed, 15 insertions(+), 5 deletions(-)

Comments

Andrew Morton April 7, 2022, 8:18 p.m. UTC | #1
On Thu,  7 Apr 2022 14:42:54 -0400 Nico Pache <npache@redhat.com> wrote:

> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is used to store the futex
> robust list head; the kernel does not keep a copy of the robust list and
> instead references a userspace address to maintain the robustness during
> a process death. A race can occur between exit_mm and the oom reaper that
> allows the oom reaper to free the memory of the futex robust list before
> the exit path has handled the futex death:
> 
>     CPU1                               CPU2
> ------------------------------------------------------------------------
>     page_fault
>     do_exit "signal"
>     wake_oom_reaper
>                                         oom_reaper
>                                         oom_reap_task_mm (invalidates mm)
>     exit_mm
>     exit_mm_release
>     futex_exit_release
>     futex_cleanup
>     exit_robust_list
>     get_user (EFAULT- can't access memory)
> 
> If the get_user EFAULT's, the kernel will be unable to recover the
> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
> 
> Use the robust_list address stored in the kernel to skip the VMA that holds
> it, allowing a successful futex_cleanup.
> 
> Theoretically a failure can still occur if there are locks mapped as
> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
> This patch only strengthens that best-effort.
> 
> The following case can still fail:
> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
> 
> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

Should this fix be backported into -stable kernels?

> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -106,7 +106,8 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
> +		__user *robust_list);

Should explicitly include futex.h

>  long oom_badness(struct task_struct *p,
>  		unsigned long totalpages);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3aa839f81e63..c14fe6f8e9a5 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3126,7 +3126,8 @@ void exit_mmap(struct mm_struct *mm)
>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>  		 * __oom_reap_task_mm() will not block.
>  		 */
> -		(void)__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm, current->robust_list);
> +
>  		set_bit(MMF_OOM_SKIP, &mm->flags);
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ec38194f8e1..727cfc3bd284 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -509,9 +509,11 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -bool __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
> +		__user *robust_list)
>  {

It's pretty sad to make such a low-level function aware of futex
internals.  How about making it a more general `void *skip_area'?

>  	struct vm_area_struct *vma;
> +	unsigned long head = (unsigned long) robust_list;
>  	bool ret = true;
>  
>  	/*
> @@ -526,6 +528,11 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> +		if (vma->vm_start <= head && vma->vm_end > head) {

This check as you have it is making assumptions about the length of the
area at *robust_list and about that area's relation to the area
represented by the vma.

So if this is to be made more generic, we'd also need skip_area_len so
we can perform a full overlap check.

I dunno, maybe not worth it at this time, what do others think.

But the special-casing in here is pretty painful.

> +			pr_info("oom_reaper: skipping vma, contains robust_list");
> +			continue;
> +		}
> +
>  		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
> @@ -587,7 +594,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  	trace_start_task_reaping(tsk->pid);
>  
>  	/* failed to reap part of the address space. Try again later */
> -	ret = __oom_reap_task_mm(mm);
> +	ret = __oom_reap_task_mm(mm, tsk->robust_list);
>  	if (!ret)
>  		goto out_finish;
>  
> @@ -1190,7 +1197,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
>  	 * possible change in exit_mmap is seen
>  	 */
> -	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
> +	if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
> +			!__oom_reap_task_mm(mm, p->robust_list))
>  		ret = -EAGAIN;
>  	mmap_read_unlock(mm);
>  
> -- 
> 2.35.1
Nico Pache April 7, 2022, 9:52 p.m. UTC | #2
Hi Andrew!

On 4/7/22 16:18, Andrew Morton wrote:
> On Thu,  7 Apr 2022 14:42:54 -0400 Nico Pache <npache@redhat.com> wrote:
> 
>> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
>> be targeted by the oom reaper. This mapping is used to store the futex
>> robust list head; the kernel does not keep a copy of the robust list and
>> instead references a userspace address to maintain the robustness during
>> a process death. A race can occur between exit_mm and the oom reaper that
>> allows the oom reaper to free the memory of the futex robust list before
>> the exit path has handled the futex death:
>>
>>     CPU1                               CPU2
>> ------------------------------------------------------------------------
>>     page_fault
>>     do_exit "signal"
>>     wake_oom_reaper
>>                                         oom_reaper
>>                                         oom_reap_task_mm (invalidates mm)
>>     exit_mm
>>     exit_mm_release
>>     futex_exit_release
>>     futex_cleanup
>>     exit_robust_list
>>     get_user (EFAULT- can't access memory)
>>
>> If the get_user EFAULT's, the kernel will be unable to recover the
>> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
>>
>> Use the robust_list address stored in the kernel to skip the VMA that holds
>> it, allowing a successful futex_cleanup.
>>
>> Theoretically a failure can still occur if there are locks mapped as
>> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
>> This patch only strengthens that best-effort.
>>
>> The following case can still fail:
>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
>>
>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
> 
> Should this fix be backported into -stable kernels?

Yes I believe so. This is caused by the commit marked under 'Fixes:' which is in
stable branch.
> 
>> --- a/include/linux/oom.h
>> +++ b/include/linux/oom.h
>> @@ -106,7 +106,8 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
>>  	return 0;
>>  }
>>  
>> -bool __oom_reap_task_mm(struct mm_struct *mm);
>> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
>> +		__user *robust_list);
> 
> Should explicitly include futex.h
Good point. On second thought I think we also need to surround some of the
changes with a ifdef CONFIG_FUTEX. current->robust_list is undefined if we turn
that config option off.
> 
>>  long oom_badness(struct task_struct *p,
>>  		unsigned long totalpages);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 3aa839f81e63..c14fe6f8e9a5 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3126,7 +3126,8 @@ void exit_mmap(struct mm_struct *mm)
>>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>>  		 * __oom_reap_task_mm() will not block.
>>  		 */
>> -		(void)__oom_reap_task_mm(mm);
>> +		(void)__oom_reap_task_mm(mm, current->robust_list);
>> +
>>  		set_bit(MMF_OOM_SKIP, &mm->flags);
>>  	}
>>  
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 7ec38194f8e1..727cfc3bd284 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -509,9 +509,11 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>>  static struct task_struct *oom_reaper_list;
>>  static DEFINE_SPINLOCK(oom_reaper_lock);
>>  
>> -bool __oom_reap_task_mm(struct mm_struct *mm)
>> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
>> +		__user *robust_list)
>>  {
> 
> It's pretty sad to make such a low-level function aware of futex
> internals.  How about making it a more general `void *skip_area'?
Yes we can make this change. My concern is that the caller may now have to cast
the type: __oom_reap_task_mm(mm_struct, (void*) current->robust_list). But I
doubt that is a big concern.

> 
>>  	struct vm_area_struct *vma;
>> +	unsigned long head = (unsigned long) robust_list;
>>  	bool ret = true;
>>  
>>  	/*
>> @@ -526,6 +528,11 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>>  			continue;
>>  
>> +		if (vma->vm_start <= head && vma->vm_end > head) {
> 
> This check as you have it is making assumptions about the length of the
> area at *robust_list and about that area's relation to the area
> represented by the vma.
> 
> So if this is to be made more generic, we'd also need skip_area_len so
> we can perform a full overlap check.
Im not sure I follow here. Can a single MMAP call span multiple VMAs? The
address would be part of the pthread_t struct which is mmapped by the userspace
code. We are simply looking for that VMA and skipping the oom of it. It does not
try to find the individual locks (allocated separately and represented on a
LinkedList), it just prevents the reaping of the robust_list_head (part of
pthread_t) which stores the start of this LL. If some of the locks are private
(shared locks are not reaped) we may run into a case where this still fails;
however, we haven't been able to reproduce this.

Thanks for the review :) I will make the required changes for this to be more
config independent.

Cheers,
-- Nico
> 
> I dunno, maybe not worth it at this time, what do others think.
> 
> But the special-casing in here is pretty painful.
> 
>> +			pr_info("oom_reaper: skipping vma, contains robust_list");
>> +			continue;
>> +		}
>> +
>>  		/*
>>  		 * Only anonymous pages have a good chance to be dropped
>>  		 * without additional steps which we cannot afford as we
>> @@ -587,7 +594,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>>  	trace_start_task_reaping(tsk->pid);
>>  
>>  	/* failed to reap part of the address space. Try again later */
>> -	ret = __oom_reap_task_mm(mm);
>> +	ret = __oom_reap_task_mm(mm, tsk->robust_list);
>>  	if (!ret)
>>  		goto out_finish;
>>  
>> @@ -1190,7 +1197,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>>  	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
>>  	 * possible change in exit_mmap is seen
>>  	 */
>> -	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
>> +	if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
>> +			!__oom_reap_task_mm(mm, p->robust_list))
>>  		ret = -EAGAIN;
>>  	mmap_read_unlock(mm);
>>  
>> -- 
>> 2.35.1
>
Andrew Morton April 7, 2022, 10:32 p.m. UTC | #3
On Thu, 7 Apr 2022 17:52:31 -0400 Nico Pache <npache@redhat.com> wrote:

> >>
> >> The following case can still fail:
> >> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
> >>
> >> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
> > 
> > Should this fix be backported into -stable kernels?
> 
> Yes I believe so. This is caused by the commit marked under 'Fixes:' which is in
> stable branch.

OK.  The MM team don't like the promiscuous backporting of things which
we didn't ask to be backported.  So -stable maintainers have been
trained (we hope) to only backport things which we explicitly marked
cc:stable.

> >> --- a/include/linux/oom.h
> >> +++ b/include/linux/oom.h
> >> @@ -106,7 +106,8 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
> >>  	return 0;
> >>  }
> >>  
> >> -bool __oom_reap_task_mm(struct mm_struct *mm);
> >> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
> >> +		__user *robust_list);
> > 
> > Should explicitly include futex.h
> Good point. On second thought I think we also need to surround some of the
> changes with a ifdef CONFIG_FUTEX. current->robust_list is undefined if we turn
> that config option off.

Ah.

> > 
> >>  long oom_badness(struct task_struct *p,
> >>  		unsigned long totalpages);
> >> diff --git a/mm/mmap.c b/mm/mmap.c
> >> index 3aa839f81e63..c14fe6f8e9a5 100644
> >> --- a/mm/mmap.c
> >> +++ b/mm/mmap.c
> >> @@ -3126,7 +3126,8 @@ void exit_mmap(struct mm_struct *mm)
> >>  		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
> >>  		 * __oom_reap_task_mm() will not block.
> >>  		 */
> >> -		(void)__oom_reap_task_mm(mm);
> >> +		(void)__oom_reap_task_mm(mm, current->robust_list);
> >> +
> >>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> >>  	}
> >>  
> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> index 7ec38194f8e1..727cfc3bd284 100644
> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -509,9 +509,11 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> >>  static struct task_struct *oom_reaper_list;
> >>  static DEFINE_SPINLOCK(oom_reaper_lock);
> >>  
> >> -bool __oom_reap_task_mm(struct mm_struct *mm)
> >> +bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
> >> +		__user *robust_list)
> >>  {
> > 
> > It's pretty sad to make such a low-level function aware of futex
> > internals.  How about making it a more general `void *skip_area'?
> Yes we can make this change. My concern is that the caller may now have to cast
> the type: __oom_reap_task_mm(mm_struct, (void*) current->robust_list). But I
> doubt that is a big concern.

No cast needed - the compiler will happily cast an anything* to a void*.

> > 
> >>  	struct vm_area_struct *vma;
> >> +	unsigned long head = (unsigned long) robust_list;
> >>  	bool ret = true;
> >>  
> >>  	/*
> >> @@ -526,6 +528,11 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
> >>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >>  			continue;
> >>  
> >> +		if (vma->vm_start <= head && vma->vm_end > head) {
> > 
> > This check as you have it is making assumptions about the length of the
> > area at *robust_list and about that area's relation to the area
> > represented by the vma.
> > 
> > So if this is to be made more generic, we'd also need skip_area_len so
> > we can perform a full overlap check.
> Im not sure I follow here. Can a single MMAP call span multiple VMAs? The
> address would be part of the pthread_t struct which is mmapped by the userspace
> code. We are simply looking for that VMA and skipping the oom of it. It does not
> try to find the individual locks (allocated separately and represented on a
> LinkedList), it just prevents the reaping of the robust_list_head (part of
> pthread_t) which stores the start of this LL. If some of the locks are private
> (shared locks are not reaped) we may run into a case where this still fails;
> however, we haven't been able to reproduce this.

Well I was thinking that if it were to become a generic "ignore this
region" then the code would need to be taught to skip any vma which had
any form of overlap with that region.  Which sounds like way overdesign
until there's a demonstrated need.
diff mbox series

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..580c95a0541d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,7 +106,8 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
+		__user *robust_list);
 
 long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..c14fe6f8e9a5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3126,7 +3126,8 @@  void exit_mmap(struct mm_struct *mm)
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 */
-		(void)__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm, current->robust_list);
+
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..727cfc3bd284 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -509,9 +509,11 @@  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm, struct robust_list_head
+		__user *robust_list)
 {
 	struct vm_area_struct *vma;
+	unsigned long head = (unsigned long) robust_list;
 	bool ret = true;
 
 	/*
@@ -526,6 +528,11 @@  bool __oom_reap_task_mm(struct mm_struct *mm)
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
+		if (vma->vm_start <= head && vma->vm_end > head) {
+			pr_info("oom_reaper: skipping vma, contains robust_list");
+			continue;
+		}
+
 		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
@@ -587,7 +594,7 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	trace_start_task_reaping(tsk->pid);
 
 	/* failed to reap part of the address space. Try again later */
-	ret = __oom_reap_task_mm(mm);
+	ret = __oom_reap_task_mm(mm, tsk->robust_list);
 	if (!ret)
 		goto out_finish;
 
@@ -1190,7 +1197,8 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
 	 * possible change in exit_mmap is seen
 	 */
-	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
+	if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
+			!__oom_reap_task_mm(mm, p->robust_list))
 		ret = -EAGAIN;
 	mmap_read_unlock(mm);