diff mbox series

[v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper

Message ID 20220318033621.626006-1-npache@redhat.com (mailing list archive)
State New
Headers show
Series [v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper | expand

Commit Message

Nico Pache March 18, 2022, 3:36 a.m. UTC
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list; the kernel does not keep a copy of the robust list and instead
references a userspace address to maintain the robustness during a process
death. A race can occur between exit_mm and the oom reaper that allows
the oom reaper to free the memory of the futex robust list before the
exit path has handled the futex death:

    CPU1                               CPU2
------------------------------------------------------------------------
    page_fault
    out_of_memory
    do_exit "signal"
    wake_oom_reaper
					oom_reaper
                        		oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

While in the oom reaper thread, we must handle the futex cleanup without
sleeping. To achieve this, add the boolean `try` to futex_exit_begin().
This will control weather or not we use a trylock. Also introduce
try_futex_exit_release() which will utilize the trylock version of the
futex_cleanup_begin(). Also call kthread_use_mm in this context to assure
the get_user call in futex_cleanup() does not return with EFAULT.

If the robust list exists, and the mutex_trylock fails, we prevent the OOM
reaper from concurrently reaping the mappings. If the dying task_struct
does not contain a pointer in tsk->robust_list, we can assume there was
either never one setup for this task struct, or the exit path's call to
futex_cleanup() has properly handled the futex death, and we can safely
reap this memory.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

[1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andre Almeida <andrealmeid@collabora.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/futex.h | 14 ++++++++++++++
 kernel/futex/core.c   | 35 +++++++++++++++++++++++++++++++----
 mm/oom_kill.c         | 13 +++++++++++++
 3 files changed, 58 insertions(+), 4 deletions(-)

Comments

Michal Hocko March 21, 2022, 8:55 a.m. UTC | #1
On Thu 17-03-22 21:36:21, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is used to store the futex
> robust list; the kernel does not keep a copy of the robust list and instead
> references a userspace address to maintain the robustness during a process
> death. A race can occur between exit_mm and the oom reaper that allows
> the oom reaper to free the memory of the futex robust list before the
> exit path has handled the futex death:
> 
>     CPU1                               CPU2
> ------------------------------------------------------------------------
>     page_fault
>     out_of_memory
>     do_exit "signal"
>     wake_oom_reaper
> 					oom_reaper
>                         		oom_reap_task_mm (invalidates mm)
>     exit_mm
>     exit_mm_release
>     futex_exit_release
>     futex_cleanup
>     exit_robust_list
>     get_user (EFAULT- can't access memory)

I still think it is useful to be explicit about the consequences of the
EFAULT here. Did you want to mention that a failing get_user in this
path would result in a hang because nobody is woken up when the current
holder of the lock terminates.

> While in the oom reaper thread, we must handle the futex cleanup without
> sleeping. To achieve this, add the boolean `try` to futex_exit_begin().
> This will control weather or not we use a trylock. Also introduce
> try_futex_exit_release() which will utilize the trylock version of the
> futex_cleanup_begin(). Also call kthread_use_mm in this context to assure
> the get_user call in futex_cleanup() does not return with EFAULT.

This alone is not sufficient. get_user can sleep in the #PF handler path
(e.g. by waiting for swap in). Or is there any guarantee that the page
is never swapped out? If we cannot rule out #PF then this is not a
viable way to address the problem I am afraid.

[...]
> @@ -587,6 +588,18 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  		goto out_unlock;
>  	}
>  
> +	/* We can't reap a process holding a robust_list; the pthread
> +	 * struct is allocated in userspace using PRIVATE | ANONYMOUS
> +	 * memory which when reaped before futex_cleanup() can leave
> +	 * the waiting process stuck. Try to perform the futex_cleanup,
> +	 * and if unsuccessful, skip the OOM reaping.
> +	 */
> +	if (task_has_robust_list(tsk) && !try_futex_exit_release(tsk)) {
> +		trace_skip_task_reaping(tsk->pid);
> +		pr_info("oom_reaper: skipping task as it contains a robust list");
> +		goto out_finish;
> +	}
> +
>  	trace_start_task_reaping(tsk->pid);
>  
>  	/* failed to reap part of the address space. Try again later */

Please also note that this all is done after mmap_lock has been already
taken so a page fault could deadlock on the mmap_lock.

The more I am thinking about this the more I am getting convinced that
we should rather approach this differently and skip over vmas which can
be holding the list. Have you considered this option?
Nico Pache March 21, 2022, 10:45 p.m. UTC | #2
On 3/21/22 02:55, Michal Hocko wrote:
> On Thu 17-03-22 21:36:21, Nico Pache wrote:
>> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
>> be targeted by the oom reaper. This mapping is used to store the futex
>> robust list; the kernel does not keep a copy of the robust list and instead
>> references a userspace address to maintain the robustness during a process
>> death. A race can occur between exit_mm and the oom reaper that allows
>> the oom reaper to free the memory of the futex robust list before the
>> exit path has handled the futex death:
>>
>>     CPU1                               CPU2
>> ------------------------------------------------------------------------
>>     page_fault
>>     out_of_memory
>>     do_exit "signal"
>>     wake_oom_reaper
>> 					oom_reaper
>>                         		oom_reap_task_mm (invalidates mm)
>>     exit_mm
>>     exit_mm_release
>>     futex_exit_release
>>     futex_cleanup
>>     exit_robust_list
>>     get_user (EFAULT- can't access memory)
> 
> I still think it is useful to be explicit about the consequences of the
> EFAULT here. Did you want to mention that a failing get_user in this
> path would result in a hang because nobody is woken up when the current
> holder of the lock terminates.

Sounds good! You make a good point-- We had that in all the other versions, but
I forgot to include it in this commit log.
> 
>> While in the oom reaper thread, we must handle the futex cleanup without
>> sleeping. To achieve this, add the boolean `try` to futex_exit_begin().
>> This will control weather or not we use a trylock. Also introduce
>> try_futex_exit_release() which will utilize the trylock version of the
>> futex_cleanup_begin(). Also call kthread_use_mm in this context to assure
>> the get_user call in futex_cleanup() does not return with EFAULT.
> 
> This alone is not sufficient. get_user can sleep in the #PF handler path
> (e.g. by waiting for swap in). Or is there any guarantee that the page
> is never swapped out? If we cannot rule out #PF then this is not a
> viable way to address the problem I am afraid.>
> Please also note that this all is done after mmap_lock has been already
> taken so a page fault could deadlock on the mmap_lock.
>
I don't think we can guarantee that page is not swapped out. Good catch, I was
concerned when I saw the 'might_fault' in get_user, but I wasn't fully sure of
its consequences. I'm still learning the labyrinth that is the MM space, so
thanks for the context!

> The more I am thinking about this the more I am getting convinced that
> we should rather approach this differently and skip over vmas which can
> be holding the list. Have you considered this option?

We've discussed it and it seems very doable, but we haven't attempted to
implement it yet. I'll give it a shot and see what I can come up with!

Thank you for your feedback and reviews :)
-- Nico
Nico Pache March 22, 2022, 1:53 a.m. UTC | #3
On 3/21/22 18:42, Davidlohr Bueso wrote:
> On Mon, 21 Mar 2022, Michal Hocko wrote:
> 
>> The more I am thinking about this the more I am getting convinced that
>> we should rather approach this differently and skip over vmas which can
>> be holding the list. Have you considered this option?
> 
> While I tend to agree with this over a hacky trylock approach, I cannot
> help but think that v3 was the right thing to do, at least conceptually.
Yeah conceptually the V3 was the first correct patch. It could use some slight
cleanup with a wrapper like in this v5 (has_robust_list), and instead of
returning it should set MMF_OOM_SKIP.
> Robust futex users here care enough about dealing with crashes while holding
> a lock that they sacrifice the performance of regular futexes. So the OOM
> killer should not cause this very thing. I went through previous threads
> but other than the user base (which I don't think would be very large
> just because of the performance implications), was there any other reason
> to no just set MMF_OOM_SKIP upon a robust list?
We could proceed with the V3 approach; however if we are able to find a complete
solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
working, I dont see why we wouldnt go for it.

If we can't find a good/reliable way to check if the vma contains the robust
list then I think we should just skip the OOM like in the v3.

Cheers,
-- Nico

> 
> Thanks,
> Davidlohr
>
Nico Pache March 22, 2022, 3:09 a.m. UTC | #4
On 3/21/22 20:57, Davidlohr Bueso wrote:
> On Mon, 21 Mar 2022, Nico Pache wrote:
> 
>> We could proceed with the V3 approach; however if we are able to find a complete
>> solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
>> working, I dont see why we wouldnt go for it.
> 
> Because semantically killing the process is, imo, the wrong thing to do. My
> performance argument before however is bogus as the overhead of robust futexes
> is pretty negligible within the lifetime of a lock. That said, the users still
> have good(?) reasons for not wanting the lock holder to crash on them.

From my understanding, the whole point of the robust futex is to allow forward
progress in an application in which the lock holder CAN crash/exit/oom. So
semantically nothing is wrong with killing the futex holder... the whole point
of the robustness is to handle these cases. We just have a case were the oom
killer is racing with said handling of the futex, invalidating the memory before
the exit path (handle_futex_death) can awake one of the other waiters.

-- Nico
> 
> Thanks,
> Davidlohr
>
Michal Hocko March 22, 2022, 8:26 a.m. UTC | #5
On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
> On Mon, 21 Mar 2022, Nico Pache wrote:
> 
> > We could proceed with the V3 approach; however if we are able to find a complete
> > solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
> > working, I dont see why we wouldnt go for it.
> 
> Because semantically killing the process is, imo, the wrong thing to do.

I am not sure I follow. The task has been killed by the oom killer. All
we are discussing here is how to preserve the robust list metadata
stored in the memory which is normally unmapped by the oom_reaper to
guarantee a further progress. 

I can see we have 4 potential solutions:
1) do not oom_reap oom victims with robust futex metadata in anonymous
   memory. Easy enough but it could lead to excessive oom killing in
   case the victim gets stuck in the kernel and cannot terminate.
2) clean up robust list from the oom_reaper context. Seems tricky due to
   #PF handling from the oom_reaper context which would need to be
   non-blocking
3) filter vmas which contain robust list. Simple check for the vma range
4) internally mark vmas which have to preserve the state during
   oom_reaping. Futex code would somehow have to mark those mappings.
   While more generic solution. I am not sure this is a practical
   approach.
Thomas Gleixner March 22, 2022, 3:17 p.m. UTC | #6
On Tue, Mar 22 2022 at 09:26, Michal Hocko wrote:
> On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
>> On Mon, 21 Mar 2022, Nico Pache wrote:
>> 
>> > We could proceed with the V3 approach; however if we are able to find a complete
>> > solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
>> > working, I dont see why we wouldnt go for it.

See below.

>> Because semantically killing the process is, imo, the wrong thing to do.
>
> I am not sure I follow. The task has been killed by the oom killer. All
> we are discussing here is how to preserve the robust list metadata
> stored in the memory which is normally unmapped by the oom_reaper to
> guarantee a further progress. 
>
> I can see we have 4 potential solutions:
> 1) do not oom_reap oom victims with robust futex metadata in anonymous
>    memory. Easy enough but it could lead to excessive oom killing in
>    case the victim gets stuck in the kernel and cannot terminate.
> 2) clean up robust list from the oom_reaper context. Seems tricky due to
>    #PF handling from the oom_reaper context which would need to be
>    non-blocking
> 3) filter vmas which contain robust list. Simple check for the vma
>    range
> 4) internally mark vmas which have to preserve the state during
>    oom_reaping. Futex code would somehow have to mark those mappings.
>    While more generic solution. I am not sure this is a practical
>    approach. 

And all of that is based on wishful thinking, really. Let me explain.

The task::robust_list pointer is set unconditionally by NPTL for every
thread of a process. It points to the 'list head' which is in the
TLS. But this does not tell whether the task holds a robust futex or
not. That's evaluated in the futex exit handling code.

So solution #1 will prevent oom reaping completely simply because the
pointer is set on every user space task.

Solutions #2 and #3 are incomplete and just awful hacks which cure one
particular case: A single threaded process. Why?

The chosen oom reaper victim is a process, so what does it help to check
or cleanup the robust list for _ONE_ thread? Nothing because the other
threads can hold robust futexes and then run into the same problem.

Aside of that you seem to believe that the robust list head in the TLS
is the only part which is relevant. That's wrong. The list head is
either NULL or points to the innermost pthread_mutex which is held by a
task. Now look at this example:

  TLS:robust_list -> mutex2 -> mutex1

mutex1 is the shared one which needs to be released so that other
processes can make progress. mutex2 is a process private one which
resides in a different VMA. So now if you filter the robust list and
refuse to reap the TLS VMA, what prevents the other VMA from being
reaped? If that's reaped then mutex1 is not reachable.

Now vs. cleaning up the robust list from the oom reaper context. That
should be doable with a lot of care, but the proposed patch is not even
close to a solution. It's simply broken.

> -static void futex_cleanup_begin(struct task_struct *tsk)
> +static bool futex_cleanup_begin(struct task_struct *tsk, bool try)
>  {
>  	/*
>  	 * Prevent various race issues against a concurrent incoming waiter
> @@ -1055,7 +1056,12 @@ static void futex_cleanup_begin(struct task_struct *tsk)
>  	 * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
>  	 * attach_to_pi_owner().
>  	 */
> -	mutex_lock(&tsk->futex_exit_mutex);
> +	if (try) {
> +		if (!mutex_trylock(&tsk->futex_exit_mutex))
> +			return false;
> +	} else {
> +		mutex_lock(&tsk->futex_exit_mutex);
> +	}

That conditional locking is disgusting.
  
>  void futex_exit_release(struct task_struct *tsk)
>  {
> -	futex_cleanup_begin(tsk);
> +	futex_cleanup_begin(tsk, false);

If the task already cleaned up the robust list then this will roll back
tsk->futex_state from FUTEX_STATE_DEAD to FUTEX_STATE_EXITING. Sigh...

> +	futex_cleanup(tsk);
> +	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
> +}
> +
> +/* Try to perform the futex_cleanup and return true if successful.

This is not a proper multi line comment.

     /*
      * Multi line comments look like this:
      *
      * Properly formatted.
      *
      * Don't try to use the network comment style
      * on anything outside of networking.
      */

> + * Designed to be called from the context of the OOM Reaper.

Let's talk about design later.

> + */
> +bool try_futex_exit_release(struct task_struct *tsk)
> +{
> +	if (!futex_cleanup_begin(tsk, true))
> +		return false;
> +
> +	/* We are calling this from the context of a kthread. We need to
> +	 * instruct the kthread to use the address space of the given mm
> +	 * so the get_user won't return -EFAULT.

How is this preventing get_user() or any other operation on the tasks
user memory to return -EFAULT? Not at all. Any user access can fail and
return -EFAULT. Comments are there to explain things not to create
confusion.

> +	 */
> +	kthread_use_mm(tsk->mm);
>  	futex_cleanup(tsk);

But aside of that. How is this supposed to work correctly?

oom_reaper()
  oom_reap_task()
    oom_reap_task_mm()
      mmap_read_trylock(mm) <- Succeeds
        try_futex_exit_release()
          use_mm()
          futex_cleanup()
            get_user() -> #PF

#PF
  if (!mmap_read_trylock(mm)) {

So here the problem starts. The trylock can succeed or not, depending
on the contention state of mm::mmap_lock.

So in case the trylock fails because there is a writer waiting, then it
runs into this:

     if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
        ....
        return;
     }

This condition evaluates to false because get_user() has an
exception table entry. So this proceeds and does:
     
     mmap_read_lock(mm);

which is a full dead lock.

But even if the trylock succeeds then this runs into the full fault
path, which is not correct either for pretty obvious reasons.

I assume that's all part of the design, right?

But the real questions here are:

   Why are we doing this remote reaping at all?

   What is the condition that a task which is killed with a fatal signal
   does not reach do_exit() and cleans up itself?

If the answer is "because", then we should rather make sure that this
gets fixed.

If there is a legitimate reason why a task cannot handle a fatal signal,
then yes the oom reaper might be necessary, but unleashing the oom
reaper unconditionally is simply a bad idea and results in the problem
which this is trying to paper over.

The oom reaper should be the last resort IMO and not racing against the
killed task in the first place. IOW, give the task some time to clean
itself up and if that fails and it is truly stuck and unable to do so,
then reap the mm. But that should be the rare case and then the stuck
futex should be the least of our worries.

Thanks,

        tglx
Michal Hocko March 22, 2022, 4:36 p.m. UTC | #7
On Tue 22-03-22 16:17:05, Thomas Gleixner wrote:
> On Tue, Mar 22 2022 at 09:26, Michal Hocko wrote:
> > On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
> >> On Mon, 21 Mar 2022, Nico Pache wrote:
> >> 
> >> > We could proceed with the V3 approach; however if we are able to find a complete
> >> > solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
> >> > working, I dont see why we wouldnt go for it.
> 
> See below.
> 
> >> Because semantically killing the process is, imo, the wrong thing to do.
> >
> > I am not sure I follow. The task has been killed by the oom killer. All
> > we are discussing here is how to preserve the robust list metadata
> > stored in the memory which is normally unmapped by the oom_reaper to
> > guarantee a further progress. 
> >
> > I can see we have 4 potential solutions:
> > 1) do not oom_reap oom victims with robust futex metadata in anonymous
> >    memory. Easy enough but it could lead to excessive oom killing in
> >    case the victim gets stuck in the kernel and cannot terminate.
> > 2) clean up robust list from the oom_reaper context. Seems tricky due to
> >    #PF handling from the oom_reaper context which would need to be
> >    non-blocking
> > 3) filter vmas which contain robust list. Simple check for the vma
> >    range
> > 4) internally mark vmas which have to preserve the state during
> >    oom_reaping. Futex code would somehow have to mark those mappings.
> >    While more generic solution. I am not sure this is a practical
> >    approach. 
> 
> And all of that is based on wishful thinking, really. Let me explain.
> 
> The task::robust_list pointer is set unconditionally by NPTL for every
> thread of a process. It points to the 'list head' which is in the
> TLS. But this does not tell whether the task holds a robust futex or
> not. That's evaluated in the futex exit handling code.
> 
> So solution #1 will prevent oom reaping completely simply because the
> pointer is set on every user space task.

This is really what I was worried about.

> Solutions #2 and #3 are incomplete and just awful hacks which cure one
> particular case: A single threaded process. Why?
> 
> The chosen oom reaper victim is a process, so what does it help to check
> or cleanup the robust list for _ONE_ thread? Nothing because the other
> threads can hold robust futexes and then run into the same problem.
> 
> Aside of that you seem to believe that the robust list head in the TLS
> is the only part which is relevant. That's wrong. The list head is
> either NULL or points to the innermost pthread_mutex which is held by a
> task. Now look at this example:
> 
>   TLS:robust_list -> mutex2 -> mutex1
> 
> mutex1 is the shared one which needs to be released so that other
> processes can make progress. mutex2 is a process private one which
> resides in a different VMA. So now if you filter the robust list and
> refuse to reap the TLS VMA, what prevents the other VMA from being
> reaped? If that's reaped then mutex1 is not reachable.
> 
> Now vs. cleaning up the robust list from the oom reaper context. That
> should be doable with a lot of care, but the proposed patch is not even
> close to a solution. It's simply broken.

My knowledge about robust futexes and how they are implemented is close
to zero. My thinking has been that the primary reason for the lockup
reported initially is that the oom_reaper will corrupt the memory which
is backing the list of locks to be woken up. So if we rule out the
oom_reaper for that memory then the exit path can do its proper cleanup.
Is that assumption completely off?

I cannot really comment on the futex specific parts of your response but
the deadlock on the mmap_lock has been already pointed out, thanks for
confirming that.

[...]

> But the real questions here are:
> 
>    Why are we doing this remote reaping at all?

Does aac453635549 ("mm, oom: introduce oom reaper") help? In short this
is to guarantee that the system is able to make a forward progress under
OOM. Sending SIGKILL to the oom victim is not sufficient, really. Tasks
can be blocked inside kernel for indefinite amount of time. E.g. being
blocked waiting on memory transitively over locks. Reclaiming the
anonymous memory from the killed process will allow to free up some
memory while the oom victim is blocked and allow to move forward and
eventually die and do the proper cleanup. We are focusing on the
anonymous memory because under assumption that such a memory is private
to the process and the process is dead so the a freed memory is not
really visible any more.

>    What is the condition that a task which is killed with a fatal signal
>    does not reach do_exit() and cleans up itself?
> 
> If the answer is "because", then we should rather make sure that this
> gets fixed.

While some places can be handled by changing uninterruptible waiting to
killable there are places which are not really fixable, e.g. lock
chain dependency which leads to memory allocation.

> If there is a legitimate reason why a task cannot handle a fatal signal,
> then yes the oom reaper might be necessary, but unleashing the oom
> reaper unconditionally is simply a bad idea and results in the problem
> which this is trying to paper over.
> 
> The oom reaper should be the last resort IMO and not racing against the
> killed task in the first place. IOW, give the task some time to clean
> itself up and if that fails and it is truly stuck and unable to do so,
> then reap the mm. But that should be the rare case and then the stuck
> futex should be the least of our worries.

Yes, the oom_reaper is the last resort indeed. It is true that it can
fire later but I do not see how this would solve this particular
problem. It is fundamentally wrong to reap the memory which the exit
path really need to do a proper clean up.

And just to be clear, this is clearly a bug in the oom_reaper per se.
Originally I thought that relaxing the locking (using trylock and
retry/bail out on failure) would help but as I've learned earlier this
day this is not really possible because of #PF at least. The most self
contained solution would be to skip over vmas which are backing the
robust list which would allow the regular exit path to do the proper
cleanup.

Thanks!
Thomas Gleixner March 22, 2022, 10:43 p.m. UTC | #8
Michal,

On Tue, Mar 22 2022 at 17:36, Michal Hocko wrote:
> On Tue 22-03-22 16:17:05, Thomas Gleixner wrote:
>> The task::robust_list pointer is set unconditionally by NPTL for every
>> thread of a process. It points to the 'list head' which is in the
>> TLS. But this does not tell whether the task holds a robust futex or
>> not. That's evaluated in the futex exit handling code.
>> 
>> So solution #1 will prevent oom reaping completely simply because the
>> pointer is set on every user space task.
>
> This is really what I was worried about.
>
>> Solutions #2 and #3 are incomplete and just awful hacks which cure one
>> particular case: A single threaded process. Why?
>> 
>> The chosen oom reaper victim is a process, so what does it help to check
>> or cleanup the robust list for _ONE_ thread? Nothing because the other
>> threads can hold robust futexes and then run into the same problem.
>> 
>> Aside of that you seem to believe that the robust list head in the TLS
>> is the only part which is relevant. That's wrong. The list head is
>> either NULL or points to the innermost pthread_mutex which is held by a
>> task. Now look at this example:
>> 
>>   TLS:robust_list -> mutex2 -> mutex1
>> 
>> mutex1 is the shared one which needs to be released so that other
>> processes can make progress. mutex2 is a process private one which
>> resides in a different VMA. So now if you filter the robust list and
>> refuse to reap the TLS VMA, what prevents the other VMA from being
>> reaped? If that's reaped then mutex1 is not reachable.
>> 
>> Now vs. cleaning up the robust list from the oom reaper context. That
>> should be doable with a lot of care, but the proposed patch is not even
>> close to a solution. It's simply broken.
>
> My knowledge about robust futexes and how they are implemented is close
> to zero. My thinking has been that the primary reason for the lockup
> reported initially is that the oom_reaper will corrupt the memory which
> is backing the list of locks to be woken up. So if we rule out the
> oom_reaper for that memory then the exit path can do its proper cleanup.
> Is that assumption completely off?

All the kernel knows is which VMA contains a robust list head. But it
does not know where the actual futexes/mutexes reside which might be
held by a task. Here is the lock chain example again:

      TLS:robust_list -> mutex2 -> mutex1
VMA   Known              Unknown   Shared

So if TLS and mutex2 sit in two different VMAs and the mutex2 VMA gets
reaped then mutex1 cannot be reached anymore, which in consequence means
that it cannot be cleaned up.

> I cannot really comment on the futex specific parts of your response but
> the deadlock on the mmap_lock has been already pointed out, thanks for
> confirming that.

It just occured to me that doing the cleanup from a kernel thread is
completely broken vs. PI futexes, which are also cleaned up by
futex_cleanup().

The PI exit cleanup code _cannot_ be handled by a foreign task at
all. It will nicely explode in the rtmutex code when it tries to release
the kernel side rtmutex which represents the held and contended user
space futex.

Cleaning up futexes from a kthread is not going to work at least not
without creating yet another pile of horrible hacks in the futex and
rtmutex code. We have enough of them already, so no.

Just for the case that someone thinks we make that a special case for
non-PI futexes: No, that's not going to happen.

> [...]
>
>> But the real questions here are:
>> 
>>    Why are we doing this remote reaping at all?
>
> Does aac453635549 ("mm, oom: introduce oom reaper") help? In short this
> is to guarantee that the system is able to make a forward progress under
> OOM. Sending SIGKILL to the oom victim is not sufficient, really. Tasks
> can be blocked inside kernel for indefinite amount of time. E.g. being
> blocked waiting on memory transitively over locks. Reclaiming the
> anonymous memory from the killed process will allow to free up some
> memory while the oom victim is blocked and allow to move forward and
> eventually die and do the proper cleanup. We are focusing on the
> anonymous memory because under assumption that such a memory is private
> to the process and the process is dead so the a freed memory is not
> really visible any more.

Which is correct. But then you have to bite the bullet and accept that
the futexes cannot be recovered.

>>    What is the condition that a task which is killed with a fatal signal
>>    does not reach do_exit() and cleans up itself?
>> 
>> If the answer is "because", then we should rather make sure that this
>> gets fixed.
>
> While some places can be handled by changing uninterruptible waiting to
> killable there are places which are not really fixable, e.g. lock
> chain dependency which leads to memory allocation.

I'm not following. Which lock chain dependency causes memory allocation?

Also aren't oom killed tasks allowed to allocate from emergency buffers
in order to clean themself up? That's at least what the comments in the
oom code claim.

Aside of that. Do you have call chains which show in which situation
such a stuck task is?

>> If there is a legitimate reason why a task cannot handle a fatal signal,
>> then yes the oom reaper might be necessary, but unleashing the oom
>> reaper unconditionally is simply a bad idea and results in the problem
>> which this is trying to paper over.
>> 
>> The oom reaper should be the last resort IMO and not racing against the
>> killed task in the first place. IOW, give the task some time to clean
>> itself up and if that fails and it is truly stuck and unable to do so,
>> then reap the mm. But that should be the rare case and then the stuck
>> futex should be the least of our worries.
>
> Yes, the oom_reaper is the last resort indeed. It is true that it can
> fire later but I do not see how this would solve this particular
> problem.

Well, it at least solves the problem which is described in the
changelog. Because that problem clearly is a race between the woken up
oom reaper and the task which killed itself in #PF due to OOM.

Of course it won't solve the problem of tasks which are stuck forever.

But right now the oom reaper thread is immediately woken after sending
SIGKILL and the same is true if the oom killer targets a process which
is already exiting.

IOW, the current implementation is enforcing the race of the oom reaper
vs. the exiting and/or killed process. With a quite high probability the
reaper is going to win.

You can easily validate that by doing:

wake_oom_reaper(task)
   task->reap_time = jiffies + HZ;
   queue_task(task);
   wakeup(reaper);

and then:

oom_reap_task(task)
    now = READ_ONCE(jiffies);
    if (time_before(now, task->reap_time)
        schedule_timeout_idle(task->reap_time - now);

before trying to actually reap the mm.

That will prevent the enforced race in most cases and allow the exiting
and/or killed processes to cleanup themself. Not pretty, but it should
reduce the chance of the reaper to win the race with the exiting and/or
killed process significantly.

It's not going to work when the problem is combined with a heavy VM
overload situation which keeps a guest (or one/some it's vCPUs) away
from being scheduled. See below for a discussion of guarantees.

If it failed to do so when the sleep returns, then you still can reap
it.

> It is fundamentally wrong to reap the memory which the exit path
> really need to do a proper clean up.

Depends on the guarantees you make. If you say preventing OOM starvation
at the end is more important than a stale futex, then it's fine as long
as it is properly documented.

That said, the robust list is no guarantee. It's a best effort approach
which works well most of the time at least for the "normal" issues where
a task holding a futex dies unexpectedly. But there is no guarantee that
it works under all circumstances, e.g. OOM.

Sorry Nico, but your understanding

>> On Mon, Mar 21 2022 at 21:09, Nico Pache wrote:
>> From my understanding, the whole point of the robust futex is to allow forward
>> progress in an application in which the lock holder CAN
>> crash/exit/oom.

is based on wishful thinking. There is absolutely no guarantee made by
robust futexes. Why?

Because the kernel can neither make a guarantee vs. user space
controlled content nor vs. dealing with user space stupidity, e.g. a
runaway memory hog.

This is _NOT_ what robust list is about. Robust list was introduced to
handle the unfortunate case where a task holding a futex dies
unexpectedly.

Extending this to OOM and expecting it to work under all circumstances
is really wishful thinking.

>> So semantically nothing is wrong with killing the futex holder... the
>> whole point of the robustness is to handle these cases.

Wrong. The whole point of robust lists is to handle the "normal" case
gracefully. A process being OOM killed is _NOT_ in the "normal"
category.

Neither is it "normal" that a VM is scheduled out long enough to miss a
1 second deadline. That might be considered normal by cloud folks, but
that's absolute not normal from an OS POV. Again, that's not a OS
problem, that's an operator/admin problem.

>> We just have a case were the oom killer is racing with said handling
>> of the futex, invalidating the memory before the exit path
>> (handle_futex_death) can awake one of the other waiters.

That case is just a symptom of the overall problem and no, we are not
curing symptoms. Especially not by introducing problems which did not
exist before the "cure".

If the kernel has only the choice of making no progress at all or
leaving a stale futex around occasionally, then the latter is an
unfortunate collateral damage, but not the end of the world.

The enforced race of the oom reaper is not a collateral damage, that's
simply stupid. And that's the root cause for the particular symptom.
which needs be addressed exactly there and nowhere else.

If that does not cover _all_ corner cases, then so be it. You _cannot_
solve that problem completely ever.

As I said above there is a final choice between not making progress at
all and a rare stale futex. It's not rocket science to pick the right
one.

> And just to be clear, this is clearly a bug in the oom_reaper per se.
> Originally I thought that relaxing the locking (using trylock and
> retry/bail out on failure) would help but as I've learned earlier this
> day this is not really possible because of #PF at least. The most self
> contained solution would be to skip over vmas which are backing the
> robust list which would allow the regular exit path to do the proper
> cleanup.

That's not sufficient because you have to guarantee that the relevant
shared futex is accessible. See the lock chain example above.

OTOH, in combination with delaying the reaper skipping the VMAs, which
contain a robust list head, might be good enough for the trivial
cases where the shared futex is the one to which the robust list head
points to.

Emphasis on VMAs. You need to evaluate every tasks robust list pointer
of the process not only the one of the task which gets queued for
reaping.

So let me summarize what I think needs to be done in priority order:

 #1 Delay the oom reaper so the normal case of a process being able to
    exit is not subject to a pointless race to death.

 #2 If #1 does not result in the desired outcome, reap the mm (which is
    what we have now).

 #3 If it's expected that #2 will allow the stuck process to make
    progress on the way towards cleanup, then do not reap any VMA
    containing a robust list head of a thread which belongs to the
    exiting and/or killed process.

The remaining cases, i.e. the lock chain example I pointed out above or
the stuck forever task are going to be rare and fall under the
collateral damage and no guarantee rule.

Thanks,

        tglx
---
P.S.: I so hoped that after my first encounter with the oom killer
      almost two decades ago I wouldn't have to deal with this horror
      again. Bah!
Michal Hocko March 23, 2022, 9:17 a.m. UTC | #9
Let me skip over futex part which I need to digest and only focus on the
oom side of the things for clarification.

On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
[...]
> > While some places can be handled by changing uninterruptible waiting to
> > killable there are places which are not really fixable, e.g. lock
> > chain dependency which leads to memory allocation.
> 
> I'm not following. Which lock chain dependency causes memory allocation?

Consider an oom victim is blocked on a lock or waiting for an event to
happen but the lock holder is stuck allocating or the wake up depends on
an allocation. Many sleeping locks are doing GFP_KERNEL allocations.
 
> Also aren't oom killed tasks allowed to allocate from emergency buffers
> in order to clean themself up? That's at least what the comments in the
> oom code claim.

Yes, this is the case. And this is a slightly easier scenario. A dip
into memory reserves could help to make a forward progress indeed. But
as mentioned above the victim can be doing something completely else.

> Aside of that. Do you have call chains which show in which situation
> such a stuck task is?

I do remember Tetsuo was really good at triggering those issues. It was
no rocket science. Essentially a memory stress test hammering FS code
and usually triggering dependencies of oom victims either on the lock or
wait queue which is stuck allocating.

I have managed to push details about specific locks or traces out of my
brain but just by a high level look and checking how many GFP_KERNEL
allocations are done from inside a mutex or semaphore which shared among
process the dependency chain is simply unavoidable.

[...]

> You can easily validate that by doing:
> 
> wake_oom_reaper(task)
>    task->reap_time = jiffies + HZ;
>    queue_task(task);
>    wakeup(reaper);
> 
> and then:
> 
> oom_reap_task(task)
>     now = READ_ONCE(jiffies);
>     if (time_before(now, task->reap_time)
>         schedule_timeout_idle(task->reap_time - now);
> 
> before trying to actually reap the mm.
> 
> That will prevent the enforced race in most cases and allow the exiting
> and/or killed processes to cleanup themself. Not pretty, but it should
> reduce the chance of the reaper to win the race with the exiting and/or
> killed process significantly.
> 
> It's not going to work when the problem is combined with a heavy VM
> overload situation which keeps a guest (or one/some it's vCPUs) away
> from being scheduled. See below for a discussion of guarantees.
> 
> If it failed to do so when the sleep returns, then you still can reap
> it.

Yes, this is certainly an option. Please note that the oom_reaper is not
the only way to trigger this. process_mrelease syscall performs the same
operation from the userspace. Arguably process_mrelease could be used
sanely/correctly because the userspace oom killer can do pro-cleanup
steps before going to final SIGKILL & process_mrelease. One way would be
to send SIGTERM in the first step and allow the victim to perform its
cleanup.

> > It is fundamentally wrong to reap the memory which the exit path
> > really need to do a proper clean up.
> 
> Depends on the guarantees you make. If you say preventing OOM starvation
> at the end is more important than a stale futex, then it's fine as long
> as it is properly documented.

Yes, this is the case we have made on process core dumps. They are
simply incomplete for oom victims. I thought that handling robust futex
lists requires some guarantees. More on that later

> That said, the robust list is no guarantee. It's a best effort approach
> which works well most of the time at least for the "normal" issues where
> a task holding a futex dies unexpectedly. But there is no guarantee that
> it works under all circumstances, e.g. OOM.

OK, so this is an important note. I am all fine by documenting this
restriction. It is not like oom victims couldn't cause other disruptions
by leaving inconsistent/stale state behind.
 
> Sorry Nico, but your understanding
> 
> >> On Mon, Mar 21 2022 at 21:09, Nico Pache wrote:
> >> From my understanding, the whole point of the robust futex is to allow forward
> >> progress in an application in which the lock holder CAN
> >> crash/exit/oom.
> 
> is based on wishful thinking. There is absolutely no guarantee made by
> robust futexes. Why?
> 
> Because the kernel can neither make a guarantee vs. user space
> controlled content nor vs. dealing with user space stupidity, e.g. a
> runaway memory hog.
> 
> This is _NOT_ what robust list is about. Robust list was introduced to
> handle the unfortunate case where a task holding a futex dies
> unexpectedly.
> 
> Extending this to OOM and expecting it to work under all circumstances
> is really wishful thinking.
> 
> >> So semantically nothing is wrong with killing the futex holder... the
> >> whole point of the robustness is to handle these cases.
> 
> Wrong. The whole point of robust lists is to handle the "normal" case
> gracefully. A process being OOM killed is _NOT_ in the "normal"
> category.
> 
> Neither is it "normal" that a VM is scheduled out long enough to miss a
> 1 second deadline. That might be considered normal by cloud folks, but
> that's absolute not normal from an OS POV. Again, that's not a OS
> problem, that's an operator/admin problem.

Thanks for this clarification. I would tend to agree. Following a
previous example that oom victims can leave inconsistent state behind
which can influence other processes. I am wondering what kind of
expectations about the lock protected state can we make when the holder
of the lock has been interrupted at any random place in the critical
section.

[...]
> > And just to be clear, this is clearly a bug in the oom_reaper per se.
> > Originally I thought that relaxing the locking (using trylock and
> > retry/bail out on failure) would help but as I've learned earlier this
> > day this is not really possible because of #PF at least. The most self
> > contained solution would be to skip over vmas which are backing the
> > robust list which would allow the regular exit path to do the proper
> > cleanup.
> 
> That's not sufficient because you have to guarantee that the relevant
> shared futex is accessible. See the lock chain example above.

Yeah, my previous understanding was that the whole linked list lives in
the single mapping and we can just look at their addresses.

> OTOH, in combination with delaying the reaper skipping the VMAs, which
> contain a robust list head, might be good enough for the trivial
> cases where the shared futex is the one to which the robust list head
> points to.
> 
> Emphasis on VMAs. You need to evaluate every tasks robust list pointer
> of the process not only the one of the task which gets queued for
> reaping.
> 
> So let me summarize what I think needs to be done in priority order:
> 
>  #1 Delay the oom reaper so the normal case of a process being able to
>     exit is not subject to a pointless race to death.
> 
>  #2 If #1 does not result in the desired outcome, reap the mm (which is
>     what we have now).
> 
>  #3 If it's expected that #2 will allow the stuck process to make
>     progress on the way towards cleanup, then do not reap any VMA
>     containing a robust list head of a thread which belongs to the
>     exiting and/or killed process.
> 
> The remaining cases, i.e. the lock chain example I pointed out above or
> the stuck forever task are going to be rare and fall under the
> collateral damage and no guarantee rule.

I do agree that delaying oom_reaper wake up is the simplest thing to do
at this stage and it could catch up most of the failures. We still have
process_mrelease syscall case but I guess we can document this as a
caveat into the man page.

Thanks!

> Thanks,
> 
>         tglx
> ---
> P.S.: I so hoped that after my first encounter with the oom killer
>       almost two decades ago I wouldn't have to deal with this horror
>       again. Bah!

heh, a nice place full of horrors and hard to see dependencies
Thomas Gleixner March 23, 2022, 10:30 a.m. UTC | #10
Michal!

On Wed, Mar 23 2022 at 10:17, Michal Hocko wrote:
> Let me skip over futex part which I need to digest and only focus on the
> oom side of the things for clarification.

The most important thing to know about futexes is: They are cursed.

> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
> [...]
>> > While some places can be handled by changing uninterruptible waiting to
>> > killable there are places which are not really fixable, e.g. lock
>> > chain dependency which leads to memory allocation.
>> 
>> I'm not following. Which lock chain dependency causes memory allocation?
>
> Consider an oom victim is blocked on a lock or waiting for an event to
> happen but the lock holder is stuck allocating or the wake up depends on
> an allocation. Many sleeping locks are doing GFP_KERNEL allocations.

Fair enough.
  
>> That will prevent the enforced race in most cases and allow the exiting
>> and/or killed processes to cleanup themself. Not pretty, but it should
>> reduce the chance of the reaper to win the race with the exiting and/or
>> killed process significantly.
>> 
>> It's not going to work when the problem is combined with a heavy VM
>> overload situation which keeps a guest (or one/some it's vCPUs) away
>> from being scheduled. See below for a discussion of guarantees.
>> 
>> If it failed to do so when the sleep returns, then you still can reap
>> it.
>
> Yes, this is certainly an option. Please note that the oom_reaper is not
> the only way to trigger this. process_mrelease syscall performs the same
> operation from the userspace. Arguably process_mrelease could be used
> sanely/correctly because the userspace oom killer can do pro-cleanup
> steps before going to final SIGKILL & process_mrelease. One way would be
> to send SIGTERM in the first step and allow the victim to perform its
> cleanup.

A potential staged approach would be:

  Send SIGTERM
  wait some time
  Send SIGKILL
  wait some time
  sys_process_mrelease()

Needs proper documentation though.

>> That said, the robust list is no guarantee. It's a best effort approach
>> which works well most of the time at least for the "normal" issues where
>> a task holding a futex dies unexpectedly. But there is no guarantee that
>> it works under all circumstances, e.g. OOM.
>
> OK, so this is an important note. I am all fine by documenting this
> restriction. It is not like oom victims couldn't cause other disruptions
> by leaving inconsistent/stale state behind.

Correct. Futexes are a small part of the overall damage.

>> Wrong. The whole point of robust lists is to handle the "normal" case
>> gracefully. A process being OOM killed is _NOT_ in the "normal"
>> category.
>> 
>> Neither is it "normal" that a VM is scheduled out long enough to miss a
>> 1 second deadline. That might be considered normal by cloud folks, but
>> that's absolute not normal from an OS POV. Again, that's not a OS
>> problem, that's an operator/admin problem.
>
> Thanks for this clarification. I would tend to agree. Following a
> previous example that oom victims can leave inconsistent state behind
> which can influence other processes. I am wondering what kind of
> expectations about the lock protected state can we make when the holder
> of the lock has been interrupted at any random place in the critical
> section.

Right. If a futex is released by the exit cleanup, it is marked with
FUTEX_OWNER_DIED. User space is supposed to handle this.

pthread_mutex_lock() returns EOWNERDEAD to the caller if the owner died
bit is set. It's the callers responsibility to deal with potentially
corrupted or inconsistent state.

>> So let me summarize what I think needs to be done in priority order:
>> 
>>  #1 Delay the oom reaper so the normal case of a process being able to
>>     exit is not subject to a pointless race to death.
>> 
>>  #2 If #1 does not result in the desired outcome, reap the mm (which is
>>     what we have now).
>> 
>>  #3 If it's expected that #2 will allow the stuck process to make
>>     progress on the way towards cleanup, then do not reap any VMA
>>     containing a robust list head of a thread which belongs to the
>>     exiting and/or killed process.
>> 
>> The remaining cases, i.e. the lock chain example I pointed out above or
>> the stuck forever task are going to be rare and fall under the
>> collateral damage and no guarantee rule.
>
> I do agree that delaying oom_reaper wake up is the simplest thing to do
> at this stage and it could catch up most of the failures. We still have
> process_mrelease syscall case but I guess we can document this as a
> caveat into the man page.

Yes. The user space oom killer should better know what it is doing :)

Thanks,

        tglx
Peter Zijlstra March 23, 2022, 11:11 a.m. UTC | #11
On Wed, Mar 23, 2022 at 10:17:28AM +0100, Michal Hocko wrote:

> > Neither is it "normal" that a VM is scheduled out long enough to miss a
> > 1 second deadline. That might be considered normal by cloud folks, but
> > that's absolute not normal from an OS POV. Again, that's not a OS
> > problem, that's an operator/admin problem.
> 
> Thanks for this clarification. I would tend to agree. Following a
> previous example that oom victims can leave inconsistent state behind
> which can influence other processes. I am wondering what kind of
> expectations about the lock protected state can we make when the holder
> of the lock has been interrupted at any random place in the critical
> section.

Right, this is why the new owner gets the OWNER_DIED bit so it can see
something really dodgy happened.

Getting that means it needs to validate state consistency or just print
a nice error and fully terminate things.

So robust futexes:

 - rely on userspace to maintain a linked list of held locks,

 - rely on lock acquire to check OWNER_DIED and handle state
   inconsistency.

If userspace manages to screw up either one of those, it's game over.
Nothing we can do about it.

Software really has to be built do deal with this, it doesn't magically
work (IOW, in 99% of the case it just doesn't work right).

> [...]
> > > And just to be clear, this is clearly a bug in the oom_reaper per se.
> > > Originally I thought that relaxing the locking (using trylock and
> > > retry/bail out on failure) would help but as I've learned earlier this
> > > day this is not really possible because of #PF at least. The most self
> > > contained solution would be to skip over vmas which are backing the
> > > robust list which would allow the regular exit path to do the proper
> > > cleanup.
> > 
> > That's not sufficient because you have to guarantee that the relevant
> > shared futex is accessible. See the lock chain example above.
> 
> Yeah, my previous understanding was that the whole linked list lives in
> the single mapping and we can just look at their addresses.

Nope; shared futexes live in shared memory and as such the robust_list
entry must live there too. That is, the robust_list entry is embedded in
the lock itself along the lines of:

struct robust_mutex {
	u32 futex;
	struct robust_list list;
};

and then you register the robust_list_head with:

 .futex_offset = offsetof(struct robust_mutex, futex) -
		 offsetof(struct robust_mutex, list);

or somesuch (glibc does all this). And the locks themselves are spread
all over the place.
Michal Hocko March 30, 2022, 9:18 a.m. UTC | #12
Nico,

On Wed 23-03-22 10:17:29, Michal Hocko wrote:
> Let me skip over futex part which I need to digest and only focus on the
> oom side of the things for clarification.
> 
> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
[...]
> > You can easily validate that by doing:
> > 
> > wake_oom_reaper(task)
> >    task->reap_time = jiffies + HZ;
> >    queue_task(task);
> >    wakeup(reaper);
> > 
> > and then:
> > 
> > oom_reap_task(task)
> >     now = READ_ONCE(jiffies);
> >     if (time_before(now, task->reap_time)
> >         schedule_timeout_idle(task->reap_time - now);
> > 
> > before trying to actually reap the mm.
> > 
> > That will prevent the enforced race in most cases and allow the exiting
> > and/or killed processes to cleanup themself. Not pretty, but it should
> > reduce the chance of the reaper to win the race with the exiting and/or
> > killed process significantly.
> > 
> > It's not going to work when the problem is combined with a heavy VM
> > overload situation which keeps a guest (or one/some it's vCPUs) away
> > from being scheduled. See below for a discussion of guarantees.
> > 
> > If it failed to do so when the sleep returns, then you still can reap
> > it.
> 
> Yes, this is certainly an option. Please note that the oom_reaper is not
> the only way to trigger this. process_mrelease syscall performs the same
> operation from the userspace. Arguably process_mrelease could be used
> sanely/correctly because the userspace oom killer can do pro-cleanup
> steps before going to final SIGKILL & process_mrelease. One way would be
> to send SIGTERM in the first step and allow the victim to perform its
> cleanup.

are you working on another version of the fix/workaround based on the
discussion so far?

I guess the most reasonable way forward is to rework
oom_reaper processing to be delayed. This can be either done by a
delayed wake up or as Thomas suggests above by postponing the
processing. I think the delayed wakeup would be _slightly_ easier to
handle because the queue can contain many tasks to be reaped.

More specifically something like delayed work but we cannot rely on
the WQ here. I guess we do not have any delayed wait queue interface
but the same trick with the timer should be applicable here as well.
exit_mmap would then cancel the timer after __oom_reap_task_mm is done.
Actually the timer could be canceled after mmu_notifier_release already
but this shouldn't make much of a difference.
Nico Pache March 30, 2022, 6:18 p.m. UTC | #13
On 3/30/22 03:18, Michal Hocko wrote:
> Nico,
> 
> On Wed 23-03-22 10:17:29, Michal Hocko wrote:
>> Let me skip over futex part which I need to digest and only focus on the
>> oom side of the things for clarification.
>>
>> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
> [...]
>>> You can easily validate that by doing:
>>>
>>> wake_oom_reaper(task)
>>>    task->reap_time = jiffies + HZ;
>>>    queue_task(task);
>>>    wakeup(reaper);
>>>
>>> and then:
>>>
>>> oom_reap_task(task)
>>>     now = READ_ONCE(jiffies);
>>>     if (time_before(now, task->reap_time)
>>>         schedule_timeout_idle(task->reap_time - now);
>>>
>>> before trying to actually reap the mm.
>>>
>>> That will prevent the enforced race in most cases and allow the exiting
>>> and/or killed processes to cleanup themself. Not pretty, but it should
>>> reduce the chance of the reaper to win the race with the exiting and/or
>>> killed process significantly.
>>>
>>> It's not going to work when the problem is combined with a heavy VM
>>> overload situation which keeps a guest (or one/some it's vCPUs) away
>>> from being scheduled. See below for a discussion of guarantees.
>>>
>>> If it failed to do so when the sleep returns, then you still can reap
>>> it.
>>
>> Yes, this is certainly an option. Please note that the oom_reaper is not
>> the only way to trigger this. process_mrelease syscall performs the same
>> operation from the userspace. Arguably process_mrelease could be used
>> sanely/correctly because the userspace oom killer can do pro-cleanup
>> steps before going to final SIGKILL & process_mrelease. One way would be
>> to send SIGTERM in the first step and allow the victim to perform its
>> cleanup.
> 
> are you working on another version of the fix/workaround based on the
> discussion so far?

We are indeed! Sorry for the delay we've been taking the time to do our due
diligence on some of the claims made. We are also spending time rewriting the
reproducer to include more test cases that Thomas brought up.

Ill summarize here, and reply to the original emails in more detail....

Firstly, we have implemented & tested the VMA skipping... it does fix our case.
Thomas brought up a few good points about the robust list head and the potential
waiters being in different VMAs; however, I think its a moot point, given that
the locks will only be reaped if allocated as ((private|anon)|| !shared).

If the locks are private|anon then there are not two processes sharing this lock
and OOMing will stop all the pthreads operating on this address space (no need
to wake the waiters).

If its a shared lock then the VMA will not be reaped and the dying process can
still recover the stuck waiter in the other process. We do however need to skip
the robust_list_head vma as that is in private|anon memory and can race between
the exit and the OOM as described in our previous commit logs.

We are trying to create a scenario in our reproducer that will break the VMA
skipping case, but have been unsuccessful so far.

> 
> I guess the most reasonable way forward is to rework
> oom_reaper processing to be delayed. This can be either done by a
> delayed wake up or as Thomas suggests above by postponing the
> processing. I think the delayed wakeup would be _slightly_ easier to
> handle because the queue can contain many tasks to be reaped.
> 
> More specifically something like delayed work but we cannot rely on
> the WQ here. I guess we do not have any delayed wait queue interface
> but the same trick with the timer should be applicable here as well.
> exit_mmap would then cancel the timer after __oom_reap_task_mm is done.
> Actually the timer could be canceled after mmu_notifier_release already
> but this shouldn't make much of a difference.

We have not tried implementing the delayed work yet, but we see value in that
solution IFF we are able to break the vma skipping case. In that case I believe
we combine the two approaches to further improve the robustness of the futex,
allowing for less 'best-effort' recoveries bailouts.

Cheers,
-- Nico
Nico Pache March 30, 2022, 9:36 p.m. UTC | #14
On 3/30/22 12:18, Nico Pache wrote:
> 
> 
> On 3/30/22 03:18, Michal Hocko wrote:
>> Nico,
>>
>> On Wed 23-03-22 10:17:29, Michal Hocko wrote:
>>> Let me skip over futex part which I need to digest and only focus on the
>>> oom side of the things for clarification.
>>>
>>> On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
>> [...]
>>>> You can easily validate that by doing:
>>>>
>>>> wake_oom_reaper(task)
>>>>    task->reap_time = jiffies + HZ;
>>>>    queue_task(task);
>>>>    wakeup(reaper);
>>>>
>>>> and then:
>>>>
>>>> oom_reap_task(task)
>>>>     now = READ_ONCE(jiffies);
>>>>     if (time_before(now, task->reap_time)
>>>>         schedule_timeout_idle(task->reap_time - now);
>>>>
>>>> before trying to actually reap the mm.
>>>>
>>>> That will prevent the enforced race in most cases and allow the exiting
>>>> and/or killed processes to cleanup themself. Not pretty, but it should
>>>> reduce the chance of the reaper to win the race with the exiting and/or
>>>> killed process significantly.
>>>>
>>>> It's not going to work when the problem is combined with a heavy VM
>>>> overload situation which keeps a guest (or one/some it's vCPUs) away
>>>> from being scheduled. See below for a discussion of guarantees.
>>>>
>>>> If it failed to do so when the sleep returns, then you still can reap
>>>> it.
>>>
>>> Yes, this is certainly an option. Please note that the oom_reaper is not
>>> the only way to trigger this. process_mrelease syscall performs the same
>>> operation from the userspace. Arguably process_mrelease could be used
>>> sanely/correctly because the userspace oom killer can do pro-cleanup
>>> steps before going to final SIGKILL & process_mrelease. One way would be
>>> to send SIGTERM in the first step and allow the victim to perform its
>>> cleanup.
>>
>> are you working on another version of the fix/workaround based on the
>> discussion so far?
> 
> We are indeed! Sorry for the delay we've been taking the time to do our due
> diligence on some of the claims made. We are also spending time rewriting the
> reproducer to include more test cases that Thomas brought up.
> 
> Ill summarize here, and reply to the original emails in more detail....
> 
> Firstly, we have implemented & tested the VMA skipping... it does fix our case.
> Thomas brought up a few good points about the robust list head and the potential
> waiters being in different VMAs; however, I think its a moot point, given that
> the locks will only be reaped if allocated as ((private|anon)|| !shared).

Sorry... not completely moot.

As Thomas pointed out, a robust list with the following structure will probably
fail to recover its waiters:

TLS (robust head, skip)* --> private lock (reaped) --> shared lock (not reaped)

We are working on getting a test case with multiple locks and mixed mapping
types to prove this.

Skipping the robust list head VMA will be beneficial in cases were the robust
list is full of shared locks:

TLS (robust head, skip)* --> shared lock(not reaped) --> shared lock(not reaped)

-- Nico
Nico Pache April 6, 2022, 5:22 p.m. UTC | #15
On 3/22/22 09:17, Thomas Gleixner wrote:
> On Tue, Mar 22 2022 at 09:26, Michal Hocko wrote:
>> On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
>>> On Mon, 21 Mar 2022, Nico Pache wrote:
>>>
>>>> We could proceed with the V3 approach; however if we are able to find a complete
>>>> solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
>>>> working, I dont see why we wouldnt go for it.
> 
> See below.
> 
>>> Because semantically killing the process is, imo, the wrong thing to do.
>>
>> I am not sure I follow. The task has been killed by the oom killer. All
>> we are discussing here is how to preserve the robust list metadata
>> stored in the memory which is normally unmapped by the oom_reaper to
>> guarantee a further progress. 
>>
>> I can see we have 4 potential solutions:
>> 1) do not oom_reap oom victims with robust futex metadata in anonymous
>>    memory. Easy enough but it could lead to excessive oom killing in
>>    case the victim gets stuck in the kernel and cannot terminate.
>> 2) clean up robust list from the oom_reaper context. Seems tricky due to
>>    #PF handling from the oom_reaper context which would need to be
>>    non-blocking
>> 3) filter vmas which contain robust list. Simple check for the vma
>>    range
>> 4) internally mark vmas which have to preserve the state during
>>    oom_reaping. Futex code would somehow have to mark those mappings.
>>    While more generic solution. I am not sure this is a practical
>>    approach. 
> 
> And all of that is based on wishful thinking, really. Let me explain.
> 
> The task::robust_list pointer is set unconditionally by NPTL for every
> thread of a process. It points to the 'list head' which is in the
> TLS. But this does not tell whether the task holds a robust futex or
> not. That's evaluated in the futex exit handling code.

Ah, thanks for pointing that out. So yes, skipping the OOM if it contains a
robust list is not really ideal as any process with pthreads will be skipped.

Would it be logical to change this so that we are no longer making this syscall
unconditionally? I still agree that skipping the OOM isnt very logical if we
have a better solution available. Is it set unconditionally so that users dont
have to do it dynamically when they enable the robustness?
If this is the case it may be too much of a headache to implement.

> 
> So solution #1 will prevent oom reaping completely simply because the
> pointer is set on every user space task.
Every userspace task that implements a pthread.
> 
> Solutions #2 and #3 are incomplete and just awful hacks which cure one
> particular case: A single threaded process. Why?
> 
> The chosen oom reaper victim is a process, so what does it help to check
> or cleanup the robust list for _ONE_ thread? Nothing because the other
> threads can hold robust futexes and then run into the same problem.
> 
> Aside of that you seem to believe that the robust list head in the TLS
> is the only part which is relevant. That's wrong. The list head is
> either NULL or points to the innermost pthread_mutex which is held by a
> task. Now look at this example:
> 
>   TLS:robust_list -> mutex2 -> mutex1
> 
> mutex1 is the shared one which needs to be released so that other
> processes can make progress. mutex2 is a process private one which
> resides in a different VMA. So now if you filter the robust list and
> refuse to reap the TLS VMA, what prevents the other VMA from being
> reaped? If that's reaped then mutex1 is not reachable.

This is a interesting case... So skipping the robust_head VMA would solve
the the case were all the locks are shared which is an improvement over the
current implementation.

We have been trying to modify our reproducer to creates the case described here,
but so far have been unsuccessful.
> 
> Now vs. cleaning up the robust list from the oom reaper context. That
> should be doable with a lot of care, but the proposed patch is not even
> close to a solution. It's simply broken.
> 
>> -static void futex_cleanup_begin(struct task_struct *tsk)
>> +static bool futex_cleanup_begin(struct task_struct *tsk, bool try)
>>  {
>>  	/*
>>  	 * Prevent various race issues against a concurrent incoming waiter
>> @@ -1055,7 +1056,12 @@ static void futex_cleanup_begin(struct task_struct *tsk)
>>  	 * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
>>  	 * attach_to_pi_owner().
>>  	 */
>> -	mutex_lock(&tsk->futex_exit_mutex);
>> +	if (try) {
>> +		if (!mutex_trylock(&tsk->futex_exit_mutex))
>> +			return false;
>> +	} else {
>> +		mutex_lock(&tsk->futex_exit_mutex);
>> +	}
> 
> That conditional locking is disgusting.
>   
>>  void futex_exit_release(struct task_struct *tsk)
>>  {
>> -	futex_cleanup_begin(tsk);
>> +	futex_cleanup_begin(tsk, false);
> 
> If the task already cleaned up the robust list then this will roll back
> tsk->futex_state from FUTEX_STATE_DEAD to FUTEX_STATE_EXITING. Sigh...
> 
>> +	futex_cleanup(tsk);
>> +	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
>> +}
>> +
>> +/* Try to perform the futex_cleanup and return true if successful.
> 
> This is not a proper multi line comment.
> 
>      /*
>       * Multi line comments look like this:
>       *
>       * Properly formatted.
>       *
>       * Don't try to use the network comment style
>       * on anything outside of networking.
>       */
> 
>> + * Designed to be called from the context of the OOM Reaper.
> 
> Let's talk about design later.
> 
>> + */
>> +bool try_futex_exit_release(struct task_struct *tsk)
>> +{
>> +	if (!futex_cleanup_begin(tsk, true))
>> +		return false;
>> +
>> +	/* We are calling this from the context of a kthread. We need to
>> +	 * instruct the kthread to use the address space of the given mm
>> +	 * so the get_user won't return -EFAULT.
> 
> How is this preventing get_user() or any other operation on the tasks
> user memory to return -EFAULT? Not at all. Any user access can fail and
Without the kthread_use_mm the kthread cannot instrument on the memory and the
get_users in futex_cleanup is guaranteed to fail... I left that comment to avoid
confusion.
> return -EFAULT. Comments are there to explain things not to create
> confusion.
> 
>> +	 */
>> +	kthread_use_mm(tsk->mm);
>>  	futex_cleanup(tsk);
> 
> But aside of that. How is this supposed to work correctly?
> 
> oom_reaper()
>   oom_reap_task()
>     oom_reap_task_mm()
>       mmap_read_trylock(mm) <- Succeeds
>         try_futex_exit_release()
>           use_mm()
>           futex_cleanup()
>             get_user() -> #PF
> 
> #PF
>   if (!mmap_read_trylock(mm)) {
> 
> So here the problem starts. The trylock can succeed or not, depending
> on the contention state of mm::mmap_lock.
> 
> So in case the trylock fails because there is a writer waiting, then it
> runs into this:
> 
>      if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
>         ....
>         return;
>      }
> 
> This condition evaluates to false because get_user() has an
> exception table entry. So this proceeds and does:
>      
>      mmap_read_lock(mm);
> 
> which is a full dead lock.
> 
> But even if the trylock succeeds then this runs into the full fault
> path, which is not correct either for pretty obvious reasons.
> 
> I assume that's all part of the design, right?

Yeah all of this makes the solution pretty useless in its current state. Thanks
for pointing that out in detail.

> But the real questions here are:
> 
>    Why are we doing this remote reaping at all?
> 
>    What is the condition that a task which is killed with a fatal signal
>    does not reach do_exit() and cleans up itself?
> 
> If the answer is "because", then we should rather make sure that this
> gets fixed.
> 
> If there is a legitimate reason why a task cannot handle a fatal signal,
> then yes the oom reaper might be necessary, but unleashing the oom
> reaper unconditionally is simply a bad idea and results in the problem
> which this is trying to paper over.
> 
> The oom reaper should be the last resort IMO and not racing against the
> killed task in the first place. IOW, give the task some time to clean
> itself up and if that fails and it is truly stuck and unable to do so,
> then reap the mm. But that should be the rare case and then the stuck
> futex should be the least of our worries.
> 
> Thanks,
> 
>         tglx

Thanks for the review!

Given our inability to reproduce the tls -> private mutex -> shared mutex case
we are going to continue forward with the VMA skipping, as that should at least
clear up the cases where all the locks are shared.

Cheers,
-- Nico
Nico Pache April 6, 2022, 5:36 p.m. UTC | #16
On 4/6/22 11:22, Nico Pache wrote:
> 
> 
> On 3/22/22 09:17, Thomas Gleixner wrote:
>> On Tue, Mar 22 2022 at 09:26, Michal Hocko wrote:
>>> On Mon 21-03-22 19:57:24, Davidlohr Bueso wrote:
>>>> On Mon, 21 Mar 2022, Nico Pache wrote:
>>>>
>>>>> We could proceed with the V3 approach; however if we are able to find a complete
>>>>> solution that keeps both functionalities (Concurrent OOM Reaping & Robust Futex)
>>>>> working, I dont see why we wouldnt go for it.
>>
>> See below.
>>
>>>> Because semantically killing the process is, imo, the wrong thing to do.
>>>
>>> I am not sure I follow. The task has been killed by the oom killer. All
>>> we are discussing here is how to preserve the robust list metadata
>>> stored in the memory which is normally unmapped by the oom_reaper to
>>> guarantee a further progress. 
>>>
>>> I can see we have 4 potential solutions:
>>> 1) do not oom_reap oom victims with robust futex metadata in anonymous
>>>    memory. Easy enough but it could lead to excessive oom killing in
>>>    case the victim gets stuck in the kernel and cannot terminate.
>>> 2) clean up robust list from the oom_reaper context. Seems tricky due to
>>>    #PF handling from the oom_reaper context which would need to be
>>>    non-blocking
>>> 3) filter vmas which contain robust list. Simple check for the vma
>>>    range
>>> 4) internally mark vmas which have to preserve the state during
>>>    oom_reaping. Futex code would somehow have to mark those mappings.
>>>    While more generic solution. I am not sure this is a practical
>>>    approach. 
>>
>> And all of that is based on wishful thinking, really. Let me explain.
>>
>> The task::robust_list pointer is set unconditionally by NPTL for every
>> thread of a process. It points to the 'list head' which is in the
>> TLS. But this does not tell whether the task holds a robust futex or
>> not. That's evaluated in the futex exit handling code.
> 
> Ah, thanks for pointing that out. So yes, skipping the OOM if it contains a
> robust list is not really ideal as any process with pthreads will be skipped.
> 
> Would it be logical to change this so that we are no longer making this syscall
> unconditionally? I still agree that skipping the OOM isnt very logical if we
> have a better solution available. Is it set unconditionally so that users dont
> have to do it dynamically when they enable the robustness?
> If this is the case it may be too much of a headache to implement.
> 
>>
>> So solution #1 will prevent oom reaping completely simply because the
>> pointer is set on every user space task.
> Every userspace task that implements a pthread.
I stand corrected... Joel just showed me that it is indeed set on every
userspace task.

Not entirely sure why... but Im wrong here.   >>
>> Solutions #2 and #3 are incomplete and just awful hacks which cure one
>> particular case: A single threaded process. Why?
>>
>> The chosen oom reaper victim is a process, so what does it help to check
>> or cleanup the robust list for _ONE_ thread? Nothing because the other
>> threads can hold robust futexes and then run into the same problem.
>>
>> Aside of that you seem to believe that the robust list head in the TLS
>> is the only part which is relevant. That's wrong. The list head is
>> either NULL or points to the innermost pthread_mutex which is held by a
>> task. Now look at this example:
>>
>>   TLS:robust_list -> mutex2 -> mutex1
>>
>> mutex1 is the shared one which needs to be released so that other
>> processes can make progress. mutex2 is a process private one which
>> resides in a different VMA. So now if you filter the robust list and
>> refuse to reap the TLS VMA, what prevents the other VMA from being
>> reaped? If that's reaped then mutex1 is not reachable.
> 
> This is a interesting case... So skipping the robust_head VMA would solve
> the the case were all the locks are shared which is an improvement over the
> current implementation.
> 
> We have been trying to modify our reproducer to creates the case described here,
> but so far have been unsuccessful.
>>
>> Now vs. cleaning up the robust list from the oom reaper context. That
>> should be doable with a lot of care, but the proposed patch is not even
>> close to a solution. It's simply broken.
>>
>>> -static void futex_cleanup_begin(struct task_struct *tsk)
>>> +static bool futex_cleanup_begin(struct task_struct *tsk, bool try)
>>>  {
>>>  	/*
>>>  	 * Prevent various race issues against a concurrent incoming waiter
>>> @@ -1055,7 +1056,12 @@ static void futex_cleanup_begin(struct task_struct *tsk)
>>>  	 * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
>>>  	 * attach_to_pi_owner().
>>>  	 */
>>> -	mutex_lock(&tsk->futex_exit_mutex);
>>> +	if (try) {
>>> +		if (!mutex_trylock(&tsk->futex_exit_mutex))
>>> +			return false;
>>> +	} else {
>>> +		mutex_lock(&tsk->futex_exit_mutex);
>>> +	}
>>
>> That conditional locking is disgusting.
>>   
>>>  void futex_exit_release(struct task_struct *tsk)
>>>  {
>>> -	futex_cleanup_begin(tsk);
>>> +	futex_cleanup_begin(tsk, false);
>>
>> If the task already cleaned up the robust list then this will roll back
>> tsk->futex_state from FUTEX_STATE_DEAD to FUTEX_STATE_EXITING. Sigh...
>>
>>> +	futex_cleanup(tsk);
>>> +	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
>>> +}
>>> +
>>> +/* Try to perform the futex_cleanup and return true if successful.
>>
>> This is not a proper multi line comment.
>>
>>      /*
>>       * Multi line comments look like this:
>>       *
>>       * Properly formatted.
>>       *
>>       * Don't try to use the network comment style
>>       * on anything outside of networking.
>>       */
>>
>>> + * Designed to be called from the context of the OOM Reaper.
>>
>> Let's talk about design later.
>>
>>> + */
>>> +bool try_futex_exit_release(struct task_struct *tsk)
>>> +{
>>> +	if (!futex_cleanup_begin(tsk, true))
>>> +		return false;
>>> +
>>> +	/* We are calling this from the context of a kthread. We need to
>>> +	 * instruct the kthread to use the address space of the given mm
>>> +	 * so the get_user won't return -EFAULT.
>>
>> How is this preventing get_user() or any other operation on the tasks
>> user memory to return -EFAULT? Not at all. Any user access can fail and
> Without the kthread_use_mm the kthread cannot instrument on the memory and the
> get_users in futex_cleanup is guaranteed to fail... I left that comment to avoid
> confusion.
>> return -EFAULT. Comments are there to explain things not to create
>> confusion.
>>
>>> +	 */
>>> +	kthread_use_mm(tsk->mm);
>>>  	futex_cleanup(tsk);
>>
>> But aside of that. How is this supposed to work correctly?
>>
>> oom_reaper()
>>   oom_reap_task()
>>     oom_reap_task_mm()
>>       mmap_read_trylock(mm) <- Succeeds
>>         try_futex_exit_release()
>>           use_mm()
>>           futex_cleanup()
>>             get_user() -> #PF
>>
>> #PF
>>   if (!mmap_read_trylock(mm)) {
>>
>> So here the problem starts. The trylock can succeed or not, depending
>> on the contention state of mm::mmap_lock.
>>
>> So in case the trylock fails because there is a writer waiting, then it
>> runs into this:
>>
>>      if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
>>         ....
>>         return;
>>      }
>>
>> This condition evaluates to false because get_user() has an
>> exception table entry. So this proceeds and does:
>>      
>>      mmap_read_lock(mm);
>>
>> which is a full dead lock.
>>
>> But even if the trylock succeeds then this runs into the full fault
>> path, which is not correct either for pretty obvious reasons.
>>
>> I assume that's all part of the design, right?
> 
> Yeah all of this makes the solution pretty useless in its current state. Thanks
> for pointing that out in detail.
> 
>> But the real questions here are:
>>
>>    Why are we doing this remote reaping at all?
>>
>>    What is the condition that a task which is killed with a fatal signal
>>    does not reach do_exit() and cleans up itself?
>>
>> If the answer is "because", then we should rather make sure that this
>> gets fixed.
>>
>> If there is a legitimate reason why a task cannot handle a fatal signal,
>> then yes the oom reaper might be necessary, but unleashing the oom
>> reaper unconditionally is simply a bad idea and results in the problem
>> which this is trying to paper over.
>>
>> The oom reaper should be the last resort IMO and not racing against the
>> killed task in the first place. IOW, give the task some time to clean
>> itself up and if that fails and it is truly stuck and unable to do so,
>> then reap the mm. But that should be the rare case and then the stuck
>> futex should be the least of our worries.
>>
>> Thanks,
>>
>>         tglx
> 
> Thanks for the review!
> 
> Given our inability to reproduce the tls -> private mutex -> shared mutex case
> we are going to continue forward with the VMA skipping, as that should at least
> clear up the cases where all the locks are shared.
> 
> Cheers,
> -- Nico
>
diff mbox series

Patch

diff --git a/include/linux/futex.h b/include/linux/futex.h
index b70df27d7e85..64d6e89294ac 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -71,9 +71,21 @@  static inline void futex_init_task(struct task_struct *tsk)
 	mutex_init(&tsk->futex_exit_mutex);
 }
 
+static inline bool task_has_robust_list(struct task_struct *tsk)
+{
+	bool robust = false;
+
+	robust = tsk->robust_list != NULL;
+#ifdef CONFIG_COMPAT
+	robust |= tsk->compat_robust_list != NULL;
+#endif
+	return robust;
+}
+
 void futex_exit_recursive(struct task_struct *tsk);
 void futex_exit_release(struct task_struct *tsk);
 void futex_exec_release(struct task_struct *tsk);
+bool try_futex_exit_release(struct task_struct *tsk);
 
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
@@ -82,6 +94,8 @@  static inline void futex_init_task(struct task_struct *tsk) { }
 static inline void futex_exit_recursive(struct task_struct *tsk) { }
 static inline void futex_exit_release(struct task_struct *tsk) { }
 static inline void futex_exec_release(struct task_struct *tsk) { }
+static inline bool task_has_robust_list(struct task_struct *tsk) { return false; }
+static inline bool try_futex_exit_release(struct task_struct *tsk) { return true; }
 static inline long do_futex(u32 __user *uaddr, int op, u32 val,
 			    ktime_t *timeout, u32 __user *uaddr2,
 			    u32 val2, u32 val3)
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index 51dd822a8060..81aa60ce1ed6 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -37,6 +37,7 @@ 
 #include <linux/memblock.h>
 #include <linux/fault-inject.h>
 #include <linux/slab.h>
+#include <linux/kthread.h>
 
 #include "futex.h"
 #include "../locking/rtmutex_common.h"
@@ -1047,7 +1048,7 @@  void futex_exit_recursive(struct task_struct *tsk)
 	tsk->futex_state = FUTEX_STATE_DEAD;
 }
 
-static void futex_cleanup_begin(struct task_struct *tsk)
+static bool futex_cleanup_begin(struct task_struct *tsk, bool try)
 {
 	/*
 	 * Prevent various race issues against a concurrent incoming waiter
@@ -1055,7 +1056,12 @@  static void futex_cleanup_begin(struct task_struct *tsk)
 	 * tsk->futex_exit_mutex when it observes FUTEX_STATE_EXITING in
 	 * attach_to_pi_owner().
 	 */
-	mutex_lock(&tsk->futex_exit_mutex);
+	if (try) {
+		if (!mutex_trylock(&tsk->futex_exit_mutex))
+			return false;
+	} else {
+		mutex_lock(&tsk->futex_exit_mutex);
+	}
 
 	/*
 	 * Switch the state to FUTEX_STATE_EXITING under tsk->pi_lock.
@@ -1071,6 +1077,7 @@  static void futex_cleanup_begin(struct task_struct *tsk)
 	raw_spin_lock_irq(&tsk->pi_lock);
 	tsk->futex_state = FUTEX_STATE_EXITING;
 	raw_spin_unlock_irq(&tsk->pi_lock);
+	return true;
 }
 
 static void futex_cleanup_end(struct task_struct *tsk, int state)
@@ -1096,7 +1103,7 @@  void futex_exec_release(struct task_struct *tsk)
 	 * futex is held on exec(), this provides at least as much state
 	 * consistency protection which is possible.
 	 */
-	futex_cleanup_begin(tsk);
+	futex_cleanup_begin(tsk, false);
 	futex_cleanup(tsk);
 	/*
 	 * Reset the state to FUTEX_STATE_OK. The task is alive and about
@@ -1107,9 +1114,29 @@  void futex_exec_release(struct task_struct *tsk)
 
 void futex_exit_release(struct task_struct *tsk)
 {
-	futex_cleanup_begin(tsk);
+	futex_cleanup_begin(tsk, false);
+	futex_cleanup(tsk);
+	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
+}
+
+/* Try to perform the futex_cleanup and return true if successful.
+ * Designed to be called from the context of the OOM Reaper.
+ */
+bool try_futex_exit_release(struct task_struct *tsk)
+{
+	if (!futex_cleanup_begin(tsk, true))
+		return false;
+
+	/* We are calling this from the context of a kthread. We need to
+	 * instruct the kthread to use the address space of the given mm
+	 * so the get_user won't return -EFAULT.
+	 */
+	kthread_use_mm(tsk->mm);
 	futex_cleanup(tsk);
+	kthread_unuse_mm(tsk->mm);
+
 	futex_cleanup_end(tsk, FUTEX_STATE_DEAD);
+	return true;
 }
 
 static int __init futex_init(void)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..f7834c53d874 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@ 
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/futex.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -587,6 +588,18 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto out_unlock;
 	}
 
+	/* We can't reap a process holding a robust_list; the pthread
+	 * struct is allocated in userspace using PRIVATE | ANONYMOUS
+	 * memory which when reaped before futex_cleanup() can leave
+	 * the waiting process stuck. Try to perform the futex_cleanup,
+	 * and if unsuccessful, skip the OOM reaping.
+	 */
+	if (task_has_robust_list(tsk) && !try_futex_exit_release(tsk)) {
+		trace_skip_task_reaping(tsk->pid);
+		pr_info("oom_reaper: skipping task as it contains a robust list");
+		goto out_finish;
+	}
+
 	trace_start_task_reaping(tsk->pid);
 
 	/* failed to reap part of the address space. Try again later */