diff mbox series

[v8] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head

Message ID 20220408032809.3696798-1-npache@redhat.com (mailing list archive)
State New
Headers show
Series [v8] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head | expand

Commit Message

Nico Pache April 8, 2022, 3:28 a.m. UTC
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is used to store the futex
robust list head; the kernel does not keep a copy of the robust list and
instead references a userspace address to maintain the robustness during
a process death. A race can occur between exit_mm and the oom reaper that
allows the oom reaper to free the memory of the futex robust list before
the exit path has handled the futex death:

    CPU1                               CPU2
------------------------------------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper
                                        oom_reaper
                                        oom_reap_task_mm (invalidates mm)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (EFAULT- can't access memory)

If the get_user EFAULT's, the kernel will be unable to recover the
waiters on the robust_list, leaving userspace mutexes hung indefinitely.

Use the robust_list address stored in the kernel to skip the VMA that holds
it, allowing a successful futex_cleanup.

Theoretically a failure can still occur if there are locks mapped as
PRIVATE|ANON; however, the robust futexes are a best-effort approach.
This patch only strengthens that best-effort.

The following case can still fail:
robust head (skipped) -> private lock (reaped) -> shared lock (skipped)

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

[1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: stable@kernel.org
Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 include/linux/oom.h |  2 +-
 mm/mmap.c           |  8 +++++++-
 mm/oom_kill.c       | 31 ++++++++++++++++++++++++++++---
 3 files changed, 36 insertions(+), 5 deletions(-)

Comments

Peter Zijlstra April 8, 2022, 8:15 a.m. UTC | #1
On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is used to store the futex
> robust list head; the kernel does not keep a copy of the robust list and
> instead references a userspace address to maintain the robustness during
> a process death. A race can occur between exit_mm and the oom reaper that
> allows the oom reaper to free the memory of the futex robust list before
> the exit path has handled the futex death:
> 
>     CPU1                               CPU2
> ------------------------------------------------------------------------
>     page_fault
>     do_exit "signal"
>     wake_oom_reaper
>                                         oom_reaper
>                                         oom_reap_task_mm (invalidates mm)
>     exit_mm
>     exit_mm_release
>     futex_exit_release
>     futex_cleanup
>     exit_robust_list
>     get_user (EFAULT- can't access memory)
> 
> If the get_user EFAULT's, the kernel will be unable to recover the
> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
> 
> Use the robust_list address stored in the kernel to skip the VMA that holds
> it, allowing a successful futex_cleanup.
> 
> Theoretically a failure can still occur if there are locks mapped as
> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
> This patch only strengthens that best-effort.
> 
> The following case can still fail:
> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)

This is still all sorts of confused.. it's a list head, the entries can
be in any random other VMA. You must not remove *any* user memory before
doing the robust thing. Not removing the VMA that contains the head is
pointless in the extreme.

Did you not read the previous discussion?
Thomas Gleixner April 8, 2022, 8:37 a.m. UTC | #2
On Fri, Apr 08 2022 at 10:15, Peter Zijlstra wrote:
> On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote:
>> Theoretically a failure can still occur if there are locks mapped as
>> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
>> This patch only strengthens that best-effort.
>> 
>> The following case can still fail:
>> robust head (skipped) -> private lock (reaped) -> shared lock
>> (skipped)
>
> This is still all sorts of confused.. it's a list head, the entries can
> be in any random other VMA. You must not remove *any* user memory before
> doing the robust thing. Not removing the VMA that contains the head is
> pointless in the extreme.
>
> Did you not read the previous discussion?

Aside of that we all agreed that giving a oom-killed task time to
cleanup itself instead of brute force cleaning it up immediately, which
is the real problem here. Can we fix that first before adding broken
heuristics?

Thanks,

        tglx
Nico Pache April 8, 2022, 8:41 a.m. UTC | #3
On 4/8/22 04:15, Peter Zijlstra wrote:
> On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote:
>> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
>> be targeted by the oom reaper. This mapping is used to store the futex
>> robust list head; the kernel does not keep a copy of the robust list and
>> instead references a userspace address to maintain the robustness during
>> a process death. A race can occur between exit_mm and the oom reaper that
>> allows the oom reaper to free the memory of the futex robust list before
>> the exit path has handled the futex death:
>>
>>     CPU1                               CPU2
>> ------------------------------------------------------------------------
>>     page_fault
>>     do_exit "signal"
>>     wake_oom_reaper
>>                                         oom_reaper
>>                                         oom_reap_task_mm (invalidates mm)
>>     exit_mm
>>     exit_mm_release
>>     futex_exit_release
>>     futex_cleanup
>>     exit_robust_list
>>     get_user (EFAULT- can't access memory)
>>
>> If the get_user EFAULT's, the kernel will be unable to recover the
>> waiters on the robust_list, leaving userspace mutexes hung indefinitely.
>>
>> Use the robust_list address stored in the kernel to skip the VMA that holds
>> it, allowing a successful futex_cleanup.
>>
>> Theoretically a failure can still occur if there are locks mapped as
>> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
>> This patch only strengthens that best-effort.
>>
>> The following case can still fail:
>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
> 
> This is still all sorts of confused.. it's a list head, the entries can
> be in any random other VMA. You must not remove *any* user memory before
> doing the robust thing. Not removing the VMA that contains the head is
> pointless in the extreme.
Not sure how its pointless if it fixes all the different reproducers we've
written for it. As for the private lock case we stated here, we havent been able
to reproduce it, but I could see how it can be a potential issue (which is why
its noted).
> 
> Did you not read the previous discussion?

I did... Thats why I added the blurb about best-effort and the case that can
theoretically still fail.

The oom reaper doesn't reap shared memory but WITHOUT this change it was reaping
the head (PRIVATE|ANON) which is needed to get to those shared mappings; so
shared locks are safe with this added change.

If a user implements private locks they can only be used within that process. If
that process is OOMing then it doesnt really matter what happens to the
futexes... all of those threads running under that process will terminate anyways.

Perhaps I'm misunderstanding you.

-- Nico
Nico Pache April 8, 2022, 8:52 a.m. UTC | #4
On 4/8/22 04:37, Thomas Gleixner wrote:
> On Fri, Apr 08 2022 at 10:15, Peter Zijlstra wrote:
>> On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote:
>>> Theoretically a failure can still occur if there are locks mapped as
>>> PRIVATE|ANON; however, the robust futexes are a best-effort approach.
>>> This patch only strengthens that best-effort.
>>>
>>> The following case can still fail:
>>> robust head (skipped) -> private lock (reaped) -> shared lock
>>> (skipped)
>>
>> This is still all sorts of confused.. it's a list head, the entries can
>> be in any random other VMA. You must not remove *any* user memory before
>> doing the robust thing. Not removing the VMA that contains the head is
>> pointless in the extreme.
>>
>> Did you not read the previous discussion?
> 
> Aside of that we all agreed that giving a oom-killed task time to
> cleanup itself instead of brute force cleaning it up immediately, which
> is the real problem here. Can we fix that first before adding broken
> heuristics?
We've tried multiple approaches to reproduce the case you are talking about with
no success...

Why make a change for something that we cant reproduce when we are sure this
works for all the cases we've attempted.

I also dont see how this a broken heuristic... If anything adding a delay is
broken. How do we assure the delay is long enough for the exit to clean up the
futexes? In a heavily contended CPU with high memory pressure the delay may also
lead to other processes unnecessarily OOMing.

Cheers,
-- Nico

> 
> Thanks,
> 
>         tglx
> 
>
Michal Hocko April 8, 2022, 9:36 a.m. UTC | #5
On Fri 08-04-22 04:52:33, Nico Pache wrote:
[...]
> In a heavily contended CPU with high memory pressure the delay may also
> lead to other processes unnecessarily OOMing.

Let me just comment on this part because there is likely a confusion
inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
because the the oom killing is throttled by existence of a preexisting
OOM victim. In other words as long as there is an alive victim no
further victims are not selected and the oom killer backs off. The
oom_repaer will hide the alive oom victim after it is processed.
The longer the delay will be the longer an oom victim can block a
further progress but it cannot really cause unnecessary OOMing.
Nico Pache April 8, 2022, 9:40 a.m. UTC | #6
On 4/8/22 05:36, Michal Hocko wrote:
> On Fri 08-04-22 04:52:33, Nico Pache wrote:
> [...]
>> In a heavily contended CPU with high memory pressure the delay may also
>> lead to other processes unnecessarily OOMing.
> 
> Let me just comment on this part because there is likely a confusion
> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
> because the the oom killing is throttled by existence of a preexisting
> OOM victim. In other words as long as there is an alive victim no
> further victims are not selected and the oom killer backs off. The
> oom_repaer will hide the alive oom victim after it is processed.
> The longer the delay will be the longer an oom victim can block a
> further progress but it cannot really cause unnecessary OOMing.
Is it not the case that if we delay an OOM, the amount of available memory stays
limited and other processes that are allocating memory can become OOM candidates?

-- Nico
Michal Hocko April 8, 2022, 9:59 a.m. UTC | #7
On Fri 08-04-22 05:40:09, Nico Pache wrote:
> 
> 
> On 4/8/22 05:36, Michal Hocko wrote:
> > On Fri 08-04-22 04:52:33, Nico Pache wrote:
> > [...]
> >> In a heavily contended CPU with high memory pressure the delay may also
> >> lead to other processes unnecessarily OOMing.
> > 
> > Let me just comment on this part because there is likely a confusion
> > inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
> > because the the oom killing is throttled by existence of a preexisting
> > OOM victim. In other words as long as there is an alive victim no
> > further victims are not selected and the oom killer backs off. The
> > oom_repaer will hide the alive oom victim after it is processed.
> > The longer the delay will be the longer an oom victim can block a
> > further progress but it cannot really cause unnecessary OOMing.
> Is it not the case that if we delay an OOM, the amount of available memory stays
> limited and other processes that are allocating memory can become OOM candidates?

No. Have a look at oom_evaluate_task (tsk_is_oom_victim check).
Nico Pache April 8, 2022, 10:36 a.m. UTC | #8
On 4/8/22 05:59, Michal Hocko wrote:
> On Fri 08-04-22 05:40:09, Nico Pache wrote:
>>
>>
>> On 4/8/22 05:36, Michal Hocko wrote:
>>> On Fri 08-04-22 04:52:33, Nico Pache wrote:
>>> [...]
>>>> In a heavily contended CPU with high memory pressure the delay may also
>>>> lead to other processes unnecessarily OOMing.
>>>
>>> Let me just comment on this part because there is likely a confusion
>>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
>>> because the the oom killing is throttled by existence of a preexisting
>>> OOM victim. In other words as long as there is an alive victim no
>>> further victims are not selected and the oom killer backs off. The
>>> oom_repaer will hide the alive oom victim after it is processed.
>>> The longer the delay will be the longer an oom victim can block a
>>> further progress but it cannot really cause unnecessary OOMing.
>> Is it not the case that if we delay an OOM, the amount of available memory stays
>> limited and other processes that are allocating memory can become OOM candidates?
> 
> No. Have a look at oom_evaluate_task (tsk_is_oom_victim check).
Ok I see.

Doesnt the delay then allow the system to run into the following case more easily?:
pr_warn("Out of memory and no killable processes...\n");
panic("System is deadlocked on memory\n");

If the system cant select another OOM candidate, the oom_reaper is delayed, and
the exit is blocked, then we panic.

-- Nico
Michal Hocko April 8, 2022, 10:51 a.m. UTC | #9
On Fri 08-04-22 06:36:40, Nico Pache wrote:
> 
> 
> On 4/8/22 05:59, Michal Hocko wrote:
> > On Fri 08-04-22 05:40:09, Nico Pache wrote:
> >>
> >>
> >> On 4/8/22 05:36, Michal Hocko wrote:
> >>> On Fri 08-04-22 04:52:33, Nico Pache wrote:
> >>> [...]
> >>>> In a heavily contended CPU with high memory pressure the delay may also
> >>>> lead to other processes unnecessarily OOMing.
> >>>
> >>> Let me just comment on this part because there is likely a confusion
> >>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
> >>> because the the oom killing is throttled by existence of a preexisting
> >>> OOM victim. In other words as long as there is an alive victim no
> >>> further victims are not selected and the oom killer backs off. The
> >>> oom_repaer will hide the alive oom victim after it is processed.
> >>> The longer the delay will be the longer an oom victim can block a
> >>> further progress but it cannot really cause unnecessary OOMing.
> >> Is it not the case that if we delay an OOM, the amount of available memory stays
> >> limited and other processes that are allocating memory can become OOM candidates?
> > 
> > No. Have a look at oom_evaluate_task (tsk_is_oom_victim check).
> Ok I see.
> 
> Doesnt the delay then allow the system to run into the following case more easily?:
> pr_warn("Out of memory and no killable processes...\n");
> panic("System is deadlocked on memory\n");

No. Aborting the oom victim search (above mentioned) will cause
out_of_memory to bail out and return to the page allocator. As I've said
the only problem with delaying the oom_reaper is that _iff_ the oom
victim cannot terminate (because it is stuck somewhere in the kernel)
on its own then the oom situation (be it global, cpuset or memcg) will
take longer so allocating tasks will not be able to make a forward
progress.
Nico Pache April 8, 2022, 11:26 a.m. UTC | #10
On 4/8/22 06:51, Michal Hocko wrote:
> On Fri 08-04-22 06:36:40, Nico Pache wrote:
>>
>>
>> On 4/8/22 05:59, Michal Hocko wrote:
>>> On Fri 08-04-22 05:40:09, Nico Pache wrote:
>>>>
>>>>
>>>> On 4/8/22 05:36, Michal Hocko wrote:
>>>>> On Fri 08-04-22 04:52:33, Nico Pache wrote:
>>>>> [...]
>>>>>> In a heavily contended CPU with high memory pressure the delay may also
>>>>>> lead to other processes unnecessarily OOMing.
>>>>>
>>>>> Let me just comment on this part because there is likely a confusion
>>>>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing
>>>>> because the the oom killing is throttled by existence of a preexisting
>>>>> OOM victim. In other words as long as there is an alive victim no
>>>>> further victims are not selected and the oom killer backs off. The
>>>>> oom_repaer will hide the alive oom victim after it is processed.
>>>>> The longer the delay will be the longer an oom victim can block a
>>>>> further progress but it cannot really cause unnecessary OOMing.
>>>> Is it not the case that if we delay an OOM, the amount of available memory stays
>>>> limited and other processes that are allocating memory can become OOM candidates?
>>>
>>> No. Have a look at oom_evaluate_task (tsk_is_oom_victim check).
>> Ok I see.
>>
>> Doesnt the delay then allow the system to run into the following case more easily?:
>> pr_warn("Out of memory and no killable processes...\n");
>> panic("System is deadlocked on memory\n");
> 
> No. Aborting the oom victim search (above mentioned) will cause
> out_of_memory to bail out and return to the page allocator. 
Ok I see that now. I did my bit math incorrectly the first time around. I
thought abort lead to the !oc->chosen case.

> the only problem with delaying the oom_reaper is that _iff_ the oom
> victim cannot terminate (because it is stuck somewhere in the kernel)
> on its own then the oom situation (be it global, cpuset or memcg) will
> take longer so allocating tasks will not be able to make a forward
> progress.
Ok so if i understand that correctly, delaying can have some ugly effects and
kinda breaks the initial purpose of the OOM reaper?

I personally don't like the delay approach. Especially if we have a better one
we know is working, and that doesnt add regressions.

If someone can prove to me the private lock case, I'd be more willing to bite.

Thanks for all the OOM context :)
-- Nico
Michal Hocko April 8, 2022, 11:48 a.m. UTC | #11
On Fri 08-04-22 07:26:07, Nico Pache wrote:
[...]
> Ok so if i understand that correctly, delaying can have some ugly effects and
> kinda breaks the initial purpose of the OOM reaper?

No, not really. The primary objective of the oom_reaper is to _guaratee_
a forward progress. It is not really meant to be an optimization to
respond to the oom killer faster. The reason the oom_reaper is kicked
off right away is because that was the simplest implementation.

> I personally don't like the delay approach. Especially if we have a better one
> we know is working, and that doesnt add regressions.

Well, I would say that handling futex case more gracefully would be
preferable but my understanding is that this is not all that easy. I am
far from being a futex expert so I will leave that up to Thomas and Peter.

On the other hand delaying oom_reaper is rather straightforward and I do
not think there is a big risk of regressions. Any QoS during OOM is
simply out of the window and the main purpose of the reaper will be
preserved with a timeout as well. I also do agree with Thomas that this
would cover 99% of cases.

> If someone can prove to me the private lock case, I'd be more willing to bite.
> 
> Thanks for all the OOM context :)

Welcome. The oom handling is a maze and it is really easy to miss all
the subtlety and conflicting requirements that are applied here.
Thomas Gleixner April 8, 2022, 1:54 p.m. UTC | #12
On Fri, Apr 08 2022 at 04:41, Nico Pache wrote:
> On 4/8/22 04:15, Peter Zijlstra wrote:
>>>
>>> The following case can still fail:
>>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
>> 
>> This is still all sorts of confused.. it's a list head, the entries can
>> be in any random other VMA. You must not remove *any* user memory before
>> doing the robust thing. Not removing the VMA that contains the head is
>> pointless in the extreme.
> Not sure how its pointless if it fixes all the different reproducers we've
> written for it. As for the private lock case we stated here, we havent been able
> to reproduce it, but I could see how it can be a potential issue (which is why
> its noted).

The below reproduces the problem nicely, i.e. the lock() in the parent
times out. So why would the OOM killer fail to cause the same problem
when it reaps the private anon mapping where the private futex sits?

If you revert the lock order in the child the robust muck works.

Thanks,

        tglx
---
#include <errno.h>
#include <fcntl.h>
#include <pthread.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/mman.h>

static char n[4096];

int main(void)
{
	pthread_mutexattr_t mat_s, mat_p;
	pthread_mutex_t *mut_s, *mut_p;
	pthread_barrierattr_t ba;
	pthread_barrier_t *b;
	struct timespec to;
	void *pri, *shr;
	int r;

	shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE,
		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);

	pthread_mutexattr_init(&mat_s);
	pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST);
	mut_s = shr;
	pthread_mutex_init(mut_s, &mat_s);

	pthread_barrierattr_init(&ba);
	pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED);
	b = shr + 1024;
	pthread_barrier_init(b, &ba, 2);

	if (!fork()) {
		pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE,
			   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
		pthread_mutexattr_init(&mat_p);
		pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE);
		pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST);
		mut_p = pri;
		pthread_mutex_init(mut_p, &mat_p);

		// With lock order s, p parent gets timeout
		// With lock order p, s parent gets owner died
		pthread_mutex_lock(mut_s);
		pthread_mutex_lock(mut_p);
		// Remove unmap and lock order does not matter
		munmap(pri, sizeof(n));
		pthread_barrier_wait(b);
		printf("child gone\n");
	} else {
		pthread_barrier_wait(b);
		printf("parent lock\n");
		clock_gettime(CLOCK_REALTIME, &to);
		to.tv_sec += 1;
		r = pthread_mutex_timedlock(mut_s, &to);
		printf("parent lock returned: %s\n", strerror(r));
	}
	return 0;
}
kernel test robot April 8, 2022, 2:41 p.m. UTC | #13
Hi Nico,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]
[also build test WARNING on linus/master v5.18-rc1 next-20220408]
[cannot apply to linux/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Nico-Pache/oom_kill-c-futex-Don-t-OOM-reap-the-VMA-containing-the-robust_list_head/20220408-112952
base:   https://github.com/hnaz/linux-mm master
config: x86_64-randconfig-s022 (https://download.01.org/0day-ci/archive/20220408/202204082258.E7EPbAYz-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-19) 11.2.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/intel-lab-lkp/linux/commit/70c1e2a404ac47b7c9b8bd1e9c4d3e72f19e6c62
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nico-Pache/oom_kill-c-futex-Don-t-OOM-reap-the-VMA-containing-the-robust_list_head/20220408-112952
        git checkout 70c1e2a404ac47b7c9b8bd1e9c4d3e72f19e6c62
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> mm/oom_kill.c:606:21: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected void *robust_list @@     got struct robust_list_head [noderef] __user *robust_list @@
   mm/oom_kill.c:606:21: sparse:     expected void *robust_list
   mm/oom_kill.c:606:21: sparse:     got struct robust_list_head [noderef] __user *robust_list
   mm/oom_kill.c: note: in included file (through include/linux/rculist.h, include/linux/sched/signal.h, include/linux/oom.h):
   include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'find_lock_task_mm' - wrong count at exit
   mm/oom_kill.c:222:28: sparse: sparse: context imbalance in 'oom_badness' - unexpected unlock
   include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'dump_task' - unexpected unlock
   include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in '__oom_kill_process' - unexpected unlock
   mm/oom_kill.c:1243:21: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected void *robust_list @@     got struct robust_list_head [noderef] __user *robust_list @@
   mm/oom_kill.c:1243:21: sparse:     expected void *robust_list
   mm/oom_kill.c:1243:21: sparse:     got struct robust_list_head [noderef] __user *robust_list
   mm/oom_kill.c:1232:20: sparse: sparse: context imbalance in '__se_sys_process_mrelease' - unexpected unlock
--
>> mm/mmap.c:3138:21: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected void *robust_list @@     got struct robust_list_head [noderef] __user *robust_list @@
   mm/mmap.c:3138:21: sparse:     expected void *robust_list
   mm/mmap.c:3138:21: sparse:     got struct robust_list_head [noderef] __user *robust_list

vim +606 mm/oom_kill.c

   575	
   576	/*
   577	 * Reaps the address space of the give task.
   578	 *
   579	 * Returns true on success and false if none or part of the address space
   580	 * has been reclaimed and the caller should retry later.
   581	 */
   582	static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
   583	{
   584		bool ret = true;
   585		void *robust_list = NULL;
   586	
   587		if (!mmap_read_trylock(mm)) {
   588			trace_skip_task_reaping(tsk->pid);
   589			return false;
   590		}
   591	
   592		/*
   593		 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
   594		 * work on the mm anymore. The check for MMF_OOM_SKIP must run
   595		 * under mmap_lock for reading because it serializes against the
   596		 * mmap_write_lock();mmap_write_unlock() cycle in exit_mmap().
   597		 */
   598		if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
   599			trace_skip_task_reaping(tsk->pid);
   600			goto out_unlock;
   601		}
   602	
   603		trace_start_task_reaping(tsk->pid);
   604	
   605	#ifdef CONFIG_FUTEX
 > 606		robust_list = tsk->robust_list;
   607	#endif
   608		/* failed to reap part of the address space. Try again later */
   609		ret = __oom_reap_task_mm(mm, robust_list);
   610		if (!ret)
   611			goto out_finish;
   612	
   613		pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
   614				task_pid_nr(tsk), tsk->comm,
   615				K(get_mm_counter(mm, MM_ANONPAGES)),
   616				K(get_mm_counter(mm, MM_FILEPAGES)),
   617				K(get_mm_counter(mm, MM_SHMEMPAGES)));
   618	out_finish:
   619		trace_finish_task_reaping(tsk->pid);
   620	out_unlock:
   621		mmap_read_unlock(mm);
   622	
   623		return ret;
   624	}
   625
Joel Savitz April 8, 2022, 4:13 p.m. UTC | #14
> ---
> #include <errno.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <time.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
>
> #include <sys/types.h>
> #include <sys/mman.h>
>
> static char n[4096];
>
> int main(void)
> {
>         pthread_mutexattr_t mat_s, mat_p;
>         pthread_mutex_t *mut_s, *mut_p;
>         pthread_barrierattr_t ba;
>         pthread_barrier_t *b;
>         struct timespec to;
>         void *pri, *shr;
>         int r;
>
>         shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE,
>                    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>
>         pthread_mutexattr_init(&mat_s);
>         pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST);
>         mut_s = shr;
>         pthread_mutex_init(mut_s, &mat_s);
>
>         pthread_barrierattr_init(&ba);
>         pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED);
>         b = shr + 1024;
>         pthread_barrier_init(b, &ba, 2);
>
>         if (!fork()) {
>                 pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE,
>                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>                 pthread_mutexattr_init(&mat_p);
>                 pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE);
>                 pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST);
One thing I don't understand is what kind of sane use case relies on
robust futex for a process-private lock?
Is there a purpose to a lock being on the robust list if there are no
other processes that must be woken in case the holder process is
killed?
If this usage serves no purpose besides causing races during oom, we
should discourage this use, perhaps by adding a note on the manpage.

>                 mut_p = pri;
>                 pthread_mutex_init(mut_p, &mat_p);
>
>                 // With lock order s, p parent gets timeout
>                 // With lock order p, s parent gets owner died
>                 pthread_mutex_lock(mut_s);
>                 pthread_mutex_lock(mut_p);
>                 // Remove unmap and lock order does not matter
>                 munmap(pri, sizeof(n));
>                 pthread_barrier_wait(b);
>                 printf("child gone\n");
>         } else {
>                 pthread_barrier_wait(b);
>                 printf("parent lock\n");
>                 clock_gettime(CLOCK_REALTIME, &to);
>                 to.tv_sec += 1;
>                 r = pthread_mutex_timedlock(mut_s, &to);
>                 printf("parent lock returned: %s\n", strerror(r));
>         }
>         return 0;
> }
>
Thomas Gleixner April 8, 2022, 9:41 p.m. UTC | #15
On Fri, Apr 08 2022 at 12:13, Joel Savitz wrote:
>>         if (!fork()) {
>>                 pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE,
>>                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>                 pthread_mutexattr_init(&mat_p);
>>                 pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE);
>>                 pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST);
> One thing I don't understand is what kind of sane use case relies on
> robust futex for a process-private lock?
> Is there a purpose to a lock being on the robust list if there are no
> other processes that must be woken in case the holder process is
> killed?

Ever heard about the concept of multi threading?

> If this usage serves no purpose besides causing races during oom, we
> should discourage this use, perhaps by adding a note on the manpage.

This usage does not cause races during oom. It does not even cause races
if it would be silly, which it is not except for the demonstrator
above. The keyword here is *demonstrator*.

The oom killer itself causes the race because it starts reaping the VMAs
without granting the target time to terminate. This needs to be fixed in
the first place, period.

If the target can't terminate because it is stuck then yes, there will
be fallout where a robust futex cannot be released, but that's something
which cannot be solved at all.

I'm really tired of this by now. Several people explained in great
length the shortcomings of your 'cure the symptom' approach, showed you
that the "impossible to reproduce" problem is real and told you very
explicitely what the proper solution is.

So instead of sitting down and really tackling the root cause, all you
can come up with is to post the same 'cure the symptom' muck over and
over and then if debunked grasp for straws.

Coming back to your original question.

What's the difference between a process shared and a process private
futex in the context of a multi threaded process?

  - The process shared must obviously have a shared mapping

  - The process private has no need for a shared mapping because
    all threads share the same address space.

What do they have in common?

  - All of them are threads in the kernel POV

  - All of them care about the unexpected exit/death of some other
    thread vs. locking

So why would a process private robust mutex be any different from a
process shared one?

I'm sure you can answer that question yourself by now.

Thanks,

        tglx
Michal Hocko April 11, 2022, 6:48 a.m. UTC | #16
On Fri 08-04-22 23:41:11, Thomas Gleixner wrote:
[...]
> Coming back to your original question.
> 
> What's the difference between a process shared and a process private
> futex in the context of a multi threaded process?
> 
>   - The process shared must obviously have a shared mapping
> 
>   - The process private has no need for a shared mapping because
>     all threads share the same address space.
> 
> What do they have in common?
> 
>   - All of them are threads in the kernel POV
> 
>   - All of them care about the unexpected exit/death of some other
>     thread vs. locking
> 
> So why would a process private robust mutex be any different from a
> process shared one?

Purely from the OOM POV they are slightly different because the OOM
killer always kills all threads which share the mm with the selected
victim (with an exception of the global init - see __oom_kill_process).
Note that this is including those threads which are not sharing signals
handling.
So clobbering private locks shouldn't be observable to an alive thread
unless I am missing something.

On the other hand I do agree that delayed oom_reaper execution is a
reasonable workaround and the most simplistic one. If I understand your
example code then we would need to evaluate the whole robust list and
that is simply not feasible because that would require a #PF in general
case.

HTH
Thomas Gleixner April 11, 2022, 7:47 a.m. UTC | #17
Michal,

On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote:
> On Fri 08-04-22 23:41:11, Thomas Gleixner wrote:
>> So why would a process private robust mutex be any different from a
>> process shared one?
>
> Purely from the OOM POV they are slightly different because the OOM
> killer always kills all threads which share the mm with the selected
> victim (with an exception of the global init - see __oom_kill_process).
> Note that this is including those threads which are not sharing signals
> handling.
> So clobbering private locks shouldn't be observable to an alive thread
> unless I am missing something.

Yes, it kills everything, but the reaper also reaps non-shared VMAs. So
if the process private futex sits in a reaped VMA the shared one becomes
unreachable.

> On the other hand I do agree that delayed oom_reaper execution is a
> reasonable workaround and the most simplistic one.

I think it's more than a workaround. It's a reasonable expectation that
the kernel side of the user space threads can mop up the mess the user
space part created. So even if one of of N threads is stuck in a place
where it can't, then N-1 can still reach do_exit() and mop their mess
up.

The oom reaper is the last resort to resolve the situation in case of a
stuck task. No?

> If I understand your example code then we would need to evaluate the
> whole robust list and that is simply not feasible because that would
> require a #PF in general case.

Right. The robust list exit code does the user access with pagefaults
disabled and if it fails, it terminates the list walk. Bad luck :)

Thanks,

        tglx
Michal Hocko April 11, 2022, 9:08 a.m. UTC | #18
On Mon 11-04-22 09:47:14, Thomas Gleixner wrote:
> Michal,
> 
> On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote:
> > On Fri 08-04-22 23:41:11, Thomas Gleixner wrote:
> >> So why would a process private robust mutex be any different from a
> >> process shared one?
> >
> > Purely from the OOM POV they are slightly different because the OOM
> > killer always kills all threads which share the mm with the selected
> > victim (with an exception of the global init - see __oom_kill_process).
> > Note that this is including those threads which are not sharing signals
> > handling.
> > So clobbering private locks shouldn't be observable to an alive thread
> > unless I am missing something.
> 
> Yes, it kills everything, but the reaper also reaps non-shared VMAs. So
> if the process private futex sits in a reaped VMA the shared one becomes
> unreachable.
> 
> > On the other hand I do agree that delayed oom_reaper execution is a
> > reasonable workaround and the most simplistic one.
> 
> I think it's more than a workaround. It's a reasonable expectation that
> the kernel side of the user space threads can mop up the mess the user
> space part created. So even if one of of N threads is stuck in a place
> where it can't, then N-1 can still reach do_exit() and mop their mess
> up.
> 
> The oom reaper is the last resort to resolve the situation in case of a
> stuck task. No?

Yes, I keep saying workaround because it really doesn't address the
underlying issue which is that the oom_reaper clobbers something it
shouldn't be. A full fix from my POV would be making oom_reaper code
more aware of the futex usage. But this is something nore really viable.

Btw. this is what I've in my local tree. It hasn't seen any testing but
it might be a good start to make it a full patch. I have decided to use
a timer rather than juggling tasks on the oom_reaper list because
initial implementation looked uglier. I will try to find some time to
finish that but if Nico or others beat me to it I won't complain.
Also I absolutely do not insist on the timer approach.
--- 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff6901dcb06d..528806ad6e6a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1446,6 +1446,7 @@ struct task_struct {
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
 	struct task_struct		*oom_reaper_list;
+	struct timer_list		oom_reaper_timer;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..be6d65ead7ec 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	set_bit(MMF_OOM_SKIP, &mm->flags);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by queue_oom_repaer */
 	put_task_struct(tsk);
 }
 
@@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
 		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
-		spin_lock(&oom_reaper_lock);
+		spin_lock_irq(&oom_reaper_lock);
 		if (oom_reaper_list != NULL) {
 			tsk = oom_reaper_list;
 			oom_reaper_list = tsk->oom_reaper_list;
 		}
-		spin_unlock(&oom_reaper_lock);
+		spin_unlock_irq(&oom_reaper_lock);
 
 		if (tsk)
 			oom_reap_task(tsk);
@@ -658,22 +658,50 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper_fn(struct timer_list *timer)
 {
-	/* mm is already queued? */
-	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
-		return;
+	struct task_struct *tsk = container_of(timer, struct task_struct, oom_reaper_timer);
+	struct mm_struct *mm = tsk->signal->oom_mm;
+	unsigned long flags;
 
-	get_task_struct(tsk);
+	/* The victim managed to terminate on its own - see exit_mmap */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+		put_task_struct(tsk);
+		return;
+	}
 
-	spin_lock(&oom_reaper_lock);
+	spin_lock_irqsave(&oom_reaper_lock, flags);
 	tsk->oom_reaper_list = oom_reaper_list;
 	oom_reaper_list = tsk;
-	spin_unlock(&oom_reaper_lock);
+	spin_unlock_irqrestore(&oom_reaper_lock, flags);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
 }
 
+/*
+ * Give OOM victims some head room to exit themselves. If they do not exit
+ * on their own the oom reaper is invoked.
+ * The timeout is basically arbitrary and there is no best value to use.
+ * The longer it will be the longer the worst case scenario OOM can
+ * take. The smaller the timeout the more likely the oom_reaper can get
+ * into the way and release resources which could be needed during the
+ * exit path - e.g. futex robust lists can sit in the anonymous memory
+ * which could be reaped and the exit path won't be able to let waiters
+ * know the holding task has terminated.
+ */
+#define OOM_REAPER_DELAY (2*HZ)
+static void queue_oom_repaer(struct task_struct *tsk)
+{
+	/* mm is already queued? */
+	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
+		return;
+
+	get_task_struct(tsk);
+	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper_fn, 0);
+	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+	add_timer(&tsk->oom_reaper_timer);
+}
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
@@ -681,7 +709,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_repaer(struct task_struct *tsk)
 {
 }
 #endif /* CONFIG_MMU */
@@ -932,7 +960,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(victim);
+		queue_oom_repaer(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -968,7 +996,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	task_lock(victim);
 	if (task_will_free_mem(victim)) {
 		mark_oom_victim(victim);
-		wake_oom_reaper(victim);
+		queue_oom_repaer(victim);
 		task_unlock(victim);
 		put_task_struct(victim);
 		return;
@@ -1067,7 +1095,7 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		queue_oom_repaer(current);
 		return true;
 	}
Nico Pache April 11, 2022, 11:51 p.m. UTC | #19
On 4/8/22 09:54, Thomas Gleixner wrote:
> On Fri, Apr 08 2022 at 04:41, Nico Pache wrote:
>> On 4/8/22 04:15, Peter Zijlstra wrote:
>>>>
>>>> The following case can still fail:
>>>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped)
>>>
>>> This is still all sorts of confused.. it's a list head, the entries can
>>> be in any random other VMA. You must not remove *any* user memory before
>>> doing the robust thing. Not removing the VMA that contains the head is
>>> pointless in the extreme.
>> Not sure how its pointless if it fixes all the different reproducers we've
>> written for it. As for the private lock case we stated here, we havent been able
>> to reproduce it, but I could see how it can be a potential issue (which is why
>> its noted).
> 
> The below reproduces the problem nicely, i.e. the lock() in the parent
> times out. So why would the OOM killer fail to cause the same problem
> when it reaps the private anon mapping where the private futex sits?
> 
> If you revert the lock order in the child the robust muck works.

Thanks for the reproducer Thomas :)

I think I need to re-up my knowledge around COW and how it effects that stack.
There are increased oddities when you add the pthread library that I cant fully
wrap my head around at the moment.

My confusion lies in how the parent/child share a robust list here, but they
obviously do. In my mind the mut_s would be different in the child/parent after
the fork and pthread_mutex_init (and friends) are done in the child.

Thanks!
-- Nico
> 
> Thanks,
> 
>         tglx
> ---
> #include <errno.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <time.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> 
> #include <sys/types.h>
> #include <sys/mman.h>
> 
> static char n[4096];
> 
> int main(void)
> {
> 	pthread_mutexattr_t mat_s, mat_p;
> 	pthread_mutex_t *mut_s, *mut_p;
> 	pthread_barrierattr_t ba;
> 	pthread_barrier_t *b;
> 	struct timespec to;
> 	void *pri, *shr;
> 	int r;
> 
> 	shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE,
> 		   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> 
> 	pthread_mutexattr_init(&mat_s);
> 	pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST);
> 	mut_s = shr;
> 	pthread_mutex_init(mut_s, &mat_s);
> 
> 	pthread_barrierattr_init(&ba);
> 	pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED);
> 	b = shr + 1024;
> 	pthread_barrier_init(b, &ba, 2);
> 
> 	if (!fork()) {
> 		pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE,
> 			   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 		pthread_mutexattr_init(&mat_p);
> 		pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE);
> 		pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST);
> 		mut_p = pri;
> 		pthread_mutex_init(mut_p, &mat_p);
> 
> 		// With lock order s, p parent gets timeout
> 		// With lock order p, s parent gets owner died
> 		pthread_mutex_lock(mut_s);
> 		pthread_mutex_lock(mut_p);
> 		// Remove unmap and lock order does not matter
> 		munmap(pri, sizeof(n));
> 		pthread_barrier_wait(b);
> 		printf("child gone\n");
> 	} else {
> 		pthread_barrier_wait(b);
> 		printf("parent lock\n");
> 		clock_gettime(CLOCK_REALTIME, &to);
> 		to.tv_sec += 1;
> 		r = pthread_mutex_timedlock(mut_s, &to);
> 		printf("parent lock returned: %s\n", strerror(r));
> 	}
> 	return 0;
> }
>
Nico Pache April 12, 2022, 12:02 a.m. UTC | #20
On 4/11/22 05:08, Michal Hocko wrote:
> On Mon 11-04-22 09:47:14, Thomas Gleixner wrote:
>> Michal,
>>
>> On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote:
>>> On Fri 08-04-22 23:41:11, Thomas Gleixner wrote:
>>>> So why would a process private robust mutex be any different from a
>>>> process shared one?
>>>
>>> Purely from the OOM POV they are slightly different because the OOM
>>> killer always kills all threads which share the mm with the selected
>>> victim (with an exception of the global init - see __oom_kill_process).
>>> Note that this is including those threads which are not sharing signals
>>> handling.
>>> So clobbering private locks shouldn't be observable to an alive thread
>>> unless I am missing something.
>>
>> Yes, it kills everything, but the reaper also reaps non-shared VMAs. So
>> if the process private futex sits in a reaped VMA the shared one becomes
>> unreachable.
>>
>>> On the other hand I do agree that delayed oom_reaper execution is a
>>> reasonable workaround and the most simplistic one.
>>
>> I think it's more than a workaround. It's a reasonable expectation that
>> the kernel side of the user space threads can mop up the mess the user
>> space part created. So even if one of of N threads is stuck in a place
>> where it can't, then N-1 can still reach do_exit() and mop their mess
>> up.
>>
>> The oom reaper is the last resort to resolve the situation in case of a
>> stuck task. No?
> 
> Yes, I keep saying workaround because it really doesn't address the
> underlying issue which is that the oom_reaper clobbers something it
> shouldn't be. A full fix from my POV would be making oom_reaper code
> more aware of the futex usage. But this is something nore really viable.
This is *kinda* what this approach is doing, but as Thomas has pointed out, it
has its shortcoming. Additionally, it has just come to my attention, that this
solution does not cover the compat robust list... So there is yet another
shortcoming.
> 
> Btw. this is what I've in my local tree. It hasn't seen any testing but
> it might be a good start to make it a full patch. I have decided to use
> a timer rather than juggling tasks on the oom_reaper list because
> initial implementation looked uglier. I will try to find some time to
> finish that but if Nico or others beat me to it I won't complain.
> Also I absolutely do not insist on the timer approach.
> [...]

I will spend tomorrow working the delay solution and testing it. Thanks for
starting it :)

I appreciate the comments and help from everyone that has participated! I'm
sorry if any misunderstanding were had, its not our intention to upset anyone,
but rather to learn and work a solution for the problem we are facing.

Best,
-- Nico
Thomas Gleixner April 12, 2022, 4:20 p.m. UTC | #21
On Mon, Apr 11 2022 at 19:51, Nico Pache wrote:
> On 4/8/22 09:54, Thomas Gleixner wrote:
>> The below reproduces the problem nicely, i.e. the lock() in the parent
>> times out. So why would the OOM killer fail to cause the same problem
>> when it reaps the private anon mapping where the private futex sits?
>> 
>> If you revert the lock order in the child the robust muck works.
>
> Thanks for the reproducer Thomas :)
>
> I think I need to re-up my knowledge around COW and how it effects
> that stack. There are increased oddities when you add the pthread
> library that I cant fully wrap my head around at the moment.

The pthread library functions are just conveniance so I did not have to
hand code the futex and robust list handling.

> My confusion lies in how the parent/child share a robust list here, but they
> obviously do. In my mind the mut_s would be different in the child/parent after
> the fork and pthread_mutex_init (and friends) are done in the child.

They don't share a robust list, each thread has it's own.

The shared mutex mut_s is initialized in the parent before fork and it's
the same address in the child and it's not COWed because the mapping is
MAP_SHARED.

The child allocates private memory and initializes the private mutex in
that private mapping.

So now child does:

   pthread_mutex_lock(mut_s);

That's the mutex in the memory shared with the parent. After that the
childs robusts list head points to mut_s::robust_list.

Now it does:

   pthread_mutex_lock(mut_p);

after that the childs robust list head points to mut_p::robust_list and
mut_p::robust_list points to mut_s::robust_list.

So now the child unmaps the private memory and exists.

The kernel tries to walk the robust list pointer and faults when trying
to access mut_p. End of walk and mut_s stays locked.

So now think about the OOM case. The killed process has a shared mapping
with some other unrelated process (file, shmem) where mut_p sits.

It gets killed after:
		pthread_mutex_lock(mut_s);
		pthread_mutex_lock(mut_p);

So the OOM reaper rips the VMA which contains mut_p and therefore breaks
the chain which is necessary to reach mut_p.

See?

Thanks,

        tglx
Nico Pache April 12, 2022, 5:03 p.m. UTC | #22
On 4/12/22 12:20, Thomas Gleixner wrote:
> On Mon, Apr 11 2022 at 19:51, Nico Pache wrote:
>> On 4/8/22 09:54, Thomas Gleixner wrote:
>>> The below reproduces the problem nicely, i.e. the lock() in the parent
>>> times out. So why would the OOM killer fail to cause the same problem
>>> when it reaps the private anon mapping where the private futex sits?
>>>
>>> If you revert the lock order in the child the robust muck works.
>>
>> Thanks for the reproducer Thomas :)
>>
>> I think I need to re-up my knowledge around COW and how it effects
>> that stack. There are increased oddities when you add the pthread
>> library that I cant fully wrap my head around at the moment.
> 
> The pthread library functions are just conveniance so I did not have to
> hand code the futex and robust list handling.
> 
>> My confusion lies in how the parent/child share a robust list here, but they
>> obviously do. In my mind the mut_s would be different in the child/parent after
>> the fork and pthread_mutex_init (and friends) are done in the child.
> 
> They don't share a robust list, each thread has it's own.
> 
> The shared mutex mut_s is initialized in the parent before fork and it's
> the same address in the child and it's not COWed because the mapping is
> MAP_SHARED.
> 
> The child allocates private memory and initializes the private mutex in
> that private mapping.
> 
> So now child does:
> 
>    pthread_mutex_lock(mut_s);
> 
> That's the mutex in the memory shared with the parent. After that the
> childs robusts list head points to mut_s::robust_list.
> 
> Now it does:
> 
>    pthread_mutex_lock(mut_p);
> 
> after that the childs robust list head points to mut_p::robust_list and
> mut_p::robust_list points to mut_s::robust_list.
> 
> So now the child unmaps the private memory and exists.
> 
> The kernel tries to walk the robust list pointer and faults when trying
> to access mut_p. End of walk and mut_s stays locked.
> 
> So now think about the OOM case. The killed process has a shared mapping
> with some other unrelated process (file, shmem) where mut_p sits.
> 
> It gets killed after:
> 		pthread_mutex_lock(mut_s);
> 		pthread_mutex_lock(mut_p);
> 
> So the OOM reaper rips the VMA which contains mut_p and therefore breaks
> the chain which is necessary to reach mut_p.
> 
> See?
Yes, thank you for the detailed explanation, the missing piece just clicked in
my head :)

Cheers,
-- Nico
> 
> Thanks,
> 
>         tglx
> 
> 
>
Nico Pache April 13, 2022, 4 p.m. UTC | #23
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ec38194f8e1..be6d65ead7ec 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk)
>  	 */
>  	set_bit(MMF_OOM_SKIP, &mm->flags);
>  
> -	/* Drop a reference taken by wake_oom_reaper */
> +	/* Drop a reference taken by queue_oom_repaer */
>  	put_task_struct(tsk);
>  }
>  
> @@ -644,12 +644,12 @@ static int oom_reaper(void *unused)
>  		struct task_struct *tsk = NULL;
>  
>  		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
> -		spin_lock(&oom_reaper_lock);
> +		spin_lock_irq(&oom_reaper_lock);
>  		if (oom_reaper_list != NULL) {
>  			tsk = oom_reaper_list;
>  			oom_reaper_list = tsk->oom_reaper_list;
>  		}
> -		spin_unlock(&oom_reaper_lock);
> +		spin_unlock_irq(&oom_reaper_lock);
>  
>  		if (tsk)
>  			oom_reap_task(tsk);
> @@ -658,22 +658,50 @@ static int oom_reaper(void *unused)
>  	return 0;
>  }
>  
> -static void wake_oom_reaper(struct task_struct *tsk)
> +static void wake_oom_reaper_fn(struct timer_list *timer)
>  {
> -	/* mm is already queued? */
> -	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> -		return;
> +	struct task_struct *tsk = container_of(timer, struct task_struct, oom_reaper_timer);
> +	struct mm_struct *mm = tsk->signal->oom_mm;
> +	unsigned long flags;
>  
> -	get_task_struct(tsk);
> +	/* The victim managed to terminate on its own - see exit_mmap */
> +	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
> +		put_task_struct(tsk);
> +		return;
> +	}
>  
> -	spin_lock(&oom_reaper_lock);
> +	spin_lock_irqsave(&oom_reaper_lock, flags);
>  	tsk->oom_reaper_list = oom_reaper_list;
>  	oom_reaper_list = tsk;
> -	spin_unlock(&oom_reaper_lock);
> +	spin_unlock_irqrestore(&oom_reaper_lock, flags);
>  	trace_wake_reaper(tsk->pid);
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +/*
> + * Give OOM victims some head room to exit themselves. If they do not exit
> + * on their own the oom reaper is invoked.
> + * The timeout is basically arbitrary and there is no best value to use.
> + * The longer it will be the longer the worst case scenario OOM can
> + * take. The smaller the timeout the more likely the oom_reaper can get
> + * into the way and release resources which could be needed during the
> + * exit path - e.g. futex robust lists can sit in the anonymous memory
> + * which could be reaped and the exit path won't be able to let waiters
> + * know the holding task has terminated.
> + */
> +#define OOM_REAPER_DELAY (2*HZ)
> +static void queue_oom_repaer(struct task_struct *tsk)
> +{
> +	/* mm is already queued? */
> +	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> +		return;
> +
> +	get_task_struct(tsk);
> +	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper_fn, 0);
> +	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	add_timer(&tsk->oom_reaper_timer);
> +}
> +
>  static int __init oom_init(void)
>  {
>  	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> @@ -681,7 +709,7 @@ static int __init oom_init(void)
>  }
>  subsys_initcall(oom_init)
>  #else
> -static inline void wake_oom_reaper(struct task_struct *tsk)
> +static inline void queue_oom_repaer(struct task_struct *tsk)
>  {
>  }
>  #endif /* CONFIG_MMU */
> @@ -932,7 +960,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  	rcu_read_unlock();
>  
>  	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> +		queue_oom_repaer(victim);
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);
> @@ -968,7 +996,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	task_lock(victim);
>  	if (task_will_free_mem(victim)) {
>  		mark_oom_victim(victim);
> -		wake_oom_reaper(victim);
> +		queue_oom_repaer(victim);
>  		task_unlock(victim);
>  		put_task_struct(victim);
>  		return;
> @@ -1067,7 +1095,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 */
>  	if (task_will_free_mem(current)) {
>  		mark_oom_victim(current);
> -		wake_oom_reaper(current);
> +		queue_oom_repaer(current);
>  		return true;
>  	}

Thanks for the code Michal-- It does seem to fix our issue! I will post it after
I finish running it through a few more test cases, and our internal testing suites.

Cheers,
-- Nico
diff mbox series

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..46355ff97abb 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -106,7 +106,7 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm, void *skip_area);
 
 long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..6dee7ea60a13 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3109,6 +3109,11 @@  void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	void *robust_list = NULL;
+
+#ifdef CONFIG_FUTEX
+	robust_list = current->robust_list;
+#endif
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -3126,7 +3131,8 @@  void exit_mmap(struct mm_struct *mm)
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 */
-		(void)__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm, robust_list);
+
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 	}
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..2a7443a64909 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -509,9 +509,10 @@  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm, void *skip_area)
 {
 	struct vm_area_struct *vma;
+	unsigned long skip_vma = (unsigned long) skip_area;
 	bool ret = true;
 
 	/*
@@ -526,6 +527,20 @@  bool __oom_reap_task_mm(struct mm_struct *mm)
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
+#ifdef CONFIG_FUTEX
+		/*
+		 * The OOM reaper runs concurrently with do_exit.
+		 * The robust_list_head is stored in userspace and is required
+		 * by the exit path to recover the robust futex waiters.
+		 * Skip the VMA that contains the robust_list to allow for
+		 * proper cleanup.
+		 */
+		if (vma->vm_start <= skip_vma && vma->vm_end > skip_vma) {
+			pr_info("oom_reaper: skipping vma, contains robust_list");
+			continue;
+		}
+#endif
+
 		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
@@ -567,6 +582,7 @@  bool __oom_reap_task_mm(struct mm_struct *mm)
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	bool ret = true;
+	void *robust_list = NULL;
 
 	if (!mmap_read_trylock(mm)) {
 		trace_skip_task_reaping(tsk->pid);
@@ -586,8 +602,11 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
+#ifdef CONFIG_FUTEX
+	robust_list = tsk->robust_list;
+#endif
 	/* failed to reap part of the address space. Try again later */
-	ret = __oom_reap_task_mm(mm);
+	ret = __oom_reap_task_mm(mm, robust_list);
 	if (!ret)
 		goto out_finish;
 
@@ -1149,6 +1168,7 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	unsigned int f_flags;
 	bool reap = false;
 	long ret = 0;
+	void *robust_list = NULL;
 
 	if (flags)
 		return -EINVAL;
@@ -1186,11 +1206,16 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		ret = -EINTR;
 		goto drop_mm;
 	}
+
+#ifdef CONFIG_FUTEX
+	robust_list = p->robust_list;
+#endif
 	/*
 	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
 	 * possible change in exit_mmap is seen
 	 */
-	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
+	if (!test_bit(MMF_OOM_SKIP, &mm->flags) &&
+			!__oom_reap_task_mm(mm, robust_list))
 		ret = -EAGAIN;
 	mmap_read_unlock(mm);