Message ID | 20220408032809.3696798-1-npache@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v8] oom_kill.c: futex: Don't OOM reap the VMA containing the robust_list_head | expand |
On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote: > The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can > be targeted by the oom reaper. This mapping is used to store the futex > robust list head; the kernel does not keep a copy of the robust list and > instead references a userspace address to maintain the robustness during > a process death. A race can occur between exit_mm and the oom reaper that > allows the oom reaper to free the memory of the futex robust list before > the exit path has handled the futex death: > > CPU1 CPU2 > ------------------------------------------------------------------------ > page_fault > do_exit "signal" > wake_oom_reaper > oom_reaper > oom_reap_task_mm (invalidates mm) > exit_mm > exit_mm_release > futex_exit_release > futex_cleanup > exit_robust_list > get_user (EFAULT- can't access memory) > > If the get_user EFAULT's, the kernel will be unable to recover the > waiters on the robust_list, leaving userspace mutexes hung indefinitely. > > Use the robust_list address stored in the kernel to skip the VMA that holds > it, allowing a successful futex_cleanup. > > Theoretically a failure can still occur if there are locks mapped as > PRIVATE|ANON; however, the robust futexes are a best-effort approach. > This patch only strengthens that best-effort. > > The following case can still fail: > robust head (skipped) -> private lock (reaped) -> shared lock (skipped) This is still all sorts of confused.. it's a list head, the entries can be in any random other VMA. You must not remove *any* user memory before doing the robust thing. Not removing the VMA that contains the head is pointless in the extreme. Did you not read the previous discussion?
On Fri, Apr 08 2022 at 10:15, Peter Zijlstra wrote: > On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote: >> Theoretically a failure can still occur if there are locks mapped as >> PRIVATE|ANON; however, the robust futexes are a best-effort approach. >> This patch only strengthens that best-effort. >> >> The following case can still fail: >> robust head (skipped) -> private lock (reaped) -> shared lock >> (skipped) > > This is still all sorts of confused.. it's a list head, the entries can > be in any random other VMA. You must not remove *any* user memory before > doing the robust thing. Not removing the VMA that contains the head is > pointless in the extreme. > > Did you not read the previous discussion? Aside of that we all agreed that giving a oom-killed task time to cleanup itself instead of brute force cleaning it up immediately, which is the real problem here. Can we fix that first before adding broken heuristics? Thanks, tglx
On 4/8/22 04:15, Peter Zijlstra wrote: > On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote: >> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can >> be targeted by the oom reaper. This mapping is used to store the futex >> robust list head; the kernel does not keep a copy of the robust list and >> instead references a userspace address to maintain the robustness during >> a process death. A race can occur between exit_mm and the oom reaper that >> allows the oom reaper to free the memory of the futex robust list before >> the exit path has handled the futex death: >> >> CPU1 CPU2 >> ------------------------------------------------------------------------ >> page_fault >> do_exit "signal" >> wake_oom_reaper >> oom_reaper >> oom_reap_task_mm (invalidates mm) >> exit_mm >> exit_mm_release >> futex_exit_release >> futex_cleanup >> exit_robust_list >> get_user (EFAULT- can't access memory) >> >> If the get_user EFAULT's, the kernel will be unable to recover the >> waiters on the robust_list, leaving userspace mutexes hung indefinitely. >> >> Use the robust_list address stored in the kernel to skip the VMA that holds >> it, allowing a successful futex_cleanup. >> >> Theoretically a failure can still occur if there are locks mapped as >> PRIVATE|ANON; however, the robust futexes are a best-effort approach. >> This patch only strengthens that best-effort. >> >> The following case can still fail: >> robust head (skipped) -> private lock (reaped) -> shared lock (skipped) > > This is still all sorts of confused.. it's a list head, the entries can > be in any random other VMA. You must not remove *any* user memory before > doing the robust thing. Not removing the VMA that contains the head is > pointless in the extreme. Not sure how its pointless if it fixes all the different reproducers we've written for it. As for the private lock case we stated here, we havent been able to reproduce it, but I could see how it can be a potential issue (which is why its noted). > > Did you not read the previous discussion? I did... Thats why I added the blurb about best-effort and the case that can theoretically still fail. The oom reaper doesn't reap shared memory but WITHOUT this change it was reaping the head (PRIVATE|ANON) which is needed to get to those shared mappings; so shared locks are safe with this added change. If a user implements private locks they can only be used within that process. If that process is OOMing then it doesnt really matter what happens to the futexes... all of those threads running under that process will terminate anyways. Perhaps I'm misunderstanding you. -- Nico
On 4/8/22 04:37, Thomas Gleixner wrote: > On Fri, Apr 08 2022 at 10:15, Peter Zijlstra wrote: >> On Thu, Apr 07, 2022 at 11:28:09PM -0400, Nico Pache wrote: >>> Theoretically a failure can still occur if there are locks mapped as >>> PRIVATE|ANON; however, the robust futexes are a best-effort approach. >>> This patch only strengthens that best-effort. >>> >>> The following case can still fail: >>> robust head (skipped) -> private lock (reaped) -> shared lock >>> (skipped) >> >> This is still all sorts of confused.. it's a list head, the entries can >> be in any random other VMA. You must not remove *any* user memory before >> doing the robust thing. Not removing the VMA that contains the head is >> pointless in the extreme. >> >> Did you not read the previous discussion? > > Aside of that we all agreed that giving a oom-killed task time to > cleanup itself instead of brute force cleaning it up immediately, which > is the real problem here. Can we fix that first before adding broken > heuristics? We've tried multiple approaches to reproduce the case you are talking about with no success... Why make a change for something that we cant reproduce when we are sure this works for all the cases we've attempted. I also dont see how this a broken heuristic... If anything adding a delay is broken. How do we assure the delay is long enough for the exit to clean up the futexes? In a heavily contended CPU with high memory pressure the delay may also lead to other processes unnecessarily OOMing. Cheers, -- Nico > > Thanks, > > tglx > >
On Fri 08-04-22 04:52:33, Nico Pache wrote: [...] > In a heavily contended CPU with high memory pressure the delay may also > lead to other processes unnecessarily OOMing. Let me just comment on this part because there is likely a confusion inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing because the the oom killing is throttled by existence of a preexisting OOM victim. In other words as long as there is an alive victim no further victims are not selected and the oom killer backs off. The oom_repaer will hide the alive oom victim after it is processed. The longer the delay will be the longer an oom victim can block a further progress but it cannot really cause unnecessary OOMing.
On 4/8/22 05:36, Michal Hocko wrote: > On Fri 08-04-22 04:52:33, Nico Pache wrote: > [...] >> In a heavily contended CPU with high memory pressure the delay may also >> lead to other processes unnecessarily OOMing. > > Let me just comment on this part because there is likely a confusion > inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing > because the the oom killing is throttled by existence of a preexisting > OOM victim. In other words as long as there is an alive victim no > further victims are not selected and the oom killer backs off. The > oom_repaer will hide the alive oom victim after it is processed. > The longer the delay will be the longer an oom victim can block a > further progress but it cannot really cause unnecessary OOMing. Is it not the case that if we delay an OOM, the amount of available memory stays limited and other processes that are allocating memory can become OOM candidates? -- Nico
On Fri 08-04-22 05:40:09, Nico Pache wrote: > > > On 4/8/22 05:36, Michal Hocko wrote: > > On Fri 08-04-22 04:52:33, Nico Pache wrote: > > [...] > >> In a heavily contended CPU with high memory pressure the delay may also > >> lead to other processes unnecessarily OOMing. > > > > Let me just comment on this part because there is likely a confusion > > inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing > > because the the oom killing is throttled by existence of a preexisting > > OOM victim. In other words as long as there is an alive victim no > > further victims are not selected and the oom killer backs off. The > > oom_repaer will hide the alive oom victim after it is processed. > > The longer the delay will be the longer an oom victim can block a > > further progress but it cannot really cause unnecessary OOMing. > Is it not the case that if we delay an OOM, the amount of available memory stays > limited and other processes that are allocating memory can become OOM candidates? No. Have a look at oom_evaluate_task (tsk_is_oom_victim check).
On 4/8/22 05:59, Michal Hocko wrote: > On Fri 08-04-22 05:40:09, Nico Pache wrote: >> >> >> On 4/8/22 05:36, Michal Hocko wrote: >>> On Fri 08-04-22 04:52:33, Nico Pache wrote: >>> [...] >>>> In a heavily contended CPU with high memory pressure the delay may also >>>> lead to other processes unnecessarily OOMing. >>> >>> Let me just comment on this part because there is likely a confusion >>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing >>> because the the oom killing is throttled by existence of a preexisting >>> OOM victim. In other words as long as there is an alive victim no >>> further victims are not selected and the oom killer backs off. The >>> oom_repaer will hide the alive oom victim after it is processed. >>> The longer the delay will be the longer an oom victim can block a >>> further progress but it cannot really cause unnecessary OOMing. >> Is it not the case that if we delay an OOM, the amount of available memory stays >> limited and other processes that are allocating memory can become OOM candidates? > > No. Have a look at oom_evaluate_task (tsk_is_oom_victim check). Ok I see. Doesnt the delay then allow the system to run into the following case more easily?: pr_warn("Out of memory and no killable processes...\n"); panic("System is deadlocked on memory\n"); If the system cant select another OOM candidate, the oom_reaper is delayed, and the exit is blocked, then we panic. -- Nico
On Fri 08-04-22 06:36:40, Nico Pache wrote: > > > On 4/8/22 05:59, Michal Hocko wrote: > > On Fri 08-04-22 05:40:09, Nico Pache wrote: > >> > >> > >> On 4/8/22 05:36, Michal Hocko wrote: > >>> On Fri 08-04-22 04:52:33, Nico Pache wrote: > >>> [...] > >>>> In a heavily contended CPU with high memory pressure the delay may also > >>>> lead to other processes unnecessarily OOMing. > >>> > >>> Let me just comment on this part because there is likely a confusion > >>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing > >>> because the the oom killing is throttled by existence of a preexisting > >>> OOM victim. In other words as long as there is an alive victim no > >>> further victims are not selected and the oom killer backs off. The > >>> oom_repaer will hide the alive oom victim after it is processed. > >>> The longer the delay will be the longer an oom victim can block a > >>> further progress but it cannot really cause unnecessary OOMing. > >> Is it not the case that if we delay an OOM, the amount of available memory stays > >> limited and other processes that are allocating memory can become OOM candidates? > > > > No. Have a look at oom_evaluate_task (tsk_is_oom_victim check). > Ok I see. > > Doesnt the delay then allow the system to run into the following case more easily?: > pr_warn("Out of memory and no killable processes...\n"); > panic("System is deadlocked on memory\n"); No. Aborting the oom victim search (above mentioned) will cause out_of_memory to bail out and return to the page allocator. As I've said the only problem with delaying the oom_reaper is that _iff_ the oom victim cannot terminate (because it is stuck somewhere in the kernel) on its own then the oom situation (be it global, cpuset or memcg) will take longer so allocating tasks will not be able to make a forward progress.
On 4/8/22 06:51, Michal Hocko wrote: > On Fri 08-04-22 06:36:40, Nico Pache wrote: >> >> >> On 4/8/22 05:59, Michal Hocko wrote: >>> On Fri 08-04-22 05:40:09, Nico Pache wrote: >>>> >>>> >>>> On 4/8/22 05:36, Michal Hocko wrote: >>>>> On Fri 08-04-22 04:52:33, Nico Pache wrote: >>>>> [...] >>>>>> In a heavily contended CPU with high memory pressure the delay may also >>>>>> lead to other processes unnecessarily OOMing. >>>>> >>>>> Let me just comment on this part because there is likely a confusion >>>>> inlved. Delaying the oom_reaper _cannot_ lead to additional OOM killing >>>>> because the the oom killing is throttled by existence of a preexisting >>>>> OOM victim. In other words as long as there is an alive victim no >>>>> further victims are not selected and the oom killer backs off. The >>>>> oom_repaer will hide the alive oom victim after it is processed. >>>>> The longer the delay will be the longer an oom victim can block a >>>>> further progress but it cannot really cause unnecessary OOMing. >>>> Is it not the case that if we delay an OOM, the amount of available memory stays >>>> limited and other processes that are allocating memory can become OOM candidates? >>> >>> No. Have a look at oom_evaluate_task (tsk_is_oom_victim check). >> Ok I see. >> >> Doesnt the delay then allow the system to run into the following case more easily?: >> pr_warn("Out of memory and no killable processes...\n"); >> panic("System is deadlocked on memory\n"); > > No. Aborting the oom victim search (above mentioned) will cause > out_of_memory to bail out and return to the page allocator. Ok I see that now. I did my bit math incorrectly the first time around. I thought abort lead to the !oc->chosen case. > the only problem with delaying the oom_reaper is that _iff_ the oom > victim cannot terminate (because it is stuck somewhere in the kernel) > on its own then the oom situation (be it global, cpuset or memcg) will > take longer so allocating tasks will not be able to make a forward > progress. Ok so if i understand that correctly, delaying can have some ugly effects and kinda breaks the initial purpose of the OOM reaper? I personally don't like the delay approach. Especially if we have a better one we know is working, and that doesnt add regressions. If someone can prove to me the private lock case, I'd be more willing to bite. Thanks for all the OOM context :) -- Nico
On Fri 08-04-22 07:26:07, Nico Pache wrote: [...] > Ok so if i understand that correctly, delaying can have some ugly effects and > kinda breaks the initial purpose of the OOM reaper? No, not really. The primary objective of the oom_reaper is to _guaratee_ a forward progress. It is not really meant to be an optimization to respond to the oom killer faster. The reason the oom_reaper is kicked off right away is because that was the simplest implementation. > I personally don't like the delay approach. Especially if we have a better one > we know is working, and that doesnt add regressions. Well, I would say that handling futex case more gracefully would be preferable but my understanding is that this is not all that easy. I am far from being a futex expert so I will leave that up to Thomas and Peter. On the other hand delaying oom_reaper is rather straightforward and I do not think there is a big risk of regressions. Any QoS during OOM is simply out of the window and the main purpose of the reaper will be preserved with a timeout as well. I also do agree with Thomas that this would cover 99% of cases. > If someone can prove to me the private lock case, I'd be more willing to bite. > > Thanks for all the OOM context :) Welcome. The oom handling is a maze and it is really easy to miss all the subtlety and conflicting requirements that are applied here.
On Fri, Apr 08 2022 at 04:41, Nico Pache wrote: > On 4/8/22 04:15, Peter Zijlstra wrote: >>> >>> The following case can still fail: >>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped) >> >> This is still all sorts of confused.. it's a list head, the entries can >> be in any random other VMA. You must not remove *any* user memory before >> doing the robust thing. Not removing the VMA that contains the head is >> pointless in the extreme. > Not sure how its pointless if it fixes all the different reproducers we've > written for it. As for the private lock case we stated here, we havent been able > to reproduce it, but I could see how it can be a potential issue (which is why > its noted). The below reproduces the problem nicely, i.e. the lock() in the parent times out. So why would the OOM killer fail to cause the same problem when it reaps the private anon mapping where the private futex sits? If you revert the lock order in the child the robust muck works. Thanks, tglx --- #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <time.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/mman.h> static char n[4096]; int main(void) { pthread_mutexattr_t mat_s, mat_p; pthread_mutex_t *mut_s, *mut_p; pthread_barrierattr_t ba; pthread_barrier_t *b; struct timespec to; void *pri, *shr; int r; shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); pthread_mutexattr_init(&mat_s); pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST); mut_s = shr; pthread_mutex_init(mut_s, &mat_s); pthread_barrierattr_init(&ba); pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED); b = shr + 1024; pthread_barrier_init(b, &ba, 2); if (!fork()) { pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); pthread_mutexattr_init(&mat_p); pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE); pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST); mut_p = pri; pthread_mutex_init(mut_p, &mat_p); // With lock order s, p parent gets timeout // With lock order p, s parent gets owner died pthread_mutex_lock(mut_s); pthread_mutex_lock(mut_p); // Remove unmap and lock order does not matter munmap(pri, sizeof(n)); pthread_barrier_wait(b); printf("child gone\n"); } else { pthread_barrier_wait(b); printf("parent lock\n"); clock_gettime(CLOCK_REALTIME, &to); to.tv_sec += 1; r = pthread_mutex_timedlock(mut_s, &to); printf("parent lock returned: %s\n", strerror(r)); } return 0; }
Hi Nico, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on hnaz-mm/master] [also build test WARNING on linus/master v5.18-rc1 next-20220408] [cannot apply to linux/master] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/intel-lab-lkp/linux/commits/Nico-Pache/oom_kill-c-futex-Don-t-OOM-reap-the-VMA-containing-the-robust_list_head/20220408-112952 base: https://github.com/hnaz/linux-mm master config: x86_64-randconfig-s022 (https://download.01.org/0day-ci/archive/20220408/202204082258.E7EPbAYz-lkp@intel.com/config) compiler: gcc-11 (Debian 11.2.0-19) 11.2.0 reproduce: # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/intel-lab-lkp/linux/commit/70c1e2a404ac47b7c9b8bd1e9c4d3e72f19e6c62 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Nico-Pache/oom_kill-c-futex-Don-t-OOM-reap-the-VMA-containing-the-robust_list_head/20220408-112952 git checkout 70c1e2a404ac47b7c9b8bd1e9c4d3e72f19e6c62 # save the config file to linux build tree mkdir build_dir make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) >> mm/oom_kill.c:606:21: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected void *robust_list @@ got struct robust_list_head [noderef] __user *robust_list @@ mm/oom_kill.c:606:21: sparse: expected void *robust_list mm/oom_kill.c:606:21: sparse: got struct robust_list_head [noderef] __user *robust_list mm/oom_kill.c: note: in included file (through include/linux/rculist.h, include/linux/sched/signal.h, include/linux/oom.h): include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'find_lock_task_mm' - wrong count at exit mm/oom_kill.c:222:28: sparse: sparse: context imbalance in 'oom_badness' - unexpected unlock include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in 'dump_task' - unexpected unlock include/linux/rcupdate.h:726:9: sparse: sparse: context imbalance in '__oom_kill_process' - unexpected unlock mm/oom_kill.c:1243:21: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected void *robust_list @@ got struct robust_list_head [noderef] __user *robust_list @@ mm/oom_kill.c:1243:21: sparse: expected void *robust_list mm/oom_kill.c:1243:21: sparse: got struct robust_list_head [noderef] __user *robust_list mm/oom_kill.c:1232:20: sparse: sparse: context imbalance in '__se_sys_process_mrelease' - unexpected unlock -- >> mm/mmap.c:3138:21: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected void *robust_list @@ got struct robust_list_head [noderef] __user *robust_list @@ mm/mmap.c:3138:21: sparse: expected void *robust_list mm/mmap.c:3138:21: sparse: got struct robust_list_head [noderef] __user *robust_list vim +606 mm/oom_kill.c 575 576 /* 577 * Reaps the address space of the give task. 578 * 579 * Returns true on success and false if none or part of the address space 580 * has been reclaimed and the caller should retry later. 581 */ 582 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) 583 { 584 bool ret = true; 585 void *robust_list = NULL; 586 587 if (!mmap_read_trylock(mm)) { 588 trace_skip_task_reaping(tsk->pid); 589 return false; 590 } 591 592 /* 593 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't 594 * work on the mm anymore. The check for MMF_OOM_SKIP must run 595 * under mmap_lock for reading because it serializes against the 596 * mmap_write_lock();mmap_write_unlock() cycle in exit_mmap(). 597 */ 598 if (test_bit(MMF_OOM_SKIP, &mm->flags)) { 599 trace_skip_task_reaping(tsk->pid); 600 goto out_unlock; 601 } 602 603 trace_start_task_reaping(tsk->pid); 604 605 #ifdef CONFIG_FUTEX > 606 robust_list = tsk->robust_list; 607 #endif 608 /* failed to reap part of the address space. Try again later */ 609 ret = __oom_reap_task_mm(mm, robust_list); 610 if (!ret) 611 goto out_finish; 612 613 pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", 614 task_pid_nr(tsk), tsk->comm, 615 K(get_mm_counter(mm, MM_ANONPAGES)), 616 K(get_mm_counter(mm, MM_FILEPAGES)), 617 K(get_mm_counter(mm, MM_SHMEMPAGES))); 618 out_finish: 619 trace_finish_task_reaping(tsk->pid); 620 out_unlock: 621 mmap_read_unlock(mm); 622 623 return ret; 624 } 625
> --- > #include <errno.h> > #include <fcntl.h> > #include <pthread.h> > #include <time.h> > #include <stdio.h> > #include <string.h> > #include <unistd.h> > > #include <sys/types.h> > #include <sys/mman.h> > > static char n[4096]; > > int main(void) > { > pthread_mutexattr_t mat_s, mat_p; > pthread_mutex_t *mut_s, *mut_p; > pthread_barrierattr_t ba; > pthread_barrier_t *b; > struct timespec to; > void *pri, *shr; > int r; > > shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > pthread_mutexattr_init(&mat_s); > pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST); > mut_s = shr; > pthread_mutex_init(mut_s, &mat_s); > > pthread_barrierattr_init(&ba); > pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED); > b = shr + 1024; > pthread_barrier_init(b, &ba, 2); > > if (!fork()) { > pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > pthread_mutexattr_init(&mat_p); > pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE); > pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST); One thing I don't understand is what kind of sane use case relies on robust futex for a process-private lock? Is there a purpose to a lock being on the robust list if there are no other processes that must be woken in case the holder process is killed? If this usage serves no purpose besides causing races during oom, we should discourage this use, perhaps by adding a note on the manpage. > mut_p = pri; > pthread_mutex_init(mut_p, &mat_p); > > // With lock order s, p parent gets timeout > // With lock order p, s parent gets owner died > pthread_mutex_lock(mut_s); > pthread_mutex_lock(mut_p); > // Remove unmap and lock order does not matter > munmap(pri, sizeof(n)); > pthread_barrier_wait(b); > printf("child gone\n"); > } else { > pthread_barrier_wait(b); > printf("parent lock\n"); > clock_gettime(CLOCK_REALTIME, &to); > to.tv_sec += 1; > r = pthread_mutex_timedlock(mut_s, &to); > printf("parent lock returned: %s\n", strerror(r)); > } > return 0; > } >
On Fri, Apr 08 2022 at 12:13, Joel Savitz wrote: >> if (!fork()) { >> pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> pthread_mutexattr_init(&mat_p); >> pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE); >> pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST); > One thing I don't understand is what kind of sane use case relies on > robust futex for a process-private lock? > Is there a purpose to a lock being on the robust list if there are no > other processes that must be woken in case the holder process is > killed? Ever heard about the concept of multi threading? > If this usage serves no purpose besides causing races during oom, we > should discourage this use, perhaps by adding a note on the manpage. This usage does not cause races during oom. It does not even cause races if it would be silly, which it is not except for the demonstrator above. The keyword here is *demonstrator*. The oom killer itself causes the race because it starts reaping the VMAs without granting the target time to terminate. This needs to be fixed in the first place, period. If the target can't terminate because it is stuck then yes, there will be fallout where a robust futex cannot be released, but that's something which cannot be solved at all. I'm really tired of this by now. Several people explained in great length the shortcomings of your 'cure the symptom' approach, showed you that the "impossible to reproduce" problem is real and told you very explicitely what the proper solution is. So instead of sitting down and really tackling the root cause, all you can come up with is to post the same 'cure the symptom' muck over and over and then if debunked grasp for straws. Coming back to your original question. What's the difference between a process shared and a process private futex in the context of a multi threaded process? - The process shared must obviously have a shared mapping - The process private has no need for a shared mapping because all threads share the same address space. What do they have in common? - All of them are threads in the kernel POV - All of them care about the unexpected exit/death of some other thread vs. locking So why would a process private robust mutex be any different from a process shared one? I'm sure you can answer that question yourself by now. Thanks, tglx
On Fri 08-04-22 23:41:11, Thomas Gleixner wrote: [...] > Coming back to your original question. > > What's the difference between a process shared and a process private > futex in the context of a multi threaded process? > > - The process shared must obviously have a shared mapping > > - The process private has no need for a shared mapping because > all threads share the same address space. > > What do they have in common? > > - All of them are threads in the kernel POV > > - All of them care about the unexpected exit/death of some other > thread vs. locking > > So why would a process private robust mutex be any different from a > process shared one? Purely from the OOM POV they are slightly different because the OOM killer always kills all threads which share the mm with the selected victim (with an exception of the global init - see __oom_kill_process). Note that this is including those threads which are not sharing signals handling. So clobbering private locks shouldn't be observable to an alive thread unless I am missing something. On the other hand I do agree that delayed oom_reaper execution is a reasonable workaround and the most simplistic one. If I understand your example code then we would need to evaluate the whole robust list and that is simply not feasible because that would require a #PF in general case. HTH
Michal, On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote: > On Fri 08-04-22 23:41:11, Thomas Gleixner wrote: >> So why would a process private robust mutex be any different from a >> process shared one? > > Purely from the OOM POV they are slightly different because the OOM > killer always kills all threads which share the mm with the selected > victim (with an exception of the global init - see __oom_kill_process). > Note that this is including those threads which are not sharing signals > handling. > So clobbering private locks shouldn't be observable to an alive thread > unless I am missing something. Yes, it kills everything, but the reaper also reaps non-shared VMAs. So if the process private futex sits in a reaped VMA the shared one becomes unreachable. > On the other hand I do agree that delayed oom_reaper execution is a > reasonable workaround and the most simplistic one. I think it's more than a workaround. It's a reasonable expectation that the kernel side of the user space threads can mop up the mess the user space part created. So even if one of of N threads is stuck in a place where it can't, then N-1 can still reach do_exit() and mop their mess up. The oom reaper is the last resort to resolve the situation in case of a stuck task. No? > If I understand your example code then we would need to evaluate the > whole robust list and that is simply not feasible because that would > require a #PF in general case. Right. The robust list exit code does the user access with pagefaults disabled and if it fails, it terminates the list walk. Bad luck :) Thanks, tglx
On Mon 11-04-22 09:47:14, Thomas Gleixner wrote: > Michal, > > On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote: > > On Fri 08-04-22 23:41:11, Thomas Gleixner wrote: > >> So why would a process private robust mutex be any different from a > >> process shared one? > > > > Purely from the OOM POV they are slightly different because the OOM > > killer always kills all threads which share the mm with the selected > > victim (with an exception of the global init - see __oom_kill_process). > > Note that this is including those threads which are not sharing signals > > handling. > > So clobbering private locks shouldn't be observable to an alive thread > > unless I am missing something. > > Yes, it kills everything, but the reaper also reaps non-shared VMAs. So > if the process private futex sits in a reaped VMA the shared one becomes > unreachable. > > > On the other hand I do agree that delayed oom_reaper execution is a > > reasonable workaround and the most simplistic one. > > I think it's more than a workaround. It's a reasonable expectation that > the kernel side of the user space threads can mop up the mess the user > space part created. So even if one of of N threads is stuck in a place > where it can't, then N-1 can still reach do_exit() and mop their mess > up. > > The oom reaper is the last resort to resolve the situation in case of a > stuck task. No? Yes, I keep saying workaround because it really doesn't address the underlying issue which is that the oom_reaper clobbers something it shouldn't be. A full fix from my POV would be making oom_reaper code more aware of the futex usage. But this is something nore really viable. Btw. this is what I've in my local tree. It hasn't seen any testing but it might be a good start to make it a full patch. I have decided to use a timer rather than juggling tasks on the oom_reaper list because initial implementation looked uglier. I will try to find some time to finish that but if Nico or others beat me to it I won't complain. Also I absolutely do not insist on the timer approach. --- diff --git a/include/linux/sched.h b/include/linux/sched.h index ff6901dcb06d..528806ad6e6a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1446,6 +1446,7 @@ struct task_struct { int pagefault_disabled; #ifdef CONFIG_MMU struct task_struct *oom_reaper_list; + struct timer_list oom_reaper_timer; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7ec38194f8e1..be6d65ead7ec 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk) */ set_bit(MMF_OOM_SKIP, &mm->flags); - /* Drop a reference taken by wake_oom_reaper */ + /* Drop a reference taken by queue_oom_repaer */ put_task_struct(tsk); } @@ -644,12 +644,12 @@ static int oom_reaper(void *unused) struct task_struct *tsk = NULL; wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); - spin_lock(&oom_reaper_lock); + spin_lock_irq(&oom_reaper_lock); if (oom_reaper_list != NULL) { tsk = oom_reaper_list; oom_reaper_list = tsk->oom_reaper_list; } - spin_unlock(&oom_reaper_lock); + spin_unlock_irq(&oom_reaper_lock); if (tsk) oom_reap_task(tsk); @@ -658,22 +658,50 @@ static int oom_reaper(void *unused) return 0; } -static void wake_oom_reaper(struct task_struct *tsk) +static void wake_oom_reaper_fn(struct timer_list *timer) { - /* mm is already queued? */ - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) - return; + struct task_struct *tsk = container_of(timer, struct task_struct, oom_reaper_timer); + struct mm_struct *mm = tsk->signal->oom_mm; + unsigned long flags; - get_task_struct(tsk); + /* The victim managed to terminate on its own - see exit_mmap */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + put_task_struct(tsk); + return; + } - spin_lock(&oom_reaper_lock); + spin_lock_irqsave(&oom_reaper_lock, flags); tsk->oom_reaper_list = oom_reaper_list; oom_reaper_list = tsk; - spin_unlock(&oom_reaper_lock); + spin_unlock_irqrestore(&oom_reaper_lock, flags); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait); } +/* + * Give OOM victims some head room to exit themselves. If they do not exit + * on their own the oom reaper is invoked. + * The timeout is basically arbitrary and there is no best value to use. + * The longer it will be the longer the worst case scenario OOM can + * take. The smaller the timeout the more likely the oom_reaper can get + * into the way and release resources which could be needed during the + * exit path - e.g. futex robust lists can sit in the anonymous memory + * which could be reaped and the exit path won't be able to let waiters + * know the holding task has terminated. + */ +#define OOM_REAPER_DELAY (2*HZ) +static void queue_oom_repaer(struct task_struct *tsk) +{ + /* mm is already queued? */ + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) + return; + + get_task_struct(tsk); + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper_fn, 0); + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + add_timer(&tsk->oom_reaper_timer); +} + static int __init oom_init(void) { oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); @@ -681,7 +709,7 @@ static int __init oom_init(void) } subsys_initcall(oom_init) #else -static inline void wake_oom_reaper(struct task_struct *tsk) +static inline void queue_oom_repaer(struct task_struct *tsk) { } #endif /* CONFIG_MMU */ @@ -932,7 +960,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) rcu_read_unlock(); if (can_oom_reap) - wake_oom_reaper(victim); + queue_oom_repaer(victim); mmdrop(mm); put_task_struct(victim); @@ -968,7 +996,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) task_lock(victim); if (task_will_free_mem(victim)) { mark_oom_victim(victim); - wake_oom_reaper(victim); + queue_oom_repaer(victim); task_unlock(victim); put_task_struct(victim); return; @@ -1067,7 +1095,7 @@ bool out_of_memory(struct oom_control *oc) */ if (task_will_free_mem(current)) { mark_oom_victim(current); - wake_oom_reaper(current); + queue_oom_repaer(current); return true; }
On 4/8/22 09:54, Thomas Gleixner wrote: > On Fri, Apr 08 2022 at 04:41, Nico Pache wrote: >> On 4/8/22 04:15, Peter Zijlstra wrote: >>>> >>>> The following case can still fail: >>>> robust head (skipped) -> private lock (reaped) -> shared lock (skipped) >>> >>> This is still all sorts of confused.. it's a list head, the entries can >>> be in any random other VMA. You must not remove *any* user memory before >>> doing the robust thing. Not removing the VMA that contains the head is >>> pointless in the extreme. >> Not sure how its pointless if it fixes all the different reproducers we've >> written for it. As for the private lock case we stated here, we havent been able >> to reproduce it, but I could see how it can be a potential issue (which is why >> its noted). > > The below reproduces the problem nicely, i.e. the lock() in the parent > times out. So why would the OOM killer fail to cause the same problem > when it reaps the private anon mapping where the private futex sits? > > If you revert the lock order in the child the robust muck works. Thanks for the reproducer Thomas :) I think I need to re-up my knowledge around COW and how it effects that stack. There are increased oddities when you add the pthread library that I cant fully wrap my head around at the moment. My confusion lies in how the parent/child share a robust list here, but they obviously do. In my mind the mut_s would be different in the child/parent after the fork and pthread_mutex_init (and friends) are done in the child. Thanks! -- Nico > > Thanks, > > tglx > --- > #include <errno.h> > #include <fcntl.h> > #include <pthread.h> > #include <time.h> > #include <stdio.h> > #include <string.h> > #include <unistd.h> > > #include <sys/types.h> > #include <sys/mman.h> > > static char n[4096]; > > int main(void) > { > pthread_mutexattr_t mat_s, mat_p; > pthread_mutex_t *mut_s, *mut_p; > pthread_barrierattr_t ba; > pthread_barrier_t *b; > struct timespec to; > void *pri, *shr; > int r; > > shr = mmap(NULL, sizeof(n), PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > > pthread_mutexattr_init(&mat_s); > pthread_mutexattr_setrobust(&mat_s, PTHREAD_MUTEX_ROBUST); > mut_s = shr; > pthread_mutex_init(mut_s, &mat_s); > > pthread_barrierattr_init(&ba); > pthread_barrierattr_setpshared(&ba, PTHREAD_PROCESS_SHARED); > b = shr + 1024; > pthread_barrier_init(b, &ba, 2); > > if (!fork()) { > pri = mmap(NULL, 1<<20, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > pthread_mutexattr_init(&mat_p); > pthread_mutexattr_setpshared(&mat_p, PTHREAD_PROCESS_PRIVATE); > pthread_mutexattr_setrobust(&mat_p, PTHREAD_MUTEX_ROBUST); > mut_p = pri; > pthread_mutex_init(mut_p, &mat_p); > > // With lock order s, p parent gets timeout > // With lock order p, s parent gets owner died > pthread_mutex_lock(mut_s); > pthread_mutex_lock(mut_p); > // Remove unmap and lock order does not matter > munmap(pri, sizeof(n)); > pthread_barrier_wait(b); > printf("child gone\n"); > } else { > pthread_barrier_wait(b); > printf("parent lock\n"); > clock_gettime(CLOCK_REALTIME, &to); > to.tv_sec += 1; > r = pthread_mutex_timedlock(mut_s, &to); > printf("parent lock returned: %s\n", strerror(r)); > } > return 0; > } >
On 4/11/22 05:08, Michal Hocko wrote: > On Mon 11-04-22 09:47:14, Thomas Gleixner wrote: >> Michal, >> >> On Mon, Apr 11 2022 at 08:48, Michal Hocko wrote: >>> On Fri 08-04-22 23:41:11, Thomas Gleixner wrote: >>>> So why would a process private robust mutex be any different from a >>>> process shared one? >>> >>> Purely from the OOM POV they are slightly different because the OOM >>> killer always kills all threads which share the mm with the selected >>> victim (with an exception of the global init - see __oom_kill_process). >>> Note that this is including those threads which are not sharing signals >>> handling. >>> So clobbering private locks shouldn't be observable to an alive thread >>> unless I am missing something. >> >> Yes, it kills everything, but the reaper also reaps non-shared VMAs. So >> if the process private futex sits in a reaped VMA the shared one becomes >> unreachable. >> >>> On the other hand I do agree that delayed oom_reaper execution is a >>> reasonable workaround and the most simplistic one. >> >> I think it's more than a workaround. It's a reasonable expectation that >> the kernel side of the user space threads can mop up the mess the user >> space part created. So even if one of of N threads is stuck in a place >> where it can't, then N-1 can still reach do_exit() and mop their mess >> up. >> >> The oom reaper is the last resort to resolve the situation in case of a >> stuck task. No? > > Yes, I keep saying workaround because it really doesn't address the > underlying issue which is that the oom_reaper clobbers something it > shouldn't be. A full fix from my POV would be making oom_reaper code > more aware of the futex usage. But this is something nore really viable. This is *kinda* what this approach is doing, but as Thomas has pointed out, it has its shortcoming. Additionally, it has just come to my attention, that this solution does not cover the compat robust list... So there is yet another shortcoming. > > Btw. this is what I've in my local tree. It hasn't seen any testing but > it might be a good start to make it a full patch. I have decided to use > a timer rather than juggling tasks on the oom_reaper list because > initial implementation looked uglier. I will try to find some time to > finish that but if Nico or others beat me to it I won't complain. > Also I absolutely do not insist on the timer approach. > [...] I will spend tomorrow working the delay solution and testing it. Thanks for starting it :) I appreciate the comments and help from everyone that has participated! I'm sorry if any misunderstanding were had, its not our intention to upset anyone, but rather to learn and work a solution for the problem we are facing. Best, -- Nico
On Mon, Apr 11 2022 at 19:51, Nico Pache wrote: > On 4/8/22 09:54, Thomas Gleixner wrote: >> The below reproduces the problem nicely, i.e. the lock() in the parent >> times out. So why would the OOM killer fail to cause the same problem >> when it reaps the private anon mapping where the private futex sits? >> >> If you revert the lock order in the child the robust muck works. > > Thanks for the reproducer Thomas :) > > I think I need to re-up my knowledge around COW and how it effects > that stack. There are increased oddities when you add the pthread > library that I cant fully wrap my head around at the moment. The pthread library functions are just conveniance so I did not have to hand code the futex and robust list handling. > My confusion lies in how the parent/child share a robust list here, but they > obviously do. In my mind the mut_s would be different in the child/parent after > the fork and pthread_mutex_init (and friends) are done in the child. They don't share a robust list, each thread has it's own. The shared mutex mut_s is initialized in the parent before fork and it's the same address in the child and it's not COWed because the mapping is MAP_SHARED. The child allocates private memory and initializes the private mutex in that private mapping. So now child does: pthread_mutex_lock(mut_s); That's the mutex in the memory shared with the parent. After that the childs robusts list head points to mut_s::robust_list. Now it does: pthread_mutex_lock(mut_p); after that the childs robust list head points to mut_p::robust_list and mut_p::robust_list points to mut_s::robust_list. So now the child unmaps the private memory and exists. The kernel tries to walk the robust list pointer and faults when trying to access mut_p. End of walk and mut_s stays locked. So now think about the OOM case. The killed process has a shared mapping with some other unrelated process (file, shmem) where mut_p sits. It gets killed after: pthread_mutex_lock(mut_s); pthread_mutex_lock(mut_p); So the OOM reaper rips the VMA which contains mut_p and therefore breaks the chain which is necessary to reach mut_p. See? Thanks, tglx
On 4/12/22 12:20, Thomas Gleixner wrote: > On Mon, Apr 11 2022 at 19:51, Nico Pache wrote: >> On 4/8/22 09:54, Thomas Gleixner wrote: >>> The below reproduces the problem nicely, i.e. the lock() in the parent >>> times out. So why would the OOM killer fail to cause the same problem >>> when it reaps the private anon mapping where the private futex sits? >>> >>> If you revert the lock order in the child the robust muck works. >> >> Thanks for the reproducer Thomas :) >> >> I think I need to re-up my knowledge around COW and how it effects >> that stack. There are increased oddities when you add the pthread >> library that I cant fully wrap my head around at the moment. > > The pthread library functions are just conveniance so I did not have to > hand code the futex and robust list handling. > >> My confusion lies in how the parent/child share a robust list here, but they >> obviously do. In my mind the mut_s would be different in the child/parent after >> the fork and pthread_mutex_init (and friends) are done in the child. > > They don't share a robust list, each thread has it's own. > > The shared mutex mut_s is initialized in the parent before fork and it's > the same address in the child and it's not COWed because the mapping is > MAP_SHARED. > > The child allocates private memory and initializes the private mutex in > that private mapping. > > So now child does: > > pthread_mutex_lock(mut_s); > > That's the mutex in the memory shared with the parent. After that the > childs robusts list head points to mut_s::robust_list. > > Now it does: > > pthread_mutex_lock(mut_p); > > after that the childs robust list head points to mut_p::robust_list and > mut_p::robust_list points to mut_s::robust_list. > > So now the child unmaps the private memory and exists. > > The kernel tries to walk the robust list pointer and faults when trying > to access mut_p. End of walk and mut_s stays locked. > > So now think about the OOM case. The killed process has a shared mapping > with some other unrelated process (file, shmem) where mut_p sits. > > It gets killed after: > pthread_mutex_lock(mut_s); > pthread_mutex_lock(mut_p); > > So the OOM reaper rips the VMA which contains mut_p and therefore breaks > the chain which is necessary to reach mut_p. > > See? Yes, thank you for the detailed explanation, the missing piece just clicked in my head :) Cheers, -- Nico > > Thanks, > > tglx > > >
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 7ec38194f8e1..be6d65ead7ec 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -632,7 +632,7 @@ static void oom_reap_task(struct task_struct *tsk) > */ > set_bit(MMF_OOM_SKIP, &mm->flags); > > - /* Drop a reference taken by wake_oom_reaper */ > + /* Drop a reference taken by queue_oom_repaer */ > put_task_struct(tsk); > } > > @@ -644,12 +644,12 @@ static int oom_reaper(void *unused) > struct task_struct *tsk = NULL; > > wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); > - spin_lock(&oom_reaper_lock); > + spin_lock_irq(&oom_reaper_lock); > if (oom_reaper_list != NULL) { > tsk = oom_reaper_list; > oom_reaper_list = tsk->oom_reaper_list; > } > - spin_unlock(&oom_reaper_lock); > + spin_unlock_irq(&oom_reaper_lock); > > if (tsk) > oom_reap_task(tsk); > @@ -658,22 +658,50 @@ static int oom_reaper(void *unused) > return 0; > } > > -static void wake_oom_reaper(struct task_struct *tsk) > +static void wake_oom_reaper_fn(struct timer_list *timer) > { > - /* mm is already queued? */ > - if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) > - return; > + struct task_struct *tsk = container_of(timer, struct task_struct, oom_reaper_timer); > + struct mm_struct *mm = tsk->signal->oom_mm; > + unsigned long flags; > > - get_task_struct(tsk); > + /* The victim managed to terminate on its own - see exit_mmap */ > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + put_task_struct(tsk); > + return; > + } > > - spin_lock(&oom_reaper_lock); > + spin_lock_irqsave(&oom_reaper_lock, flags); > tsk->oom_reaper_list = oom_reaper_list; > oom_reaper_list = tsk; > - spin_unlock(&oom_reaper_lock); > + spin_unlock_irqrestore(&oom_reaper_lock, flags); > trace_wake_reaper(tsk->pid); > wake_up(&oom_reaper_wait); > } > > +/* > + * Give OOM victims some head room to exit themselves. If they do not exit > + * on their own the oom reaper is invoked. > + * The timeout is basically arbitrary and there is no best value to use. > + * The longer it will be the longer the worst case scenario OOM can > + * take. The smaller the timeout the more likely the oom_reaper can get > + * into the way and release resources which could be needed during the > + * exit path - e.g. futex robust lists can sit in the anonymous memory > + * which could be reaped and the exit path won't be able to let waiters > + * know the holding task has terminated. > + */ > +#define OOM_REAPER_DELAY (2*HZ) > +static void queue_oom_repaer(struct task_struct *tsk) > +{ > + /* mm is already queued? */ > + if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) > + return; > + > + get_task_struct(tsk); > + timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper_fn, 0); > + tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; > + add_timer(&tsk->oom_reaper_timer); > +} > + > static int __init oom_init(void) > { > oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper"); > @@ -681,7 +709,7 @@ static int __init oom_init(void) > } > subsys_initcall(oom_init) > #else > -static inline void wake_oom_reaper(struct task_struct *tsk) > +static inline void queue_oom_repaer(struct task_struct *tsk) > { > } > #endif /* CONFIG_MMU */ > @@ -932,7 +960,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) > rcu_read_unlock(); > > if (can_oom_reap) > - wake_oom_reaper(victim); > + queue_oom_repaer(victim); > > mmdrop(mm); > put_task_struct(victim); > @@ -968,7 +996,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > task_lock(victim); > if (task_will_free_mem(victim)) { > mark_oom_victim(victim); > - wake_oom_reaper(victim); > + queue_oom_repaer(victim); > task_unlock(victim); > put_task_struct(victim); > return; > @@ -1067,7 +1095,7 @@ bool out_of_memory(struct oom_control *oc) > */ > if (task_will_free_mem(current)) { > mark_oom_victim(current); > - wake_oom_reaper(current); > + queue_oom_repaer(current); > return true; > } Thanks for the code Michal-- It does seem to fix our issue! I will post it after I finish running it through a few more test cases, and our internal testing suites. Cheers, -- Nico
diff --git a/include/linux/oom.h b/include/linux/oom.h index 2db9a1432511..46355ff97abb 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -106,7 +106,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) return 0; } -bool __oom_reap_task_mm(struct mm_struct *mm); +bool __oom_reap_task_mm(struct mm_struct *mm, void *skip_area); long oom_badness(struct task_struct *p, unsigned long totalpages); diff --git a/mm/mmap.c b/mm/mmap.c index 3aa839f81e63..6dee7ea60a13 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3109,6 +3109,11 @@ void exit_mmap(struct mm_struct *mm) struct mmu_gather tlb; struct vm_area_struct *vma; unsigned long nr_accounted = 0; + void *robust_list = NULL; + +#ifdef CONFIG_FUTEX + robust_list = current->robust_list; +#endif /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); @@ -3126,7 +3131,8 @@ void exit_mmap(struct mm_struct *mm) * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block. */ - (void)__oom_reap_task_mm(mm); + (void)__oom_reap_task_mm(mm, robust_list); + set_bit(MMF_OOM_SKIP, &mm->flags); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 7ec38194f8e1..2a7443a64909 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -509,9 +509,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -bool __oom_reap_task_mm(struct mm_struct *mm) +bool __oom_reap_task_mm(struct mm_struct *mm, void *skip_area) { struct vm_area_struct *vma; + unsigned long skip_vma = (unsigned long) skip_area; bool ret = true; /* @@ -526,6 +527,20 @@ bool __oom_reap_task_mm(struct mm_struct *mm) if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) continue; +#ifdef CONFIG_FUTEX + /* + * The OOM reaper runs concurrently with do_exit. + * The robust_list_head is stored in userspace and is required + * by the exit path to recover the robust futex waiters. + * Skip the VMA that contains the robust_list to allow for + * proper cleanup. + */ + if (vma->vm_start <= skip_vma && vma->vm_end > skip_vma) { + pr_info("oom_reaper: skipping vma, contains robust_list"); + continue; + } +#endif + /* * Only anonymous pages have a good chance to be dropped * without additional steps which we cannot afford as we @@ -567,6 +582,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm) static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; + void *robust_list = NULL; if (!mmap_read_trylock(mm)) { trace_skip_task_reaping(tsk->pid); @@ -586,8 +602,11 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); +#ifdef CONFIG_FUTEX + robust_list = tsk->robust_list; +#endif /* failed to reap part of the address space. Try again later */ - ret = __oom_reap_task_mm(mm); + ret = __oom_reap_task_mm(mm, robust_list); if (!ret) goto out_finish; @@ -1149,6 +1168,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) unsigned int f_flags; bool reap = false; long ret = 0; + void *robust_list = NULL; if (flags) return -EINVAL; @@ -1186,11 +1206,16 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) ret = -EINTR; goto drop_mm; } + +#ifdef CONFIG_FUTEX + robust_list = p->robust_list; +#endif /* * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure * possible change in exit_mmap is seen */ - if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm)) + if (!test_bit(MMF_OOM_SKIP, &mm->flags) && + !__oom_reap_task_mm(mm, robust_list)) ret = -EAGAIN; mmap_read_unlock(mm);