Message ID | alpine.DEB.2.21.1806141339580.4543@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu 14-06-18 13:42:59, David Rientjes wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper > to stall indefinitely, > > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. > > This replaces the current timeouts in the oom reaper: (1) when trying to > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) > a HZ sleep if there are blockable mmu notifiers. It extends it with > timeout to allow an oom victim to reach exit_mmap() before choosing > additional processes unnecessarily. > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. > > Signed-off-by: David Rientjes <rientjes@google.com> Nacked-by: Michal Hocko <mhocko@suse.com> as already explained elsewhere in this email thread. > --- > Note: I understand there is an objection based on timeout based delays. > This is currently the only possible way to avoid oom killing important > processes completely unnecessarily. If the oom reaper can someday free > all memory, including mlocked memory and those mm's with blockable mmu > notifiers, and is guaranteed to always be able to grab mm->mmap_sem, > this can be removed. I do not believe any such guarantee is possible > and consider the massive killing of additional processes unnecessarily > to be a regression introduced by the oom reaper and its very quick > setting of MMF_OOM_SKIP to allow additional processes to be oom killed. If you find oom reaper more harmful than useful I would be willing to ack a comman line option to disable it. Especially when you keep claiming that the lockups are not really happening in your environment. Other than that I've already pointed to a more robust solution. If you are reluctant to try it out I will do, but introducing a timeout is just papering over the real problem. Maybe we will not reach the state that _all_ the memory is reapable but we definitely should try to make as much as possible to be reapable and I do not see any fundamental problems in that direction.
On Fri, 15 Jun 2018, Michal Hocko wrote: > > Signed-off-by: David Rientjes <rientjes@google.com> > > Nacked-by: Michal Hocko <mhocko@suse.com> > as already explained elsewhere in this email thread. > I don't find this to be surprising, but I'm not sure that it actually matters if you won't fix a regression that you introduced. Tetsuo initially found this issue and presented a similar solution, so I think his feedback on this is more important since it would fix a problem for him as well. > > --- > > Note: I understand there is an objection based on timeout based delays. > > This is currently the only possible way to avoid oom killing important > > processes completely unnecessarily. If the oom reaper can someday free > > all memory, including mlocked memory and those mm's with blockable mmu > > notifiers, and is guaranteed to always be able to grab mm->mmap_sem, > > this can be removed. I do not believe any such guarantee is possible > > and consider the massive killing of additional processes unnecessarily > > to be a regression introduced by the oom reaper and its very quick > > setting of MMF_OOM_SKIP to allow additional processes to be oom killed. > > If you find oom reaper more harmful than useful I would be willing to > ack a comman line option to disable it. Especially when you keep > claiming that the lockups are not really happening in your environment. > There's no need to disable it, we simply need to ensure that it doesn't set MMF_OOM_SKIP too early, which my patch does. We also need to avoid setting MMF_OOM_SKIP in exit_mmap() until after all memory has been freed, i.e. after free_pgtables(). I'd be happy to make the this timeout configurable, however, and default it to perhaps one second as the blockable mmu notifier timeout in your own code does. I find it somewhat sad that we'd need a sysctl for this, but if that will appease you and it will help to move this into -mm then we can do that. > Other than that I've already pointed to a more robust solution. If you > are reluctant to try it out I will do, but introducing a timeout is just > papering over the real problem. Maybe we will not reach the state that > _all_ the memory is reapable but we definitely should try to make as > much as possible to be reapable and I do not see any fundamental > problems in that direction. You introduced the timeout already, I'm sure you realized yourself that the oom reaper sets MMF_OOM_SKIP much too early. Trying to grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout. If there are blockable mmu notifiers, your code puts the oom reaper to sleep for HZ before setting MMF_OOM_SKIP, which is a timeout. This patch moves the timeout to reaching exit_mmap() where we actually free all memory possible and still allow for additional oom killing if there is a very rare oom livelock. You haven't provided any data that suggests oom livelocking isn't a very rare event and that we need to respond immediately by randomly killing more and more processes rather than wait a bounded period of time to allow for forward progress to be made. I have constantly provided data showing oom livelock in our fleet is extremely rare, less than 0.04% of the time. Yet your solution is to kill many processes so this 0.04% is fast. The reproducer on powerpc is very simple. Do an mmap() and mlock() the length. Fork one 120MB process that does that and two 60MB processes that do that in a 128MB memcg. [ 402.064375] Killed process 17024 (a.out) total-vm:134080kB, anon-rss:122032kB, file-rss:1600kB [ 402.107521] Killed process 17026 (a.out) total-vm:64448kB, anon-rss:44736kB, file-rss:1600kB Completely reproducible and completely unnecessary. Killing two processes pointlessly when the first oom kill would have been successful. Killing processes is important, optimizing for 0.04% of cases of true oom livelock by insisting everybody tolerate excessive oom killing is not. If you have data to suggest the 0.04% is higher, please present it. I'd be interested in any data you have that suggests its higher and has even 1/1,000,000th oom occurrence rate that I have shown. It's inappropriate to merge code that oom kills many processes unnecessarily when one happens to be mlocked or have blockable mmu notifiers or when mm->mmap_sem can't be grabbed fast enough but forward progress is actually being made. It's a regression, and it impacts real users. Insisting that we fix the problem you introduced by making all mmu notifiers unblockable and mlocked memory can always be reaped and mm->mmap_sem can always be grabbed within a second is irresponsible.
On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > it cannot reap an mm. This can happen for a variety of reasons, > including: > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > - when the mm has blockable mmu notifiers that could cause the oom reaper > to stall indefinitely, Maybe we should have more than one oom reaper thread? I assume the probability of the oom reaper thread blocking on an mmu notifier is small, so perhaps just dive in and hope for the best. If the oom reaper gets stuck then there's another thread ready to take over. And revisit the decision to use a kernel thread instead of workqueues. > but we can also add a third when the oom reaper can "reap" an mm but doing > so is unlikely to free any amount of memory: > > - when the mm's memory is fully mlocked. > > When all memory is mlocked, the oom reaper will not be able to free any > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > unmap and free its memory in exit_mmap() and subsequent oom victims are > chosen unnecessarily. This is trivial to reproduce if all eligible > processes on the system have mlocked their memory: the oom killer calls > panic() even though forward progress can be made. > > This is the same issue where the exit path sets MMF_OOM_SKIP before > unmapping memory and additional processes can be chosen unnecessarily > because the oom killer is racing with exit_mmap(). So what's actually happening here. A process has a large amount of mlocked memory, it has been oom-killed and it is in the process of releasing its memory and exiting, yes? If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we just patiently waiting for its attempt to release meory? > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > a true oom livelock in progress, it never gets set and no additional > killing is possible. I guess that's my answer. What causes this livelock? Process looping in alloc_pages while holding a lock the oom victim wants? > To fix this, this patch introduces a per-mm reaping timeout, initially set > at 10s. It requires that the oom reaper's list becomes a properly linked > list so that other mm's may be reaped while waiting for an mm's timeout to > expire. > > This replaces the current timeouts in the oom reaper: (1) when trying to > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) > a HZ sleep if there are blockable mmu notifiers. It extends it with > timeout to allow an oom victim to reach exit_mmap() before choosing > additional processes unnecessarily. > > The exit path will now set MMF_OOM_SKIP only after all memory has been > freed, so additional oom killing is justified, That seems sensible, but why set MMF_OOM_SKIP at all? > and rely on MMF_UNSTABLE to > determine when it can race with the oom reaper. > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > lapsed because it can no longer guarantee forward progress. > > The reaping timeout is intentionally set for a substantial amount of time > since oom livelock is a very rare occurrence and it's better to optimize > for preventing additional (unnecessary) oom killing than a scenario that > is much more unlikely. What happened to the old idea of permitting the task which is blocking the oom victim to access additional reserves? Come to that, what happened to the really really old Andreaidea of not looping in the page allocator anyway? Return NULL instead... I dunno, I'm thrashing around here. We seem to be piling mess on top of mess and then being surprised that the result is a mess. > +#ifdef CONFIG_MMU > + /* When to give up on oom reaping this mm */ > + unsigned long reap_timeout; "timeout" implies "interval". To me, anyway. This is an absolute time, so something like reap_time would be clearer. Along with a comment explaining that the units are in jiffies. > +#endif > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > pgtable_t pmd_huge_pte; /* protected by page_table_lock */ > #endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1163,7 +1163,7 @@ struct task_struct { > #endif > int pagefault_disabled; > #ifdef CONFIG_MMU > - struct task_struct *oom_reaper_list; > + struct list_head oom_reap_list; Can we have a comment explaining its locking. > #endif > #ifdef CONFIG_VMAP_STACK > struct vm_struct *stack_vm_area; > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) > if (unlikely(mm_is_oom_victim(mm))) { > /* > * Manually reap the mm to free as much memory as possible. > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > - * this mm from further consideration. Taking mm->mmap_sem for > - * write after setting MMF_OOM_SKIP will guarantee that the oom > - * reaper will not run on this mm again after mmap_sem is > - * dropped. > + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper. > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will > + * guarantee that the oom reaper will not run on this mm again > + * after mmap_sem is dropped. Comment should explain *why* we don't want the reaper to run on this mm again. > > ... >
On Fri 15-06-18 16:15:39, David Rientjes wrote: [...] > I'd be happy to make the this timeout configurable, however, and default > it to perhaps one second as the blockable mmu notifier timeout in your own > code does. I find it somewhat sad that we'd need a sysctl for this, but > if that will appease you and it will help to move this into -mm then we > can do that. No. This has been nacked in the past and I do not see anything different from back than. > > Other than that I've already pointed to a more robust solution. If you > > are reluctant to try it out I will do, but introducing a timeout is just > > papering over the real problem. Maybe we will not reach the state that > > _all_ the memory is reapable but we definitely should try to make as > > much as possible to be reapable and I do not see any fundamental > > problems in that direction. > > You introduced the timeout already, I'm sure you realized yourself that > the oom reaper sets MMF_OOM_SKIP much too early. Trying to grab > mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout. Yes, it is. And it is a timeout based some some feedback. The lock is held, let's retry later but do not retry for ever. We can do the same with blockable mmu notifiers. We are currently giving up right away. I was proposing to add can_sleep parameter to mmu_notifier_invalidate_range_start and return it EAGAIN if it would block. This would allow to simply retry on EAGAIN like we do for the mmap_sem. [...] > The reproducer on powerpc is very simple. Do an mmap() and mlock() the > length. Fork one 120MB process that does that and two 60MB processes that > do that in a 128MB memcg. And again, to solve this we just need to teach oom_reaper to handle mlocked memory. There shouldn't be any fundamental reason why this would be impossible AFAICS. Timeout is not a solution! [...] > It's inappropriate to merge code that oom kills many processes > unnecessarily when one happens to be mlocked or have blockable mmu > notifiers or when mm->mmap_sem can't be grabbed fast enough but forward > progress is actually being made. It's a regression, and it impacts real > users. Insisting that we fix the problem you introduced by making all mmu > notifiers unblockable and mlocked memory can always be reaped and > mm->mmap_sem can always be grabbed within a second is irresponsible. Well, a lack of real world bug reports doesn't really back your story here. I have asked about non-artificial workloads suffering and your responsive were quite nonspecific to say the least. And I do insist to come with a reasonable solution rather than random hacks. Jeez the oom killer was full of these. As I've said, if you are not willing to work on a proper solution, I will, but my nack holds for this patch until we see no other way around existing and real world problems.
On Mon 18-06-18 17:27:33, Andrew Morton wrote: > On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > to stall indefinitely, > > Maybe we should have more than one oom reaper thread? I assume the > probability of the oom reaper thread blocking on an mmu notifier is > small, so perhaps just dive in and hope for the best. If the oom > reaper gets stuck then there's another thread ready to take over. And > revisit the decision to use a kernel thread instead of workqueues. Well, I think that having more threads would be wasteful for a rare event like the oom. Creating one on demand could be tricky because we are under strong memory pressure at the time and a new thread costs some resoures. > > but we can also add a third when the oom reaper can "reap" an mm but doing > > so is unlikely to free any amount of memory: > > > > - when the mm's memory is fully mlocked. > > > > When all memory is mlocked, the oom reaper will not be able to free any > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > chosen unnecessarily. This is trivial to reproduce if all eligible > > processes on the system have mlocked their memory: the oom killer calls > > panic() even though forward progress can be made. > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > unmapping memory and additional processes can be chosen unnecessarily > > because the oom killer is racing with exit_mmap(). > > So what's actually happening here. A process has a large amount of > mlocked memory, it has been oom-killed and it is in the process of > releasing its memory and exiting, yes? > > If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we > just patiently waiting for its attempt to release meory? Because the oom victim has no guarantee to proceed to exit and release its own memory. OOM reaper jumps in and skip over mlocked ranges because they require lock page and that egain cannot be taken from the oom reaper path (the lock might be held and doing an allocation). This in turn means that the oom_reaper doesn't free mlocked memory before it sets MMF_OOM_SKIP which will allow a new oom victim to be selected. At the time we merged the oom reaper this hasn't been seen as a major issue because tasks usually do not consume a lot of mlocked memory and there is always some other memory to tear down and help to relief the memory pressure. mlockall oom victim were deemed unlikely because they need a large rlimit and as such it should be trusted and therefore quite safe from runaways. But there was definitely a plan to make mlocked memory reapable. So time to do it finally. > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > I guess that's my answer. What causes this livelock? Process looping > in alloc_pages while holding a lock the oom victim wants? Yes. > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > at 10s. It requires that the oom reaper's list becomes a properly linked > > list so that other mm's may be reaped while waiting for an mm's timeout to > > expire. > > > > This replaces the current timeouts in the oom reaper: (1) when trying to > > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) > > a HZ sleep if there are blockable mmu notifiers. It extends it with > > timeout to allow an oom victim to reach exit_mmap() before choosing > > additional processes unnecessarily. > > > > The exit path will now set MMF_OOM_SKIP only after all memory has been > > freed, so additional oom killing is justified, > > That seems sensible, but why set MMF_OOM_SKIP at all? MMF_OOM_SKIP is a way to say that the task should be skipped from OOM victims evaluation. > > and rely on MMF_UNSTABLE to > > determine when it can race with the oom reaper. > > > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > > lapsed because it can no longer guarantee forward progress. > > > > The reaping timeout is intentionally set for a substantial amount of time > > since oom livelock is a very rare occurrence and it's better to optimize > > for preventing additional (unnecessary) oom killing than a scenario that > > is much more unlikely. > > What happened to the old idea of permitting the task which is blocking > the oom victim to access additional reserves? How do you find such a task? > Come to that, what happened to the really really old Andreaidea of not > looping in the page allocator anyway? Return NULL instead... Nacked by Linus because too-small-to-fail is a long term semantic that cannot change easily. We do not have any way to audit syscall paths to not return ENOMEM when inappropriate. > I dunno, I'm thrashing around here. We seem to be piling mess on top > of mess and then being surprised that the result is a mess. Are we? The current oom_reaper certainly has some shortcomings that are addressable. We have started simple to cover most cases and move on with more complex heuristics based on real life bug reports. But we _do_ have a quite straightforward feedback based algorithm to reclaim oom victims. This is a solid ground for future development. Something we never had before. So I am really wondering what is all the mess about.
On Mon, 18 Jun 2018, Andrew Morton wrote: > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > it cannot reap an mm. This can happen for a variety of reasons, > > including: > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > to stall indefinitely, > > Maybe we should have more than one oom reaper thread? I assume the > probability of the oom reaper thread blocking on an mmu notifier is > small, so perhaps just dive in and hope for the best. If the oom > reaper gets stuck then there's another thread ready to take over. And > revisit the decision to use a kernel thread instead of workqueues. > I'm not sure that we need more than one thread, per se, but we need the ability to operate on more than one oom victim while deciding whether one victim can be reaped or not. The current implementation only processes one victim at a time: it tries to grab mm->mmap_sem, it sleeps, retries, sleeps, etc. We need to try other oom victims (we do parallel memcg oom stress testing, and the oom reaper can uncharge memory to a hierarchy that prevents livelock as well), which my patch does. > So what's actually happening here. A process has a large amount of > mlocked memory, it has been oom-killed and it is in the process of > releasing its memory and exiting, yes? > That's one failure mode, yes, and three possible ways: - the oom reaper immediately sets MMF_OOM_SKIP because it tried to free memory and completely failed, so it actually declares this as a success and sets MMF_OOM_SKIP assuming memory was freed, which wasn't, - to avoid CVE-2018-1000200 exit_mmap() must set MMF_OOM_SKIP before doing munlock_vma_pages_all() which the oom reaper uses to determine if it can safely operate on a vma, so the exit path sets MMF_OOM_SKIP before any possible memory freeing as well, and - the previous iteration of the oom reaper to set MMF_OOM_SKIP between unmap_vmas() and free_pgtables() suffered from the same problem for large amounts of virtual memory whereas subsequent oom kill could have been prevented if free_pgtables() could have completed. My patch fixes all these issues because MMF_OOM_SKIP only gets set after free_pgtables(), i.e. no additional memory freeing is possible through exit_mmap(), or a process has failed to exit for 10s by the oom reaper. I will patch this to make the timeout configurable. I use the existing MMF_UNSTABLE to determine if the oom reaper can safely operate on vmas of the mm. > If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we > just patiently waiting for its attempt to release meory? > That's what my patch does, yes, it needs to wait to ensure forward progress is not being made before setting MMF_OOM_SKIP and allowing all other processes on the system to be oom killed. Taken to an extreme, imagine a single large mlocked process or one with a blockable mmu notifier taking up almost all memory on a machine. If there is a memory leak, it will be oom killed same as it always has been. The difference now is that the machine panic()'s because MMF_OOM_SKIP is set with no memory freeing and the oom killer finds no more eligible processes so its only alternative is panicking. > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > a true oom livelock in progress, it never gets set and no additional > > killing is possible. > > I guess that's my answer. What causes this livelock? Process looping > in alloc_pages while holding a lock the oom victim wants? > That's one way, yes, the other is to be charging memory in the mem cgroup path while holding a mutex the victim wants. If additional kmem will start being charged to mem cgroup hierarchies and the oom killer is called synchronously in the charge path (there is no fault path to unwind to), which has been discussed, this problem will become much more prolific. > > The exit path will now set MMF_OOM_SKIP only after all memory has been > > freed, so additional oom killing is justified, > > That seems sensible, but why set MMF_OOM_SKIP at all? > The oom reaper will eventually need to set it if its actually livelocked, which happens extremely rarely in practice, because the oom reaper was unable to free memory such that an allocator holding our mutex could successfully allocate. It sets it immediately now for mlocked processes (it doesn't realize it didn't free a single page). It retries 10 times to grab mm->mmap_sem and sets it after one second if it fails. If it has a blockable mmu notifier it sleeps for a second and sets it. I'm replacing all the current timeouts with a per-mm timeout and volunteering to make it configurable so that it can be disabled or set to 10s as preferred by us because we are tired of every process getting oom killed pointlessly. I'll suggest a default of 1s to match the timeouts currently implemented in the oom reaper and generalize them to be per-mm. > > and rely on MMF_UNSTABLE to > > determine when it can race with the oom reaper. > > > > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has > > lapsed because it can no longer guarantee forward progress. > > > > The reaping timeout is intentionally set for a substantial amount of time > > since oom livelock is a very rare occurrence and it's better to optimize > > for preventing additional (unnecessary) oom killing than a scenario that > > is much more unlikely. > > What happened to the old idea of permitting the task which is blocking > the oom victim to access additional reserves? > That is an alternative to the oom reaper and worked quite successfully for us. We'd detect when a process was looping endlessly waiting for the same victim to exit and then grant it access to additional reserves, specifically to detect oom livelock scenarios. The oom reaper should theoretically make this extremely rare since it normally can free *some* memory so we aren't oom anymore and allocators holding mutexes can succeed. > > +#ifdef CONFIG_MMU > > + /* When to give up on oom reaping this mm */ > > + unsigned long reap_timeout; > > "timeout" implies "interval". To me, anyway. This is an absolute > time, so something like reap_time would be clearer. Along with a > comment explaining that the units are in jiffies. > Ack. > > +#endif > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS > > pgtable_t pmd_huge_pte; /* protected by page_table_lock */ > > #endif > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1163,7 +1163,7 @@ struct task_struct { > > #endif > > int pagefault_disabled; > > #ifdef CONFIG_MMU > > - struct task_struct *oom_reaper_list; > > + struct list_head oom_reap_list; > > Can we have a comment explaining its locking. > Ok. > > #endif > > #ifdef CONFIG_VMAP_STACK > > struct vm_struct *stack_vm_area; > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) > > if (unlikely(mm_is_oom_victim(mm))) { > > /* > > * Manually reap the mm to free as much memory as possible. > > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard > > - * this mm from further consideration. Taking mm->mmap_sem for > > - * write after setting MMF_OOM_SKIP will guarantee that the oom > > - * reaper will not run on this mm again after mmap_sem is > > - * dropped. > > + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper. > > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will > > + * guarantee that the oom reaper will not run on this mm again > > + * after mmap_sem is dropped. > > Comment should explain *why* we don't want the reaper to run on this mm > again. > Sounds good.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -449,6 +449,10 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_MMU + /* When to give up on oom reaping this mm */ + unsigned long reap_timeout; +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1163,7 +1163,7 @@ struct task_struct { #endif int pagefault_disabled; #ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; + struct list_head oom_reap_list; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -835,6 +835,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->fail_nth = 0; #endif +#ifdef CONFIG_MMU + INIT_LIST_HEAD(&tsk->oom_reap_list); +#endif + return tsk; free_stack: diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) if (unlikely(mm_is_oom_victim(mm))) { /* * Manually reap the mm to free as much memory as possible. - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard - * this mm from further consideration. Taking mm->mmap_sem for - * write after setting MMF_OOM_SKIP will guarantee that the oom - * reaper will not run on this mm again after mmap_sem is - * dropped. + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper. + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will + * guarantee that the oom reaper will not run on this mm again + * after mmap_sem is dropped. * * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in @@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm) __oom_reap_task_mm(mm); mutex_unlock(&oom_lock); - set_bit(MMF_OOM_SKIP, &mm->flags); + set_bit(MMF_UNSTABLE, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } @@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -476,7 +476,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) */ static struct task_struct *oom_reaper_th; static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; +static LIST_HEAD(oom_reaper_list); static DEFINE_SPINLOCK(oom_reaper_lock); void __oom_reap_task_mm(struct mm_struct *mm) @@ -519,10 +519,8 @@ void __oom_reap_task_mm(struct mm_struct *mm) } } -static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { - bool ret = true; - /* * We have to make sure to not race with the victim exit path * and cause premature new oom victim selection: @@ -540,9 +538,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) mutex_lock(&oom_lock); if (!down_read_trylock(&mm->mmap_sem)) { - ret = false; trace_skip_task_reaping(tsk->pid); - goto unlock_oom; + goto out_oom; } /* @@ -551,69 +548,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * TODO: we really want to get rid of this ugly hack and make sure that * notifiers cannot block for unbounded amount of time */ - if (mm_has_blockable_invalidate_notifiers(mm)) { - up_read(&mm->mmap_sem); - schedule_timeout_idle(HZ); - goto unlock_oom; - } + if (mm_has_blockable_invalidate_notifiers(mm)) + goto out_mm; /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't + * work on the mm anymore. The check for MMF_UNSTABLE must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); + if (test_bit(MMF_UNSTABLE, &mm->flags)) { trace_skip_task_reaping(tsk->pid); - goto unlock_oom; + goto out_mm; } trace_start_task_reaping(tsk->pid); - __oom_reap_task_mm(mm); + trace_finish_task_reaping(tsk->pid); pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); +out_mm: up_read(&mm->mmap_sem); - - trace_finish_task_reaping(tsk->pid); -unlock_oom: +out_oom: mutex_unlock(&oom_lock); - return ret; } -#define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { - int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; - /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) - schedule_timeout_idle(HZ/10); + /* + * If this mm has either been fully unmapped, or the oom reaper has + * given up on it, nothing left to do except drop the refcount. + */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto drop; - if (attempts <= MAX_OOM_REAP_RETRIES || - test_bit(MMF_OOM_SKIP, &mm->flags)) - goto done; + /* + * If this mm has already been reaped, doing so again will not likely + * free additional memory. + */ + if (!test_bit(MMF_UNSTABLE, &mm->flags)) + oom_reap_task_mm(tsk, mm); - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - debug_show_all_locks(); + if (time_after_eq(jiffies, mm->reap_timeout)) { + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) { + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + debug_show_all_locks(); -done: - tsk->oom_reaper_list = NULL; + /* + * Reaping has failed for the timeout period, so give up + * and allow additional processes to be oom killed. + */ + set_bit(MMF_OOM_SKIP, &mm->flags); + } + goto drop; + } - /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). - */ - set_bit(MMF_OOM_SKIP, &mm->flags); + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto drop; - /* Drop a reference taken by wake_oom_reaper */ + /* Enqueue to be reaped again */ + spin_lock(&oom_reaper_lock); + list_add_tail(&tsk->oom_reap_list, &oom_reaper_list); + spin_unlock(&oom_reaper_lock); + + schedule_timeout_idle(HZ/10); + return; + +drop: + /* Drop the reference taken by wake_oom_reaper */ put_task_struct(tsk); } @@ -622,11 +631,13 @@ static int oom_reaper(void *unused) while (true) { struct task_struct *tsk = NULL; - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); + wait_event_freezable(oom_reaper_wait, + !list_empty(&oom_reaper_list)); spin_lock(&oom_reaper_lock); - if (oom_reaper_list != NULL) { - tsk = oom_reaper_list; - oom_reaper_list = tsk->oom_reaper_list; + if (!list_empty(&oom_reaper_list)) { + tsk = list_entry(oom_reaper_list.next, + struct task_struct, oom_reap_list); + list_del(&tsk->oom_reap_list); } spin_unlock(&oom_reaper_lock); @@ -637,17 +648,22 @@ static int oom_reaper(void *unused) return 0; } +/* How long to wait to oom reap an mm before selecting another process */ +#define OOM_REAP_TIMEOUT_MSECS (10 * 1000) static void wake_oom_reaper(struct task_struct *tsk) { - /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) + /* + * Set the reap timeout; if it's already set, the mm is enqueued and + * this tsk can be ignored. + */ + if (cmpxchg(&tsk->signal->oom_mm->reap_timeout, 0UL, + jiffies + msecs_to_jiffies(OOM_REAP_TIMEOUT_MSECS))) return; get_task_struct(tsk); spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; + list_add(&tsk->oom_reap_list, &oom_reaper_list); spin_unlock(&oom_reaper_lock); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait);
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if it cannot reap an mm. This can happen for a variety of reasons, including: - the inability to grab mm->mmap_sem in a sufficient amount of time, - when the mm has blockable mmu notifiers that could cause the oom reaper to stall indefinitely, but we can also add a third when the oom reaper can "reap" an mm but doing so is unlikely to free any amount of memory: - when the mm's memory is fully mlocked. When all memory is mlocked, the oom reaper will not be able to free any substantial amount of memory. It sets MMF_OOM_SKIP before the victim can unmap and free its memory in exit_mmap() and subsequent oom victims are chosen unnecessarily. This is trivial to reproduce if all eligible processes on the system have mlocked their memory: the oom killer calls panic() even though forward progress can be made. This is the same issue where the exit path sets MMF_OOM_SKIP before unmapping memory and additional processes can be chosen unnecessarily because the oom killer is racing with exit_mmap(). We can't simply defer setting MMF_OOM_SKIP, however, because if there is a true oom livelock in progress, it never gets set and no additional killing is possible. To fix this, this patch introduces a per-mm reaping timeout, initially set at 10s. It requires that the oom reaper's list becomes a properly linked list so that other mm's may be reaped while waiting for an mm's timeout to expire. This replaces the current timeouts in the oom reaper: (1) when trying to grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2) a HZ sleep if there are blockable mmu notifiers. It extends it with timeout to allow an oom victim to reach exit_mmap() before choosing additional processes unnecessarily. The exit path will now set MMF_OOM_SKIP only after all memory has been freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to determine when it can race with the oom reaper. The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has lapsed because it can no longer guarantee forward progress. The reaping timeout is intentionally set for a substantial amount of time since oom livelock is a very rare occurrence and it's better to optimize for preventing additional (unnecessary) oom killing than a scenario that is much more unlikely. Signed-off-by: David Rientjes <rientjes@google.com> --- Note: I understand there is an objection based on timeout based delays. This is currently the only possible way to avoid oom killing important processes completely unnecessarily. If the oom reaper can someday free all memory, including mlocked memory and those mm's with blockable mmu notifiers, and is guaranteed to always be able to grab mm->mmap_sem, this can be removed. I do not believe any such guarantee is possible and consider the massive killing of additional processes unnecessarily to be a regression introduced by the oom reaper and its very quick setting of MMF_OOM_SKIP to allow additional processes to be oom killed. include/linux/mm_types.h | 4 ++ include/linux/sched.h | 2 +- kernel/fork.c | 4 ++ mm/mmap.c | 12 ++--- mm/oom_kill.c | 112 ++++++++++++++++++++++----------------- 5 files changed, 79 insertions(+), 55 deletions(-)