Message ID | 1561807474-10317-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: mempolicy: don't select exited threads as OOM victims | expand |
On Sat 29-06-19 20:24:34, Tetsuo Handa wrote: > Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL, > mempolicy_nodemask_intersects() considers exited threads (e.g. a process > with dying leader and live threads) as eligible. But it is possible that > all of live threads are still ineligible. > > Since has_intersects_mems_allowed() returns true as soon as one of threads > is considered eligible, mempolicy_nodemask_intersects() needs to consider > exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL > before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited > threads by checking whether mm is NULL. Ok, this makes sense. For this change Acked-by: Michal Hocko <mhocko@suse.com> > While at it, since mempolicy_nodemask_intersects() is called by only > has_intersects_mems_allowed(), it is guaranteed that mask != NULL. Well, I am not really sure. It's true that mempolicy_nodemask_intersects is currently only used by OOM path and never with mask == NULL but the function seems to be more generic and hadnling NULL mask seems reasonable to me. This is not a hot path that an additional check would be harmful, right? > BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE} > and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use > find_lock_task_mm() instead of for_each_thread() for mask != NULL case > in has_intersects_mems_allowed(). I am afraid that mempolicy is allowed to be per thread which is quite ugly and I am afraid we cannot change that right now. > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > mm/mempolicy.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 01600d8..938f0a0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1974,11 +1974,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk, > const nodemask_t *mask) > { > struct mempolicy *mempolicy; > - bool ret = true; > + bool ret; > > - if (!mask) > - return ret; > task_lock(tsk); > + ret = tsk->mm; > mempolicy = tsk->mempolicy; > if (!mempolicy) > goto out; > -- > 1.8.3.1
On 2019/07/01 20:17, Michal Hocko wrote: > On Sat 29-06-19 20:24:34, Tetsuo Handa wrote: >> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL, >> mempolicy_nodemask_intersects() considers exited threads (e.g. a process >> with dying leader and live threads) as eligible. But it is possible that >> all of live threads are still ineligible. >> >> Since has_intersects_mems_allowed() returns true as soon as one of threads >> is considered eligible, mempolicy_nodemask_intersects() needs to consider >> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL >> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited >> threads by checking whether mm is NULL. > > Ok, this makes sense. For this change > Acked-by: Michal Hocko <mhocko@suse.com> > But I realized that this patch was too optimistic. We need to wait for mm-less threads until MMF_OOM_SKIP is set if the process was already an OOM victim. If we fail to allow the process to reach MMF_OOM_SKIP test, the process will be ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used. Well, the problem is that exited threads prematurely set mempolicy = NULL. Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when __put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects() path should be freed as well when __put_task_struct() is called? diff --git a/kernel/exit.c b/kernel/exit.c index a75b6a7..02a60ea 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -897,7 +897,6 @@ void __noreturn do_exit(long code) exit_tasks_rcu_start(); exit_notify(tsk, group_dead); proc_exit_connector(tsk); - mpol_put_task_policy(tsk); #ifdef CONFIG_FUTEX if (unlikely(current->pi_state_cache)) kfree(current->pi_state_cache); diff --git a/kernel/fork.c b/kernel/fork.c index 6166790..c17e436 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -726,6 +726,7 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + mpol_put_task_policy(tsk); cgroup_free(tsk); task_numa_free(tsk); security_task_free(tsk);
On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: > On 2019/07/01 20:17, Michal Hocko wrote: > > On Sat 29-06-19 20:24:34, Tetsuo Handa wrote: > >> Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL, > >> mempolicy_nodemask_intersects() considers exited threads (e.g. a process > >> with dying leader and live threads) as eligible. But it is possible that > >> all of live threads are still ineligible. > >> > >> Since has_intersects_mems_allowed() returns true as soon as one of threads > >> is considered eligible, mempolicy_nodemask_intersects() needs to consider > >> exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL > >> before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited > >> threads by checking whether mm is NULL. > > > > Ok, this makes sense. For this change > > Acked-by: Michal Hocko <mhocko@suse.com> > > > > But I realized that this patch was too optimistic. We need to wait for mm-less > threads until MMF_OOM_SKIP is set if the process was already an OOM victim. If the process is an oom victim then _all_ threads are so as well because that is the address space property. And we already do check that before reaching oom_badness IIRC. So what is the actual problem you are trying to solve here? > If > we fail to allow the process to reach MMF_OOM_SKIP test, the process will be > ignored by the OOM killer as soon as all threads pass mm = NULL at exit_mm(), for > has_intersects_mems_allowed() returns false unless MPOL_{BIND,INTERLEAVE} is used. > > Well, the problem is that exited threads prematurely set mempolicy = NULL. > Since bitmap memory for cpuset_mems_allowed_intersects() path is freed when > __put_task_struct() is called, mempolicy memory for mempolicy_nodemask_intersects() > path should be freed as well when __put_task_struct() is called? I am sorry but I have hard time understanding what is the actual user visible problem here.
On 2019/07/01 22:17, Michal Hocko wrote: > On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: >> But I realized that this patch was too optimistic. We need to wait for mm-less >> threads until MMF_OOM_SKIP is set if the process was already an OOM victim. > > If the process is an oom victim then _all_ threads are so as well > because that is the address space property. And we already do check that > before reaching oom_badness IIRC. So what is the actual problem you are > trying to solve here? I'm talking about behavioral change after tsk became an OOM victim. If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as oom_unkillable_task() returned true because has_intersects_mems_allowed() returned false because mempolicy_nodemask_intersects() returned false because all thread's mm became NULL (despite tsk->signal->oom_mm != NULL). static int oom_evaluate_task(struct task_struct *task, void *arg) { if (oom_unkillable_task(task, NULL, oc->nodemask)) goto next; if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) { if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) goto next; goto abort; } }
On Mon 01-07-19 22:38:58, Tetsuo Handa wrote: > On 2019/07/01 22:17, Michal Hocko wrote: > > On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: > >> But I realized that this patch was too optimistic. We need to wait for mm-less > >> threads until MMF_OOM_SKIP is set if the process was already an OOM victim. > > > > If the process is an oom victim then _all_ threads are so as well > > because that is the address space property. And we already do check that > > before reaching oom_badness IIRC. So what is the actual problem you are > > trying to solve here? > > I'm talking about behavioral change after tsk became an OOM victim. > > If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if > tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as > oom_unkillable_task() returned true because has_intersects_mems_allowed() returned > false because mempolicy_nodemask_intersects() returned false because all thread's > mm became NULL (despite tsk->signal->oom_mm != NULL). OK, I finally got your point. It was not clear that you are referring to the code _after_ the patch you are proposing. You are indeed right that this would have a side effect that an additional victim could be selected even though the current process hasn't terminated yet. Sigh, another example how the whole thing is subtle so I retract my Ack and request a real life example of where this matters before we think about a proper fix and make the code even more complex.
On 2019/07/01 22:48, Michal Hocko wrote: > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote: >> On 2019/07/01 22:17, Michal Hocko wrote: >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: >>>> But I realized that this patch was too optimistic. We need to wait for mm-less >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim. >>> >>> If the process is an oom victim then _all_ threads are so as well >>> because that is the address space property. And we already do check that >>> before reaching oom_badness IIRC. So what is the actual problem you are >>> trying to solve here? >> >> I'm talking about behavioral change after tsk became an OOM victim. >> >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned >> false because mempolicy_nodemask_intersects() returned false because all thread's >> mm became NULL (despite tsk->signal->oom_mm != NULL). > > OK, I finally got your point. It was not clear that you are referring to > the code _after_ the patch you are proposing. You are indeed right that > this would have a side effect that an additional victim could be > selected even though the current process hasn't terminated yet. Sigh, > another example how the whole thing is subtle so I retract my Ack and > request a real life example of where this matters before we think about > a proper fix and make the code even more complex. > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from do_exit() to __put_task_struct() ? That change will (if it is safe to do) prevent exited threads from setting mempolicy = NULL (and confusing mempolicy_nodemask_intersects() due to mempolicy == NULL).
On Mon 01-07-19 22:56:12, Tetsuo Handa wrote: > On 2019/07/01 22:48, Michal Hocko wrote: > > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote: > >> On 2019/07/01 22:17, Michal Hocko wrote: > >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: > >>>> But I realized that this patch was too optimistic. We need to wait for mm-less > >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim. > >>> > >>> If the process is an oom victim then _all_ threads are so as well > >>> because that is the address space property. And we already do check that > >>> before reaching oom_badness IIRC. So what is the actual problem you are > >>> trying to solve here? > >> > >> I'm talking about behavioral change after tsk became an OOM victim. > >> > >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if > >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as > >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned > >> false because mempolicy_nodemask_intersects() returned false because all thread's > >> mm became NULL (despite tsk->signal->oom_mm != NULL). > > > > OK, I finally got your point. It was not clear that you are referring to > > the code _after_ the patch you are proposing. You are indeed right that > > this would have a side effect that an additional victim could be > > selected even though the current process hasn't terminated yet. Sigh, > > another example how the whole thing is subtle so I retract my Ack and > > request a real life example of where this matters before we think about > > a proper fix and make the code even more complex. > > > > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from > do_exit() to __put_task_struct() ? That change will (if it is safe to do) > prevent exited threads from setting mempolicy = NULL (and confusing > mempolicy_nodemask_intersects() due to mempolicy == NULL). I am sorry but I would have to study it much more and I am not convinced the time spent on it would be well spent.
On Mon 01-07-19 16:04:34, Michal Hocko wrote: > On Mon 01-07-19 22:56:12, Tetsuo Handa wrote: > > On 2019/07/01 22:48, Michal Hocko wrote: > > > On Mon 01-07-19 22:38:58, Tetsuo Handa wrote: > > >> On 2019/07/01 22:17, Michal Hocko wrote: > > >>> On Mon 01-07-19 22:04:22, Tetsuo Handa wrote: > > >>>> But I realized that this patch was too optimistic. We need to wait for mm-less > > >>>> threads until MMF_OOM_SKIP is set if the process was already an OOM victim. > > >>> > > >>> If the process is an oom victim then _all_ threads are so as well > > >>> because that is the address space property. And we already do check that > > >>> before reaching oom_badness IIRC. So what is the actual problem you are > > >>> trying to solve here? > > >> > > >> I'm talking about behavioral change after tsk became an OOM victim. > > >> > > >> If tsk->signal->oom_mm != NULL, we have to wait for MMF_OOM_SKIP even if > > >> tsk->mm == NULL. Otherwise, the OOM killer selects next OOM victim as soon as > > >> oom_unkillable_task() returned true because has_intersects_mems_allowed() returned > > >> false because mempolicy_nodemask_intersects() returned false because all thread's > > >> mm became NULL (despite tsk->signal->oom_mm != NULL). > > > > > > OK, I finally got your point. It was not clear that you are referring to > > > the code _after_ the patch you are proposing. You are indeed right that > > > this would have a side effect that an additional victim could be > > > selected even though the current process hasn't terminated yet. Sigh, > > > another example how the whole thing is subtle so I retract my Ack and > > > request a real life example of where this matters before we think about > > > a proper fix and make the code even more complex. > > > > > > > Instead of checking for mm != NULL, can we move mpol_put_task_policy() from > > do_exit() to __put_task_struct() ? That change will (if it is safe to do) > > prevent exited threads from setting mempolicy = NULL (and confusing > > mempolicy_nodemask_intersects() due to mempolicy == NULL). > > I am sorry but I would have to study it much more and I am not convinced > the time spent on it would be well spent. Thinking about it some more it seems that we can go with your original fix if we also reorder oom_evaluate_task diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f719b64741d6..e5feb0f72e3b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) struct oom_control *oc = arg; unsigned long points; - if (oom_unkillable_task(task, NULL, oc->nodemask)) - goto next; - /* * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves unless @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) goto abort; } + if (oom_unkillable_task(task, NULL, oc->nodemask)) + goto next; + /* * If task is allocating a lot of memory and has been marked to be * killed first if it triggers an oom, then select it. I do not see any strong reason to keep the current ordering. OOM victim check is trivial so it shouldn't add a visible overhead for few unkillable tasks that we might encounter.
On 2019/07/01 23:16, Michal Hocko wrote: > Thinking about it some more it seems that we can go with your original > fix if we also reorder oom_evaluate_task > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index f719b64741d6..e5feb0f72e3b 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > struct oom_control *oc = arg; > unsigned long points; > > - if (oom_unkillable_task(task, NULL, oc->nodemask)) > - goto next; > - > /* > * This task already has access to memory reserves and is being killed. > * Don't allow any other task to have access to the reserves unless > @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > goto abort; > } > > + if (oom_unkillable_task(task, NULL, oc->nodemask)) > + goto next; > + > /* > * If task is allocating a lot of memory and has been marked to be > * killed first if it triggers an oom, then select it. > > I do not see any strong reason to keep the current ordering. OOM victim > check is trivial so it shouldn't add a visible overhead for few > unkillable tasks that we might encounter. > Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events (because an OOM victim in a different OOM context will hit "goto abort;" path). Thinking again, I think that the same problem exists for mask == NULL path as long as "a process with dying leader and live threads" is possible. Then, fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects() judged that some thread is eligible is better. diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d1c9c4e..43e499e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start, */ ret = cpuset_mems_allowed_intersects(current, tsk); } - if (ret) - break; + if (ret) { + /* + * Exclude dead threads as ineligible when selecting + * an OOM victim. But include dead threads as eligible + * when waiting for OOM victims to get MMF_OOM_SKIP. + * + * Strictly speaking, tsk->mm should be checked under + * task lock because cpuset_mems_allowed_intersects() + * does not take task lock. But racing with exit_mm() + * is not fatal. Thus, use cheaper barrier rather than + * strict task lock. + */ + smp_rmb(); + if (tsk->mm || tsk_is_oom_victim(tsk)) + break; + ret = false; + } } rcu_read_unlock();
On Tue 02-07-19 22:19:27, Tetsuo Handa wrote: > On 2019/07/01 23:16, Michal Hocko wrote: > > Thinking about it some more it seems that we can go with your original > > fix if we also reorder oom_evaluate_task > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index f719b64741d6..e5feb0f72e3b 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -318,9 +318,6 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > > struct oom_control *oc = arg; > > unsigned long points; > > > > - if (oom_unkillable_task(task, NULL, oc->nodemask)) > > - goto next; > > - > > /* > > * This task already has access to memory reserves and is being killed. > > * Don't allow any other task to have access to the reserves unless > > @@ -333,6 +330,9 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > > goto abort; > > } > > > > + if (oom_unkillable_task(task, NULL, oc->nodemask)) > > + goto next; > > + > > /* > > * If task is allocating a lot of memory and has been marked to be > > * killed first if it triggers an oom, then select it. > > > > I do not see any strong reason to keep the current ordering. OOM victim > > check is trivial so it shouldn't add a visible overhead for few > > unkillable tasks that we might encounter. > > > > Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events > (because an OOM victim in a different OOM context will hit "goto abort;" path). You are right. Considering that we now have a guarantee of a forward progress then this should be tolerateable (a victim in a disjoint numaset will go away and other one can go ahead and trigger its own OOM). > Thinking again, I think that the same problem exists for mask == NULL path > as long as "a process with dying leader and live threads" is possible. Then, > fixing up after has_intersects_mems_allowed()/cpuset_mems_allowed_intersects() > judged that some thread is eligible is better. This is getting more and more hair for something that is not really clear to be an actual problem. Don't you think? > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index d1c9c4e..43e499e 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -109,8 +109,23 @@ static bool oom_cpuset_eligible(struct task_struct *start, > */ > ret = cpuset_mems_allowed_intersects(current, tsk); > } > - if (ret) > - break; > + if (ret) { > + /* > + * Exclude dead threads as ineligible when selecting > + * an OOM victim. But include dead threads as eligible > + * when waiting for OOM victims to get MMF_OOM_SKIP. > + * > + * Strictly speaking, tsk->mm should be checked under > + * task lock because cpuset_mems_allowed_intersects() > + * does not take task lock. But racing with exit_mm() > + * is not fatal. Thus, use cheaper barrier rather than > + * strict task lock. > + */ > + smp_rmb(); > + if (tsk->mm || tsk_is_oom_victim(tsk)) > + break; > + ret = false; > + } > } > rcu_read_unlock(); >
On 2019/07/02 22:51, Michal Hocko wrote: >>> I do not see any strong reason to keep the current ordering. OOM victim >>> check is trivial so it shouldn't add a visible overhead for few >>> unkillable tasks that we might encounter. >>> >> >> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events >> (because an OOM victim in a different OOM context will hit "goto abort;" path). > > You are right. Considering that we now have a guarantee of a forward > progress then this should be tolerateable (a victim in a disjoint > numaset will go away and other one can go ahead and trigger its own > OOM). But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't make forward progress.
On Wed 03-07-19 06:26:55, Tetsuo Handa wrote: > On 2019/07/02 22:51, Michal Hocko wrote: > >>> I do not see any strong reason to keep the current ordering. OOM victim > >>> check is trivial so it shouldn't add a visible overhead for few > >>> unkillable tasks that we might encounter. > >>> > >> > >> Yes if we can tolerate that there can be only one OOM victim for !memcg OOM events > >> (because an OOM victim in a different OOM context will hit "goto abort;" path). > > > > You are right. Considering that we now have a guarantee of a forward > > progress then this should be tolerateable (a victim in a disjoint > > numaset will go away and other one can go ahead and trigger its own > > OOM). > > But it might take very long period before MMF_OOM_SKIP is set by the OOM reaper > or exit_mmap(). Until MMF_OOM_SKIP is set, OOM events from disjoint numaset can't > make forward progress. If that is a concern then I would stick with the current status quo until we see the issue to be reported by real workloads.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 01600d8..938f0a0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1974,11 +1974,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk, const nodemask_t *mask) { struct mempolicy *mempolicy; - bool ret = true; + bool ret; - if (!mask) - return ret; task_lock(tsk); + ret = tsk->mm; mempolicy = tsk->mempolicy; if (!mempolicy) goto out;
Since mpol_put_task_policy() in do_exit() sets mempolicy = NULL, mempolicy_nodemask_intersects() considers exited threads (e.g. a process with dying leader and live threads) as eligible. But it is possible that all of live threads are still ineligible. Since has_intersects_mems_allowed() returns true as soon as one of threads is considered eligible, mempolicy_nodemask_intersects() needs to consider exited threads as ineligible. Since exit_mm() in do_exit() sets mm = NULL before mpol_put_task_policy() sets mempolicy = NULL, we can exclude exited threads by checking whether mm is NULL. While at it, since mempolicy_nodemask_intersects() is called by only has_intersects_mems_allowed(), it is guaranteed that mask != NULL. BTW, are there processes where some of threads use MPOL_{BIND,INTERLEAVE} and the rest do not use MPOL_{BIND,INTERLEAVE} ? If no, we can use find_lock_task_mm() instead of for_each_thread() for mask != NULL case in has_intersects_mems_allowed(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/mempolicy.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)