Message ID | 1522322870-4335-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > Theoretically it is possible that an mm_struct with 60000+ vmas loops > with potentially allocating memory, with mm->mmap_sem held for write by > the current thread. Unless I overlooked that fatal_signal_pending() is > somewhere in the loop, this is bad if current thread was selected as an > OOM victim, for the current thread will continue allocations using memory > reserves while the OOM reaper is unable to reclaim memory. All of which implies to me that this patch fixes a problem which is not known to exist! > But there is no point with continuing the loop from the beginning if > current thread is killed. If there were __GFP_KILLABLE (or something > like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it > to all allocations inside the loop. But since we don't have such flag, > this patch uses fatal_signal_pending() check inside the loop. Dumb question: if a thread has been oom-killed and then tries to allocate memory, should the page allocator just fail the allocation attempt? I suppose there are all sorts of reasons why not :( In which case, yes, setting a new PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a tidy enough solution. It would be a bit sad to add another test in the hot path (should_fail_alloc_page()?), but geeze we do a lot of junk already. > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > continue; > } > charge = 0; > + if (fatal_signal_pending(current)) { > + retval = -EINTR; > + goto out; > + } > if (mpnt->vm_flags & VM_ACCOUNT) { > unsigned long len = vma_pages(mpnt); I think a comment explaining why we're doing this would help. Better would be to add a new function "current_is_oom_killed()" or such, which becomes self-documenting. Because there are other reasons why a task may have a fatal signal pending.
On Thu 29-03-18 14:30:03, Andrew Morton wrote: [...] > Dumb question: if a thread has been oom-killed and then tries to > allocate memory, should the page allocator just fail the allocation > attempt? I suppose there are all sorts of reasons why not :( We give those tasks access to memory reserves to move on (see oom_reserves_allowed) and fail allocation if reserves do not help if (tsk_is_oom_victim(current) && (alloc_flags == ALLOC_OOM || (gfp_mask & __GFP_NOMEMALLOC))) goto nopage; So we... > In which case, yes, setting a new > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a > tidy enough solution. It would be a bit sad to add another test in the > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk > already. ... do not need this.
Michal Hocko wrote: > On Thu 29-03-18 14:30:03, Andrew Morton wrote: > [...] > > Dumb question: if a thread has been oom-killed and then tries to > > allocate memory, should the page allocator just fail the allocation > > attempt? I suppose there are all sorts of reasons why not :( > > We give those tasks access to memory reserves to move on (see > oom_reserves_allowed) and fail allocation if reserves do not help > > if (tsk_is_oom_victim(current) && > (alloc_flags == ALLOC_OOM || > (gfp_mask & __GFP_NOMEMALLOC))) > goto nopage; > So we... > > > In which case, yes, setting a new > > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a > > tidy enough solution. It would be a bit sad to add another test in the > > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk > > already. > > ... do not need this. Excuse me? But that check is after /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) goto got_pg; which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP.
On Tue 03-04-18 20:32:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Thu 29-03-18 14:30:03, Andrew Morton wrote: > > [...] > > > Dumb question: if a thread has been oom-killed and then tries to > > > allocate memory, should the page allocator just fail the allocation > > > attempt? I suppose there are all sorts of reasons why not :( > > > > We give those tasks access to memory reserves to move on (see > > oom_reserves_allowed) and fail allocation if reserves do not help > > > > if (tsk_is_oom_victim(current) && > > (alloc_flags == ALLOC_OOM || > > (gfp_mask & __GFP_NOMEMALLOC))) > > goto nopage; > > So we... > > > > > In which case, yes, setting a new > > > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a > > > tidy enough solution. It would be a bit sad to add another test in the > > > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk > > > already. > > > > ... do not need this. > > Excuse me? But that check is after > > /* Reclaim has failed us, start killing things */ > page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); > if (page) > goto got_pg; > > which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads > can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP. Races are possible and I do not see them as critical _right now_. If that turnes out to be not the case we can think of a more robust way. The thing is that we have "bail out for OOM victims already".
On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote: > > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > continue; > > } > > charge = 0; > > + if (fatal_signal_pending(current)) { > > + retval = -EINTR; > > + goto out; > > + } > > if (mpnt->vm_flags & VM_ACCOUNT) { > > unsigned long len = vma_pages(mpnt); > > I think a comment explaining why we're doing this would help. > > Better would be to add a new function "current_is_oom_killed()" or > such, which becomes self-documenting. Because there are other reasons > why a task may have a fatal signal pending. I disagree that we need a comment here, or to create an alias. Someone who knows nothing of the oom-killer (like, er, me) reading that code sees "Oh, we're checking for fatal signals here. I guess it doesn't make sense to continue forking a process if it's already received a fatal signal." One might speculate about the causes of the fatal signal having been received and settle on reasons which make sense even without thinking of the OOM case. Because it's why it was introduced, I always think about a task blocked on a dead NFS mount. If it's multithreaded and one of the threads called fork() while another thread was blocked on a page fault and the dup_mmap() had to wait for the page fault to finish ... that would make some kind of sense.
On Tue 03-04-18 04:58:57, Matthew Wilcox wrote: > On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote: > > > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > continue; > > > } > > > charge = 0; > > > + if (fatal_signal_pending(current)) { > > > + retval = -EINTR; > > > + goto out; > > > + } > > > if (mpnt->vm_flags & VM_ACCOUNT) { > > > unsigned long len = vma_pages(mpnt); > > > > I think a comment explaining why we're doing this would help. > > > > Better would be to add a new function "current_is_oom_killed()" or > > such, which becomes self-documenting. Because there are other reasons > > why a task may have a fatal signal pending. > > I disagree that we need a comment here, or to create an alias. Someone > who knows nothing of the oom-killer (like, er, me) reading that code sees > "Oh, we're checking for fatal signals here. I guess it doesn't make sense > to continue forking a process if it's already received a fatal signal." > > One might speculate about the causes of the fatal signal having been > received and settle on reasons which make sense even without thinking > of the OOM case. Because it's why it was introduced, I always think > about a task blocked on a dead NFS mount. If it's multithreaded and > one of the threads called fork() while another thread was blocked on a > page fault and the dup_mmap() had to wait for the page fault to finish > ... that would make some kind of sense. I completely agree. If the check is really correct then it should be pretty much self explanatory like many other checks. There is absolutely zero oom specific in there. If a check _is_ oom specific then there is something fishy going on.
diff --git a/kernel/fork.c b/kernel/fork.c index 1e8c9a7..38d5baa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, continue; } charge = 0; + if (fatal_signal_pending(current)) { + retval = -EINTR; + goto out; + } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt);
Theoretically it is possible that an mm_struct with 60000+ vmas loops with potentially allocating memory, with mm->mmap_sem held for write by the current thread. Unless I overlooked that fatal_signal_pending() is somewhere in the loop, this is bad if current thread was selected as an OOM victim, for the current thread will continue allocations using memory reserves while the OOM reaper is unable to reclaim memory. But there is no point with continuing the loop from the beginning if current thread is killed. If there were __GFP_KILLABLE (or something like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it to all allocations inside the loop. But since we don't have such flag, this patch uses fatal_signal_pending() check inside the loop. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- kernel/fork.c | 4 ++++ 1 file changed, 4 insertions(+)