Message ID | 2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: fix page leak with multiple threads mapping the same page | expand |
On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote: > We have an application with a lot of threads that use a shared mmap > backed by tmpfs mounted with -o huge=within_size. This application > started leaking loads of huge pages when we upgraded to a recent kernel. > > Using the page ref tracepoints and a BPF program written by Tejun Heo we > were able to determine that these pages would have multiple refcounts > from the page fault path, but when it came to unmap time we wouldn't > drop the number of refs we had added from the faults. > > I wrote a reproducer that mmap'ed a file backed by tmpfs with -o > huge=always, and then spawned 20 threads all looping faulting random > offsets in this map, while using madvise(MADV_DONTNEED) randomly for > huge page aligned ranges. This very quickly reproduced the problem. > > The problem here is that we check for the case that we have multiple > threads faulting in a range that was previously unmapped. One thread > maps the PMD, the other thread loses the race and then returns 0. > However at this point we already have the page, and we are no longer > putting this page into the processes address space, and so we leak the > page. We actually did the correct thing prior to f9ce0be71d1f, however > it looks like Kirill copied what we do in the anonymous page case. In > the anonymous page case we don't yet have a page, so we don't have to > drop a reference on anything. Previously we did the correct thing for > file based faults by returning VM_FAULT_NOPAGE so we correctly drop the > reference on the page we faulted in. > > Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable() > case, this makes us drop the ref on the page properly, and now my > reproducer no longer leaks the huge pages. > > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Chris Mason <clm@fb.com> Cc: stable@ ? > --- > mm/memory.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 7a089145cad4..f10724d7dca3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > /* See comment in handle_pte_fault() */ > if (pmd_devmap_trans_unstable(vmf->pmd)) > - return 0; > + return VM_FAULT_NOPAGE; Comment update would be nice. Other instances of pmd_devmap_trans_unstable() return 0 in the fault path. Explanation would be helpful. Otherwise, Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
On Thu, 2022-07-07 at 01:46 +0300, Kirill A. Shutemov wrote: > On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote: > > > > Fix this by returning VM_FAULT_NOPAGE in the > > pmd_devmap_trans_unstable() > > case, this makes us drop the ref on the page properly, and now my > > reproducer no longer leaks the huge pages. > > > > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() > > codepaths") > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Chris Mason <clm@fb.com> > > Cc: stable@ Yes, it should. How do we send a patch to stable@ after the start of the thread? > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > > > /* See comment in handle_pte_fault() */ > > if (pmd_devmap_trans_unstable(vmf->pmd)) > > - return 0; > > + return VM_FAULT_NOPAGE; > > Comment update would be nice. > > Other instances of pmd_devmap_trans_unstable() return 0 in the fault > path. > Explanation would be helpful. > The explanation is that by the time we get to finish_fault, we already have a reference on a page, and we need to ensure that reference gets released by the caller. VM_FAULT_NOPAGE is one of the ways to indicate that the page should be freed.
On Wed, 06 Jul 2022 20:42:56 -0400 Rik van Riel <riel@surriel.com> wrote: > On Thu, 2022-07-07 at 01:46 +0300, Kirill A. Shutemov wrote: > > On Tue, Jul 05, 2022 at 04:00:36PM -0400, Josef Bacik wrote: > > > > > > Fix this by returning VM_FAULT_NOPAGE in the > > > pmd_devmap_trans_unstable() > > > case, this makes us drop the ref on the page properly, and now my > > > reproducer no longer leaks the huge pages. > > > > > > Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() > > > codepaths") > > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > Signed-off-by: Chris Mason <clm@fb.com> > > > > Cc: stable@ > > Yes, it should. How do we send a patch to stable@ > after the start of the thread? cc'ing the stable list doesn't have much affect, afaik. What Kirill means is that we should add cc:stable to this patch's changelog. I did that yesterday. > > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > > > > > /* See comment in handle_pte_fault() */ > > > if (pmd_devmap_trans_unstable(vmf->pmd)) > > > - return 0; > > > + return VM_FAULT_NOPAGE; > > > > Comment update would be nice. > > > > Other instances of pmd_devmap_trans_unstable() return 0 in the fault > > path. > > Explanation would be helpful. > > > The explanation is that by the time we get to > finish_fault, we already have a reference on a > page, and we need to ensure that reference > gets released by the caller. > > VM_FAULT_NOPAGE is one of the ways to indicate > that the page should be freed. I think Kirill meant this should be added to the code comment!
On Thu, 2022-07-07 at 17:58 -0700, Andrew Morton wrote: > On Wed, 06 Jul 2022 20:42:56 -0400 Rik van Riel <riel@surriel.com> > wrote: > > > > > The explanation is that by the time we get to > > finish_fault, we already have a reference on a > > page, and we need to ensure that reference > > gets released by the caller. > > > > VM_FAULT_NOPAGE is one of the ways to indicate > > that the page should be freed. > > I think Kirill meant this should be added to the code comment! OK, I finally got around to looking at that today, and it seems like the comment at the top of finish_fault() already has everything I wanted to write down. Is there anything else we should add here? /** * finish_fault - finish page fault once we have prepared the page to fault * * @vmf: structure describing the fault * * This function handles all that is needed to finish a page fault once the * page to fault in is prepared. It handles locking of PTEs, inserts PTE for * given page, adds reverse page mapping, handles memcg charges and LRU * addition. * * The function expects the page to be locked and on success it consumes a * reference of a page being mapped (for the PTE which maps it). * * Return: %0 on success, %VM_FAULT_ code in case of error. */ vm_fault_t finish_fault(struct vm_fault *vmf) {
diff --git a/mm/memory.c b/mm/memory.c index 7a089145cad4..f10724d7dca3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4371,7 +4371,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) /* See comment in handle_pte_fault() */ if (pmd_devmap_trans_unstable(vmf->pmd)) - return 0; + return VM_FAULT_NOPAGE; vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);