Message ID | 20201209163950.8494-2-will@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Create 'old' ptes for faultaround mappings on arm64 with hardware access flag | expand |
On Wed, Dec 9, 2020 at 8:40 AM Will Deacon <will@kernel.org> wrote: > > @@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) > > /* check if the page fault is solved */ > vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); > - if (!pte_none(*vmf->pte)) > - ret = VM_FAULT_NOPAGE; > + if (pte_none(*vmf->pte)) > + goto out_unlock; > + > + if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) { > + pte_t pte = pte_mkyoung(*vmf->pte); > + if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0)) > + update_mmu_cache(vmf->vma, address, vmf->pte); > + } Oh, please dear God no. First you incorrectly set it old, and then you conditionally make it young again and as a result force an atomic rwm update and another TLB flush for no good reason. Just make sure that the FAULT_FLAG_PREFAULT_OLD never sets the *actual* address to old. And yes, that probably means that you need to change "alloc_set_pte()" to actually pass in the real address, and leave "vmf->address" alone - so that it can know which ones are prefaulted and which one is real, but that sounds like a good idea anyway. Then you can just make alloc_set_pte() do the right thing in the first place, instead of doing this nasty "lets do it wrong and fix it up later" horror. Linus
On Wed, Dec 09, 2020 at 09:58:12AM -0800, Linus Torvalds wrote: > On Wed, Dec 9, 2020 at 8:40 AM Will Deacon <will@kernel.org> wrote: > > > > @@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) > > > > /* check if the page fault is solved */ > > vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); > > - if (!pte_none(*vmf->pte)) > > - ret = VM_FAULT_NOPAGE; > > + if (pte_none(*vmf->pte)) > > + goto out_unlock; > > + > > + if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) { > > + pte_t pte = pte_mkyoung(*vmf->pte); > > + if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0)) > > + update_mmu_cache(vmf->vma, address, vmf->pte); > > + } > > Oh, please dear God no. > > First you incorrectly set it old, and then you conditionally make it > young again and as a result force an atomic rwm update and another TLB > flush for no good reason. There shouldn't be a TLB flush here, but I agree that it would have to go and nobble the hash for PowerPC if they wanted to enable this. > Just make sure that the FAULT_FLAG_PREFAULT_OLD never sets the > *actual* address to old. > > And yes, that probably means that you need to change "alloc_set_pte()" > to actually pass in the real address, and leave "vmf->address" alone - > so that it can know which ones are prefaulted and which one is real, > but that sounds like a good idea anyway. Right, I deliberately avoided that based on the feedback from Jan on an older version [1], but I can certainly look at it again. > Then you can just make alloc_set_pte() do the right thing in the first > place, instead of doing this nasty "lets do it wrong and fix it up > later" horror. I'll have a crack at this in v2. Cheers, Will [1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/1511845670-12133-1-git-send-email-vinmenon@codeaurora.org/
On Wed, Dec 9, 2020 at 10:40 AM Will Deacon <will@kernel.org> wrote: > > > And yes, that probably means that you need to change "alloc_set_pte()" > > to actually pass in the real address, and leave "vmf->address" alone - > > so that it can know which ones are prefaulted and which one is real, > > but that sounds like a good idea anyway. > > Right, I deliberately avoided that based on the feedback from Jan on an > older version [1], but I can certainly look at it again. Side note: I absolutely detest alloc_set_pte() in general, and it would be very good to try to just change the rules of that function entirely. The alloc_set_pte() function is written to be some generic "iterate over ptes setting them'" function, but it has exactly two users, and none of those users really want the semantics it gives them, I feel. And the semantics that alloc_set_pte() does have actually *hurts* them. In particular, it made it a nightmare to read what do_fault_around() does: it does that odd if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); and then it calls ->map_pages() (which is always filemap_map_pages(), except for xfs, where it is also always filemap_map_pages but it takes a lock first). And it did that prealloc_pte, because that's what alloc_set_pte() needs to be atomic - and it needs to be atomic because it's all done under the RCU lock. So essentially, the _major_ user of alloc_set_pte() requires atomicity - but that's not at all obvious when you actually read alloc_set_pte() itself, and in fact when you read it, you see that if (!vmf->pte) { ret = pte_alloc_one_map(vmf); which can block. Except it won't block if we have vmf->prealloc_pte, because then it will just take that instead. In other words, ALL THAT IS COMPLETE GARBAGE. And it's written to be as obtuse as humanly possible, and there is no way in hell anybody understands it by just looking at the code - you have to follow all these odd quirks back to see what's going on. So it would be much better to get rid of alloc_set_pte() entirely, and move the allocation and the pte locking into the caller, and just clean it up (because it turns out the callers have their own models for allocation anyway). And then each of the callers would be a lot more understandable, instead of having this insanely subtle "I need to be atomic, so I will pre-allocate in one place, and then take the RCU lock in another place, in order to call a _third_ place that is only atomic because of that first step". The other user of alloc_set_pte() is "finish_fault()", and honestly, that's what alloc_set_pte() was written for, and why alloc_set_pte() does all those things that filemap_map_pages() doesn't even want. But while that other user does want what alloc_set_pte() does, there's no downside to just moving those things into the caller. It would once again just clarify things to make the allocation and the setting be separate operations - even in that place where it doesn't have the same kind of very subtle behavior with pre-allocation and atomicity. I think the problem with alloc_set_pte() is hat it has had many different people working on it over the years, and Kirill massaged that old use to fit the new pre-alloc use. Before the pre-faulting, I think both cases were much closer to each other. So I'm not really blaming anybody here, but the ugly and very hard-to-follow rules seem to come from historical behavior that was massaged over time to "just work" rather than have that function be made more logical. Kirill - I think you know this code best. Would you be willing to look at splitting out that "allocate and map" from alloc_set_pte() and into the callers? At that point, I think the current very special and odd do_fault_around() pre-allocation could be made into just a _regular_ "allocate the pmd if it doesn't exist". And then the pte locking could be moved into filemap_map_pages(), and suddenly the semantics and rules around all that would be a whole lot more obvious. Pretty please? Linus
On Wed, Dec 09, 2020 at 11:04:13AM -0800, Linus Torvalds wrote: > In particular, it made it a nightmare to read what do_fault_around() > does: it does that odd > > if (pmd_none(*vmf->pmd)) { > vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); > > and then it calls ->map_pages() (which is always filemap_map_pages(), > except for xfs, where it is also always filemap_map_pages but it takes > a lock first). ... which is wrong. Dave's paranoia around other people introducing bugs into XFS made him do this, but we should revert cd647d5651c0b0deaa26c1acb9e1789437ba9bc7. Those operations he's worried about are protected by the page lock. If a filesystem has put an Uptodate page into the page cache, the rest of the kernel can read it without telling the filesystem.
On Wed, Dec 9, 2020 at 12:32 PM Matthew Wilcox <willy@infradead.org> wrote: > > If a filesystem has put an Uptodate page into the page cache, the > rest of the kernel can read it without telling the filesystem. XFS does the same thing for xfs_file_read_iter() too. Not that I disagree with you - when you mmap a file, once it's mapped you see the data without any lock anyway. So it's all kinds of pointless to serialize the page fault, because that's simply not relevant. The lock will be gone by the time the user actually sees the data. But hey, the XFS people have their opinions. Linus
On Wed, Dec 09, 2020 at 11:04:13AM -0800, Linus Torvalds wrote: > > Kirill - I think you know this code best. Would you be willing to look > at splitting out that "allocate and map" from alloc_set_pte() and into > the callers? See lightly tested patch below. Is it something you had in mind? I don't like map_set_pte() name. Any suggestions? > At that point, I think the current very special and odd > do_fault_around() pre-allocation could be made into just a _regular_ > "allocate the pmd if it doesn't exist". And then the pte locking could > be moved into filemap_map_pages(), and suddenly the semantics and > rules around all that would be a whole lot more obvious. No. It would stop faultaround code from mapping huge pages. We had to defer pte page table mapping until we know we don't have huge pages in page cache. diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..6b1bfa0fd488 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -535,7 +535,7 @@ struct vm_fault { */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * map_set_pte() from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t map_set_pte(struct vm_fault *vmf, struct page *page); vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/mm/filemap.c b/mm/filemap.c index 331f4261d723..4446ae6d6d09 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2884,7 +2884,7 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + if (map_set_pte(vmf, page)) goto unlock; unlock_page(head); goto next; diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..1db1000767fc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See the comment in map_set_pte() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3641,55 +3641,6 @@ static int pmd_devmap_trans_unstable(pmd_t *pmd) return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); } -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3769,45 +3720,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +static void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3741,56 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); +} - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); +vm_fault_t map_set_pte(struct vm_fault *vmf, struct page *page) +{ + struct vm_area_struct *vma = vmf->vma; + vm_fault_t ret; + + if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (!vmf->pte) { + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return VM_FAULT_NOPAGE; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + } + + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vmf->vma, vmf->address, vmf->pte); return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3808,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3822,34 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -4340,7 +4320,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* See comment in map_set_pte() */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Thu, Dec 10, 2020 at 7:08 AM Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote: > > See lightly tested patch below. Is it something you had in mind? This is closer, in that at least it removes the ostensibly blocking allocation (that can't happen) from the prefault path. But the main issue remains: > > At that point, I think the current very special and odd > > do_fault_around() pre-allocation could be made into just a _regular_ > > "allocate the pmd if it doesn't exist". And then the pte locking could > > be moved into filemap_map_pages(), and suddenly the semantics and > > rules around all that would be a whole lot more obvious. > > No. It would stop faultaround code from mapping huge pages. We had to > defer pte page table mapping until we know we don't have huge pages in > page cache. Can we please move that part to the callers too - possibly with a separate helper function? Because the real issue remains: as long the map_set_pte() function takes the pte lock, the caller cannot rely on it. And the filemap_map_pages() code really would like to rely on it. Because if the lock is taken there *above* the loop - or even in the loop iteration at the top, the code can now do things that rely on "I know I hold the page table lock". In particular, we can get rid of that very very expensive page locking. Which is the reason I know about the horrid current issue with "pre-allocate in one place, lock in another, and know we are atomic in a third place" issue. Because I had to walk down these paths and realize that "this loop is run under the page table lock, EXCEPT for the first iteration, where it's taken by the first time we do that non-allocating alloc_set_pte()". See? Linus
On Mon, Dec 14, 2020 at 8:07 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Here it is. Still barely tested. Ok, from looking at the patch (not applying it and looking at the end result), I think the locking - at least for the filemap_map_pages() case - is a lot easier to understand. So you seem to have fixed the thing I personally found most confusing. Thanks. > I expected to hate it more, but it looks reasonable. Opencoded > xas_for_each() smells bad, but... I think the open-coded xas_for_each() per se isn't a problem, but I agree that the startup condition is a bit ugly. And I'm actually personally more confused by why xas_retry() is needed here, bit not in many other places. That is perhaps more obvious now that it shows up twice. Adding Willy to the cc in case he has comments on that, and can explain it to me in small words. [ https://lore.kernel.org/lkml/20201214160724.ewhjqoi32chheone@box/ for context ] And I actually think it might make even more sense if you moved more of the pmd handling into "filemap_map_pages_pmd()". Now it's a bit odd, with filemap_map_pages() containing part of the pmd handling, and then part being in filemap_map_pages_pmd(). Could we have a "filemap_map_pmd()" that does it all, and then the filemap_map_pages() logic would be more along the lines of if (filemap_map_pmd(vmf, xas)) { rcu_read_unlock(); return; } ... then handle pte's ... Hmm? There may be some shared state thing why you didn't do it, the above is mostly a syntactic issue. Thanks, Linus
On Mon, Dec 14, 2020 at 09:54:06AM -0800, Linus Torvalds wrote: > > I expected to hate it more, but it looks reasonable. Opencoded > > xas_for_each() smells bad, but... > > I think the open-coded xas_for_each() per se isn't a problem, but I > agree that the startup condition is a bit ugly. And I'm actually > personally more confused by why xas_retry() is needed here, bit not in > many other places. That is perhaps more obvious now that it shows up > twice. > > Adding Willy to the cc in case he has comments on that, and can > explain it to me in small words. > > [ https://lore.kernel.org/lkml/20201214160724.ewhjqoi32chheone@box/ > for context ] The xas_retry() is something I now regret, but haven't got annoyed enough by it yet to fix (also, other projects). It originated in the radix tree where we would get a radix_tree_node and then iterate over it in header macros. If we're holding the rcu_read_lock() and somebody else deletes an entry leaving the entry at index 0 as the only index in the tree, we tell the RCU readers to rewalk the tree from the top by putting a retry entry in place of the real entry. It's not entirely clear to me now why we did that. Just leave the entry alone and the RCU-walkers will see it, then the rest of the node is empty. As to why we need to do this in some places and not others; you can only see a retry entry if you're only protected by the RCU lock. If you're protected by the spinlock, you can't see any nodes which contain retry entries. But I now think we should just get rid of retry entries. Maybe I'm missing a good reason to keep them.
On Mon, Dec 14, 2020 at 09:54:06AM -0800, Linus Torvalds wrote: > Could we have a "filemap_map_pmd()" that does it all, and then the > filemap_map_pages() logic would be more along the lines of > > if (filemap_map_pmd(vmf, xas)) { > rcu_read_unlock(); > return; > } > > ... then handle pte's ... > > Hmm? If this looks fine, I'll submit a proper patch. diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..a7e21cff4ede 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page, + struct xa_state *xas) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (xa_is_value(page)) + goto nohuge; + + if (!pmd_none(*vmf->pmd)) + goto nohuge; + + if (!PageTransHuge(page) || PageLocked(page)) + goto nohuge; + + if (!page_cache_get_speculative(page)) + goto nohuge; + + if (page != xas_reload(xas)) + goto unref; + + if (!PageTransHuge(page)) + goto unref; + + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) + goto unref; + + if (!trylock_page(page)) + goto unref; + + if (page->mapping != mapping || !PageUptodate(page)) + goto unlock; + + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE)) + goto unlock; + + do_set_pmd(vmf, page); + unlock_page(page); + return true; +unlock: + unlock_page(page); +unref: + put_page(page); +nohuge: + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return true; + + return false; +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; unsigned long max_idx; @@ -2843,20 +2908,36 @@ void filemap_map_pages(struct vm_fault *vmf, unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { + head = xas_find(&xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return; + } + + while (xas_retry(&xas, head)) + head = xas_next_entry(&xas, end_pgoff); + + if (filemap_map_pmd(vmf, head, &xas)) { + rcu_read_unlock(); + return; + } + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + + for (; head; head = xas_next_entry(&xas, end_pgoff)) { if (xas_retry(&xas, head)) continue; if (xa_is_value(head)) - goto next; - + continue; /* * Check for a locked page first, as a speculative * reference may adversely influence page migration. */ if (PageLocked(head)) - goto next; + continue; if (!page_cache_get_speculative(head)) - goto next; + continue; /* Has the page moved or been split? */ if (unlikely(head != xas_reload(&xas))) @@ -2884,19 +2965,18 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) - goto unlock; + if (pte_none(*vmf->pte)) + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; } + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..96d62774096a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See the comment in map_set_pte() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Wed, Dec 16, 2020 at 9:07 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > If this looks fine, I'll submit a proper patch. That patch looks good to me. It would be good if somebody else looked it through - maybe I like it just because I got to pee in the snow and make my mark. But i think that filemap_map_pages() now looks a lot more understandable, and having that pte_offset_map_lock() outside the loop should be good. In particular, it will make it much easier to then make arguments about things like - increase the page mapping count - check that the page is not locked - check that the page still has a mapping we can say "we know we hold the page table lock, and we increased the page mapping count, and we saw that the page still had a mapping and was not locked after that". And that means that with some trivial memory ordering rules, we can know that any concurrent "truncate()" that got the page lock after we checked it, will get held up at if (page_mapped(page)) { unsigned int nr = thp_nr_pages(page); unmap_mapping_pages(mapping, page->index, nr, false); } in truncate_cleanup_page(), and that means that we can safely map the page _without_ ever actually even doing a "trylock_page()" at all. Before this patch of yours, that logic would have been broken, because the page table lock wasn't actually held for the whole first iteration of that loop in filemap_map_pages(). And getting rid of the trylock_page() in that path gets rid of _the_ major page lock case for at least one class of loads. So I like the patch. But I would _really_ like for somebody else to look at it, and look at my thinking above. Because maybe I'm missing something. Linus
On Wed, Dec 16, 2020 at 10:41:36AM -0800, Linus Torvalds wrote: > On Wed, Dec 16, 2020 at 9:07 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > If this looks fine, I'll submit a proper patch. > > That patch looks good to me. > > It would be good if somebody else looked it through - maybe I like it > just because I got to pee in the snow and make my mark. But i think > that filemap_map_pages() now looks a lot more understandable, and > having that pte_offset_map_lock() outside the loop should be good. It worth noting that after the change in the worth case scenario we will have additional ref/unref and lock/unlock of the page if we get deep enough into filemap_map_pmd(), but fail to map the page. Also if the range doesn't have a mappable page we would setup a page table into the PMD entry. It means we cannot have huge page mapped there later. It may be a bummer: getting the page table out of page table tree requires mmap_write_lock(). We also take ptl for cold page cache. It may increase ptl contention, but it should be negligible with split-ptl.
On Thu, Dec 17, 2020 at 2:54 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Also if the range doesn't have a mappable page we would setup a page > table into the PMD entry. It means we cannot have huge page mapped there > later. It may be a bummer: getting the page table out of page table tree > requires mmap_write_lock(). > > We also take ptl for cold page cache. It may increase ptl contention, but > it should be negligible with split-ptl. Both good points. I doubt the second one is really noticeable, since if it isn't cached you're going to have all the costs of actually getting the page, but the first one sounds fairly fundamental., But I think both issues could be easily fixed by doing that "xas_is_value()" and checking for 'head' being non-NULL early. In fact, maybe that should be done as part of that early setup loop. So that early code that is now + head = xas_find(&xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return; + } + + while (xas_retry(&xas, head)) + head = xas_next_entry(&xas, end_pgoff); could/should perhaps be something more along the lines of + head = xas_find(&xas, end_pgoff); + for (; ; head = xas_next_entry(&xas, end_pgoff)) { + if (!head) { + rcu_read_unlock(); + return; + } + if (likely(!xas_retry(&xas, head)) + break; + } instead. So that if we don't find any cached entries, we won't do anything, rather than take locks and then not do anything. Then that second loop very naturally becomes a "do { } while ()" one. Hmm? Linus
On Thu, Dec 17, 2020 at 10:22:33AM -0800, Linus Torvalds wrote: > + head = xas_find(&xas, end_pgoff); > + for (; ; head = xas_next_entry(&xas, end_pgoff)) { > + if (!head) { > + rcu_read_unlock(); > + return; > + } > + if (likely(!xas_retry(&xas, head)) > + break; > + } > > instead. So that if we don't find any cached entries, we won't do > anything, rather than take locks and then not do anything. This should do. See below. > Then that second loop very naturally becomes a "do { } while ()" one. I don't see it. I haven't found a reasonable way to rework it do-while. diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..04975682296d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page, + struct xa_state *xas) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (xa_is_value(page)) + goto nohuge; + + if (!pmd_none(*vmf->pmd)) + goto nohuge; + + if (!PageTransHuge(page) || PageLocked(page)) + goto nohuge; + + if (!page_cache_get_speculative(page)) + goto nohuge; + + if (page != xas_reload(xas)) + goto unref; + + if (!PageTransHuge(page)) + goto unref; + + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) + goto unref; + + if (!trylock_page(page)) + goto unref; + + if (page->mapping != mapping || !PageUptodate(page)) + goto unlock; + + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE)) + goto unlock; + + do_set_pmd(vmf, page); + unlock_page(page); + return true; +unlock: + unlock_page(page); +unref: + put_page(page); +nohuge: + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return true; + + return false; +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; unsigned long max_idx; @@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf, unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { + head = xas_find(&xas, end_pgoff); + for (; ; head = xas_next_entry(&xas, end_pgoff)) { + if (!head) { + rcu_read_unlock(); + return; + } + if (likely(!xas_retry(&xas, head))) + break; + } + + if (filemap_map_pmd(vmf, head, &xas)) { + rcu_read_unlock(); + return; + } + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + + for (; head; head = xas_next_entry(&xas, end_pgoff)) { if (xas_retry(&xas, head)) continue; if (xa_is_value(head)) - goto next; - + continue; /* * Check for a locked page first, as a speculative * reference may adversely influence page migration. */ if (PageLocked(head)) - goto next; + continue; if (!page_cache_get_speculative(head)) - goto next; + continue; /* Has the page moved or been split? */ if (unlikely(head != xas_reload(&xas))) @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) - goto unlock; + if (pte_none(*vmf->pte)) + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; } + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..96d62774096a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See the comment in map_set_pte() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Fri, Dec 18, 2020 at 3:04 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > This should do. See below. Looks fine. > > Then that second loop very naturally becomes a "do { } while ()" one. > > I don't see it. I haven't found a reasonable way to rework it do-while. Now that you return early for the "HEAD == NULL" case, this loop: for (; head; head = xas_next_entry(&xas, end_pgoff)) { [...] } very naturally becomes do { [...] } while ((head = xas_next_entry(&xas, end_pgoff)) != NULL); because the initial test for 'head' being NULL is no longer needed, and thus it's a lot more logical to just test it at the end of the loop when we update it. No? Maybe I'm missing something silly. Linus
On Fri, Dec 18, 2020 at 10:56:55AM -0800, Linus Torvalds wrote: > No? Okay, but we only win the NULL check. xas_retry() and xa_is_value() has to be repeated in the beginning of the loop. From b4f4c0d32e654b8459a1e439a453373499b8946a Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- include/linux/mm.h | 8 +- include/linux/pgtable.h | 11 +++ mm/filemap.c | 109 +++++++++++++++++++++++---- mm/memory.c | 162 ++++++++++++---------------------------- 4 files changed, 158 insertions(+), 132 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..a9b21ec52e73 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page, + struct xa_state *xas) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (xa_is_value(page)) + goto nohuge; + + if (!pmd_none(*vmf->pmd)) + goto nohuge; + + if (!PageTransHuge(page) || PageLocked(page)) + goto nohuge; + + if (!page_cache_get_speculative(page)) + goto nohuge; + + if (page != xas_reload(xas)) + goto unref; + + if (!PageTransHuge(page)) + goto unref; + + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) + goto unref; + + if (!trylock_page(page)) + goto unref; + + if (page->mapping != mapping || !PageUptodate(page)) + goto unlock; + + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE)) + goto unlock; + + do_set_pmd(vmf, page); + unlock_page(page); + return true; +unlock: + unlock_page(page); +unref: + put_page(page); +nohuge: + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return true; + + return false; +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; unsigned long max_idx; @@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf, unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { + head = xas_find(&xas, end_pgoff); + for (; ; head = xas_next_entry(&xas, end_pgoff)) { + if (!head) { + rcu_read_unlock(); + return; + } + if (likely(!xas_retry(&xas, head))) + break; + } + + if (filemap_map_pmd(vmf, head, &xas)) { + rcu_read_unlock(); + return; + } + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + + do { if (xas_retry(&xas, head)) continue; if (xa_is_value(head)) - goto next; - + continue; /* * Check for a locked page first, as a speculative * reference may adversely influence page migration. */ if (PageLocked(head)) - goto next; + continue; if (!page_cache_get_speculative(head)) - goto next; + continue; /* Has the page moved or been split? */ if (unlikely(head != xas_reload(&xas))) @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) - goto unlock; + if (pte_none(*vmf->pte)) + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = xas_next_entry(&xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..96d62774096a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See the comment in map_set_pte() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > Okay, but we only win the NULL check. xas_retry() and xa_is_value() has to > be repeated in the beginning of the loop. Yeah. I wonder if it might make sense to have a "xas_next_entry_rcu()" function that does something like while ((entry = xas_next_entry(&xas, end)) != NULL) { if (xas_retry(entry) || xa_is_value(entry)) continue; return entry; } return NULL; but that's a question for Willy, and independent of this patch. I like the patch, and I'll think about it a bit more, but I might decide to just apply it to get this thing over with. Otherwise it should probably go into -mm for more testing. Linus
On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf, > if (vmf->pte) > vmf->pte += xas.xa_index - last_pgoff; > last_pgoff = xas.xa_index; > - if (alloc_set_pte(vmf, page)) > - goto unlock; > + if (pte_none(*vmf->pte)) > + do_set_pte(vmf, page); > + /* no need to invalidate: a not-present page won't be cached */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > unlock_page(head); > - goto next; > + continue; This can't be right. Look at what happens if "pte_none()" is not true.. It won't install the new pte, but it also won't drop the ref to the page. So I think it needs to be - if (alloc_set_pte(vmf, page)) + if (!pte_none(*vmf->pte)) goto unlock; + do_set_pte(vmf, page); instead, so that the "if somebody else already filled the page table" case gets handled right. Hmm? Linus
On Sat, Dec 19, 2020 at 12:34:17PM -0800, Linus Torvalds wrote: > On Sat, Dec 19, 2020 at 4:41 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > @@ -2884,19 +2966,18 @@ void filemap_map_pages(struct vm_fault *vmf, > > if (vmf->pte) > > vmf->pte += xas.xa_index - last_pgoff; > > last_pgoff = xas.xa_index; > > - if (alloc_set_pte(vmf, page)) > > - goto unlock; > > + if (pte_none(*vmf->pte)) > > + do_set_pte(vmf, page); > > + /* no need to invalidate: a not-present page won't be cached */ > > + update_mmu_cache(vma, vmf->address, vmf->pte); > > unlock_page(head); > > - goto next; > > + continue; > > This can't be right. > > Look at what happens if "pte_none()" is not true.. It won't install > the new pte, but it also won't drop the ref to the page. Ouch. > So I think it needs to be > > - if (alloc_set_pte(vmf, page)) > + if (!pte_none(*vmf->pte)) > goto unlock; > + do_set_pte(vmf, page); > > instead, so that the "if somebody else already filled the page table" > case gets handled right. Updated patch is below. From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- include/linux/mm.h | 8 +- include/linux/pgtable.h | 11 +++ mm/filemap.c | 109 +++++++++++++++++++++++---- mm/memory.c | 162 ++++++++++++---------------------------- 4 files changed, 159 insertions(+), 131 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..f8fdbe079375 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page, + struct xa_state *xas) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (xa_is_value(page)) + goto nohuge; + + if (!pmd_none(*vmf->pmd)) + goto nohuge; + + if (!PageTransHuge(page) || PageLocked(page)) + goto nohuge; + + if (!page_cache_get_speculative(page)) + goto nohuge; + + if (page != xas_reload(xas)) + goto unref; + + if (!PageTransHuge(page)) + goto unref; + + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) + goto unref; + + if (!trylock_page(page)) + goto unref; + + if (page->mapping != mapping || !PageUptodate(page)) + goto unlock; + + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE)) + goto unlock; + + do_set_pmd(vmf, page); + unlock_page(page); + return true; +unlock: + unlock_page(page); +unref: + put_page(page); +nohuge: + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(vma->vm_mm); + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return true; + + return false; +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; unsigned long max_idx; @@ -2843,20 +2908,37 @@ void filemap_map_pages(struct vm_fault *vmf, unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { + head = xas_find(&xas, end_pgoff); + for (; ; head = xas_next_entry(&xas, end_pgoff)) { + if (!head) { + rcu_read_unlock(); + return; + } + if (likely(!xas_retry(&xas, head))) + break; + } + + if (filemap_map_pmd(vmf, head, &xas)) { + rcu_read_unlock(); + return; + } + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + + do { if (xas_retry(&xas, head)) continue; if (xa_is_value(head)) - goto next; - + continue; /* * Check for a locked page first, as a speculative * reference may adversely influence page migration. */ if (PageLocked(head)) - goto next; + continue; if (!page_cache_get_speculative(head)) - goto next; + continue; /* Has the page moved or been split? */ if (unlikely(head != xas_reload(&xas))) @@ -2884,19 +2966,20 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = xas_next_entry(&xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..96d62774096a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See the comment in map_set_pte() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Tue, 22 Dec 2020, Kirill A. Shutemov wrote: > > Updated patch is below. > > From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001 > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Date: Sat, 19 Dec 2020 15:19:23 +0300 > Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths > > alloc_set_pte() has two users with different requirements: in the > faultaround code, it called from an atomic context and PTE page table > has to be preallocated. finish_fault() can sleep and allocate page table > as needed. > > PTL locking rules are also strange, hard to follow and overkill for > finish_fault(). > > Let's untangle the mess. alloc_set_pte() has gone now. All locking is > explicit. > > The price is some code duplication to handle huge pages in faultaround > path, but it should be fine, having overall improvement in readability. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> It's not ready yet. I won't pretend to have reviewed, but I did try applying and running with it: mostly it seems to work fine, but turned out to be leaking huge pages (with vmstat's thp_split_page_failed growing bigger and bigger as page reclaim cannot get rid of them). Aside from the actual bug, filemap_map_pmd() seems suboptimal at present: comments below (plus one comment in do_anonymous_page()). > diff --git a/mm/filemap.c b/mm/filemap.c > index 0b2067b3c328..f8fdbe079375 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > } > EXPORT_SYMBOL(filemap_fault); > > +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page, > + struct xa_state *xas) > +{ > + struct vm_area_struct *vma = vmf->vma; > + struct address_space *mapping = vma->vm_file->f_mapping; > + > + /* Huge page is mapped? No need to proceed. */ > + if (pmd_trans_huge(*vmf->pmd)) > + return true; > + > + if (xa_is_value(page)) > + goto nohuge; I think it would be easier to follow if filemap_map_pages() never passed this an xa_is_value(page): probably just skip them in its initial xas_next_entry() loop. > + > + if (!pmd_none(*vmf->pmd)) > + goto nohuge; Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none, and unlocks again: unnecessary overhead I believe we did not have before. > + > + if (!PageTransHuge(page) || PageLocked(page)) > + goto nohuge; So if PageTransHuge, but someone else temporarily holds PageLocked, we insert a page table at nohuge, sadly preventing it from being mapped here later by huge pmd. > + > + if (!page_cache_get_speculative(page)) > + goto nohuge; > + > + if (page != xas_reload(xas)) > + goto unref; > + > + if (!PageTransHuge(page)) > + goto unref; > + > + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page)) > + goto unref; > + > + if (!trylock_page(page)) > + goto unref; > + > + if (page->mapping != mapping || !PageUptodate(page)) > + goto unlock; > + > + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE)) > + goto unlock; > + > + do_set_pmd(vmf, page); Here is the source of the huge page leak: do_set_pmd() can fail (and we would do better to have skipped most of its failure cases long before getting this far). It worked without leaking once I patched it: - do_set_pmd(vmf, page); - unlock_page(page); - return true; + if (do_set_pmd(vmf, page) == 0) { + unlock_page(page); + return true; + } > + unlock_page(page); > + return true; > +unlock: > + unlock_page(page); > +unref: > + put_page(page); > +nohuge: > + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > + if (likely(pmd_none(*vmf->pmd))) { > + mm_inc_nr_ptes(vma->vm_mm); > + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); > + vmf->prealloc_pte = NULL; > + } > + spin_unlock(vmf->ptl); I think it's a bit weird to hide this page table insertion inside filemap_map_pmd() (I guess you're thinking that this function deals with pmd level, but I'd find it easier to have a filemap_map_huge() dealing with the huge mapping). Better to do it on return into filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge() would then need to return vm_fault_t rather than bool, I didn't try. > + > + /* See comment in handle_pte_fault() */ > + if (pmd_devmap_trans_unstable(vmf->pmd)) > + return true; > + > + return false; > +} ... > diff --git a/mm/memory.c b/mm/memory.c > index c48f8df6e502..96d62774096a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (pte_alloc(vma->vm_mm, vmf->pmd)) > return VM_FAULT_OOM; > > - /* See the comment in pte_alloc_one_map() */ > + /* See the comment in map_set_pte() */ No, no such function: probably should be like the others and say /* See comment in handle_pte_fault() */ Hugh
On Wed, Dec 23, 2020 at 08:04:54PM -0800, Hugh Dickins wrote: > It's not ready yet. Thanks for the feedback. It's very helpful. The patch below should address the problems you've found. The new helper next_page() returns a stablized page, so filemap_map_pmd() can clearly decide if we should set up a page table or a huge page. The leak is fixed too. Could you give it a try? From 0a6a3a1c920d04f55bcb202b585c85a4494b2ae0 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- include/linux/mm.h | 8 +- include/linux/pgtable.h | 11 +++ mm/filemap.c | 147 ++++++++++++++++++++++++++---------- mm/memory.c | 162 ++++++++++++---------------------------- 4 files changed, 171 insertions(+), 157 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..069b15778b4a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,50 +2832,117 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) +{ + struct mm_struct *mm = vmf->vma->vm_mm; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (pmd_none(*vmf->pmd) && + PageTransHuge(page) && + do_set_pmd(vmf, page)) { + unlock_page(page); + return true; + } + + if (pmd_none(*vmf->pmd)) { + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(mm); + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + return false; +} + +static struct page *next_page(struct page *page, struct vm_fault *vmf, + struct xa_state *xas, pgoff_t end_pgoff) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + unsigned long max_idx; + + if (!page) + page = xas_find(xas, end_pgoff); + else + page = xas_next_entry(xas, end_pgoff); + + do { + if (!page) + return NULL; + if (xas_retry(xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (PageLocked(page)) + continue; + if (!page_cache_get_speculative(page)) + continue; + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(xas))) + goto skip; + if (!PageUptodate(page) || PageReadahead(page)) + goto skip; + if (PageHWPoison(page)) + goto skip; + if (!trylock_page(page)) + goto skip; + if (page->mapping != mapping) + goto unlock; + if (!PageUptodate(page)) + goto unlock; + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); + if (xas->xa_index >= max_idx) + goto unlock; + return page; +unlock: + unlock_page(page); +skip: + put_page(page); + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); + + return NULL; +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long max_idx; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { - if (xas_retry(&xas, head)) - continue; - if (xa_is_value(head)) - goto next; + head = next_page(NULL, vmf, &xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return; + } - /* - * Check for a locked page first, as a speculative - * reference may adversely influence page migration. - */ - if (PageLocked(head)) - goto next; - if (!page_cache_get_speculative(head)) - goto next; + if (filemap_map_pmd(vmf, head)) { + rcu_read_unlock(); + return; + } - /* Has the page moved or been split? */ - if (unlikely(head != xas_reload(&xas))) - goto skip; + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + do { page = find_subpage(head, xas.xa_index); - - if (!PageUptodate(head) || - PageReadahead(page) || - PageHWPoison(page)) - goto skip; - if (!trylock_page(head)) - goto skip; - - if (head->mapping != mapping || !PageUptodate(head)) - goto unlock; - - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (xas.xa_index >= max_idx) + if (PageHWPoison(page)) goto unlock; if (mmap_miss > 0) @@ -2884,19 +2952,20 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); -skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = next_page(head, vmf, &xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..98d340b45557 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See comment in handle_pte_fault() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Fri, Dec 25, 2020 at 3:31 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > The new helper next_page() returns a stablized page, so filemap_map_pmd() > can clearly decide if we should set up a page table or a huge page. I really like that next_page() abstraction, my only comment is that I think it should be renamed as "next_stable_page()" or something, and then this part: + if (!page) + page = xas_find(xas, end_pgoff); + else + page = xas_next_entry(xas, end_pgoff); should be in the caller. Then just have two helper functions like 'first_map_page()' and 'next_map_page()' which just do next_stable_page(xas_find(xas, end_pgoff)) and next_stable_page(xas_next_entry(xas, end_pgoff)) respectively. Because not only does that get rid of the "if (page)" test, I think it would make things a bit clearer. When I read the patch first, the initial "next_page()" call confused me. But maybe I'm just grasping at straws. Even in this format, I think it's a nice cleanup and makes for more understandable code. Linus
On Sat, Dec 26, 2020 at 09:57:13AM -0800, Linus Torvalds wrote: > Because not only does that get rid of the "if (page)" test, I think it > would make things a bit clearer. When I read the patch first, the > initial "next_page()" call confused me. Agreed. Here we go: From d12dea4abe94dbc24b7945329b191ad7d29e213a Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- include/linux/mm.h | 8 +- include/linux/pgtable.h | 11 +++ mm/filemap.c | 157 ++++++++++++++++++++++++++++---------- mm/memory.c | 162 ++++++++++++---------------------------- 4 files changed, 181 insertions(+), 157 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..2825153ad0d6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..3a92aaa59b9b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,50 +2832,127 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) +{ + struct mm_struct *mm = vmf->vma->vm_mm; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) + return true; + + if (pmd_none(*vmf->pmd) && + PageTransHuge(page) && + do_set_pmd(vmf, page)) { + unlock_page(page); + return true; + } + + if (pmd_none(*vmf->pmd)) { + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(mm); + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + return false; +} + +static struct page *next_stable_page(struct page *page, struct vm_fault *vmf, + struct xa_state *xas, pgoff_t end_pgoff) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + unsigned long max_idx; + + do { + if (!page) + return NULL; + if (xas_retry(xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (PageLocked(page)) + continue; + if (!page_cache_get_speculative(page)) + continue; + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(xas))) + goto skip; + if (!PageUptodate(page) || PageReadahead(page)) + goto skip; + if (PageHWPoison(page)) + goto skip; + if (!trylock_page(page)) + goto skip; + if (page->mapping != mapping) + goto unlock; + if (!PageUptodate(page)) + goto unlock; + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); + if (xas->xa_index >= max_idx) + goto unlock; + return page; +unlock: + unlock_page(page); +skip: + put_page(page); + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); + + return NULL; +} + +static inline struct page *first_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_stable_page(xas_find(xas, end_pgoff), vmf, xas, end_pgoff); +} + +static inline struct page *next_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_stable_page(xas_next_entry(xas, end_pgoff), + vmf, xas, end_pgoff); +} + void filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long max_idx; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { - if (xas_retry(&xas, head)) - continue; - if (xa_is_value(head)) - goto next; + head = first_map_page(vmf, &xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return; + } - /* - * Check for a locked page first, as a speculative - * reference may adversely influence page migration. - */ - if (PageLocked(head)) - goto next; - if (!page_cache_get_speculative(head)) - goto next; + if (filemap_map_pmd(vmf, head)) { + rcu_read_unlock(); + return; + } - /* Has the page moved or been split? */ - if (unlikely(head != xas_reload(&xas))) - goto skip; + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + do { page = find_subpage(head, xas.xa_index); - - if (!PageUptodate(head) || - PageReadahead(page) || - PageHWPoison(page)) - goto skip; - if (!trylock_page(head)) - goto skip; - - if (head->mapping != mapping || !PageUptodate(head)) - goto unlock; - - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (xas.xa_index >= max_idx) + if (PageHWPoison(page)) goto unlock; if (mmap_miss > 0) @@ -2884,19 +2962,20 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + continue; unlock: unlock_page(head); -skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); } diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..98d340b45557 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See comment in handle_pte_fault() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3769,45 +3709,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3730,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3749,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3763,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3980,7 +3902,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); if (!pte_none(*vmf->pte)) ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address; vmf->pte = NULL; @@ -4340,7 +4261,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Sat, 26 Dec 2020, Kirill A. Shutemov wrote: > On Sat, Dec 26, 2020 at 09:57:13AM -0800, Linus Torvalds wrote: > > Because not only does that get rid of the "if (page)" test, I think it > > would make things a bit clearer. When I read the patch first, the > > initial "next_page()" call confused me. > > Agreed. Here we go: > > From d12dea4abe94dbc24b7945329b191ad7d29e213a Mon Sep 17 00:00:00 2001 > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Date: Sat, 19 Dec 2020 15:19:23 +0300 > Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths > > alloc_set_pte() has two users with different requirements: in the > faultaround code, it called from an atomic context and PTE page table > has to be preallocated. finish_fault() can sleep and allocate page table > as needed. > > PTL locking rules are also strange, hard to follow and overkill for > finish_fault(). > > Let's untangle the mess. alloc_set_pte() has gone now. All locking is > explicit. > > The price is some code duplication to handle huge pages in faultaround > path, but it should be fine, having overall improvement in readability. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Hold on. I guess this one will suffer from the same bug as the previous. I was about to report back, after satisfactory overnight testing of that version - provided that one big little bug is fixed: --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa if (pmd_none(*vmf->pmd) && PageTransHuge(page) && - do_set_pmd(vmf, page)) { + do_set_pmd(vmf, page) == 0) { unlock_page(page); return true; } (Yes, you can write that as !do_set_pmd(vmf, page), and maybe I'm odd, but even though it's very common, I have a personal aversion to using "!' on a positive-sounding function that returns 0 for success.) I'll give the new patch a try now, but with that fix added in. Without it, I got "Bad page" on compound_mapcount on file THP pages - but I run with a BUG() inside of bad_page() so I cannot miss them: I did not look to see what the eventual crash or page leak would look like without that. Hugh
I was going to just apply this patch, because I like it so much, but then I decided to take one last look, and: On Sat, Dec 26, 2020 at 12:43 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) > +{ > + struct mm_struct *mm = vmf->vma->vm_mm; > + > + /* Huge page is mapped? No need to proceed. */ > + if (pmd_trans_huge(*vmf->pmd)) > + return true; doesn't this cause us to leak a locked page? I get the feeling that every single "return true" case here should always unlock the page and - with the exception of a successful do_set_pmd() - do a "put_page()". Which kind of argues that we should just do it in the caller (and get an extra ref in the do_set_pmd() case, so that the caller can always do if (filemap_map_pmd(..)) { unlock_page(page); put_page(page); rcu_read_unlock(); return; } andf then there are no odd cases inside that filemap_map_pmd() function. Hmm? Other than that, I really find it all much more legible. Of course, if I'm wrong about the above, that just proves that I'm missing something and it wasn't so legible after all.. Linus
On Sat, Dec 26, 2020 at 1:04 PM Hugh Dickins <hughd@google.com> wrote: > > > Hold on. I guess this one will suffer from the same bug as the previous. > I was about to report back, after satisfactory overnight testing of that > version - provided that one big little bug is fixed: > > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa > > if (pmd_none(*vmf->pmd) && > PageTransHuge(page) && > - do_set_pmd(vmf, page)) { > + do_set_pmd(vmf, page) == 0) { > unlock_page(page); > return true; > } I missed that entirely, because when just reading the patch it looks fine and I didn't look at what do_set_pmd() function returns outside the patch. And maybe it would be better to write it as if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { vm_fault_t ret = do_set_pmd(vmf, page); if (!ret) { ... instead to make it a bit more explicit about how that return value is a vm_fault_t there... And see my other email about how I suspect there is still a leak in that patch for the previous test-case. Linus
On Sat, Dec 26, 2020 at 11:43:35PM +0300, Kirill A. Shutemov wrote: > +static struct page *next_stable_page(struct page *page, struct vm_fault *vmf, > + struct xa_state *xas, pgoff_t end_pgoff) A "stable page" means one that doesn't currently have an outstanding write (see wait_for_stable_page()). It's really "Next trivially insertable page". I might call it "next_uptodate_page()" and gloss over the other reasons this might fail to return a page.
On Sat, Dec 26, 2020 at 01:16:09PM -0800, Linus Torvalds wrote: > On Sat, Dec 26, 2020 at 1:04 PM Hugh Dickins <hughd@google.com> wrote: > > > > > > Hold on. I guess this one will suffer from the same bug as the previous. > > I was about to report back, after satisfactory overnight testing of that > > version - provided that one big little bug is fixed: > > > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -2919,7 +2919,7 @@ static bool filemap_map_pmd(struct vm_fa > > > > if (pmd_none(*vmf->pmd) && > > PageTransHuge(page) && > > - do_set_pmd(vmf, page)) { > > + do_set_pmd(vmf, page) == 0) { > > unlock_page(page); > > return true; > > } > > I missed that entirely, because when just reading the patch it looks > fine and I didn't look at what do_set_pmd() function returns outside > the patch. > > And maybe it would be better to write it as > > if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { > vm_fault_t ret = do_set_pmd(vmf, page); > if (!ret) { > ... > > instead to make it a bit more explicit about how that return value is > a vm_fault_t there... > > And see my other email about how I suspect there is still a leak in > that patch for the previous test-case. Ughh... Here's the fixup I have so far. It doesn't blow up immediately, but please take a closer look. Who knows what stupid mistake I did this time. :/ diff --git a/mm/filemap.c b/mm/filemap.c index 3a92aaa59b9b..c4b374678e7d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2837,16 +2837,21 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) struct mm_struct *mm = vmf->vma->vm_mm; /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - return true; - - if (pmd_none(*vmf->pmd) && - PageTransHuge(page) && - do_set_pmd(vmf, page)) { + if (pmd_trans_huge(*vmf->pmd)) { unlock_page(page); + put_page(page); return true; } + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { + vm_fault_t ret = do_set_pmd(vmf, page); + if (!ret) { + /* The page is mapped successfully, reference consumed. */ + unlock_page(page); + return true; + } + } + if (pmd_none(*vmf->pmd)) { vmf->ptl = pmd_lock(mm, vmf->pmd); if (likely(pmd_none(*vmf->pmd))) { @@ -2867,7 +2872,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) return false; } -static struct page *next_stable_page(struct page *page, struct vm_fault *vmf, +static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf, struct xa_state *xas, pgoff_t end_pgoff) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; @@ -2914,15 +2919,16 @@ static inline struct page *first_map_page(struct vm_fault *vmf, struct xa_state *xas, pgoff_t end_pgoff) { - return next_stable_page(xas_find(xas, end_pgoff), vmf, xas, end_pgoff); + return next_uptodate_page(xas_find(xas, end_pgoff), + vmf, xas, end_pgoff); } static inline struct page *next_map_page(struct vm_fault *vmf, struct xa_state *xas, pgoff_t end_pgoff) { - return next_stable_page(xas_next_entry(xas, end_pgoff), - vmf, xas, end_pgoff); + return next_uptodate_page(xas_next_entry(xas, end_pgoff), + vmf, xas, end_pgoff); } void filemap_map_pages(struct vm_fault *vmf,
On Sun, 27 Dec 2020, Kirill A. Shutemov wrote: > > Here's the fixup I have so far. It doesn't blow up immediately, but please > take a closer look. Who knows what stupid mistake I did this time. :/ It's been running fine on x86_64 for a couple of hours (but of course my testing is deficient, in not detecting the case Linus spotted). But I just thought I'd try it on i386 (hadn't tried previous versions) and this has a new disappointment: crashes when booting, in the "check if the page fault is solved" in do_fault_around(). I imagine a highmem issue with kmap of the pte address, but I'm reporting now before looking into it further (but verified that current linux.git i386 boots up fine). Maybe easily fixed: but does suggest this needs exposure in linux-next. Hugh
On Sat, 26 Dec 2020, Hugh Dickins wrote: > On Sun, 27 Dec 2020, Kirill A. Shutemov wrote: > > > > Here's the fixup I have so far. It doesn't blow up immediately, but please > > take a closer look. Who knows what stupid mistake I did this time. :/ > > It's been running fine on x86_64 for a couple of hours (but of course > my testing is deficient, in not detecting the case Linus spotted). > > But I just thought I'd try it on i386 (hadn't tried previous versions) > and this has a new disappointment: crashes when booting, in the "check > if the page fault is solved" in do_fault_around(). I imagine a highmem > issue with kmap of the pte address, but I'm reporting now before looking > into it further (but verified that current linux.git i386 boots up fine). This patch (like its antecedents) moves the pte_unmap_unlock() from after do_fault_around()'s "check if the page fault is solved" into filemap_map_pages() itself (which apparently does not NULLify vmf->pte after unmapping it, which is poor, but good for revealing this issue). That looks cleaner, but of course there was a very good reason for its original positioning. Maybe you want to change the ->map_pages prototype, to pass down the requested address too, so that it can report whether the requested address was resolved or not. Or it could be left to __do_fault(), or even to a repeated fault; but those would be less efficient. > > Maybe easily fixed: but does suggest this needs exposure in linux-next. > > Hugh
On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <hughd@google.com> wrote: > > This patch (like its antecedents) moves the pte_unmap_unlock() from > after do_fault_around()'s "check if the page fault is solved" into > filemap_map_pages() itself (which apparently does not NULLify vmf->pte > after unmapping it, which is poor, but good for revealing this issue). > That looks cleaner, but of course there was a very good reason for its > original positioning. Good catch. > Maybe you want to change the ->map_pages prototype, to pass down the > requested address too, so that it can report whether the requested > address was resolved or not. Or it could be left to __do_fault(), > or even to a repeated fault; but those would be less efficient. Let's keep the old really odd "let's unlock in the caller" for now, and minimize the changes. Adding a big big comment at the end of filemap_map_pages() to note the odd delayed page table unlocking. Here's an updated patch that combines Kirill's original patch, his additional incremental patch, and the fix for the pte lock oddity into one thing. Does this finally pass your testing? Linus
On Sun, 27. Dec 11:38, Linus Torvalds wrote: > On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <hughd@google.com> wrote: > > > > This patch (like its antecedents) moves the pte_unmap_unlock() from > > after do_fault_around()'s "check if the page fault is solved" into > > filemap_map_pages() itself (which apparently does not NULLify vmf->pte > > after unmapping it, which is poor, but good for revealing this issue). > > That looks cleaner, but of course there was a very good reason for its > > original positioning. > > Good catch. > > > Maybe you want to change the ->map_pages prototype, to pass down the > > requested address too, so that it can report whether the requested > > address was resolved or not. Or it could be left to __do_fault(), > > or even to a repeated fault; but those would be less efficient. > > Let's keep the old really odd "let's unlock in the caller" for now, > and minimize the changes. > > Adding a big big comment at the end of filemap_map_pages() to note the > odd delayed page table unlocking. > > Here's an updated patch that combines Kirill's original patch, his > additional incremental patch, and the fix for the pte lock oddity into > one thing. > > Does this finally pass your testing? > > Linus Hello together, when i try to build this patch, i got the following error: CC arch/x86/kernel/cpu/mce/threshold.o mm/memory.c:3716:19: error: static declaration of ‘do_set_pmd’ follows non-static declaration static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) ^~~~~~~~~~ In file included from mm/memory.c:43: ./include/linux/mm.h:984:12: note: previous declaration of ‘do_set_pmd’ was here vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); ^~~~~~~~~~ make[3]: *** [scripts/Makefile.build:279: mm/memory.o] Error 1 make[2]: *** [Makefile:1805: mm] Error 2 make[2]: *** Waiting for unfinished jobs.... CC arch/x86/kernel/cpu/mce/therm_throt.o Best regards Damian > From 4d221d934d112aa40c3f4978460be098fc9ce831 Mon Sep 17 00:00:00 2001 > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Date: Sat, 19 Dec 2020 15:19:23 +0300 > Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths > > alloc_set_pte() has two users with different requirements: in the > faultaround code, it called from an atomic context and PTE page table > has to be preallocated. finish_fault() can sleep and allocate page table > as needed. > > PTL locking rules are also strange, hard to follow and overkill for > finish_fault(). > > Let's untangle the mess. alloc_set_pte() has gone now. All locking is > explicit. > > The price is some code duplication to handle huge pages in faultaround > path, but it should be fine, having overall improvement in readability. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > include/linux/mm.h | 8 +- > include/linux/pgtable.h | 11 +++ > mm/filemap.c | 168 ++++++++++++++++++++++++++++++---------- > mm/memory.c | 161 +++++++++++--------------------------- > 4 files changed, 192 insertions(+), 156 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5299b90a6c40..c0643a0ad5ff 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -535,8 +535,8 @@ struct vm_fault { > * is not NULL, otherwise pmd. > */ > pgtable_t prealloc_pte; /* Pre-allocated pte page table. > - * vm_ops->map_pages() calls > - * alloc_set_pte() from atomic context. > + * vm_ops->map_pages() sets up a page > + * table from from atomic context. > * do_fault_around() pre-allocates > * page table to avoid allocation from > * atomic context. > @@ -981,7 +981,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > return pte; > } > > -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); > +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); > +void do_set_pte(struct vm_fault *vmf, struct page *page); > + > vm_fault_t finish_fault(struct vm_fault *vmf); > vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); > #endif > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 8fcdfa52eb4b..36eb748f3c97 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1314,6 +1314,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) > #endif > } > > +/* > + * the ordering of these checks is important for pmds with _page_devmap set. > + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check > + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly > + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. > + */ > +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) > +{ > + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > +} > + > #ifndef CONFIG_NUMA_BALANCING > /* > * Technically a PTE can be PROTNONE even when not doing NUMA balancing but > diff --git a/mm/filemap.c b/mm/filemap.c > index 5c9d564317a5..dbc2eda92a53 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -42,6 +42,7 @@ > #include <linux/psi.h> > #include <linux/ramfs.h> > #include <linux/page_idle.h> > +#include <asm/pgalloc.h> > #include "internal.h" > > #define CREATE_TRACE_POINTS > @@ -2911,50 +2912,133 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > } > EXPORT_SYMBOL(filemap_fault); > > +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) > +{ > + struct mm_struct *mm = vmf->vma->vm_mm; > + > + /* Huge page is mapped? No need to proceed. */ > + if (pmd_trans_huge(*vmf->pmd)) { > + unlock_page(page); > + put_page(page); > + return true; > + } > + > + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { > + vm_fault_t ret = do_set_pmd(vmf, page); > + if (!ret) { > + /* The page is mapped successfully, reference consumed. */ > + unlock_page(page); > + return true; > + } > + } > + > + if (pmd_none(*vmf->pmd)) { > + vmf->ptl = pmd_lock(mm, vmf->pmd); > + if (likely(pmd_none(*vmf->pmd))) { > + mm_inc_nr_ptes(mm); > + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); > + vmf->prealloc_pte = NULL; > + } > + spin_unlock(vmf->ptl); > + } > + > + /* See comment in handle_pte_fault() */ > + if (pmd_devmap_trans_unstable(vmf->pmd)) { > + unlock_page(page); > + put_page(page); > + return true; > + } > + > + return false; > +} > + > +static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf, > + struct xa_state *xas, pgoff_t end_pgoff) > +{ > + struct address_space *mapping = vmf->vma->vm_file->f_mapping; > + unsigned long max_idx; > + > + do { > + if (!page) > + return NULL; > + if (xas_retry(xas, page)) > + continue; > + if (xa_is_value(page)) > + continue; > + if (PageLocked(page)) > + continue; > + if (!page_cache_get_speculative(page)) > + continue; > + /* Has the page moved or been split? */ > + if (unlikely(page != xas_reload(xas))) > + goto skip; > + if (!PageUptodate(page) || PageReadahead(page)) > + goto skip; > + if (PageHWPoison(page)) > + goto skip; > + if (!trylock_page(page)) > + goto skip; > + if (page->mapping != mapping) > + goto unlock; > + if (!PageUptodate(page)) > + goto unlock; > + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); > + if (xas->xa_index >= max_idx) > + goto unlock; > + return page; > +unlock: > + unlock_page(page); > +skip: > + put_page(page); > + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); > + > + return NULL; > +} > + > +static inline struct page *first_map_page(struct vm_fault *vmf, > + struct xa_state *xas, > + pgoff_t end_pgoff) > +{ > + return next_uptodate_page(xas_find(xas, end_pgoff), > + vmf, xas, end_pgoff); > +} > + > +static inline struct page *next_map_page(struct vm_fault *vmf, > + struct xa_state *xas, > + pgoff_t end_pgoff) > +{ > + return next_uptodate_page(xas_next_entry(xas, end_pgoff), > + vmf, xas, end_pgoff); > +} > + > void filemap_map_pages(struct vm_fault *vmf, > pgoff_t start_pgoff, pgoff_t end_pgoff) > { > - struct file *file = vmf->vma->vm_file; > + struct vm_area_struct *vma = vmf->vma; > + struct file *file = vma->vm_file; > struct address_space *mapping = file->f_mapping; > pgoff_t last_pgoff = start_pgoff; > - unsigned long max_idx; > XA_STATE(xas, &mapping->i_pages, start_pgoff); > struct page *head, *page; > unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); > > rcu_read_lock(); > - xas_for_each(&xas, head, end_pgoff) { > - if (xas_retry(&xas, head)) > - continue; > - if (xa_is_value(head)) > - goto next; > + head = first_map_page(vmf, &xas, end_pgoff); > + if (!head) { > + rcu_read_unlock(); > + return; > + } > > - /* > - * Check for a locked page first, as a speculative > - * reference may adversely influence page migration. > - */ > - if (PageLocked(head)) > - goto next; > - if (!page_cache_get_speculative(head)) > - goto next; > + if (filemap_map_pmd(vmf, head)) { > + rcu_read_unlock(); > + return; > + } > > - /* Has the page moved or been split? */ > - if (unlikely(head != xas_reload(&xas))) > - goto skip; > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, > + vmf->address, &vmf->ptl); > + do { > page = find_subpage(head, xas.xa_index); > - > - if (!PageUptodate(head) || > - PageReadahead(page) || > - PageHWPoison(page)) > - goto skip; > - if (!trylock_page(head)) > - goto skip; > - > - if (head->mapping != mapping || !PageUptodate(head)) > - goto unlock; > - > - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); > - if (xas.xa_index >= max_idx) > + if (PageHWPoison(page)) > goto unlock; > > if (mmap_miss > 0) > @@ -2964,19 +3048,25 @@ void filemap_map_pages(struct vm_fault *vmf, > if (vmf->pte) > vmf->pte += xas.xa_index - last_pgoff; > last_pgoff = xas.xa_index; > - if (alloc_set_pte(vmf, page)) > + > + if (!pte_none(*vmf->pte)) > goto unlock; > + > + do_set_pte(vmf, page); > + /* no need to invalidate: a not-present page won't be cached */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > unlock_page(head); > - goto next; > + continue; > unlock: > unlock_page(head); > -skip: > put_page(head); > -next: > - /* Huge page is mapped? No need to proceed. */ > - if (pmd_trans_huge(*vmf->pmd)) > - break; > - } > + } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); > + > + /* > + * NOTE! We return with the pte still locked! It is unlocked > + * by do_fault_around() after it has tested whether the target > + * address got filled in. > + */ > rcu_read_unlock(); > WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); > } > diff --git a/mm/memory.c b/mm/memory.c > index 7d608765932b..07a408c7d38b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3501,7 +3501,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > if (pte_alloc(vma->vm_mm, vmf->pmd)) > return VM_FAULT_OOM; > > - /* See the comment in pte_alloc_one_map() */ > + /* See comment in handle_pte_fault() */ > if (unlikely(pmd_trans_unstable(vmf->pmd))) > return 0; > > @@ -3641,66 +3641,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) > return ret; > } > > -/* > - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. > - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check > - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly > - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. > - */ > -static int pmd_devmap_trans_unstable(pmd_t *pmd) > -{ > - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); > -} > - > -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) > -{ > - struct vm_area_struct *vma = vmf->vma; > - > - if (!pmd_none(*vmf->pmd)) > - goto map_pte; > - if (vmf->prealloc_pte) { > - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); > - if (unlikely(!pmd_none(*vmf->pmd))) { > - spin_unlock(vmf->ptl); > - goto map_pte; > - } > - > - mm_inc_nr_ptes(vma->vm_mm); > - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); > - spin_unlock(vmf->ptl); > - vmf->prealloc_pte = NULL; > - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { > - return VM_FAULT_OOM; > - } > -map_pte: > - /* > - * If a huge pmd materialized under us just retry later. Use > - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of > - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge > - * under us and then back to pmd_none, as a result of MADV_DONTNEED > - * running immediately after a huge pmd fault in a different thread of > - * this mm, in turn leading to a misleading pmd_trans_huge() retval. > - * All we have to ensure is that it is a regular pmd that we can walk > - * with pte_offset_map() and we can do that through an atomic read in > - * C, which is what pmd_trans_unstable() provides. > - */ > - if (pmd_devmap_trans_unstable(vmf->pmd)) > - return VM_FAULT_NOPAGE; > - > - /* > - * At this point we know that our vmf->pmd points to a page of ptes > - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() > - * for the duration of the fault. If a racing MADV_DONTNEED runs and > - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still > - * be valid and we will re-check to make sure the vmf->pte isn't > - * pte_none() under vmf->ptl protection when we return to > - * alloc_set_pte(). > - */ > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, > - &vmf->ptl); > - return 0; > -} > - > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > static void deposit_prealloc_pte(struct vm_fault *vmf) > { > @@ -3715,7 +3655,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) > vmf->prealloc_pte = NULL; > } > > -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > { > struct vm_area_struct *vma = vmf->vma; > bool write = vmf->flags & FAULT_FLAG_WRITE; > @@ -3780,45 +3720,11 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > } > #endif > > -/** > - * alloc_set_pte - setup new PTE entry for given page and add reverse page > - * mapping. If needed, the function allocates page table or use pre-allocated. > - * > - * @vmf: fault environment > - * @page: page to map > - * > - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on > - * return. > - * > - * Target users are page handler itself and implementations of > - * vm_ops->map_pages. > - * > - * Return: %0 on success, %VM_FAULT_ code in case of error. > - */ > -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) > +void do_set_pte(struct vm_fault *vmf, struct page *page) > { > struct vm_area_struct *vma = vmf->vma; > bool write = vmf->flags & FAULT_FLAG_WRITE; > pte_t entry; > - vm_fault_t ret; > - > - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { > - ret = do_set_pmd(vmf, page); > - if (ret != VM_FAULT_FALLBACK) > - return ret; > - } > - > - if (!vmf->pte) { > - ret = pte_alloc_one_map(vmf); > - if (ret) > - return ret; > - } > - > - /* Re-check under ptl */ > - if (unlikely(!pte_none(*vmf->pte))) { > - update_mmu_tlb(vma, vmf->address, vmf->pte); > - return VM_FAULT_NOPAGE; > - } > > flush_icache_page(vma, page); > entry = mk_pte(page, vma->vm_page_prot); > @@ -3835,14 +3741,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) > page_add_file_rmap(page, false); > } > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > - > - /* no need to invalidate: a not-present page won't be cached */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > - > - return 0; > } > > - > /** > * finish_fault - finish page fault once we have prepared the page to fault > * > @@ -3860,12 +3760,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) > */ > vm_fault_t finish_fault(struct vm_fault *vmf) > { > + struct vm_area_struct *vma = vmf->vma; > struct page *page; > - vm_fault_t ret = 0; > + vm_fault_t ret; > > /* Did we COW the page? */ > - if ((vmf->flags & FAULT_FLAG_WRITE) && > - !(vmf->vma->vm_flags & VM_SHARED)) > + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) > page = vmf->cow_page; > else > page = vmf->page; > @@ -3874,13 +3774,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > * check even for read faults because we might have lost our CoWed > * page > */ > - if (!(vmf->vma->vm_flags & VM_SHARED)) > - ret = check_stable_address_space(vmf->vma->vm_mm); > - if (!ret) > - ret = alloc_set_pte(vmf, page); > - if (vmf->pte) > - pte_unmap_unlock(vmf->pte, vmf->ptl); > - return ret; > + if (!(vma->vm_flags & VM_SHARED)) > + ret = check_stable_address_space(vma->vm_mm); > + if (ret) > + return ret; > + > + if (pmd_none(*vmf->pmd)) { > + if (PageTransCompound(page)) { > + ret = do_set_pmd(vmf, page); > + if (ret != VM_FAULT_FALLBACK) > + return ret; > + } > + > + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) > + return VM_FAULT_OOM; > + } > + > + /* See comment in handle_pte_fault() */ > + if (pmd_devmap_trans_unstable(vmf->pmd)) > + return 0; > + > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, > + vmf->address, &vmf->ptl); > + /* Re-check under ptl */ > + if (likely(pte_none(*vmf->pte))) > + do_set_pte(vmf, page); > + > + update_mmu_tlb(vma, vmf->address, vmf->pte); > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + return 0; > } > > static unsigned long fault_around_bytes __read_mostly = > @@ -4351,7 +4273,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) > */ > vmf->pte = NULL; > } else { > - /* See comment in pte_alloc_one_map() */ > + /* > + * If a huge pmd materialized under us just retry later. Use > + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead > + * of pmd_trans_huge() to ensure the pmd didn't become > + * pmd_trans_huge under us and then back to pmd_none, as a > + * result of MADV_DONTNEED running immediately after a huge pmd > + * fault in a different thread of this mm, in turn leading to a > + * misleading pmd_trans_huge() retval. All we have to ensure is > + * that it is a regular pmd that we can walk with > + * pte_offset_map() and we can do that through an atomic read > + * in C, which is what pmd_trans_unstable() provides. > + */ > if (pmd_devmap_trans_unstable(vmf->pmd)) > return 0; > /* > -- > 2.29.2.157.g1d47791a39 >
On Sun, 27 Dec 2020, Damian Tometzki wrote: > On Sun, 27. Dec 11:38, Linus Torvalds wrote: > > On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <hughd@google.com> wrote: > > > > > > This patch (like its antecedents) moves the pte_unmap_unlock() from > > > after do_fault_around()'s "check if the page fault is solved" into > > > filemap_map_pages() itself (which apparently does not NULLify vmf->pte > > > after unmapping it, which is poor, but good for revealing this issue). > > > That looks cleaner, but of course there was a very good reason for its > > > original positioning. > > > > Good catch. > > > > > Maybe you want to change the ->map_pages prototype, to pass down the > > > requested address too, so that it can report whether the requested > > > address was resolved or not. Or it could be left to __do_fault(), > > > or even to a repeated fault; but those would be less efficient. > > > > Let's keep the old really odd "let's unlock in the caller" for now, > > and minimize the changes. > > > > Adding a big big comment at the end of filemap_map_pages() to note the > > odd delayed page table unlocking. > > > > Here's an updated patch that combines Kirill's original patch, his > > additional incremental patch, and the fix for the pte lock oddity into > > one thing. > > > > Does this finally pass your testing? Yes, this one passes my testing on x86_64 and on i386. But... > > > > Linus > Hello together, > > when i try to build this patch, i got the following error: > > CC arch/x86/kernel/cpu/mce/threshold.o > mm/memory.c:3716:19: error: static declaration of ‘do_set_pmd’ follows non-static declaration > static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) > ^~~~~~~~~~ > In file included from mm/memory.c:43: > ./include/linux/mm.h:984:12: note: previous declaration of ‘do_set_pmd’ was here > vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); > ^~~~~~~~~~ > make[3]: *** [scripts/Makefile.build:279: mm/memory.o] Error 1 > make[2]: *** [Makefile:1805: mm] Error 2 > make[2]: *** Waiting for unfinished jobs.... > CC arch/x86/kernel/cpu/mce/therm_throt.o ... Damian very helpfully reports that it does not build when CONFIG_TRANSPARENT_HUGEPAGE is not set, since the "static " has not been removed from the alternative definition of do_set_pmd(). And its BUILD_BUG() becomes invalid once it's globally available. You don't like unnecessary BUG()s, and I don't like returning success there: VM_FAULT_FALLBACK seems best. --- a/mm/memory.c +++ b/mm/memory.c @@ -3713,10 +3713,9 @@ out: return ret; } #else -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { - BUILD_BUG(); - return 0; + return VM_FAULT_FALLBACK; } #endif (I'm also a wee bit worried by filemap.c's +#include <asm/pgalloc.h>: that's the kind of thing that might turn out not to work on some arch.) Hugh
On Sun, Dec 27, 2020 at 2:35 PM Hugh Dickins <hughd@google.com> wrote: > > Yes, this one passes my testing on x86_64 and on i386. Ok, good. > But... > > ... Damian very helpfully reports that it does not build when > CONFIG_TRANSPARENT_HUGEPAGE is not set, since the "static " has > not been removed from the alternative definition of do_set_pmd(). Ok, your fix for that folded in, and here's yet another version. Damian, hopefully your configuration builds and works too now. But obviously it's too late for this merge window that I'm just about to close, so I hope Andrew can pick this up - I hope it's now in good enough shape to go into -mm and then we can do it next merge window or something. Andrew? Linus
On Sun, Dec 27, 2020 at 3:12 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok, your fix for that folded in, and here's yet another version. Still not good. I don't know what happened, but the change of - vm_fault_t ret = 0; + vm_fault_t ret; is very very wrong. The next user is + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; so now 'ret' will potentially be used uninitialized (although this is the kind of thing that a compiler might almost accidentally end up fixing - with a single dominating assignment, I could imagine the compiler moving the test to that assignment and thus "fixing" the code without really even meaning to). I think Kirill was intending to move the "if (ret)" up into the path that sets it, IOW something like + if (!(vma->vm_flags & VM_SHARED)) { + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + } instead. But that patch as-is is broken. Kirill? Linus
On Sun, Dec 27, 2020 at 11:38:22AM -0800, Linus Torvalds wrote: > On Sat, Dec 26, 2020 at 6:38 PM Hugh Dickins <hughd@google.com> wrote: > > > > This patch (like its antecedents) moves the pte_unmap_unlock() from > > after do_fault_around()'s "check if the page fault is solved" into > > filemap_map_pages() itself (which apparently does not NULLify vmf->pte > > after unmapping it, which is poor, but good for revealing this issue). > > That looks cleaner, but of course there was a very good reason for its > > original positioning. > > Good catch. > > > Maybe you want to change the ->map_pages prototype, to pass down the > > requested address too, so that it can report whether the requested > > address was resolved or not. Or it could be left to __do_fault(), > > or even to a repeated fault; but those would be less efficient. > > Let's keep the old really odd "let's unlock in the caller" for now, > and minimize the changes. I did what Hugh proposed and it got clear to my eyes. It gets somewhat large, but take a look. The patch below incorporates all fixups from the thread. From 3496fc6bde3f6ed0c98d811dda02465bf86fdb90 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- fs/xfs/xfs_file.c | 7 +- include/linux/mm.h | 12 +-- include/linux/pgtable.h | 11 +++ mm/filemap.c | 172 +++++++++++++++++++++++++++--------- mm/memory.c | 191 +++++++++++----------------------------- 5 files changed, 206 insertions(+), 187 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..040fa64291ed 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1319,17 +1319,20 @@ xfs_filemap_pfn_mkwrite( return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } -static void +static vm_fault_t xfs_filemap_map_pages( struct vm_fault *vmf, + unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff) { struct inode *inode = file_inode(vmf->vma->vm_file); + vm_fault_t ret; xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - filemap_map_pages(vmf, start_pgoff, end_pgoff); + ret = filemap_map_pages(vmf, address, start_pgoff, end_pgoff); xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + return ret; } static const struct vm_operations_struct xfs_file_vm_ops = { diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..4e1d2945a71a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -562,7 +562,7 @@ struct vm_operations_struct { vm_fault_t (*fault)(struct vm_fault *vmf); vm_fault_t (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size); - void (*map_pages)(struct vm_fault *vmf, + vm_fault_t (*map_pages)(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff); unsigned long (*pagesize)(struct vm_area_struct * area); @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif @@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *); /* generic vm_area_ops exported for stackable file systems */ extern vm_fault_t filemap_fault(struct vm_fault *vmf); -extern void filemap_map_pages(struct vm_fault *vmf, +extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..c5da09f3f363 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,50 +2832,134 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); -void filemap_map_pages(struct vm_fault *vmf, +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) +{ + struct mm_struct *mm = vmf->vma->vm_mm; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { + vm_fault_t ret = do_set_pmd(vmf, page); + if (!ret) { + /* The page is mapped successfully, reference consumed. */ + unlock_page(page); + return true; + } + } + + if (pmd_none(*vmf->pmd)) { + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(mm); + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + return false; +} + +static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf, + struct xa_state *xas, pgoff_t end_pgoff) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + unsigned long max_idx; + + do { + if (!page) + return NULL; + if (xas_retry(xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (PageLocked(page)) + continue; + if (!page_cache_get_speculative(page)) + continue; + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(xas))) + goto skip; + if (!PageUptodate(page) || PageReadahead(page)) + goto skip; + if (PageHWPoison(page)) + goto skip; + if (!trylock_page(page)) + goto skip; + if (page->mapping != mapping) + goto unlock; + if (!PageUptodate(page)) + goto unlock; + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); + if (xas->xa_index >= max_idx) + goto unlock; + return page; +unlock: + unlock_page(page); +skip: + put_page(page); + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); + + return NULL; +} + +static inline struct page *first_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_find(xas, end_pgoff), + vmf, xas, end_pgoff); +} + +static inline struct page *next_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_next_entry(xas, end_pgoff), + vmf, xas, end_pgoff); +} + +vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long max_idx; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); + vm_fault_t ret = 0; rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { - if (xas_retry(&xas, head)) - continue; - if (xa_is_value(head)) - goto next; + head = first_map_page(vmf, &xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return 0; + } - /* - * Check for a locked page first, as a speculative - * reference may adversely influence page migration. - */ - if (PageLocked(head)) - goto next; - if (!page_cache_get_speculative(head)) - goto next; + if (filemap_map_pmd(vmf, head)) { + rcu_read_unlock(); + return VM_FAULT_NOPAGE; + } - /* Has the page moved or been split? */ - if (unlikely(head != xas_reload(&xas))) - goto skip; + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + do { page = find_subpage(head, xas.xa_index); - - if (!PageUptodate(head) || - PageReadahead(page) || - PageHWPoison(page)) - goto skip; - if (!trylock_page(head)) - goto skip; - - if (head->mapping != mapping || !PageUptodate(head)) - goto unlock; - - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (xas.xa_index >= max_idx) + if (PageHWPoison(page)) goto unlock; if (mmap_miss > 0) @@ -2884,21 +2969,28 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + + /* The fault is handled */ + if (vmf->address == address) + ret = VM_FAULT_NOPAGE; + continue; unlock: unlock_page(head); -skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); + + return ret; } EXPORT_SYMBOL(filemap_map_pages); diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..247abfcd60d8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See comment in handle_pte_fault() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) return ret; } #else -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { - BUILD_BUG(); - return 0; + return VM_FAULT_FALLBACK; } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3762,35 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3938,7 +3859,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) pgoff_t start_pgoff = vmf->pgoff; pgoff_t end_pgoff; int off; - vm_fault_t ret = 0; nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; @@ -3960,31 +3880,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); if (!vmf->prealloc_pte) - goto out; + return VM_FAULT_OOM; smp_wmb(); /* See comment in __pte_alloc() */ } - vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); - - /* Huge page is mapped? Page fault is solved */ - if (pmd_trans_huge(*vmf->pmd)) { - ret = VM_FAULT_NOPAGE; - goto out; - } - - /* ->map_pages() haven't done anything useful. Cold page cache? */ - if (!vmf->pte) - goto out; - - /* check if the page fault is solved */ - vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); - if (!pte_none(*vmf->pte)) - ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); -out: - vmf->address = address; - vmf->pte = NULL; - return ret; + return vmf->vma->vm_ops->map_pages(vmf, address, start_pgoff, end_pgoff); } static vm_fault_t do_read_fault(struct vm_fault *vmf) @@ -4340,7 +4240,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Sun, Dec 27, 2020 at 03:40:36PM -0800, Linus Torvalds wrote: > I think Kirill was intending to move the "if (ret)" up into the path > that sets it, IOW something like > > + if (!(vma->vm_flags & VM_SHARED)) { > + ret = check_stable_address_space(vma->vm_mm); > + if (ret) > + return ret; > + } > > instead. But that patch as-is is broken. > > Kirill? Right. Once again, patch is below. From d50eff470de025f602711ef546aa87d0da1727f5 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- fs/xfs/xfs_file.c | 7 +- include/linux/mm.h | 12 +-- include/linux/pgtable.h | 11 +++ mm/filemap.c | 172 ++++++++++++++++++++++++++--------- mm/memory.c | 192 +++++++++++----------------------------- 5 files changed, 207 insertions(+), 187 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..040fa64291ed 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1319,17 +1319,20 @@ xfs_filemap_pfn_mkwrite( return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } -static void +static vm_fault_t xfs_filemap_map_pages( struct vm_fault *vmf, + unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff) { struct inode *inode = file_inode(vmf->vma->vm_file); + vm_fault_t ret; xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - filemap_map_pages(vmf, start_pgoff, end_pgoff); + ret = filemap_map_pages(vmf, address, start_pgoff, end_pgoff); xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + return ret; } static const struct vm_operations_struct xfs_file_vm_ops = { diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..4e1d2945a71a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -562,7 +562,7 @@ struct vm_operations_struct { vm_fault_t (*fault)(struct vm_fault *vmf); vm_fault_t (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size); - void (*map_pages)(struct vm_fault *vmf, + vm_fault_t (*map_pages)(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff); unsigned long (*pagesize)(struct vm_area_struct * area); @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif @@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *); /* generic vm_area_ops exported for stackable file systems */ extern vm_fault_t filemap_fault(struct vm_fault *vmf); -extern void filemap_map_pages(struct vm_fault *vmf, +extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..c5da09f3f363 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,50 +2832,134 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); -void filemap_map_pages(struct vm_fault *vmf, +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) +{ + struct mm_struct *mm = vmf->vma->vm_mm; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { + vm_fault_t ret = do_set_pmd(vmf, page); + if (!ret) { + /* The page is mapped successfully, reference consumed. */ + unlock_page(page); + return true; + } + } + + if (pmd_none(*vmf->pmd)) { + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(mm); + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + return false; +} + +static struct page *next_uptodate_page(struct page *page, struct vm_fault *vmf, + struct xa_state *xas, pgoff_t end_pgoff) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + unsigned long max_idx; + + do { + if (!page) + return NULL; + if (xas_retry(xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (PageLocked(page)) + continue; + if (!page_cache_get_speculative(page)) + continue; + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(xas))) + goto skip; + if (!PageUptodate(page) || PageReadahead(page)) + goto skip; + if (PageHWPoison(page)) + goto skip; + if (!trylock_page(page)) + goto skip; + if (page->mapping != mapping) + goto unlock; + if (!PageUptodate(page)) + goto unlock; + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); + if (xas->xa_index >= max_idx) + goto unlock; + return page; +unlock: + unlock_page(page); +skip: + put_page(page); + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); + + return NULL; +} + +static inline struct page *first_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_find(xas, end_pgoff), + vmf, xas, end_pgoff); +} + +static inline struct page *next_map_page(struct vm_fault *vmf, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_next_entry(xas, end_pgoff), + vmf, xas, end_pgoff); +} + +vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, pgoff_t start_pgoff, pgoff_t end_pgoff) { - struct file *file = vmf->vma->vm_file; + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long max_idx; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); + vm_fault_t ret = 0; rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { - if (xas_retry(&xas, head)) - continue; - if (xa_is_value(head)) - goto next; + head = first_map_page(vmf, &xas, end_pgoff); + if (!head) { + rcu_read_unlock(); + return 0; + } - /* - * Check for a locked page first, as a speculative - * reference may adversely influence page migration. - */ - if (PageLocked(head)) - goto next; - if (!page_cache_get_speculative(head)) - goto next; + if (filemap_map_pmd(vmf, head)) { + rcu_read_unlock(); + return VM_FAULT_NOPAGE; + } - /* Has the page moved or been split? */ - if (unlikely(head != xas_reload(&xas))) - goto skip; + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + do { page = find_subpage(head, xas.xa_index); - - if (!PageUptodate(head) || - PageReadahead(page) || - PageHWPoison(page)) - goto skip; - if (!trylock_page(head)) - goto skip; - - if (head->mapping != mapping || !PageUptodate(head)) - goto unlock; - - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (xas.xa_index >= max_idx) + if (PageHWPoison(page)) goto unlock; if (mmap_miss > 0) @@ -2884,21 +2969,28 @@ void filemap_map_pages(struct vm_fault *vmf, if (vmf->pte) vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + + /* The fault is handled */ + if (vmf->address == address) + ret = VM_FAULT_NOPAGE; + continue; unlock: unlock_page(head); -skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); + + return ret; } EXPORT_SYMBOL(filemap_map_pages); diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..e51638b92e7c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See comment in handle_pte_fault() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) return ret; } #else -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { - BUILD_BUG(); - return 0; + return VM_FAULT_FALLBACK; } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,13 +3762,36 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); - return ret; + if (!(vma->vm_flags & VM_SHARED)) { + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + } + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } static unsigned long fault_around_bytes __read_mostly = @@ -3938,7 +3860,6 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) pgoff_t start_pgoff = vmf->pgoff; pgoff_t end_pgoff; int off; - vm_fault_t ret = 0; nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; @@ -3960,31 +3881,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); if (!vmf->prealloc_pte) - goto out; + return VM_FAULT_OOM; smp_wmb(); /* See comment in __pte_alloc() */ } - vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); - - /* Huge page is mapped? Page fault is solved */ - if (pmd_trans_huge(*vmf->pmd)) { - ret = VM_FAULT_NOPAGE; - goto out; - } - - /* ->map_pages() haven't done anything useful. Cold page cache? */ - if (!vmf->pte) - goto out; - - /* check if the page fault is solved */ - vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); - if (!pte_none(*vmf->pte)) - ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); -out: - vmf->address = address; - vmf->pte = NULL; - return ret; + return vmf->vma->vm_ops->map_pages(vmf, address, start_pgoff, end_pgoff); } static vm_fault_t do_read_fault(struct vm_fault *vmf) @@ -4340,7 +4241,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I did what Hugh proposed and it got clear to my eyes. It gets somewhat > large, but take a look. Ok, it's not that much bigger, and the end result is certainly much clearer wrt locking. So that last version of yours with the fix for the uninitialized 'ret' variable looks good to me. Of course, I've said that before, and have been wrong. So ... Linus
On Sun, 27 Dec 2020, Linus Torvalds wrote: > On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > I did what Hugh proposed and it got clear to my eyes. It gets somewhat > > large, but take a look. > > Ok, it's not that much bigger, and the end result is certainly much > clearer wrt locking. > > So that last version of yours with the fix for the uninitialized 'ret' > variable looks good to me. > > Of course, I've said that before, and have been wrong. So ... And guess what... it's broken. I folded it into testing rc1: segfault on cc1, systemd "Journal file corrupted, rotating", seen on more than one machine. I've backed it out, rc1 itself seems fine, I'll leave rc1 under load overnight, then come back to the faultaround patch tomorrow; won't glance at it tonight, but maybe Kirill will guess what's wrong. Hugh
On Sun, Dec 27, 2020 at 10:43:44PM -0800, Hugh Dickins wrote: > On Sun, 27 Dec 2020, Linus Torvalds wrote: > > On Sun, Dec 27, 2020 at 3:48 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > I did what Hugh proposed and it got clear to my eyes. It gets somewhat > > > large, but take a look. > > > > Ok, it's not that much bigger, and the end result is certainly much > > clearer wrt locking. > > > > So that last version of yours with the fix for the uninitialized 'ret' > > variable looks good to me. > > > > Of course, I've said that before, and have been wrong. So ... > > And guess what... it's broken. > > I folded it into testing rc1: segfault on cc1, systemd > "Journal file corrupted, rotating", seen on more than one machine. > > I've backed it out, rc1 itself seems fine, I'll leave rc1 under > load overnight, then come back to the faultaround patch tomorrow; > won't glance at it tonight, but maybe Kirill will guess what's wrong. So far I only found one more pin leak and always-true check. I don't see how can it lead to crash or corruption. Keep looking. diff --git a/mm/filemap.c b/mm/filemap.c index c5da09f3f363..87671284de62 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2966,8 +2966,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, mmap_miss--; vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT; - if (vmf->pte) - vmf->pte += xas.xa_index - last_pgoff; + vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; if (!pte_none(*vmf->pte)) diff --git a/mm/memory.c b/mm/memory.c index e51638b92e7c..829f5056dd1c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3785,13 +3785,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf) vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + ret = 0; /* Re-check under ptl */ if (likely(pte_none(*vmf->pte))) do_set_pte(vmf, page); + else + ret = VM_FAULT_NOPAGE; update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; + return ret; } static unsigned long fault_around_bytes __read_mostly =
On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > So far I only found one more pin leak and always-true check. I don't see > how can it lead to crash or corruption. Keep looking. Well, I noticed that the nommu.c version of filemap_map_pages() needs fixing, but that's obviously not the case Hugh sees. No,m I think the problem is the pte_unmap_unlock(vmf->pte, vmf->ptl); at the end of filemap_map_pages(). Why? Because we've been updating vmf->pte as we go along: vmf->pte += xas.xa_index - last_pgoff; and I think that by the time we get to that "pte_unmap_unlock()", vmf->pte potentially points to past the edge of the page directory. I think that is the bug that Hugh sees - simply because we get entirely confused about the page table locking. And it would match the latest change, which was all about moving that unlock from the caller to filemap_map_pages(), and now it's missing the pte fixup.. I personally think it's wrong to update vmf->pte at all. We should just have a local 'ptep' pointer that we update as we walk along. But that requires another change to the calling convention, namely to "do_set_pte()". Also, considering how complicated this patch is getting, I think it might be good to try to split it up a bit. In particular, I think the calling convention change for "filemap_map_pages()" could be done first and independently. And then as a second step, move the VM_FAULT_NOPAGE and "pte_unmap_lock()" from the callers to filemap_map_pages(). And then only as the final step, do that nice re-organization with first_map_page/next_map_page() and moving the locking from alloc_set_pte() into filemap_map_pages().. How does that sound? Anyway, Hugh, if it's about overshooting the pte pointer, maybe this absolutely horrendous and disgusting patch fixes it without the above kinds of more extensive cleanups. Worth testing, perhaps, even if it's too ugly for words? Linus
On Mon, Dec 28, 2020 at 10:47 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I personally think it's wrong to update vmf->pte at all. We should > just have a local 'ptep' pointer that we update as we walk along. But > that requires another change to the calling convention, namely to > "do_set_pte()". Actually, I think we should not use do_set_pte() at all. About half of do_set_pte() is about the FAULT_FLAG_WRITE case, which the fault-around code never has set (and would be wrong if it did). So I think do_set_pte() should be made local to mm/memory.c, and the filemap_map_pages() code should do it's own simplified version that just does the non-writable case, and that just gets passed the address and the pte pointer. At that point, there would no longer be any need to update the address/pte fields in the vmf struct, and in fact I think it could be made a "const" pointer in this cal chain. This is all just from looking at the code, I haven't tried to write a patch to do this, so I might be missing some case. Linus
On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote: > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > So far I only found one more pin leak and always-true check. I don't see > > how can it lead to crash or corruption. Keep looking. > > Well, I noticed that the nommu.c version of filemap_map_pages() needs > fixing, but that's obviously not the case Hugh sees. > > No,m I think the problem is the > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > at the end of filemap_map_pages(). > > Why? > > Because we've been updating vmf->pte as we go along: > > vmf->pte += xas.xa_index - last_pgoff; > > and I think that by the time we get to that "pte_unmap_unlock()", > vmf->pte potentially points to past the edge of the page directory. Well, if it's true we have bigger problem: we set up an pte entry without relevant PTL. But I *think* we should be fine here: do_fault_around() limits start_pgoff and end_pgoff to stay within the page table. It made mw looking at the code around pte_unmap_unlock() and I think that the bug is that we have to reset vmf->address and NULLify vmf->pte once we are done with faultaround: diff --git a/mm/memory.c b/mm/memory.c index 829f5056dd1c..405f5c73ce3e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3794,6 +3794,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf) update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); + vmf->address = address; + vmf->pte = NULL; return ret; }
On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote: > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote: > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > So far I only found one more pin leak and always-true check. I don't see > > > how can it lead to crash or corruption. Keep looking. > > > > Well, I noticed that the nommu.c version of filemap_map_pages() needs > > fixing, but that's obviously not the case Hugh sees. > > > > No,m I think the problem is the > > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > > at the end of filemap_map_pages(). > > > > Why? > > > > Because we've been updating vmf->pte as we go along: > > > > vmf->pte += xas.xa_index - last_pgoff; > > > > and I think that by the time we get to that "pte_unmap_unlock()", > > vmf->pte potentially points to past the edge of the page directory. > > Well, if it's true we have bigger problem: we set up an pte entry without > relevant PTL. > > But I *think* we should be fine here: do_fault_around() limits start_pgoff > and end_pgoff to stay within the page table. > > It made mw looking at the code around pte_unmap_unlock() and I think that > the bug is that we have to reset vmf->address and NULLify vmf->pte once we > are done with faultaround: > > diff --git a/mm/memory.c b/mm/memory.c Ugh.. Wrong place. Need to sleep. I'll look into your idea tomorrow. diff --git a/mm/filemap.c b/mm/filemap.c index 87671284de62..e4daab80ed81 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); rcu_read_unlock(); + vmf->address = address; + vmf->pte = NULL; WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); return ret;
On Mon, Dec 28, 2020 at 2:05 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > But I *think* we should be fine here: do_fault_around() limits start_pgoff > and end_pgoff to stay within the page table. Yeah, but I was thinking it would then update ->pte to just past the edge. But looking at that logic, you're right - it will update ->pte and ->address only just before installing the pte, so it will never go _to_ the edge, it will always stay inside. So scratch my suspicion. It looked promising mainly because that ->pte pointer update was one of the things that changed when you instead compared against the address. Linus
Got it at last, sorry it's taken so long. On Tue, 29 Dec 2020, Kirill A. Shutemov wrote: > On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote: > > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote: > > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > > > > > So far I only found one more pin leak and always-true check. I don't see > > > > how can it lead to crash or corruption. Keep looking. Those mods look good in themselves, but, as you expected, made no difference to the corruption I was seeing. > > > > > > Well, I noticed that the nommu.c version of filemap_map_pages() needs > > > fixing, but that's obviously not the case Hugh sees. > > > > > > No,m I think the problem is the > > > > > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > > > > > at the end of filemap_map_pages(). > > > > > > Why? > > > > > > Because we've been updating vmf->pte as we go along: > > > > > > vmf->pte += xas.xa_index - last_pgoff; > > > > > > and I think that by the time we get to that "pte_unmap_unlock()", > > > vmf->pte potentially points to past the edge of the page directory. > > > > Well, if it's true we have bigger problem: we set up an pte entry without > > relevant PTL. > > > > But I *think* we should be fine here: do_fault_around() limits start_pgoff > > and end_pgoff to stay within the page table. Yes, Linus's patch had made no difference, the map_pages loop is safe in that respect. > > > > It made mw looking at the code around pte_unmap_unlock() and I think that > > the bug is that we have to reset vmf->address and NULLify vmf->pte once we > > are done with faultaround: > > > > diff --git a/mm/memory.c b/mm/memory.c > > Ugh.. Wrong place. Need to sleep. > > I'll look into your idea tomorrow. > > diff --git a/mm/filemap.c b/mm/filemap.c > index 87671284de62..e4daab80ed81 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address, > } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); > pte_unmap_unlock(vmf->pte, vmf->ptl); > rcu_read_unlock(); > + vmf->address = address; > + vmf->pte = NULL; > WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); > > return ret; > -- And that made no (noticeable) difference either. But at last I realized, it's absolutely on the right track, but missing the couple of early returns at the head of filemap_map_pages(): add --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f rcu_read_lock(); head = first_map_page(vmf, &xas, end_pgoff); - if (!head) { - rcu_read_unlock(); - return 0; - } + if (!head) + goto out; if (filemap_map_pmd(vmf, head)) { - rcu_read_unlock(); - return VM_FAULT_NOPAGE; + ret = VM_FAULT_NOPAGE; + goto out; } vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, @@ -3066,9 +3064,9 @@ unlock: put_page(head); } while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL); pte_unmap_unlock(vmf->pte, vmf->ptl); +out: rcu_read_unlock(); vmf->address = address; - vmf->pte = NULL; WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); return ret; -- and then the corruption is fixed. It seems miraculous that the machines even booted with that bad vmf->address going to __do_fault(): maybe that tells us what a good job map_pages does most of the time. You'll see I've tried removing the "vmf->pte = NULL;" there. I did criticize earlier that vmf->pte was being left set, but was either thinking back to some earlier era of mm/memory.c, or else confusing with vmf->prealloc_pte, which is NULLed when consumed: I could not find anywhere in mm/memory.c which now needs vmf->pte to be cleared, and I seem to run fine without it (even on i386 HIGHPTE). So, the mystery is solved; but I don't think any of these patches should be applied. Without thinking through Linus's suggestions re do_set_pte() in particular, I do think this map_pages interface is too ugly, and given us lots of trouble: please take your time to go over it all again, and come up with a cleaner patch. I've grown rather jaded, and questioning the value of the rework: I don't think I want to look at or test another for a week or so. Hugh
On Mon, Dec 28, 2020 at 01:58:40PM -0800, Linus Torvalds wrote: > On Mon, Dec 28, 2020 at 10:47 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I personally think it's wrong to update vmf->pte at all. We should > > just have a local 'ptep' pointer that we update as we walk along. But > > that requires another change to the calling convention, namely to > > "do_set_pte()". > > Actually, I think we should not use do_set_pte() at all. > > About half of do_set_pte() is about the FAULT_FLAG_WRITE case, which > the fault-around code never has set (and would be wrong if it did). > > So I think do_set_pte() should be made local to mm/memory.c, and the > filemap_map_pages() code should do it's own simplified version that > just does the non-writable case, and that just gets passed the address > and the pte pointer. Well, do_set_pte() also does rss accounting and _fast version of it is local to mm/memory.c. It makes it cumbersome to opencode do_set_pte() in filemap_map_pages(). We need to make fast rss acounting visible outside memory.c or have some kind of batching in filemap_map_pages(). > At that point, there would no longer be any need to update the > address/pte fields in the vmf struct, and in fact I think it could be > made a "const" pointer in this cal chain. Unfortunately, we would still need to NULLify vmf->prealloc_pte once it's consumed. It kills idea with const. > This is all just from looking at the code, I haven't tried to write a > patch to do this, so I might be missing some case. I reworked do_fault_around() so it doesn't touch vmf->address and pass original address down to ->map_pages(). No need in the new argument in ->map_pages. filemap_map_pages() calculates address based on pgoff of the page handled. From 526b30b06d237cb1ed2fd51df0e4a6203df7bfa0 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Date: Sat, 19 Dec 2020 15:19:23 +0300 Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- fs/xfs/xfs_file.c | 6 +- include/linux/mm.h | 12 ++- include/linux/pgtable.h | 11 +++ mm/filemap.c | 177 ++++++++++++++++++++++++++--------- mm/memory.c | 199 ++++++++++++---------------------------- 5 files changed, 213 insertions(+), 192 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..111fe73bb8a7 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite( return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } -static void +static vm_fault_t xfs_filemap_map_pages( struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { struct inode *inode = file_inode(vmf->vma->vm_file); + vm_fault_t ret; xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - filemap_map_pages(vmf, start_pgoff, end_pgoff); + ret = filemap_map_pages(vmf, start_pgoff, end_pgoff); xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + return ret; } static const struct vm_operations_struct xfs_file_vm_ops = { diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..d3d4e307fa09 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -534,8 +534,8 @@ struct vm_fault { * is not NULL, otherwise pmd. */ pgtable_t prealloc_pte; /* Pre-allocated pte page table. - * vm_ops->map_pages() calls - * alloc_set_pte() from atomic context. + * vm_ops->map_pages() sets up a page + * table from from atomic context. * do_fault_around() pre-allocates * page table to avoid allocation from * atomic context. @@ -562,7 +562,7 @@ struct vm_operations_struct { vm_fault_t (*fault)(struct vm_fault *vmf); vm_fault_t (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size); - void (*map_pages)(struct vm_fault *vmf, + vm_fault_t (*map_pages)(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); unsigned long (*pagesize)(struct vm_area_struct * area); @@ -972,7 +972,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) return pte; } -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page); +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page); +void do_set_pte(struct vm_fault *vmf, struct page *page); + vm_fault_t finish_fault(struct vm_fault *vmf); vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #endif @@ -2621,7 +2623,7 @@ extern void truncate_inode_pages_final(struct address_space *); /* generic vm_area_ops exported for stackable file systems */ extern vm_fault_t filemap_fault(struct vm_fault *vmf); -extern void filemap_map_pages(struct vm_fault *vmf, +extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e237004d498d..869c1921ceda 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1259,6 +1259,17 @@ static inline int pmd_trans_unstable(pmd_t *pmd) #endif } +/* + * the ordering of these checks is important for pmds with _page_devmap set. + * if we check pmd_trans_unstable() first we will trip the bad_pmd() check + * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly + * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. + */ +static inline int pmd_devmap_trans_unstable(pmd_t *pmd) +{ + return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); +} + #ifndef CONFIG_NUMA_BALANCING /* * Technically a PTE can be PROTNONE even when not doing NUMA balancing but diff --git a/mm/filemap.c b/mm/filemap.c index 0b2067b3c328..cc4c99f5a287 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -42,6 +42,7 @@ #include <linux/psi.h> #include <linux/ramfs.h> #include <linux/page_idle.h> +#include <asm/pgalloc.h> #include "internal.h" #define CREATE_TRACE_POINTS @@ -2831,74 +2832,164 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) } EXPORT_SYMBOL(filemap_fault); -void filemap_map_pages(struct vm_fault *vmf, - pgoff_t start_pgoff, pgoff_t end_pgoff) +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page) { - struct file *file = vmf->vma->vm_file; + struct mm_struct *mm = vmf->vma->vm_mm; + + /* Huge page is mapped? No need to proceed. */ + if (pmd_trans_huge(*vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + if (pmd_none(*vmf->pmd) && PageTransHuge(page)) { + vm_fault_t ret = do_set_pmd(vmf, page); + if (!ret) { + /* The page is mapped successfully, reference consumed. */ + unlock_page(page); + return true; + } + } + + if (pmd_none(*vmf->pmd)) { + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (likely(pmd_none(*vmf->pmd))) { + mm_inc_nr_ptes(mm); + pmd_populate(mm, vmf->pmd, vmf->prealloc_pte); + vmf->prealloc_pte = NULL; + } + spin_unlock(vmf->ptl); + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) { + unlock_page(page); + put_page(page); + return true; + } + + return false; +} + +static struct page *next_uptodate_page(struct page *page, + struct address_space *mapping, + struct xa_state *xas, pgoff_t end_pgoff) +{ + unsigned long max_idx; + + do { + if (!page) + return NULL; + if (xas_retry(xas, page)) + continue; + if (xa_is_value(page)) + continue; + if (PageLocked(page)) + continue; + if (!page_cache_get_speculative(page)) + continue; + /* Has the page moved or been split? */ + if (unlikely(page != xas_reload(xas))) + goto skip; + if (!PageUptodate(page) || PageReadahead(page)) + goto skip; + if (PageHWPoison(page)) + goto skip; + if (!trylock_page(page)) + goto skip; + if (page->mapping != mapping) + goto unlock; + if (!PageUptodate(page)) + goto unlock; + max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); + if (xas->xa_index >= max_idx) + goto unlock; + return page; +unlock: + unlock_page(page); +skip: + put_page(page); + } while ((page = xas_next_entry(xas, end_pgoff)) != NULL); + + return NULL; +} + +static inline struct page *first_map_page(struct address_space *mapping, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_find(xas, end_pgoff), + mapping, xas, end_pgoff); +} + +static inline struct page *next_map_page(struct address_space *mapping, + struct xa_state *xas, + pgoff_t end_pgoff) +{ + return next_uptodate_page(xas_next_entry(xas, end_pgoff), + mapping, xas, end_pgoff); +} + +vm_fault_t filemap_map_pages(struct vm_fault *vmf, + pgoff_t start_pgoff, pgoff_t end_pgoff) +{ + struct vm_area_struct *vma = vmf->vma; + struct file *file = vma->vm_file; struct address_space *mapping = file->f_mapping; pgoff_t last_pgoff = start_pgoff; - unsigned long max_idx; + unsigned long address = vmf->address; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct page *head, *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); + vm_fault_t ret = 0; rcu_read_lock(); - xas_for_each(&xas, head, end_pgoff) { - if (xas_retry(&xas, head)) - continue; - if (xa_is_value(head)) - goto next; + head = first_map_page(mapping, &xas, end_pgoff); + if (!head) + goto out; - /* - * Check for a locked page first, as a speculative - * reference may adversely influence page migration. - */ - if (PageLocked(head)) - goto next; - if (!page_cache_get_speculative(head)) - goto next; + if (filemap_map_pmd(vmf, head)) { + ret = VM_FAULT_NOPAGE; + goto out; + } - /* Has the page moved or been split? */ - if (unlikely(head != xas_reload(&xas))) - goto skip; + vmf->address = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); + do { page = find_subpage(head, xas.xa_index); - - if (!PageUptodate(head) || - PageReadahead(page) || - PageHWPoison(page)) - goto skip; - if (!trylock_page(head)) - goto skip; - - if (head->mapping != mapping || !PageUptodate(head)) - goto unlock; - - max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE); - if (xas.xa_index >= max_idx) + if (PageHWPoison(page)) goto unlock; if (mmap_miss > 0) mmap_miss--; vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT; - if (vmf->pte) - vmf->pte += xas.xa_index - last_pgoff; + vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - if (alloc_set_pte(vmf, page)) + + if (!pte_none(*vmf->pte)) goto unlock; + + do_set_pte(vmf, page); + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); unlock_page(head); - goto next; + + /* The fault is handled */ + if (vmf->address == address) + ret = VM_FAULT_NOPAGE; + continue; unlock: unlock_page(head); -skip: put_page(head); -next: - /* Huge page is mapped? No need to proceed. */ - if (pmd_trans_huge(*vmf->pmd)) - break; - } + } while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL); + pte_unmap_unlock(vmf->pte, vmf->ptl); +out: rcu_read_unlock(); + vmf->address = address; WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss); + return ret; } EXPORT_SYMBOL(filemap_map_pages); diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..978798a6289b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; - /* See the comment in pte_alloc_one_map() */ + /* See comment in handle_pte_fault() */ if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; @@ -3630,66 +3630,6 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return ret; } -/* - * The ordering of these checks is important for pmds with _PAGE_DEVMAP set. - * If we check pmd_trans_unstable() first we will trip the bad_pmd() check - * inside of pmd_none_or_trans_huge_or_clear_bad(). This will end up correctly - * returning 1 but not before it spams dmesg with the pmd_clear_bad() output. - */ -static int pmd_devmap_trans_unstable(pmd_t *pmd) -{ - return pmd_devmap(*pmd) || pmd_trans_unstable(pmd); -} - -static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf) -{ - struct vm_area_struct *vma = vmf->vma; - - if (!pmd_none(*vmf->pmd)) - goto map_pte; - if (vmf->prealloc_pte) { - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_none(*vmf->pmd))) { - spin_unlock(vmf->ptl); - goto map_pte; - } - - mm_inc_nr_ptes(vma->vm_mm); - pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); - spin_unlock(vmf->ptl); - vmf->prealloc_pte = NULL; - } else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) { - return VM_FAULT_OOM; - } -map_pte: - /* - * If a huge pmd materialized under us just retry later. Use - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead of - * pmd_trans_huge() to ensure the pmd didn't become pmd_trans_huge - * under us and then back to pmd_none, as a result of MADV_DONTNEED - * running immediately after a huge pmd fault in a different thread of - * this mm, in turn leading to a misleading pmd_trans_huge() retval. - * All we have to ensure is that it is a regular pmd that we can walk - * with pte_offset_map() and we can do that through an atomic read in - * C, which is what pmd_trans_unstable() provides. - */ - if (pmd_devmap_trans_unstable(vmf->pmd)) - return VM_FAULT_NOPAGE; - - /* - * At this point we know that our vmf->pmd points to a page of ptes - * and it cannot become pmd_none(), pmd_devmap() or pmd_trans_huge() - * for the duration of the fault. If a racing MADV_DONTNEED runs and - * we zap the ptes pointed to by our vmf->pmd, the vmf->ptl will still - * be valid and we will re-check to make sure the vmf->pte isn't - * pte_none() under vmf->ptl protection when we return to - * alloc_set_pte(). - */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - return 0; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { @@ -3704,7 +3644,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) vmf->prealloc_pte = NULL; } -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; @@ -3762,52 +3702,17 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) return ret; } #else -static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) +vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) { - BUILD_BUG(); - return 0; + return VM_FAULT_FALLBACK; } #endif -/** - * alloc_set_pte - setup new PTE entry for given page and add reverse page - * mapping. If needed, the function allocates page table or use pre-allocated. - * - * @vmf: fault environment - * @page: page to map - * - * Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on - * return. - * - * Target users are page handler itself and implementations of - * vm_ops->map_pages. - * - * Return: %0 on success, %VM_FAULT_ code in case of error. - */ -vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) +void do_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; pte_t entry; - vm_fault_t ret; - - if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { - ret = do_set_pmd(vmf, page); - if (ret != VM_FAULT_FALLBACK) - return ret; - } - - if (!vmf->pte) { - ret = pte_alloc_one_map(vmf); - if (ret) - return ret; - } - - /* Re-check under ptl */ - if (unlikely(!pte_none(*vmf->pte))) { - update_mmu_tlb(vma, vmf->address, vmf->pte); - return VM_FAULT_NOPAGE; - } flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); @@ -3824,14 +3729,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) page_add_file_rmap(page, false); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); - - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, vmf->address, vmf->pte); - - return 0; } - /** * finish_fault - finish page fault once we have prepared the page to fault * @@ -3849,12 +3748,12 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) */ vm_fault_t finish_fault(struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; struct page *page; - vm_fault_t ret = 0; + vm_fault_t ret; /* Did we COW the page? */ - if ((vmf->flags & FAULT_FLAG_WRITE) && - !(vmf->vma->vm_flags & VM_SHARED)) + if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) page = vmf->cow_page; else page = vmf->page; @@ -3863,12 +3762,38 @@ vm_fault_t finish_fault(struct vm_fault *vmf) * check even for read faults because we might have lost our CoWed * page */ - if (!(vmf->vma->vm_flags & VM_SHARED)) - ret = check_stable_address_space(vmf->vma->vm_mm); - if (!ret) - ret = alloc_set_pte(vmf, page); - if (vmf->pte) - pte_unmap_unlock(vmf->pte, vmf->ptl); + if (!(vma->vm_flags & VM_SHARED)) { + ret = check_stable_address_space(vma->vm_mm); + if (ret) + return ret; + } + + if (pmd_none(*vmf->pmd)) { + if (PageTransCompound(page)) { + ret = do_set_pmd(vmf, page); + if (ret != VM_FAULT_FALLBACK) + return ret; + } + + if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) + return VM_FAULT_OOM; + } + + /* See comment in handle_pte_fault() */ + if (pmd_devmap_trans_unstable(vmf->pmd)) + return 0; + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + ret = 0; + /* Re-check under ptl */ + if (likely(pte_none(*vmf->pte))) + do_set_pte(vmf, page); + else + ret = VM_FAULT_NOPAGE; + + update_mmu_tlb(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; } @@ -3938,13 +3863,12 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) pgoff_t start_pgoff = vmf->pgoff; pgoff_t end_pgoff; int off; - vm_fault_t ret = 0; nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; - vmf->address = max(address & mask, vmf->vma->vm_start); - off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); + address = max(address & mask, vmf->vma->vm_start); + off = ((vmf->address - address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); start_pgoff -= off; /* @@ -3952,7 +3876,7 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) * the vma or nr_pages from start_pgoff, depending what is nearest. */ end_pgoff = start_pgoff - - ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + + ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + PTRS_PER_PTE - 1; end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1, start_pgoff + nr_pages - 1); @@ -3960,31 +3884,11 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) if (pmd_none(*vmf->pmd)) { vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm); if (!vmf->prealloc_pte) - goto out; + return VM_FAULT_OOM; smp_wmb(); /* See comment in __pte_alloc() */ } - vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); - - /* Huge page is mapped? Page fault is solved */ - if (pmd_trans_huge(*vmf->pmd)) { - ret = VM_FAULT_NOPAGE; - goto out; - } - - /* ->map_pages() haven't done anything useful. Cold page cache? */ - if (!vmf->pte) - goto out; - - /* check if the page fault is solved */ - vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); - if (!pte_none(*vmf->pte)) - ret = VM_FAULT_NOPAGE; - pte_unmap_unlock(vmf->pte, vmf->ptl); -out: - vmf->address = address; - vmf->pte = NULL; - return ret; + return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); } static vm_fault_t do_read_fault(struct vm_fault *vmf) @@ -4340,7 +4244,18 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) */ vmf->pte = NULL; } else { - /* See comment in pte_alloc_one_map() */ + /* + * If a huge pmd materialized under us just retry later. Use + * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead + * of pmd_trans_huge() to ensure the pmd didn't become + * pmd_trans_huge under us and then back to pmd_none, as a + * result of MADV_DONTNEED running immediately after a huge pmd + * fault in a different thread of this mm, in turn leading to a + * misleading pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() provides. + */ if (pmd_devmap_trans_unstable(vmf->pmd)) return 0; /*
On Tue, Dec 29, 2020 at 04:28:19PM +0300, Kirill A. Shutemov wrote: > > At that point, there would no longer be any need to update the > > address/pte fields in the vmf struct, and in fact I think it could be > > made a "const" pointer in this cal chain. > > Unfortunately, we would still need to NULLify vmf->prealloc_pte once it's > consumed. It kills idea with const. We could abstract out a refcount for pgtable_t (usually it's a struct page, but sometimes it's a portion of a struct page, or sometimes it's multiple struct pages) and always put it instead of freeing it. There'd be some details to sort out, and I'm not sure it's worth it just to constify the vmf on this path, but something that might be worth it for a future case. > +++ b/fs/xfs/xfs_file.c > @@ -1319,17 +1319,19 @@ xfs_filemap_pfn_mkwrite( > return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > -static void > +static vm_fault_t > xfs_filemap_map_pages( Can we just ditch the ->map_pages callback and make the rule "if you've put uptodate pages in the page cache, they can be mapped without informing the filesystem"? > +++ b/include/linux/mm.h > @@ -534,8 +534,8 @@ struct vm_fault { > * is not NULL, otherwise pmd. > */ > pgtable_t prealloc_pte; /* Pre-allocated pte page table. > - * vm_ops->map_pages() calls > - * alloc_set_pte() from atomic context. > + * vm_ops->map_pages() sets up a page > + * table from from atomic context. Doubled word "from" here.
On Tue, Dec 29, 2020 at 5:28 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > I reworked do_fault_around() so it doesn't touch vmf->address and pass > original address down to ->map_pages(). No need in the new argument in > ->map_pages. filemap_map_pages() calculates address based on pgoff of the > page handled. So I really like how the end result looks, but it worries me how many problems this patch has had, and I'd love to try to make it more incremental. In particular, I really liked your re-write of the loop in filemap_map_pages(), and how you separated out the pmd case to be a separate early case. I think that was some of the nicest part of the patch. And I really like how you changed the vmf->address meaning in this latest version, and pass the real address down, and pass the range just in the start/end_pgoff. That just looks like the RightThing(tm) to do. But I think these nice cleanups could and should be independent of some of the other changes. The unlocking change, for example, that changed the return type from void to vm_fault_t, also looks like a nice cleanup, but that's also the one that caused the last series of odd problems for Hugh. The fix for that then also caused that ugly "goto out" mess, which undid some of the "this looks really nice" effects. I don't even see the need for that "goto out". Yes, it resets the 'vmf->address' back to what it should be (so that __do_fault() fallback can work right), but the two first "goto out" cases haven't actually changed it (or ther mmap_miss), so why did they need to do that? Anyway, I really like how it all looks (apart from that odd "goto out" detail - nitpicking here) but it does worry me that it changes so much and had so many small problems.. Linus
diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d3fb4e..932886554586 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -426,6 +426,7 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_PREFAULT_OLD: Initialise pre-faulted PTEs in the 'old' state. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two @@ -456,6 +457,7 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_REMOTE 0x80 #define FAULT_FLAG_INSTRUCTION 0x100 #define FAULT_FLAG_INTERRUPTIBLE 0x200 +#define FAULT_FLAG_PREFAULT_OLD 0x400 /* * The default fault flags that should be used by most of the @@ -493,7 +495,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags) { FAULT_FLAG_USER, "USER" }, \ { FAULT_FLAG_REMOTE, "REMOTE" }, \ { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \ - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \ + { FAULT_FLAG_PREFAULT_OLD, "PREFAULT_OLD" } /* * vm_fault is filled by the pagefault handler and passed to the vma's diff --git a/mm/memory.c b/mm/memory.c index c48f8df6e502..6b30c15120e7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -134,6 +134,18 @@ static inline bool arch_faults_on_old_pte(void) } #endif +#ifndef arch_wants_old_faultaround_pte +static inline bool arch_wants_old_faultaround_pte(void) +{ + /* + * Transitioning a PTE from 'old' to 'young' can be expensive on + * some architectures, even if it's performed in hardware. By + * default, "false" means prefaulted entries will be 'young'. + */ + return false; +} +#endif + static int __init disable_randmaps(char *s) { randomize_va_space = 0; @@ -3788,6 +3800,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) { struct vm_area_struct *vma = vmf->vma; bool write = vmf->flags & FAULT_FLAG_WRITE; + bool old = vmf->flags & FAULT_FLAG_PREFAULT_OLD; pte_t entry; vm_fault_t ret; @@ -3811,7 +3824,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) flush_icache_page(vma, page); entry = mk_pte(page, vma->vm_page_prot); - entry = pte_sw_mkyoung(entry); + entry = old ? pte_mkold(entry) : pte_sw_mkyoung(entry); if (write) entry = maybe_mkwrite(pte_mkdirty(entry), vma); /* copy-on-write page */ @@ -3964,6 +3977,9 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) smp_wmb(); /* See comment in __pte_alloc() */ } + if (arch_wants_old_faultaround_pte()) + vmf->flags |= FAULT_FLAG_PREFAULT_OLD; + vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff); /* Huge page is mapped? Page fault is solved */ @@ -3978,8 +3994,17 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) /* check if the page fault is solved */ vmf->pte -= (vmf->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); - if (!pte_none(*vmf->pte)) - ret = VM_FAULT_NOPAGE; + if (pte_none(*vmf->pte)) + goto out_unlock; + + if (vmf->flags & FAULT_FLAG_PREFAULT_OLD) { + pte_t pte = pte_mkyoung(*vmf->pte); + if (ptep_set_access_flags(vmf->vma, address, vmf->pte, pte, 0)) + update_mmu_cache(vmf->vma, address, vmf->pte); + } + + ret = VM_FAULT_NOPAGE; +out_unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); out: vmf->address = address;
Commit 5c0a85fad949 ("mm: make faultaround produce old ptes") changed the "faultaround" behaviour to initialise prefaulted PTEs as 'old', since this avoids vmscan wrongly assuming that they are hot, despite having never been explicitly accessed by userspace. The change has been shown to benefit numerous arm64 micro-architectures (with hardware access flag) running Android, where both application launch latency and direct reclaim time are significantly reduced. Unfortunately, commit 315d09bf30c2 ("Revert "mm: make faultaround produce old ptes"") reverted the change to it being identified as the cause of a ~6% regression in unixbench on x86. Experiments on a variety of recent arm64 micro-architectures indicate that unixbench is not affected by the original commit, yielding a 0-1% performance improvement. Since one size does not fit all for the initial state of prefaulted PTEs, introduce arch_wants_old_faultaround_pte(), which allows an architecture to opt-in to 'old' prefaulted PTEs at runtime based on whatever criteria it may have. Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Will Deacon <will@kernel.org> --- include/linux/mm.h | 5 ++++- mm/memory.c | 31 ++++++++++++++++++++++++++++--- 2 files changed, 32 insertions(+), 4 deletions(-)