Message ID | 4d31abf5-56c0-9f3d-d12f-c9317936691@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd | expand |
On Mon, Aug 21, 2023 at 12:51:20PM -0700, Hugh Dickins wrote: > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > thought it had emptied: page lock on the huge page is enough to protect > against WP faults (which find the PTE has been cleared), but not enough > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > retract_page_tables() protects against this by checking !vma->anon_vma; > but we know that MADV_COLLAPSE needs to be able to work on private shmem > mappings, even those with an anon_vma prepared for another part of the > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > mappings which are userfaultfd_armed(). Whether it needs to work on > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > but assume that it does. > > Just for this case, take the pmd_lock() two steps earlier: not because > it gives any protection against this case itself, but because ptlock > nests inside it, and it's the dropping of ptlock which let the bug in. > In other cases, continue to minimize the pmd_lock() hold time. > > Reported-by: Jann Horn <jannh@google.com> > Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ > Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") > Signed-off-by: Hugh Dickins <hughd@google.com> The locking is indeed slightly complicated.. but I didn't spot anything wrong. Acked-by: Peter Xu <peterx@redhat.com> Thanks,
On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > thought it had emptied: page lock on the huge page is enough to protect > against WP faults (which find the PTE has been cleared), but not enough > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > retract_page_tables() protects against this by checking !vma->anon_vma; > but we know that MADV_COLLAPSE needs to be able to work on private shmem > mappings, even those with an anon_vma prepared for another part of the > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > mappings which are userfaultfd_armed(). Whether it needs to work on > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > but assume that it does. I think we couldn't rely on anon_vma here anyway, since holding the mmap_lock in read mode doesn't prevent concurrent creation of an anon_vma? > Just for this case, take the pmd_lock() two steps earlier: not because > it gives any protection against this case itself, but because ptlock > nests inside it, and it's the dropping of ptlock which let the bug in. > In other cases, continue to minimize the pmd_lock() hold time. Special-casing userfaultfd like this makes me a bit uncomfortable; but I also can't find anything other than userfaultfd that would insert pages into regions that are khugepaged-compatible, so I guess this works? I guess an alternative would be to use a spin_trylock() instead of the current pmd_lock(), and if that fails, temporarily drop the page table lock and then restart from step 2 with both locks held - and at that point the page table scan should be fast since we expect it to usually be empty.
On Mon, 21 Aug 2023, Jann Horn wrote: > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > > thought it had emptied: page lock on the huge page is enough to protect > > against WP faults (which find the PTE has been cleared), but not enough > > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > > > retract_page_tables() protects against this by checking !vma->anon_vma; > > but we know that MADV_COLLAPSE needs to be able to work on private shmem > > mappings, even those with an anon_vma prepared for another part of the > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > > mappings which are userfaultfd_armed(). Whether it needs to work on > > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > > but assume that it does. > > I think we couldn't rely on anon_vma here anyway, since holding the > mmap_lock in read mode doesn't prevent concurrent creation of an > anon_vma? We would have had to do the same as in retract_page_tables() (which doesn't even have mmap_lock for read): recheck !vma->anon_vma after finally acquiring ptlock. But the !anon_vma limitation is certainly not acceptable here anyway. > > > Just for this case, take the pmd_lock() two steps earlier: not because > > it gives any protection against this case itself, but because ptlock > > nests inside it, and it's the dropping of ptlock which let the bug in. > > In other cases, continue to minimize the pmd_lock() hold time. > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > I also can't find anything other than userfaultfd that would insert > pages into regions that are khugepaged-compatible, so I guess this > works? I'm as sure as I can be that it's solely because userfaultfd breaks the usual rules here (and in fairness, IIRC Andrea did ask my permission before making it behave that way on shmem, COWing without a source page). Perhaps something else will want that same behaviour in future (it's tempting, but difficult to guarantee correctness); for now, it is just userfaultfd (but by saying "_armed" rather than "_missing", I'm half- expecting uffd to add more such exceptional modes in future). > > I guess an alternative would be to use a spin_trylock() instead of the > current pmd_lock(), and if that fails, temporarily drop the page table > lock and then restart from step 2 with both locks held - and at that > point the page table scan should be fast since we expect it to usually > be empty. That's certainly a good idea, if collapse on userfaultfd_armed private is anything of a common case (I doubt, but I don't know). It may be a better idea anyway (saving a drop and retake of ptlock). I gave it a try, expecting to end up with something that would lead me to say "I tried it, but it didn't work out well"; but actually it looks okay to me. I wouldn't say I prefer it, but it seems reasonable, and no more complicated (as Peter rightly observes) than the original. It's up to you and Peter, and whoever has strong feelings about it, to choose between them: I don't mind (but I shall be sad if someone demands that I indent that comment deeper - I'm not a fan of long multi-line comments near column 80). [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() thought it had emptied: page lock on the huge page is enough to protect against WP faults (which find the PTE has been cleared), but not enough to protect against userfaultfd. "BUG: Bad rss-counter state" followed. retract_page_tables() protects against this by checking !vma->anon_vma; but we know that MADV_COLLAPSE needs to be able to work on private shmem mappings, even those with an anon_vma prepared for another part of the mapping; and we know that MADV_COLLAPSE needs to work on shared shmem mappings which are userfaultfd_armed(). Whether it needs to work on private shmem mappings which are userfaultfd_armed(), I'm not so sure: but assume that it does. Now trylock pmd lock without dropping ptlock (suggested by jannh): if that fails, drop and retake ptlock around taking pmd lock, and just in the uffd private case, go back to recheck and empty the page table. Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/khugepaged.c | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 40d43eccdee8..ad1c571772fe 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, struct page *hpage; pte_t *start_pte, *pte; pmd_t *pmd, pgt_pmd; - spinlock_t *pml, *ptl; + spinlock_t *pml = NULL, *ptl; int nr_ptes = 0, result = SCAN_FAIL; int i; @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, haddr, haddr + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); notified = true; - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); - if (!start_pte) /* mmap_lock + page lock should prevent this */ - goto abort; + spin_lock(ptl); +recheck: + start_pte = pte_offset_map(pmd, haddr); + VM_BUG_ON(!start_pte); /* mmap_lock + page lock should prevent this */ /* step 2: clear page table and adjust rmap */ for (i = 0, addr = haddr, pte = start_pte; @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, nr_ptes++; } - pte_unmap_unlock(start_pte, ptl); + pte_unmap(start_pte); /* step 3: set proper refcount and mm_counters. */ if (nr_ptes) { page_ref_sub(hpage, nr_ptes); add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); + nr_ptes = 0; } - /* step 4: remove page table */ + /* step 4: remove empty page table */ + if (!pml) { + pml = pmd_lockptr(mm, pmd); + if (pml != ptl && !spin_trylock(pml)) { + spin_unlock(ptl); + spin_lock(pml); + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + /* + * pmd_lock covers a wider range than ptl, and (if split from mm's + * page_table_lock) ptl nests inside pml. The less time we hold pml, + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA + * inserts a valid as-if-COWed PTE without even looking up page cache. + * So page lock of hpage does not protect from it, so we must not drop + * ptl before pgt_pmd is removed, so uffd private needs rechecking. + */ + if (userfaultfd_armed(vma) && + !(vma->vm_flags & VM_SHARED)) + goto recheck; + } + } - /* Huge page lock is still held, so page table must remain empty */ - pml = pmd_lock(mm, pmd); - if (ptl != pml) - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); if (ptl != pml) @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); + if (pml && pml != ptl) + spin_unlock(pml); if (notified) mmu_notifier_invalidate_range_end(&range); drop_hpage:
Hi, Hugh, Jann, On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote: > On Mon, 21 Aug 2023, Jann Horn wrote: > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: > > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > > > thought it had emptied: page lock on the huge page is enough to protect > > > against WP faults (which find the PTE has been cleared), but not enough > > > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > > > > > retract_page_tables() protects against this by checking !vma->anon_vma; > > > but we know that MADV_COLLAPSE needs to be able to work on private shmem > > > mappings, even those with an anon_vma prepared for another part of the > > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > > > mappings which are userfaultfd_armed(). Whether it needs to work on > > > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > > > but assume that it does. > > > > I think we couldn't rely on anon_vma here anyway, since holding the > > mmap_lock in read mode doesn't prevent concurrent creation of an > > anon_vma? > > We would have had to do the same as in retract_page_tables() (which > doesn't even have mmap_lock for read): recheck !vma->anon_vma after > finally acquiring ptlock. But the !anon_vma limitation is certainly > not acceptable here anyway. > > > > > > Just for this case, take the pmd_lock() two steps earlier: not because > > > it gives any protection against this case itself, but because ptlock > > > nests inside it, and it's the dropping of ptlock which let the bug in. > > > In other cases, continue to minimize the pmd_lock() hold time. > > > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > > I also can't find anything other than userfaultfd that would insert > > pages into regions that are khugepaged-compatible, so I guess this > > works? > > I'm as sure as I can be that it's solely because userfaultfd breaks > the usual rules here (and in fairness, IIRC Andrea did ask my permission > before making it behave that way on shmem, COWing without a source page). > > Perhaps something else will want that same behaviour in future (it's > tempting, but difficult to guarantee correctness); for now, it is just > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > expecting uffd to add more such exceptional modes in future). > > > > > I guess an alternative would be to use a spin_trylock() instead of the > > current pmd_lock(), and if that fails, temporarily drop the page table > > lock and then restart from step 2 with both locks held - and at that > > point the page table scan should be fast since we expect it to usually > > be empty. > > That's certainly a good idea, if collapse on userfaultfd_armed private > is anything of a common case (I doubt, but I don't know). It may be a > better idea anyway (saving a drop and retake of ptlock). > > I gave it a try, expecting to end up with something that would lead > me to say "I tried it, but it didn't work out well"; but actually it > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > and no more complicated (as Peter rightly observes) than the original. > > It's up to you and Peter, and whoever has strong feelings about it, > to choose between them: I don't mind (but I shall be sad if someone > demands that I indent that comment deeper - I'm not a fan of long > multi-line comments near column 80). No strong opinion here, either. Just one trivial comment/question below on the new patch (if that will be preferred).. > > > [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > thought it had emptied: page lock on the huge page is enough to protect > against WP faults (which find the PTE has been cleared), but not enough > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > retract_page_tables() protects against this by checking !vma->anon_vma; > but we know that MADV_COLLAPSE needs to be able to work on private shmem > mappings, even those with an anon_vma prepared for another part of the > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > mappings which are userfaultfd_armed(). Whether it needs to work on > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > but assume that it does. > > Now trylock pmd lock without dropping ptlock (suggested by jannh): if > that fails, drop and retake ptlock around taking pmd lock, and just in > the uffd private case, go back to recheck and empty the page table. > > Reported-by: Jann Horn <jannh@google.com> > Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ > Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") > Signed-off-by: Hugh Dickins <hughd@google.com> > --- > mm/khugepaged.c | 39 +++++++++++++++++++++++++++++---------- > 1 file changed, 29 insertions(+), 10 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 40d43eccdee8..ad1c571772fe 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > struct page *hpage; > pte_t *start_pte, *pte; > pmd_t *pmd, pgt_pmd; > - spinlock_t *pml, *ptl; > + spinlock_t *pml = NULL, *ptl; > int nr_ptes = 0, result = SCAN_FAIL; > int i; > > @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > haddr, haddr + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > notified = true; > - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > - if (!start_pte) /* mmap_lock + page lock should prevent this */ > - goto abort; > + spin_lock(ptl); .. here will the ptl always be valid? That comes from the previous round of pte_offset_map_lock(), and I assume after this whole "thp collapse without write lock" work landed, it has the same lifecycle with the *pte pointer, so can be invalid right after the rcu read lock released; mmap read lock isn't strong enough to protect the ptl, not anymore. Maybe it's all fine because the thp collapse path is the solo path(s) that will release the pte pgtable page without write mmap lock (so as to release the ptl too when doing so), and we at least still hold the page lock, so the worst case is the other concurrent "thp collapse" will still serialize with this one on the huge page lock. But that doesn't look as solid as fetching again the ptl from another pte_offset_map_nolock(). So still just raise this question up. It's possible I just missed something. > +recheck: > + start_pte = pte_offset_map(pmd, haddr); > + VM_BUG_ON(!start_pte); /* mmap_lock + page lock should prevent this */ > > /* step 2: clear page table and adjust rmap */ > for (i = 0, addr = haddr, pte = start_pte; > @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > nr_ptes++; > } > > - pte_unmap_unlock(start_pte, ptl); > + pte_unmap(start_pte); > > /* step 3: set proper refcount and mm_counters. */ > if (nr_ptes) { > page_ref_sub(hpage, nr_ptes); > add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); > + nr_ptes = 0; > } > > - /* step 4: remove page table */ > + /* step 4: remove empty page table */ > + if (!pml) { > + pml = pmd_lockptr(mm, pmd); > + if (pml != ptl && !spin_trylock(pml)) { > + spin_unlock(ptl); > + spin_lock(pml); > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + /* > + * pmd_lock covers a wider range than ptl, and (if split from mm's > + * page_table_lock) ptl nests inside pml. The less time we hold pml, > + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA > + * inserts a valid as-if-COWed PTE without even looking up page cache. > + * So page lock of hpage does not protect from it, so we must not drop > + * ptl before pgt_pmd is removed, so uffd private needs rechecking. > + */ > + if (userfaultfd_armed(vma) && > + !(vma->vm_flags & VM_SHARED)) > + goto recheck; > + } > + } > > - /* Huge page lock is still held, so page table must remain empty */ > - pml = pmd_lock(mm, pmd); > - if (ptl != pml) > - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); > pmdp_get_lockless_sync(); > if (ptl != pml) > @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > } > if (start_pte) > pte_unmap_unlock(start_pte, ptl); > + if (pml && pml != ptl) > + spin_unlock(pml); > if (notified) > mmu_notifier_invalidate_range_end(&range); > drop_hpage: > -- > 2.35.3
On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote: > On Mon, 21 Aug 2023, Jann Horn wrote: > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: > > > Just for this case, take the pmd_lock() two steps earlier: not because > > > it gives any protection against this case itself, but because ptlock > > > nests inside it, and it's the dropping of ptlock which let the bug in. > > > In other cases, continue to minimize the pmd_lock() hold time. > > > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > > I also can't find anything other than userfaultfd that would insert > > pages into regions that are khugepaged-compatible, so I guess this > > works? > > I'm as sure as I can be that it's solely because userfaultfd breaks > the usual rules here (and in fairness, IIRC Andrea did ask my permission > before making it behave that way on shmem, COWing without a source page). > > Perhaps something else will want that same behaviour in future (it's > tempting, but difficult to guarantee correctness); for now, it is just > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > expecting uffd to add more such exceptional modes in future). Hm, yeah, sounds okay. (I guess we'd also run into this if we ever wanted to make it possible to reliably install PTE markers with madvise() or something like that, which might be nice for allowing userspace to create guard pages without unnecessary extra VMAs...) > > I guess an alternative would be to use a spin_trylock() instead of the > > current pmd_lock(), and if that fails, temporarily drop the page table > > lock and then restart from step 2 with both locks held - and at that > > point the page table scan should be fast since we expect it to usually > > be empty. > > That's certainly a good idea, if collapse on userfaultfd_armed private > is anything of a common case (I doubt, but I don't know). It may be a > better idea anyway (saving a drop and retake of ptlock). I was thinking it also has the advantage that it would still perform okay if we got rid of the userfaultfd_armed() condition at some point - though I realize that designing too much for hypothetical future features is an antipattern. > I gave it a try, expecting to end up with something that would lead > me to say "I tried it, but it didn't work out well"; but actually it > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > and no more complicated (as Peter rightly observes) than the original. > > It's up to you and Peter, and whoever has strong feelings about it, > to choose between them: I don't mind (but I shall be sad if someone > demands that I indent that comment deeper - I'm not a fan of long > multi-line comments near column 80). I prefer this version because it would make it easier to remove the "userfaultfd_armed()" check in the future if we have to, but I guess we could also always change it later if that becomes necessary, so I don't really have strong feelings on it at this point.
On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote: > > Perhaps something else will want that same behaviour in future (it's > > tempting, but difficult to guarantee correctness); for now, it is just > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > > expecting uffd to add more such exceptional modes in future). > > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever > wanted to make it possible to reliably install PTE markers with > madvise() or something like that, which might be nice for allowing > userspace to create guard pages without unnecessary extra VMAs...) I don't know what a userspace API for this would look like, but I have a dream of creating guard VMAs which only live in the maple tree and don't require the allocation of a struct VMA. Use some magic reserved pointer value like XA_ZERO_ENTRY to represent them ... seems more robust than putting a PTE marker in the page tables?
On 22.08.23 16:39, Jann Horn wrote: > On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote: >> On Mon, 21 Aug 2023, Jann Horn wrote: >>> On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: >>>> Just for this case, take the pmd_lock() two steps earlier: not because >>>> it gives any protection against this case itself, but because ptlock >>>> nests inside it, and it's the dropping of ptlock which let the bug in. >>>> In other cases, continue to minimize the pmd_lock() hold time. >>> >>> Special-casing userfaultfd like this makes me a bit uncomfortable; but >>> I also can't find anything other than userfaultfd that would insert >>> pages into regions that are khugepaged-compatible, so I guess this >>> works? >> >> I'm as sure as I can be that it's solely because userfaultfd breaks >> the usual rules here (and in fairness, IIRC Andrea did ask my permission >> before making it behave that way on shmem, COWing without a source page). >> >> Perhaps something else will want that same behaviour in future (it's >> tempting, but difficult to guarantee correctness); for now, it is just >> userfaultfd (but by saying "_armed" rather than "_missing", I'm half- >> expecting uffd to add more such exceptional modes in future). > > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever > wanted to make it possible to reliably install PTE markers with > madvise() or something like that, which might be nice for allowing > userspace to create guard pages without unnecessary extra VMAs...) I'm working on something similar that goes a bit further than just guard pages. It also installs PTE markers into page tables, inside existing large VMAs. Initially, I'll only tackle anon VMAs, though.
On Tue, Aug 22, 2023 at 5:23 PM Matthew Wilcox <willy@infradead.org> wrote: > On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote: > > > Perhaps something else will want that same behaviour in future (it's > > > tempting, but difficult to guarantee correctness); for now, it is just > > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > > > expecting uffd to add more such exceptional modes in future). > > > > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever > > wanted to make it possible to reliably install PTE markers with > > madvise() or something like that, which might be nice for allowing > > userspace to create guard pages without unnecessary extra VMAs...) > > I don't know what a userspace API for this would look like, but I have > a dream of creating guard VMAs which only live in the maple tree and > don't require the allocation of a struct VMA. Use some magic reserved > pointer value like XA_ZERO_ENTRY to represent them ... seems more > robust than putting a PTE marker in the page tables? Chrome currently uses a lot of VMAs for its heap, which I think are basically alternating PROT_NONE and PROT_READ|PROT_WRITE anonymous VMAs. Like this: [...] 3a10002cf000-3a10002d0000 ---p 00000000 00:00 0 3a10002d0000-3a10002e6000 rw-p 00000000 00:00 0 3a10002e6000-3a10002e8000 ---p 00000000 00:00 0 3a10002e8000-3a10002f2000 rw-p 00000000 00:00 0 3a10002f2000-3a10002f4000 ---p 00000000 00:00 0 3a10002f4000-3a10002fb000 rw-p 00000000 00:00 0 3a10002fb000-3a10002fc000 ---p 00000000 00:00 0 3a10002fc000-3a1000303000 rw-p 00000000 00:00 0 3a1000303000-3a1000304000 ---p 00000000 00:00 0 3a1000304000-3a100031b000 rw-p 00000000 00:00 0 3a100031b000-3a100031c000 ---p 00000000 00:00 0 3a100031c000-3a1000326000 rw-p 00000000 00:00 0 3a1000326000-3a1000328000 ---p 00000000 00:00 0 3a1000328000-3a100033a000 rw-p 00000000 00:00 0 3a100033a000-3a100033c000 ---p 00000000 00:00 0 3a100033c000-3a100038b000 rw-p 00000000 00:00 0 3a100038b000-3a100038c000 ---p 00000000 00:00 0 3a100038c000-3a100039b000 rw-p 00000000 00:00 0 3a100039b000-3a100039c000 ---p 00000000 00:00 0 3a100039c000-3a10003af000 rw-p 00000000 00:00 0 3a10003af000-3a10003b0000 ---p 00000000 00:00 0 3a10003b0000-3a10003e8000 rw-p 00000000 00:00 0 3a10003e8000-3a1000401000 ---p 00000000 00:00 0 3a1000401000-3a1000402000 rw-p 00000000 00:00 0 3a1000402000-3a100040c000 ---p 00000000 00:00 0 3a100040c000-3a100046f000 rw-p 00000000 00:00 0 3a100046f000-3a1000470000 ---p 00000000 00:00 0 3a1000470000-3a100047a000 rw-p 00000000 00:00 0 3a100047a000-3a100047c000 ---p 00000000 00:00 0 3a100047c000-3a1000492000 rw-p 00000000 00:00 0 3a1000492000-3a1000494000 ---p 00000000 00:00 0 3a1000494000-3a10004a2000 rw-p 00000000 00:00 0 3a10004a2000-3a10004a4000 ---p 00000000 00:00 0 3a10004a4000-3a10004b6000 rw-p 00000000 00:00 0 3a10004b6000-3a10004b8000 ---p 00000000 00:00 0 3a10004b8000-3a10004ea000 rw-p 00000000 00:00 0 3a10004ea000-3a10004ec000 ---p 00000000 00:00 0 3a10004ec000-3a10005f4000 rw-p 00000000 00:00 0 3a10005f4000-3a1000601000 ---p 00000000 00:00 0 3a1000601000-3a1000602000 rw-p 00000000 00:00 0 3a1000602000-3a1000604000 ---p 00000000 00:00 0 3a1000604000-3a100062b000 rw-p 00000000 00:00 0 3a100062b000-3a1000801000 ---p 00000000 00:00 0 [...] I was thinking if you used PTE markers as guards, you could maybe turn all that into more or less a single VMA?
On 22.08.23 17:30, Jann Horn wrote: > On Tue, Aug 22, 2023 at 5:23 PM Matthew Wilcox <willy@infradead.org> wrote: >> On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote: >>>> Perhaps something else will want that same behaviour in future (it's >>>> tempting, but difficult to guarantee correctness); for now, it is just >>>> userfaultfd (but by saying "_armed" rather than "_missing", I'm half- >>>> expecting uffd to add more such exceptional modes in future). >>> >>> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever >>> wanted to make it possible to reliably install PTE markers with >>> madvise() or something like that, which might be nice for allowing >>> userspace to create guard pages without unnecessary extra VMAs...) >> >> I don't know what a userspace API for this would look like, but I have >> a dream of creating guard VMAs which only live in the maple tree and >> don't require the allocation of a struct VMA. Use some magic reserved >> pointer value like XA_ZERO_ENTRY to represent them ... seems more >> robust than putting a PTE marker in the page tables? > > Chrome currently uses a lot of VMAs for its heap, which I think are > basically alternating PROT_NONE and PROT_READ|PROT_WRITE anonymous > VMAs. Like this: > > [...] > 3a10002cf000-3a10002d0000 ---p 00000000 00:00 0 > 3a10002d0000-3a10002e6000 rw-p 00000000 00:00 0 > 3a10002e6000-3a10002e8000 ---p 00000000 00:00 0 > 3a10002e8000-3a10002f2000 rw-p 00000000 00:00 0 > 3a10002f2000-3a10002f4000 ---p 00000000 00:00 0 > 3a10002f4000-3a10002fb000 rw-p 00000000 00:00 0 > 3a10002fb000-3a10002fc000 ---p 00000000 00:00 0 > 3a10002fc000-3a1000303000 rw-p 00000000 00:00 0 > 3a1000303000-3a1000304000 ---p 00000000 00:00 0 > 3a1000304000-3a100031b000 rw-p 00000000 00:00 0 > 3a100031b000-3a100031c000 ---p 00000000 00:00 0 > 3a100031c000-3a1000326000 rw-p 00000000 00:00 0 > 3a1000326000-3a1000328000 ---p 00000000 00:00 0 > 3a1000328000-3a100033a000 rw-p 00000000 00:00 0 > 3a100033a000-3a100033c000 ---p 00000000 00:00 0 > 3a100033c000-3a100038b000 rw-p 00000000 00:00 0 > 3a100038b000-3a100038c000 ---p 00000000 00:00 0 > 3a100038c000-3a100039b000 rw-p 00000000 00:00 0 > 3a100039b000-3a100039c000 ---p 00000000 00:00 0 > 3a100039c000-3a10003af000 rw-p 00000000 00:00 0 > 3a10003af000-3a10003b0000 ---p 00000000 00:00 0 > 3a10003b0000-3a10003e8000 rw-p 00000000 00:00 0 > 3a10003e8000-3a1000401000 ---p 00000000 00:00 0 > 3a1000401000-3a1000402000 rw-p 00000000 00:00 0 > 3a1000402000-3a100040c000 ---p 00000000 00:00 0 > 3a100040c000-3a100046f000 rw-p 00000000 00:00 0 > 3a100046f000-3a1000470000 ---p 00000000 00:00 0 > 3a1000470000-3a100047a000 rw-p 00000000 00:00 0 > 3a100047a000-3a100047c000 ---p 00000000 00:00 0 > 3a100047c000-3a1000492000 rw-p 00000000 00:00 0 > 3a1000492000-3a1000494000 ---p 00000000 00:00 0 > 3a1000494000-3a10004a2000 rw-p 00000000 00:00 0 > 3a10004a2000-3a10004a4000 ---p 00000000 00:00 0 > 3a10004a4000-3a10004b6000 rw-p 00000000 00:00 0 > 3a10004b6000-3a10004b8000 ---p 00000000 00:00 0 > 3a10004b8000-3a10004ea000 rw-p 00000000 00:00 0 > 3a10004ea000-3a10004ec000 ---p 00000000 00:00 0 > 3a10004ec000-3a10005f4000 rw-p 00000000 00:00 0 > 3a10005f4000-3a1000601000 ---p 00000000 00:00 0 > 3a1000601000-3a1000602000 rw-p 00000000 00:00 0 > 3a1000602000-3a1000604000 ---p 00000000 00:00 0 > 3a1000604000-3a100062b000 rw-p 00000000 00:00 0 > 3a100062b000-3a1000801000 ---p 00000000 00:00 0 > [...] > > I was thinking if you used PTE markers as guards, you could maybe turn > all that into more or less a single VMA? I proposed the topic "A proper API for sparse memory mappings" for the bi-weekly MM meeting on September 20, that would also cover exactly that use case. :)
On Tue, 22 Aug 2023, Peter Xu wrote: > On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote: > > On Mon, 21 Aug 2023, Jann Horn wrote: ... > > > > > > I guess an alternative would be to use a spin_trylock() instead of the > > > current pmd_lock(), and if that fails, temporarily drop the page table > > > lock and then restart from step 2 with both locks held - and at that > > > point the page table scan should be fast since we expect it to usually > > > be empty. > > > > That's certainly a good idea, if collapse on userfaultfd_armed private > > is anything of a common case (I doubt, but I don't know). It may be a > > better idea anyway (saving a drop and retake of ptlock). > > > > I gave it a try, expecting to end up with something that would lead > > me to say "I tried it, but it didn't work out well"; but actually it > > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > > and no more complicated (as Peter rightly observes) than the original. > > > > It's up to you and Peter, and whoever has strong feelings about it, > > to choose between them: I don't mind (but I shall be sad if someone > > demands that I indent that comment deeper - I'm not a fan of long > > multi-line comments near column 80). > > No strong opinion here, either. Just one trivial comment/question below on > the new patch (if that will be preferred).. I'm going to settle for the original v1 for now (I'll explain why in reply to Jann next) - which already has the blessing of your Acked-by, thanks. (Yes, the locking is a bit confusing: but mainly for the unrelated reason, that with the split locking configs, we never quite know whether this lock is the same as that lock or not, and so have to be rather careful.) > > [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd ... > > @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > > haddr, haddr + HPAGE_PMD_SIZE); > > mmu_notifier_invalidate_range_start(&range); > > notified = true; > > - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > > - if (!start_pte) /* mmap_lock + page lock should prevent this */ > > - goto abort; > > + spin_lock(ptl); > > .. here will the ptl always be valid? > > That comes from the previous round of pte_offset_map_lock(), and I assume > after this whole "thp collapse without write lock" work landed, it has the > same lifecycle with the *pte pointer, so can be invalid right after the rcu > read lock released; mmap read lock isn't strong enough to protect the ptl, > not anymore. > > Maybe it's all fine because the thp collapse path is the solo path(s) that > will release the pte pgtable page without write mmap lock (so as to release > the ptl too when doing so), and we at least still hold the page lock, so > the worst case is the other concurrent "thp collapse" will still serialize > with this one on the huge page lock. But that doesn't look as solid as > fetching again the ptl from another pte_offset_map_nolock(). So still just > raise this question up. It's possible I just missed something. It is safe, as you say because of us holding the hpage lock, which stops any racing callers of collapse_pte_mapped_thp() or retract_page_tables(): and these are the functions which (currently) make the *pmd transition which pte_offset_map_lock() etc. are being careful to guard against. [In future we can imagine empty page table removal making that transition too: and that wouldn't even have any hpage to lock. Will it rely on mmap_lock for write? or pmd_lock? probably both, but no need to design for it now.] But I agree that it does *look* more questionable in this patch: there was a reassuring pte_offset_map_lock() there before, and now I rely more on the assumptions and just use the "previous" ptl (and that's why I chose to make the !start_pte case a VM_BUG_ON a few lines later). I expect, with more time spent, I could cast it back into more reassuring form: but it's all a bit of a con trick - if you look further down (even before v2 or v1 fixes) to "step 4", there we have "if (ptl != pml)" which is also relying on the fact that ptl cannot have changed. And no doubt that too could be recast into more reassuring-looking form, but it wouldn't actually be worthwhile. Thanks for considering these, Peter: I'll recommend v1 to Andrew. Hugh
On Tue, Aug 22, 2023 at 11:34:19AM -0700, Hugh Dickins wrote: > (Yes, the locking is a bit confusing: but mainly for the unrelated reason, > that with the split locking configs, we never quite know whether this lock > is the same as that lock or not, and so have to be rather careful.) Is it time to remove the PTE split locking config option? I believe all supported architectures have at least two levels of page tables, so if we have split ptlocks, ptl and pml are always different from each other (it's just that on two level machines, pmd == pud == p4d == pgd). With huge thread counts now being the norm, it's hard to see why anybody would want to support SMP and !SPLIT_PTE_PTLOCKS. To quote the documentation ... Split page table lock for PTE tables is enabled compile-time if CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS. If split lock is disabled, all tables are guarded by mm->page_table_lock. You can barely buy a wrist-watch without eight CPUs these days.
On Tue, 22 Aug 2023, Jann Horn wrote: > On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins <hughd@google.com> wrote: > > On Mon, 21 Aug 2023, Jann Horn wrote: > > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@google.com> wrote: > > > > Just for this case, take the pmd_lock() two steps earlier: not because > > > > it gives any protection against this case itself, but because ptlock > > > > nests inside it, and it's the dropping of ptlock which let the bug in. > > > > In other cases, continue to minimize the pmd_lock() hold time. > > > > > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > > > I also can't find anything other than userfaultfd that would insert > > > pages into regions that are khugepaged-compatible, so I guess this > > > works? > > > > I'm as sure as I can be that it's solely because userfaultfd breaks > > the usual rules here (and in fairness, IIRC Andrea did ask my permission > > before making it behave that way on shmem, COWing without a source page). > > > > Perhaps something else will want that same behaviour in future (it's > > tempting, but difficult to guarantee correctness); for now, it is just > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > > expecting uffd to add more such exceptional modes in future). > > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever > wanted to make it possible to reliably install PTE markers with > madvise() or something like that, which might be nice for allowing > userspace to create guard pages without unnecessary extra VMAs...) I see the mailthread has taken inspiration from your comment there, and veered off in that direction: but I'll ignore those futures. > > > > I guess an alternative would be to use a spin_trylock() instead of the > > > current pmd_lock(), and if that fails, temporarily drop the page table > > > lock and then restart from step 2 with both locks held - and at that > > > point the page table scan should be fast since we expect it to usually > > > be empty. > > > > That's certainly a good idea, if collapse on userfaultfd_armed private > > is anything of a common case (I doubt, but I don't know). It may be a > > better idea anyway (saving a drop and retake of ptlock). > > I was thinking it also has the advantage that it would still perform > okay if we got rid of the userfaultfd_armed() condition at some point > - though I realize that designing too much for hypothetical future > features is an antipattern. > > > I gave it a try, expecting to end up with something that would lead > > me to say "I tried it, but it didn't work out well"; but actually it > > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > > and no more complicated (as Peter rightly observes) than the original. > > > > It's up to you and Peter, and whoever has strong feelings about it, > > to choose between them: I don't mind (but I shall be sad if someone > > demands that I indent that comment deeper - I'm not a fan of long > > multi-line comments near column 80). > > I prefer this version because it would make it easier to remove the > "userfaultfd_armed()" check in the future if we have to, but I guess > we could also always change it later if that becomes necessary, so I > don't really have strong feelings on it at this point. Thanks for considering them both, Jann. I do think your trylock way, as in v2, is in principle superior, and we may well have good reason to switch over to it in future; but I find it slightly more confusing, so will follow your and Peter's "no strong feelings" for now, and ask Andrew please to take the original (implicit v1). Overriding reason: I realized overnight that v2 is not quite correct: I was clever enough to realize that nr_ptes needed to be reset to 0 to get the accounting right with a recheck pass, but not clever enough to realize that resetting it to 0 there would likely skip the abort path's flush_tlb_mm(mm), when we actually had cleared entries on the first pass. It needs a separate bool to decide the flush_tlb_mm(mm), or it needs that (ridiculously minor!) step 3 to be moved down. But rather than reworking it, please let's just go with v1 for now. Thanks, Hugh
On Tue, Aug 22, 2023 at 8:54 PM Hugh Dickins <hughd@google.com> wrote:
> But rather than reworking it, please let's just go with v1 for now.
Sounds good to me.
On Tue, 22 Aug 2023, Matthew Wilcox wrote: > On Tue, Aug 22, 2023 at 11:34:19AM -0700, Hugh Dickins wrote: > > (Yes, the locking is a bit confusing: but mainly for the unrelated reason, > > that with the split locking configs, we never quite know whether this lock > > is the same as that lock or not, and so have to be rather careful.) > > Is it time to remove the PTE split locking config option? I believe all > supported architectures have at least two levels of page tables, so if we > have split ptlocks, ptl and pml are always different from each other (it's > just that on two level machines, pmd == pud == p4d == pgd). With huge > thread counts now being the norm, it's hard to see why anybody would want > to support SMP and !SPLIT_PTE_PTLOCKS. To quote the documentation ... > > Split page table lock for PTE tables is enabled compile-time if > CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS. > If split lock is disabled, all tables are guarded by mm->page_table_lock. > > You can barely buy a wrist-watch without eight CPUs these days. Whilst I'm still happy with my 0-CPU wrist-watch, I do think you're right: that SPLIT_PTLOCK_CPUS business was really just a safety-valve for when introducing split ptlock in the first place, 4 pulled out of a hat, and the unsplit ptlock path quite under-tested. But I'll leave it to someone else do the job of removing it whenever. Hugh
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 40d43eccdee8..d5650541083a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, struct page *hpage; pte_t *start_pte, *pte; pmd_t *pmd, pgt_pmd; - spinlock_t *pml, *ptl; + spinlock_t *pml = NULL, *ptl; int nr_ptes = 0, result = SCAN_FAIL; int i; @@ -1572,9 +1572,25 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, haddr, haddr + HPAGE_PMD_SIZE); mmu_notifier_invalidate_range_start(&range); notified = true; - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + + /* + * pmd_lock covers a wider range than ptl, and (if split from mm's + * page_table_lock) ptl nests inside pml. The less time we hold pml, + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA + * inserts a valid as-if-COWed PTE without even looking up page cache. + * So page lock of hpage does not protect from it, so we must not drop + * ptl before pgt_pmd is removed, so uffd private needs pml taken now. + */ + if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED)) + pml = pmd_lock(mm, pmd); + + start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl); if (!start_pte) /* mmap_lock + page lock should prevent this */ goto abort; + if (!pml) + spin_lock(ptl); + else if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); /* step 2: clear page table and adjust rmap */ for (i = 0, addr = haddr, pte = start_pte; @@ -1608,7 +1624,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, nr_ptes++; } - pte_unmap_unlock(start_pte, ptl); + pte_unmap(start_pte); + if (!pml) + spin_unlock(ptl); /* step 3: set proper refcount and mm_counters. */ if (nr_ptes) { @@ -1616,12 +1634,12 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); } - /* step 4: remove page table */ - - /* Huge page lock is still held, so page table must remain empty */ - pml = pmd_lock(mm, pmd); - if (ptl != pml) - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + /* step 4: remove empty page table */ + if (!pml) { + pml = pmd_lock(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + } pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); pmdp_get_lockless_sync(); if (ptl != pml) @@ -1648,6 +1666,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, } if (start_pte) pte_unmap_unlock(start_pte, ptl); + if (pml && pml != ptl) + spin_unlock(pml); if (notified) mmu_notifier_invalidate_range_end(&range); drop_hpage:
Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() thought it had emptied: page lock on the huge page is enough to protect against WP faults (which find the PTE has been cleared), but not enough to protect against userfaultfd. "BUG: Bad rss-counter state" followed. retract_page_tables() protects against this by checking !vma->anon_vma; but we know that MADV_COLLAPSE needs to be able to work on private shmem mappings, even those with an anon_vma prepared for another part of the mapping; and we know that MADV_COLLAPSE needs to work on shared shmem mappings which are userfaultfd_armed(). Whether it needs to work on private shmem mappings which are userfaultfd_armed(), I'm not so sure: but assume that it does. Just for this case, take the pmd_lock() two steps earlier: not because it gives any protection against this case itself, but because ptlock nests inside it, and it's the dropping of ptlock which let the bug in. In other cases, continue to minimize the pmd_lock() hold time. Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") Signed-off-by: Hugh Dickins <hughd@google.com> --- mm/khugepaged.c | 38 +++++++++++++++++++++++++++++--------- 1 file changed, 29 insertions(+), 9 deletions(-)