Message ID | 20220224122614.94921-13-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: COW fixes part 2: reliable GUP pins of anonymous pages | expand |
On Thu, Feb 24, 2022 at 01:26:13PM +0100, David Hildenbrand wrote: > Whenever GUP currently ends up taking a R/O pin on an anonymous page that > might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault > on the page table entry will end up replacing the mapped anonymous page > due to COW, resulting in the GUP pin no longer being consistent with the > page actually mapped into the page table. > > The possible ways to deal with this situation are: > (1) Ignore and pin -- what we do right now. > (2) Fail to pin -- which would be rather surprising to callers and > could break user space. > (3) Trigger unsharing and pin the now exclusive page -- reliable R/O > pins. How does this mesh with the common FOLL_FORCE|FOLL_WRITE|FOLL_PIN pattern used for requesting read access? Can they be converted to just FOLL_WRITE|FOLL_PIN after this? Jason
On 02.03.22 17:55, Jason Gunthorpe wrote: > On Thu, Feb 24, 2022 at 01:26:13PM +0100, David Hildenbrand wrote: >> Whenever GUP currently ends up taking a R/O pin on an anonymous page that >> might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault >> on the page table entry will end up replacing the mapped anonymous page >> due to COW, resulting in the GUP pin no longer being consistent with the >> page actually mapped into the page table. >> >> The possible ways to deal with this situation are: >> (1) Ignore and pin -- what we do right now. >> (2) Fail to pin -- which would be rather surprising to callers and >> could break user space. >> (3) Trigger unsharing and pin the now exclusive page -- reliable R/O >> pins. > Hi Jason, > How does this mesh with the common FOLL_FORCE|FOLL_WRITE|FOLL_PIN > pattern used for requesting read access? Can they be converted to > just FOLL_WRITE|FOLL_PIN after this? Interesting question, I thought about this in detail yet, let me give it a try: IIRC, the sole purpose of FOLL_FORCE in the context of R/O pins is to enforce the eventual COW -- meaning we COW (via FOLL_WRITE) even if we don't have the permissions to write (via FOLL_FORCE), to make sure we most certainly have an exclusive anonymoous page in a MAP_PRIVATE mapping. Dropping only the FOLL_FORCE would make the FOLL_WRITE request fail if the mapping is currently !VM_WRITE (but is VM_MAYWRITE), so that wouldn't work. I recall that we don't allow pinning the zero page ("special pte", !vm_normal_page()). So if you have an ordinary MAP_PRIVATE|MAP_ANON mapping, you will now only need a "FOLL_READ" and have a reliable pin, even if not previously writing to every page. It would we different with other MAP_PRIVATE file mappings I remember: With FOLL_FORCE|FOLL_WRITE|FOLL_PIN we'd force placement of an anonymous page, resulting in the R/O (long-term ?) pin not observing consecutive file changes. With a pure FOLL_READ we'd still observe file changes as we don't trigger a write fault. BUT, once we actually write to the private mapping via the page table, the GUP pin would go out of sync with the now-anonymous page mapped into the page table. However, I'm having a hard time answering what's actually expected? It's really hard to tell what the user wants with MAP_PRIVATE file mappings and stumbles over a !anon page (no modifications so far): (a) I want a R/O pin to observe file modifications. (b) I want the R/O pin to *not* observe file modifications but observe my (eventual? if any) private modifications, Of course, if we already wrote to that page and now have an anon page, it's easy: we are already no longer following file changes. Maybe FOLL_PIN would already do now what we'd expect from a R/O pin -- (a), maybe not. I'm wondering if FOLL_LONGTERM could give us an indication whether (a) or (b) applies.
On Wed, Mar 02, 2022 at 09:38:09PM +0100, David Hildenbrand wrote: > (a) I want a R/O pin to observe file modifications. > (b) I want the R/O pin to *not* observe file modifications but observe > my (eventual? if any) private modifications, A scenario I know that motivated this is fairly straightfoward: static char data[] = {}; ibv_reg_mr(data, READ_ONLY) data[0] = 1 .. go touch data via DMA .. We want to reliably observe the '1' What is happening under the covers is that 'data' is placed in the .data segment and becomes a file backed MAP_PRIVATE page. The write COWs that page It would work OK if it was in .bss instead I think the FOLL_FORCE is there because otherwise the trick doesn't work on true RO pages and the API becomes broken. Jason
On 3/2/22 12:38, David Hildenbrand wrote: ... > BUT, once we actually write to the private mapping via the page table, > the GUP pin would go out of sync with the now-anonymous page mapped into > the page table. However, I'm having a hard time answering what's > actually expected? > > It's really hard to tell what the user wants with MAP_PRIVATE file > mappings and stumbles over a !anon page (no modifications so far): > > (a) I want a R/O pin to observe file modifications. > (b) I want the R/O pin to *not* observe file modifications but observe > my (eventual? if any) private modifications, > On this aspect, I think it is easier than trying to discern user intentions. Because it is less a question of what the user wants, and more a question of how mmap(2) is specified. And the man page clearly indicates that the user has no right to expect to see file modifications. Here's the excerpt: "MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. " > Of course, if we already wrote to that page and now have an anon page, > it's easy: we are already no longer following file changes. Yes, and in fact, I've always thought that the way this was written means that it should be treated as a snapshot of the file contents, and no longer reliably connected in either direction to the page(s). thanks,
On 03.03.22 02:47, John Hubbard wrote: > On 3/2/22 12:38, David Hildenbrand wrote: > ... >> BUT, once we actually write to the private mapping via the page table, >> the GUP pin would go out of sync with the now-anonymous page mapped into >> the page table. However, I'm having a hard time answering what's >> actually expected? >> >> It's really hard to tell what the user wants with MAP_PRIVATE file >> mappings and stumbles over a !anon page (no modifications so far): >> >> (a) I want a R/O pin to observe file modifications. >> (b) I want the R/O pin to *not* observe file modifications but observe >> my (eventual? if any) private modifications, >> > > On this aspect, I think it is easier than trying to discern user > intentions. Because it is less a question of what the user wants, and > more a question of how mmap(2) is specified. And the man page clearly > indicates that the user has no right to expect to see file > modifications. Here's the excerpt: > > "MAP_PRIVATE > > Create a private copy-on-write mapping. Updates to the mapping are not > visible to other processes mapping the same file, and are not carried > through to the underlying file. It is unspecified whether changes made > to the file after the mmap() call are visible in the mapped region. > " > >> Of course, if we already wrote to that page and now have an anon page, >> it's easy: we are already no longer following file changes. > > Yes, and in fact, I've always thought that the way this was written > means that it should be treated as a snapshot of the file contents, > and no longer reliably connected in either direction to the page(s). Thanks John, that's extremely helpful. I forgot about these MAP_PRIVATE mmap() details -- they help a lot to clarify which semantics to provide. So what we could do is: a) Extend FAULT_FLAG_UNSHARE to also unshare an !anon page in a MAP_RPIVATE mapping, replacing it with an (exclusive) anon page. R/O PTE permissions are maintained, just like unsharing in the context of this series. b) Similarly trigger FAULT_FLAG_UNSHARE from GUP when trying to take a R/O pin (FOLL_PIN) on a R/O-mapped !anon page in a MAP_PRIVATE mapping. c) Make R/O pins consistently use "FOLL_PIN" instead, getting rid of FOLL_FORCE|FOLL_WRITE. Of course, we can't detect MAP_PRIVATE vs. MAP_SHARED in GUP-fast (no VMA), so we'd always have to fallback in GUP-fast in case we intend to FOLL_PIN a R/O-mapped !anon page. That would imply that essentially any R/O pins (FOLL_PIN) would have to fallback to ordinary GUP. BUT, I mean we require FOLL_FORCE|FOLL_WRITE right now, which is not any different, so ... One optimization would be to trigger b) only for FOLL_LONGTERM. For !FOLL_LONGTERM there are "in theory" absolutely no guarantees which data will be observed if we modify concurrently to e.g., O_DIRECT IMHO. But that would require some more thought. Of course, that's all material for another journey, although it should be mostly straight forward.
On 02.03.22 21:59, Jason Gunthorpe wrote: > On Wed, Mar 02, 2022 at 09:38:09PM +0100, David Hildenbrand wrote: > >> (a) I want a R/O pin to observe file modifications. >> (b) I want the R/O pin to *not* observe file modifications but observe >> my (eventual? if any) private modifications, > > A scenario I know that motivated this is fairly straightfoward: > > static char data[] = {}; > > ibv_reg_mr(data, READ_ONLY) > data[0] = 1 > .. go touch data via DMA .. > > We want to reliably observe the '1' > > What is happening under the covers is that 'data' is placed in the > .data segment and becomes a file backed MAP_PRIVATE page. The write > COWs that page > > It would work OK if it was in .bss instead > > I think the FOLL_FORCE is there because otherwise the trick doesn't > work on true RO pages and the API becomes broken. Thanks for the nice example, it matches what John brought up in respect to MAP_PRIVATE semantics.
On 03.03.22 09:06, David Hildenbrand wrote: > On 03.03.22 02:47, John Hubbard wrote: >> On 3/2/22 12:38, David Hildenbrand wrote: >> ... >>> BUT, once we actually write to the private mapping via the page table, >>> the GUP pin would go out of sync with the now-anonymous page mapped into >>> the page table. However, I'm having a hard time answering what's >>> actually expected? >>> >>> It's really hard to tell what the user wants with MAP_PRIVATE file >>> mappings and stumbles over a !anon page (no modifications so far): >>> >>> (a) I want a R/O pin to observe file modifications. >>> (b) I want the R/O pin to *not* observe file modifications but observe >>> my (eventual? if any) private modifications, >>> >> >> On this aspect, I think it is easier than trying to discern user >> intentions. Because it is less a question of what the user wants, and >> more a question of how mmap(2) is specified. And the man page clearly >> indicates that the user has no right to expect to see file >> modifications. Here's the excerpt: >> >> "MAP_PRIVATE >> >> Create a private copy-on-write mapping. Updates to the mapping are not >> visible to other processes mapping the same file, and are not carried >> through to the underlying file. It is unspecified whether changes made >> to the file after the mmap() call are visible in the mapped region. >> " >> >>> Of course, if we already wrote to that page and now have an anon page, >>> it's easy: we are already no longer following file changes. >> >> Yes, and in fact, I've always thought that the way this was written >> means that it should be treated as a snapshot of the file contents, >> and no longer reliably connected in either direction to the page(s). > > Thanks John, that's extremely helpful. I forgot about these MAP_PRIVATE > mmap() details -- they help a lot to clarify which semantics to provide. > > So what we could do is: > > a) Extend FAULT_FLAG_UNSHARE to also unshare an !anon page in > a MAP_RPIVATE mapping, replacing it with an (exclusive) anon page. > R/O PTE permissions are maintained, just like unsharing in the > context of this series. > > b) Similarly trigger FAULT_FLAG_UNSHARE from GUP when trying to take a > R/O pin (FOLL_PIN) on a R/O-mapped !anon page in a MAP_PRIVATE > mapping. > > c) Make R/O pins consistently use "FOLL_PIN" instead, getting rid of > FOLL_FORCE|FOLL_WRITE. > > > Of course, we can't detect MAP_PRIVATE vs. MAP_SHARED in GUP-fast (no > VMA), so we'd always have to fallback in GUP-fast in case we intend to > FOLL_PIN a R/O-mapped !anon page. That would imply that essentially any > R/O pins (FOLL_PIN) would have to fallback to ordinary GUP. BUT, I mean > we require FOLL_FORCE|FOLL_WRITE right now, which is not any different, > so ... > > One optimization would be to trigger b) only for FOLL_LONGTERM. For > !FOLL_LONGTERM there are "in theory" absolutely no guarantees which data > will be observed if we modify concurrently to e.g., O_DIRECT IMHO. But > that would require some more thought. > > Of course, that's all material for another journey, although it should > be mostly straight forward. > Just a slight clarification after stumbling over shared zeropage code in follow_page_pte(): we do seem to support pinning the shared zeropage at least on the GUP-slow path. While I haven't played with it, I assume we'd have to implement+trigger unsharing in case we'd want to take a R/O pin on the shared zeropage. Of course, similar to file-backed MAP_PRIVATE handling, this is out of the scope of this series ("This change implies that whenever user space wrote to a private mapping (IOW, we have an anonymous page mapped), that GUP pins will always remain consistent: reliable R/O GUP pins of anonymous pages.").
diff --git a/include/linux/mm.h b/include/linux/mm.h index ffeb7f5fa32b..48ce9b7f903d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3003,6 +3003,45 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags) return 0; } +/* + * Indicates for which pages that are write-protected in the page table, + * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the + * GUP pin will remain consistent with the pages mapped into the page tables + * of the MM. + * + * Temporary unmapping of PageAnonExclusive() pages or clearing of + * PageAnonExclusive() has to protect against concurrent GUP: + * * Ordinary GUP: Using the PT lock + * * GUP-fast and fork(): mm->write_protect_seq + * * GUP-fast and KSM or temporary unmapping (swap, migration): + * clear/invalidate+flush of the page table entry + * + * Must be called with the (sub)page that's actually referenced via the + * page table entry, which might not necessarily be the head page for a + * PTE-mapped THP. + */ +static inline bool gup_must_unshare(unsigned int flags, struct page *page) +{ + /* + * FOLL_WRITE is implicitly handled correctly as the page table entry + * has to be writable -- and if it references (part of) an anonymous + * folio, that part is required to be marked exclusive. + */ + if ((flags & (FOLL_WRITE | FOLL_PIN)) != FOLL_PIN) + return false; + /* + * Note: PageAnon(page) is stable until the page is actually getting + * freed. + */ + if (!PageAnon(page)) + return false; + /* + * Note that PageKsm() pages cannot be exclusive, and consequently, + * cannot get pinned. + */ + return !PageAnonExclusive(page); +} + typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data); extern int apply_to_page_range(struct mm_struct *mm, unsigned long address, unsigned long size, pte_fn_t fn, void *data); diff --git a/mm/gup.c b/mm/gup.c index 7127e789d210..2cb7cbb1fc1f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -568,6 +568,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } } + if (!pte_write(pte) && gup_must_unshare(flags, page)) { + page = ERR_PTR(-EMLINK); + goto out; + } /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */ if (unlikely(!try_grab_page(page, flags))) { page = ERR_PTR(-ENOMEM); @@ -820,6 +824,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches * the device's dev_pagemap metadata to avoid repeating expensive lookups. * + * When getting an anonymous page and the caller has to trigger unsharing + * of a shared anonymous page first, -EMLINK is returned. The caller should + * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only + * relevant with FOLL_PIN and !FOLL_WRITE. + * * On output, the @ctx->page_mask is set according to the size of the page. * * Return: the mapped (struct page *), %NULL if no mapping exists, or @@ -943,7 +952,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, * is, *@locked will be set to 0 and -EBUSY returned. */ static int faultin_page(struct vm_area_struct *vma, - unsigned long address, unsigned int *flags, int *locked) + unsigned long address, unsigned int *flags, bool unshare, + int *locked) { unsigned int fault_flags = 0; vm_fault_t ret; @@ -968,6 +978,11 @@ static int faultin_page(struct vm_area_struct *vma, */ fault_flags |= FAULT_FLAG_TRIED; } + if (unshare) { + fault_flags |= FAULT_FLAG_UNSHARE; + /* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */ + VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE); + } ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { @@ -1189,8 +1204,9 @@ static long __get_user_pages(struct mm_struct *mm, cond_resched(); page = follow_page_mask(vma, start, foll_flags, &ctx); - if (!page) { - ret = faultin_page(vma, start, &foll_flags, locked); + if (!page || PTR_ERR(page) == -EMLINK) { + ret = faultin_page(vma, start, &foll_flags, + PTR_ERR(page) == -EMLINK, locked); switch (ret) { case 0: goto retry; @@ -2346,6 +2362,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; } + if (!pte_write(pte) && gup_must_unshare(flags, page)) { + put_compound_head(head, 1, flags); + goto pte_unmap; + } + VM_BUG_ON_PAGE(compound_head(page) != head, page); /* @@ -2589,6 +2610,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return 0; } + if (!pmd_write(orig) && gup_must_unshare(flags, head)) { + put_compound_head(head, refs, flags); + return 0; + } + *nr += refs; SetPageReferenced(head); return 1; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2cb8fa31377a..334cd9381635 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1396,6 +1396,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) + return ERR_PTR(-EMLINK); + if (!try_grab_page(page, flags)) return ERR_PTR(-ENOMEM); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ebcde7e9f704..6c2a7dd8a48d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5957,6 +5957,23 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma, } } +static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, + bool *unshare) +{ + pte_t pteval = huge_ptep_get(pte); + + *unshare = false; + if (is_swap_pte(pteval)) + return true; + if (huge_pte_write(pteval)) + return false; + if (flags & FOLL_WRITE) + return true; + if (gup_must_unshare(flags, pte_page(pteval))) + *unshare = true; + return *unshare; +} + long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, unsigned long *nr_pages, @@ -5971,6 +5988,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, while (vaddr < vma->vm_end && remainder) { pte_t *pte; spinlock_t *ptl = NULL; + bool unshare; int absent; struct page *page; @@ -6021,9 +6039,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * both cases, and because we can't follow correct pages * directly from any kind of swap entries. */ - if (absent || is_swap_pte(huge_ptep_get(pte)) || - ((flags & FOLL_WRITE) && - !huge_pte_write(huge_ptep_get(pte)))) { + if (absent || + __follow_hugetlb_must_fault(flags, pte, &unshare)) { vm_fault_t ret; unsigned int fault_flags = 0; @@ -6031,6 +6048,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, spin_unlock(ptl); if (flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; + else if (unshare) + fault_flags |= FAULT_FLAG_UNSHARE; if (locked) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;