Message ID | 20210421225613.60124-1-surenb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] gup: document and work around "COW can break either way" issue | expand |
On Wed, Apr 21, 2021 at 03:56:13PM -0700, Suren Baghdasaryan wrote: > From: Linus Torvalds <torvalds@linux-foundation.org> > > commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream. > > Doing a "get_user_pages()" on a copy-on-write page for reading can be > ambiguous: the page can be COW'ed at any time afterwards, and the > direction of a COW event isn't defined. > > Yes, whoever writes to it will generally do the COW, but if the thread > that did the get_user_pages() unmapped the page before the write (and > that could happen due to memory pressure in addition to any outright > action), the writer could also just take over the old page instead. > > End result: the get_user_pages() call might result in a page pointer > that is no longer associated with the original VM, and is associated > with - and controlled by - another VM having taken it over instead. > > So when doing a get_user_pages() on a COW mapping, the only really safe > thing to do would be to break the COW when getting the page, even when > only getting it for reading. > > At the same time, some users simply don't even care. > > For example, the perf code wants to look up the page not because it > cares about the page, but because the code simply wants to look up the > physical address of the access for informational purposes, and doesn't > really care about races when a page might be unmapped and remapped > elsewhere. > > This adds logic to force a COW event by setting FOLL_WRITE on any > copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page > pointer as a result. > > The current semantics end up being: > > - __get_user_pages_fast(): no change. If you don't ask for a write, > you won't break COW. You'd better know what you're doing. > > - get_user_pages_fast(): the fast-case "look it up in the page tables > without anything getting mmap_sem" now refuses to follow a read-only > page, since it might need COW breaking. Which happens in the slow > path - the fast path doesn't know if the memory might be COW or not. > > - get_user_pages() (including the slow-path fallback for gup_fast()): > for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with > very similar semantics to FOLL_FORCE. > > If it turns out that we want finer granularity (ie "only break COW when > it might actually matter" - things like the zero page are special and > don't need to be broken) we might need to push these semantics deeper > into the lookup fault path. So if people care enough, it's possible > that we might end up adding a new internal FOLL_BREAK_COW flag to go > with the internal FOLL_COW flag we already have for tracking "I had a > COW". > > Alternatively, if it turns out that different callers might want to > explicitly control the forced COW break behavior, we might even want to > make such a flag visible to the users of get_user_pages() instead of > using the above default semantics. > > But for now, this is mostly commentary on the issue (this commit message > being a lot bigger than the patch, and that patch in turn is almost all > comments), with that minimal "enable COW breaking early" logic using the > existing FOLL_WRITE behavior. > > [ It might be worth noting that we've always had this ambiguity, and it > could arguably be seen as a user-space issue. > > You only get private COW mappings that could break either way in > situations where user space is doing cooperative things (ie fork() > before an execve() etc), but it _is_ surprising and very subtle, and > fork() is supposed to give you independent address spaces. > > So let's treat this as a kernel issue and make the semantics of > get_user_pages() easier to understand. Note that obviously a true > shared mapping will still get a page that can change under us, so this > does _not_ mean that get_user_pages() somehow returns any "stable" > page ] > > [surenb: backport notes] > Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range. > Removed FOLL_PIN usage in should_force_cow_break since it's missing in > the earlier kernels. > > Reported-by: Jann Horn <jannh@google.com> > Tested-by: Christoph Hellwig <hch@lst.de> > Acked-by: Oleg Nesterov <oleg@redhat.com> > Acked-by: Kirill Shutemov <kirill@shutemov.name> > Acked-by: Jan Kara <jack@suse.cz> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Matthew Wilcox <willy@infradead.org> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > [surenb: backport to 4.19 kernel] > Cc: stable@vger.kernel.org # 4.19.x > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > mm/gup.c | 44 ++++++++++++++++++++++++++++++++++++++------ > mm/huge_memory.c | 7 +++---- > 2 files changed, 41 insertions(+), 10 deletions(-) Thanks for these backports, I've now queued them up. greg k-h
diff --git a/mm/gup.c b/mm/gup.c index f3088d25bd92..44569927f0ea 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, } /* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. + * FOLL_FORCE or a forced COW break can write even to unwritable pte's, + * but only after we've gone through a COW cycle and they are dirty. */ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) { - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); + return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte)); +} + +/* + * A (separate) COW fault might break the page the other way and + * get_user_pages() would return the page from what is now the wrong + * VM. So we need to force a COW break at GUP time even for reads. + */ +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags) +{ + return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET); } static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -710,12 +719,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (!vma || check_vma_flags(vma, gup_flags)) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { + if (should_force_cow_break(vma, foll_flags)) + foll_flags |= FOLL_WRITE; i = follow_hugetlb_page(mm, vma, pages, vmas, &start, &nr_pages, i, - gup_flags, nonblocking); + foll_flags, nonblocking); continue; } } + + if (should_force_cow_break(vma, foll_flags)) + foll_flags |= FOLL_WRITE; + retry: /* * If we have a pending SIGKILL, don't keep faulting pages and @@ -1804,6 +1819,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write) * the regular GUP. * Note a difference with get_user_pages_fast: this always returns the * number of pages pinned, 0 if no pages were pinned. + * + * Careful, careful! COW breaking can go either way, so a non-write + * access can get ambiguous page results. If you call this function without + * 'write' set, you'd better be sure that you're ok with that ambiguity. */ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages) @@ -1831,6 +1850,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. + * + * NOTE! We allow read-only gup_fast() here, but you'd better be + * careful about possible COW pages. You'll get _a_ COW page, but + * not necessarily the one you intended to get depending on what + * COW event happens after this. COW may break the page copy in a + * random direction. */ if (gup_fast_permitted(start, nr_pages, write)) { @@ -1876,9 +1901,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, (void __user *)start, len))) return -EFAULT; + /* + * The FAST_GUP case requires FOLL_WRITE even for pure reads, + * because get_user_pages() may need to cause an early COW in + * order to avoid confusing the normal COW routines. So only + * targets that are already writable are safe to do by just + * looking at the page tables. + */ if (gup_fast_permitted(start, nr_pages, write)) { local_irq_disable(); - gup_pgd_range(addr, end, write, pages, &nr); + gup_pgd_range(addr, end, 1, pages, &nr); local_irq_enable(); ret = nr; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cf9e2bbffdc1..7c374c0fcf0c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1432,13 +1432,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) } /* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's, + * but only after we've gone through a COW cycle and they are dirty. */ static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd)); } struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,