diff mbox series

[5.4] mm/gup: Do not force a COW break on file-backed memory

Message ID 20211201231757.332199-1-willy@infradead.org (mailing list archive)
State New
Headers show
Series [5.4] mm/gup: Do not force a COW break on file-backed memory | expand

Commit Message

Matthew Wilcox (Oracle) Dec. 1, 2021, 11:17 p.m. UTC
Commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") forces a COW break, even for read-only GUP.  This interacts
badly with CONFIG_READ_ONLY_THP_FOR_FS as it tries to write to a read-only
PMD and follow_trans_huge_pmd() returns NULL which induces an endless
loop as __get_user_pages() interprets that as page-not-present, tries
to fault it in and retries the follow_page_mask().

The issues fixed by 17839856fd58 don't apply to files.  We know which way
the COW breaks; the page cache keeps the original and any modifications
are private to that process.  There's no optimisation that allows a
process to reuse a file-backed MAP_PRIVATE page.  So we can skip the
breaking of the COW for file-backed mappings.

This problem only exists in v5.4.y; other stable kernels either predate
CONFIG_READ_ONLY_THP_FOR_FS or they include commit a308c71bf1e6 ("mm/gup:
Remove enfornced COW mechanism").

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jann Horn Dec. 2, 2021, 3:51 a.m. UTC | #1
On Thu, Dec 2, 2021 at 12:18 AM Matthew Wilcox (Oracle)
<willy@infradead.org> wrote:
> Commit 17839856fd58 ("gup: document and work around "COW can break either
> way" issue") forces a COW break, even for read-only GUP.  This interacts
> badly with CONFIG_READ_ONLY_THP_FOR_FS as it tries to write to a read-only
> PMD and follow_trans_huge_pmd() returns NULL which induces an endless
> loop as __get_user_pages() interprets that as page-not-present, tries
> to fault it in and retries the follow_page_mask().
>
> The issues fixed by 17839856fd58 don't apply to files.  We know which way
> the COW breaks; the page cache keeps the original and any modifications
> are private to that process.  There's no optimisation that allows a
> process to reuse a file-backed MAP_PRIVATE page.  So we can skip the
> breaking of the COW for file-backed mappings.
>
> This problem only exists in v5.4.y; other stable kernels either predate
> CONFIG_READ_ONLY_THP_FOR_FS or they include commit a308c71bf1e6 ("mm/gup:
> Remove enfornced COW mechanism").
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/gup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 3ef769529548..d55e02411010 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -176,7 +176,8 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>   */
>  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
>  {
> -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> +       return is_cow_mapping(vma->vm_flags) && vma_is_anonymous(vma) &&
> +               (flags & FOLL_GET);
>  }

To be fully correct, the check would have to check for PageAnon(), not
whether the mapping is anonymous, right? Since a private file mapping
can still contain anonymous pages from a prior CoW?
Matthew Wilcox (Oracle) Dec. 2, 2021, 4:11 a.m. UTC | #2
On Thu, Dec 02, 2021 at 04:51:47AM +0100, Jann Horn wrote:
> On Thu, Dec 2, 2021 at 12:18 AM Matthew Wilcox (Oracle)
> <willy@infradead.org> wrote:
> > Commit 17839856fd58 ("gup: document and work around "COW can break either
> > way" issue") forces a COW break, even for read-only GUP.  This interacts
> > badly with CONFIG_READ_ONLY_THP_FOR_FS as it tries to write to a read-only
> > PMD and follow_trans_huge_pmd() returns NULL which induces an endless
> > loop as __get_user_pages() interprets that as page-not-present, tries
> > to fault it in and retries the follow_page_mask().
> >
> > The issues fixed by 17839856fd58 don't apply to files.  We know which way
> > the COW breaks; the page cache keeps the original and any modifications
> > are private to that process.  There's no optimisation that allows a
> > process to reuse a file-backed MAP_PRIVATE page.  So we can skip the
> > breaking of the COW for file-backed mappings.
> >
> > This problem only exists in v5.4.y; other stable kernels either predate
> > CONFIG_READ_ONLY_THP_FOR_FS or they include commit a308c71bf1e6 ("mm/gup:
> > Remove enfornced COW mechanism").
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  mm/gup.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 3ef769529548..d55e02411010 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -176,7 +176,8 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> >   */
> >  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> >  {
> > -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> > +       return is_cow_mapping(vma->vm_flags) && vma_is_anonymous(vma) &&
> > +               (flags & FOLL_GET);
> >  }
> 
> To be fully correct, the check would have to check for PageAnon(), not
> whether the mapping is anonymous, right? Since a private file mapping
> can still contain anonymous pages from a prior CoW?

Oh, right.  So parent process maps a file with MAP_PRIVATE, writes to
it, gets an anon page, forks.  Child stuffs the page into a pipe,
unmaps page.  Parent writes to page again, now child can read() the
modification?

The problem is that we don't even get to seeing the struct page with
the current code paths.  And we're looking for a fix for RO THP that's
less intrusive for v5.4 than backporting

09854ba94c6a ("mm: do_wp_page() simplification")
1a0cf26323c8 ("mm/ksm: Remove reuse_ksm_page()")
a308c71bf1e6 ("mm/gup: Remove enfornced COW mechanism")

The other patch we've been kicking around (and works) is:

 static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned
int flags)
 {
-       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
+       return is_cow_mapping(vma->vm_flags) &&
+               (!(vma->vm_flags & VM_DENYWRITE)) && (flags & FOLL_GET);
 }

That limits the change to be only text pages.  Generally programs do
not write to their text pages, and they certainly don't write *secrets*
to their text pages; if somebody else can read it, that's probably not
a problem in the same way as writing to a page of heap.
Jann Horn Dec. 2, 2021, 4:33 a.m. UTC | #3
On Thu, Dec 2, 2021 at 5:11 AM Matthew Wilcox <willy@infradead.org> wrote:
> On Thu, Dec 02, 2021 at 04:51:47AM +0100, Jann Horn wrote:
> > On Thu, Dec 2, 2021 at 12:18 AM Matthew Wilcox (Oracle)
> > <willy@infradead.org> wrote:
> > > Commit 17839856fd58 ("gup: document and work around "COW can break either
> > > way" issue") forces a COW break, even for read-only GUP.  This interacts
> > > badly with CONFIG_READ_ONLY_THP_FOR_FS as it tries to write to a read-only
> > > PMD and follow_trans_huge_pmd() returns NULL which induces an endless
> > > loop as __get_user_pages() interprets that as page-not-present, tries
> > > to fault it in and retries the follow_page_mask().
> > >
> > > The issues fixed by 17839856fd58 don't apply to files.  We know which way
> > > the COW breaks; the page cache keeps the original and any modifications
> > > are private to that process.  There's no optimisation that allows a
> > > process to reuse a file-backed MAP_PRIVATE page.  So we can skip the
> > > breaking of the COW for file-backed mappings.
> > >
> > > This problem only exists in v5.4.y; other stable kernels either predate
> > > CONFIG_READ_ONLY_THP_FOR_FS or they include commit a308c71bf1e6 ("mm/gup:
> > > Remove enfornced COW mechanism").
> > >
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > ---
> > >  mm/gup.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 3ef769529548..d55e02411010 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -176,7 +176,8 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
> > >   */
> > >  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> > >  {
> > > -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> > > +       return is_cow_mapping(vma->vm_flags) && vma_is_anonymous(vma) &&
> > > +               (flags & FOLL_GET);
> > >  }
> >
> > To be fully correct, the check would have to check for PageAnon(), not
> > whether the mapping is anonymous, right? Since a private file mapping
> > can still contain anonymous pages from a prior CoW?
>
> Oh, right.  So parent process maps a file with MAP_PRIVATE, writes to
> it, gets an anon page, forks.  Child stuffs the page into a pipe,
> unmaps page.  Parent writes to page again, now child can read() the
> modification?

Yeah - in theory that could happen e.g. with an ELF's .data section?
Those end up as writable private file mappings.

(I don't know whether that actually has real-world relevance though,
I'm just saying it's semantically off in theory.)

> The problem is that we don't even get to seeing the struct page with
> the current code paths.  And we're looking for a fix for RO THP that's
> less intrusive for v5.4 than backporting
>
> 09854ba94c6a ("mm: do_wp_page() simplification")
> 1a0cf26323c8 ("mm/ksm: Remove reuse_ksm_page()")
> a308c71bf1e6 ("mm/gup: Remove enfornced COW mechanism")
>
> The other patch we've been kicking around (and works) is:
>
>  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned
> int flags)
>  {
> -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> +       return is_cow_mapping(vma->vm_flags) &&
> +               (!(vma->vm_flags & VM_DENYWRITE)) && (flags & FOLL_GET);
>  }
>
> That limits the change to be only text pages.  Generally programs do
> not write to their text pages, and they certainly don't write *secrets*
> to their text pages; if somebody else can read it, that's probably not
> a problem in the same way as writing to a page of heap.

Hm, yeah. It's not exactly beautiful, but I guess it should do the job
for fixing stable...

It's a good thing that VM_DENYWRITE still exists in the 5.4 branch. ^^
Linus Torvalds Dec. 2, 2021, 6:54 p.m. UTC | #4
On Wed, Dec 1, 2021 at 8:11 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> The other patch we've been kicking around (and works) is:
>
>  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned
> int flags)
>  {
> -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> +       return is_cow_mapping(vma->vm_flags) &&
> +               (!(vma->vm_flags & VM_DENYWRITE)) && (flags & FOLL_GET);
>  }

That patch makes no sense to me.

It may "work", but it doesn't actually do anything sensible or really
fix the problem that I can tell.

I suspect a real fix would be bigger and more invasive.

If the answer is not to backport all the other changes (and they were
_really_ invasive), I think one answer may be to simply move the
"should_force_cow_break()" down to below the point where you've looked
up the page.

Then you can actually look at "is this a file mapped page", and say
"if so, that's ok, we can return it as-is".

Otherwise, you do something like

        foll_flags |= FOLL_WRITE;
        free_page(page);
        goto repeat;

to repeat the loop (now with FOLL_WRITE).

So the patch is bigger and more involved, because you would have done
the page lookup (for reading) and now notice "Oh, I need it for
writing instead" so you need to undo and re-do).

But at least - unlike backporting everything else - it would be
limited to that one __get_user_pages() function.

Hmm?

(And you'd need to handle that follow_hugetlb_page() case too), not
just the follow_page_mask() one)

             Linus
Matthew Wilcox (Oracle) Dec. 2, 2021, 7:59 p.m. UTC | #5
On Thu, Dec 02, 2021 at 10:54:48AM -0800, Linus Torvalds wrote:
> On Wed, Dec 1, 2021 at 8:11 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > The other patch we've been kicking around (and works) is:
> >
> >  static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned
> > int flags)
> >  {
> > -       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
> > +       return is_cow_mapping(vma->vm_flags) &&
> > +               (!(vma->vm_flags & VM_DENYWRITE)) && (flags & FOLL_GET);
> >  }
> 
> That patch makes no sense to me.
> 
> It may "work", but it doesn't actually do anything sensible or really
> fix the problem that I can tell.

Oh absolutely, it's semantically nonsense.  The only reason it fixes the
problem is that VM_DENYWRITE VMAs are the only ones considered for the
RO_THP merging, so they're the only ones which we've seen causing a
problem.

> I suspect a real fix would be bigger and more invasive.

Darn.  I was hoping you were going to say something like "The real
problem is follow_trans_huge_pmd() is complete garbage and it should
just do X, Y and Z".  Or "When we force on FOLL_WRITE, we should also
force on FOLL_SPLIT_PMD".

> If the answer is not to backport all the other changes (and they were
> _really_ invasive), I think one answer may be to simply move the
> "should_force_cow_break()" down to below the point where you've looked
> up the page.
> 
> Then you can actually look at "is this a file mapped page", and say
> "if so, that's ok, we can return it as-is".
> 
> Otherwise, you do something like
> 
>         foll_flags |= FOLL_WRITE;
>         free_page(page);
>         goto repeat;
> 
> to repeat the loop (now with FOLL_WRITE).
> 
> So the patch is bigger and more involved, because you would have done
> the page lookup (for reading) and now notice "Oh, I need it for
> writing instead" so you need to undo and re-do).
> 
> But at least - unlike backporting everything else - it would be
> limited to that one __get_user_pages() function.
> 
> Hmm?
> 
> (And you'd need to handle that follow_hugetlb_page() case too), not
> just the follow_page_mask() one)

Thanks, I'll take a look.
Linus Torvalds Dec. 2, 2021, 10:33 p.m. UTC | #6
On Thu, Dec 2, 2021 at 11:59 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Oh absolutely, it's semantically nonsense.  The only reason it fixes the
> problem is that VM_DENYWRITE VMAs are the only ones considered for the
> RO_THP merging, so they're the only ones which we've seen causing a
> problem.

That would be a semantically meaningful argument, but I think the
reverse isn't true: regular pages in VM_DENYWRITE vmas - that aren't
using the RO_THP thing - are open to the same old "COW wrong way"
issue.

So it's not like VM_DENYWRITE is really meaningful for the
conditional, even if it's perhaps a prerequisite for it being a
problem.

> > I suspect a real fix would be bigger and more invasive.
>
> Darn.  I was hoping you were going to say something like "The real
> problem is follow_trans_huge_pmd() is complete garbage and it should
> just do X, Y and Z".  Or "When we force on FOLL_WRITE, we should also
> force on FOLL_SPLIT_PMD".

Well, maybe that "FOLL_SPLIT_PMD" thing would indeed be a valid thing?

But I _think_ that it shouldn't be too painful to do the
should_force_cow_break() call later, when you actually have the exact
page details, so while not exactly a one-liner, I hope that approach
would end up working out.

I only gave it a quick look, though, I might be missing something.

              Linus
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 3ef769529548..d55e02411010 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -176,7 +176,8 @@  static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
  */
 static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
 {
-	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
+	return is_cow_mapping(vma->vm_flags) && vma_is_anonymous(vma) &&
+		(flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,