mm/khugepaged: Fix ->anon_vma race

Message ID	20230111133351.807024-1-jannh@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Jann Horn <jannh@google.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Zach O'Keefe <zokeefe@google.com>, linux-kernel@vger.kernel.org, David Hildenbrand <david@redhat.com>, Yang Shi <shy828301@gmail.com> Subject: [PATCH] mm/khugepaged: Fix ->anon_vma race Date: Wed, 11 Jan 2023 14:33:51 +0100 Message-Id: <20230111133351.807024-1-jannh@google.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/khugepaged: Fix ->anon_vma race \| expand mm/khugepaged: Fix ->anon_vma race

Jann Horn Jan. 11, 2023, 1:33 p.m. UTC

If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
it to be locked. retract_page_tables() bails out if an ->anon_vma is
attached, but does this check before holding the mmap lock (as the comment
above the check explains).

If we racily merge an existing ->anon_vma (shared with a child process)
from a neighboring VMA, subsequent rmap traversals on pages belonging to
the child will be able to see the page tables that we are concurrently
removing while assuming that nothing else can access them.

Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
really is no concurrent page table access.

Reported-by: Zach O'Keefe <zokeefe@google.com>
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
---
zokeefe@ pointed out to me that the current code (after my last round of patches)
can hit a lockdep assert by racing, and after staring at it a bit I've
convinced myself that this is a real, preexisting bug.
(I haven't written a reproducer for it though. One way to hit it might be
something along the lines of:

 - set up a process A with a private-file-mapping VMA V1
 - let A fork() to create process B, thereby copying V1 in A to V1' in B
 - let B extend the end of V1'
 - let B put some anon pages into the extended part of V1'
 - let A map a new private-file-mapping VMA V2 directly behind V1, without
   an anon_vma
[race begins here]
  - in A's thread 1: begin retract_page_tables() on V2, run through first
    ->anon_vma check
  - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
    merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
    same file at compatible offsets)
  - in B: trigger rmap traversal on anon page in V1'

 mm/khugepaged.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)


base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25

Yang Shi Jan. 12, 2023, 1:06 a.m. UTC | #1

On Wed, Jan 11, 2023 at 5:33 AM Jann Horn <jannh@google.com> wrote:
>
> If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> it to be locked. retract_page_tables() bails out if an ->anon_vma is
> attached, but does this check before holding the mmap lock (as the comment
> above the check explains).
>
> If we racily merge an existing ->anon_vma (shared with a child process)
> from a neighboring VMA, subsequent rmap traversals on pages belonging to
> the child will be able to see the page tables that we are concurrently
> removing while assuming that nothing else can access them.
>
> Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> really is no concurrent page table access.
>
> Reported-by: Zach O'Keefe <zokeefe@google.com>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
> zokeefe@ pointed out to me that the current code (after my last round of patches)
> can hit a lockdep assert by racing, and after staring at it a bit I've

I'm supposed the lockdep is the one in collapse_and_free_pmd(). It is
better to have the splat included in the commit log.

> convinced myself that this is a real, preexisting bug.
> (I haven't written a reproducer for it though. One way to hit it might be
> something along the lines of:
>
>  - set up a process A with a private-file-mapping VMA V1
>  - let A fork() to create process B, thereby copying V1 in A to V1' in B
>  - let B extend the end of V1'
>  - let B put some anon pages into the extended part of V1'

I don't quite get why we need this step. A cow fault on A's V1 isn't
enough to have anon_vma for V1? This should not prevent V1 and V2 from
sharing anon_vma. Did I miss something?

>  - let A map a new private-file-mapping VMA V2 directly behind V1, without
>    an anon_vma
> [race begins here]
>   - in A's thread 1: begin retract_page_tables() on V2, run through first
>     ->anon_vma check
>   - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
>     merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
>     same file at compatible offsets)
>   - in B: trigger rmap traversal on anon page in V1'
>
>  mm/khugepaged.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5cb401aa2b9d..0bfed37f3a3b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>                  * has higher cost too. It would also probably require locking
>                  * the anon_vma.
>                  */
> -               if (vma->anon_vma) {
> +               if (READ_ONCE(vma->anon_vma)) {
>                         result = SCAN_PAGE_ANON;
>                         goto next;
>                 }
> @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>                 result = SCAN_PTE_MAPPED_HUGEPAGE;
>                 if ((cc->is_khugepaged || is_target) &&
>                     mmap_write_trylock(mm)) {
> +                       /*
> +                        * Re-check whether we have an ->anon_vma, because
> +                        * collapse_and_free_pmd() requires that either no
> +                        * ->anon_vma exists or the anon_vma is locked.
> +                        * We already checked ->anon_vma above, but that check
> +                        * is racy because ->anon_vma can be populated under the
> +                        * mmap lock in read mode.
> +                        */
> +                       if (vma->anon_vma) {
> +                               result = SCAN_PAGE_ANON;
> +                               goto unlock_next;
> +                       }
>                         /*
>                          * When a vma is registered with uffd-wp, we can't
>                          * recycle the pmd pgtable because there can be pte
>
> base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25
> --
> 2.39.0.314.g84b9a713c41-goog
>

Kirill A. Shutemov Jan. 12, 2023, 8:56 a.m. UTC | #2

On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote:
> If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> it to be locked. retract_page_tables() bails out if an ->anon_vma is
> attached, but does this check before holding the mmap lock (as the comment
> above the check explains).
> 
> If we racily merge an existing ->anon_vma (shared with a child process)
> from a neighboring VMA, subsequent rmap traversals on pages belonging to
> the child will be able to see the page tables that we are concurrently
> removing while assuming that nothing else can access them.
> 
> Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> really is no concurrent page table access.
> 
> Reported-by: Zach O'Keefe <zokeefe@google.com>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
> zokeefe@ pointed out to me that the current code (after my last round of patches)
> can hit a lockdep assert by racing, and after staring at it a bit I've
> convinced myself that this is a real, preexisting bug.
> (I haven't written a reproducer for it though. One way to hit it might be
> something along the lines of:
> 
>  - set up a process A with a private-file-mapping VMA V1
>  - let A fork() to create process B, thereby copying V1 in A to V1' in B
>  - let B extend the end of V1'
>  - let B put some anon pages into the extended part of V1'

At this point V1' gets it's own ->anon_vma, not connected to V1, right?

>  - let A map a new private-file-mapping VMA V2 directly behind V1, without
>    an anon_vma
> [race begins here]
>   - in A's thread 1: begin retract_page_tables() on V2, run through first
>     ->anon_vma check
>   - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
>     merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
>     same file at compatible offsets)
>   - in B: trigger rmap traversal on anon page in V1'

I don't follow the race. rmap on V1' will not reach V1.

>  mm/khugepaged.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 5cb401aa2b9d..0bfed37f3a3b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>  		 * has higher cost too. It would also probably require locking
>  		 * the anon_vma.
>  		 */
> -		if (vma->anon_vma) {
> +		if (READ_ONCE(vma->anon_vma)) {
>  			result = SCAN_PAGE_ANON;
>  			goto next;
>  		}

This makes perfect sense. At least for readability. But I think
false-negative should not lead to bad results.

> @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>  		result = SCAN_PTE_MAPPED_HUGEPAGE;
>  		if ((cc->is_khugepaged || is_target) &&
>  		    mmap_write_trylock(mm)) {
> +			/*
> +			 * Re-check whether we have an ->anon_vma, because
> +			 * collapse_and_free_pmd() requires that either no
> +			 * ->anon_vma exists or the anon_vma is locked.
> +			 * We already checked ->anon_vma above, but that check
> +			 * is racy because ->anon_vma can be populated under the
> +			 * mmap lock in read mode.
> +			 */
> +			if (vma->anon_vma) {
> +				result = SCAN_PAGE_ANON;
> +				goto unlock_next;
> +			}

This is totally wrong direction. Or I don't understand the race.

At this point we already paid nearly all price of of pagetable retraction.
I don't see any correctness reason to stop here, except for the assert.

I think lockdep assert in collapse_and_free_pmd() is wrong and has to be
dropped.

>  			/*
>  			 * When a vma is registered with uffd-wp, we can't
>  			 * recycle the pmd pgtable because there can be pte
> 
> base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25
> -- 
> 2.39.0.314.g84b9a713c41-goog
>

Yang Shi Jan. 12, 2023, 6:12 p.m. UTC | #3

On Thu, Jan 12, 2023 at 12:56 AM Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote:
> > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> > it to be locked. retract_page_tables() bails out if an ->anon_vma is
> > attached, but does this check before holding the mmap lock (as the comment
> > above the check explains).
> >
> > If we racily merge an existing ->anon_vma (shared with a child process)
> > from a neighboring VMA, subsequent rmap traversals on pages belonging to
> > the child will be able to see the page tables that we are concurrently
> > removing while assuming that nothing else can access them.
> >
> > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> > really is no concurrent page table access.
> >
> > Reported-by: Zach O'Keefe <zokeefe@google.com>
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> > zokeefe@ pointed out to me that the current code (after my last round of patches)
> > can hit a lockdep assert by racing, and after staring at it a bit I've
> > convinced myself that this is a real, preexisting bug.
> > (I haven't written a reproducer for it though. One way to hit it might be
> > something along the lines of:
> >
> >  - set up a process A with a private-file-mapping VMA V1
> >  - let A fork() to create process B, thereby copying V1 in A to V1' in B
> >  - let B extend the end of V1'
> >  - let B put some anon pages into the extended part of V1'
>
> At this point V1' gets it's own ->anon_vma, not connected to V1, right?

This is what I got confused too.

>
> >  - let A map a new private-file-mapping VMA V2 directly behind V1, without
> >    an anon_vma
> > [race begins here]
> >   - in A's thread 1: begin retract_page_tables() on V2, run through first
> >     ->anon_vma check
> >   - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
> >     merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
> >     same file at compatible offsets)
> >   - in B: trigger rmap traversal on anon page in V1'
>
> I don't follow the race. rmap on V1' will not reach V1.
>
> >  mm/khugepaged.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5cb401aa2b9d..0bfed37f3a3b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >                * has higher cost too. It would also probably require locking
> >                * the anon_vma.
> >                */
> > -             if (vma->anon_vma) {
> > +             if (READ_ONCE(vma->anon_vma)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto next;
> >               }
>
> This makes perfect sense. At least for readability. But I think
> false-negative should not lead to bad results.
>
> > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >               result = SCAN_PTE_MAPPED_HUGEPAGE;
> >               if ((cc->is_khugepaged || is_target) &&
> >                   mmap_write_trylock(mm)) {
> > +                     /*
> > +                      * Re-check whether we have an ->anon_vma, because
> > +                      * collapse_and_free_pmd() requires that either no
> > +                      * ->anon_vma exists or the anon_vma is locked.
> > +                      * We already checked ->anon_vma above, but that check
> > +                      * is racy because ->anon_vma can be populated under the
> > +                      * mmap lock in read mode.
> > +                      */
> > +                     if (vma->anon_vma) {
> > +                             result = SCAN_PAGE_ANON;
> > +                             goto unlock_next;
> > +                     }
>
> This is totally wrong direction. Or I don't understand the race.
>
> At this point we already paid nearly all price of of pagetable retraction.
> I don't see any correctness reason to stop here, except for the assert.

Isn't it possible that collapse_and_free_pmd() clear the pmd which may
point to a PTE which maps the COW'ed anon page if this race happens?

>
> I think lockdep assert in collapse_and_free_pmd() is wrong and has to be
> dropped.
>
> >                       /*
> >                        * When a vma is registered with uffd-wp, we can't
> >                        * recycle the pmd pgtable because there can be pte
> >
> > base-commit: 7dd4b804e08041ff56c88bdd8da742d14b17ed25
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Kirill A. Shutemov Jan. 13, 2023, 12:10 a.m. UTC | #4

On Thu, Jan 12, 2023 at 10:12:53AM -0800, Yang Shi wrote:
> > This is totally wrong direction. Or I don't understand the race.
> >
> > At this point we already paid nearly all price of of pagetable retraction.
> > I don't see any correctness reason to stop here, except for the assert.
> 
> Isn't it possible that collapse_and_free_pmd() clear the pmd which may
> point to a PTE which maps the COW'ed anon page if this race happens?

No. At this point we have huge page in the place in the page cache and it
is locked. COW fault would serialize on the page lock.

Yang Shi Jan. 13, 2023, 3:22 a.m. UTC | #5

On Thu, Jan 12, 2023 at 4:10 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Thu, Jan 12, 2023 at 10:12:53AM -0800, Yang Shi wrote:
> > > This is totally wrong direction. Or I don't understand the race.
> > >
> > > At this point we already paid nearly all price of of pagetable retraction.
> > > I don't see any correctness reason to stop here, except for the assert.
> >
> > Isn't it possible that collapse_and_free_pmd() clear the pmd which may
> > point to a PTE which maps the COW'ed anon page if this race happens?
>
> No. At this point we have huge page in the place in the page cache and it
> is locked. COW fault would serialize on the page lock.

Aha, yeah, I missed the page lock. So IIUC even though anon_vma is set
there should be not any COW'ed anon pages since cow fault would be
blocked by the page lock. If so I tend to agree with you the lockdep
assert seems unnecessary.

>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Jann Horn Jan. 13, 2023, 7:28 p.m. UTC | #6

On Thu, Jan 12, 2023 at 9:56 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote:
> > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> > it to be locked. retract_page_tables() bails out if an ->anon_vma is
> > attached, but does this check before holding the mmap lock (as the comment
> > above the check explains).
> >
> > If we racily merge an existing ->anon_vma (shared with a child process)
> > from a neighboring VMA, subsequent rmap traversals on pages belonging to
> > the child will be able to see the page tables that we are concurrently
> > removing while assuming that nothing else can access them.
> >
> > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> > really is no concurrent page table access.
> >
> > Reported-by: Zach O'Keefe <zokeefe@google.com>
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> > zokeefe@ pointed out to me that the current code (after my last round of patches)
> > can hit a lockdep assert by racing, and after staring at it a bit I've
> > convinced myself that this is a real, preexisting bug.
> > (I haven't written a reproducer for it though. One way to hit it might be
> > something along the lines of:
> >
> >  - set up a process A with a private-file-mapping VMA V1
> >  - let A fork() to create process B, thereby copying V1 in A to V1' in B
> >  - let B extend the end of V1'
> >  - let B put some anon pages into the extended part of V1'
>
> At this point V1' gets it's own ->anon_vma, not connected to V1, right?

Yes, true, it doesn't work in a child process because
reusable_anon_vma() refuses to reuse the anon_vma of an adjacent VMA
if that VMA is attached to more than one VMA. So to actually make it
work, you have to rearrange things a bit such that this happens in the
parent process, so that prepare_anon_vma() will reuse a neighboring
anon_vma via reusable_anon_vma().

Sorry, my description of the race was slightly off. See the attached
code for how to actually do it. Basically the trick is to use mremap()
to take one private file VMA with an anon_vma and turn it into two
private file VMAs ("mapping1" and "mapping2" in my code) with the same
anon_vma that map the same file range. Put an anonymous page into
mapping1. Then you can replace the first 2 MiB of "mapping2" with
another VMA that is not mergeable into mapping2, is
anon_vma-compatible with mapping2, has no anon_vma, and looks
THP-collapsible. And then you can, while khugepaged is in the race
window, attach the new VMA to the existing anon_vma and trigger an
rmap walk on the anonymous page.

> >  - let A map a new private-file-mapping VMA V2 directly behind V1, without
> >    an anon_vma
> > [race begins here]
> >   - in A's thread 1: begin retract_page_tables() on V2, run through first
> >     ->anon_vma check
> >   - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
> >     merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
> >     same file at compatible offsets)
> >   - in B: trigger rmap traversal on anon page in V1'
>
> I don't follow the race. rmap on V1' will not reach V1.
>
> >  mm/khugepaged.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5cb401aa2b9d..0bfed37f3a3b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >                * has higher cost too. It would also probably require locking
> >                * the anon_vma.
> >                */
> > -             if (vma->anon_vma) {
> > +             if (READ_ONCE(vma->anon_vma)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto next;
> >               }
>
> This makes perfect sense. At least for readability. But I think
> false-negative should not lead to bad results.
>
> > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >               result = SCAN_PTE_MAPPED_HUGEPAGE;
> >               if ((cc->is_khugepaged || is_target) &&
> >                   mmap_write_trylock(mm)) {
> > +                     /*
> > +                      * Re-check whether we have an ->anon_vma, because
> > +                      * collapse_and_free_pmd() requires that either no
> > +                      * ->anon_vma exists or the anon_vma is locked.
> > +                      * We already checked ->anon_vma above, but that check
> > +                      * is racy because ->anon_vma can be populated under the
> > +                      * mmap lock in read mode.
> > +                      */
> > +                     if (vma->anon_vma) {
> > +                             result = SCAN_PAGE_ANON;
> > +                             goto unlock_next;
> > +                     }
>
> This is totally wrong direction. Or I don't understand the race.
>
> At this point we already paid nearly all price of of pagetable retraction.
> I don't see any correctness reason to stop here, except for the assert.

This condition can only be hit with a narrow race, since vma->anon_vma
was NULL a moment ago. So this branch will only run very rarely, which
means it is unlikely to have an effect on performance.

> I think lockdep assert in collapse_and_free_pmd() is wrong and has to be
> dropped.

No, that lockdep assert has to be there. Page table traversal is
allowed under any one of the mmap lock, the anon_vma lock (if the VMA
is associated with an anon_vma), and the mapping lock (if the VMA is
associated with a mapping); and so to be able to remove page tables,
we must hold all three of them. Even if we couldn't figure out a way
to actually exploit a lack of locking here, we still shouldn't be
skipping locking because of that. Unless we have clearly defined API
contracts such as "you are allowed to traverse page tables if you hold
one of these three locks", it is impossible to maintain code or to
make it robust and secure. Even if you could show that the lack of
locking can't lead to a crash in the current kernel, that still
wouldn't be a good reason to skip the locking. Without something at
least resembling an API contract, an innocent-looking change in
another part of the kernel can turn things in a completely different
part of the kernel into dangerous bugs, and it is impossible to change
anything while feeling confident that you haven't broken something
halfway across the kernel. (Or at least that's what it feels like to
me when I try to edit kernel code.)

Anyway, here are proper instructions for reproducing the bug.
First build the kernel with the attached patch
HACK-thp-anonvma-logging-and-delay.diff, which adds a bunch of debug
logging and injects extra timing delays to make the race much easier
to hit.
Then build with CONFIG_KASAN=y and
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y and lockdep, and run that
kernel.
Run "echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled"
to enable shmem THP.
Compile the attached reproducer thp_newbug.c ("gcc -o thp_newbug thp_newbug.c").
Let the reproducer run (it'll take a few seconds to run because it has
to wait for khugepaged and stuff). Then you should see a lockdep
assert failure followed by an ASAN splat due to page table UAF in
dmesg:


[ 1076.064348][  T116] khugepaged: retract_page_tables:
retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80
[ 1076.066331][  T116] khugepaged: retract_page_tables: running on vma
at 0x1fe000 in mm=ffff88817b4acc80
[ 1076.068464][  T116] khugepaged: retract_page_tables: about to call
collapse_and_free_pmd()
[ 1076.071078][  T116] khugepaged: retract_page_tables: done with
collapse_and_free_pmd()
[ 1076.072360][  T116] khugepaged: retract_page_tables: sleeping
before mmap lock...
[ 1076.135883][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x23ff000
[ 1076.147212][  T797] map_pte:   ... begin delay...
[ 1079.074924][  T116] khugepaged: retract_page_tables: done sleeping
[ 1079.075874][  T116] khugepaged: retract_page_tables: about to call
collapse_and_free_pmd()
[ 1079.076964][  T116] ------------[ cut here ]------------
[ 1079.077663][  T116] WARNING: CPU: 14 PID: 116 at
mm/khugepaged.c:1406 collapse_and_free_pmd+0x364/0x420
[ 1079.078940][  T116] CPU: 14 PID: 116 Comm: khugepaged Not tainted
6.2.0-rc3-00060-gc757fc92a3f7-dirty #257
[ 1079.080191][  T116] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1079.081462][  T116] RIP: 0010:collapse_and_free_pmd+0x364/0x420
[ 1079.082290][  T116] Code: ae 7e e6 03 85 d2 0f 84 b3 fd ff ff 4c 89
ff e8 52 86 fe ff 49 8b 3f 31 f6 48 83 c7 78 e8 34 56 e3 01 85 c0 0f
85 95 fd ff ff <0f> 0b e9 8e fd ff ff e8 d0 13 e8 ff ba ff ff ff ff be
26 00 00 00
[ 1079.084735][  T116] RSP: 0018:ffffc900009b79b0 EFLAGS: 00010246
[ 1079.085527][  T116] RAX: 0000000000000000 RBX: ffff88817b4acc80
RCX: 0000000000000001
[ 1079.089956][  T116] RDX: 0000000000000000 RSI: ffffffffa5cab020
RDI: ffffffffa5e382e0
[ 1079.090967][  T116] RBP: 1ffff92000136f36 R08: 0000000000000000
R09: ffff88817b4acd6f
[ 1079.091942][  T116] R10: ffffed102f6959ad R11: 0000000000000001
R12: ffff888117b0a3e8
[ 1079.092953][  T116] R13: 0000000002400000 R14: ffff888171ad1090
R15: ffff8881149b1100
[ 1079.094066][  T116] FS:  0000000000000000(0000)
GS:ffff8881f7100000(0000) knlGS:0000000000000000
[ 1079.095387][  T116] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1079.096379][  T116] CR2: 000055c6faaf9578 CR3: 0000000111738005
CR4: 0000000000370ee0
[ 1079.097695][  T116] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 1079.098909][  T116] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 1079.100117][  T116] Call Trace:
[ 1079.100668][  T116]  <TASK>
[...]
[ 1079.106003][  T116]  collapse_file.cold+0x3a4/0x400
[...]
[ 1079.110117][  T116]  hpage_collapse_scan_file+0x72a/0xad0
[ 1079.110856][  T116]  ? lock_release+0x22e/0x460
[ 1079.111483][  T116]  ? __pfx_hpage_collapse_scan_file+0x10/0x10
[ 1079.112289][  T116]  khugepaged+0x66b/0xbf0
[...]
[ 1079.118911][  T116]  kthread+0x17e/0x1b0
[ 1079.119440][  T116]  ? __pfx_kthread+0x10/0x10
[ 1079.120012][  T116]  ret_from_fork+0x2c/0x50
[ 1079.120593][  T116]  </TASK>
[ 1079.120971][  T116] irq event stamp: 12259
[ 1079.121520][  T116] hardirqs last  enabled at (12271):
[<ffffffffa388204e>] __up_console_sem+0x5e/0x70
[ 1079.122735][  T116] hardirqs last disabled at (12282):
[<ffffffffa3882033>] __up_console_sem+0x43/0x70
[ 1079.123950][  T116] softirqs last  enabled at (12018):
[<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140
[ 1079.125151][  T116] softirqs last disabled at (12013):
[<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140
[ 1079.126342][  T116] ---[ end trace 0000000000000000 ]---
[ 1079.140447][  T116] khugepaged: retract_page_tables: done with
collapse_and_free_pmd()
[ 1079.143002][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.146490][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4a9980 because anon_vma
[ 1079.159601][  T116] khugepaged: retract_page_tables:
retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80
[ 1079.161186][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.162648][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4a9980 because anon_vma
[ 1079.164000][  T116] khugepaged: retract_page_tables: skipping vma
at 0x2600000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.649601][  T797] map_pte:   ... end delay
[ 1079.650358][  T797]
==================================================================
[ 1079.651257][  T797] BUG: KASAN: use-after-free in
page_vma_mapped_walk+0x5f3/0xdf0
[ 1079.652120][  T797] Read of size 8 at addr ffff888171e54000 by task TEST1/797
[ 1079.652915][  T797]
[ 1079.653179][  T797] CPU: 13 PID: 797 Comm: TEST1 Tainted: G
W          6.2.0-rc3-00060-gc757fc92a3f7-dirty #257
[ 1079.654410][  T797] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1079.655534][  T797] Call Trace:
[ 1079.655907][  T797]  <TASK>
[ 1079.656268][  T797]  dump_stack_lvl+0x5b/0x77
[ 1079.656812][  T797]  print_report+0x168/0x458
[...]
[ 1079.659003][  T797]  kasan_report+0xbb/0xf0
[...]
[ 1079.662829][  T797]  page_vma_mapped_walk+0x5f3/0xdf0
[ 1079.663474][  T797]  folio_referenced_one+0x1b3/0x3e0
[...]
[ 1079.667670][  T797]  rmap_walk_anon+0x137/0x2b0
[ 1079.668252][  T797]  folio_referenced+0x298/0x2c0
[...]
[ 1079.673519][  T797]  shrink_folio_list+0xd30/0x1660
[...]
[ 1079.679030][  T797]  reclaim_folio_list.constprop.0+0xcf/0x200
[...]
[ 1079.681772][  T797]  reclaim_pages+0x154/0x1c0
[...]
[ 1079.683727][  T797]  madvise_cold_or_pageout_pte_range+0xf64/0x1900
[...]
[ 1079.686125][  T797]  walk_pgd_range+0x68a/0xcb0
[...]
[ 1079.687393][  T797]  __walk_page_range+0x25d/0x260
[...]
[ 1079.689327][  T797]  walk_page_range+0x253/0x2b0
[...]
[ 1079.691922][  T797]  madvise_pageout+0x1ab/0x2a0
[...]
[ 1079.694355][  T797]  do_madvise.part.0+0x3a3/0x1370
[...]
[ 1079.701143][  T797]  __x64_sys_madvise+0x7b/0xa0
[ 1079.701719][  T797]  do_syscall_64+0x3a/0x90
[ 1079.702207][  T797]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1079.702863][  T797] RIP: 0033:0x7f18ae2dfd07
[ 1079.703352][  T797] Code: ff ff ff ff c3 48 8b 15 87 61 0c 00 f7 d8
64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 90 b8 1c
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 61 0c 00 f7 d8
64 89 01 48
[ 1079.705469][  T797] RSP: 002b:00007fffaaf31198 EFLAGS: 00000206
ORIG_RAX: 000000000000001c
[ 1079.706382][  T797] RAX: ffffffffffffffda RBX: 0000000000000000
RCX: 00007f18ae2dfd07
[ 1079.707244][  T797] RDX: 0000000000000015 RSI: 0000000000001000
RDI: 0000000001400000
[ 1079.710207][  T797] RBP: 00007fffaaf322c0 R08: 00007f18ae3a8800
R09: 00007f18ae3ad540
[ 1079.711092][  T797] R10: 00007f18ae2d6274 R11: 0000000000000206
R12: 00005577fac45220
[ 1079.712021][  T797] R13: 00007fffaaf323a0 R14: 0000000000000000
R15: 0000000000000000
[ 1079.712883][  T797]  </TASK>
[ 1079.713219][  T797]
[ 1079.713495][  T797] The buggy address belongs to the physical page:
[ 1079.714273][  T797] page:000000009614ae4b refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x171e54
[ 1079.715595][  T797] flags: 0x8000000000000000(zone=2)
[ 1079.716275][  T797] raw: 8000000000000000 ffffea0004562e48
ffff8881f713d810 0000000000000000
[ 1079.717389][  T797] raw: 0000000000000000 ffff8881183f50c0
00000000ffffffff 0000000000000000
[ 1079.718508][  T797] page dumped because: kasan: bad access detected
[ 1079.719358][  T797] page_owner tracks the page as freed
[ 1079.720053][  T797] page last allocated via order 0, migratetype
Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 796,
tgid 796 (thp_newbug), ts 1065542415232, free_ts 1079140429026
[ 1079.722276][  T797]  get_page_from_freelist+0xb0a/0x1300
[ 1079.722872][  T797]  __alloc_pages+0x17e/0x3c0
[ 1079.723376][  T797]  pte_alloc_one+0x18/0xe0
[ 1079.723885][  T797]  __handle_mm_fault+0x15fa/0x1660
[ 1079.724474][  T797]  handle_mm_fault+0x18c/0x4e0
[ 1079.724995][  T797]  do_user_addr_fault+0x232/0x7f0
[ 1079.725551][  T797]  exc_page_fault+0x5d/0xe0
[ 1079.726045][  T797]  asm_exc_page_fault+0x22/0x30
[ 1079.726577][  T797] page last free stack trace:
[ 1079.727089][  T797]  free_pcp_prepare+0x276/0x530
[ 1079.727680][  T797]  free_unref_page+0x37/0x250
[ 1079.728193][  T797]  collapse_and_free_pmd+0x2a3/0x420
[ 1079.728773][  T797]  collapse_file.cold+0x3a4/0x400
[ 1079.729321][  T797]  hpage_collapse_scan_file+0x72a/0xad0
[ 1079.729939][  T797]  khugepaged+0x66b/0xbf0
[ 1079.730426][  T797]  kthread+0x17e/0x1b0
[ 1079.730874][  T797]  ret_from_fork+0x2c/0x50
[ 1079.731393][  T797]
[ 1079.731674][  T797] Memory state around the buggy address:
[ 1079.732284][  T797]  ffff888171e53f00: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[ 1079.733154][  T797]  ffff888171e53f80: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[ 1079.734032][  T797] >ffff888171e54000: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.734932][  T797]                    ^
[ 1079.735380][  T797]  ffff888171e54080: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.736249][  T797]  ffff888171e54100: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.737119][  T797]
==================================================================
[ 1079.738129][  T797] Disabling lock debugging due to kernel taint
[ 1079.738897][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x1400000
[ 1079.744091][  T797] map_pte:   ... continuing
[ 1079.744621][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4a9980 pvmw->vma->vm_start=0x1400000
[ 1079.745802][  T797] map_pte:   ... continuing


Here's the corresponding reproducer output (including dumps of the
anon_vma hierarchies attached to the VMAs, thanks to the extra debug
info I added to procfs in the attached diff):


user@vm:~/thp/thp_newbug$ ./thp_newbug
Rss = 2048 kB
[...]
Rss = 2048 kB
Rss = 2048 kB
Rss = 2048 kB
Rss = 2048 kB
Rss = 0 kB
khugepaged is about to write_trylock! associating anon_vma...
anon_vma associated, telling child to hit rmap now...
child hitting rmap...
child done with rmap

=========== PARENT ===========
001fe000-00400000 r--p 003fe000 00:01 13313
  /memfd:memfd (deleted)
Size:               2056 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd mr mw me dc hg
01400000-01800000 rw-p 00400000 00:01 13313
  /memfd:memfd (deleted)
Size:               4096 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
023ff000-02600000 rwxp 003ff000 00:01 13313
  /memfd:memfd (deleted)
Size:               2052 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   4 kB
Pss:                   4 kB
Pss_Dirty:             4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr ex mr mw me dc ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
02600000-02800000 rw-p 00600000 00:01 13313
  /memfd:memfd (deleted)
Size:               2048 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me dc ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
[...]
=========== CHILD ===========
01400000-01800000 rw-p 00400000 00:01 13313
  /memfd:memfd (deleted)
Size:               4096 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   4 kB
Pss:                   4 kB
Pss_Dirty:             4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            0 kB
Anonymous:             4 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac
AnonVMAs:
   [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
 #   [  1] ffff8881149b0cc0 root=ffff8881149b1100 refcount=1
anon_vmas=1 num_children=0 num_active_vmas=1 parent=ffff8881149b1100
[...]
user@vm:~/thp/thp_newbug$

Jann Horn Jan. 13, 2023, 7:36 p.m. UTC | #7

On Thu, Jan 12, 2023 at 2:06 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jan 11, 2023 at 5:33 AM Jann Horn <jannh@google.com> wrote:
> >
> > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> > it to be locked. retract_page_tables() bails out if an ->anon_vma is
> > attached, but does this check before holding the mmap lock (as the comment
> > above the check explains).
> >
> > If we racily merge an existing ->anon_vma (shared with a child process)
> > from a neighboring VMA, subsequent rmap traversals on pages belonging to
> > the child will be able to see the page tables that we are concurrently
> > removing while assuming that nothing else can access them.
> >
> > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> > really is no concurrent page table access.
> >
> > Reported-by: Zach O'Keefe <zokeefe@google.com>
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> > zokeefe@ pointed out to me that the current code (after my last round of patches)
> > can hit a lockdep assert by racing, and after staring at it a bit I've
>
> I'm supposed the lockdep is the one in collapse_and_free_pmd(). It is
> better to have the splat included in the commit log.

I pasted the splat in
https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
- which part do you think should go into the commit log? Just the
"WARNING: CPU: 14 PID: 116 at
mm/khugepaged.c:1406 collapse_and_free_pmd+0x364/0x420"? Or the whole
ASAN splat below the lockdep complaint with all three backtraces?

> > convinced myself that this is a real, preexisting bug.
> > (I haven't written a reproducer for it though. One way to hit it might be
> > something along the lines of:
> >
> >  - set up a process A with a private-file-mapping VMA V1
> >  - let A fork() to create process B, thereby copying V1 in A to V1' in B
> >  - let B extend the end of V1'
> >  - let B put some anon pages into the extended part of V1'
>
> I don't quite get why we need this step. A cow fault on A's V1 isn't
> enough to have anon_vma for V1? This should not prevent V1 and V2 from
> sharing anon_vma. Did I miss something?

You're right, these steps don't work.

Kirill A. Shutemov Jan. 15, 2023, 7:06 p.m. UTC | #8

On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
> No, that lockdep assert has to be there. Page table traversal is
> allowed under any one of the mmap lock, the anon_vma lock (if the VMA
> is associated with an anon_vma), and the mapping lock (if the VMA is
> associated with a mapping); and so to be able to remove page tables,
> we must hold all three of them.

Okay, that's fair. I agree with the patch now. Maybe adjust the commit
message a bit?

Anyway:

Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>

BTW, I've noticied that you recently added tlb_remove_table_sync_one().
I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
good enough to serialize against GUP fast?

Jann Horn Jan. 16, 2023, 12:06 p.m. UTC | #9

On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
> > No, that lockdep assert has to be there. Page table traversal is
> > allowed under any one of the mmap lock, the anon_vma lock (if the VMA
> > is associated with an anon_vma), and the mapping lock (if the VMA is
> > associated with a mapping); and so to be able to remove page tables,
> > we must hold all three of them.
>
> Okay, that's fair. I agree with the patch now. Maybe adjust the commit
> message a bit?

Just to make sure we're on the same page: Are you suggesting that I
add this text?
"Page table traversal is allowed under any one of the mmap lock, the
anon_vma lock (if the VMA is associated with an anon_vma), and the
mapping lock (if the VMA is associated with a mapping); and so to be
able to remove page tables, we must hold all three of them."
Or something else?

> Anyway:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>

Thanks!

> BTW, I've noticied that you recently added tlb_remove_table_sync_one().
> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
> good enough to serialize against GUP fast?

If that sent an IPI, it would be good enough; but
pmdp_collapse_flush() is not guaranteed to send an IPI.
It does a TLB flush, but on some architectures (including arm64 and
also virtualized x86), a remote TLB flush can be done without an IPI.
For example, arm64 has some fancy hardware support for remote TLB
invalidation without IPIs ("broadcast TLB invalidation"), and
virtualized x86 has (depending on the hypervisor) things like TLB
shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
TLB shootdown signalling for preempted CPUs through shared memory
(under KVM, see kvm_flush_tlb_multi).

Kirill A. Shutemov Jan. 16, 2023, 12:34 p.m. UTC | #10

On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote:
> On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
> > > No, that lockdep assert has to be there. Page table traversal is
> > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA
> > > is associated with an anon_vma), and the mapping lock (if the VMA is
> > > associated with a mapping); and so to be able to remove page tables,
> > > we must hold all three of them.
> >
> > Okay, that's fair. I agree with the patch now. Maybe adjust the commit
> > message a bit?
> 
> Just to make sure we're on the same page: Are you suggesting that I
> add this text?
> "Page table traversal is allowed under any one of the mmap lock, the
> anon_vma lock (if the VMA is associated with an anon_vma), and the
> mapping lock (if the VMA is associated with a mapping); and so to be
> able to remove page tables, we must hold all three of them."
> Or something else?

Looks good to me.

> > Anyway:
> >
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>
> 
> Thanks!
> 
> > BTW, I've noticied that you recently added tlb_remove_table_sync_one().
> > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
> > good enough to serialize against GUP fast?
> 
> If that sent an IPI, it would be good enough; but
> pmdp_collapse_flush() is not guaranteed to send an IPI.
> It does a TLB flush, but on some architectures (including arm64 and
> also virtualized x86), a remote TLB flush can be done without an IPI.
> For example, arm64 has some fancy hardware support for remote TLB
> invalidation without IPIs ("broadcast TLB invalidation"), and
> virtualized x86 has (depending on the hypervisor) things like TLB
> shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
> TLB shootdown signalling for preempted CPUs through shared memory
> (under KVM, see kvm_flush_tlb_multi).

I think such architectures must provide proper pmdp_collapse_flush()
with the required serialization. Power and S390 already do that.

Jann Horn Jan. 16, 2023, 12:54 p.m. UTC | #11

+cc mmu-gather maintainers

On Mon, Jan 16, 2023 at 1:34 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote:
> > On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > BTW, I've noticied that you recently added tlb_remove_table_sync_one().
> > > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
> > > good enough to serialize against GUP fast?
> >
> > If that sent an IPI, it would be good enough; but
> > pmdp_collapse_flush() is not guaranteed to send an IPI.
> > It does a TLB flush, but on some architectures (including arm64 and
> > also virtualized x86), a remote TLB flush can be done without an IPI.
> > For example, arm64 has some fancy hardware support for remote TLB
> > invalidation without IPIs ("broadcast TLB invalidation"), and
> > virtualized x86 has (depending on the hypervisor) things like TLB
> > shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
> > TLB shootdown signalling for preempted CPUs through shared memory
> > (under KVM, see kvm_flush_tlb_multi).
>
> I think such architectures must provide proper pmdp_collapse_flush()
> with the required serialization.

FWIW, the IPI that I added is not unconditional;
tlb_remove_table_sync_one() is a no-op depending on
CONFIG_MMU_GATHER_RCU_TABLE_FREE, which an architecture can use to
signal that it uses "Semi RCU freeing of the page directories". The
kernel has arch-independent support for these semantics in the normal
TLB flushing code. But yeah, I guess you could move the
tlb_remove_table_sync_one() calls into pmdp_collapse_flush()
(including the generic version)? I'm CC-ing the mmu-gather maintainers
in case they have an opinion.

Anyway, I'm not going to do that refactor; feel free to do that if you want.

> Power and S390 already do that.

What's the call graph from pmdp_collapse_flush() to IPI on powerpc and s390?

David Hildenbrand Jan. 16, 2023, 1:07 p.m. UTC | #12

On 16.01.23 13:34, Kirill A. Shutemov wrote:
> On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote:
>> On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
>>>> No, that lockdep assert has to be there. Page table traversal is
>>>> allowed under any one of the mmap lock, the anon_vma lock (if the VMA
>>>> is associated with an anon_vma), and the mapping lock (if the VMA is
>>>> associated with a mapping); and so to be able to remove page tables,
>>>> we must hold all three of them.
>>>
>>> Okay, that's fair. I agree with the patch now. Maybe adjust the commit
>>> message a bit?
>>
>> Just to make sure we're on the same page: Are you suggesting that I
>> add this text?
>> "Page table traversal is allowed under any one of the mmap lock, the
>> anon_vma lock (if the VMA is associated with an anon_vma), and the
>> mapping lock (if the VMA is associated with a mapping); and so to be
>> able to remove page tables, we must hold all three of them."
>> Or something else?
> 
> Looks good to me.
> 
>>> Anyway:
>>>
>>> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>
>>
>> Thanks!
>>
>>> BTW, I've noticied that you recently added tlb_remove_table_sync_one().
>>> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
>>> good enough to serialize against GUP fast?
>>
>> If that sent an IPI, it would be good enough; but
>> pmdp_collapse_flush() is not guaranteed to send an IPI.
>> It does a TLB flush, but on some architectures (including arm64 and
>> also virtualized x86), a remote TLB flush can be done without an IPI.
>> For example, arm64 has some fancy hardware support for remote TLB
>> invalidation without IPIs ("broadcast TLB invalidation"), and
>> virtualized x86 has (depending on the hypervisor) things like TLB
>> shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
>> TLB shootdown signalling for preempted CPUs through shared memory
>> (under KVM, see kvm_flush_tlb_multi).
> 
> I think such architectures must provide proper pmdp_collapse_flush()
> with the required serialization. Power and S390 already do that.
> 

The plan is to eventually move away from (ab)using IPI to synchronize with
GUP-fast. Moving further into that direction a is wrong.

The flush was added as a quick fix for all architectures by Jann, until
we can do better.

Even for ppc64, see:

commit bedf03416913d88c796288f9dca109a53608c745
Author: Yang Shi <shy828301@gmail.com>
Date:   Wed Sep 7 11:01:44 2022 -0700

     powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
     
     The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
     move to use RCU instead of disabling local interrupts in fast-GUP.  Using
     an IPI is the old-styled way of serializing against fast-GUP although it
     still works as expected now.
     
     And fast-GUP now fixed the potential race with THP collapse by checking
     whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
     flush is not necessary anymore.  But it is still needed for hash TLB.

Kirill A. Shutemov Jan. 16, 2023, 1:47 p.m. UTC | #13

On Mon, Jan 16, 2023 at 02:07:41PM +0100, David Hildenbrand wrote:
> On 16.01.23 13:34, Kirill A. Shutemov wrote:
> > On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote:
> > > On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > > > On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
> > > > > No, that lockdep assert has to be there. Page table traversal is
> > > > > allowed under any one of the mmap lock, the anon_vma lock (if the VMA
> > > > > is associated with an anon_vma), and the mapping lock (if the VMA is
> > > > > associated with a mapping); and so to be able to remove page tables,
> > > > > we must hold all three of them.
> > > > 
> > > > Okay, that's fair. I agree with the patch now. Maybe adjust the commit
> > > > message a bit?
> > > 
> > > Just to make sure we're on the same page: Are you suggesting that I
> > > add this text?
> > > "Page table traversal is allowed under any one of the mmap lock, the
> > > anon_vma lock (if the VMA is associated with an anon_vma), and the
> > > mapping lock (if the VMA is associated with a mapping); and so to be
> > > able to remove page tables, we must hold all three of them."
> > > Or something else?
> > 
> > Looks good to me.
> > 
> > > > Anyway:
> > > > 
> > > > Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>
> > > 
> > > Thanks!
> > > 
> > > > BTW, I've noticied that you recently added tlb_remove_table_sync_one().
> > > > I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
> > > > good enough to serialize against GUP fast?
> > > 
> > > If that sent an IPI, it would be good enough; but
> > > pmdp_collapse_flush() is not guaranteed to send an IPI.
> > > It does a TLB flush, but on some architectures (including arm64 and
> > > also virtualized x86), a remote TLB flush can be done without an IPI.
> > > For example, arm64 has some fancy hardware support for remote TLB
> > > invalidation without IPIs ("broadcast TLB invalidation"), and
> > > virtualized x86 has (depending on the hypervisor) things like TLB
> > > shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
> > > TLB shootdown signalling for preempted CPUs through shared memory
> > > (under KVM, see kvm_flush_tlb_multi).
> > 
> > I think such architectures must provide proper pmdp_collapse_flush()
> > with the required serialization. Power and S390 already do that.
> > 
> 
> The plan is to eventually move away from (ab)using IPI to synchronize with
> GUP-fast. Moving further into that direction a is wrong.
> 
> The flush was added as a quick fix for all architectures by Jann, until
> we can do better.
> 
> Even for ppc64, see:
> 
> commit bedf03416913d88c796288f9dca109a53608c745
> Author: Yang Shi <shy828301@gmail.com>
> Date:   Wed Sep 7 11:01:44 2022 -0700
> 
>     powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
>     The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
>     move to use RCU instead of disabling local interrupts in fast-GUP.  Using
>     an IPI is the old-styled way of serializing against fast-GUP although it
>     still works as expected now.
>     And fast-GUP now fixed the potential race with THP collapse by checking
>     whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
>     flush is not necessary anymore.  But it is still needed for hash TLB.

Okay. But I think tlb_remove_table_sync_one() belongs inside
pmdp_collapse_flush(). Collapsing pmd table into huge page without
serialization is a bug. They should not be separate.

Yang Shi Jan. 17, 2023, 6:57 p.m. UTC | #14

On Sun, Jan 15, 2023 at 11:07 AM Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
> > No, that lockdep assert has to be there. Page table traversal is
> > allowed under any one of the mmap lock, the anon_vma lock (if the VMA
> > is associated with an anon_vma), and the mapping lock (if the VMA is
> > associated with a mapping); and so to be able to remove page tables,
> > we must hold all three of them.
>
> Okay, that's fair. I agree with the patch now. Maybe adjust the commit
> message a bit?

Yeah, the explanation makes sense to me too. Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
> Anyway:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>
>
> BTW, I've noticied that you recently added tlb_remove_table_sync_one().
> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
> good enough to serialize against GUP fast?
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Jann Horn Jan. 17, 2023, 7:12 p.m. UTC | #15

On Wed, Jan 11, 2023 at 2:33 PM Jann Horn <jannh@google.com> wrote:
> If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> it to be locked. retract_page_tables() bails out if an ->anon_vma is
> attached, but does this check before holding the mmap lock (as the comment
> above the check explains).

@akpm please replace the commit message with the following, and maybe
also add a "Link:" entry pointing to
https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
for the reproducer.

If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
it to be locked.
Page table traversal is allowed under any one of the mmap lock, the
anon_vma lock (if the VMA is associated with an anon_vma), and the
mapping lock (if the VMA is associated with a mapping); and so to be
able to remove page tables, we must hold all three of them.
retract_page_tables() bails out if an ->anon_vma is attached, but
does this check before holding the mmap lock (as the comment above
the check explains).

If we racily merge an existing ->anon_vma (shared with a child process)
from a neighboring VMA, subsequent rmap traversals on pages belonging to
the child will be able to see the page tables that we are concurrently
removing while assuming that nothing else can access them.

Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
really is no concurrent page table access.

Hitting this bug causes a lockdep warning in collapse_and_free_pmd(),
in the line
"lockdep_assert_held_write(&vma->anon_vma->root->rwsem)".
It can also lead to use-after-free access.

Andrew Morton Jan. 17, 2023, 10:55 p.m. UTC | #16

On Tue, 17 Jan 2023 20:12:49 +0100 Jann Horn <jannh@google.com> wrote:

> On Wed, Jan 11, 2023 at 2:33 PM Jann Horn <jannh@google.com> wrote:
> > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> > it to be locked. retract_page_tables() bails out if an ->anon_vma is
> > attached, but does this check before holding the mmap lock (as the comment
> > above the check explains).
> 
> @akpm please replace the commit message with the following, and maybe
> also add a "Link:" entry pointing to
> https://lore.kernel.org/linux-mm/CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com/
> for the reproducer.

Done, thanks.

David Hildenbrand Jan. 23, 2023, 11:07 a.m. UTC | #17

On 16.01.23 14:47, Kirill A. Shutemov wrote:
> On Mon, Jan 16, 2023 at 02:07:41PM +0100, David Hildenbrand wrote:
>> On 16.01.23 13:34, Kirill A. Shutemov wrote:
>>> On Mon, Jan 16, 2023 at 01:06:59PM +0100, Jann Horn wrote:
>>>> On Sun, Jan 15, 2023 at 8:07 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>>> On Fri, Jan 13, 2023 at 08:28:59PM +0100, Jann Horn wrote:
>>>>>> No, that lockdep assert has to be there. Page table traversal is
>>>>>> allowed under any one of the mmap lock, the anon_vma lock (if the VMA
>>>>>> is associated with an anon_vma), and the mapping lock (if the VMA is
>>>>>> associated with a mapping); and so to be able to remove page tables,
>>>>>> we must hold all three of them.
>>>>>
>>>>> Okay, that's fair. I agree with the patch now. Maybe adjust the commit
>>>>> message a bit?
>>>>
>>>> Just to make sure we're on the same page: Are you suggesting that I
>>>> add this text?
>>>> "Page table traversal is allowed under any one of the mmap lock, the
>>>> anon_vma lock (if the VMA is associated with an anon_vma), and the
>>>> mapping lock (if the VMA is associated with a mapping); and so to be
>>>> able to remove page tables, we must hold all three of them."
>>>> Or something else?
>>>
>>> Looks good to me.
>>>
>>>>> Anyway:
>>>>>
>>>>> Acked-by: Kirill A. Shutemov <kirill.shutemov@intel.linux.com>
>>>>
>>>> Thanks!
>>>>
>>>>> BTW, I've noticied that you recently added tlb_remove_table_sync_one().
>>>>> I'm not sure why it is needed. Why IPI in pmdp_collapse_flush() in not
>>>>> good enough to serialize against GUP fast?
>>>>
>>>> If that sent an IPI, it would be good enough; but
>>>> pmdp_collapse_flush() is not guaranteed to send an IPI.
>>>> It does a TLB flush, but on some architectures (including arm64 and
>>>> also virtualized x86), a remote TLB flush can be done without an IPI.
>>>> For example, arm64 has some fancy hardware support for remote TLB
>>>> invalidation without IPIs ("broadcast TLB invalidation"), and
>>>> virtualized x86 has (depending on the hypervisor) things like TLB
>>>> shootdown hypercalls (under Hyper-V, see hyperv_flush_tlb_multi) or
>>>> TLB shootdown signalling for preempted CPUs through shared memory
>>>> (under KVM, see kvm_flush_tlb_multi).
>>>
>>> I think such architectures must provide proper pmdp_collapse_flush()
>>> with the required serialization. Power and S390 already do that.
>>>
>>
>> The plan is to eventually move away from (ab)using IPI to synchronize with
>> GUP-fast. Moving further into that direction a is wrong.
>>
>> The flush was added as a quick fix for all architectures by Jann, until
>> we can do better.
>>
>> Even for ppc64, see:
>>
>> commit bedf03416913d88c796288f9dca109a53608c745
>> Author: Yang Shi <shy828301@gmail.com>
>> Date:   Wed Sep 7 11:01:44 2022 -0700
>>
>>      powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
>>      The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
>>      move to use RCU instead of disabling local interrupts in fast-GUP.  Using
>>      an IPI is the old-styled way of serializing against fast-GUP although it
>>      still works as expected now.
>>      And fast-GUP now fixed the potential race with THP collapse by checking
>>      whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
>>      flush is not necessary anymore.  But it is still needed for hash TLB.
> 
> Okay. But I think tlb_remove_table_sync_one() belongs inside
> pmdp_collapse_flush(). Collapsing pmd table into huge page without
> serialization is a bug. They should not be separate.

Agreed. But I wonder if it should be moved into a generic 
pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush().

Kirill A. Shutemov Jan. 24, 2023, 12:51 a.m. UTC | #18

On Mon, Jan 23, 2023 at 12:07:05PM +0100, David Hildenbrand wrote:
> > > commit bedf03416913d88c796288f9dca109a53608c745
> > > Author: Yang Shi <shy828301@gmail.com>
> > > Date:   Wed Sep 7 11:01:44 2022 -0700
> > > 
> > >      powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
> > >      The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
> > >      move to use RCU instead of disabling local interrupts in fast-GUP.  Using
> > >      an IPI is the old-styled way of serializing against fast-GUP although it
> > >      still works as expected now.
> > >      And fast-GUP now fixed the potential race with THP collapse by checking
> > >      whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
> > >      flush is not necessary anymore.  But it is still needed for hash TLB.
> > 
> > Okay. But I think tlb_remove_table_sync_one() belongs inside
> > pmdp_collapse_flush(). Collapsing pmd table into huge page without
> > serialization is a bug. They should not be separate.
> 
> Agreed. But I wonder if it should be moved into a generic
> pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush().

Um? There's no __pmdp_collapse_flush(). Architectural code overrides
pmdp_collapse_flush() altogether. It has know how to serialize properly
for the arch.

David Hildenbrand Jan. 24, 2023, 10:19 a.m. UTC | #19

On 24.01.23 01:51, Kirill A. Shutemov wrote:
> On Mon, Jan 23, 2023 at 12:07:05PM +0100, David Hildenbrand wrote:
>>>> commit bedf03416913d88c796288f9dca109a53608c745
>>>> Author: Yang Shi <shy828301@gmail.com>
>>>> Date:   Wed Sep 7 11:01:44 2022 -0700
>>>>
>>>>       powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush
>>>>       The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will
>>>>       move to use RCU instead of disabling local interrupts in fast-GUP.  Using
>>>>       an IPI is the old-styled way of serializing against fast-GUP although it
>>>>       still works as expected now.
>>>>       And fast-GUP now fixed the potential race with THP collapse by checking
>>>>       whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
>>>>       flush is not necessary anymore.  But it is still needed for hash TLB.
>>>
>>> Okay. But I think tlb_remove_table_sync_one() belongs inside
>>> pmdp_collapse_flush(). Collapsing pmd table into huge page without
>>> serialization is a bug. They should not be separate.
>>
>> Agreed. But I wonder if it should be moved into a generic
>> pmdp_collapse_flush(), that calls an arch specific __pmdp_collapse_flush().
> 
> Um? There's no __pmdp_collapse_flush().

Did I imply that there is already one? ;)

mm/khugepaged: Fix ->anon_vma race

Commit Message

Comments

Patch