diff mbox series

[6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling

Message ID 20201120143557.6715-7-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series tlb: Fix access and (soft-)dirty bit management | expand

Commit Message

Will Deacon Nov. 20, 2020, 2:35 p.m. UTC
clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
updating the page-tables for the current mm. However, since the mm is not
being freed, this can result in stale TLB entries on architectures which
elide 'fullmm' invalidation.

Ensure that TLB invalidation is performed after updating soft-dirty
entries via clear_refs_write() by using the non-fullmm API to MMU gather.

Signed-off-by: Will Deacon <will@kernel.org>
---
 fs/proc/task_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Linus Torvalds Nov. 20, 2020, 5:41 p.m. UTC | #1
On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
>
> Ensure that TLB invalidation is performed after updating soft-dirty
> entries via clear_refs_write() by using the non-fullmm API to MMU gather.

This code sequence looks bogus to begin with.

It does that

                tlb_gather_mmu(&tlb, mm, 0, -1);
     ..
                tlb_finish_mmu(&tlb, 0, -1);

around the loop (all, your patch series changes those arguments), but
it doesn't actually use "tlb" anywhere inside the loop itself that I
can see.

Yeah., yeah, it sets the flush_pending thing etc, but that still
sounds fundamentally wrong. It should do the proper range adjustments
if/when it actually wals the range. No?

If I read this all right, it will do a full TLB flush even when it
doesn't do anything (eg CLEAR_REFS_SOFT_DIRTY with no softdirty
pages).

So this looks all kinds of bogus. Not your patch, but the code it patches.

               Linus
Linus Torvalds Nov. 20, 2020, 5:45 p.m. UTC | #2
On Fri, Nov 20, 2020 at 9:41 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This code sequence looks bogus to begin with.

Oh, never mind.

I was reading the patches out of order, because 4/6 showed up later in
my inbox since it had other replies.

You seem to have fixed that bogosity in 4/6.

             Linus
Yu Zhao Nov. 20, 2020, 8:40 p.m. UTC | #3
On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> updating the page-tables for the current mm. However, since the mm is not
> being freed, this can result in stale TLB entries on architectures which
> elide 'fullmm' invalidation.
> 
> Ensure that TLB invalidation is performed after updating soft-dirty
> entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  fs/proc/task_mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index a76d339b5754..316af047f1aa 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			count = -EINTR;
>  			goto out_mm;
>  		}
> -		tlb_gather_mmu_fullmm(&tlb, mm);
> +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);

Let's assume my reply to patch 4 is wrong, and therefore we still need
tlb_gather/finish_mmu() here. But then wouldn't this change deprive
architectures other than ARM the opportunity to optimize based on the
fact it's a full-mm flush?

It seems to me ARM's interpretation of tlb->fullmm is a special case,
not the other way around.
Will Deacon Nov. 23, 2020, 6:35 p.m. UTC | #4
On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > updating the page-tables for the current mm. However, since the mm is not
> > being freed, this can result in stale TLB entries on architectures which
> > elide 'fullmm' invalidation.
> > 
> > Ensure that TLB invalidation is performed after updating soft-dirty
> > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > 
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  fs/proc/task_mmu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index a76d339b5754..316af047f1aa 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			count = -EINTR;
> >  			goto out_mm;
> >  		}
> > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> 
> Let's assume my reply to patch 4 is wrong, and therefore we still need
> tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> architectures other than ARM the opportunity to optimize based on the
> fact it's a full-mm flush?

Only for the soft-dirty case, but I think TLB invalidation is required
there because we are write-protecting the entries and I don't see any
mechanism to handle lazy invalidation for that (compared with the aging
case, which is handled via pte_accessible()).

Furthermore, If we decide that we can relax the TLB invalidation
requirements here, then I'd much rather than was done deliberately, rather
than as an accidental side-effect of another commit (since I think the
current behaviour was a consequence of 7a30df49f63a).

> It seems to me ARM's interpretation of tlb->fullmm is a special case,
> not the other way around.

Although I agree that this is subtle and error-prone (which is why I'm
trying to make the API more explicit here), it _is_ documented clearly
in asm-generic/tlb.h:

 *  - mmu_gather::fullmm
 *
 *    A flag set by tlb_gather_mmu() to indicate we're going to free
 *    the entire mm; this allows a number of optimizations.
 *
 *    - We can ignore tlb_{start,end}_vma(); because we don't
 *      care about ranges. Everything will be shot down.
 *
 *    - (RISC) architectures that use ASIDs can cycle to a new ASID
 *      and delay the invalidation until ASID space runs out.

Will
Yu Zhao Nov. 23, 2020, 8:04 p.m. UTC | #5
On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > updating the page-tables for the current mm. However, since the mm is not
> > > being freed, this can result in stale TLB entries on architectures which
> > > elide 'fullmm' invalidation.
> > > 
> > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > 
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  fs/proc/task_mmu.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index a76d339b5754..316af047f1aa 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > >  			count = -EINTR;
> > >  			goto out_mm;
> > >  		}
> > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > 
> > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > architectures other than ARM the opportunity to optimize based on the
> > fact it's a full-mm flush?

I double checked my conclusion on patch 4, and aside from a couple
of typos, it still seems correct after the weekend.

> Only for the soft-dirty case, but I think TLB invalidation is required
> there because we are write-protecting the entries and I don't see any
> mechanism to handle lazy invalidation for that (compared with the aging
> case, which is handled via pte_accessible()).

The lazy invalidation for that is done when we write-protect a page,
not an individual PTE. When we do so, our decision is based on both
the dirty bit and the writable bit on each PTE mapping this page. So
we only need to make sure we don't lose both on a PTE. And we don't
here.

> Furthermore, If we decide that we can relax the TLB invalidation
> requirements here, then I'd much rather than was done deliberately, rather
> than as an accidental side-effect of another commit (since I think the
> current behaviour was a consequence of 7a30df49f63a).

Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
("mm: fix KSM data corruption") in the first place.

> > It seems to me ARM's interpretation of tlb->fullmm is a special case,
> > not the other way around.
> 
> Although I agree that this is subtle and error-prone (which is why I'm
> trying to make the API more explicit here), it _is_ documented clearly
> in asm-generic/tlb.h:
> 
>  *  - mmu_gather::fullmm
>  *
>  *    A flag set by tlb_gather_mmu() to indicate we're going to free
>  *    the entire mm; this allows a number of optimizations.
>  *
>  *    - We can ignore tlb_{start,end}_vma(); because we don't
>  *      care about ranges. Everything will be shot down.
>  *
>  *    - (RISC) architectures that use ASIDs can cycle to a new ASID
>  *      and delay the invalidation until ASID space runs out.

I'd leave the original tlb_gather/finish_mmu() for the first case and
add a new API for the second case, the special case that only applies
to exit_mmap()). This way we won't change any existing behaviors on
other architectures, which seems important to me.

Additional cleanups to tlb_gather/finish_mmu() come thereafter.
Will Deacon Nov. 23, 2020, 9:17 p.m. UTC | #6
On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > updating the page-tables for the current mm. However, since the mm is not
> > > > being freed, this can result in stale TLB entries on architectures which
> > > > elide 'fullmm' invalidation.
> > > > 
> > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > 
> > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > ---
> > > >  fs/proc/task_mmu.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index a76d339b5754..316af047f1aa 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > >  			count = -EINTR;
> > > >  			goto out_mm;
> > > >  		}
> > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > 
> > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > architectures other than ARM the opportunity to optimize based on the
> > > fact it's a full-mm flush?
> 
> I double checked my conclusion on patch 4, and aside from a couple
> of typos, it still seems correct after the weekend.

I still need to digest that, but I would prefer that we restore the
invalidation first, and then have a subsequent commit to relax it. I find
it hard to believe that the behaviour in mainline at the moment is deliberate.

That is, I'm not against optimising this, but I'd rather get it "obviously
correct" first and the current code is definitely not that.

> > Only for the soft-dirty case, but I think TLB invalidation is required
> > there because we are write-protecting the entries and I don't see any
> > mechanism to handle lazy invalidation for that (compared with the aging
> > case, which is handled via pte_accessible()).
> 
> The lazy invalidation for that is done when we write-protect a page,
> not an individual PTE. When we do so, our decision is based on both
> the dirty bit and the writable bit on each PTE mapping this page. So
> we only need to make sure we don't lose both on a PTE. And we don't
> here.

Sorry, I don't follow what you're getting at here (page vs pte). Please can
you point me to the code you're referring to? The case I'm worried about is
code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
where !pte_write() and assuming (despite pte_dirty()) that there can't be
any concurrent modifications to the mapped page. Granted, I haven't found
anything doing that, but I could not convince myself that it would be a bug
to write such code, either.

> > Furthermore, If we decide that we can relax the TLB invalidation
> > requirements here, then I'd much rather than was done deliberately, rather
> > than as an accidental side-effect of another commit (since I think the
> > current behaviour was a consequence of 7a30df49f63a).
> 
> Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
> ("mm: fix KSM data corruption") in the first place.

Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?

Will
Yu Zhao Nov. 24, 2020, 1:13 a.m. UTC | #7
On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > elide 'fullmm' invalidation.
> > > > > 
> > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > 
> > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > ---
> > > > >  fs/proc/task_mmu.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index a76d339b5754..316af047f1aa 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > >  			count = -EINTR;
> > > > >  			goto out_mm;
> > > > >  		}
> > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > 
> > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > architectures other than ARM the opportunity to optimize based on the
> > > > fact it's a full-mm flush?
> > 
> > I double checked my conclusion on patch 4, and aside from a couple
> > of typos, it still seems correct after the weekend.
> 
> I still need to digest that, but I would prefer that we restore the
> invalidation first, and then have a subsequent commit to relax it. I find
> it hard to believe that the behaviour in mainline at the moment is deliberate.
> 
> That is, I'm not against optimising this, but I'd rather get it "obviously
> correct" first and the current code is definitely not that.

I wouldn't mind having this patch and patch 4 if the invalidation they
restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
corruption") isn't correct to start with.

It is complicated, so please bear with me. Let's study this by looking
at examples this time.

> > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > there because we are write-protecting the entries and I don't see any
> > > mechanism to handle lazy invalidation for that (compared with the aging
> > > case, which is handled via pte_accessible()).
> > 
> > The lazy invalidation for that is done when we write-protect a page,
> > not an individual PTE. When we do so, our decision is based on both
> > the dirty bit and the writable bit on each PTE mapping this page. So
> > we only need to make sure we don't lose both on a PTE. And we don't
> > here.
> 
> Sorry, I don't follow what you're getting at here (page vs pte). Please can
> you point me to the code you're referring to? The case I'm worried about is
> code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> where !pte_write() and assuming (despite pte_dirty()) that there can't be
> any concurrent modifications to the mapped page. Granted, I haven't found
> anything doing that, but I could not convince myself that it would be a bug
> to write such code, either.

Example 1: memory corruption is still possible with patch 4 & 6

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  userspace                           page writeback

  [cache writable
   PTE in TLB]

              inc_tlb_flush_pending()
              clean_record_pte()
              pte_mkclean()

                          tlb_gather_mmu()
                          [set mm_tlb_flush_pending()]
                          clear_refs_write()
                          pte_wrprotect()

                                      page_mkclean_one()
                                      !pte_dirty() && !pte_write()
                                      [true, no flush]

                                      write page to disk

  Write to page
  [using stale PTE]

                                      drop clean page
                                      [data integrity compromised]

              flush_tlb_range()

                          tlb_finish_mmu()
                          [flush (with patch 4)]

Example 2: why no flush when write-protecting is not a problem (after
we fix the problem correctly by adding mm_tlb_flush_pending()).

Case a:

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  userspace                           page writeback

  [cache writable
   PTE in TLB]

              inc_tlb_flush_pending()
              clean_record_pte()
              pte_mkclean()

                          clear_refs_write()
                          pte_wrprotect()

                                      page_mkclean_one()
                                      !pte_dirty() && !pte_write() &&
                                      !mm_tlb_flush_pending()
                                      [false: flush]

                                      write page to disk

  Write to page
  [page fault]

                                      drop clean page
                                      [data integrity guaranteed]

              flush_tlb_range()

Case b:

  CPU0        CPU1        CPU2
  ----        ----        ----
  userspace               page writeback

  [cache writable
   PTE in TLB]

              clear_refs_write()
              pte_wrprotect()
              [pte_dirty() is false]

                          page_mkclean_one()
                          !pte_dirty() && !pte_write() &&
                          !mm_tlb_flush_pending()
                          [true: no flush]

                          write page to disk

  Write to page
  [h/w tries to set
   the dirty bit
   but sees write-
   protected PTE,
   page fault]

                          drop clean page
                          [data integrity guaranteed]

Case c:

  CPU0        CPU1        CPU2
  ----        ----        ----
  userspace               page writeback

  [cache writable
   PTE in TLB]

              clear_refs_write()
              pte_wrprotect()
              [pte_dirty() is true]

                          page_mkclean_one()
                          !pte_dirty() && !pte_write() &&
                          !mm_tlb_flush_pending()
                          [false: flush]

                          write page to disk

  Write to page
  [page fault]

                          drop clean page
                          [data integrity guaranteed]

> > > Furthermore, If we decide that we can relax the TLB invalidation
> > > requirements here, then I'd much rather than was done deliberately, rather
> > > than as an accidental side-effect of another commit (since I think the
> > > current behaviour was a consequence of 7a30df49f63a).
> > 
> > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
                                  ^^^^^^ shouldn't

Another typo, I apologize.

> > ("mm: fix KSM data corruption") in the first place.
> 
> Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
> invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?
> 
> Will
Will Deacon Nov. 24, 2020, 2:31 p.m. UTC | #8
On Mon, Nov 23, 2020 at 06:13:34PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > > elide 'fullmm' invalidation.
> > > > > > 
> > > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > > 
> > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > ---
> > > > > >  fs/proc/task_mmu.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index a76d339b5754..316af047f1aa 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > > >  			count = -EINTR;
> > > > > >  			goto out_mm;
> > > > > >  		}
> > > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > > 
> > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > > architectures other than ARM the opportunity to optimize based on the
> > > > > fact it's a full-mm flush?
> > > 
> > > I double checked my conclusion on patch 4, and aside from a couple
> > > of typos, it still seems correct after the weekend.
> > 
> > I still need to digest that, but I would prefer that we restore the
> > invalidation first, and then have a subsequent commit to relax it. I find
> > it hard to believe that the behaviour in mainline at the moment is deliberate.
> > 
> > That is, I'm not against optimising this, but I'd rather get it "obviously
> > correct" first and the current code is definitely not that.
> 
> I wouldn't mind having this patch and patch 4 if the invalidation they
> restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
> corruption") isn't correct to start with.
> 
> It is complicated, so please bear with me. Let's study this by looking
> at examples this time.

Thanks for putting these together. If you're right, then it looks like it's
even worse than I thought :(

> > > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > > there because we are write-protecting the entries and I don't see any
> > > > mechanism to handle lazy invalidation for that (compared with the aging
> > > > case, which is handled via pte_accessible()).
> > > 
> > > The lazy invalidation for that is done when we write-protect a page,
> > > not an individual PTE. When we do so, our decision is based on both
> > > the dirty bit and the writable bit on each PTE mapping this page. So
> > > we only need to make sure we don't lose both on a PTE. And we don't
> > > here.
> > 
> > Sorry, I don't follow what you're getting at here (page vs pte). Please can
> > you point me to the code you're referring to? The case I'm worried about is
> > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> > where !pte_write() and assuming (despite pte_dirty()) that there can't be
> > any concurrent modifications to the mapped page. Granted, I haven't found
> > anything doing that, but I could not convince myself that it would be a bug
> > to write such code, either.
> 
> Example 1: memory corruption is still possible with patch 4 & 6
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()

This path:      ^^^^^ looks a bit weird to me and I _think_ only happens
via some vmware DRM driver (i.e. the only caller of
clean_record_shared_mapping_range()). Are you sure that's operating on
pages that can be reclaimed? I have a feeling it might all be pinned.

>                           tlb_gather_mmu()
>                           [set mm_tlb_flush_pending()]
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write()
>                                       [true, no flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [using stale PTE]
> 
>                                       drop clean page
>                                       [data integrity compromised]
> 
>               flush_tlb_range()
> 
>                           tlb_finish_mmu()
>                           [flush (with patch 4)]

Setting my earlier comment aside, I think a useful observation here
is that even with correct TLB invalidation, there is still a window
between modifying the page-table and flushing the TLB where another CPU
could see the updated page-table and incorrectly elide a flush. In these
cases we need to rely either on locking or use of tlb_flush_pending() to
ensure the correct behaviour.

> Example 2: why no flush when write-protecting is not a problem (after
> we fix the problem correctly by adding mm_tlb_flush_pending()).

So here you add an mm_tlb_flush_pending() check to the reclaim path
to resolve the race above.

> Case a:
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write() &&
>                                       !mm_tlb_flush_pending()
>                                       [false: flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [page fault]
> 
>                                       drop clean page
>                                       [data integrity guaranteed]
> 
>               flush_tlb_range()
> 
> Case b:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is false]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [true: no flush]
> 
>                           write page to disk
> 
>   Write to page
>   [h/w tries to set
>    the dirty bit
>    but sees write-
>    protected PTE,
>    page fault]

I agree with you for this example, but I think if the page writeback ran
on CPU 1 after clear_refs_write() then we could have a problem: the updated
pte could sit in the store buffer of CPU1 and the walker on CPU0 would
be able to set the dirty bit. TLB invalidation in clear_refs_write()
would prevent that.

Will
Peter Zijlstra Nov. 24, 2020, 2:46 p.m. UTC | #9
On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:

> It seems to me ARM's interpretation of tlb->fullmm is a special case,
> not the other way around.

I don't think ARM is special here, IIRC there were more architectures
that did that.
Minchan Kim Nov. 25, 2020, 10:01 p.m. UTC | #10
On Mon, Nov 23, 2020 at 06:13:34PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > > elide 'fullmm' invalidation.
> > > > > > 
> > > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > > 
> > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > ---
> > > > > >  fs/proc/task_mmu.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index a76d339b5754..316af047f1aa 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > > >  			count = -EINTR;
> > > > > >  			goto out_mm;
> > > > > >  		}
> > > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > > 
> > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > > architectures other than ARM the opportunity to optimize based on the
> > > > > fact it's a full-mm flush?
> > > 
> > > I double checked my conclusion on patch 4, and aside from a couple
> > > of typos, it still seems correct after the weekend.
> > 
> > I still need to digest that, but I would prefer that we restore the
> > invalidation first, and then have a subsequent commit to relax it. I find
> > it hard to believe that the behaviour in mainline at the moment is deliberate.
> > 
> > That is, I'm not against optimising this, but I'd rather get it "obviously
> > correct" first and the current code is definitely not that.
> 
> I wouldn't mind having this patch and patch 4 if the invalidation they
> restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
> corruption") isn't correct to start with.
> 
> It is complicated, so please bear with me. Let's study this by looking
> at examples this time.
> 
> > > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > > there because we are write-protecting the entries and I don't see any
> > > > mechanism to handle lazy invalidation for that (compared with the aging
> > > > case, which is handled via pte_accessible()).
> > > 
> > > The lazy invalidation for that is done when we write-protect a page,
> > > not an individual PTE. When we do so, our decision is based on both
> > > the dirty bit and the writable bit on each PTE mapping this page. So
> > > we only need to make sure we don't lose both on a PTE. And we don't
> > > here.
> > 
> > Sorry, I don't follow what you're getting at here (page vs pte). Please can
> > you point me to the code you're referring to? The case I'm worried about is
> > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> > where !pte_write() and assuming (despite pte_dirty()) that there can't be
> > any concurrent modifications to the mapped page. Granted, I haven't found
> > anything doing that, but I could not convince myself that it would be a bug
> > to write such code, either.
> 
> Example 1: memory corruption is still possible with patch 4 & 6
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           tlb_gather_mmu()
>                           [set mm_tlb_flush_pending()]
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write()
>                                       [true, no flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [using stale PTE]
> 
>                                       drop clean page
>                                       [data integrity compromised]
> 
>               flush_tlb_range()
> 
>                           tlb_finish_mmu()
>                           [flush (with patch 4)]
> 
> Example 2: why no flush when write-protecting is not a problem (after
> we fix the problem correctly by adding mm_tlb_flush_pending()).
> 
> Case a:
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write() &&
>                                       !mm_tlb_flush_pending()
>                                       [false: flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [page fault]
> 
>                                       drop clean page
>                                       [data integrity guaranteed]
> 
>               flush_tlb_range()
> 
> Case b:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is false]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [true: no flush]
> 
>                           write page to disk
> 
>   Write to page
>   [h/w tries to set
>    the dirty bit
>    but sees write-
>    protected PTE,
>    page fault]
> 
>                           drop clean page
>                           [data integrity guaranteed]
> 
> Case c:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is true]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [false: flush]
> 
>                           write page to disk
> 
>   Write to page
>   [page fault]
> 
>                           drop clean page
>                           [data integrity guaranteed]
> 
> > > > Furthermore, If we decide that we can relax the TLB invalidation
> > > > requirements here, then I'd much rather than was done deliberately, rather
> > > > than as an accidental side-effect of another commit (since I think the
> > > > current behaviour was a consequence of 7a30df49f63a).
> > > 
> > > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
>                                   ^^^^^^ shouldn't

I read all examples Yu mentioned and think they are all correct. Furthermore,
I agree with Yu we don't need the tlb gathering in clear_refs_writes from the
beginning to just increase the tlb flush pending count since MADV_FREE already
have it. However, I'd like to keep the last flushing logic in clear_refs_writes
to avoid relying on the luck for better accuracy as well as guarantees.

So, IMHO, technically, Yu's points are all valid to me - we need to fix page
writeback path. About this clear_refs_writes, Will's chages are still
improvement(based on assumption if Yu agree on that we need the TLB flush for
accuracy, not correctness) so still worth to have it.

Then, Wu, could you send the writeback path fix? Please Ccing Hugh, Mel, 
Adnrea and Nadav in next patchset since they all are experts in this
domain - referenced in https://lkml.org/lkml/2015/4/15/565

> 
> Another typo, I apologize.
> 
> > > ("mm: fix KSM data corruption") in the first place.
> > 
> > Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
> > invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?
> > 
> > Will
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a76d339b5754..316af047f1aa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1238,7 +1238,7 @@  static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			count = -EINTR;
 			goto out_mm;
 		}
-		tlb_gather_mmu_fullmm(&tlb, mm);
+		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))