diff mbox series

mm/userfaultfd: fix memory corruption due to writeprotect

Message ID	20201219043006.2206347-1-namit@vmware.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=deGn=FX=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3895923BE2 From: Nadav Amit <nadav.amit@gmail.com> To: linux-mm@kvack.org, Peter Xu <peterx@redhat.com> Cc: linux-kernel@vger.kernel.org, Nadav Amit <namit@vmware.com>, Andrea Arcangeli <aarcange@redhat.com>, Pavel Emelyanov <xemul@openvz.org>, Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@linux.vnet.ibm.com>, stable@vger.kernel.org Subject: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Date: Fri, 18 Dec 2020 20:30:06 -0800 Message-Id: <20201219043006.2206347-1-namit@vmware.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/userfaultfd: fix memory corruption due to writeprotect \| expand mm/userfaultfd: fix memory corruption due to writeprotect

Commit Message

Nadav Amit Dec. 19, 2020, 4:30 a.m. UTC

From: Nadav Amit <namit@vmware.com>

Userfaultfd self-tests fail occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range(). This might
cause the TLBs to be in an inconsistent state due to the deferred
batched TLB flushes.

Consider the following scenario with 3 CPUs (cpu2 is not shown):

cpu0				cpu1
----				----
userfaultfd_writeprotect()
[ write-protecting ]
mwriteprotect_range()
 mmap_read_lock()
 change_protection()
  change_protection_range()
   ...
   change_pte_range()
   [ defer TLB flushes]
				userfaultfd_writeprotect()
				 mmap_read_lock()
				 change_protection()
				 [ write-unprotect ]
				 ...
				  [ unprotect PTE logically ]
				...
				[ page-fault]
				...
				wp_page_copy()
				[ set new writable page in PTE]

At this point no TLB flush took place. cpu2 (not shown) might have a
stale writable PTE, which was cached in the TLB before cpu0 called
userfaultfd_writeprotect(), and this PTE points to a different page.

Therefore, write-protecting of memory, even using userfaultfd, which
does not change the vma, requires to prevent any concurrent reader (#PF
handler) from reading PTEs from the page-tables if any CPU might still
hold in it TLB a PTE with higher permissions for the same address. To
do so mmap_lock needs to be taken for write.

Surprisingly, memory-unprotection using userfaultfd also poses a
problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
code might actually cause a demotion of the permission, and therefore
requires a TLB flush.

During unprotection of userfaultfd managed memory region, the PTE is not
really made writable, but instead marked "logically" as writable, and
left for the #PF handler to be handled later. While this is ok, the code
currently also *removes* the write permission, and therefore makes it
necessary to flush the TLBs.

To resolve these problems, acquire mmap_lock for write when
write-protecting userfaultfd region using ioctl's. Keep taking mmap_lock
for read when unprotecting memory, but keep the write-bit set when
resolving userfaultfd write-protection.

Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c    |  3 ++-
 mm/userfaultfd.c | 15 +++++++++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

Comments

Andrea Arcangeli Dec. 19, 2020, 7:15 p.m. UTC | #1

Hello,

On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> Analyzing this problem indicates that there is a real bug since
> mmap_lock is only taken for read in mwriteprotect_range(). This might

Never having to take the mmap_sem for writing, and in turn never
blocking, in order to modify the pagetables is quite an important
feature in uffd that justifies uffd instead of mprotect. It's not the
most important reason to use uffd, but it'd be nice if that guarantee
would remain also for the UFFDIO_WRITEPROTECT API, not only for the
other pgtable manipulations.

> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> 
> cpu0				cpu1
> ----				----
> userfaultfd_writeprotect()
> [ write-protecting ]
> mwriteprotect_range()
>  mmap_read_lock()
>  change_protection()
>   change_protection_range()
>    ...
>    change_pte_range()
>    [ defer TLB flushes]
> 				userfaultfd_writeprotect()
> 				 mmap_read_lock()
> 				 change_protection()
> 				 [ write-unprotect ]
> 				 ...
> 				  [ unprotect PTE logically ]
> 				...
> 				[ page-fault]
> 				...
> 				wp_page_copy()
> 				[ set new writable page in PTE]

Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
is set and do an explicit (potentially spurious) tlb flush before
write-unprotect?

Thanks,
Andrea

Nadav Amit Dec. 19, 2020, 10:06 p.m. UTC | #2

> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> [ cc’ing some more people who have experience with similar problems ]
> 
>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> 
>> Hello,
>> 
>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>> Analyzing this problem indicates that there is a real bug since
>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>> 
>> Never having to take the mmap_sem for writing, and in turn never
>> blocking, in order to modify the pagetables is quite an important
>> feature in uffd that justifies uffd instead of mprotect. It's not the
>> most important reason to use uffd, but it'd be nice if that guarantee
>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>> other pgtable manipulations.
>> 
>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>> 
>>> cpu0				cpu1
>>> ----				----
>>> userfaultfd_writeprotect()
>>> [ write-protecting ]
>>> mwriteprotect_range()
>>> mmap_read_lock()
>>> change_protection()
>>> change_protection_range()
>>>  ...
>>>  change_pte_range()
>>>  [ defer TLB flushes]
>>> 				userfaultfd_writeprotect()
>>> 				 mmap_read_lock()
>>> 				 change_protection()
>>> 				 [ write-unprotect ]
>>> 				 ...
>>> 				  [ unprotect PTE logically ]
>>> 				...
>>> 				[ page-fault]
>>> 				...
>>> 				wp_page_copy()
>>> 				[ set new writable page in PTE]
>> 
>> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
>> is set and do an explicit (potentially spurious) tlb flush before
>> write-unprotect?
> 
> There is a concrete scenario that I actually encountered and then there is a
> general problem.
> 
> In general, the kernel code assumes that PTEs that are read from the
> page-tables are coherent across all the TLBs, excluding permission promotion
> (i.e., the PTE may have higher permissions in the page-tables than those
> that are cached in the TLBs).
> 
> We therefore need to both: (a) protect change_protection_range() from the
> changes of others who might defer TLB flushes without taking mmap_sem for
> write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> page-fault handlers) from concurrent changes of change_protection().
> 
> We have already encountered several similar bugs, and debugging such issues
> s time consuming and these bugs impact is substantial (memory corruption,
> security). So I think we should only stick to general solutions.
> 
> So perhaps your the approach of your proposed solution is feasible, but it
> would have to be applied all over the place: we will need to add a check for
> mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> which PTEs are read and there might be an assumption that the
> access-permission reflect what the TLBs hold. This includes page-fault
> handlers, but also NUMA migration code in change_protection(), softdirty
> cleanup in clear_refs_write() and maybe others.
> 
> [ I have in mind another solution, such as keeping in each page-table a 
> “table-generation” which is the mm-generation at the time of the change,
> and only flush if “table-generation”==“mm-generation”, but it requires
> some thought on how to avoid adding new memory barriers. ]
> 
> IOW: I think the change that you suggest is insufficient, and a proper
> solution is too intrusive for “stable".
> 
> As for performance, I can add another patch later to remove the TLB flush
> that is unnecessarily performed during change_protection_range() that does
> permission promotion. I know that your concern is about the “protect” case
> but I cannot think of a good immediate solution that avoids taking mmap_lock
> for write.
> 
> Thoughts?

On a second thought (i.e., I don’t know what I was thinking), doing so —
checking mm_tlb_flush_pending() on every PTE read which is potentially
dangerous and flushing if needed - can lead to huge amount of TLB flushes
and shootodowns as the counter might be elevated for considerable amount of
time.

So this solution seems to me as a no-go.

Andy Lutomirski Dec. 20, 2020, 2:01 a.m. UTC | #3

On Sat, Dec 19, 2020 at 1:34 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> [ cc’ing some more people who have experience with similar problems ]
>
> > On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Hello,
> >
> > On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >> Analyzing this problem indicates that there is a real bug since
> >> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >
> > Never having to take the mmap_sem for writing, and in turn never
> > blocking, in order to modify the pagetables is quite an important
> > feature in uffd that justifies uffd instead of mprotect. It's not the
> > most important reason to use uffd, but it'd be nice if that guarantee
> > would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> > other pgtable manipulations.
> >
> >> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>
> >> cpu0                         cpu1
> >> ----                         ----
> >> userfaultfd_writeprotect()
> >> [ write-protecting ]
> >> mwriteprotect_range()
> >> mmap_read_lock()
> >> change_protection()
> >>  change_protection_range()
> >>   ...
> >>   change_pte_range()
> >>   [ defer TLB flushes]
> >>                              userfaultfd_writeprotect()
> >>                               mmap_read_lock()
> >>                               change_protection()
> >>                               [ write-unprotect ]
> >>                               ...
> >>                                [ unprotect PTE logically ]
> >>                              ...
> >>                              [ page-fault]
> >>                              ...
> >>                              wp_page_copy()
> >>                              [ set new writable page in PTE]
> >
> > Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> > is set and do an explicit (potentially spurious) tlb flush before
> > write-unprotect?
>
> There is a concrete scenario that I actually encountered and then there is a
> general problem.
>
> In general, the kernel code assumes that PTEs that are read from the
> page-tables are coherent across all the TLBs, excluding permission promotion
> (i.e., the PTE may have higher permissions in the page-tables than those
> that are cached in the TLBs).
>
> We therefore need to both: (a) protect change_protection_range() from the
> changes of others who might defer TLB flushes without taking mmap_sem for
> write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> page-fault handlers) from concurrent changes of change_protection().
>
> We have already encountered several similar bugs, and debugging such issues
> s time consuming and these bugs impact is substantial (memory corruption,
> security). So I think we should only stick to general solutions.
>
> So perhaps your the approach of your proposed solution is feasible, but it
> would have to be applied all over the place: we will need to add a check for
> mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> which PTEs are read and there might be an assumption that the
> access-permission reflect what the TLBs hold. This includes page-fault
> handlers, but also NUMA migration code in change_protection(), softdirty
> cleanup in clear_refs_write() and maybe others.

I missed the beginning of this thread, but it looks to me like
userfaultfd changes PTEs with not locking except mmap_read_lock().  It
also calls inc_tlb_flush_pending(), which is very explicitly
documented as requiring the pagetable lock.  Those docs must be wrong,
because mprotect() uses the mmap_sem write lock, which is just fine,
but ISTM some kind of mutual exclusion with proper acquire/release
ordering is indeed needed.  So the userfaultfd code seems bogus.

I think userfaultfd either needs to take a real lock (probably doesn't
matter which) or the core rules about PTEs need to be rewritten.

Andrea Arcangeli Dec. 20, 2020, 2:20 a.m. UTC | #4

On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> > [ cc’ing some more people who have experience with similar problems ]
> > 
> >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >> 
> >> Hello,
> >> 
> >> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>> Analyzing this problem indicates that there is a real bug since
> >>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >> 
> >> Never having to take the mmap_sem for writing, and in turn never
> >> blocking, in order to modify the pagetables is quite an important
> >> feature in uffd that justifies uffd instead of mprotect. It's not the
> >> most important reason to use uffd, but it'd be nice if that guarantee
> >> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >> other pgtable manipulations.
> >> 
> >>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>> 
> >>> cpu0				cpu1
> >>> ----				----
> >>> userfaultfd_writeprotect()
> >>> [ write-protecting ]
> >>> mwriteprotect_range()
> >>> mmap_read_lock()
> >>> change_protection()
> >>> change_protection_range()
> >>>  ...
> >>>  change_pte_range()
> >>>  [ defer TLB flushes]
> >>> 				userfaultfd_writeprotect()
> >>> 				 mmap_read_lock()
> >>> 				 change_protection()
> >>> 				 [ write-unprotect ]
> >>> 				 ...
> >>> 				  [ unprotect PTE logically ]

Is the uffd selftest failing with upstream or after your kernel
modification that removes the tlb flush from unprotect?

			} else if (uffd_wp_resolve) {
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

Upstraem this will still do pages++, there's a tlb flush before
change_protection can return here, so I'm confused.

> >>> 				...
> >>> 				[ page-fault]
> >>> 				...
> >>> 				wp_page_copy()
> >>> 				[ set new writable page in PTE]
> >> 
> >> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> >> is set and do an explicit (potentially spurious) tlb flush before
> >> write-unprotect?
> > 
> > There is a concrete scenario that I actually encountered and then there is a
> > general problem.
> > 
> > In general, the kernel code assumes that PTEs that are read from the
> > page-tables are coherent across all the TLBs, excluding permission promotion
> > (i.e., the PTE may have higher permissions in the page-tables than those
> > that are cached in the TLBs).
> > 
> > We therefore need to both: (a) protect change_protection_range() from the
> > changes of others who might defer TLB flushes without taking mmap_sem for
> > write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> > page-fault handlers) from concurrent changes of change_protection().
> > 
> > We have already encountered several similar bugs, and debugging such issues
> > s time consuming and these bugs impact is substantial (memory corruption,
> > security). So I think we should only stick to general solutions.
> > 
> > So perhaps your the approach of your proposed solution is feasible, but it
> > would have to be applied all over the place: we will need to add a check for
> > mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> > which PTEs are read and there might be an assumption that the
> > access-permission reflect what the TLBs hold. This includes page-fault
> > handlers, but also NUMA migration code in change_protection(), softdirty
> > cleanup in clear_refs_write() and maybe others.
> > 
> > [ I have in mind another solution, such as keeping in each page-table a 
> > “table-generation” which is the mm-generation at the time of the change,
> > and only flush if “table-generation”==“mm-generation”, but it requires
> > some thought on how to avoid adding new memory barriers. ]
> > 
> > IOW: I think the change that you suggest is insufficient, and a proper
> > solution is too intrusive for “stable".
> > 
> > As for performance, I can add another patch later to remove the TLB flush
> > that is unnecessarily performed during change_protection_range() that does
> > permission promotion. I know that your concern is about the “protect” case

You may want to check flush_tlb_fix_spurious_fault for other archs
before proceeding with your patch later, assuming it wasn't already applied.

> > but I cannot think of a good immediate solution that avoids taking mmap_lock
> > for write.
> > 
> > Thoughts?
> 
> On a second thought (i.e., I don’t know what I was thinking), doing so —
> checking mm_tlb_flush_pending() on every PTE read which is potentially

Note the part "if MM_CP_UFFD_WP_ALL is set" and probably just
MM_CP_UFFD_WP.

> dangerous and flushing if needed - can lead to huge amount of TLB flushes
> and shootodowns as the counter might be elevated for considerable amount of
> time.
> 
> So this solution seems to me as a no-go.

I don't share your concern. What matters is the PT lock, so it
wouldn't be one per pte, but a least an order 9 higher, but let's
assume one flush per pte.

It's either huge mapping and then it's likely running without other
tlb flushing in background (postcopy snapshotting), or it's a granular
protect with distributed shared memory in which case the number of
changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
matter if it's deferred.

I agree it may require a larger tlb flush review not just mprotect
though, but it didn't sound particularly complex. Note the
UFFDIO_WRITEPROTECT is still relatively recent so backports won't
risk to reject so heavy as to require a band-aid.

My second thought is, I don't see exactly the bug and it's not clear
if it's upstream reproducing this, but assuming this happens on
upstream, even ignoring everything else happening in the tlb flush
code, this sounds like purely introduced by userfaultfd_writeprotect()
vs userfaultfd_writeprotect() (since it's the only place changing
protection with mmap_sem for reading and note we already unmap and
flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
nothing new, the only new thing is the flush after wrprotect.

So instead of altering any tlb flush code, would it be possible to
just stick to mmap_lock for reading and then serialize
userfaultfd_writeprotect() against itself with an additional
mm->mmap_wprotect_lock mutex? That'd be a very local change to
userfaultfd too.

Can you look if the rule mmap_sem for reading plus a new
mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
wrprotecting ptes, is enough to comply with the current tlb flushing
code, so not to require any change non local to uffd (modulo the
additional mutex).

Thanks,
Andrea

Andrea Arcangeli Dec. 20, 2020, 2:49 a.m. UTC | #5

On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> I missed the beginning of this thread, but it looks to me like
> userfaultfd changes PTEs with not locking except mmap_read_lock().  It

There's no mmap_read_lock, I assume you mean mmap_lock for reading.

The ptes are changed always with the PT lock, in fact there's no
problem with the PTE updates. The only difference with mprotect
runtime is that the mmap_lock is taken for reading. And the effect
contested for this change doesn't affect the PTE, but supposedly the
tlb flushing deferral.

The change_protection_range is identical to what already happens with
zap_page_range. zap_page_range is called with mmap_lock for reading
in MADV_DONTNEED, and by munmap with mmap_lock for
writing. change_protection_range is called with mmap_lock for writing
by mprotect, and mmap_lock for reading by UFFDIO_WRITEPROTECT.

> also calls inc_tlb_flush_pending(), which is very explicitly
> documented as requiring the pagetable lock.  Those docs must be wrong,

The comment in inc_tlb_flush_pending() shows the pagetable lock is
taken after inc_tlb_flush_pending():

	 *	atomic_inc(&mm->tlb_flush_pending);
	 *	spin_lock(&ptl);

> because mprotect() uses the mmap_sem write lock, which is just fine,
> but ISTM some kind of mutual exclusion with proper acquire/release
> ordering is indeed needed.  So the userfaultfd code seems bogus.

If there's a bug, it'd be nice to fix without taking the mmap_lock for
writing.

The vma is guaranteed not modified, so I think it'd be pretty bad if
we had to give in the mmap_lock for writing just to wait for a tlb
flush that is issued deferred in the context of
userfaultfd_writeprotect.

> I think userfaultfd either needs to take a real lock (probably doesn't
> matter which) or the core rules about PTEs need to be rewritten.

It's not exactly clear how the do_wp_page could run on cpu1 before the
unprotect did an extra flush, I guess the trace needs one more
cpu/thread?

Anyway to wait the wrprotect to do the deferred flush, before the
unprotect can even start, one more mutex in the mm to take in all
callers of change_protection_range with the mmap_lock for reading may
be enough.

Thanks,
Andrea

Andy Lutomirski Dec. 20, 2020, 5:08 a.m. UTC | #6

On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote:
> > I missed the beginning of this thread, but it looks to me like
> > userfaultfd changes PTEs with not locking except mmap_read_lock().  It
>
> There's no mmap_read_lock, I assume you mean mmap_lock for reading.

Yes.

>
> The ptes are changed always with the PT lock, in fact there's no
> problem with the PTE updates. The only difference with mprotect
> runtime is that the mmap_lock is taken for reading. And the effect
> contested for this change doesn't affect the PTE, but supposedly the
> tlb flushing deferral.

Can you point me at where the lock ends up being taken in this path?
I apparently missed it somewhere.

> Anyway to wait the wrprotect to do the deferred flush, before the
> unprotect can even start, one more mutex in the mm to take in all
> callers of change_protection_range with the mmap_lock for reading may
> be enough.

I'll read the code again tomorrow.

Yu Zhao Dec. 20, 2020, 6:05 a.m. UTC | #7

On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> [ cc’ing some more people who have experience with similar problems ]
> 
> > On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > Hello,
> > 
> > On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >> Analyzing this problem indicates that there is a real bug since
> >> mmap_lock is only taken for read in mwriteprotect_range(). This might
> > 
> > Never having to take the mmap_sem for writing, and in turn never
> > blocking, in order to modify the pagetables is quite an important
> > feature in uffd that justifies uffd instead of mprotect. It's not the
> > most important reason to use uffd, but it'd be nice if that guarantee
> > would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> > other pgtable manipulations.
> > 
> >> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >> 
> >> cpu0				cpu1
> >> ----				----
> >> userfaultfd_writeprotect()
> >> [ write-protecting ]
> >> mwriteprotect_range()
> >> mmap_read_lock()
> >> change_protection()
> >>  change_protection_range()
> >>   ...
> >>   change_pte_range()
> >>   [ defer TLB flushes]
> >> 				userfaultfd_writeprotect()
> >> 				 mmap_read_lock()
> >> 				 change_protection()
> >> 				 [ write-unprotect ]
> >> 				 ...
> >> 				  [ unprotect PTE logically ]
> >> 				...
> >> 				[ page-fault]
> >> 				...
> >> 				wp_page_copy()
> >> 				[ set new writable page in PTE]

I don't see any problem in this example -- wp_page_copy() calls
ptep_clear_flush_notify(), which should take care of the stale entry
left by cpu0.

That being said, I suspect the memory corruption you observed is
related this example, with cpu1 running something else that flushes
conditionally depending on pte_write().

Do you know which type of pages were corrupted? file, anon, etc.

> > Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
> > is set and do an explicit (potentially spurious) tlb flush before
> > write-unprotect?
> 
> There is a concrete scenario that I actually encountered and then there is a
> general problem.
> 
> In general, the kernel code assumes that PTEs that are read from the
> page-tables are coherent across all the TLBs, excluding permission promotion
> (i.e., the PTE may have higher permissions in the page-tables than those
> that are cached in the TLBs).
> 
> We therefore need to both: (a) protect change_protection_range() from the
> changes of others who might defer TLB flushes without taking mmap_sem for
> write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
> page-fault handlers) from concurrent changes of change_protection().
> 
> We have already encountered several similar bugs, and debugging such issues
> s time consuming and these bugs impact is substantial (memory corruption,
> security). So I think we should only stick to general solutions.
> 
> So perhaps your the approach of your proposed solution is feasible, but it
> would have to be applied all over the place: we will need to add a check for
> mm_tlb_flush_pending() and conditionally flush the TLB in every case in
> which PTEs are read and there might be an assumption that the
> access-permission reflect what the TLBs hold. This includes page-fault
> handlers, but also NUMA migration code in change_protection(), softdirty
> cleanup in clear_refs_write() and maybe others.
> 
> [ I have in mind another solution, such as keeping in each page-table a 
> “table-generation” which is the mm-generation at the time of the change,
> and only flush if “table-generation”==“mm-generation”, but it requires
> some thought on how to avoid adding new memory barriers. ]
> 
> IOW: I think the change that you suggest is insufficient, and a proper
> solution is too intrusive for “stable".
> 
> As for performance, I can add another patch later to remove the TLB flush
> that is unnecessarily performed during change_protection_range() that does
> permission promotion. I know that your concern is about the “protect” case
> but I cannot think of a good immediate solution that avoids taking mmap_lock
> for write.
> 
> Thoughts?
>

Nadav Amit Dec. 20, 2020, 8:06 a.m. UTC | #8

> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
>> [ cc’ing some more people who have experience with similar problems ]
>> 
>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> Hello,
>>> 
>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>>> Analyzing this problem indicates that there is a real bug since
>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>>> 
>>> Never having to take the mmap_sem for writing, and in turn never
>>> blocking, in order to modify the pagetables is quite an important
>>> feature in uffd that justifies uffd instead of mprotect. It's not the
>>> most important reason to use uffd, but it'd be nice if that guarantee
>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>>> other pgtable manipulations.
>>> 
>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>>> 
>>>> cpu0				cpu1
>>>> ----				----
>>>> userfaultfd_writeprotect()
>>>> [ write-protecting ]
>>>> mwriteprotect_range()
>>>> mmap_read_lock()
>>>> change_protection()
>>>> change_protection_range()
>>>>  ...
>>>>  change_pte_range()
>>>>  [ defer TLB flushes]
>>>> 				userfaultfd_writeprotect()
>>>> 				 mmap_read_lock()
>>>> 				 change_protection()
>>>> 				 [ write-unprotect ]
>>>> 				 ...
>>>> 				  [ unprotect PTE logically ]
>>>> 				...
>>>> 				[ page-fault]
>>>> 				...
>>>> 				wp_page_copy()
>>>> 				[ set new writable page in PTE]
> 
> I don't see any problem in this example -- wp_page_copy() calls
> ptep_clear_flush_notify(), which should take care of the stale entry
> left by cpu0.
> 
> That being said, I suspect the memory corruption you observed is
> related this example, with cpu1 running something else that flushes
> conditionally depending on pte_write().
> 
> Do you know which type of pages were corrupted? file, anon, etc.

First, Yu, you are correct. My analysis is incorrect, but let me have
another try (below). To answer your (and Andrea’s) question - this happens
with upstream without any changes, excluding a small fix to the selftest,
since it failed (got stuck) due to missing wake events. [1]

We are talking about anon memory.

So to correct myself, I think that what I really encountered was actually
during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
problem was that in this case the “write”-bit was removed during unprotect.
Sorry for the strange formatting to fit within 80 columns:


[ Start: PTE is writable ]

cpu0				cpu1			cpu2
----				----			----
							[ Writable PTE 
							  cached in TLB ]
userfaultfd_writeprotect()				
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
 ...
 change_pte_range()
 [ *clear* “write”-bit ]
 [ defer TLB flushes]
				[ page-fault ]
				…
				wp_page_copy()
				 cow_user_page()
				  [ copy page ]
							[ write to old
							  page ]
				…
				 set_pte_at_notify()

[ End: cpu2 write not copied form old to new page. ]


So this was actually resolved by the second part of the patch - changing
preserve_write in change_pte_range(). I removed the acquisition of mmap_lock
for write, left the change in change_pte_range() and the test passes.

Let me give some more thought on whether a mmap_lock is needed 
for write. I need to rehash this TLB flushing algorithm.

Thanks,
Nadav

[1] https://lore.kernel.org/patchwork/patch/1346386

Yu Zhao Dec. 20, 2020, 9:54 a.m. UTC | #9

On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> >> [ cc’ing some more people who have experience with similar problems ]
> >> 
> >>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>> Analyzing this problem indicates that there is a real bug since
> >>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>> 
> >>> Never having to take the mmap_sem for writing, and in turn never
> >>> blocking, in order to modify the pagetables is quite an important
> >>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>> most important reason to use uffd, but it'd be nice if that guarantee
> >>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>> other pgtable manipulations.
> >>> 
> >>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>> 
> >>>> cpu0				cpu1
> >>>> ----				----
> >>>> userfaultfd_writeprotect()
> >>>> [ write-protecting ]
> >>>> mwriteprotect_range()
> >>>> mmap_read_lock()
> >>>> change_protection()
> >>>> change_protection_range()
> >>>>  ...
> >>>>  change_pte_range()
> >>>>  [ defer TLB flushes]
> >>>> 				userfaultfd_writeprotect()
> >>>> 				 mmap_read_lock()
> >>>> 				 change_protection()
> >>>> 				 [ write-unprotect ]
> >>>> 				 ...
> >>>> 				  [ unprotect PTE logically ]
> >>>> 				...
> >>>> 				[ page-fault]
> >>>> 				...
> >>>> 				wp_page_copy()
> >>>> 				[ set new writable page in PTE]
> > 
> > I don't see any problem in this example -- wp_page_copy() calls
> > ptep_clear_flush_notify(), which should take care of the stale entry
> > left by cpu0.
> > 
> > That being said, I suspect the memory corruption you observed is
> > related this example, with cpu1 running something else that flushes
> > conditionally depending on pte_write().
> > 
> > Do you know which type of pages were corrupted? file, anon, etc.
> 
> First, Yu, you are correct. My analysis is incorrect, but let me have
> another try (below). To answer your (and Andrea’s) question - this happens
> with upstream without any changes, excluding a small fix to the selftest,
> since it failed (got stuck) due to missing wake events. [1]
> 
> We are talking about anon memory.
> 
> So to correct myself, I think that what I really encountered was actually
> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> problem was that in this case the “write”-bit was removed during unprotect.

Thanks. You are right about when the problem happens: UFD write-
UNprotecting. But it's not UFD write-UNprotecting that removes the
writable bit -- the bit can only be removed during COW or UFD
write-protecting. So your original example was almost correct, except
the last line describing cpu1.

The problem is how do_wp_page() handles non-COW pages. (For COW pages,
do_wp_page() works correctly by either reusing an existing page or
make a new copy out of it.) In UFD case, the existing page may not
have been properly write-protected. As you pointed out, the tlb flush
may not be done yet. Making a copy can potentially race with the
writer on cpu2.

Should we fix the problem by ensuring integrity of the copy? IMO, no,
because do_wp_page() shouldn't copy at all in this case. It seems it
was recently broken by

  be068f29034f mm: fix misplaced unlock_page in do_wp_page()
  09854ba94c6a mm: do_wp_page() simplification

I haven't study them carefully. But if you could just revert them and
run the test again, we'd know where exactly to look at next.

> Sorry for the strange formatting to fit within 80 columns:
> 
> 
> [ Start: PTE is writable ]
> 
> cpu0				cpu1			cpu2
> ----				----			----
> 							[ Writable PTE 
> 							  cached in TLB ]
> userfaultfd_writeprotect()				
> [ write-*unprotect* ]
> mwriteprotect_range()
> mmap_read_lock()
> change_protection()
> 
> change_protection_range()
>  ...
>  change_pte_range()
>  [ *clear* “write”-bit ]
>  [ defer TLB flushes]
> 				[ page-fault ]
> 				…
> 				wp_page_copy()
> 				 cow_user_page()
> 				  [ copy page ]
> 							[ write to old
> 							  page ]
> 				…
> 				 set_pte_at_notify()
> 
> [ End: cpu2 write not copied form old to new page. ]
> 
> 
> So this was actually resolved by the second part of the patch - changing
> preserve_write in change_pte_range(). I removed the acquisition of mmap_lock
> for write, left the change in change_pte_range() and the test passes.
> 
> Let me give some more thought on whether a mmap_lock is needed 
> for write. I need to rehash this TLB flushing algorithm.
> 
> Thanks,
> Nadav
> 
> [1] https://lore.kernel.org/patchwork/patch/1346386

Nadav Amit Dec. 21, 2020, 3:33 a.m. UTC | #10

> On Dec 20, 2020, at 1:54 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
>>>> [ cc’ing some more people who have experience with similar problems ]
>>>> 
>>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>>>>> Analyzing this problem indicates that there is a real bug since
>>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>>>>> 
>>>>> Never having to take the mmap_sem for writing, and in turn never
>>>>> blocking, in order to modify the pagetables is quite an important
>>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
>>>>> most important reason to use uffd, but it'd be nice if that guarantee
>>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>>>>> other pgtable manipulations.
>>>>> 
>>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>>>>> 
>>>>>> cpu0				cpu1
>>>>>> ----				----
>>>>>> userfaultfd_writeprotect()
>>>>>> [ write-protecting ]
>>>>>> mwriteprotect_range()
>>>>>> mmap_read_lock()
>>>>>> change_protection()
>>>>>> change_protection_range()
>>>>>> ...
>>>>>> change_pte_range()
>>>>>> [ defer TLB flushes]
>>>>>> 				userfaultfd_writeprotect()
>>>>>> 				 mmap_read_lock()
>>>>>> 				 change_protection()
>>>>>> 				 [ write-unprotect ]
>>>>>> 				 ...
>>>>>> 				  [ unprotect PTE logically ]
>>>>>> 				...
>>>>>> 				[ page-fault]
>>>>>> 				...
>>>>>> 				wp_page_copy()
>>>>>> 				[ set new writable page in PTE]
>>> 
>>> I don't see any problem in this example -- wp_page_copy() calls
>>> ptep_clear_flush_notify(), which should take care of the stale entry
>>> left by cpu0.
>>> 
>>> That being said, I suspect the memory corruption you observed is
>>> related this example, with cpu1 running something else that flushes
>>> conditionally depending on pte_write().
>>> 
>>> Do you know which type of pages were corrupted? file, anon, etc.
>> 
>> First, Yu, you are correct. My analysis is incorrect, but let me have
>> another try (below). To answer your (and Andrea’s) question - this happens
>> with upstream without any changes, excluding a small fix to the selftest,
>> since it failed (got stuck) due to missing wake events. [1]
>> 
>> We are talking about anon memory.
>> 
>> So to correct myself, I think that what I really encountered was actually
>> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
>> problem was that in this case the “write”-bit was removed during unprotect.
> 
> Thanks. You are right about when the problem happens: UFD write-
> UNprotecting. But it's not UFD write-UNprotecting that removes the
> writable bit -- the bit can only be removed during COW or UFD
> write-protecting. So your original example was almost correct, except
> the last line describing cpu1.

The scenario is a bit confusing, so stay with me. The idea behind uffd
unprotect is indeed only to mark the PTE logically as uffd-unprotected, and
not to *set* the writable bit, allowing the #PF handler to do COW or
whatever correctly upon #PF.

However, the problem that we have is that if a page is already writable,
write-unprotect *clears* the writable bit, making it write-protected (at
least for anonymous pages). This is not good from performance point-of-view,
but also a correctness issue, as I pointed out.

In some more detail: mwriteprotect_range() uses vm_get_page_prot() to
compute the new protection. For anonymous private memory, at least on x86,
this means the write-bit in the protection is clear. So later,
change_pte_range() *clears* the write-bit during *unprotection*.

That’s the reason the second part of my patch - the change to preserve_write
- fixes the problem.

> The problem is how do_wp_page() handles non-COW pages. (For COW pages,
> do_wp_page() works correctly by either reusing an existing page or
> make a new copy out of it.) In UFD case, the existing page may not
> have been properly write-protected. As you pointed out, the tlb flush
> may not be done yet. Making a copy can potentially race with the
> writer on cpu2.

Just to clarify the difference - You regard a scenario of UFFD
write-protect, while I am pretty sure the problem I encountered is during
write-unprotect.

I am not sure we are on the same page (but we may be). The problem I have is
with cow_user_page() that is called by do_wp_page() before any TLB flush
took place (either by change_protection_range() or by do_wp_page() which
does flush, but after the copy).

Let me know if you regard a different scenario.

> Should we fix the problem by ensuring integrity of the copy? IMO, no,
> because do_wp_page() shouldn't copy at all in this case. It seems it
> was recently broken by
> 
>  be068f29034f mm: fix misplaced unlock_page in do_wp_page()
>  09854ba94c6a mm: do_wp_page() simplification
> 
> I haven't study them carefully. But if you could just revert them and
> run the test again, we'd know where exactly to look at next.

These patches regard the wp_page_reuse() case, which makes me think we
are not on the same page. I do not see a problem with wp_page_reuse()
since it does not make a copy of the page. If you can explain what I
am missing, it would be great.

Nadav Amit Dec. 21, 2020, 4:36 a.m. UTC | #11

> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>> [ cc’ing some more people who have experience with similar problems ]
>>> 
>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>>>> Analyzing this problem indicates that there is a real bug since
>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>>>> 
>>>> Never having to take the mmap_sem for writing, and in turn never
>>>> blocking, in order to modify the pagetables is quite an important
>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
>>>> most important reason to use uffd, but it'd be nice if that guarantee
>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>>>> other pgtable manipulations.
>>>> 
>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>>>> 
>>>>> cpu0				cpu1
>>>>> ----				----
>>>>> userfaultfd_writeprotect()
>>>>> [ write-protecting ]
>>>>> mwriteprotect_range()
>>>>> mmap_read_lock()
>>>>> change_protection()
>>>>> change_protection_range()
>>>>> ...
>>>>> change_pte_range()
>>>>> [ defer TLB flushes]
>>>>> 				userfaultfd_writeprotect()
>>>>> 				 mmap_read_lock()
>>>>> 				 change_protection()
>>>>> 				 [ write-unprotect ]
>>>>> 				 ...
>>>>> 				  [ unprotect PTE logically ]
> 
> Is the uffd selftest failing with upstream or after your kernel
> modification that removes the tlb flush from unprotect?

Please see my reply to Yu. I was wrong in this analysis, and I sent a
correction to my analysis. The problem actually happens when
userfaultfd_writeprotect() unprotects the memory.

> 			} else if (uffd_wp_resolve) {
> 				/*
> 				 * Leave the write bit to be handled
> 				 * by PF interrupt handler, then
> 				 * things like COW could be properly
> 				 * handled.
> 				 */
> 				ptent = pte_clear_uffd_wp(ptent);
> 			}
> 
> Upstraem this will still do pages++, there's a tlb flush before
> change_protection can return here, so I'm confused.
> 

You are correct. The problem I encountered with userfaultfd_writeprotect()
is during unprotecting path.

Having said that, I think that there are additional scenarios that are
problematic. Consider for instance madvise_dontneed_free() that is racing
with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
removing the PTEs, but still did not flush, change_pte_range() will see
non-present PTEs, say a flush is not needed, and then
change_protection_range() will not do a flush, and return while
the memory is still not protected.

> I don't share your concern. What matters is the PT lock, so it
> wouldn't be one per pte, but a least an order 9 higher, but let's
> assume one flush per pte.
> 
> It's either huge mapping and then it's likely running without other
> tlb flushing in background (postcopy snapshotting), or it's a granular
> protect with distributed shared memory in which case the number of
> changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
> matter if it's deferred.
> 
> I agree it may require a larger tlb flush review not just mprotect
> though, but it didn't sound particularly complex. Note the
> UFFDIO_WRITEPROTECT is still relatively recent so backports won't
> risk to reject so heavy as to require a band-aid.
> 
> My second thought is, I don't see exactly the bug and it's not clear
> if it's upstream reproducing this, but assuming this happens on
> upstream, even ignoring everything else happening in the tlb flush
> code, this sounds like purely introduced by userfaultfd_writeprotect()
> vs userfaultfd_writeprotect() (since it's the only place changing
> protection with mmap_sem for reading and note we already unmap and
> flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
> the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
> nothing new, the only new thing is the flush after wrprotect.
> 
> So instead of altering any tlb flush code, would it be possible to
> just stick to mmap_lock for reading and then serialize
> userfaultfd_writeprotect() against itself with an additional
> mm->mmap_wprotect_lock mutex? That'd be a very local change to
> userfaultfd too.
> 
> Can you look if the rule mmap_sem for reading plus a new
> mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
> wrprotecting ptes, is enough to comply with the current tlb flushing
> code, so not to require any change non local to uffd (modulo the
> additional mutex).

So I did not fully understand your solution, but I took your point and
looked again on similar cases. To be fair, despite my experience with these
deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
getting confused (e.g., can’t we somehow combine tlb_flush_batched and
tlb_flush_pending ?)

As I said before, my initial scenario was wrong, and the problem is not
userfaultfd_writeprotect() racing against itself. This one seems actually
benign to me.

Nevertheless, I do think there is a problem in change_protection_range().
Specifically, see the aforementioned scenario of a race between
madvise_dontneed_free() and userfaultfd_writeprotect().

So an immediate solution for such a case can be resolve without holding
mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
change_protection_range():

        /*
	 * Only flush the TLB if we actually modified any entries
	 * or if there are pending TLB flushes.
	 */
        if (pages || mm_tlb_flush_nested(mm))
                flush_tlb_range(vma, start, end);
 
To be fair, I am not confident I did not miss other problematic cases.

But for now, this change, with the preserve_write change should address the
immediate issues. Let me know if you agree.

Let me know whether you agree.

Yu Zhao Dec. 21, 2020, 4:44 a.m. UTC | #12

On Sun, Dec 20, 2020 at 07:33:09PM -0800, Nadav Amit wrote:
> > On Dec 20, 2020, at 1:54 AM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao <yuzhao@google.com> wrote:
> >>> 
> >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
> >>>> [ cc’ing some more people who have experience with similar problems ]
> >>>> 
> >>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>>> 
> >>>>> Hello,
> >>>>> 
> >>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>>> Analyzing this problem indicates that there is a real bug since
> >>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>>> 
> >>>>> Never having to take the mmap_sem for writing, and in turn never
> >>>>> blocking, in order to modify the pagetables is quite an important
> >>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>>>> most important reason to use uffd, but it'd be nice if that guarantee
> >>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>>>> other pgtable manipulations.
> >>>>> 
> >>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>>> 
> >>>>>> cpu0				cpu1
> >>>>>> ----				----
> >>>>>> userfaultfd_writeprotect()
> >>>>>> [ write-protecting ]
> >>>>>> mwriteprotect_range()
> >>>>>> mmap_read_lock()
> >>>>>> change_protection()
> >>>>>> change_protection_range()
> >>>>>> ...
> >>>>>> change_pte_range()
> >>>>>> [ defer TLB flushes]
> >>>>>> 				userfaultfd_writeprotect()
> >>>>>> 				 mmap_read_lock()
> >>>>>> 				 change_protection()
> >>>>>> 				 [ write-unprotect ]
> >>>>>> 				 ...
> >>>>>> 				  [ unprotect PTE logically ]
> >>>>>> 				...
> >>>>>> 				[ page-fault]
> >>>>>> 				...
> >>>>>> 				wp_page_copy()
> >>>>>> 				[ set new writable page in PTE]
> >>> 
> >>> I don't see any problem in this example -- wp_page_copy() calls
> >>> ptep_clear_flush_notify(), which should take care of the stale entry
> >>> left by cpu0.
> >>> 
> >>> That being said, I suspect the memory corruption you observed is
> >>> related this example, with cpu1 running something else that flushes
> >>> conditionally depending on pte_write().
> >>> 
> >>> Do you know which type of pages were corrupted? file, anon, etc.
> >> 
> >> First, Yu, you are correct. My analysis is incorrect, but let me have
> >> another try (below). To answer your (and Andrea’s) question - this happens
> >> with upstream without any changes, excluding a small fix to the selftest,
> >> since it failed (got stuck) due to missing wake events. [1]
> >> 
> >> We are talking about anon memory.
> >> 
> >> So to correct myself, I think that what I really encountered was actually
> >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> >> problem was that in this case the “write”-bit was removed during unprotect.
> > 
> > Thanks. You are right about when the problem happens: UFD write-
> > UNprotecting. But it's not UFD write-UNprotecting that removes the
> > writable bit -- the bit can only be removed during COW or UFD
> > write-protecting. So your original example was almost correct, except
> > the last line describing cpu1.
> 
> The scenario is a bit confusing, so stay with me. The idea behind uffd
> unprotect is indeed only to mark the PTE logically as uffd-unprotected, and
> not to *set* the writable bit, allowing the #PF handler to do COW or
> whatever correctly upon #PF.

Right.

> However, the problem that we have is that if a page is already writable,
> write-unprotect *clears* the writable bit, making it write-protected (at
> least for anonymous pages). This is not good from performance point-of-view,
> but also a correctness issue, as I pointed out.
> 
> In some more detail: mwriteprotect_range() uses vm_get_page_prot() to
> compute the new protection. For anonymous private memory, at least on x86,
> this means the write-bit in the protection is clear. So later,
> change_pte_range() *clears* the write-bit during *unprotection*.

Agreed.

> That’s the reason the second part of my patch - the change to preserve_write
> - fixes the problem.

Yes, it fixes a part of the problem.

But what if there is no writable bit there for you to preserve? If the
bit was cleared by COW, e.g., KSM, fork, etc., no problem, because they
guarantee the flush. If it was cleared by a priory UFD write-protecting,
you still would run into the same problem because of the deterred flush.
And you are trying to fix this part of the problem by grabbing
mmap_write_lock. I think I understand your patch correctly.

Both parts of the problem were introduced by the commits I listed, and
your patch does fix the problem. I'm just saying it's not an optimal fix
because for the second part of the problem:
1) there is no need to grab mmap_write_lock
2) there is no need to make a copy in do_wp_page() if the write bit was
cleared by UFD write-protecting (non-COW case).

The fix should be done in do_wp_page(), i.e., to handle non-COW pages
correctly. Preserving the write bit can be considered separately as an
optimization, not a fix. (It eliminates unnecessary page faults.)

> > The problem is how do_wp_page() handles non-COW pages. (For COW pages,
> > do_wp_page() works correctly by either reusing an existing page or
> > make a new copy out of it.) In UFD case, the existing page may not
> > have been properly write-protected. As you pointed out, the tlb flush
> > may not be done yet. Making a copy can potentially race with the
> > writer on cpu2.
> 
> Just to clarify the difference - You regard a scenario of UFFD
> write-protect, while I am pretty sure the problem I encountered is during
> write-unprotect.
> 
> I am not sure we are on the same page (but we may be). The problem I have is
> with cow_user_page() that is called by do_wp_page() before any TLB flush
> took place (either by change_protection_range() or by do_wp_page() which
> does flush, but after the copy).
> 
> Let me know if you regard a different scenario.
> 
> > Should we fix the problem by ensuring integrity of the copy? IMO, no,
> > because do_wp_page() shouldn't copy at all in this case. It seems it
> > was recently broken by
> > 
> >  be068f29034f mm: fix misplaced unlock_page in do_wp_page()
> >  09854ba94c6a mm: do_wp_page() simplification
> > 
> > I haven't study them carefully. But if you could just revert them and
> > run the test again, we'd know where exactly to look at next.
> 
> These patches regard the wp_page_reuse() case, which makes me think we
> are not on the same page. I do not see a problem with wp_page_reuse()
> since it does not make a copy of the page. If you can explain what I
> am missing, it would be great.
>

Yu Zhao Dec. 21, 2020, 5:12 a.m. UTC | #13

On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
> > On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> >>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> >>> 
> >>> [ cc’ing some more people who have experience with similar problems ]
> >>> 
> >>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>> 
> >>>> Hello,
> >>>> 
> >>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>> Analyzing this problem indicates that there is a real bug since
> >>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>> 
> >>>> Never having to take the mmap_sem for writing, and in turn never
> >>>> blocking, in order to modify the pagetables is quite an important
> >>>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>>> most important reason to use uffd, but it'd be nice if that guarantee
> >>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>>> other pgtable manipulations.
> >>>> 
> >>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>> 
> >>>>> cpu0				cpu1
> >>>>> ----				----
> >>>>> userfaultfd_writeprotect()
> >>>>> [ write-protecting ]
> >>>>> mwriteprotect_range()
> >>>>> mmap_read_lock()
> >>>>> change_protection()
> >>>>> change_protection_range()
> >>>>> ...
> >>>>> change_pte_range()
> >>>>> [ defer TLB flushes]
> >>>>> 				userfaultfd_writeprotect()
> >>>>> 				 mmap_read_lock()
> >>>>> 				 change_protection()
> >>>>> 				 [ write-unprotect ]
> >>>>> 				 ...
> >>>>> 				  [ unprotect PTE logically ]
> > 
> > Is the uffd selftest failing with upstream or after your kernel
> > modification that removes the tlb flush from unprotect?
> 
> Please see my reply to Yu. I was wrong in this analysis, and I sent a
> correction to my analysis. The problem actually happens when
> userfaultfd_writeprotect() unprotects the memory.
> 
> > 			} else if (uffd_wp_resolve) {
> > 				/*
> > 				 * Leave the write bit to be handled
> > 				 * by PF interrupt handler, then
> > 				 * things like COW could be properly
> > 				 * handled.
> > 				 */
> > 				ptent = pte_clear_uffd_wp(ptent);
> > 			}
> > 
> > Upstraem this will still do pages++, there's a tlb flush before
> > change_protection can return here, so I'm confused.
> > 
> 
> You are correct. The problem I encountered with userfaultfd_writeprotect()
> is during unprotecting path.
> 
> Having said that, I think that there are additional scenarios that are
> problematic. Consider for instance madvise_dontneed_free() that is racing
> with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
> removing the PTEs, but still did not flush, change_pte_range() will see
> non-present PTEs, say a flush is not needed, and then
> change_protection_range() will not do a flush, and return while
> the memory is still not protected.
> 
> > I don't share your concern. What matters is the PT lock, so it
> > wouldn't be one per pte, but a least an order 9 higher, but let's
> > assume one flush per pte.
> > 
> > It's either huge mapping and then it's likely running without other
> > tlb flushing in background (postcopy snapshotting), or it's a granular
> > protect with distributed shared memory in which case the number of
> > changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
> > matter if it's deferred.
> > 
> > I agree it may require a larger tlb flush review not just mprotect
> > though, but it didn't sound particularly complex. Note the
> > UFFDIO_WRITEPROTECT is still relatively recent so backports won't
> > risk to reject so heavy as to require a band-aid.
> > 
> > My second thought is, I don't see exactly the bug and it's not clear
> > if it's upstream reproducing this, but assuming this happens on
> > upstream, even ignoring everything else happening in the tlb flush
> > code, this sounds like purely introduced by userfaultfd_writeprotect()
> > vs userfaultfd_writeprotect() (since it's the only place changing
> > protection with mmap_sem for reading and note we already unmap and
> > flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
> > the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
> > nothing new, the only new thing is the flush after wrprotect.
> > 
> > So instead of altering any tlb flush code, would it be possible to
> > just stick to mmap_lock for reading and then serialize
> > userfaultfd_writeprotect() against itself with an additional
> > mm->mmap_wprotect_lock mutex? That'd be a very local change to
> > userfaultfd too.
> > 
> > Can you look if the rule mmap_sem for reading plus a new
> > mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
> > wrprotecting ptes, is enough to comply with the current tlb flushing
> > code, so not to require any change non local to uffd (modulo the
> > additional mutex).
> 
> So I did not fully understand your solution, but I took your point and
> looked again on similar cases. To be fair, despite my experience with these
> deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
> getting confused (e.g., can’t we somehow combine tlb_flush_batched and
> tlb_flush_pending ?)
> 
> As I said before, my initial scenario was wrong, and the problem is not
> userfaultfd_writeprotect() racing against itself. This one seems actually
> benign to me.
> 
> Nevertheless, I do think there is a problem in change_protection_range().
> Specifically, see the aforementioned scenario of a race between
> madvise_dontneed_free() and userfaultfd_writeprotect().
> 
> So an immediate solution for such a case can be resolve without holding
> mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
> change_protection_range():
> 
>         /*
> 	 * Only flush the TLB if we actually modified any entries
> 	 * or if there are pending TLB flushes.
> 	 */
>         if (pages || mm_tlb_flush_nested(mm))
>                 flush_tlb_range(vma, start, end);
>  
> To be fair, I am not confident I did not miss other problematic cases.
> 
> But for now, this change, with the preserve_write change should address the
> immediate issues. Let me know if you agree.
> 
> Let me know whether you agree.

The problem starts in UFD, and is related to tlb flush. But its focal
point is in do_wp_page(). I'd suggest you look at function and see
what it does before and after the commits I listed, with the following
conditions

PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
page_mapcount() = 1, page_count() > 1 or PageLocked()

when it runs against the two UFD examples you listed.

Nadav Amit Dec. 21, 2020, 5:25 a.m. UTC | #14

> On Dec 20, 2020, at 9:12 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
>>>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>> 
>>>>> [ cc’ing some more people who have experience with similar problems ]
>>>>> 
>>>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>>>>>> Analyzing this problem indicates that there is a real bug since
>>>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>>>>>> 
>>>>>> Never having to take the mmap_sem for writing, and in turn never
>>>>>> blocking, in order to modify the pagetables is quite an important
>>>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
>>>>>> most important reason to use uffd, but it'd be nice if that guarantee
>>>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>>>>>> other pgtable manipulations.
>>>>>> 
>>>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>>>>>> 
>>>>>>> cpu0				cpu1
>>>>>>> ----				----
>>>>>>> userfaultfd_writeprotect()
>>>>>>> [ write-protecting ]
>>>>>>> mwriteprotect_range()
>>>>>>> mmap_read_lock()
>>>>>>> change_protection()
>>>>>>> change_protection_range()
>>>>>>> ...
>>>>>>> change_pte_range()
>>>>>>> [ defer TLB flushes]
>>>>>>> 				userfaultfd_writeprotect()
>>>>>>> 				 mmap_read_lock()
>>>>>>> 				 change_protection()
>>>>>>> 				 [ write-unprotect ]
>>>>>>> 				 ...
>>>>>>> 				  [ unprotect PTE logically ]
>>> 
>>> Is the uffd selftest failing with upstream or after your kernel
>>> modification that removes the tlb flush from unprotect?
>> 
>> Please see my reply to Yu. I was wrong in this analysis, and I sent a
>> correction to my analysis. The problem actually happens when
>> userfaultfd_writeprotect() unprotects the memory.
>> 
>>> } else if (uffd_wp_resolve) {
>>> 				/*
>>> 				 * Leave the write bit to be handled
>>> 				 * by PF interrupt handler, then
>>> 				 * things like COW could be properly
>>> 				 * handled.
>>> 				 */
>>> 				ptent = pte_clear_uffd_wp(ptent);
>>> 			}
>>> 
>>> Upstraem this will still do pages++, there's a tlb flush before
>>> change_protection can return here, so I'm confused.
>> 
>> You are correct. The problem I encountered with userfaultfd_writeprotect()
>> is during unprotecting path.
>> 
>> Having said that, I think that there are additional scenarios that are
>> problematic. Consider for instance madvise_dontneed_free() that is racing
>> with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
>> removing the PTEs, but still did not flush, change_pte_range() will see
>> non-present PTEs, say a flush is not needed, and then
>> change_protection_range() will not do a flush, and return while
>> the memory is still not protected.
>> 
>>> I don't share your concern. What matters is the PT lock, so it
>>> wouldn't be one per pte, but a least an order 9 higher, but let's
>>> assume one flush per pte.
>>> 
>>> It's either huge mapping and then it's likely running without other
>>> tlb flushing in background (postcopy snapshotting), or it's a granular
>>> protect with distributed shared memory in which case the number of
>>> changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
>>> matter if it's deferred.
>>> 
>>> I agree it may require a larger tlb flush review not just mprotect
>>> though, but it didn't sound particularly complex. Note the
>>> UFFDIO_WRITEPROTECT is still relatively recent so backports won't
>>> risk to reject so heavy as to require a band-aid.
>>> 
>>> My second thought is, I don't see exactly the bug and it's not clear
>>> if it's upstream reproducing this, but assuming this happens on
>>> upstream, even ignoring everything else happening in the tlb flush
>>> code, this sounds like purely introduced by userfaultfd_writeprotect()
>>> vs userfaultfd_writeprotect() (since it's the only place changing
>>> protection with mmap_sem for reading and note we already unmap and
>>> flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
>>> the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
>>> nothing new, the only new thing is the flush after wrprotect.
>>> 
>>> So instead of altering any tlb flush code, would it be possible to
>>> just stick to mmap_lock for reading and then serialize
>>> userfaultfd_writeprotect() against itself with an additional
>>> mm->mmap_wprotect_lock mutex? That'd be a very local change to
>>> userfaultfd too.
>>> 
>>> Can you look if the rule mmap_sem for reading plus a new
>>> mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
>>> wrprotecting ptes, is enough to comply with the current tlb flushing
>>> code, so not to require any change non local to uffd (modulo the
>>> additional mutex).
>> 
>> So I did not fully understand your solution, but I took your point and
>> looked again on similar cases. To be fair, despite my experience with these
>> deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
>> getting confused (e.g., can’t we somehow combine tlb_flush_batched and
>> tlb_flush_pending ?)
>> 
>> As I said before, my initial scenario was wrong, and the problem is not
>> userfaultfd_writeprotect() racing against itself. This one seems actually
>> benign to me.
>> 
>> Nevertheless, I do think there is a problem in change_protection_range().
>> Specifically, see the aforementioned scenario of a race between
>> madvise_dontneed_free() and userfaultfd_writeprotect().
>> 
>> So an immediate solution for such a case can be resolve without holding
>> mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
>> change_protection_range():
>> 
>>        /*
>> 	 * Only flush the TLB if we actually modified any entries
>> 	 * or if there are pending TLB flushes.
>> 	 */
>>        if (pages || mm_tlb_flush_nested(mm))
>>                flush_tlb_range(vma, start, end);
>> 
>> To be fair, I am not confident I did not miss other problematic cases.
>> 
>> But for now, this change, with the preserve_write change should address the
>> immediate issues. Let me know if you agree.
>> 
>> Let me know whether you agree.
> 
> The problem starts in UFD, and is related to tlb flush. But its focal
> point is in do_wp_page(). I'd suggest you look at function and see
> what it does before and after the commits I listed, with the following
> conditions
> 
> PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
> page_mapcount() = 1, page_count() > 1 or PageLocked()
> 
> when it runs against the two UFD examples you listed.

Thanks for your quick response. I wanted to write a lengthy response, but I
do want to sleep on it. I presume page_count() > 1, since I have multiple
concurrent page-faults on the same address in my test, but I will check.

Anyhow, before I give a further response, I was just wondering - since you
recently dealt with soft-dirty issue as I remember - isn't this problematic
COW for non-COW page scenario, in which the copy races with writes to a page
which is protected in the PTE but not in all TLB, also problematic for
soft-dirty clearing?

Nadav Amit Dec. 21, 2020, 5:39 a.m. UTC | #15

> On Dec 20, 2020, at 9:25 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Dec 20, 2020, at 9:12 PM, Yu Zhao <yuzhao@google.com> wrote:
>> 
>> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
>>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>> 
>>>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
>>>>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>>> 
>>>>>> [ cc’ing some more people who have experience with similar problems ]
>>>>>> 
>>>>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>>>>>>>> Analyzing this problem indicates that there is a real bug since
>>>>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
>>>>>>> 
>>>>>>> Never having to take the mmap_sem for writing, and in turn never
>>>>>>> blocking, in order to modify the pagetables is quite an important
>>>>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
>>>>>>> most important reason to use uffd, but it'd be nice if that guarantee
>>>>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
>>>>>>> other pgtable manipulations.
>>>>>>> 
>>>>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>>>>>>>> 
>>>>>>>> cpu0				cpu1
>>>>>>>> ----				----
>>>>>>>> userfaultfd_writeprotect()
>>>>>>>> [ write-protecting ]
>>>>>>>> mwriteprotect_range()
>>>>>>>> mmap_read_lock()
>>>>>>>> change_protection()
>>>>>>>> change_protection_range()
>>>>>>>> ...
>>>>>>>> change_pte_range()
>>>>>>>> [ defer TLB flushes]
>>>>>>>> 				userfaultfd_writeprotect()
>>>>>>>> 				 mmap_read_lock()
>>>>>>>> 				 change_protection()
>>>>>>>> 				 [ write-unprotect ]
>>>>>>>> 				 ...
>>>>>>>> 				  [ unprotect PTE logically ]
>>>> 
>>>> Is the uffd selftest failing with upstream or after your kernel
>>>> modification that removes the tlb flush from unprotect?
>>> 
>>> Please see my reply to Yu. I was wrong in this analysis, and I sent a
>>> correction to my analysis. The problem actually happens when
>>> userfaultfd_writeprotect() unprotects the memory.
>>> 
>>>> } else if (uffd_wp_resolve) {
>>>> 				/*
>>>> 				 * Leave the write bit to be handled
>>>> 				 * by PF interrupt handler, then
>>>> 				 * things like COW could be properly
>>>> 				 * handled.
>>>> 				 */
>>>> 				ptent = pte_clear_uffd_wp(ptent);
>>>> 			}
>>>> 
>>>> Upstraem this will still do pages++, there's a tlb flush before
>>>> change_protection can return here, so I'm confused.
>>> 
>>> You are correct. The problem I encountered with userfaultfd_writeprotect()
>>> is during unprotecting path.
>>> 
>>> Having said that, I think that there are additional scenarios that are
>>> problematic. Consider for instance madvise_dontneed_free() that is racing
>>> with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
>>> removing the PTEs, but still did not flush, change_pte_range() will see
>>> non-present PTEs, say a flush is not needed, and then
>>> change_protection_range() will not do a flush, and return while
>>> the memory is still not protected.
>>> 
>>>> I don't share your concern. What matters is the PT lock, so it
>>>> wouldn't be one per pte, but a least an order 9 higher, but let's
>>>> assume one flush per pte.
>>>> 
>>>> It's either huge mapping and then it's likely running without other
>>>> tlb flushing in background (postcopy snapshotting), or it's a granular
>>>> protect with distributed shared memory in which case the number of
>>>> changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
>>>> matter if it's deferred.
>>>> 
>>>> I agree it may require a larger tlb flush review not just mprotect
>>>> though, but it didn't sound particularly complex. Note the
>>>> UFFDIO_WRITEPROTECT is still relatively recent so backports won't
>>>> risk to reject so heavy as to require a band-aid.
>>>> 
>>>> My second thought is, I don't see exactly the bug and it's not clear
>>>> if it's upstream reproducing this, but assuming this happens on
>>>> upstream, even ignoring everything else happening in the tlb flush
>>>> code, this sounds like purely introduced by userfaultfd_writeprotect()
>>>> vs userfaultfd_writeprotect() (since it's the only place changing
>>>> protection with mmap_sem for reading and note we already unmap and
>>>> flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
>>>> the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
>>>> nothing new, the only new thing is the flush after wrprotect.
>>>> 
>>>> So instead of altering any tlb flush code, would it be possible to
>>>> just stick to mmap_lock for reading and then serialize
>>>> userfaultfd_writeprotect() against itself with an additional
>>>> mm->mmap_wprotect_lock mutex? That'd be a very local change to
>>>> userfaultfd too.
>>>> 
>>>> Can you look if the rule mmap_sem for reading plus a new
>>>> mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
>>>> wrprotecting ptes, is enough to comply with the current tlb flushing
>>>> code, so not to require any change non local to uffd (modulo the
>>>> additional mutex).
>>> 
>>> So I did not fully understand your solution, but I took your point and
>>> looked again on similar cases. To be fair, despite my experience with these
>>> deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
>>> getting confused (e.g., can’t we somehow combine tlb_flush_batched and
>>> tlb_flush_pending ?)
>>> 
>>> As I said before, my initial scenario was wrong, and the problem is not
>>> userfaultfd_writeprotect() racing against itself. This one seems actually
>>> benign to me.
>>> 
>>> Nevertheless, I do think there is a problem in change_protection_range().
>>> Specifically, see the aforementioned scenario of a race between
>>> madvise_dontneed_free() and userfaultfd_writeprotect().
>>> 
>>> So an immediate solution for such a case can be resolve without holding
>>> mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
>>> change_protection_range():
>>> 
>>>       /*
>>> 	 * Only flush the TLB if we actually modified any entries
>>> 	 * or if there are pending TLB flushes.
>>> 	 */
>>>       if (pages || mm_tlb_flush_nested(mm))
>>>               flush_tlb_range(vma, start, end);
>>> 
>>> To be fair, I am not confident I did not miss other problematic cases.
>>> 
>>> But for now, this change, with the preserve_write change should address the
>>> immediate issues. Let me know if you agree.
>>> 
>>> Let me know whether you agree.
>> 
>> The problem starts in UFD, and is related to tlb flush. But its focal
>> point is in do_wp_page(). I'd suggest you look at function and see
>> what it does before and after the commits I listed, with the following
>> conditions
>> 
>> PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
>> page_mapcount() = 1, page_count() > 1 or PageLocked()
>> 
>> when it runs against the two UFD examples you listed.
> 
> Thanks for your quick response. I wanted to write a lengthy response, but I
> do want to sleep on it. I presume page_count() > 1, since I have multiple
> concurrent page-faults on the same address in my test, but I will check.
> 
> Anyhow, before I give a further response, I was just wondering - since you
> recently dealt with soft-dirty issue as I remember - isn't this problematic
> COW for non-COW page scenario, in which the copy races with writes to a page
> which is protected in the PTE but not in all TLB, also problematic for
> soft-dirty clearing?

Stupid me. You hold mmap_lock for write, so no, it cannot happen when clear
soft-dirty.

Yu Zhao Dec. 21, 2020, 7:29 a.m. UTC | #16

On Sun, Dec 20, 2020 at 09:39:06PM -0800, Nadav Amit wrote:
> > On Dec 20, 2020, at 9:25 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> >> On Dec 20, 2020, at 9:12 PM, Yu Zhao <yuzhao@google.com> wrote:
> >> 
> >> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
> >>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>> 
> >>>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> >>>>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>>> 
> >>>>>> [ cc’ing some more people who have experience with similar problems ]
> >>>>>> 
> >>>>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>>>>> Analyzing this problem indicates that there is a real bug since
> >>>>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>>>>> 
> >>>>>>> Never having to take the mmap_sem for writing, and in turn never
> >>>>>>> blocking, in order to modify the pagetables is quite an important
> >>>>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>>>>>> most important reason to use uffd, but it'd be nice if that guarantee
> >>>>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>>>>>> other pgtable manipulations.
> >>>>>>> 
> >>>>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>>>>> 
> >>>>>>>> cpu0				cpu1
> >>>>>>>> ----				----
> >>>>>>>> userfaultfd_writeprotect()
> >>>>>>>> [ write-protecting ]
> >>>>>>>> mwriteprotect_range()
> >>>>>>>> mmap_read_lock()
> >>>>>>>> change_protection()
> >>>>>>>> change_protection_range()
> >>>>>>>> ...
> >>>>>>>> change_pte_range()
> >>>>>>>> [ defer TLB flushes]
> >>>>>>>> 				userfaultfd_writeprotect()
> >>>>>>>> 				 mmap_read_lock()
> >>>>>>>> 				 change_protection()
> >>>>>>>> 				 [ write-unprotect ]
> >>>>>>>> 				 ...
> >>>>>>>> 				  [ unprotect PTE logically ]
> >>>> 
> >>>> Is the uffd selftest failing with upstream or after your kernel
> >>>> modification that removes the tlb flush from unprotect?
> >>> 
> >>> Please see my reply to Yu. I was wrong in this analysis, and I sent a
> >>> correction to my analysis. The problem actually happens when
> >>> userfaultfd_writeprotect() unprotects the memory.
> >>> 
> >>>> } else if (uffd_wp_resolve) {
> >>>> 				/*
> >>>> 				 * Leave the write bit to be handled
> >>>> 				 * by PF interrupt handler, then
> >>>> 				 * things like COW could be properly
> >>>> 				 * handled.
> >>>> 				 */
> >>>> 				ptent = pte_clear_uffd_wp(ptent);
> >>>> 			}
> >>>> 
> >>>> Upstraem this will still do pages++, there's a tlb flush before
> >>>> change_protection can return here, so I'm confused.
> >>> 
> >>> You are correct. The problem I encountered with userfaultfd_writeprotect()
> >>> is during unprotecting path.
> >>> 
> >>> Having said that, I think that there are additional scenarios that are
> >>> problematic. Consider for instance madvise_dontneed_free() that is racing
> >>> with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
> >>> removing the PTEs, but still did not flush, change_pte_range() will see
> >>> non-present PTEs, say a flush is not needed, and then
> >>> change_protection_range() will not do a flush, and return while
> >>> the memory is still not protected.
> >>> 
> >>>> I don't share your concern. What matters is the PT lock, so it
> >>>> wouldn't be one per pte, but a least an order 9 higher, but let's
> >>>> assume one flush per pte.
> >>>> 
> >>>> It's either huge mapping and then it's likely running without other
> >>>> tlb flushing in background (postcopy snapshotting), or it's a granular
> >>>> protect with distributed shared memory in which case the number of
> >>>> changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
> >>>> matter if it's deferred.
> >>>> 
> >>>> I agree it may require a larger tlb flush review not just mprotect
> >>>> though, but it didn't sound particularly complex. Note the
> >>>> UFFDIO_WRITEPROTECT is still relatively recent so backports won't
> >>>> risk to reject so heavy as to require a band-aid.
> >>>> 
> >>>> My second thought is, I don't see exactly the bug and it's not clear
> >>>> if it's upstream reproducing this, but assuming this happens on
> >>>> upstream, even ignoring everything else happening in the tlb flush
> >>>> code, this sounds like purely introduced by userfaultfd_writeprotect()
> >>>> vs userfaultfd_writeprotect() (since it's the only place changing
> >>>> protection with mmap_sem for reading and note we already unmap and
> >>>> flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
> >>>> the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
> >>>> nothing new, the only new thing is the flush after wrprotect.
> >>>> 
> >>>> So instead of altering any tlb flush code, would it be possible to
> >>>> just stick to mmap_lock for reading and then serialize
> >>>> userfaultfd_writeprotect() against itself with an additional
> >>>> mm->mmap_wprotect_lock mutex? That'd be a very local change to
> >>>> userfaultfd too.
> >>>> 
> >>>> Can you look if the rule mmap_sem for reading plus a new
> >>>> mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
> >>>> wrprotecting ptes, is enough to comply with the current tlb flushing
> >>>> code, so not to require any change non local to uffd (modulo the
> >>>> additional mutex).
> >>> 
> >>> So I did not fully understand your solution, but I took your point and
> >>> looked again on similar cases. To be fair, despite my experience with these
> >>> deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
> >>> getting confused (e.g., can’t we somehow combine tlb_flush_batched and
> >>> tlb_flush_pending ?)
> >>> 
> >>> As I said before, my initial scenario was wrong, and the problem is not
> >>> userfaultfd_writeprotect() racing against itself. This one seems actually
> >>> benign to me.
> >>> 
> >>> Nevertheless, I do think there is a problem in change_protection_range().
> >>> Specifically, see the aforementioned scenario of a race between
> >>> madvise_dontneed_free() and userfaultfd_writeprotect().
> >>> 
> >>> So an immediate solution for such a case can be resolve without holding
> >>> mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
> >>> change_protection_range():
> >>> 
> >>>       /*
> >>> 	 * Only flush the TLB if we actually modified any entries
> >>> 	 * or if there are pending TLB flushes.
> >>> 	 */
> >>>       if (pages || mm_tlb_flush_nested(mm))
> >>>               flush_tlb_range(vma, start, end);
> >>> 
> >>> To be fair, I am not confident I did not miss other problematic cases.
> >>> 
> >>> But for now, this change, with the preserve_write change should address the
> >>> immediate issues. Let me know if you agree.
> >>> 
> >>> Let me know whether you agree.
> >> 
> >> The problem starts in UFD, and is related to tlb flush. But its focal
> >> point is in do_wp_page(). I'd suggest you look at function and see
> >> what it does before and after the commits I listed, with the following
> >> conditions
> >> 
> >> PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
> >> page_mapcount() = 1, page_count() > 1 or PageLocked()
> >> 
> >> when it runs against the two UFD examples you listed.
> > 
> > Thanks for your quick response. I wanted to write a lengthy response, but I
> > do want to sleep on it. I presume page_count() > 1, since I have multiple
> > concurrent page-faults on the same address in my test, but I will check.
> > 
> > Anyhow, before I give a further response, I was just wondering - since you
> > recently dealt with soft-dirty issue as I remember - isn't this problematic
> > COW for non-COW page scenario, in which the copy races with writes to a page
> > which is protected in the PTE but not in all TLB, also problematic for
> > soft-dirty clearing?

Yes, it has the same problem.

> Stupid me. You hold mmap_lock for write, so no, it cannot happen when clear
> soft-dirty.

mmap_write_lock is temporarily held to update vm_page_prot for write
notifications. It doesn't help in the context of this problem.

Peter Xu Dec. 21, 2020, 5:27 p.m. UTC | #17

Hi, Nadav,

On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:

[...]

> So to correct myself, I think that what I really encountered was actually
> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> problem was that in this case the “write”-bit was removed during unprotect.
> Sorry for the strange formatting to fit within 80 columns:

I assume I can ignore the race mentioned in the commit message but only refer
to this one below.  However I'm still confused.  Please see below.

> 
> 
> [ Start: PTE is writable ]
> 
> cpu0				cpu1			cpu2
> ----				----			----
> 							[ Writable PTE 
> 							  cached in TLB ]

Here cpu2 got writable pte in tlb.  But why?

If below is an unprotect, it means it must have been protected once by
userfaultfd, right?  If so, the previous change_protection_range() which did
the wr-protect should have done a tlb flush already before it returns (since
pages>0 - we protected one pte at least).  Then I can't see why cpu2 tlb has
stall data.

If I assume cpu2 doesn't have that cached tlb, then "write to old page" won't
happen either, because cpu1/cpu2 will all go through the cow path and pgtable
lock should serialize them.

> userfaultfd_writeprotect()				
> [ write-*unprotect* ]
> mwriteprotect_range()
> mmap_read_lock()
> change_protection()
> 
> change_protection_range()
>  ...
>  change_pte_range()
>  [ *clear* “write”-bit ]
>  [ defer TLB flushes]
> 				[ page-fault ]
> 				…
> 				wp_page_copy()
> 				 cow_user_page()
> 				  [ copy page ]
> 							[ write to old
> 							  page ]
> 				…
> 				 set_pte_at_notify()
> 
> [ End: cpu2 write not copied form old to new page. ]

Could you share how to reproduce the problem?  I would be glad to give it a
shot as well.

> [1] https://lore.kernel.org/patchwork/patch/1346386

PS: Sorry to not have read the other series of yours.  It seems to need some
chunk of time so I postponed it a bit due to other things; but I'll read at
least the fixes very soon.

Thanks,

Andrea Arcangeli Dec. 21, 2020, 6:03 p.m. UTC | #18

Hello,

On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> > The ptes are changed always with the PT lock, in fact there's no
> > problem with the PTE updates. The only difference with mprotect
> > runtime is that the mmap_lock is taken for reading. And the effect
> > contested for this change doesn't affect the PTE, but supposedly the
> > tlb flushing deferral.
> 
> Can you point me at where the lock ends up being taken in this path?

pte_offset_map_lock in change_pte_range, as in mprotect, no difference.

As I suspected on my follow up, the bug described wasn't there, but
I'll look at the new theory posted.

Thanks,
Andrea

Andy Lutomirski Dec. 21, 2020, 6:22 p.m. UTC | #19

On Mon, Dec 21, 2020 at 10:04 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote:
> > On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> > > The ptes are changed always with the PT lock, in fact there's no
> > > problem with the PTE updates. The only difference with mprotect
> > > runtime is that the mmap_lock is taken for reading. And the effect
> > > contested for this change doesn't affect the PTE, but supposedly the
> > > tlb flushing deferral.
> >
> > Can you point me at where the lock ends up being taken in this path?
>
> pte_offset_map_lock in change_pte_range, as in mprotect, no difference.
>
> As I suspected on my follow up, the bug described wasn't there, but
> I'll look at the new theory posted.

Indeed.

Nadav Amit Dec. 21, 2020, 6:31 p.m. UTC | #20

> On Dec 21, 2020, at 9:27 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> Hi, Nadav,
> 
> On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> 
> [...]
> 
>> So to correct myself, I think that what I really encountered was actually
>> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
>> problem was that in this case the “write”-bit was removed during unprotect.
>> Sorry for the strange formatting to fit within 80 columns:
> 
> I assume I can ignore the race mentioned in the commit message but only refer
> to this one below.  However I'm still confused.  Please see below.
> 
>> [ Start: PTE is writable ]
>> 
>> cpu0				cpu1			cpu2
>> ----				----			----
>> 							[ Writable PTE 
>> 							  cached in TLB ]
> 
> Here cpu2 got writable pte in tlb.  But why?
> 
> If below is an unprotect, it means it must have been protected once by
> userfaultfd, right?  If so, the previous change_protection_range() which did
> the wr-protect should have done a tlb flush already before it returns (since
> pages>0 - we protected one pte at least).  Then I can't see why cpu2 tlb has
> stall data.

Thanks, Peter. Just as you can munprotect() a region which was not protected
before, you can ufff-unprotect a region that was not protected before. It
might be that the user tried to unprotect a large region, which was
partially protected and partially unprotected.

The selftest obviously blindly unprotect some regions to check for bugs.

So to your question - it was not write-protected (think about initial copy
without write-protecting).

> If I assume cpu2 doesn't have that cached tlb, then "write to old page" won't
> happen either, because cpu1/cpu2 will all go through the cow path and pgtable
> lock should serialize them.
> 
>> userfaultfd_writeprotect()				
>> [ write-*unprotect* ]
>> mwriteprotect_range()
>> mmap_read_lock()
>> change_protection()
>> 
>> change_protection_range()
>> ...
>> change_pte_range()
>> [ *clear* “write”-bit ]
>> [ defer TLB flushes]
>> 				[ page-fault ]
>> 				…
>> 				wp_page_copy()
>> 				 cow_user_page()
>> 				  [ copy page ]
>> 							[ write to old
>> 							  page ]
>> 				…
>> 				 set_pte_at_notify()
>> 
>> [ End: cpu2 write not copied form old to new page. ]
> 
> Could you share how to reproduce the problem?  I would be glad to give it a
> shot as well.

You can run the selftests/userfaultfd with my small patch [1]. I ran it with
the following parameters: “ ./userfaultfd anon 100 100 “. I think that it is
more easily reproducible with “mitigations=off idle=poll” as kernel
parameters.

[1] https://lore.kernel.org/patchwork/patch/1346386/

> 
>> [1] https://lore.kernel.org/patchwork/patch/1346386
> 
> PS: Sorry to not have read the other series of yours.  It seems to need some
> chunk of time so I postponed it a bit due to other things; but I'll read at
> least the fixes very soon.

Thanks again, I will post RFCv2 with some numbers soon.

Yu Zhao Dec. 21, 2020, 7:16 p.m. UTC | #21

On Mon, Dec 21, 2020 at 10:31:57AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 9:27 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Hi, Nadav,
> > 
> > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> > 
> > [...]
> > 
> >> So to correct myself, I think that what I really encountered was actually
> >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> >> problem was that in this case the “write”-bit was removed during unprotect.
> >> Sorry for the strange formatting to fit within 80 columns:
> > 
> > I assume I can ignore the race mentioned in the commit message but only refer
> > to this one below.  However I'm still confused.  Please see below.
> > 
> >> [ Start: PTE is writable ]
> >> 
> >> cpu0				cpu1			cpu2
> >> ----				----			----
> >> 							[ Writable PTE 
> >> 							  cached in TLB ]
> > 
> > Here cpu2 got writable pte in tlb.  But why?
> > 
> > If below is an unprotect, it means it must have been protected once by
> > userfaultfd, right?  If so, the previous change_protection_range() which did
> > the wr-protect should have done a tlb flush already before it returns (since
> > pages>0 - we protected one pte at least).  Then I can't see why cpu2 tlb has
> > stall data.
> 
> Thanks, Peter. Just as you can munprotect() a region which was not protected
> before, you can ufff-unprotect a region that was not protected before. It
> might be that the user tried to unprotect a large region, which was
> partially protected and partially unprotected.
> 
> The selftest obviously blindly unprotect some regions to check for bugs.
> 
> So to your question - it was not write-protected (think about initial copy
> without write-protecting).
> 
> > If I assume cpu2 doesn't have that cached tlb, then "write to old page" won't
> > happen either, because cpu1/cpu2 will all go through the cow path and pgtable
> > lock should serialize them.
> > 
> >> userfaultfd_writeprotect()				
> >> [ write-*unprotect* ]
> >> mwriteprotect_range()
> >> mmap_read_lock()
> >> change_protection()
> >> 
> >> change_protection_range()
> >> ...
> >> change_pte_range()
> >> [ *clear* “write”-bit ]
> >> [ defer TLB flushes]
> >> 				[ page-fault ]
> >> 				…
> >> 				wp_page_copy()
> >> 				 cow_user_page()
> >> 				  [ copy page ]
> >> 							[ write to old
> >> 							  page ]
> >> 				…
> >> 				 set_pte_at_notify()
> >> 
> >> [ End: cpu2 write not copied form old to new page. ]
> > 
> > Could you share how to reproduce the problem?  I would be glad to give it a
> > shot as well.
> 
> You can run the selftests/userfaultfd with my small patch [1]. I ran it with
> the following parameters: “ ./userfaultfd anon 100 100 “. I think that it is
> more easily reproducible with “mitigations=off idle=poll” as kernel
> parameters.
> 
> [1] https://lore.kernel.org/patchwork/patch/1346386/

Hi Linus,

Nadav Amit found memory corruptions when running userfaultfd test above.
It seems to me the problem is related to commit 09854ba94c6a ("mm:
do_wp_page() simplification"). Can you please take a look? Thanks.

TL;DR: it may not safe to make copies of singly mapped (non-COW) pages
when it's locked or has additional ref count because concurrent
clear_soft_dirty or change_pte_range may have removed pte_write but yet
to flush tlb.

Peter Xu Dec. 21, 2020, 7:53 p.m. UTC | #22

On Mon, Dec 21, 2020 at 10:31:57AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 9:27 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Hi, Nadav,
> > 
> > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
> > 
> > [...]
> > 
> >> So to correct myself, I think that what I really encountered was actually
> >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The
> >> problem was that in this case the “write”-bit was removed during unprotect.
> >> Sorry for the strange formatting to fit within 80 columns:
> > 
> > I assume I can ignore the race mentioned in the commit message but only refer
> > to this one below.  However I'm still confused.  Please see below.
> > 
> >> [ Start: PTE is writable ]
> >> 
> >> cpu0				cpu1			cpu2
> >> ----				----			----
> >> 							[ Writable PTE 
> >> 							  cached in TLB ]
> > 
> > Here cpu2 got writable pte in tlb.  But why?
> > 
> > If below is an unprotect, it means it must have been protected once by
> > userfaultfd, right?  If so, the previous change_protection_range() which did
> > the wr-protect should have done a tlb flush already before it returns (since
> > pages>0 - we protected one pte at least).  Then I can't see why cpu2 tlb has
> > stall data.
> 
> Thanks, Peter. Just as you can munprotect() a region which was not protected
> before, you can ufff-unprotect a region that was not protected before. It
> might be that the user tried to unprotect a large region, which was
> partially protected and partially unprotected.
> 
> The selftest obviously blindly unprotect some regions to check for bugs.
> 
> So to your question - it was not write-protected (think about initial copy
> without write-protecting).

If that's the only case, how about we don't touch the ptes at all? Instead of
playing with preserve_write, I'm thinking something like this right before
ptep_modify_prot_start(), even for uffd_wp==true:

  if (uffd_wp && pte_uffd_wp(old_pte)) {
    WARN_ON_ONCE(pte_write(old_pte));
    continue;
  }

  if (uffd_wp_resolve && !pte_uffd_wp(old_pte))
      continue;

Then we can also avoid the heavy operations on changing ptes back and forth.

> 
> > If I assume cpu2 doesn't have that cached tlb, then "write to old page" won't
> > happen either, because cpu1/cpu2 will all go through the cow path and pgtable
> > lock should serialize them.
> > 
> >> userfaultfd_writeprotect()				
> >> [ write-*unprotect* ]
> >> mwriteprotect_range()
> >> mmap_read_lock()
> >> change_protection()
> >> 
> >> change_protection_range()
> >> ...
> >> change_pte_range()
> >> [ *clear* “write”-bit ]
> >> [ defer TLB flushes]
> >> 				[ page-fault ]
> >> 				…
> >> 				wp_page_copy()
> >> 				 cow_user_page()
> >> 				  [ copy page ]
> >> 							[ write to old
> >> 							  page ]
> >> 				…
> >> 				 set_pte_at_notify()
> >> 
> >> [ End: cpu2 write not copied form old to new page. ]
> > 
> > Could you share how to reproduce the problem?  I would be glad to give it a
> > shot as well.
> 
> You can run the selftests/userfaultfd with my small patch [1]. I ran it with
> the following parameters: “ ./userfaultfd anon 100 100 “. I think that it is
> more easily reproducible with “mitigations=off idle=poll” as kernel
> parameters.
> 
> [1] https://lore.kernel.org/patchwork/patch/1346386/

Thanks.

> 
> > 
> >> [1] https://lore.kernel.org/patchwork/patch/1346386
> > 
> > PS: Sorry to not have read the other series of yours.  It seems to need some
> > chunk of time so I postponed it a bit due to other things; but I'll read at
> > least the fixes very soon.
> 
> Thanks again, I will post RFCv2 with some numbers soon.

I read the patch 1/3 of the series.  Would it be better to post them separately
just in case Andrew would like to pick them earlier?

Since you seem to be heavily working on uffd-wp - I do still have a few uffd-wp
fixes locally even for anonymous.  I think they're related to some corner cases
like either thp or migration entry convertions, but anyway I'll see whether I
should post them even earlier (I planned to add smap/pagemap support for
uffd-wp so maybe I can even write some test case to verify some of them).  Just
a FYI...

Thanks,

Linus Torvalds Dec. 21, 2020, 7:55 p.m. UTC | #23

On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao <yuzhao@google.com> wrote:
>
> Nadav Amit found memory corruptions when running userfaultfd test above.
> It seems to me the problem is related to commit 09854ba94c6a ("mm:
> do_wp_page() simplification"). Can you please take a look? Thanks.
>
> TL;DR: it may not safe to make copies of singly mapped (non-COW) pages
> when it's locked or has additional ref count because concurrent
> clear_soft_dirty or change_pte_range may have removed pte_write but yet
> to flush tlb.

Hmm. The TLB flush shouldn't actually matter, because anything that
changes the writable bit had better be serialized by the page table
lock.

Yes, we often load the page table value without holding the page table
lock (in order to know what we are going to do), but then before we
finalize the operation, we then re-check - undet the page table lock -
that the value we loaded still matches.

But I think I see what *MAY* be going on.  The userfaultfd
mwriteprotect_range() code takes the mm lock for _reading_. Which
means that you can have

Thread A     Thread B

 - fault starts. Sees write-protected pte, allocates memory, copies data

                   - userfaultfd makes the regions writable

                   - usefaultfd case writes to the region

                   - userfaultfd makes region non-writable

 - fault continues, gets the page table lock, sees that the pte is the
same, uses old copied data

But if this is what's happening, I think it's a userfaultfd bug. I
think the mmap_read_lock(dst_mm) in mwriteprotect_range() needs to be
a mmap_write_lock().

mprotect() does this right, it looks like userfaultfd does not. You
cannot just change the writability of a page willy-nilly without the
correct locking.

Maybe there are other causes, but this one stands out to me as one
possible cause.

Comments?

              Linus

Yu Zhao Dec. 21, 2020, 8:21 p.m. UTC | #24

On Mon, Dec 21, 2020 at 11:55:02AM -0800, Linus Torvalds wrote:
> On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Nadav Amit found memory corruptions when running userfaultfd test above.
> > It seems to me the problem is related to commit 09854ba94c6a ("mm:
> > do_wp_page() simplification"). Can you please take a look? Thanks.
> >
> > TL;DR: it may not safe to make copies of singly mapped (non-COW) pages
> > when it's locked or has additional ref count because concurrent
> > clear_soft_dirty or change_pte_range may have removed pte_write but yet
> > to flush tlb.
> 
> Hmm. The TLB flush shouldn't actually matter, because anything that
> changes the writable bit had better be serialized by the page table
> lock.

Well, unfortunately we have places that use optimizations like

  inc_tlb_flush_pending()
    lock page table
      pte_wrprotect
  flush_tlb_range()
  dec_tlb_flush_pending()

which complicate things. And usually checking mm_tlb_flush_pending()
in addition to pte_write() (while holding page table lock) would fix
the similar problems. But for this one, doing so apparently isn't as
straightforward or the best solution.

> Yes, we often load the page table value without holding the page table
> lock (in order to know what we are going to do), but then before we
> finalize the operation, we then re-check - undet the page table lock -
> that the value we loaded still matches.
> 
> But I think I see what *MAY* be going on.  The userfaultfd
> mwriteprotect_range() code takes the mm lock for _reading_. Which
> means that you can have
> 
> Thread A     Thread B
> 
>  - fault starts. Sees write-protected pte, allocates memory, copies data
> 
>                    - userfaultfd makes the regions writable
> 
>                    - usefaultfd case writes to the region
> 
>                    - userfaultfd makes region non-writable
> 
>  - fault continues, gets the page table lock, sees that the pte is the
> same, uses old copied data
> 
> But if this is what's happening, I think it's a userfaultfd bug. I
> think the mmap_read_lock(dst_mm) in mwriteprotect_range() needs to be
> a mmap_write_lock().
> 
> mprotect() does this right, it looks like userfaultfd does not. You
> cannot just change the writability of a page willy-nilly without the
> correct locking.
> 
> Maybe there are other causes, but this one stands out to me as one
> possible cause.
> 
> Comments?
> 
>               Linus

Nadav Amit Dec. 21, 2020, 8:23 p.m. UTC | #25

> On Dec 21, 2020, at 11:55 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao <yuzhao@google.com> wrote:
>> Nadav Amit found memory corruptions when running userfaultfd test above.
>> It seems to me the problem is related to commit 09854ba94c6a ("mm:
>> do_wp_page() simplification"). Can you please take a look? Thanks.
>> 
>> TL;DR: it may not safe to make copies of singly mapped (non-COW) pages
>> when it's locked or has additional ref count because concurrent
>> clear_soft_dirty or change_pte_range may have removed pte_write but yet
>> to flush tlb.
> 
> Hmm. The TLB flush shouldn't actually matter, because anything that
> changes the writable bit had better be serialized by the page table
> lock.
> 
> Yes, we often load the page table value without holding the page table
> lock (in order to know what we are going to do), but then before we
> finalize the operation, we then re-check - undet the page table lock -
> that the value we loaded still matches.
> 
> But I think I see what *MAY* be going on.  The userfaultfd
> mwriteprotect_range() code takes the mm lock for _reading_. Which
> means that you can have
> 
> Thread A     Thread B
> 
> - fault starts. Sees write-protected pte, allocates memory, copies data
> 
>                   - userfaultfd makes the regions writable
> 
>                   - usefaultfd case writes to the region
> 
>                   - userfaultfd makes region non-writable
> 
> - fault continues, gets the page table lock, sees that the pte is the
> same, uses old copied data
> 
> But if this is what's happening, I think it's a userfaultfd bug. I
> think the mmap_read_lock(dst_mm) in mwriteprotect_range() needs to be
> a mmap_write_lock().
> 
> mprotect() does this right, it looks like userfaultfd does not. You
> cannot just change the writability of a page willy-nilly without the
> correct locking.
> 
> Maybe there are other causes, but this one stands out to me as one
> possible cause.
> 
> Comments?

Using mmap_write_lock() was my initial fix and there was a strong pushback
on this approach due to its potential impact on performance.

So I think there are 4 possible directions for solutions that I thought of
or others have raised/hinted:

1. mmap_write_lock()

2. Copy the page in cow_user_page() while holding the PTL and after flushing
   has been done. I am not sure if there are potential problems with
   special-pages (2 flushes might be necessary for special pages).

3. Always reuse the page and never COW on userfaultfd/soft-dirty.
   I do not know if it is feasible.

4. Retry the page-fault if mm->tlb_flush_pending is set.

Hopefully I did not miss any other suggestion.

Linus Torvalds Dec. 21, 2020, 8:25 p.m. UTC | #26

On Mon, Dec 21, 2020 at 12:21 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Well, unfortunately we have places that use optimizations like
>
>   inc_tlb_flush_pending()
>     lock page table
>       pte_wrprotect
>   flush_tlb_range()
>   dec_tlb_flush_pending()
>
> which complicate things.

My point is, none of that  matters.

Because the software side that does the actual page table
modifications do not depend on the TLB at all.

They depend on the page table lock, and the pte in memory.

So the "pending flush" simply shoudln't be an issue. It's about the
actual hardware usage.

But what DOES matter for the software accesses is that you can't
modify protections without holding the proper lock.

And userfaultfd seems to do exactly that, breaking the whole "load pte
early, then check that it didn't change".

(Which we do in other places too, not just COW - it's basically _the_
pattern for page table updates).

               Linus

Linus Torvalds Dec. 21, 2020, 8:26 p.m. UTC | #27

On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Using mmap_write_lock() was my initial fix and there was a strong pushback
> on this approach due to its potential impact on performance.

From whom?

Somebody who doesn't understand that correctness is more important
than performance? And that userfaultfd is not the most important part
of the system?

The fact is, userfaultfd is CLEARLY BUGGY.

          Linus

Yu Zhao Dec. 21, 2020, 9:24 p.m. UTC | #28

On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >
> > Using mmap_write_lock() was my initial fix and there was a strong pushback
> > on this approach due to its potential impact on performance.
> 
> From whom?
> 
> Somebody who doesn't understand that correctness is more important
> than performance? And that userfaultfd is not the most important part
> of the system?
> 
> The fact is, userfaultfd is CLEARLY BUGGY.
> 
>           Linus

Fair enough.

Nadav, for your patch (you might want to update the commit message).

Reviewed-by: Yu Zhao <yuzhao@google.com>

While we are all here, there is also clear_soft_dirty() that could
use a similar fix...

Nadav Amit Dec. 21, 2020, 9:49 p.m. UTC | #29

> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>> on this approach due to its potential impact on performance.
>> 
>> From whom?
>> 
>> Somebody who doesn't understand that correctness is more important
>> than performance? And that userfaultfd is not the most important part
>> of the system?
>> 
>> The fact is, userfaultfd is CLEARLY BUGGY.
>> 
>>          Linus
> 
> Fair enough.
> 
> Nadav, for your patch (you might want to update the commit message).

Yes, the commit message is completely off. Will fix.

Thanks for your guidance and assistance.

> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
> While we are all here, there is also clear_soft_dirty() that could
> use a similar fix…

Let me try to build a small reproducer for clear_soft_dirty() and then I’ll
send another patch for it too.

BTW: In general, I think that you are right, and that changing of PTEs
should not require taking mmap_lock for write. However, I am not sure
cow_user_page() is not the only one that poses a problem and whether a more
systematic solution is needed. If cow_user_pages() is the only problem, do
you think it is possible to do the copying while holding the PTL? It works
for normal-pages, but I am not sure whether special-pages pose special
problems.

Anyhow, this is an enhancement that we can try later.

Thanks again,
Nadav

Peter Xu Dec. 21, 2020, 9:55 p.m. UTC | #30

On Mon, Dec 21, 2020 at 12:23:06PM -0800, Nadav Amit wrote:
> 2. Copy the page in cow_user_page() while holding the PTL and after flushing
>    has been done. I am not sure if there are potential problems with
>    special-pages (2 flushes might be necessary for special pages).

This seems to be a good thing irrelevant of userfaultfd.

Frankly speaking I don't know why it's always safe to do data copy without the
pgtable lock in wp_page_copy(), since I don't know what guaranteed us from data
changing on the original page due to any reason.  We check and make sure pte
hasn't changed, but is that really enough?

So the safe way to me is that wp_page_copy() not only check pte_same() but also
data_same() on the old/new pages, then with that it's even easier if we just
move the data copy into the ptl as what Nadav said here.  But I have no idea on
the penalty to the rest of the world, especially on non-modern hosts.

Thanks,

Peter Xu Dec. 21, 2020, 10:30 p.m. UTC | #31

On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
> BTW: In general, I think that you are right, and that changing of PTEs
> should not require taking mmap_lock for write. However, I am not sure
> cow_user_page() is not the only one that poses a problem and whether a more
> systematic solution is needed. If cow_user_pages() is the only problem, do
> you think it is possible to do the copying while holding the PTL? It works
> for normal-pages, but I am not sure whether special-pages pose special
> problems.
> 
> Anyhow, this is an enhancement that we can try later.

AFAIU mprotect() is the only one who modifies the pte using the mmap write
lock.  NUMA balancing is also using read mmap lock when changing pte
protections, while my understanding is mprotect() used write lock only because
it manipulates the address space itself (aka. vma layout) rather than modifying
the ptes, so it needs to.

At the pte level, it seems always to be the pgtable lock that serializes things.

So it's perfectly legal to me for e.g. a driver to modify ptes with the read
lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
taken when doing so.

If there's a driver that manipulated the ptes, changed the content of the page,
recover the ptes to origin, and all these happen right after wp_page_copy()
unlocked the pgtable lock but before wp_page_copy() retakes the same lock
again, we may face the same issue finding that the page got copied contains
corrupted data at last.  While I don't know what to blame on the driver either
because it seems to be exactly following the rules.

I believe changing into write lock would solve the race here because tlb
flushing would be guaranteed along the way, but I'm just a bit worried it's not
the best way to go..

Thanks,

Nadav Amit Dec. 21, 2020, 10:55 p.m. UTC | #32

> On Dec 21, 2020, at 2:30 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
>> BTW: In general, I think that you are right, and that changing of PTEs
>> should not require taking mmap_lock for write. However, I am not sure
>> cow_user_page() is not the only one that poses a problem and whether a more
>> systematic solution is needed. If cow_user_pages() is the only problem, do
>> you think it is possible to do the copying while holding the PTL? It works
>> for normal-pages, but I am not sure whether special-pages pose special
>> problems.
>> 
>> Anyhow, this is an enhancement that we can try later.
> 
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock.  NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.

You are correct about NUMA balancing in general. Yet in practice it is not
an issue in our “use-case” since NUMA balancing preserves the write-bit.

> At the pte level, it seems always to be the pgtable lock that serializes things.
> 
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
> 
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last.  While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.

The driver would have to do so without flushing the TLB. Having said that,
the driver could have used inc_tlb_flush_pending() and batch flushes.

> 
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..

It might be too big of a hammer. But the question that comes to my mind is,
if it is ok to change the PTEs without mmap_lock held for write, why
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
mprotect change of PTE would not be done with the write-lock)? If you did
so, you would have the same problem as the one we encountered (concurrent
protect-unprotect allow concurrent cow-#PF copying the wrong data).

So as an alternative solution, I can do copying under the PTL after
flushing, which seems to solve the problem. First copying (without a lock)
and then comparing seems to me as suboptimal - I do not think the benefit
(if there is one) of shortening the time in which the lock is taken - worth
the additional compare (and the complexity with special pages).

There are 2 problems in doing so:

1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be
called while holding the PTL, but I am not sure.

2. For special pages we would need 2 TLB flushes: one to ensure the
write-bit is cleared, and a second one after we clear the PTE. We
can limit ourselves to soft-dirty/UFFD VMAs, but if we have your
hypothetical driver, this would not be good enough.

Yu Zhao Dec. 21, 2020, 11:12 p.m. UTC | #33

On Mon, Dec 21, 2020 at 05:30:41PM -0500, Peter Xu wrote:
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
> > BTW: In general, I think that you are right, and that changing of PTEs
> > should not require taking mmap_lock for write. However, I am not sure
> > cow_user_page() is not the only one that poses a problem and whether a more
> > systematic solution is needed. If cow_user_pages() is the only problem, do
> > you think it is possible to do the copying while holding the PTL? It works
> > for normal-pages, but I am not sure whether special-pages pose special
> > problems.
> > 
> > Anyhow, this is an enhancement that we can try later.
> 
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock.  NUMA balancing is also using read mmap lock when changing pte
> protections,

NUMA balance doesn't clear pte_write() -- I would not call setting
pte_none() a change of protection.

> while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.

Yes, and personally, I would only take mmap lock for write when I
change VMAs, not PTE protections.

> At the pte level, it seems always to be the pgtable lock that serializes things.
> 
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
> 
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last.  While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.
> 
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..

I can't say I disagree with you but the man has made the call and I
think we should just move on.

Linus Torvalds Dec. 21, 2020, 11:13 p.m. UTC | #34

On Mon, Dec 21, 2020 at 1:55 PM Peter Xu <peterx@redhat.com> wrote:
>
> Frankly speaking I don't know why it's always safe to do data copy without the
> pgtable lock in wp_page_copy(), since I don't know what guaranteed us from data
> changing on the original page due to any reason.

So the reason it should be safe is that

 (a) the pte is write-protected

 (b) we're clearly not a shared mapping, so that if anybody else
writes to the page in another mapping, that's irrelevant (think of it
as "page copy happened earlier")

 (c) before we install the copied page, we check that nothing changed
our initial assumption in (a).

And the problem is that userfaultfd basically can cause that (c) phase to fail.

                Linus

Linus Torvalds Dec. 21, 2020, 11:22 p.m. UTC | #35

On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <peterx@redhat.com> wrote:
>
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock.  NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.

So it's ok to change the pte holding only the PTE lock, if it's a
*one*way* conversion.

That doesn't break the "re-check the PTE contents" model (which
predates _all_ of the rest: NUMA, userfaultfd, everything - it's
pretty much the original model for our page table operations, and goes
back to the dark ages even before SMP and the existence of a page
table lock).

So for example, a COW will always create a different pte (not just
because the page number itself changes - you could imagine a page
getting re-used and changing back - but because it's always a RO->RW
transition).

So two COW operations cannot "undo" each other and fool us into
thinking nothing changed.

Anything that changes RW->RO - like fork(), for example - needs to
take the mmap_lock.

NUMA balancing should be ok wrt COW, because it doesn't do that RW->RO
thing, it uses the present bit.

I think that you are right that NUMA balancing itself might cause
other issues, because it can cause that "pte changed and then came
back" (for numa protectoipn and then a numa fault) all with just the
mmap lock for reading.

However, even that shouldn't matter for COW, because the write protect
bit is the one that proptects the *contents* of the page, so even if
NUMA balancing caused that "load original PTE, then re-check later" to
succeed (despite the PTE actually changing in the middle), the
_contents_ of the page cannot have changed, so COW is ok. NUMA
balancing won't be making a read-only page temporarily writable.

           Linus

Linus Torvalds Dec. 21, 2020, 11:30 p.m. UTC | #36

On Mon, Dec 21, 2020 at 2:55 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> So as an alternative solution, I can do copying under the PTL after
> flushing, which seems to solve the problem.

I think that's a valid model, but note that we do the "re-check ptl"
in a (*completely(* different part than we do the actual PTE install.

Note that the "Re-validate under PTL" code in cow_user_page() is *not*
the "now we are installing the copy". No, that's actually for the
"uhhuh, the copy using the virtual address outside the ptl failed, now
we need to do something special".

The real "we hold teh ptl" actually happens in wp_page_copy(), after
cow_user_page() has already returned.

So you'd have to change how all of that works.

And honestly, I'm not sure it's worth it - if this was the *only*
case, then yes. But that whole "we load the original pte first, then
we do whatever we _think_ we will need to do, and then we install the
final pte after checking" is actually the case for every other page
fault handling case too.

So are we sure the COW case is so special?

I really think this is clearly just a userfaultfd bug that we hadn't
realized until now, and had possibly been hidden by timings or other
random stuff before.

    Linus

Linus Torvalds Dec. 21, 2020, 11:33 p.m. UTC | #37

On Mon, Dec 21, 2020 at 3:12 PM Yu Zhao <yuzhao@google.com> wrote:
>
> I can't say I disagree with you but the man has made the call and I
> think we should just move on.

"The man" can always be convinced by numbers.

So if somebody comes up with an alternate patch, and explains it, and
shows that it is better - go for it.

I just think that if mprotect() can take the mmap lock for writing,
then userfaultfd sure as hell can. What odd load does people have
where userfaultfd is more important than mprotect?

So as far as the man is concerned, I think "just fix userfaultfd" is
simply the default obvious operation.

Not necessarily a final endpoint.

                 Linus

Nadav Amit Dec. 21, 2020, 11:46 p.m. UTC | #38

> On Dec 21, 2020, at 3:30 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Mon, Dec 21, 2020 at 2:55 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> So as an alternative solution, I can do copying under the PTL after
>> flushing, which seems to solve the problem.
> 
> ...
> Note that the "Re-validate under PTL" code in cow_user_page() is *not*
> the "now we are installing the copy". No, that's actually for the
> "uhhuh, the copy using the virtual address outside the ptl failed, now
> we need to do something special”.
> ...
> So are we sure the COW case is so special?
> 
> I really think this is clearly just a userfaultfd bug that we hadn't
> realized until now, and had possibly been hidden by timings or other
> random stuff before.

Thanks for the detailed explanation. I think I got the COW parts correct,
but as you said, I am completely not sure that COW is so special.

Seems as if some general per page-table mechanism for detection of stale
PTEs is needed, so by default anyone that acquires the PTL is guaranteed
that the PTEs in memory are coherent across all the TLBs.

But I still did not figure out how to do so without introducing overheads,
and the question is indeed if people care about mprotect and uffd-wp
performance.

Yu Zhao Dec. 22, 2020, midnight UTC | #39

On Mon, Dec 21, 2020 at 03:33:30PM -0800, Linus Torvalds wrote:
> On Mon, Dec 21, 2020 at 3:12 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > I can't say I disagree with you but the man has made the call and I
> > think we should just move on.
> 
> "The man" can always be convinced by numbers.
> 
> So if somebody comes up with an alternate patch, and explains it, and
> shows that it is better - go for it.
> 
> I just think that if mprotect() can take the mmap lock for writing,
> then userfaultfd sure as hell can. What odd load does people have
> where userfaultfd is more important than mprotect?
> 
> So as far as the man is concerned, I think "just fix userfaultfd" is
> simply the default obvious operation.
> 
> Not necessarily a final endpoint.
> 
>                  Linus

My first instinct is to be conservative and revert 09854ba94c6a ("mm:
do_wp_page() simplification") so people are less likely to come back
and complain about performance issues from holding mmap lock for
write when clearing pte_write.

That being said, I do like the simplicity of 09854ba94c6a as well as
having one simple rule that dictates what we should do when clearing
pte_write(). And "userfaultfd is not the most important part of the
system" is a fair point.

Linus Torvalds Dec. 22, 2020, 12:11 a.m. UTC | #40

On Mon, Dec 21, 2020 at 4:00 PM Yu Zhao <yuzhao@google.com> wrote:
>
> My first instinct is to be conservative and revert 09854ba94c6a ("mm:
> do_wp_page() simplification") so people are less likely to come back
> and complain about performance issues from holding mmap lock for
> write when clearing pte_write.

Well, the thing is, that simplificaiton was actually part of fixing a
real regression wrt GUP.

Reverting that would break a308c71bf1e6 ("mm/gup: Remove enfornced COW
mechanism").

And that one was the (better) fix for commit 17839856fd58 that fixed a
real security issue, but did it with a big hammer that then caused
problems for uffd-wp (and some other loads). There's a bit more
context in the merge message in commit b25d1dc9474e Merge branch
'simplify-do_wp_page'.

So while that commit 09854ba94c6a on its own is "just" a
simplification, it's actually part of a bigger series that fixes
serious problems.

                Linus

Yu Zhao Dec. 22, 2020, 12:24 a.m. UTC | #41

On Mon, Dec 21, 2020 at 04:11:01PM -0800, Linus Torvalds wrote:
> On Mon, Dec 21, 2020 at 4:00 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > My first instinct is to be conservative and revert 09854ba94c6a ("mm:
> > do_wp_page() simplification") so people are less likely to come back
> > and complain about performance issues from holding mmap lock for
> > write when clearing pte_write.
> 
> Well, the thing is, that simplificaiton was actually part of fixing a
> real regression wrt GUP.
> 
> Reverting that would break a308c71bf1e6 ("mm/gup: Remove enfornced COW
> mechanism").
> 
> And that one was the (better) fix for commit 17839856fd58 that fixed a
> real security issue, but did it with a big hammer that then caused
> problems for uffd-wp (and some other loads). There's a bit more
> context in the merge message in commit b25d1dc9474e Merge branch
> 'simplify-do_wp_page'.
> 
> So while that commit 09854ba94c6a on its own is "just" a
> simplification, it's actually part of a bigger series that fixes
> serious problems.
> 
>                 Linus

Well, that settles it then. I don't any better idea that is as safe
as what Nadav has.

Andy Lutomirski Dec. 22, 2020, 3:19 a.m. UTC | #42

On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > AFAIU mprotect() is the only one who modifies the pte using the mmap write
> > lock.  NUMA balancing is also using read mmap lock when changing pte
> > protections, while my understanding is mprotect() used write lock only because
> > it manipulates the address space itself (aka. vma layout) rather than modifying
> > the ptes, so it needs to.
>
> So it's ok to change the pte holding only the PTE lock, if it's a
> *one*way* conversion.
>
> That doesn't break the "re-check the PTE contents" model (which
> predates _all_ of the rest: NUMA, userfaultfd, everything - it's
> pretty much the original model for our page table operations, and goes
> back to the dark ages even before SMP and the existence of a page
> table lock).
>
> So for example, a COW will always create a different pte (not just
> because the page number itself changes - you could imagine a page
> getting re-used and changing back - but because it's always a RO->RW
> transition).
>
> So two COW operations cannot "undo" each other and fool us into
> thinking nothing changed.
>
> Anything that changes RW->RO - like fork(), for example - needs to
> take the mmap_lock.

Ugh, this is unpleasantly complicated.  I will admit that any API that
takes an address and more-or-less-blindly marks it RO makes me quite
nervous even assuming all the relevant locks are held.  At least
userfaultfd refuses to operate on VM_SHARED VMAs, but we have another
instance of this (with mmap_sem held for write) in x86:
mark_screen_rdonly().  Dare I ask how broken this is?  We could likely
get away with deleting it entirely.

--Andy

Linus Torvalds Dec. 22, 2020, 4:16 a.m. UTC | #43

On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> Ugh, this is unpleasantly complicated.

I probably should have phrased it differently, because the case you
quote is actually a good one:

>  I will admit that any API that
> takes an address and more-or-less-blindly marks it RO makes me quite
> nervous even assuming all the relevant locks are held.  At least
> userfaultfd refuses to operate on VM_SHARED VMAs, but we have another
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly().  Dare I ask how broken this is?  We could likely
> get away with deleting it entirely.

So I think the basic rule is that "if you hold mmap_sem for writing,
you're always safe". And that really should be considered the
"default" locking.

ANY time you make a modification to the VM layer, you should basically
always treat it as a write operation, and get the mmap_sem for
writing.

Yeah, yeah, that's a bit simplified, and it ignores various special
cases (and the hardware page table walkers that obviously take no
locks at all), but if you hold the mmap_sem for writing you won't
really race with anything else - not page faults, and not other
"modify this VM".

So mark_screen_rdonly() looks trivially fine to me. I don't think it
really necessarily matters any more, and it's a legacy thing for a
legacy hardware issue, but I don't think there's any reason at all to
remove it either.

To a first approximation, everybody that changes the VM should take
the mmap_sem for writing, and the readers should just be just about
page fault handling (and I count GUP as "page fault handling" too -
it's kind of the same "look up page" rather than "modify vm" kind of
operation).

And there are just a _lot_ more page faults than there are things that
modify the page tables and the vma's.

So having that mental model of "lookup of pages in a VM take mmap_semn
for reading, any modification of the VM uses it for writing" makes
sense both from a performance angle and a logical standpoint. It's the
correct model.

And it's worth noting that COW is still "lookup of pages", even though
it might modify the page tables in the process. The same way lookup
can modify the page tables to mark things accessed or dirty.

So COW is still a lookup operation, in ways that "change the
writabiility of this range" very much is not. COW is "lookup for
write", and the magic we do to copy to make that write valid is still
all about the lookup of the page.

Which brings up another mental mistake I saw earlier in this thread:
you should not think "mmap_sem is for vma, and the page table lock is
for the page table changes".

mmap_sem is the primary lock for any modifications to the VM layout,
whether it be in the vma's or in the page tables.

Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
it is partly because

 (a) we have things that historically walked the page tables _without_
walking the vma's (notably the virtual memory scanning)

 (b) we do allow concurrent page faults, so we then need a lower-level
lock to serialize the parallelism we _do_ have.

And yes, we have ended up allowing some of those "modify the VM state"
to then take the mmap_sem for reading, just because their
modifications are slight enough that we can then say "ok, this is a
write modification, but the writes it does are protected by the page
table lock, we'll just get the mmap_sem for reading". It's an
optimization, but it should be considered exactly that: not
fundamental, but just a clever trick and an optimization.

It's why I went "userfaultfd is buggy" immediately when I noticed. It
basically became clear that "oh, that's an optimization, but it's an
*invalid* one", in that it didn't actually work and give the same end
result.

So when I said:

> Anything that changes RW->RO - like fork(), for example - needs to
> take the mmap_lock.

I didn't mean that we should consider that RW->RO change to be this
complex semantic marker that we should honor and say "ok, because it's
a RW->RO change, we need to hold the mmap_sem".

I phrased it really badly, in other words.

What I *should* have said is that *because* userfaultfd is changing
the VM layout, it should always act as if it had to take the mmap_sem
for writing, and that the RW->RO change is an example of when
downgrading that write-lock to a read lock is simply not valid -
because it basically breaks the rules about what a lookup (ie a read)
can do.

A lookup can never turn a writable page non-writable. A lookup -
through COW - _can_ turn a read-only page writable. So that's why
"RO->RW" can be ok under the read lock, but RW->RO is not.

Does that clarify the situation when I phrase it that way instead?

             Linus

Nadav Amit Dec. 22, 2020, 9:38 a.m. UTC | #44

> On Dec 21, 2020, at 7:19 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <peterx@redhat.com> wrote:
>>> AFAIU mprotect() is the only one who modifies the pte using the mmap write
>>> lock.  NUMA balancing is also using read mmap lock when changing pte
>>> protections, while my understanding is mprotect() used write lock only because
>>> it manipulates the address space itself (aka. vma layout) rather than modifying
>>> the ptes, so it needs to.
>> 
>> So it's ok to change the pte holding only the PTE lock, if it's a
>> *one*way* conversion.
>> 
>> That doesn't break the "re-check the PTE contents" model (which
>> predates _all_ of the rest: NUMA, userfaultfd, everything - it's
>> pretty much the original model for our page table operations, and goes
>> back to the dark ages even before SMP and the existence of a page
>> table lock).
>> 
>> So for example, a COW will always create a different pte (not just
>> because the page number itself changes - you could imagine a page
>> getting re-used and changing back - but because it's always a RO->RW
>> transition).
>> 
>> So two COW operations cannot "undo" each other and fool us into
>> thinking nothing changed.
>> 
>> Anything that changes RW->RO - like fork(), for example - needs to
>> take the mmap_lock.
> 
> Ugh, this is unpleasantly complicated.  I will admit that any API that
> takes an address and more-or-less-blindly marks it RO makes me quite
> nervous even assuming all the relevant locks are held.  At least
> userfaultfd refuses to operate on VM_SHARED VMAs, but we have another
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly().  Dare I ask how broken this is?  We could likely
> get away with deleting it entirely.

If you only look at the function in isolation, it seems broken. It should
have flushed the TLB before releasing the mmap_lock. After the
mmap_write_unlock() and before the actual flush, a #PF on another thread can
happen, and a similar scenario to the one that is mentioned in this thread
(copying while a stale PTE in the TLBs is not-writeprotected) might happen.

Having said that, I do not know this code and the context in which this
function is called, so I do not know whether there are other mitigating
factors.

Funny, I had a deja-vu and indeed you have already raised (other) TLB issues
with mark_screen_rdonly() 3 years ago. At the time you said "I'd like to
delete it.” [1]

[1] https://lore.kernel.org/patchwork/patch/782486/#976151

Nadav Amit Dec. 22, 2020, 12:40 p.m. UTC | #45

> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>> on this approach due to its potential impact on performance.
>> 
>> From whom?
>> 
>> Somebody who doesn't understand that correctness is more important
>> than performance? And that userfaultfd is not the most important part
>> of the system?
>> 
>> The fact is, userfaultfd is CLEARLY BUGGY.
>> 
>>          Linus
> 
> Fair enough.
> 
> Nadav, for your patch (you might want to update the commit message).
> 
> Reviewed-by: Yu Zhao <yuzhao@google.com>
> 
> While we are all here, there is also clear_soft_dirty() that could
> use a similar fix…

Just an update as for why I have still not sent v2: I fixed
clear_soft_dirty(), created a reproducer, and the reproducer kept failing.

So after some debugging, it appears that clear_refs_write() does not flush
the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
flush the TLB since there is clear_refs_write() does not call to
__tlb_adjust_range() (unless there are nested TLBs are pending).

So I have a patch for this issue too: arguably the tlb_gather interface is
not the right one for clear_refs_write() that does not clear PTEs but
changes them.

Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
will keep debugging to see what goes wrong. I will send v2 once I figure out
what the heck is wrong in the code or my reproducer.

For the reference, here is my reproducer:

-- >8 --

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <fcntl.h>
#include <string.h>
#include <threads.h>
#include <stdatomic.h>

#define PAGE_SIZE	(4096)
#define TLB_SIZE	(2000)
#define N_PAGES		(300000)
#define ITERATIONS	(100)
#define N_THREADS	(2)

static int stop;
static char *m;

static int writer(void *argp)
{
	unsigned long t_idx = (unsigned long)argp;
	int i, cnt = 0;

	while (!atomic_load(&stop)) {
		cnt++;
		atomic_fetch_add((atomic_int *)m, 1);

		/*
		 * First thread only accesses the page to have it cached in the
		 * TLB.
		 */
		if (t_idx == 0)
			continue;

		/*
		 * Other threads access enough entries to cause eviction from
		 * the TLB and trigger #PF upon the next access (before the TLB
		 * flush of clear_ref actually takes place).
		 */
		for (i = 1; i < TLB_SIZE; i++) {
			if (atomic_load((atomic_int *)(m + PAGE_SIZE * i))) {
				fprintf(stderr, "unexpected error\n");
				exit(1);
			}
		}
	}
	return cnt;
}

/*
 * Runs mlock/munlock in the background to raise the page-count of the page and
 * force copying instead of reusing the page.
 */
static int do_mlock(void *argp)
{
	while (!atomic_load(&stop)) {
		if (mlock(m, PAGE_SIZE) || munlock(m, PAGE_SIZE)) {
			perror("mlock/munlock");
			exit(1);
		}
	}
	return 0;
}

int main(void)
{
	int r, cnt, fd, total = 0;
	long i;
	thrd_t thr[N_THREADS];
	thrd_t mlock_thr[N_THREADS];

	fd = open("/proc/self/clear_refs", O_WRONLY, 0666);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	/*
	 * Have large memory for clear_ref, so there would be some time between
	 * the unmap and the actual deferred flush.
	 */
	m = mmap(NULL, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
	if (m == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	for (i = 0; i < N_THREADS; i++) {
		r = thrd_create(&thr[i], writer, (void *)i);
		assert(r == thrd_success);
	}

	for (i = 0; i < N_THREADS; i++) {
		r = thrd_create(&mlock_thr[i], do_mlock, (void *)i);
		assert(r == thrd_success);
	}

	for (i = 0; i < ITERATIONS; i++) {
		for (i = 0; i < ITERATIONS; i++) {
			r = pwrite(fd, "4", 1, 0);
			if (r < 0) {
				perror("pwrite");
				exit(1);
			}
		}
	}

	atomic_store(&stop, 1);

	for (i = 0; i < N_THREADS; i++) {
		r = thrd_join(mlock_thr[i], NULL);
		assert(r == thrd_success);
	}

	for (i = 0; i < N_THREADS; i++) {
		r = thrd_join(thr[i], &cnt);
		assert(r == thrd_success);
		total += cnt;
	}

	r = atomic_load((atomic_int *)(m));
	if (r != total) {
		fprintf(stderr, "failed - expected=%d actual=%d\n", total, r);
		exit(-1);
	}

	fprintf(stderr, "ok\n");
	return 0;
}

Yu Zhao Dec. 22, 2020, 6:30 p.m. UTC | #46

On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>> Using mmap_write_lock() was my initial fix and there was a strong pushback
> >>> on this approach due to its potential impact on performance.
> >> 
> >> From whom?
> >> 
> >> Somebody who doesn't understand that correctness is more important
> >> than performance? And that userfaultfd is not the most important part
> >> of the system?
> >> 
> >> The fact is, userfaultfd is CLEARLY BUGGY.
> >> 
> >>          Linus
> > 
> > Fair enough.
> > 
> > Nadav, for your patch (you might want to update the commit message).
> > 
> > Reviewed-by: Yu Zhao <yuzhao@google.com>
> > 
> > While we are all here, there is also clear_soft_dirty() that could
> > use a similar fix…
> 
> Just an update as for why I have still not sent v2: I fixed
> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
> 
> So after some debugging, it appears that clear_refs_write() does not flush
> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
> flush the TLB since there is clear_refs_write() does not call to
> __tlb_adjust_range() (unless there are nested TLBs are pending).

Sorry Nadav, I assumed you knew this existing problem fixed by:
https://patchwork.kernel.org/project/linux-mm/cover/20201210121110.10094-1-will@kernel.org/

> So I have a patch for this issue too: arguably the tlb_gather interface is
> not the right one for clear_refs_write() that does not clear PTEs but
> changes them.
> 
> Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
> will keep debugging to see what goes wrong. I will send v2 once I figure out
> what the heck is wrong in the code or my reproducer.
> 
> For the reference, here is my reproducer:

Thanks. This would be helpful in case any other breakages happen in the
future.

> -- >8 --
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/mman.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> #include <fcntl.h>
> #include <string.h>
> #include <threads.h>
> #include <stdatomic.h>
> 
> #define PAGE_SIZE	(4096)
> #define TLB_SIZE	(2000)
> #define N_PAGES		(300000)
> #define ITERATIONS	(100)
> #define N_THREADS	(2)
> 
> static int stop;
> static char *m;
> 
> static int writer(void *argp)
> {
> 	unsigned long t_idx = (unsigned long)argp;
> 	int i, cnt = 0;
> 
> 	while (!atomic_load(&stop)) {
> 		cnt++;
> 		atomic_fetch_add((atomic_int *)m, 1);
> 
> 		/*
> 		 * First thread only accesses the page to have it cached in the
> 		 * TLB.
> 		 */
> 		if (t_idx == 0)
> 			continue;
> 
> 		/*
> 		 * Other threads access enough entries to cause eviction from
> 		 * the TLB and trigger #PF upon the next access (before the TLB
> 		 * flush of clear_ref actually takes place).
> 		 */
> 		for (i = 1; i < TLB_SIZE; i++) {
> 			if (atomic_load((atomic_int *)(m + PAGE_SIZE * i))) {
> 				fprintf(stderr, "unexpected error\n");
> 				exit(1);
> 			}
> 		}
> 	}
> 	return cnt;
> }
> 
> /*
>  * Runs mlock/munlock in the background to raise the page-count of the page and
>  * force copying instead of reusing the page.
>  */
> static int do_mlock(void *argp)
> {
> 	while (!atomic_load(&stop)) {
> 		if (mlock(m, PAGE_SIZE) || munlock(m, PAGE_SIZE)) {
> 			perror("mlock/munlock");
> 			exit(1);
> 		}
> 	}
> 	return 0;
> }
> 
> int main(void)
> {
> 	int r, cnt, fd, total = 0;
> 	long i;
> 	thrd_t thr[N_THREADS];
> 	thrd_t mlock_thr[N_THREADS];
> 
> 	fd = open("/proc/self/clear_refs", O_WRONLY, 0666);
> 	if (fd < 0) {
> 		perror("open");
> 		exit(1);
> 	}
> 
> 	/*
> 	 * Have large memory for clear_ref, so there would be some time between
> 	 * the unmap and the actual deferred flush.
> 	 */
> 	m = mmap(NULL, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE,
> 			MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
> 	if (m == MAP_FAILED) {
> 		perror("mmap");
> 		exit(1);
> 	}
> 
> 	for (i = 0; i < N_THREADS; i++) {
> 		r = thrd_create(&thr[i], writer, (void *)i);
> 		assert(r == thrd_success);
> 	}
> 
> 	for (i = 0; i < N_THREADS; i++) {
> 		r = thrd_create(&mlock_thr[i], do_mlock, (void *)i);
> 		assert(r == thrd_success);
> 	}
> 
> 	for (i = 0; i < ITERATIONS; i++) {
> 		for (i = 0; i < ITERATIONS; i++) {
> 			r = pwrite(fd, "4", 1, 0);
> 			if (r < 0) {
> 				perror("pwrite");
> 				exit(1);
> 			}
> 		}
> 	}
> 
> 	atomic_store(&stop, 1);
> 
> 	for (i = 0; i < N_THREADS; i++) {
> 		r = thrd_join(mlock_thr[i], NULL);
> 		assert(r == thrd_success);
> 	}
> 
> 	for (i = 0; i < N_THREADS; i++) {
> 		r = thrd_join(thr[i], &cnt);
> 		assert(r == thrd_success);
> 		total += cnt;
> 	}
> 
> 	r = atomic_load((atomic_int *)(m));
> 	if (r != total) {
> 		fprintf(stderr, "failed - expected=%d actual=%d\n", total, r);
> 		exit(-1);
> 	}
> 
> 	fprintf(stderr, "ok\n");
> 	return 0;
> }

Nadav Amit Dec. 22, 2020, 7:20 p.m. UTC | #47

> On Dec 22, 2020, at 10:30 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>>>> on this approach due to its potential impact on performance.
>>>> 
>>>> From whom?
>>>> 
>>>> Somebody who doesn't understand that correctness is more important
>>>> than performance? And that userfaultfd is not the most important part
>>>> of the system?
>>>> 
>>>> The fact is, userfaultfd is CLEARLY BUGGY.
>>>> 
>>>>         Linus
>>> 
>>> Fair enough.
>>> 
>>> Nadav, for your patch (you might want to update the commit message).
>>> 
>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>> 
>>> While we are all here, there is also clear_soft_dirty() that could
>>> use a similar fix…
>> 
>> Just an update as for why I have still not sent v2: I fixed
>> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
>> 
>> So after some debugging, it appears that clear_refs_write() does not flush
>> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
>> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
>> flush the TLB since there is clear_refs_write() does not call to
>> __tlb_adjust_range() (unless there are nested TLBs are pending).
> 
> Sorry Nadav, I assumed you knew this existing problem fixed by:
> https://patchwork.kernel.org/project/linux-mm/cover/20201210121110.10094-1-will@kernel.org/
> 

Thanks, Yu! For some reason I assumed it was already upstreamed and did not
look back (yet if I was cc’d on v2…)

Yet, something still goes bad. Debugging.

Andrea Arcangeli Dec. 22, 2020, 7:31 p.m. UTC | #48

On Mon, Dec 21, 2020 at 07:19:35PM -0800, Andy Lutomirski wrote:
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly().  Dare I ask how broken this is?  We could likely

That one is buggy despite it takes the mmap_write_lock... inverting
the last two lines would fix it though.

-	mmap_write_unlock(mm);
	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT, false);
+	mmap_write_unlock(mm);

The issue here is that rightfully wp_page_copy copies outside the PT
lock (good thing) and it correctly assumes during the copy nobody can
modify the page if the fault happened in a write access and the pte
had no _PAGE_WRITE bit set.

The bug is clearly in userfaultfd_writeprotect that doesn't enforce
the above invariant.

If the userfaultfd_wrprotect(mode_wp = false) run after the deferred
TLB flush this could never happen because the uffd_wp bit was still
set and the page fault would hit the handle_userfault dead end and
it'd have to be restarted from scratch.

Leaving stale tlb entries after dropping the PT lock and with only the
mmap_lock_read is only ok if the page fault has a "check" that catches
that specific case, so that specific case is fully serialized by the
PT lock alone and the "catch" path is fully aware about the stale TLB
entries that were left around.

So looking at the two cases that predates uffd-wp that already did a
RW->RO transition with the mmap_lock_read, they both comply with the
wp_page_copy invariant but with two different techniques:

1) change_protection is called by the scheduler with the
  mmap_read_lock and with a deferred TLB flush, and things don't fall
  apart completely there simply because we catch that in do_numa_page:

	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+               /* catch it: no risk to end up in wp_page_copy even if the change_protection
+	           is running concurrently and the tlb flush is still pending */
		return do_numa_page(vmf);

  do_numa_page doesn't issue any further tlb flush, but basically
  restores the pte to the previous value, so then the pending flush
  becomes a noop, since there was no modification to the pte in the
  first place after do_numa_page runs.

2) ksm write_protect_page is also called with mmap_read_lock, and it
   will simply not defer the flush. If the tlb flush is done inside
   the PT lock it is ok to take mmap_read_lock since the page fault
   code will not make any assumption about the TLB before reading the
   pagetable content under PT lock.

   In fact if mm_tlb_flush_pending(mm) is true, write_protect_page in
   KSM has to issue an extra synchronous flush before dropping the PT
   lock, even if it finds the _PAGE_WRITE bit is already clear,
   precisely because there can be another deferred flush that cleared
   the pte but didn't flush the TLB yet.

userfaultfd_writeprotect() is already using the technique in 1) to be
safe, except the un-wrprotect can break it if run while the tlb flush
is still pending...

The good thing is this guarantee must be provided not more granular
even than a vma, not per-mm and a vma cannot be registered on two
different uffd ctx at the same time, so the locking can happen all
internal to the uffd ctx.

My previous suggestion to use a mutex to serialize
userfaultfd_writeprotect with a mutex will still work, but we can run
as many wrprotect and un-wrprotect as we want in parallel, as long as
they're not simultaneous, we can do much better than a mutex.

Ideally we would need a new two_group_semaphore, where each group can
run as many parallel instances as it wants, but no instance of one
group can run in parallel with any instance of the other group. AFIK
such a kind of lock doesn't exist right now.

wrprotect if left alone never had an issue since we had a working
"catch" check for it well before wp_page_copy could run. unprotect
also if left alone was ok since it's a permission upgrade.

Alternatively, we can solve this with the same technique as in 2),
because all it matters is that the 4k pte modification doesn't take
the mmap_lock_write. A modification to a single 4k pte or a single 2m
hugepmd is very likely indication that there's going to be a flood of
those in parallel from all CPUs and likely there's also a flood of
page faults from all CPUs in parallel. In addition for a 4k wrprotect
or un-wrprotect there's not a significant benefit in deferring the TLB
flush.

I don't think we can take the mmap_write_lock unconditionally because
one of the main reasons to deal with the extra complexity of resolving
page faults in userland is to bypass mprotect entirely.

Example, JITs can use unprivileged uffd to eliminate the software-set
dirty bit every time it modifies memory, that would then require
frequent wrprotect/un-wrprotect during page faults and the less likely
we have to block for I/O during those CPU-only events, the
better. This is one of the potential use cases, but there's plenty
more.

So the alternative would be to take mmap_read_lock for "len <=
HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a
parameter in change_protection to tell it to flush before dropping the
PT lock and not defer the flush. So this is going to be more intrusive
in the VM and it's overall unnecessary.

The below is mostly untested... but it'd be good to hear some feedback
before doing more work in this direction.

From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Tue, 22 Dec 2020 14:30:16 -0500
Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
 userfaultfd_writeprotect()

change_protection() is called by userfaultfd_writeprotect() with the
mmap_lock_read like in change_prot_numa().

The page fault code in wp_copy_page() rightfully assumes if the CPU
issued a write fault and the write bit in the pagetable is not set, no
CPU can write to the page. That's wrong assumption after
userfaultfd_writeprotect(). That's also wrong assumption after
change_prot_numa() where the regular page fault code also would assume
that if the present bit is not set and the page fault is running,
there should be no stale TLB entry, but there is still.

So to stay safe, the page fault code must be prevented to run as long
as long as the TLB flush remains pending. That is already achieved by
the do_numa_page() path for change_prot_numa() and by the
userfaultfd_pte_wp() path for userfaultfd_writeprotect().

The problem that needs fixing is that an un-wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
set) could run in between the original wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
and wp_copy_page, while the TLB flush remains pending.

In such case the userfaultfd_pte_wp() check will stop being effective
to prevent the regular page fault code to run.

CPU0			CPU 1		CPU 2
------			--------	-------
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
PT unlock

			do_page_fault FAULT_FLAG_WRITE
					userfaultfd_wrprotect(mode_wp = false)
					PT lock
					ATOMIC clear _PAGE_UFFD_WP <- problem
					/* _PAGE_WRITE not set */
					PT unlock
					XXXXXXXXXXXXXX BUG RACE window open here

			PT lock
			FAULT_FLAG_WRITE is set by CPU
			_PAGE_WRITE is still clear in pte
			PT unlock

			wp_page_copy
			copy_user_page runs with stale TLB

deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here
================================================================================

If the userfaultfd_wrprotect(mode_wp = false) can only run after the
deferred TLB flush the above cannot happen either, because the uffd_wp
bit will remain set until after the TLB flush and the page fault would
reliably hit the handle_userfault() dead end as long as the TLB is
stale.

So to fix this we need to prevent any un-wrprotect to start until the
last outstanding wrprotect completed and to prevent any further
wrprotect until the last outstanding un-wrprotect completed. Then
wp_page_copy can still run in parallel but only with the un-wrprotect,
and that's fine since it's a permission promotion.

We would need a new two_group_semaphore, where each group can run as
many parallel instances as it wants, but no instance of one group can
run in parallel with any instance of the other group. The below rwsem
with two atomics approximates that kind lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 39 +++++++++++++++++++++++++++++++++++++++
 mm/memory.c      | 20 ++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..3729ca99dae5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -76,6 +76,20 @@ struct userfaultfd_ctx {
 	bool mmap_changing;
 	/* mm with one ore more vmas attached to this userfaultfd_ctx */
 	struct mm_struct *mm;
+	/*
+	 * We cannot let userfaultd_writeprotect(mode_wp = false)
+	 * clear _PAGE_UFFD_WP from the pgtable, until the last
+	 * outstanding userfaultd_writeprotect(mode_wp = true)
+	 * completes, because the TLB flush is deferred. The page
+	 * fault must be forced to enter handle_userfault() while the
+	 * TLB flush is deferred, and that's achieved with the
+	 * _PAGE_UFFD_WP bit. The _PAGE_UFFD_WP can only be cleared
+	 * after there are no stale TLB entries left, only then it'll
+	 * be safe again for the page fault to continue and not hit
+	 * the handle_userfault() dead end anymore.
+	 */
+	struct rw_semaphore wrprotect_rwsem;
+	atomic64_t wrprotect_count[2];
 };

 struct userfaultfd_fork_ctx {
@@ -1783,6 +1797,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
 	bool mode_wp, mode_dontwake;
+	bool contention;

 	if (READ_ONCE(ctx->mmap_changing))
 		return -EAGAIN;
@@ -1808,9 +1823,30 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;

+	VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[0]) < 0);
+	VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[1]) < 0);
+	atomic64_inc(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
+	smp_mb__after_atomic();
+	contention = atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0]) > 0;
+	if (!contention)
+		down_read(&ctx->wrprotect_rwsem);
+	else {
+		down_write(&ctx->wrprotect_rwsem);
+		if (!atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0])) {
+			contention = false;
+			downgrade_write(&ctx->wrprotect_rwsem);
+		}
+
+	}
 	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
 				  uffdio_wp.range.len, mode_wp,
 				  &ctx->mmap_changing);
+	if (!contention)
+		up_read(&ctx->wrprotect_rwsem);
+	else
+		up_write(&ctx->wrprotect_rwsem);
+	smp_mb__before_atomic();
+	atomic64_dec(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
 	if (ret)
 		return ret;

@@ -1958,6 +1994,9 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->fault_wqh);
 	init_waitqueue_head(&ctx->event_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
+	init_rwsem(&ctx->wrprotect_rwsem);
+	atomic64_set(&ctx->wrprotect_count[0], 0);
+	atomic64_set(&ctx->wrprotect_count[1], 0);
 	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
 }

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..4ff9f21d5a7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3085,6 +3085,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;

+	/*
+	 * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
+	 * can be stale. We cannot allow do_wp_page to proceed or
+	 * it'll wrongly assume that nobody can still be writing to
+	 * the page if !pte_write.
+	 */
 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
@@ -4274,6 +4280,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
+		/*
+		 * STALE_TLB_WARNING: while the uffd_wp bit is set,
+		 * the TLB can be stale. We cannot allow wp_huge_pmd()
+		 * to proceed or it'll wrongly assume that nobody can
+		 * still be writing to the page if !pmd_write.
+		 */
 		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
@@ -4388,6 +4400,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);

+	/*
+	 * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
+	 * be stale.
+	 */
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);

@@ -4503,6 +4519,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			return 0;
 		}
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
+			/*
+			 * STALE_TLB_WARNING: if the pmd is NUMA
+			 * protnone the TLB can be stale.
+			 */
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);

Andrea Arcangeli Dec. 22, 2020, 7:44 p.m. UTC | #49

On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so

I assume you mean "in" mprotect_fixup, after change_protection.

If you would downgrade the mmap_lock to read there, then it'd severely
slowdown the non contention case, if there's more than vma that needs
change_protection.

You'd need to throw away the prev->vm_next info and you'd need to do a
new find_vma after droping the mmap_lock for reading and re-taking the
mmap_lock for writing at every iteration of the loop.

To do less harm to the non-contention case you could perhaps walk
vma->vm_next and check if it's outside the mprotect range and only
downgrade in such case. So let's assume we intend to optimize with
mmap_write_downgrade only the last vma.

The problem is once you had to take mmap_lock for writing, you already
stalled for I/O and waited all concurrent page faults and blocked them
as well for the vma allocations in split_vma, so that extra boost in
SMP scalability you get is lost in the noise there at best.

And the risk is that at worst that extra locked op of
mmap_write_downgrade() will hurt SMP scalability because it would
increase the locked ops of mprotect on the hottest false-shared
cacheline by 50% and that may outweight the benefit from unblocking
the page faults half a usec sooner on large systems.

But the ultimate reason why mprotect cannot do mmap_write_downgrade()
while userfaultfd_writeprotect can do mmap_read_lock and avoid the
mmap_write_lock altogether, is that mprotect leaves no mark in the
pte/hugepmd that allows to detect when the TLB is stale in order to
redirect the page fault in a dead end (handle_userfault() or
do_numa_page) until after the TLB has been flushed as it happens in
the the 4 cases below:

	/*
	 * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
	 * can be stale. We cannot allow do_wp_page to proceed or
	 * it'll wrongly assume that nobody can still be writing to
	 * the page if !pte_write.
	 */
	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
		/*
		 * STALE_TLB_WARNING: while the uffd_wp bit is set,
		 * the TLB can be stale. We cannot allow wp_huge_pmd()
		 * to proceed or it'll wrongly assume that nobody can
		 * still be writing to the page if !pmd_write.
		 */
		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
	/*
	 * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
	 * be stale.
	 */
	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
			/*
			 * STALE_TLB_WARNING: if the pmd is NUMA
			 * protnone the TLB can be stale.
			 */
			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))

Thanks,
Andrea

Matthew Wilcox Dec. 22, 2020, 8:15 p.m. UTC | #50

On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> My previous suggestion to use a mutex to serialize
> userfaultfd_writeprotect with a mutex will still work, but we can run
> as many wrprotect and un-wrprotect as we want in parallel, as long as
> they're not simultaneous, we can do much better than a mutex.
> 
> Ideally we would need a new two_group_semaphore, where each group can
> run as many parallel instances as it wants, but no instance of one
> group can run in parallel with any instance of the other group. AFIK
> such a kind of lock doesn't exist right now.

Kent and I worked on one for a bit, and we called it a red-black mutex.
If team red had the lock, more members of team red could join in.
If team black had the lock, more members of team black could join in.
I forget what our rule was around fairness (if team red has the lock,
and somebody from team black is waiting, can another member of team red
take the lock, or must they block?)

It was to solve the direct-IO vs buffered-IO problem (you can have as many
direct-IO readers/writers at once or you can have as many buffered-IO
readers/writers at once, but exclude a mix of direct and buffered I/O).
In the end, we decided it didn't work all that well.

Andy Lutomirski Dec. 22, 2020, 8:19 p.m. UTC | #51

On Mon, Dec 21, 2020 at 8:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Ugh, this is unpleasantly complicated.
>
> What I *should* have said is that *because* userfaultfd is changing
> the VM layout, it should always act as if it had to take the mmap_sem
> for writing, and that the RW->RO change is an example of when
> downgrading that write-lock to a read lock is simply not valid -
> because it basically breaks the rules about what a lookup (ie a read)
> can do.
>
> A lookup can never turn a writable page non-writable. A lookup -
> through COW - _can_ turn a read-only page writable. So that's why
> "RO->RW" can be ok under the read lock, but RW->RO is not.
>
> Does that clarify the situation when I phrase it that way instead?

It's certainly more clear, but I'm still not thrilled with a bunch of
functions in different files maintained by different people that
cooperate using an undocumented lockless protocol.  It makes me
nervous from a maintainability standpoint even if the code is, in
fact, entirely correct.

But I didn't make my question clear either: when I asked about
mark_screen_rdonly(), I wasn't asking about locking or races at all.
mark_screen_rdonly() will happily mark *any* PTE readonly.  I can
easily believe that writable mappings of non-shared mappings all
follow the same CoW rules in the kernel and that, assuming all the
locking is correct, marking them readonly is conceptually okay.  But
shared mappings are a whole different story.  Is it actually the case
that all, or even most, drivers and other sources of shared mappings
will function correctly if one of their PTEs becomes readonly behind
their back?  Or maybe there are less bizarre ways for this to happen
without vm86 shenanigans and this is completely safe after all.

P.S. This whole mess reminds me that I need to take a closer look at
the upcoming SHSTK code.  Shadow stack pages violate all common sense
about how the various PTE bits work.

Nadav Amit Dec. 22, 2020, 8:19 p.m. UTC | #52

> On Dec 22, 2020, at 11:44 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
>> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
> 
> I assume you mean "in" mprotect_fixup, after change_protection.
> 
> If you would downgrade the mmap_lock to read there, then it'd severely
> slowdown the non contention case, if there's more than vma that needs
> change_protection.
> 
> You'd need to throw away the prev->vm_next info and you'd need to do a
> new find_vma after droping the mmap_lock for reading and re-taking the
> mmap_lock for writing at every iteration of the loop.
> 
> To do less harm to the non-contention case you could perhaps walk
> vma->vm_next and check if it's outside the mprotect range and only
> downgrade in such case. So let's assume we intend to optimize with
> mmap_write_downgrade only the last vma.
…

I read in detail your response and you make great points. To be fair,
I did not think it through and just tried to make a point that not
taking mmap_lock for write is an unrelated optimization.

So you make a great point that it is actually related and can only(?)
benefit uffd and arguably soft-dirty, to which I am going to add
mmap_write_lock().

Yet, my confidence in doing the optimization that you suggested (keep using
mmap_read_lock()) as part of the bug fix is very low and just got lower
since we discussed. So I do want in the future to try to avoid the overheads
I introduce (sorry), but it requires a systematic solution and some thought.

Perhaps any change to PTE in a page-table should increase a page-table
generation that we would save in the page-table page-struct (next to the
PTL) and pte_same() would also look at the original and updated "page-table
generation” when it checks if a PTE changed. So if a PTE went through
not-writable -> writable -> not-writable cycle without a TLB flush this can
go detected. Yet, there is still a question of how to detect pending TLB
flushes in finer granularity to avoid frequent unwarranted TLB flushes while
a TLB flush is pending.

It all requires some thought, and the fact that soft-dirty appears to be
broken too indicates that bugs can get unnoticed for some time.

Sorry for being a chicken, but I prefer to be safe than sorry.

Andrea Arcangeli Dec. 22, 2020, 8:26 p.m. UTC | #53

On Tue, Dec 22, 2020 at 08:15:53PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> > My previous suggestion to use a mutex to serialize
> > userfaultfd_writeprotect with a mutex will still work, but we can run
> > as many wrprotect and un-wrprotect as we want in parallel, as long as
> > they're not simultaneous, we can do much better than a mutex.
> > 
> > Ideally we would need a new two_group_semaphore, where each group can
> > run as many parallel instances as it wants, but no instance of one
> > group can run in parallel with any instance of the other group. AFIK
> > such a kind of lock doesn't exist right now.
> 
> Kent and I worked on one for a bit, and we called it a red-black mutex.
> If team red had the lock, more members of team red could join in.
> If team black had the lock, more members of team black could join in.
> I forget what our rule was around fairness (if team red has the lock,
> and somebody from team black is waiting, can another member of team red
> take the lock, or must they block?)

In this case they would need to block and provide full fairness.

Well maybe just a bit of unfariness (to let a few more through the
door before it shuts) wouldn't be a deal breaker but it would need to
be bound or it'd starve the other color/side indefinitely. Otherwise
an ioctl mode_wp = true would block forever, if more ioctl mode_wp =
false keep coming in other CPUs (or the other way around).

The approximation with rwsem and two atomics provides full fariness in
both read and write mode (originally the read would stave the write
IIRC which was an issue for all mprotect etc.. not anymore thankfully).

> It was to solve the direct-IO vs buffered-IO problem (you can have as many
> direct-IO readers/writers at once or you can have as many buffered-IO
> readers/writers at once, but exclude a mix of direct and buffered I/O).
> In the end, we decided it didn't work all that well.

Well mixing buffered and direct-IO is certainly not a good practice so
it's reasonable to leave it up to userland to serialize if such mix is
needed, the kernel behavior is undefined if the mix is concurrent out
of order.

Andy Lutomirski Dec. 22, 2020, 8:34 p.m. UTC | #54

On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> > [ I have in mind another solution, such as keeping in each page-table a
> > “table-generation” which is the mm-generation at the time of the change,
> > and only flush if “table-generation”==“mm-generation”, but it requires
> > some thought on how to avoid adding new memory barriers. ]
> >
> > IOW: I think the change that you suggest is insufficient, and a proper
> > solution is too intrusive for “stable".
> >
> > As for performance, I can add another patch later to remove the TLB flush
> > that is unnecessarily performed during change_protection_range() that does
> > permission promotion. I know that your concern is about the “protect” case
> > but I cannot think of a good immediate solution that avoids taking mmap_lock
> > for write.
> >
> > Thoughts?
>
> On a second thought (i.e., I don’t know what I was thinking), doing so —
> checking mm_tlb_flush_pending() on every PTE read which is potentially
> dangerous and flushing if needed - can lead to huge amount of TLB flushes
> and shootodowns as the counter might be elevated for considerable amount of
> time.

I've lost track as to whether we still think that this particular
problem is really a problem, but could we perhaps make the
tlb_flush_pending field be per-ptl instead of per-mm?  Depending on
how it gets used, it could plausibly be done without atomics or
expensive barriers by using PTL to protect the field.

FWIW, x86 has a mm generation counter, and I don't think it would be
totally crazy to find a way to expose an mm generation to core code.
I don't think we'd want to expose the specific data structures that
x86 uses to track it -- they're very tailored to the oddities of x86
TLB management.  x86 also doesn't currently have any global concept of
which mm generation is guaranteed to have been propagated to all CPUs
-- we track the generation in the pagetables and, per cpu, the
generation that we know that CPU has seen.  x86 could offer a function
"ensure that all CPUs catch up to mm generation G and don't return
until this happens" and its relative "have all CPUs caught up to mm
generation G", but these would need to look at data from multiple CPUs
and would probably be too expensive on very large systems to use in
normal page faults unless we were to cache the results somewhere.
Making a nice cache for this is surely doable, but maybe more
complexity than we'd want.

--Andy

Nadav Amit Dec. 22, 2020, 8:58 p.m. UTC | #55

> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> [ I have in mind another solution, such as keeping in each page-table a
>>> “table-generation” which is the mm-generation at the time of the change,
>>> and only flush if “table-generation”==“mm-generation”, but it requires
>>> some thought on how to avoid adding new memory barriers. ]
>>> 
>>> IOW: I think the change that you suggest is insufficient, and a proper
>>> solution is too intrusive for “stable".
>>> 
>>> As for performance, I can add another patch later to remove the TLB flush
>>> that is unnecessarily performed during change_protection_range() that does
>>> permission promotion. I know that your concern is about the “protect” case
>>> but I cannot think of a good immediate solution that avoids taking mmap_lock
>>> for write.
>>> 
>>> Thoughts?
>> 
>> On a second thought (i.e., I don’t know what I was thinking), doing so —
>> checking mm_tlb_flush_pending() on every PTE read which is potentially
>> dangerous and flushing if needed - can lead to huge amount of TLB flushes
>> and shootodowns as the counter might be elevated for considerable amount of
>> time.
> 
> I've lost track as to whether we still think that this particular
> problem is really a problem,

If you mean “problem” as to whether there is a correctness issue with
userfaultfd and soft-dirty deferred flushes under mmap_read_lock() - yes
there is a problem and I produced these failures on upstream.

If you mean “problem” as to performance - I do not know.

> but could we perhaps make the
> tlb_flush_pending field be per-ptl instead of per-mm?  Depending on
> how it gets used, it could plausibly be done without atomics or
> expensive barriers by using PTL to protect the field.
> 
> FWIW, x86 has a mm generation counter, and I don't think it would be
> totally crazy to find a way to expose an mm generation to core code.
> I don't think we'd want to expose the specific data structures that
> x86 uses to track it -- they're very tailored to the oddities of x86
> TLB management.  x86 also doesn't currently have any global concept of
> which mm generation is guaranteed to have been propagated to all CPUs
> -- we track the generation in the pagetables and, per cpu, the
> generation that we know that CPU has seen.  x86 could offer a function
> "ensure that all CPUs catch up to mm generation G and don't return
> until this happens" and its relative "have all CPUs caught up to mm
> generation G", but these would need to look at data from multiple CPUs
> and would probably be too expensive on very large systems to use in
> normal page faults unless we were to cache the results somewhere.
> Making a nice cache for this is surely doable, but maybe more
> complexity than we'd want.

I had somewhat similar ideas - saving in each page-struct the generation,
which would allow to: (1) extend pte_same() to detect interim changes
that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.

Obviously, I cannot do it as part of this fix. But moreover, it seems to me
that it would require a memory barrier after updating the PTEs and before
reading the current generation (that would be saved per page-table). I
try to think about schemes that would use the per-CPU generation instead,
but still could not and did not have the time to figure it out.

Yu Zhao Dec. 22, 2020, 9:14 p.m. UTC | #56

On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> On Mon, Dec 21, 2020 at 07:19:35PM -0800, Andy Lutomirski wrote:
> > instance of this (with mmap_sem held for write) in x86:
> > mark_screen_rdonly().  Dare I ask how broken this is?  We could likely
> 
> That one is buggy despite it takes the mmap_write_lock... inverting
> the last two lines would fix it though.
> 
> -	mmap_write_unlock(mm);
> 	flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT, false);
> +	mmap_write_unlock(mm);
> 
> The issue here is that rightfully wp_page_copy copies outside the PT
> lock (good thing) and it correctly assumes during the copy nobody can
> modify the page if the fault happened in a write access and the pte
> had no _PAGE_WRITE bit set.
> 
> The bug is clearly in userfaultfd_writeprotect that doesn't enforce
> the above invariant.
> 
> If the userfaultfd_wrprotect(mode_wp = false) run after the deferred
> TLB flush this could never happen because the uffd_wp bit was still
> set and the page fault would hit the handle_userfault dead end and
> it'd have to be restarted from scratch.
> 
> Leaving stale tlb entries after dropping the PT lock and with only the
> mmap_lock_read is only ok if the page fault has a "check" that catches
> that specific case, so that specific case is fully serialized by the
> PT lock alone and the "catch" path is fully aware about the stale TLB
> entries that were left around.
> 
> So looking at the two cases that predates uffd-wp that already did a
> RW->RO transition with the mmap_lock_read, they both comply with the
> wp_page_copy invariant but with two different techniques:
> 
> 1) change_protection is called by the scheduler with the
>   mmap_read_lock and with a deferred TLB flush, and things don't fall
>   apart completely there simply because we catch that in do_numa_page:
> 
> 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> +               /* catch it: no risk to end up in wp_page_copy even if the change_protection
> +	           is running concurrently and the tlb flush is still pending */
> 		return do_numa_page(vmf);
> 
>   do_numa_page doesn't issue any further tlb flush, but basically
>   restores the pte to the previous value, so then the pending flush
>   becomes a noop, since there was no modification to the pte in the
>   first place after do_numa_page runs.
> 
> 2) ksm write_protect_page is also called with mmap_read_lock, and it
>    will simply not defer the flush. If the tlb flush is done inside
>    the PT lock it is ok to take mmap_read_lock since the page fault
>    code will not make any assumption about the TLB before reading the
>    pagetable content under PT lock.
> 
>    In fact if mm_tlb_flush_pending(mm) is true, write_protect_page in
>    KSM has to issue an extra synchronous flush before dropping the PT
>    lock, even if it finds the _PAGE_WRITE bit is already clear,
>    precisely because there can be another deferred flush that cleared
>    the pte but didn't flush the TLB yet.
> 
> userfaultfd_writeprotect() is already using the technique in 1) to be
> safe, except the un-wrprotect can break it if run while the tlb flush
> is still pending...
> 
> The good thing is this guarantee must be provided not more granular
> even than a vma, not per-mm and a vma cannot be registered on two
> different uffd ctx at the same time, so the locking can happen all
> internal to the uffd ctx.
> 
> My previous suggestion to use a mutex to serialize
> userfaultfd_writeprotect with a mutex will still work, but we can run
> as many wrprotect and un-wrprotect as we want in parallel, as long as
> they're not simultaneous, we can do much better than a mutex.
> 
> Ideally we would need a new two_group_semaphore, where each group can
> run as many parallel instances as it wants, but no instance of one
> group can run in parallel with any instance of the other group. AFIK
> such a kind of lock doesn't exist right now.
> 
> wrprotect if left alone never had an issue since we had a working
> "catch" check for it well before wp_page_copy could run. unprotect
> also if left alone was ok since it's a permission upgrade.
> 
> Alternatively, we can solve this with the same technique as in 2),
> because all it matters is that the 4k pte modification doesn't take
> the mmap_lock_write. A modification to a single 4k pte or a single 2m
> hugepmd is very likely indication that there's going to be a flood of
> those in parallel from all CPUs and likely there's also a flood of
> page faults from all CPUs in parallel. In addition for a 4k wrprotect
> or un-wrprotect there's not a significant benefit in deferring the TLB
> flush.
> 
> I don't think we can take the mmap_write_lock unconditionally because
> one of the main reasons to deal with the extra complexity of resolving
> page faults in userland is to bypass mprotect entirely.
> 
> Example, JITs can use unprivileged uffd to eliminate the software-set
> dirty bit every time it modifies memory, that would then require
> frequent wrprotect/un-wrprotect during page faults and the less likely
> we have to block for I/O during those CPU-only events, the
> better. This is one of the potential use cases, but there's plenty
> more.
> 
> So the alternative would be to take mmap_read_lock for "len <=
> HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a
> parameter in change_protection to tell it to flush before dropping the
> PT lock and not defer the flush. So this is going to be more intrusive
> in the VM and it's overall unnecessary.
> 
> The below is mostly untested... but it'd be good to hear some feedback
> before doing more work in this direction.

This works but I don't prefer this option because 1) this is new
way of making pte_wrprotect safe and 2) it's specific to ufd and
can't be applied to clear_soft_dirty() which has no "catcher". No
matter how good the documentation about this new way is now, it
will be out of date, speaking from my personal experience.

I'd go with what Nadav has -- the memory corruption problem has been
there for a while and nobody has complained except Nadav. This makes
me think people is less likely to notice any performance issues from
holding mmap_lock for write in his patch either.

But I can't say I have zero concern with the potential performance
impact given that I have been expecting the fix to go to stable,
which I care most. So the next option on my list is to have a
common "catcher" in do_wp_page() which singles out pages that have
page_mapcount equal to one and reuse them instead of trying to
make copies when they have elevated page_count or are locked. Think
this as a common "catcher" for anybody who pte_wrprotect but doesn't
belong to the COW category (ksm, fork, etc.). I have to stress that
I don't know if this can be done confidently enough to be included
in stable, hence the second option on my list.

> From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 22 Dec 2020 14:30:16 -0500
> Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
>  userfaultfd_writeprotect()
> 
> change_protection() is called by userfaultfd_writeprotect() with the
> mmap_lock_read like in change_prot_numa().
> 
> The page fault code in wp_copy_page() rightfully assumes if the CPU
> issued a write fault and the write bit in the pagetable is not set, no
> CPU can write to the page. That's wrong assumption after
> userfaultfd_writeprotect(). That's also wrong assumption after
> change_prot_numa() where the regular page fault code also would assume
> that if the present bit is not set and the page fault is running,
> there should be no stale TLB entry, but there is still.
> 
> So to stay safe, the page fault code must be prevented to run as long
> as long as the TLB flush remains pending. That is already achieved by
> the do_numa_page() path for change_prot_numa() and by the
> userfaultfd_pte_wp() path for userfaultfd_writeprotect().
> 
> The problem that needs fixing is that an un-wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
> set) could run in between the original wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
> and wp_copy_page, while the TLB flush remains pending.
> 
> In such case the userfaultfd_pte_wp() check will stop being effective
> to prevent the regular page fault code to run.
> 
> CPU0			CPU 1		CPU 2
> ------			--------	-------
> userfaultfd_wrprotect(mode_wp = true)
> PT lock
> atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
> PT unlock
> 
> 			do_page_fault FAULT_FLAG_WRITE
> 					userfaultfd_wrprotect(mode_wp = false)
> 					PT lock
> 					ATOMIC clear _PAGE_UFFD_WP <- problem
> 					/* _PAGE_WRITE not set */
> 					PT unlock
> 					XXXXXXXXXXXXXX BUG RACE window open here
> 
> 			PT lock
> 			FAULT_FLAG_WRITE is set by CPU
> 			_PAGE_WRITE is still clear in pte
> 			PT unlock
> 
> 			wp_page_copy
> 			copy_user_page runs with stale TLB
> 
> deferred tlb flush <- too late
> XXXXXXXXXXXXXX BUG RACE window close here
> ================================================================================
> 
> If the userfaultfd_wrprotect(mode_wp = false) can only run after the
> deferred TLB flush the above cannot happen either, because the uffd_wp
> bit will remain set until after the TLB flush and the page fault would
> reliably hit the handle_userfault() dead end as long as the TLB is
> stale.
> 
> So to fix this we need to prevent any un-wrprotect to start until the
> last outstanding wrprotect completed and to prevent any further
> wrprotect until the last outstanding un-wrprotect completed. Then
> wp_page_copy can still run in parallel but only with the un-wrprotect,
> and that's fine since it's a permission promotion.
> 
> We would need a new two_group_semaphore, where each group can run as
> many parallel instances as it wants, but no instance of one group can
> run in parallel with any instance of the other group. The below rwsem
> with two atomics approximates that kind lock.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/userfaultfd.c | 39 +++++++++++++++++++++++++++++++++++++++
>  mm/memory.c      | 20 ++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 894cc28142e7..3729ca99dae5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -76,6 +76,20 @@ struct userfaultfd_ctx {
>  	bool mmap_changing;
>  	/* mm with one ore more vmas attached to this userfaultfd_ctx */
>  	struct mm_struct *mm;
> +	/*
> +	 * We cannot let userfaultd_writeprotect(mode_wp = false)
> +	 * clear _PAGE_UFFD_WP from the pgtable, until the last
> +	 * outstanding userfaultd_writeprotect(mode_wp = true)
> +	 * completes, because the TLB flush is deferred. The page
> +	 * fault must be forced to enter handle_userfault() while the
> +	 * TLB flush is deferred, and that's achieved with the
> +	 * _PAGE_UFFD_WP bit. The _PAGE_UFFD_WP can only be cleared
> +	 * after there are no stale TLB entries left, only then it'll
> +	 * be safe again for the page fault to continue and not hit
> +	 * the handle_userfault() dead end anymore.
> +	 */
> +	struct rw_semaphore wrprotect_rwsem;
> +	atomic64_t wrprotect_count[2];
>  };
>  
>  struct userfaultfd_fork_ctx {
> @@ -1783,6 +1797,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
>  	bool mode_wp, mode_dontwake;
> +	bool contention;
>  
>  	if (READ_ONCE(ctx->mmap_changing))
>  		return -EAGAIN;
> @@ -1808,9 +1823,30 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
>  
> +	VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[0]) < 0);
> +	VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[1]) < 0);
> +	atomic64_inc(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
> +	smp_mb__after_atomic();
> +	contention = atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0]) > 0;
> +	if (!contention)
> +		down_read(&ctx->wrprotect_rwsem);
> +	else {
> +		down_write(&ctx->wrprotect_rwsem);
> +		if (!atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0])) {
> +			contention = false;
> +			downgrade_write(&ctx->wrprotect_rwsem);
> +		}
> +
> +	}
>  	ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
>  				  uffdio_wp.range.len, mode_wp,
>  				  &ctx->mmap_changing);
> +	if (!contention)
> +		up_read(&ctx->wrprotect_rwsem);
> +	else
> +		up_write(&ctx->wrprotect_rwsem);
> +	smp_mb__before_atomic();
> +	atomic64_dec(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
>  	if (ret)
>  		return ret;
>  
> @@ -1958,6 +1994,9 @@ static void init_once_userfaultfd_ctx(void *mem)
>  	init_waitqueue_head(&ctx->fault_wqh);
>  	init_waitqueue_head(&ctx->event_wqh);
>  	init_waitqueue_head(&ctx->fd_wqh);
> +	init_rwsem(&ctx->wrprotect_rwsem);
> +	atomic64_set(&ctx->wrprotect_count[0], 0);
> +	atomic64_set(&ctx->wrprotect_count[1], 0);
>  	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
>  }
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 7d608765932b..4ff9f21d5a7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3085,6 +3085,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  
> +	/*
> +	 * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
> +	 * can be stale. We cannot allow do_wp_page to proceed or
> +	 * it'll wrongly assume that nobody can still be writing to
> +	 * the page if !pte_write.
> +	 */
>  	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  		return handle_userfault(vmf, VM_UFFD_WP);
> @@ -4274,6 +4280,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
>  	if (vma_is_anonymous(vmf->vma)) {
> +		/*
> +		 * STALE_TLB_WARNING: while the uffd_wp bit is set,
> +		 * the TLB can be stale. We cannot allow wp_huge_pmd()
> +		 * to proceed or it'll wrongly assume that nobody can
> +		 * still be writing to the page if !pmd_write.
> +		 */
>  		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
>  			return handle_userfault(vmf, VM_UFFD_WP);
>  		return do_huge_pmd_wp_page(vmf, orig_pmd);
> @@ -4388,6 +4400,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	if (!pte_present(vmf->orig_pte))
>  		return do_swap_page(vmf);
>  
> +	/*
> +	 * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
> +	 * be stale.
> +	 */
>  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>  		return do_numa_page(vmf);
>  
> @@ -4503,6 +4519,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  			return 0;
>  		}
>  		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> +			/*
> +			 * STALE_TLB_WARNING: if the pmd is NUMA
> +			 * protnone the TLB can be stale.
> +			 */
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&vmf, orig_pmd);
>  
>

Nadav Amit Dec. 22, 2020, 9:14 p.m. UTC | #57

> On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Tue, 22 Dec 2020 14:30:16 -0500
> Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
> userfaultfd_writeprotect()
> 
> change_protection() is called by userfaultfd_writeprotect() with the
> mmap_lock_read like in change_prot_numa().
> 
> The page fault code in wp_copy_page() rightfully assumes if the CPU
> issued a write fault and the write bit in the pagetable is not set, no
> CPU can write to the page. That's wrong assumption after
> userfaultfd_writeprotect(). That's also wrong assumption after
> change_prot_numa() where the regular page fault code also would assume
> that if the present bit is not set and the page fault is running,
> there should be no stale TLB entry, but there is still.
> 
> So to stay safe, the page fault code must be prevented to run as long
> as long as the TLB flush remains pending. That is already achieved by
> the do_numa_page() path for change_prot_numa() and by the
> userfaultfd_pte_wp() path for userfaultfd_writeprotect().
> 
> The problem that needs fixing is that an un-wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
> set) could run in between the original wrprotect
> (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
> and wp_copy_page, while the TLB flush remains pending.

I may need to read your patch more carefully, but in general I do not like
the approach. You are much more experienced than I am, but IMHO the TLB
flushing logic needs to be further simplified and generalized and not the
other way around.

The complexity is already too high. We have tlb_flush_batched and
tlb_flush_pending, which I think should be (somehow) combined.
tlb_gather_mmu() is designed for zapping, but can’t it be modified to suit
protection changes to avoid buggy use-cases (as the wrong use in
clear_refs_write() ) ?

So adding new userfaultfd specific code, which potentially does not address
all the interactions (now or the future), is concerning.

In this regard, a similar problem to the one in userfaultfd
(mmap_read_lock() while write-protecting) already exists with soft-dirty
clearing, so any solution should also address the soft-dirty issue.

Andrea Arcangeli Dec. 22, 2020, 9:17 p.m. UTC | #58

On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote:
> Perhaps any change to PTE in a page-table should increase a page-table
> generation that we would save in the page-table page-struct (next to the

The current rule is that by the time in the page fault we find a
pte/hugepmd in certain state under the "PT lock", the TLB cannot be
left stale with a more permissive state at that point.

So if under "PT lock" the page fault sees !pte_write, it means no
other CPU can possibly modify the page anymore.

That must be enforced at all times, no sequence number is required if
you enforce that and the whole point of the fix is to enforce that.

This is how things always worked in the page fault and it's perfectly
fine.

For the record I never suggested to change anything in the page fault
code.

So the only way we can leave stale TLB after dropping PT lock with the
mmap_read_lock and concurrent page faults is with a marker:

1) take PT lock
2) leave in the pte/hugepmd a unique identifiable marker
3) change the pte to downgrade permissions
4) drop PT lock and leave stale TLB entries with the upgraded permissions

In point 3) the pte is built in a way that is guaranteed to trigger a
deterministic path in the page fault. And in the way of that
deterministic path you put the marker check for 2) to send the fault
to a dead-end where the stale TLB is actually irrelevant and harmless.

No change to the page fault is needed if you only enforce the above.

Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you
will still have to deal with the above technique because of
change_prot_numa().

Would you suggest to add a sequence counter and to change all pte_same
in order to make change_prot_numa safer or is it safe enough already
using the above technique that checks the marker in the page fault?

To be safe NUMA balancing is using the same mechanism above of sending
the page fault into a dead end, in order to call the very same
function (change_permissions) with the very same lock (mmap_read_lock)
as userfaultfd_writeprotect.

What happens then in do_numa_page then is different than what happens
in handle_userfault, but both are a dead ends as far as the page fault
code is concerned and they will never come back to the page fault
code. That's how they avoid breaking the page fault.

The final rule for the above logic to work safe for uffd, is that the
marker cannot be cleared until after the deferred TLB flush is
executed (that's where the window for the race existed, and the fix
closed such window).

do_numa_page differs in clearing the marker while the TLB flush still
pending, but it can do that because it puts the same exact pte value
(with the upgraded permissions) that was there before it put the
marker in the pte. In turn do_numa_page makes the pending TLB flush
becomes a noop and it doesn't need to wait for it before removing the
marker. Does that last difference in how the marker is cleared,
warrant to consider what NUMA balancing radically different so to
forbid userfaultfd_writeprotect to use the same logic in the very same
function with the very same lock in order to decrease the VM
complexity? In my view no.

Thanks,
Andrea

Andrea Arcangeli Dec. 22, 2020, 9:34 p.m. UTC | #59

On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote:
> I had somewhat similar ideas - saving in each page-struct the generation,
> which would allow to: (1) extend pte_same() to detect interim changes
> that were reverted (RO->RW->RO) and (2) per-PTE pending flushes.

What don't you feel safe about, what's the problem with RO->RO->RO, I
don't get it.

The pte_same is perfectly ok without sequence counter in my view, I
never seen anything that would not be ok with pte_same given all the
invariant are respected. It's actually a great optimization compared
to any unscalable sequence counter.

The counter would slowdown everything, having to increase a counter
every time you change a pte, no matter if it's a counter per pgtable
or per-vma or per-mm, sounds very bad.

I'd rather prefer to take mmap_lock_write across the whole userfaultfd
ioctl, than having to deal with a new sequence counter increase for
every pte modification on a heavily contended cacheline.

Also note the counter would have solved nothing for
userfaultfd_writeprotect, it's useless to detect stale TLB entries.

See how !pte_write check happens after the counter was already increased:

CPU0			CPU 1		CPU 2
------			--------	-------
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
false_shared_counter_counter++ 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PT unlock

			do_page_fault FAULT_FLAG_WRITE
					userfaultfd_wrprotect(mode_wp = false)
					PT lock
					ATOMIC clear _PAGE_UFFD_WP <- problem
					/* _PAGE_WRITE not set */
					false_shared_counter_counter++ 
					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
					PT unlock
					XXXXXXXXXXXXXX BUG RACE window open here

			PT lock
			counter = false_shared_counter_counter
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			FAULT_FLAG_WRITE is set by CPU
			_PAGE_WRITE is still clear in pte
			PT unlock

			wp_page_copy
			copy_user_page runs with stale TLB

			pte_same(counter, orig_pte, pte) -> PASS
				 ^^^^^^^                    ^^^^
			commit the copy to the pte with the lost writes

deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here
================================================================================

Andrea Arcangeli Dec. 22, 2020, 10:02 p.m. UTC | #60

On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
> This works but I don't prefer this option because 1) this is new
> way of making pte_wrprotect safe and 2) it's specific to ufd and
> can't be applied to clear_soft_dirty() which has no "catcher". No

I didn't look into clear_soft_dirty issue, I can look into that.

To avoid having to deal with a 3rd solution it will have to use one of
the two:

1) avoid deferring tlb flush and enforce a sync flush before dropping
  the PT lock even if mm_mm_tlb_flush_pending is true ->
  write_protect_page in KSM

2) add its own new catcher for its own specific marker (for UFFD-WP
   the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a
   vma->vm_pgprot not PROT_NONE, for soft dirty it could be
   _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the
   pte value is interpreted.

> matter how good the documentation about this new way is now, it
> will be out of date, speaking from my personal experience.

A common catcher for all 3 is not possible because each catcher
depends on whatever marker and whatever pte value they set that may
lead to a different deterministic path where to put the catcher or
multiple paths even. do_numa_page requires a catcher in a different
place already.

Or well, a common catcher for all 3 is technically possible but it'd
perform slower requiring to check things twice.

But perhaps the soft_dirty can use the same catcher of uffd-wp given
the similarity?

> I'd go with what Nadav has -- the memory corruption problem has been
> there for a while and nobody has complained except Nadav. This makes
> me think people is less likely to notice any performance issues from
> holding mmap_lock for write in his patch either.

uffd-wp is a fairly new feature, the current users are irrelevant,
keeping it optimal is just for the future potential.

> But I can't say I have zero concern with the potential performance
> impact given that I have been expecting the fix to go to stable,
> which I care most. So the next option on my list is to have a

Actually stable would be very fine to go with Nadav patch and use the
mmap_lock_write unconditionally. The optimal fix is only relevant for
the future potential, so it's only relevant for Linus's tree.

However the feature is recent enough that it won't require a deep
backport so the optimal fix is likely fine for stable as well,
generally stable prefers the same fix as in the upstream when there's
no major backport rejection issue.

The alternative solution for uffd is to do the deferred flush under
mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell
change_protection not to defer the flush and to take the
mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the
catcher and then userfaultfd_writeprotect(mode_wp=true)
userfaultfd_writeprotect(mode_wp=false) can even run in parallel at
all times. The cons is large userfaultfd_writeprotect will block for
I/O and those would happen at least in the postcopy live snapshotting
use case.

The main cons is that it'd require modification to change_protection
so it actually looks more intrusive, not less.

Overall anything that allows to wrprotect 1 pte with only the
mmap_lock_read exactly like KSM write_protect_page, would be enough for
uffd-wp.

What isn't ok in terms of future potential is unconditional
mmap_lock_write as in the original suggested patch in my view. It
doesn't mean we can take mmap_lock_write when the operation is large
and there is actually more benefit from deferring the flush.

> common "catcher" in do_wp_page() which singles out pages that have
> page_mapcount equal to one and reuse them instead of trying to

I don't think the page_mapcount matters here. If the wp page reuse was
more accurate (as it was before) we wouldn't notice this issue, but it
still would be a bug that there were stale TLB entries. It worked by
luck.

Thanks,
Andrea

Yu Zhao Dec. 22, 2020, 11:39 p.m. UTC | #61

On Tue, Dec 22, 2020 at 05:02:37PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
> > This works but I don't prefer this option because 1) this is new
> > way of making pte_wrprotect safe and 2) it's specific to ufd and
> > can't be applied to clear_soft_dirty() which has no "catcher". No
> 
> I didn't look into clear_soft_dirty issue, I can look into that.
> 
> To avoid having to deal with a 3rd solution it will have to use one of
> the two:
> 
> 1) avoid deferring tlb flush and enforce a sync flush before dropping
>   the PT lock even if mm_mm_tlb_flush_pending is true ->
>   write_protect_page in KSM
> 
> 2) add its own new catcher for its own specific marker (for UFFD-WP
>    the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a
>    vma->vm_pgprot not PROT_NONE, for soft dirty it could be
>    _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the
>    pte value is interpreted.
> 
> > matter how good the documentation about this new way is now, it
> > will be out of date, speaking from my personal experience.
> 
> A common catcher for all 3 is not possible because each catcher
> depends on whatever marker and whatever pte value they set that may
> lead to a different deterministic path where to put the catcher or
> multiple paths even. do_numa_page requires a catcher in a different
> place already.
> 
> Or well, a common catcher for all 3 is technically possible but it'd
> perform slower requiring to check things twice.
> 
> But perhaps the soft_dirty can use the same catcher of uffd-wp given
> the similarity?
> 
> > I'd go with what Nadav has -- the memory corruption problem has been
> > there for a while and nobody has complained except Nadav. This makes
> > me think people is less likely to notice any performance issues from
> > holding mmap_lock for write in his patch either.
> 
> uffd-wp is a fairly new feature, the current users are irrelevant,
> keeping it optimal is just for the future potential.
> 
> > But I can't say I have zero concern with the potential performance
> > impact given that I have been expecting the fix to go to stable,
> > which I care most. So the next option on my list is to have a
> 
> Actually stable would be very fine to go with Nadav patch and use the
> mmap_lock_write unconditionally. The optimal fix is only relevant for
> the future potential, so it's only relevant for Linus's tree.
> 
> However the feature is recent enough that it won't require a deep
> backport so the optimal fix is likely fine for stable as well,
> generally stable prefers the same fix as in the upstream when there's
> no major backport rejection issue.
> 
> The alternative solution for uffd is to do the deferred flush under
> mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell
> change_protection not to defer the flush and to take the
> mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the
> catcher and then userfaultfd_writeprotect(mode_wp=true)
> userfaultfd_writeprotect(mode_wp=false) can even run in parallel at
> all times. The cons is large userfaultfd_writeprotect will block for
> I/O and those would happen at least in the postcopy live snapshotting
> use case.
> 
> The main cons is that it'd require modification to change_protection
> so it actually looks more intrusive, not less.
> 
> Overall anything that allows to wrprotect 1 pte with only the
> mmap_lock_read exactly like KSM write_protect_page, would be enough for
> uffd-wp.
> 
> What isn't ok in terms of future potential is unconditional
> mmap_lock_write as in the original suggested patch in my view. It
> doesn't mean we can take mmap_lock_write when the operation is large
> and there is actually more benefit from deferring the flush.
> 
> > common "catcher" in do_wp_page() which singles out pages that have
> > page_mapcount equal to one and reuse them instead of trying to
> 
> I don't think the page_mapcount matters here. If the wp page reuse was
> more accurate (as it was before) we wouldn't notice this issue, but it
> still would be a bug that there were stale TLB entries. It worked by
> luck.

Can you please correct my understanding below? Thank you.

Generally speaking, we have four possible combinations relevant to
this discussion:
  1) anon, COW
  2) anon, non-COW
  3) file, COW
  4) file, non-COW

Assume we pte_wrprotect while holding mmap_lock for read, all four
combinations can be routed to do_wp_page(). The difference is that
1) and 3) are guaranteed to be RO when they become COW, and what we
do on top of their existing state doesn't make any difference.

2) is the false positive because of what we do, and it's causing the
memory corruption because do_wp_page() tries to make copies of pages
that seem to be RO but may have stale RW tlb entries pending flush.

If we grab mmap_lock for write, we block the fault that tries to copy
before we flush. Once it's done, we'll have two faulting CPUs, one
that had the stale entry and would otherwise do the damage and the
other that tries to copy. Then there is the silly part: we make a copy
and swap it with the only user who already has it...

We are talking about non-COW anon pages here -- they can't be mapped
more than once. So why not just identify them by checking
page_mapcount == 1 and then unconditionally reuse them? (This is
probably where I've missed things.)

4) may also be false positive (there are other legit cases relying on
it), and it's correctly identified by !PageAnon() +
VM_WRITE|VM_SHARED. For this category, we always reuse and we hitch
a free ride.

Linus Torvalds Dec. 22, 2020, 11:50 p.m. UTC | #62

On Tue, Dec 22, 2020 at 3:39 PM Yu Zhao <yuzhao@google.com> wrote:
>
> 2) is the false positive because of what we do, and it's causing the
> memory corruption because do_wp_page() tries to make copies of pages
> that seem to be RO but may have stale RW tlb entries pending flush.

Yeah, that's definitely a different bug.

The rule is that the TLB flush has to be done before the page table
lock is released.

See zap_pte_range() for an example of doing it right, even in the
presence of complexities (ie that has an example of both flushing the
TLB, and doing the actual "free the pages after flush", and it does
the two cases separately).

           Linus

Linus Torvalds Dec. 23, 2020, 12:01 a.m. UTC | #63

On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> See zap_pte_range() for an example of doing it right, even in the
> presence of complexities (ie that has an example of both flushing the
> TLB, and doing the actual "free the pages after flush", and it does
> the two cases separately).

The more I look at the mprotect code, the less I like it. We seem to
be much better about the TLB flushes in other places (looking at
mremap, for example). The mprotect code seems to be very laissez-faire
about the TLB flushing.

Does adding a TLB flush to before that

        pte_unmap_unlock(pte - 1, ptl);

fix things for you?

That's not the right fix - leaving a stale TLB entry around is fine if
the TLB entry is more strict wrt protections - but it might be worth
testing as a "does it at least close the problem" patch.

          Linus

Linus Torvalds Dec. 23, 2020, 12:20 a.m. UTC | #64

On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The rule is that the TLB flush has to be done before the page table
> lock is released.

I take that back. I guess it's ok as long as the mmap_sem is held for
writing. Then the TLB flush can be delayed until just before releasing
the mmap_sem. I think.

The stale TLB entries still mean that somebody else can write through
them in another thread, but as long as anybody who actually unmaps the
page (and frees it - think rmap etc) is being careful, mprotect()
itself can probably afford to be a bit laissez-faire.

So mprotect() itself should be ok, I think, because it takes things for writing.

Even with the mmap_sem held for writing, truncate and friends can see
the read-only page table entries (because they can look things up
using the file i_mmap thing instead), but then they rely on the page
table lock and they'll also be careful if they then change that PTE
and will force their own TLB flushes.

So I think a pending TLB flush outside the page table lock is fine -
but once again only if you hold the mmap_sem for writing. Not for
reading, because then the page tables need to be synchronized with the
TLB so that other readers don't see the not-yet-synchronized state.

It once again looks like it's just userfaultfd that would trigger this
due to the read-lock on the mmap_sem. And mprotect() itself is fine.

Am I missing something?

But apparently Nadav sees problems even with that lock changed to a
write lock. Navad?

           Linus

Yu Zhao Dec. 23, 2020, 12:23 a.m. UTC | #65

On Tue, Dec 22, 2020 at 04:01:45PM -0800, Linus Torvalds wrote:
> On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > See zap_pte_range() for an example of doing it right, even in the
> > presence of complexities (ie that has an example of both flushing the
> > TLB, and doing the actual "free the pages after flush", and it does
> > the two cases separately).
> 
> The more I look at the mprotect code, the less I like it. We seem to
> be much better about the TLB flushes in other places (looking at
> mremap, for example). The mprotect code seems to be very laissez-faire
> about the TLB flushing.
> 
> Does adding a TLB flush to before that
> 
>         pte_unmap_unlock(pte - 1, ptl);
> 
> fix things for you?

It definitely does. But if I had to choose, I'd go with holding
mmap_lock for write because 1) it's less likely to storm other CPUs by
IPI and would only have performance impact on processes that use ufd,
which I guess already have high tolerance for not-so-good performance,
and 2) people are spearheading multiple efforts to reduce the mmap_lock
contention, which hopefully would make ufd users suffer less soon.

> That's not the right fix - leaving a stale TLB entry around is fine if
> the TLB entry is more strict wrt protections - but it might be worth
> testing as a "does it at least close the problem" patch.

Well, things get trick if we do this. I'm not sure if I could vouch
such a fix for stable as confident as I do holding mmap_lock for
write.

Andrea Arcangeli Dec. 23, 2020, 2:17 a.m. UTC | #66

On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote:
> and 2) people are spearheading multiple efforts to reduce the mmap_lock
> contention, which hopefully would make ufd users suffer less soon.

In my view UFFD is an already deployed working solution that
eliminates the mmap_lock_write contention to allocate and free memory.

We need to add a UFFDIO_POPULATE to use in combination with
UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or
THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it
will solve the last performance bottleneck by avoiding a suprious
wrprotect fault after the allocation).

After that malloc based on uffd should become competitive single
threaded and it won't ever require the mmap_lock_write so allocations
and freeing of memory can continue indefinitely from all threaded in
parallel. There will never be another mmap or munmap stalling all
threads.

This is not why uffd was created, it's just a secondary performance
benefit of uffd, but it's still a relevant benefit in my view.

Every time I hear people with major mmap_lock_write issues I recommend
uffd, but you know, until we add the UFFDIO_POPULATE, it will still
have higher fixed allocation overhead because of the wprotect fault
after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a
clear_page and currently it's not even THP capable.

In addition you'll get a SIGBUS after an user after free. It's not
like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE
granularity to rate limit the costly munmap, and then the app does an
use after free and it reads zero or writes to a newly faulted in page.

The above will not require any special privilege and all allocated
virtual memory remains fully swappable, because SIGBUS mode will never
have to block any kernel initiated faults.

uffd-wp also is totally usable unprivileged by default to replace
various software dirty bits with the info provided in O(1) instead of
O(N), as long as the writes are done in userland also unprivileged by
default without tweaking any sysctl and with zero risk of increasing
reproduciblity of any exploit against unrelated random kernel bugs.

So if we're forced to take the mmap_lock_write it'd be cool if at
least we can avoid it for 1 single pte or hugepmd wrprotection, as it
happens in write_protect_page() KSM.

Thanks,
Andrea

Andrea Arcangeli Dec. 23, 2020, 2:56 a.m. UTC | #67

On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> We are talking about non-COW anon pages here -- they can't be mapped
> more than once. So why not just identify them by checking
> page_mapcount == 1 and then unconditionally reuse them? (This is
> probably where I've missed things.)

The problem in depending on page_mapcount to decide if it's COW or
non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
may elevate the count of a COW anon page that become a non-COW anon
page.

This is Jann's idea not mine.

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

This was supposed to be solvable by just doing the COW in gup(write=0)
case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
sure why that didn't fly and it had to be reverted by Peter in
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
happening I was side tracked by urgent issues and I didn't manage to
look back of how we ended up with the big hammer page_count == 1 check
instead to decide if to call wp_page_reuse or wp_page_shared.

So anyway, the only thing that is clear to me is that keeping the
child from reading the page_mapcount == 1 pages of the parent, is the
only reason why wp_page_reuse(vmf) will only be called on
page_count(page) == 1 and not on page_mapcount(page) == 1.

It's also the reason why your page_mapcount assumption will risk to
reintroduce the issue, and I only wish we could put back page_mapcount
== 1 back there.

Still even if we put back page_mapcount there, it is not ok to leave
the page fault with stale TLB entries and to rely on the fact
wp_page_shared won't run. It'd also avoid the problem but I think if
you leave stale TLB entries in change_protection just like NUMA
balancing does, it also requires a catcher just like NUMA balancing
has, or it'd truly work by luck.

So until we can put a page_mapcount == 1 check back there, the
page_count will be by definition unreliable because of the speculative
lookups randomly elevating all non zero page_counts at any time in the
background on all pages, so you will never be able to tell if a page
is true COW or if it's just a spurious COW because of a speculative
lookup. It is impossible to differentiate a speculative lookup from a
vmsplice ref in a child.

Thanks,
Andrea

Yu Zhao Dec. 23, 2020, 3:36 a.m. UTC | #68

On Tue, Dec 22, 2020 at 09:56:11PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> > We are talking about non-COW anon pages here -- they can't be mapped
> > more than once. So why not just identify them by checking
> > page_mapcount == 1 and then unconditionally reuse them? (This is
> > probably where I've missed things.)
> 
> The problem in depending on page_mapcount to decide if it's COW or
> non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
> may elevate the count of a COW anon page that become a non-COW anon
> page.
> 
> This is Jann's idea not mine.
> 
> The problem is we have an unprivileged long term GUP like vmsplice
> that facilitates elevating the page count indefinitely, until the
> parent finally writes a secret to it. Theoretically a short term pin
> would do it too so it's not just vmpslice, but the short term pin
> would be incredibly more challenging to become a concern since it'd
> kill a phone battery and flash before it can read any data.
> 
> So what happens with your page_mapcount == 1 check is that it doesn't
> mean non-COW (we thought it did until it didn't for the long term gup
> pin in vmsplice).
> 
> Jann's testcases does fork() and set page_mapcount 2 and page_count to
> 2, vmsplice, take unprivileged infinitely long GUP pin to set
> page_count to 3, queue the page in the pipe with page_count elevated,
> munmap to drop page_count to 2 and page_mapcount to 1.
> 
> page_mapcount is 1, so you'd think the page is non-COW and owned by
> the parent, but the child can still read it so it's very much still
> wp_page_copy material if the parent tries to modify it. Otherwise the
> child can read the content.
> 
> This was supposed to be solvable by just doing the COW in gup(write=0)
> case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
> sure why that didn't fly and it had to be reverted by Peter in
> a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
> happening I was side tracked by urgent issues and I didn't manage to
> look back of how we ended up with the big hammer page_count == 1 check
> instead to decide if to call wp_page_reuse or wp_page_shared.
> 
> So anyway, the only thing that is clear to me is that keeping the
> child from reading the page_mapcount == 1 pages of the parent, is the
> only reason why wp_page_reuse(vmf) will only be called on
> page_count(page) == 1 and not on page_mapcount(page) == 1.
> 
> It's also the reason why your page_mapcount assumption will risk to
> reintroduce the issue, and I only wish we could put back page_mapcount
> == 1 back there.
> 
> Still even if we put back page_mapcount there, it is not ok to leave
> the page fault with stale TLB entries and to rely on the fact
> wp_page_shared won't run. It'd also avoid the problem but I think if
> you leave stale TLB entries in change_protection just like NUMA
> balancing does, it also requires a catcher just like NUMA balancing
> has, or it'd truly work by luck.
> 
> So until we can put a page_mapcount == 1 check back there, the
> page_count will be by definition unreliable because of the speculative
> lookups randomly elevating all non zero page_counts at any time in the
> background on all pages, so you will never be able to tell if a page
> is true COW or if it's just a spurious COW because of a speculative
> lookup. It is impossible to differentiate a speculative lookup from a
> vmsplice ref in a child.

Thanks for the details.

In your patch, do we need to take wrprotect_rwsem in
handle_userfault() as well? Otherwise, it seems userspace would have
to synchronize between its wrprotect ioctl and fault handler? i.e.,
the fault hander needs to be aware that the content of write-
protected pages can actually change before the iotcl returns.

Linus Torvalds Dec. 23, 2020, 9:44 a.m. UTC | #69

On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The more I look at the mprotect code, the less I like it. We seem to
> be much better about the TLB flushes in other places (looking at
> mremap, for example). The mprotect code seems to be very laissez-faire
> about the TLB flushing.

No, this doesn't help.

> Does adding a TLB flush to before that
>
>         pte_unmap_unlock(pte - 1, ptl);
>
> fix things for you?

It really doesn't fix it. Exactly because - as pointed out earlier -
the actual page *copy* happens outside the pte lock.

So what can happen is:

 - CPU 1 holds the page table lock, while doing the write protect. It
has cleared the writable bit, but hasn't flushed the TLB's yet

 - CPU 2 did *not* have the TLB entry, sees the new read-only state,
takes a COW page fault, and reads the PTE from memory (into
vmf->orig_pte)

 - CPU 2 correctly decides it needs to be a COW, and copies the page contents

 - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
happened yet), and writes to that page in users apce

 - CPU 1 now does the TLB invalidate, and releases the page table lock

 - CPU 2 gets the page table lock, sees that its PTE matches
vmf->orig_pte, and switches it to be that writable copy of the page.

where the copy happened before CPU 3 had stopped writing to the page.

So the pte lock doesn't actually matter, unless we actually do the
page copy inside of it (on CPU2), in addition to doing the TLB flush
inside of it (on CPU1).

mprotect() is actually safe for two independent reasons: (a) it does
the mmap_sem for writing (so mprotect can't race with the COW logic at
all), and (b) it changes the vma permissions so turning something
read-only actually disables COW anyway, since it won't be a COW, it
will be a SIGSEGV.

So mprotect() is irrelevant, other than the fact that it shares some
code with that "turn it read-only in the page tables".

fork() is a much closer operation, in that it actually triggers that
COW behavior, but fork() takes the mmap_sem for writing, so it avoids
this too.

So it's really just userfaultfd and that kind of ilk that is relevant
here, I think. But that "you need to flush the TLB before releasing
the page table lock" was not true (well, it's true in other
circumstances - just not *here*), and is not part of the solution.

Or rather, if it's part of the solution here, it would have to be
matched with that "page copy needs to be done under the page table
lock too".

              Linus

Yu Zhao Dec. 23, 2020, 10:06 a.m. UTC | #70

On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The more I look at the mprotect code, the less I like it. We seem to
> > be much better about the TLB flushes in other places (looking at
> > mremap, for example). The mprotect code seems to be very laissez-faire
> > about the TLB flushing.
> 
> No, this doesn't help.
> 
> > Does adding a TLB flush to before that
> >
> >         pte_unmap_unlock(pte - 1, ptl);
> >
> > fix things for you?
> 
> It really doesn't fix it. Exactly because - as pointed out earlier -
> the actual page *copy* happens outside the pte lock.

I appreciate all the pointers. It seems to me it does.

> So what can happen is:
> 
>  - CPU 1 holds the page table lock, while doing the write protect. It
> has cleared the writable bit, but hasn't flushed the TLB's yet
> 
>  - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> takes a COW page fault, and reads the PTE from memory (into
> vmf->orig_pte)

In handle_pte_fault(), we lock page table and check pte_write(), so
we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
same page table lock (not mmap_lock).

>  - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> 
>  - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> happened yet), and writes to that page in users apce
> 
>  - CPU 1 now does the TLB invalidate, and releases the page table lock
> 
>  - CPU 2 gets the page table lock, sees that its PTE matches
> vmf->orig_pte, and switches it to be that writable copy of the page.
> 
> where the copy happened before CPU 3 had stopped writing to the page.
> 
> So the pte lock doesn't actually matter, unless we actually do the
> page copy inside of it (on CPU2), in addition to doing the TLB flush
> inside of it (on CPU1).
> 
> mprotect() is actually safe for two independent reasons: (a) it does
> the mmap_sem for writing (so mprotect can't race with the COW logic at
> all), and (b) it changes the vma permissions so turning something
> read-only actually disables COW anyway, since it won't be a COW, it
> will be a SIGSEGV.
> 
> So mprotect() is irrelevant, other than the fact that it shares some
> code with that "turn it read-only in the page tables".
> 
> fork() is a much closer operation, in that it actually triggers that
> COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> this too.
> 
> So it's really just userfaultfd and that kind of ilk that is relevant
> here, I think. But that "you need to flush the TLB before releasing
> the page table lock" was not true (well, it's true in other
> circumstances - just not *here*), and is not part of the solution.
> 
> Or rather, if it's part of the solution here, it would have to be
> matched with that "page copy needs to be done under the page table
> lock too".
> 
>               Linus
>

Peter Xu Dec. 23, 2020, 3:52 p.m. UTC | #71

On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> In your patch, do we need to take wrprotect_rwsem in
> handle_userfault() as well? Otherwise, it seems userspace would have
> to synchronize between its wrprotect ioctl and fault handler? i.e.,
> the fault hander needs to be aware that the content of write-
> protected pages can actually change before the iotcl returns.

The handle_userfault() thread should be sleeping until another uffd_wp_resolve
fixes the page fault for it.  However when the uffd_wp_resolve ioctl comes,
then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, or
any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm)
should have guaranteed the previous wr-protect ioctls are finished and tlb must
have been flushed until this thread continues.

And I don't know why it matters even if the data changed - IMHO what uffd-wp
wants to do is simply to make sure after wr-protect ioctl returns to userspace,
no change on the page should ever happen anymore.  So "whether data changed"
seems matter more on the ioctl thread rather than the handle_userfault()
thread.  IOW, I think data changes before tlb flush but after pte wr-protect is
always fine - but that's not fine anymore if the syscall returns.

Thanks,

Will Deacon Dec. 23, 2020, 4:23 p.m. UTC | #72

On Tue, Dec 22, 2020 at 11:20:21AM -0800, Nadav Amit wrote:
> > On Dec 22, 2020, at 10:30 AM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
> >>> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
> >>> 
> >>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> >>>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
> >>>>> on this approach due to its potential impact on performance.
> >>>> 
> >>>> From whom?
> >>>> 
> >>>> Somebody who doesn't understand that correctness is more important
> >>>> than performance? And that userfaultfd is not the most important part
> >>>> of the system?
> >>>> 
> >>>> The fact is, userfaultfd is CLEARLY BUGGY.
> >>>> 
> >>>>         Linus
> >>> 
> >>> Fair enough.
> >>> 
> >>> Nadav, for your patch (you might want to update the commit message).
> >>> 
> >>> Reviewed-by: Yu Zhao <yuzhao@google.com>
> >>> 
> >>> While we are all here, there is also clear_soft_dirty() that could
> >>> use a similar fix…
> >> 
> >> Just an update as for why I have still not sent v2: I fixed
> >> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
> >> 
> >> So after some debugging, it appears that clear_refs_write() does not flush
> >> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
> >> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
> >> flush the TLB since there is clear_refs_write() does not call to
> >> __tlb_adjust_range() (unless there are nested TLBs are pending).
> > 
> > Sorry Nadav, I assumed you knew this existing problem fixed by:
> > https://patchwork.kernel.org/project/linux-mm/cover/20201210121110.10094-1-will@kernel.org/
> > 
> 
> Thanks, Yu! For some reason I assumed it was already upstreamed and did not
> look back (yet if I was cc’d on v2…)

I'll repost in the new year, as it was a bit tight for the merge window.
I've made a note to put you on cc.

> Yet, something still goes bad. Debugging.

Did you figure this out? I tried to read the whole thread, but it's a bit
of a rollercoaster.

Will

Peter Xu Dec. 23, 2020, 4:24 p.m. UTC | #73

On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > The more I look at the mprotect code, the less I like it. We seem to
> > > be much better about the TLB flushes in other places (looking at
> > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > about the TLB flushing.
> > 
> > No, this doesn't help.
> > 
> > > Does adding a TLB flush to before that
> > >
> > >         pte_unmap_unlock(pte - 1, ptl);
> > >
> > > fix things for you?
> > 
> > It really doesn't fix it. Exactly because - as pointed out earlier -
> > the actual page *copy* happens outside the pte lock.
> 
> I appreciate all the pointers. It seems to me it does.
> 
> > So what can happen is:
> > 
> >  - CPU 1 holds the page table lock, while doing the write protect. It
> > has cleared the writable bit, but hasn't flushed the TLB's yet
> > 
> >  - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > takes a COW page fault, and reads the PTE from memory (into
> > vmf->orig_pte)
> 
> In handle_pte_fault(), we lock page table and check pte_write(), so
> we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> same page table lock (not mmap_lock).

I think this is not against Linus's example - where cpu2 does not have tlb
cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
it.  So IMHO there's no problem here.

But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit.  Note that if
it's uffd-wp wr-protection it's always applied along with removing of the write
bit in change_pte_range():

        if (uffd_wp) {
                ptent = pte_wrprotect(ptent);
                ptent = pte_mkuffd_wp(ptent);
        }

So instead of being handled as COW page do_wp_page() will always trap
userfaultfd-wp first, hence no chance to race with COW.

COW could only trigger after another uffd-wp-resolve ioctl which could remove
the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
after all wr-protect completes, which guarantees that when reaching the COW
path the tlb must has been flushed anyways.  Then no one should be modifying
the page anymore even without pgtable lock in COW path.

So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
work, but it just may cause more tlb flush than Andrea's proposal especially
when the protection range is large (it's common to have a huge protection range
for e.g. VM live snapshotting, where we'll protect all guest rw ram).

My understanding of current issue is that either we can take Andrea's proposal
(although I think the group rwsem may not be extremely better than a per-mm
rwsem, which I don't know... at least not worst than that?), or we can also go
the other way (also as Andrea mentioned) so that when wr-protect:

  - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
    pgtable lock

  - for >2M range, we take write rwsem directly but flush tlb once
  
Thanks,

> 
> >  - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> > 
> >  - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > happened yet), and writes to that page in users apce
> > 
> >  - CPU 1 now does the TLB invalidate, and releases the page table lock
> > 
> >  - CPU 2 gets the page table lock, sees that its PTE matches
> > vmf->orig_pte, and switches it to be that writable copy of the page.
> > 
> > where the copy happened before CPU 3 had stopped writing to the page.
> > 
> > So the pte lock doesn't actually matter, unless we actually do the
> > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > inside of it (on CPU1).
> > 
> > mprotect() is actually safe for two independent reasons: (a) it does
> > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > all), and (b) it changes the vma permissions so turning something
> > read-only actually disables COW anyway, since it won't be a COW, it
> > will be a SIGSEGV.
> > 
> > So mprotect() is irrelevant, other than the fact that it shares some
> > code with that "turn it read-only in the page tables".
> > 
> > fork() is a much closer operation, in that it actually triggers that
> > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > this too.
> > 
> > So it's really just userfaultfd and that kind of ilk that is relevant
> > here, I think. But that "you need to flush the TLB before releasing
> > the page table lock" was not true (well, it's true in other
> > circumstances - just not *here*), and is not part of the solution.
> > 
> > Or rather, if it's part of the solution here, it would have to be
> > matched with that "page copy needs to be done under the page table
> > lock too".
> > 
> >               Linus
> > 
>

Andrea Arcangeli Dec. 23, 2020, 6:51 p.m. UTC | #74

On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it.  So IMHO there's no problem here.
> 
> But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit.  Note that if
> it's uffd-wp wr-protection it's always applied along with removing of the write
> bit in change_pte_range():
> 
>         if (uffd_wp) {
>                 ptent = pte_wrprotect(ptent);
>                 ptent = pte_mkuffd_wp(ptent);
>         }
> 
> So instead of being handled as COW page do_wp_page() will always trap
> userfaultfd-wp first, hence no chance to race with COW.
> 
> COW could only trigger after another uffd-wp-resolve ioctl which could remove
> the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> after all wr-protect completes, which guarantees that when reaching the COW
> path the tlb must has been flushed anyways.  Then no one should be modifying
> the page anymore even without pgtable lock in COW path.
> 
> So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> work, but it just may cause more tlb flush than Andrea's proposal especially
> when the protection range is large (it's common to have a huge protection range
> for e.g. VM live snapshotting, where we'll protect all guest rw ram).
> 
> My understanding of current issue is that either we can take Andrea's proposal
> (although I think the group rwsem may not be extremely better than a per-mm
> rwsem, which I don't know... at least not worst than that?), or we can also go
> the other way (also as Andrea mentioned) so that when wr-protect:
> 
>   - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
>     pgtable lock
> 
>   - for >2M range, we take write rwsem directly but flush tlb once

I fully agree indeed.

I still have to fully understand how the clear_refs_write was fixed.

clear_refs_write has not even a "catcher" in the page fault but it
clears the pte_write under mmap_read_lock and it doesn't flush before
releasing the PT lock. It appears way more broken than
userfaultfd_writeprotect ever was.

static inline void clear_soft_dirty(struct vm_area_struct *vma,
		unsigned long addr, pte_t *pte)
{
	/*
	 * The soft-dirty tracker uses #PF-s to catch writes
	 * to pages, so write-protect the pte as well. See the
	 * Documentation/admin-guide/mm/soft-dirty.rst for full description
	 * of how soft-dirty works.
	 */

       https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/

"Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does
already"

NOTE: about the above comment, that mprotect takes
mmap_read_lock. Your above code change in the commit above, still has
mmap_read_lock, there's still no similarity with mprotect whatsoever.

Did you test what happens when clear_refs_write leaves writable stale
TLB and you run do_wp_page copies the page while the other CPU still
is writing to the page? Has clear_refs_write a selftest as aggressive
and racy as the uffd selftest to reproduce that workload?

The race fixed with the group lock internally to uffd, is possible
thanks to marker+catcher in NUMA balancing style? But that's not there
for clear_refs_write even after the above commit.

It doesn't appear either that this patch is adding a synchronous tlb
flush inside the PT lock either.

Overall, it would be unfair if clear_refs_write is allowed to do the
same thing that userfaultfd_writeprotect has to do, but with the
mmap_read_lock, if userfaultfd_writeprotect will be forced to take
mmap_write_lock.

In my view there are 3 official ways to deal with this:

1) set a marker in the pte/hugepmd inside the PT lock, and add a
   catcher for the marker in the page fault to send the page fault to
   a dead end while there are stale TLB entries

   cases: userfaultfd_writeprotect() and NUMA balancing

2) take mmap_write_lock

   case: mprotect

3) flush the TLB before the PT lock release

   case: KSM write_protect_page

Where does the below patch land in the 3 cases? It doesn't fit any of
them and what it does looks still not sufficient to fix the userfault
memory corruption.

     https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/

It'd be backwards if the complexity of the kernel was increased for
clear_refs_write, but forbidden for the O(1) API that delivers the
exact same information (clear_refs_write API delivers that info in
O(N)).

We went the last mile to guarantee uffd can be always used fully
securely unprivileged and by default in current Linus's tree and in
future Android out of the box. It's simply impossible with the current
defaults, to use uffd to enlarge the any kernel user-after-free race
window either, so even that concern is gone. From on those research
testcases will stick to fuse instead I guess.

Thanks,
Andrea

Andrea Arcangeli Dec. 23, 2020, 6:55 p.m. UTC | #75

On Wed, Dec 23, 2020 at 01:51:59PM -0500, Andrea Arcangeli wrote:
> NOTE: about the above comment, that mprotect takes
> mmap_read_lock. Your above code change in the commit above, still has
       ^^^^ write

Correction to avoid any confusion.

Nadav Amit Dec. 23, 2020, 7:04 p.m. UTC | #76

> On Dec 23, 2020, at 8:23 AM, Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Dec 22, 2020 at 11:20:21AM -0800, Nadav Amit wrote:
>>> On Dec 22, 2020, at 10:30 AM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
>>>>> 
>>>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>>>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>>>>>> on this approach due to its potential impact on performance.
>>>>>> 
>>>>>> From whom?
>>>>>> 
>>>>>> Somebody who doesn't understand that correctness is more important
>>>>>> than performance? And that userfaultfd is not the most important part
>>>>>> of the system?
>>>>>> 
>>>>>> The fact is, userfaultfd is CLEARLY BUGGY.
>>>>>> 
>>>>>>        Linus
>>>>> 
>>>>> Fair enough.
>>>>> 
>>>>> Nadav, for your patch (you might want to update the commit message).
>>>>> 
>>>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>>>> 
>>>>> While we are all here, there is also clear_soft_dirty() that could
>>>>> use a similar fix…
>>>> 
>>>> Just an update as for why I have still not sent v2: I fixed
>>>> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
>>>> 
>>>> So after some debugging, it appears that clear_refs_write() does not flush
>>>> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
>>>> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
>>>> flush the TLB since there is clear_refs_write() does not call to
>>>> __tlb_adjust_range() (unless there are nested TLBs are pending).
>>> 
>>> Sorry Nadav, I assumed you knew this existing problem fixed by:
>>> https://patchwork.kernel.org/project/linux-mm/cover/20201210121110.10094-1-will@kernel.org/
>> 
>> Thanks, Yu! For some reason I assumed it was already upstreamed and did not
>> look back (yet if I was cc’d on v2…)
> 
> I'll repost in the new year, as it was a bit tight for the merge window.
> I've made a note to put you on cc.

No worries. I just like to complain. I read v1 but forgot.

> 
>> Yet, something still goes bad. Debugging.
> 
> Did you figure this out? I tried to read the whole thread, but it's a bit
> of a rollercoaster.

Yes, it was embarrassing bug of mine (not in any code sent). The
soft-dirty code is entangled and the deep nesting of the code
is unnecessary and confusing.

I tried not to change much to ease backporting and merging with
your pending patch, but some merging will be needed.

Yu Zhao Dec. 23, 2020, 7:12 p.m. UTC | #77

On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> > On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > The more I look at the mprotect code, the less I like it. We seem to
> > > > be much better about the TLB flushes in other places (looking at
> > > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > > about the TLB flushing.
> > > 
> > > No, this doesn't help.
> > > 
> > > > Does adding a TLB flush to before that
> > > >
> > > >         pte_unmap_unlock(pte - 1, ptl);
> > > >
> > > > fix things for you?
> > > 
> > > It really doesn't fix it. Exactly because - as pointed out earlier -
> > > the actual page *copy* happens outside the pte lock.
> > 
> > I appreciate all the pointers. It seems to me it does.
> > 
> > > So what can happen is:
> > > 
> > >  - CPU 1 holds the page table lock, while doing the write protect. It
> > > has cleared the writable bit, but hasn't flushed the TLB's yet
> > > 
> > >  - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > > takes a COW page fault, and reads the PTE from memory (into
> > > vmf->orig_pte)
> > 
> > In handle_pte_fault(), we lock page table and check pte_write(), so
> > we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> > entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> > same page table lock (not mmap_lock).
> 
> I think this is not against Linus's example - where cpu2 does not have tlb
> cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> it.  So IMHO there's no problem here.

None of the CPUs has stale entries when CPU 2 sees a RO PTE. We are
assuming that TLB flush will be done on CPU 1 while it's still holding
page table lock.

CPU 2 (re)locks page table and (re)checks the PTE under question when
it decides if copy is necessary. If it sees a RO PTE, it means the
flush has been done on all CPUs, therefore it fixes the problem.

> But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit.  Note that if
> it's uffd-wp wr-protection it's always applied along with removing of the write
> bit in change_pte_range():
> 
>         if (uffd_wp) {
>                 ptent = pte_wrprotect(ptent);
>                 ptent = pte_mkuffd_wp(ptent);
>         }
> 
> So instead of being handled as COW page do_wp_page() will always trap
> userfaultfd-wp first, hence no chance to race with COW.
> 
> COW could only trigger after another uffd-wp-resolve ioctl which could remove
> the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> after all wr-protect completes, which guarantees that when reaching the COW
> path the tlb must has been flushed anyways.  Then no one should be modifying
> the page anymore even without pgtable lock in COW path.
> 
> So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> work, but it just may cause more tlb flush than Andrea's proposal especially
> when the protection range is large (it's common to have a huge protection range
> for e.g. VM live snapshotting, where we'll protect all guest rw ram).
> 
> My understanding of current issue is that either we can take Andrea's proposal
> (although I think the group rwsem may not be extremely better than a per-mm
> rwsem, which I don't know... at least not worst than that?), or we can also go
> the other way (also as Andrea mentioned) so that when wr-protect:
> 
>   - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
>     pgtable lock
> 
>   - for >2M range, we take write rwsem directly but flush tlb once
>   
> Thanks,
> 
> > 
> > >  - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> > > 
> > >  - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > > happened yet), and writes to that page in users apce
> > > 
> > >  - CPU 1 now does the TLB invalidate, and releases the page table lock
> > > 
> > >  - CPU 2 gets the page table lock, sees that its PTE matches
> > > vmf->orig_pte, and switches it to be that writable copy of the page.
> > > 
> > > where the copy happened before CPU 3 had stopped writing to the page.
> > > 
> > > So the pte lock doesn't actually matter, unless we actually do the
> > > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > > inside of it (on CPU1).
> > > 
> > > mprotect() is actually safe for two independent reasons: (a) it does
> > > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > > all), and (b) it changes the vma permissions so turning something
> > > read-only actually disables COW anyway, since it won't be a COW, it
> > > will be a SIGSEGV.
> > > 
> > > So mprotect() is irrelevant, other than the fact that it shares some
> > > code with that "turn it read-only in the page tables".
> > > 
> > > fork() is a much closer operation, in that it actually triggers that
> > > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > > this too.
> > > 
> > > So it's really just userfaultfd and that kind of ilk that is relevant
> > > here, I think. But that "you need to flush the TLB before releasing
> > > the page table lock" was not true (well, it's true in other
> > > circumstances - just not *here*), and is not part of the solution.
> > > 
> > > Or rather, if it's part of the solution here, it would have to be
> > > matched with that "page copy needs to be done under the page table
> > > lock too".
> > > 
> > >               Linus
> > > 
> > 
> 
> -- 
> Peter Xu
>

Peter Xu Dec. 23, 2020, 7:32 p.m. UTC | #78

On Wed, Dec 23, 2020 at 12:12:23PM -0700, Yu Zhao wrote:
> On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
> > On Wed, Dec 23, 2020 at 03:06:30AM -0700, Yu Zhao wrote:
> > > On Wed, Dec 23, 2020 at 01:44:42AM -0800, Linus Torvalds wrote:
> > > > On Tue, Dec 22, 2020 at 4:01 PM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > The more I look at the mprotect code, the less I like it. We seem to
> > > > > be much better about the TLB flushes in other places (looking at
> > > > > mremap, for example). The mprotect code seems to be very laissez-faire
> > > > > about the TLB flushing.
> > > > 
> > > > No, this doesn't help.
> > > > 
> > > > > Does adding a TLB flush to before that
> > > > >
> > > > >         pte_unmap_unlock(pte - 1, ptl);
> > > > >
> > > > > fix things for you?
> > > > 
> > > > It really doesn't fix it. Exactly because - as pointed out earlier -
> > > > the actual page *copy* happens outside the pte lock.
> > > 
> > > I appreciate all the pointers. It seems to me it does.
> > > 
> > > > So what can happen is:
> > > > 
> > > >  - CPU 1 holds the page table lock, while doing the write protect. It
> > > > has cleared the writable bit, but hasn't flushed the TLB's yet
> > > > 
> > > >  - CPU 2 did *not* have the TLB entry, sees the new read-only state,
> > > > takes a COW page fault, and reads the PTE from memory (into
> > > > vmf->orig_pte)
> > > 
> > > In handle_pte_fault(), we lock page table and check pte_write(), so
> > > we either see a RW pte before CPU 1 runs or a RO one with no stale tlb
> > > entries after CPU 1 runs, assume CPU 1 flushes tlb while holding the
> > > same page table lock (not mmap_lock).
> > 
> > I think this is not against Linus's example - where cpu2 does not have tlb
> > cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify
> > it.  So IMHO there's no problem here.
> 
> None of the CPUs has stale entries when CPU 2 sees a RO PTE.

In this example we have - Please see [1] below.

> We are
> assuming that TLB flush will be done on CPU 1 while it's still holding
> page table lock.
> 
> CPU 2 (re)locks page table and (re)checks the PTE under question when
> it decides if copy is necessary. If it sees a RO PTE, it means the
> flush has been done on all CPUs, therefore it fixes the problem.

I guess you had the assumption that pgtable lock released in step 1 already.
But it's not happening in this specific example, not until step5 [2] below.

Indeed if pgtable lock is not released from cpu1, then COW path won't even
triger, afaiu... do_wp_page() needs the pgtable lock.  It seems just even safer.

Irrelevant of the small confusions here and there... I believe we're always on
the same page, at least the conclusion.

Thanks,

> 
> > But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit.  Note that if
> > it's uffd-wp wr-protection it's always applied along with removing of the write
> > bit in change_pte_range():
> > 
> >         if (uffd_wp) {
> >                 ptent = pte_wrprotect(ptent);
> >                 ptent = pte_mkuffd_wp(ptent);
> >         }
> > 
> > So instead of being handled as COW page do_wp_page() will always trap
> > userfaultfd-wp first, hence no chance to race with COW.
> > 
> > COW could only trigger after another uffd-wp-resolve ioctl which could remove
> > the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen
> > after all wr-protect completes, which guarantees that when reaching the COW
> > path the tlb must has been flushed anyways.  Then no one should be modifying
> > the page anymore even without pgtable lock in COW path.
> > 
> > So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to
> > work, but it just may cause more tlb flush than Andrea's proposal especially
> > when the protection range is large (it's common to have a huge protection range
> > for e.g. VM live snapshotting, where we'll protect all guest rw ram).
> > 
> > My understanding of current issue is that either we can take Andrea's proposal
> > (although I think the group rwsem may not be extremely better than a per-mm
> > rwsem, which I don't know... at least not worst than that?), or we can also go
> > the other way (also as Andrea mentioned) so that when wr-protect:
> > 
> >   - for <=2M range (pmd or less), we take read rwsem, but flush tlb within
> >     pgtable lock
> > 
> >   - for >2M range, we take write rwsem directly but flush tlb once
> >   
> > Thanks,
> > 
> > > 
> > > >  - CPU 2 correctly decides it needs to be a COW, and copies the page contents
> > > > 
> > > >  - CPU 3 *does* have a stale TLB (because TLB invalidation hasn't
> > > > happened yet), and writes to that page in users apce

[1]

> > > > 
> > > >  - CPU 1 now does the TLB invalidate, and releases the page table lock

[2]

> > > > 
> > > >  - CPU 2 gets the page table lock, sees that its PTE matches
> > > > vmf->orig_pte, and switches it to be that writable copy of the page.
> > > > 
> > > > where the copy happened before CPU 3 had stopped writing to the page.
> > > > 
> > > > So the pte lock doesn't actually matter, unless we actually do the
> > > > page copy inside of it (on CPU2), in addition to doing the TLB flush
> > > > inside of it (on CPU1).
> > > > 
> > > > mprotect() is actually safe for two independent reasons: (a) it does
> > > > the mmap_sem for writing (so mprotect can't race with the COW logic at
> > > > all), and (b) it changes the vma permissions so turning something
> > > > read-only actually disables COW anyway, since it won't be a COW, it
> > > > will be a SIGSEGV.
> > > > 
> > > > So mprotect() is irrelevant, other than the fact that it shares some
> > > > code with that "turn it read-only in the page tables".
> > > > 
> > > > fork() is a much closer operation, in that it actually triggers that
> > > > COW behavior, but fork() takes the mmap_sem for writing, so it avoids
> > > > this too.
> > > > 
> > > > So it's really just userfaultfd and that kind of ilk that is relevant
> > > > here, I think. But that "you need to flush the TLB before releasing
> > > > the page table lock" was not true (well, it's true in other
> > > > circumstances - just not *here*), and is not part of the solution.
> > > > 
> > > > Or rather, if it's part of the solution here, it would have to be
> > > > matched with that "page copy needs to be done under the page table
> > > > lock too".
> > > > 
> > > >               Linus
> > > > 
> > > 
> > 
> > -- 
> > Peter Xu
> > 
>

Andrea Arcangeli Dec. 23, 2020, 9:07 p.m. UTC | #79

On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > In your patch, do we need to take wrprotect_rwsem in
> > handle_userfault() as well? Otherwise, it seems userspace would have
> > to synchronize between its wrprotect ioctl and fault handler? i.e.,
> > the fault hander needs to be aware that the content of write-
> > protected pages can actually change before the iotcl returns.
> 
> The handle_userfault() thread should be sleeping until another uffd_wp_resolve
> fixes the page fault for it.  However when the uffd_wp_resolve ioctl comes,
> then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, or
> any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm)
> should have guaranteed the previous wr-protect ioctls are finished and tlb must
> have been flushed until this thread continues.
> 
> And I don't know why it matters even if the data changed - IMHO what uffd-wp

The data will change indeed and it's fine.

> wants to do is simply to make sure after wr-protect ioctl returns to userspace,
> no change on the page should ever happen anymore.  So "whether data changed"
> seems matter more on the ioctl thread rather than the handle_userfault()
> thread.  IOW, I think data changes before tlb flush but after pte wr-protect is
> always fine - but that's not fine anymore if the syscall returns.

Agreed.

From the userland point of view all it matters is that the writes
through the stale TLB entries will stop in both the two cases:

1) before returning from the UFFDIO_WRITEPROTECT(mode_wp = true) ioctl syscall

2) before a parallel UFFDIO_WRITEPROTECT(mode_wp = false) can clear
   the _PAGE_UFFD_WP marker in the pte/hugepmd under the PT lock,
   assuming the syscall at point 1) is still in flight

Both points are guaranteed at all times by the group lock now, so
userland cannot even measure or perceive the existence of any stale
TLB at any given time in the whole uffd-wp workload.

So it's perfectly safe and identical as NUMA balancing and requires
zero extra locking in handle_userfault().

handle_userfault() is a dead end that simply waits and when it's the
right time it restarts the page fault. It can have occasional false positives
after f9bf352224d7d4612b55b8d0cd0eaa981a3246cf, false positive as in
restarting too soon, but even then it's perfectly safe since it's
equivalent of one more CPU hitting the page fault path. As long as the
marker is there, any spurious userfault will re-enter
handle_userfault().

handle_userfault() doesn't care about the data and in turn it cannot
care less about any stale TLB either. Userland cares but userland
cannot make any assumption about writes being fully stopped, until the
ioctl returned anyway and by that time the pending flush will be done
and in fact by the time userland can make any assumption also the
mmap_write_lock would have been released with the first proposed patch.

Thanks,
Andrea

Andrea Arcangeli Dec. 23, 2020, 9:39 p.m. UTC | #80

On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> Thanks for the details.

I hope we can find a way put the page_mapcount back where there's a
page_count right now.

If you're so worried about having to maintain a all defined well
documented (or to be documented even better if you ACK it)
marker/catcher for userfaultfd_writeprotect, I can't see how you could
consider to maintain the page fault safe against any random code
leaving too permissive TLB entries out of sync of the more restrictive
pte permissions as it was happening with clear_refs_write, which
worked by luck until page_mapcount was changed to page_count.

page_count is far from optimal, but it is a feature it finally allowed
us to notice that various code (clear_refs_write included apparently
even after the fix) leaves stale too permissive TLB entries when it
shouldn't.

The question is only which way you prefer to fix clear_refs_write and
I don't think we can deviate from those 3 methods that already exist
today. So clear_refs_write will have to pick one of those and
currently it's not falling in the same category with mprotect even
after the fix.

I think if clear_refs_write starts to take the mmap_write_lock and
really start to operate like mprotect, only then we can consider to
make userfaultfd_writeprotect also operate like mprotect.

Even then I'd hope we can at least be allowed to make it operate like
KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that
clear_refs_write cannot do since it works in O(N) and tends to scan
everything at once, so there would be no point to optimize not to
defer the flush, for a process with a tiny amount of virtual memory
mapped.

vm86 also should be fixed to fall in the same category with mprotect,
since performance there is irrelevant.

Thanks,
Andrea

Andrea Arcangeli Dec. 23, 2020, 10:05 p.m. UTC | #81

On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
> > On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
> >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
> >>> Using mmap_write_lock() was my initial fix and there was a strong pushback
> >>> on this approach due to its potential impact on performance.
> >> 
> >> From whom?
> >> 
> >> Somebody who doesn't understand that correctness is more important
> >> than performance? And that userfaultfd is not the most important part
> >> of the system?
> >> 
> >> The fact is, userfaultfd is CLEARLY BUGGY.
> >> 
> >>          Linus
> > 
> > Fair enough.
> > 
> > Nadav, for your patch (you might want to update the commit message).
> > 
> > Reviewed-by: Yu Zhao <yuzhao@google.com>
> > 
> > While we are all here, there is also clear_soft_dirty() that could
> > use a similar fix…
> 
> Just an update as for why I have still not sent v2: I fixed
> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
> 
> So after some debugging, it appears that clear_refs_write() does not flush
> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
> flush the TLB since there is clear_refs_write() does not call to
> __tlb_adjust_range() (unless there are nested TLBs are pending).
> 
> So I have a patch for this issue too: arguably the tlb_gather interface is
> not the right one for clear_refs_write() that does not clear PTEs but
> changes them.
> 
> Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
> will keep debugging to see what goes wrong. I will send v2 once I figure out
> what the heck is wrong in the code or my reproducer.

If you put the page_mapcount check back in do_wp_page instead of
page_count, it'll stop reproducing but the bug is still very much
there...

It's a feature page_count finally shows you the corruption now by
virtue of the page_count being totally unreliable with all speculative
pagecache lookups randomly elevating it in the background.

The proof it worked by luck is that an unrelated change
(s/page_mapcount/page_count/) made the page fault behave slightly
different and broke clear_refs_write.

Even before page_mapcount was replaced with page_count, it has always
been forbidden to leave too permissive stale TLB entries out of sync
with a more restrictive pte/hugepmd permission past the PT unlock,
unless you're holding the mmap_write_lock.

So for example all rmap code has to flush before PT unlock release
too, usually it clears the pte as a whole but it's still a
downgrade.

The rmap_lock and the mmap_read_lock achieve the same: they keep the
vma stable but they can't stop the page fault from running (that's a
feature) so they have to flush inside the PT lock.

The tlb gather deals with preventing use after free (where userland
can modify kernel memory), but it cannot deal with the guarantee the
page fault requires.

So the clear_refs_write patch linked that alters the tlb flushing
appears a noop with respect to this bug. It cannot do anything to
prevent the page fault run with writable stale TLB entries with the
!pte_write.

If you don't add a marker here (it clears it, the exact opposite of
what should be happening), there's no way to avoid the mmap_write_lock
in my view.

static inline void clear_soft_dirty(struct vm_area_struct *vma,
		unsigned long addr, pte_t *pte)
{
	/*
	 * The soft-dirty tracker uses #PF-s to catch writes
	 * to pages, so write-protect the pte as well. See the
	 * Documentation/admin-guide/mm/soft-dirty.rst for full description
	 * of how soft-dirty works.
	 */
	pte_t ptent = *pte;

	if (pte_present(ptent)) {
		pte_t old_pte;

		old_pte = ptep_modify_prot_start(vma, addr, pte);
		ptent = pte_wrprotect(old_pte);
		ptent = pte_clear_soft_dirty(ptent);
+		ptent = pte_mkuffd_wp(ptent);
		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);

One solution that would fix the userland mm corruption in
clear_refs_write is to take the mmap_read_lock, take some mutex
somewhere (vma/mm whatever), then in clear_soft_dirty make the above
modification adding the _PAGE_UFFD_WP, then flush tlb, release the
mutex and then release the mmap_read_lock.

Then here:

	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		if (vma->vm_flags & VM_SOFTDIRTY)
+		   return handle_soft_dirty(vma);
		return handle_userfault(vmf, VM_UFFD_WP);

Of course handle_soft_dirty will have to take the mutex once (1
mutex_lock/unlock cycle to run after any pending flush).

And then we'll have to enforce uffd-wp cannot be registered if
VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is
mutually exclusive with VM_SOFTDIRTY. So then we can also unify the
bit so they all use the same software bit in the pgtable (that's
something I considered anyway originally since it doesn't make whole
lot of sense to use the two features on the same vma at the same
time).

If the above is too complex, clear_refs_write will have to grind down
to I/O disk spindle speed as mprotect with
s/mmap_read_lock/mmap_write_lock/, unless it stops triggering
wrprotect faults altogether.

Thanks,
Andrea

Yu Zhao Dec. 23, 2020, 10:29 p.m. UTC | #82

On Wed, Dec 23, 2020 at 04:39:00PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > Thanks for the details.
> 
> I hope we can find a way put the page_mapcount back where there's a
> page_count right now.
> 
> If you're so worried about having to maintain a all defined well
> documented (or to be documented even better if you ACK it)
> marker/catcher for userfaultfd_writeprotect, I can't see how you could
> consider to maintain the page fault safe against any random code
> leaving too permissive TLB entries out of sync of the more restrictive
> pte permissions as it was happening with clear_refs_write, which
> worked by luck until page_mapcount was changed to page_count.
> 
> page_count is far from optimal, but it is a feature it finally allowed
> us to notice that various code (clear_refs_write included apparently
> even after the fix) leaves stale too permissive TLB entries when it
> shouldn't.
> 
> The question is only which way you prefer to fix clear_refs_write and
> I don't think we can deviate from those 3 methods that already exist
> today. So clear_refs_write will have to pick one of those and
> currently it's not falling in the same category with mprotect even
> after the fix.
> 
> I think if clear_refs_write starts to take the mmap_write_lock and
> really start to operate like mprotect, only then we can consider to
> make userfaultfd_writeprotect also operate like mprotect.
> 
> Even then I'd hope we can at least be allowed to make it operate like
> KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that
> clear_refs_write cannot do since it works in O(N) and tends to scan
> everything at once, so there would be no point to optimize not to
> defer the flush, for a process with a tiny amount of virtual memory
> mapped.
> 
> vm86 also should be fixed to fall in the same category with mprotect,
> since performance there is irrelevant.

I was hesitant to suggest the following because it isn't that straight
forward. But since you seem to be less concerned with the complexity,
I'll just bring it on the table -- it would take care of both ufd and
clear_refs_write, wouldn't it?

diff --git a/mm/memory.c b/mm/memory.c
index 5e9ca612d7d7..af38c5ee327e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		goto unlock;
 	}
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
+				flush_tlb_page(vmf->vma, vmf->address);
 			return do_wp_page(vmf);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);

Nadav Amit Dec. 23, 2020, 10:45 p.m. UTC | #83

> On Dec 23, 2020, at 2:05 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>>>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>>>> on this approach due to its potential impact on performance.
>>>> 
>>>> From whom?
>>>> 
>>>> Somebody who doesn't understand that correctness is more important
>>>> than performance? And that userfaultfd is not the most important part
>>>> of the system?
>>>> 
>>>> The fact is, userfaultfd is CLEARLY BUGGY.
>>>> 
>>>>         Linus
>>> 
>>> Fair enough.
>>> 
>>> Nadav, for your patch (you might want to update the commit message).
>>> 
>>> Reviewed-by: Yu Zhao <yuzhao@google.com>
>>> 
>>> While we are all here, there is also clear_soft_dirty() that could
>>> use a similar fix…
>> 
>> Just an update as for why I have still not sent v2: I fixed
>> clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
>> 
>> So after some debugging, it appears that clear_refs_write() does not flush
>> the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494
>> ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not
>> flush the TLB since there is clear_refs_write() does not call to
>> __tlb_adjust_range() (unless there are nested TLBs are pending).
>> 
>> So I have a patch for this issue too: arguably the tlb_gather interface is
>> not the right one for clear_refs_write() that does not clear PTEs but
>> changes them.
>> 
>> Yet, sadly, my reproducer keeps falling (less frequently, but still). So I
>> will keep debugging to see what goes wrong. I will send v2 once I figure out
>> what the heck is wrong in the code or my reproducer.
> 
> If you put the page_mapcount check back in do_wp_page instead of
> page_count, it'll stop reproducing but the bug is still very much
> there...

I know. I designed the (re)producer just to be able to hit the bug without
changing the kernel (and test my fix), but I am fully aware that it would
not work on older kernels although the bug is still there.

> And then we'll have to enforce uffd-wp cannot be registered if
> VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is
> mutually exclusive with VM_SOFTDIRTY. So then we can also unify the
> bit so they all use the same software bit in the pgtable (that's
> something I considered anyway originally since it doesn't make whole
> lot of sense to use the two features on the same vma at the same
> time).

I think it may be reasonable.

Just a proposal: At some point we can also ask ourselves whether the
“artificial" limitation of the number of software bits per PTE should really
limit us, or do we want to hold some additional metadata per-PTE by either
putting it in an adjacent page (holding 64-bits of additional software-bits
per PTE) or by finding some place in the page-struct to link to this
metadata (and have the liberty of number of bits per PTE). One of the PTE
software-bits can be repurposed to say whether there is “extra-metadata”
associated with the PTE.

I am fully aware that there will be some overhead associated, but it
can be limited to less-common use-cases.

Andrea Arcangeli Dec. 23, 2020, 11:04 p.m. UTC | #84

On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote:
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?

It certainly would since this is basically declaring "leaving stale
TLB entries past mmap_read_lock" is now permitted as long as the
pending flush counter is elevated under mmap_sem for reading.

Anything that prevents uffd-wp to take mmap_write_lock looks great to
me, anything, the below included, as long as it looks like easy to
enforce and understand. And the below certainly is.

My view is that the below is at the very least an improvement in terms
of total complexity, compared to v5.10. At least it'll be documented.

So what would be not ok to me is to depend on undocumented not
guaranteed behavior in do_wp_page like the page_mapcount, which is
what we had until now in clear_refs_write and in uffd-wp (but only if
wrprotect raced against un-wrprotect, a tiny window if compared to
clear_refs_write).

Documenting that clearing pte_write and deferring the flush is allowed
if mm_tlb_flush_pending was elevated before taking the PT lock is less
complex and very well defined rule, if compared to what we had before
in the page_mapcount dependency of clear_refs_write which was prone to
break, and in fact it just did.

We'll need a commentary in both userfaultfd_writeprotect and
clear_refs_write that links to the below snippet.

If we abstract it in a header we can hide there also a #ifdef
CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y &&
CONFIG_USERFAULTFD=y to make it even more explicit.

However it may be simpler to keep it unconditional, I don't mind
either ways. If it was up to me I'd write it to those 3 config options
to be sure I remember where it comes from and to force any other user
to register to be explicit they depend on that.

> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  		goto unlock;
>  	}
>  	if (vmf->flags & FAULT_FLAG_WRITE) {
> -		if (!pte_write(entry))
> +		if (!pte_write(entry)) {
> +			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> +				flush_tlb_page(vmf->vma, vmf->address);
>  			return do_wp_page(vmf);
> +		}
>  		entry = pte_mkdirty(entry);
>  	}
>  	entry = pte_mkyoung(entry);
>

Linus Torvalds Dec. 23, 2020, 11:39 p.m. UTC | #85

On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > Thanks for the details.
>
> I hope we can find a way put the page_mapcount back where there's a
> page_count right now.

I really don't think that's ever going to happen - at least if you're
talking about do_wp_page().

I refuse to make the *simple* core operations VM have to jump through
hoops, and the old COW mapcount logic was that. I *much* prefer the
newer COW code, because the rules are more straightforward.

> page_count is far from optimal, but it is a feature it finally allowed
> us to notice that various code (clear_refs_write included apparently
> even after the fix) leaves stale too permissive TLB entries when it
> shouldn't.

I absolutely agree that page_count isn't exactly optimal, but
"mapcount" is just so much worse.

page_count() is at least _logical_, and has a very clear meaning:
"this page has other users". mapcount() means something else, and is
simply not sufficient or relevant wrt COW.

That doesn't mean that page_mapcount() is wrong - it's just that it's
wrong for COW. page_mapcount() is great for rmap, so that we can see
when we need to shoot down a memory mapping of a page that gets
released (truncate being the classic example).

I think that the mapcount games we used to have were horrible. I
absolutely much prefer where we are now wrt COW.

The modern rules for COW handling are:

 - if we got a COW fault there might be another user, we copy (and
this is where the page count makes so much logical sense).

 - if somebody needs to pin the page in the VM, we either make sure
that it is pre-COWed and we

   (a) either never turn it back into a COW page (ie the fork()-time
stuff we have for pinned pages)

   (b) or there is some explicit marker on the page or in the page
table (ie the userfaultfd_pte_wp thing).

those are _so_ much more straightforward than the very complex rules
we used to have that were actively buggy, in addition to requiring the
page lock. So they were buggy and slow.

And yes, I had forgotten about that userfaultfd_pte_wp() because I was
being myopic and only looking at wp_page_copy(). So using that as a
way to make sure that a page doesn't go through COW is a good way to
avoid the COW race, but I think that thing requires a bit in the page
table which might be a problem on some architectures?

          Linus

Andrea Arcangeli Dec. 23, 2020, 11:55 p.m. UTC | #86

On Wed, Dec 23, 2020 at 02:45:59PM -0800, Nadav Amit wrote:
> I think it may be reasonable.

Whatever solution used, there will be 2 users of it: uffd-wp will use
whatever technique used by clear_refs_write to avoid the
mmap_write_lock.

My favorite is Yu's patch and not the group lock anymore. The cons is
it changes the VM rules (which kind of reminds me my initial proposal
of adding a spurious tlb flush if mm_tlb_flush_pending is set, except
I didn't correctly specify it'd need to go in the page fault), but it
still appears the simplest.

> Just a proposal: At some point we can also ask ourselves whether the
> “artificial" limitation of the number of software bits per PTE should really
> limit us, or do we want to hold some additional metadata per-PTE by either
> putting it in an adjacent page (holding 64-bits of additional software-bits
> per PTE) or by finding some place in the page-struct to link to this
> metadata (and have the liberty of number of bits per PTE). One of the PTE
> software-bits can be repurposed to say whether there is “extra-metadata”
> associated with the PTE.
> 
> I am fully aware that there will be some overhead associated, but it
> can be limited to less-common use-cases.

That's a good point, so far far we didn't run out so it's not an
immediate concern. (as opposed we run out in page->flags where the
PG_tail went to some LSB).

In general kicking the can down the road sounds like the best thing to
do for those bit shortage matters, until we can't anymore at
least.. There's no gain to the kernel runtime, in doing something
generically good here (again see where PG_tail rightfully went).

So before spending RAM and CPU, we'd need to find a more compact
encoding with the bits we already have available.

This reminds me again we could double check if we could make
VM_UFFD_WP mutually exclusive with VM_SOFTDIRTY.

I wasn't sure if it could ever happen in a legit way to use both at
the same time (CRIU on a app using uffd-wp for its own internal mm
management?).

Theoretically it's too late already for it, but VM_UFFD_WP is
relatively new, if we're sure it cannot ever happen in a legit way, it
would be possible to at least evaluate/discuss it. This is an
immediate matter.

What we'll do if we later run out, is not an immediate matter instead,
because it won't make our life any simpler to resolve it now.

Thanks,
Andrea

Andrea Arcangeli Dec. 24, 2020, 1:01 a.m. UTC | #87

Hello Linus,

On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote:
> On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
> > > Thanks for the details.
> >
> > I hope we can find a way put the page_mapcount back where there's a
> > page_count right now.
> 
> I really don't think that's ever going to happen - at least if you're
> talking about do_wp_page().

Yes, I was referring to do_wp_page().

> I refuse to make the *simple* core operations VM have to jump through
> hoops, and the old COW mapcount logic was that. I *much* prefer the
> newer COW code, because the rules are more straightforward.

Agreed. I don't have any practical alternative proposal in fact, it
was only an hypothetical wish/hope, echoing Hugh's view on this
matter.

> > page_count is far from optimal, but it is a feature it finally allowed
> > us to notice that various code (clear_refs_write included apparently
> > even after the fix) leaves stale too permissive TLB entries when it
> > shouldn't.
> 
> I absolutely agree that page_count isn't exactly optimal, but
> "mapcount" is just so much worse.

Agreed. The "isn't exactly optimal" part is the only explanation for
wish/hope.

> page_count() is at least _logical_, and has a very clear meaning:
> "this page has other users". mapcount() means something else, and is
> simply not sufficient or relevant wrt COW.
>
> That doesn't mean that page_mapcount() is wrong - it's just that it's
> wrong for COW. page_mapcount() is great for rmap, so that we can see
> when we need to shoot down a memory mapping of a page that gets
> released (truncate being the classic example).
> 
> I think that the mapcount games we used to have were horrible. I
> absolutely much prefer where we are now wrt COW.
> 
> The modern rules for COW handling are:
> 
>  - if we got a COW fault there might be another user, we copy (and
> this is where the page count makes so much logical sense).
> 
>  - if somebody needs to pin the page in the VM, we either make sure
> that it is pre-COWed and we
> 
>    (a) either never turn it back into a COW page (ie the fork()-time
> stuff we have for pinned pages)
> 
>    (b) or there is some explicit marker on the page or in the page
> table (ie the userfaultfd_pte_wp thing).
> 
> those are _so_ much more straightforward than the very complex rules
> we used to have that were actively buggy, in addition to requiring the
> page lock. So they were buggy and slow.

Agreed, this is a very solid solution that solves the problem the
mapcount had with GUP pins.

The only alternative but very vapourware thought I had on this matter,
is that all troubles start when we let a GUP-pinned page be unmapped
from the pgtables.

I already told Jens, is io_uring could use mmu notifier already (that
would make any io_uring GUP pin not count anymore with regard to
page_mapcount vs page_count difference) and vmsplice also needs some
handling or maybe become privileged.

The above two improvements are orthogonal with the COW issue since
long term GUP pins do more than just mlock and they break most
advanced VM features. It's not ok just to account GUP pins in rlimit
mlock.

The above two improvements however would go into the direction to make
mapcount more suitable for COW, but it'd still not be enough.

The transient GUP pins also would need to be taken care of, but we
could wait for those to be released, since those are guaranteed to be
released or something else, not just munmap()/MADV_DONTNEED, will
remain in D state.

Anyway.. until io_uring and vmsplice are solved first, it's pointless
to even wonder about transient GUP pins.

> And yes, I had forgotten about that userfaultfd_pte_wp() because I was
> being myopic and only looking at wp_page_copy(). So using that as a
> way to make sure that a page doesn't go through COW is a good way to
> avoid the COW race, but I think that thing requires a bit in the page
> table which might be a problem on some architectures?

Correct, it requires a bit in the pgtable.

The bit is required in order to disambiguate which regions have been
marked for wrprotect faults and which didn't.

A practical example would be: fork() called by an app with a very
large vma with VM_UFFD_WP set.

There would be a storm of WP userfaults if we didn't have the software
bit to detect exactly which pte were wrprotected by uffd-wp and which
ones were wrprotected by fork.

That bit then sends the COW fault to a safe place if wrprotect is
running (the problem is we didn't see the un-wrprotect would clear the
bit while the TLB flush of the wrprotect could be still pending).

The page_mapcount I guess hidden that race to begin with, just like it
did in clear_refs_write.

uffd-wp is similar to soft dirty: it won't work well without a pgtable
software bit. The software bit, to avoid storms of false positives
during memory pressure, also survives swapin/swapout, again like soft
dirty.

The bit requirement is handled through a per-arch option
arch/x86/Kconfig similarly to soft dirty.

        select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD

From an high level, the extra bit in the pte/hugepmd could be seen as
holding the information that would be generated by split_vma()
followed by clearing VM_WRITE in the middle vma. Setting that bit
results in a cheaper runtime than allocating 2 more vmas with the
associated locking and rbtree size increase, but same accuracy.

Thanks,
Andrea

Andy Lutomirski Dec. 24, 2020, 1:21 a.m. UTC | #88

> On Dec 23, 2020, at 2:29 PM, Yu Zhao <yuzhao@google.com> wrote:
> 

> 
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>        goto unlock;
>    }
>    if (vmf->flags & FAULT_FLAG_WRITE) {
> -        if (!pte_write(entry))
> +        if (!pte_write(entry)) {
> +            if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> +                flush_tlb_page(vmf->vma, vmf->address);
>            return do_wp_page(vmf);
> +        }

I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.

I’m not immediately sure how to do all that much better, though. We could potentially keep a record of pending ranges that need flushing per mm or per PTL, protected by the PTL, and arrange to do the flush if we notice that flushes are pending when we want to do_wp_page().  At least this would limit us to one point extra flush, at least until the concurrent mprotect (or whatever) makes further progress.  The bookkeeping might be nasty, though.

But x86 already sort of does some of this bookkeeping, and arguably x86’s code could be improved by tracking TLB ranges to flush per mm instead of per flush request — Nadav already got us half way there by making a little cache of flush_tlb_info structs.  IMO it wouldn’t be totally crazy to integrate this better with tlb_gather_mmu to make the pending flush data visible to other CPUs even before actually kicking off the flush. In the limit, this starts to look a bit like a fully async flush mechanism.  We would have a function to request a flush, and that function would return a generation count but not actually flush anything.  The case of flushing a range adjacent to a still-pending range would be explicitly optimized.  Then another function would actually initiate and wait for the flush to complete.  And we could, while holding PTL, scan the list of pending flushes, if any, to see if the PTE we’re looking at has a flush pending.  This is sort of easy in the one-PTL-per-mm case but potentially rather complicated in the split PTL case.  And I’m genuinely unsure where the “batch” TLB flush interface fits in, because it can create a batch that spans more than one mm.  x86 can deal with this moderately efficiently since we limit the number of live mms per CPU and our flushes are (for now?) per cpu, not per mm.

u64 gen = 0;
for(...)
  gen = queue_flush(mm, start, end, freed_tables);
flush_to_gen(mm, gen);

and the wp fault path does:

wait_for_pending_flushes(mm, address);

Other than the possibility of devolving to one flush per operation if one thread is page faulting at the same speed that another thread is modifying PTEs, this should be decently performant.

Andrea Arcangeli Dec. 24, 2020, 2 a.m. UTC | #89

On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.

mprotect can't run concurrently with a page fault in the first place.

One other near zero cost improvement easy to add if this would be "if
(vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
conditional to the two config options too.

Still I don't mind doing it in some other way, uffd-wp has much easier
time doing it in another way in fact.

Whatever performs better is fine, but queuing up pending invalidate
ranges don't look very attractive since it'd be a fixed cost that we'd
always have to pay even when there's no fault (and there can't be any
fault at least for mprotect).

Thanks,
Andrea

Nadav Amit Dec. 24, 2020, 3:09 a.m. UTC | #90

> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
> 
> mprotect can't run concurrently with a page fault in the first place.
> 
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
> conditional to the two config options too.
> 
> Still I don't mind doing it in some other way, uffd-wp has much easier
> time doing it in another way in fact.
> 
> Whatever performs better is fine, but queuing up pending invalidate
> ranges don't look very attractive since it'd be a fixed cost that we'd
> always have to pay even when there's no fault (and there can't be any
> fault at least for mprotect).

I think there are other cases in which Andy’s concern is relevant
(MADV_PAGEOUT).

Perhaps holding some small bitmap based on part of the deferred flushed
pages (e.g., bits 12-17 of the address or some other kind of a single
hash-function bloom-filter) would be more performant to avoid (most)
unnecessary TLB flushes. It will be cleared before a TLB flush and set while
holding the PTL.

Checking if a flush is needed, under the PTL, would require a single memory
access (although potentially cache miss). It will however require one atomic
operation for each page-table whose PTEs’ flushes are deferred - in contrast
to the current scheme which requires two atomic operations for the *entire*
operation.

Nadav Amit Dec. 24, 2020, 3:30 a.m. UTC | #91

> On Dec 23, 2020, at 7:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>> 
>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>>> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
>> 
>> mprotect can't run concurrently with a page fault in the first place.
>> 
>> One other near zero cost improvement easy to add if this would be "if
>> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
>> conditional to the two config options too.
>> 
>> Still I don't mind doing it in some other way, uffd-wp has much easier
>> time doing it in another way in fact.
>> 
>> Whatever performs better is fine, but queuing up pending invalidate
>> ranges don't look very attractive since it'd be a fixed cost that we'd
>> always have to pay even when there's no fault (and there can't be any
>> fault at least for mprotect).
> 
> I think there are other cases in which Andy’s concern is relevant
> (MADV_PAGEOUT).
> 
> Perhaps holding some small bitmap based on part of the deferred flushed
> pages (e.g., bits 12-17 of the address or some other kind of a single
> hash-function bloom-filter) would be more performant to avoid (most)
> unnecessary TLB flushes. It will be cleared before a TLB flush and set while
> holding the PTL.
> 
> Checking if a flush is needed, under the PTL, would require a single memory
> access (although potentially cache miss). It will however require one atomic
> operation for each page-table whose PTEs’ flushes are deferred - in contrast
> to the current scheme which requires two atomic operations for the *entire*
> operation.

Just to finish my thought - clearing the bitmap is the tricky part,
which I still did not figure out.

Andrea Arcangeli Dec. 24, 2020, 3:31 a.m. UTC | #92

On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made

The next worry then is if UFFDIO_WRITEPROTECT is very large then there
would be a flood of wrprotect faults, and they'd end up all issuing a
tlb flush during the UFFDIO_WRITEPROTECT itself which again is a
performance concern for certain uffd-wp use cases.

Those use cases would be for example redis and postcopy live
snapshotting, to use it for async snapshots, unprivileged too in the
case of redis if it temporarily uses bounce buffers for the syscall
I/O for the duration of the snapshot. hypervisors tuned profiles need
to manually lift the unprivileged_userfaultfd to 1 unless their jailer
leaves one capability in the snapshot thread.

Moving the check after userfaultfd_pte_wp would solve
userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid
a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false)
because change_protection doesn't restore the pte_write:

			} else if (uffd_wp_resolve) {
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

When the snapshot is complete userfaultfd_writeprotect(mode_wp=false)
would need to run again on the whole range which can be very big
again.

Orthogonally I think we should also look to restore the pte_write
above orthogonally in uffd-wp, so it'll get yet an extra boost if
compared to current redis snapshotting fork(), that cannot restore all
pte_write after the snapshot child quit and forces a flood of spurious
wrprotect faults (uffd-wp can solve that too).

However, even if uffd-wp restored the pte_write, things would remain
suboptimal for a terabyte process under clear_refs, since softdirty
wrprotect faults that start happening while softdirty is still running
on the mm, won't be caught in userfaultfd_pte_wp.

Something like below, if cleaned up, abstracted properly and
documented well in the two places involved, will have a better chance
to perform optimally for softdirty too.

And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is
compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not
built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be
removed and it won't make any measurable difference even when
USERFAULTFD=n)

RFC untested below, it's supposed to fix the softdirty testcase too,
even without the incremental fix, since it already does tlb_gather_mmu
before walk_page_range and tlb_finish_mmu after it and that appears
enough to define the inc/dec_tlb_flush_pending.

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..66fd6d070c47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 	} else {
+		bool in_uffd_wp, in_softdirty;
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 				vmf->address);
 		if (!new_page)
 			goto oom;

+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+		in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP);
+#else
+		in_uffd_wp = false;
+#endif
+#ifdef CONFIG_MEM_SOFT_DIRTY
+		in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY);
+#else
+		in_softdirty = false;
+#endif
+		if ((in_uffd_wp || in_softdirty) &&
+		    mm_tlb_flush_pending(mm))
+			flush_tlb_page(vma, vmf->address);
+
 		if (!cow_user_page(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,

Yu Zhao Dec. 24, 2020, 3:34 a.m. UTC | #93

On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
> >> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
> > 
> > mprotect can't run concurrently with a page fault in the first place.
> > 
> > One other near zero cost improvement easy to add if this would be "if
> > (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
> > conditional to the two config options too.
> > 
> > Still I don't mind doing it in some other way, uffd-wp has much easier
> > time doing it in another way in fact.
> > 
> > Whatever performs better is fine, but queuing up pending invalidate
> > ranges don't look very attractive since it'd be a fixed cost that we'd
> > always have to pay even when there's no fault (and there can't be any
> > fault at least for mprotect).
> 
> I think there are other cases in which Andy’s concern is relevant
> (MADV_PAGEOUT).

That patch only demonstrate a rough idea and I should have been
elaborate: if we ever decide to go that direction, we only need to
worry about "jumping through hoops", because the final patch (set)
I have in mind would not only have the build time optimization Andrea
suggested but also include runtime optimizations like skipping
do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
assured, the performance impact on do_wp_page() from occasionally an
additional TLB flush on top of a page copy is negligible.

> Perhaps holding some small bitmap based on part of the deferred flushed
> pages (e.g., bits 12-17 of the address or some other kind of a single
> hash-function bloom-filter) would be more performant to avoid (most)
> unnecessary TLB flushes. It will be cleared before a TLB flush and set while
> holding the PTL.
> 
> Checking if a flush is needed, under the PTL, would require a single memory
> access (although potentially cache miss). It will however require one atomic
> operation for each page-table whose PTEs’ flushes are deferred - in contrast
> to the current scheme which requires two atomic operations for the *entire*
> operation.
>

Andrea Arcangeli Dec. 24, 2020, 4:01 a.m. UTC | #94

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > I think there are other cases in which Andy’s concern is relevant
> > (MADV_PAGEOUT).

I didn't try to figure how it would help MADV_PAGEOUT. MADV_PAGEOUT
sounds cool feature, but maybe it'd need a way to flush the
invalidates out and a take a static key to enable the buildup of those
ranges?

I wouldn't like to slow down the fast paths even for MADV_PAGEOUT, and
the same applies to uffd-wp and softdirty in fact.

On Wed, Dec 23, 2020 at 08:34:00PM -0700, Yu Zhao wrote:
> That patch only demonstrate a rough idea and I should have been
> elaborate: if we ever decide to go that direction, we only need to
> worry about "jumping through hoops", because the final patch (set)
> I have in mind would not only have the build time optimization Andrea
> suggested but also include runtime optimizations like skipping
> do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
> assured, the performance impact on do_wp_page() from occasionally an
> additional TLB flush on top of a page copy is negligible.

Agreed, I just posted something in that direction. Feel free to
refactor it, it's just a tentative implementation to show something
that may deliver all the performance we need in all cases involved
without slowing down the fast paths of non-uffd and non-softdirty
(well 1 branch).

> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
> > Perhaps holding some small bitmap based on part of the deferred flushed
> > pages (e.g., bits 12-17 of the address or some other kind of a single
> > hash-function bloom-filter) would be more performant to avoid (most)

The concern here aren't only the page faults having to run the bloom
filter, but how to manage the RAM storage pointed by the bloomfilter
or whatever index into the storage, which would slowdown mprotect.

Granted that mprotect is slow to begin with, but the idea we can't make
it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run
faster since it's too important and too frequent in comparison.

Just to restrict the potential false positive IPI caused by page_count
inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple
check on vm_flags should be enough.

Thanks,
Andrea

Nadav Amit Dec. 24, 2020, 4:37 a.m. UTC | #95

> On Dec 23, 2020, at 7:34 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>>>> I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
>>> 
>>> mprotect can't run concurrently with a page fault in the first place.
>>> 
>>> One other near zero cost improvement easy to add if this would be "if
>>> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made
>>> conditional to the two config options too.
>>> 
>>> Still I don't mind doing it in some other way, uffd-wp has much easier
>>> time doing it in another way in fact.
>>> 
>>> Whatever performs better is fine, but queuing up pending invalidate
>>> ranges don't look very attractive since it'd be a fixed cost that we'd
>>> always have to pay even when there's no fault (and there can't be any
>>> fault at least for mprotect).
>> 
>> I think there are other cases in which Andy’s concern is relevant
>> (MADV_PAGEOUT).
> 
> That patch only demonstrate a rough idea and I should have been
> elaborate: if we ever decide to go that direction, we only need to
> worry about "jumping through hoops", because the final patch (set)
> I have in mind would not only have the build time optimization Andrea
> suggested but also include runtime optimizations like skipping
> do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest
> assured, the performance impact on do_wp_page() from occasionally an
> additional TLB flush on top of a page copy is negligible.

I agree with you to a certain extent, since there is anyhow another TLB
flush in this path when the PTE is set after copying.

Yet, I think that having a combined and efficient central mechanism for
pending TLB flushes is important even for robustness: to prevent the
development of new independent deferred flushing schemes. I specifically do
not like tlb_flush_batched which every time that I look at gets me confused.
For example the following code completely confuses me:

  void flush_tlb_batched_pending(struct mm_struct *mm)
  {       
        if (data_race(mm->tlb_flush_batched)) {
                flush_tlb_mm(mm);

                /* 
                 * Do not allow the compiler to re-order the clearing of
                 * tlb_flush_batched before the tlb is flushed.
                 */
                barrier();
                mm->tlb_flush_batched = false;
        }
  }

… and then I ask myself (no answer):

1. What prevents concurrent flush_tlb_batched_pending() which is called by
madvise_free_pte_range(), for instance from madvise_free_pte_range(), from
clearing new deferred flush indication that was just set by
set_tlb_ubc_flush_pending()? Can it cause a missed TLB flush later?

2. Why the write to tlb_flush_batched is not done with WRITE_ONCE()?

3. Should we have smp_wmb() instead of barrier()? (probably the barrier() is
not needed at all since flush_tlb_mm() serializes if a flush is needed).

4. Why do we need 2 deferred TLB flushing mechanisms?

Nadav Amit Dec. 24, 2020, 5:18 a.m. UTC | #96

> On Dec 23, 2020, at 8:01 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
>> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> Perhaps holding some small bitmap based on part of the deferred flushed
>>> pages (e.g., bits 12-17 of the address or some other kind of a single
>>> hash-function bloom-filter) would be more performant to avoid (most)
> 
> The concern here aren't only the page faults having to run the bloom
> filter, but how to manage the RAM storage pointed by the bloomfilter
> or whatever index into the storage, which would slowdown mprotect.
> 
> Granted that mprotect is slow to begin with, but the idea we can't make
> it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run
> faster since it's too important and too frequent in comparison.
> 
> Just to restrict the potential false positive IPI caused by page_count
> inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple
> check on vm_flags should be enough.

Andrea,

I am not trying to be argumentative, and I did not think through about an
alternative solution. It sounds to me that your proposed solution is correct
and would probably be eventually (slightly) more efficient than anything
that I can propose.

Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as
this long thread shows. The question is whether it has to be so hard. In
theory, we can only think about architectural considerations - whether a PTE
permissions are promoted/demoted and whether the PTE was changed/cleared.

Obviously, it is more complex than that. Yet, once you add into the equation
various parameters such as the VMA flags or whether a page is locked (which
Mel told me was once a consideration), things become much more complicated.
If all the logic of TLB flushes had been concentrated in a single point and
maintenance of this code did not require thought about users and use-cases,
I think things would have been much simpler, at least for me.

Regards,
Nadav

Andrea Arcangeli Dec. 24, 2020, 6:49 p.m. UTC | #97

On Wed, Dec 23, 2020 at 09:18:09PM -0800, Nadav Amit wrote:
> I am not trying to be argumentative, and I did not think through about an
> alternative solution. It sounds to me that your proposed solution is correct
> and would probably be eventually (slightly) more efficient than anything
> that I can propose.

On a side note, on the last proposed solution, I've been wondering
about the memory ordering mm_tlb_flush_pending.

There's plenty of locking before the point where the actual data is
read, but it's not a release/acquire full barrier (or more accurately
it is only on x86), so smp_rmb() seems needed before cow_user_page to
be sure no data can be read before we read the value of
mm_tlb_flush_pending.

To avoid an explicit smp_rmb() and to inherit the implicit smp_mb()
from the release/acquire of PT unlock/PT lock, we'd need to put the
flush before the previous PT unlock. (note, not after the next PT lock
or it'd be even worse).

So this made me look also the inc/dec:

+	smp_mb__before_atomic();
	atomic_dec(&mm->tlb_flush_pending);

Without the above, can't the CPU decrement the tlb_flush_pending while
the IPI to ack didn't arrive yet in csd_lock_wait?

The smp_rmb() is still needed in the page fault (implicit or explicit
doesn't matter), but we also need a smp_mb__before_atomic() above to
really make this work.

> Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as
> this long thread shows. The question is whether it has to be so hard. In
> theory, we can only think about architectural considerations - whether a PTE
> permissions are promoted/demoted and whether the PTE was changed/cleared.
> 
> Obviously, it is more complex than that. Yet, once you add into the equation
> various parameters such as the VMA flags or whether a page is locked (which
> Mel told me was once a consideration), things become much more complicated.
> If all the logic of TLB flushes had been concentrated in a single point and
> maintenance of this code did not require thought about users and use-cases,
> I think things would have been much simpler, at least for me.

In your previous email you also suggested to add range invalidates and
bloom filter to index them by the address in the page fault since it'd
help MADV_PAGEOUT. That would increase complexity and it won't go
exactly in the above direction.

I assume that here nobody wants to add gratuitous complexity, and I
never suggested the code shouldn't become simpler and easier to
understand and maintain. But we can't solve everything in a single
thread in terms of cleaning up and auditing the entirety of the TLB
code.

To refocus: the only short term objective in this thread, is to fix
the data corruption in uffd-wp and clear_refs_write without
introducing new performance regressions compared to the previous
clear_refs_write runtime.

Once that is fixed and we didn't introduce a performance regression
while fixing an userland memory corruption regression (i.e. not really
a regression in theory, but in practice it is because it worked by
luck), then we can look at all the rest without hurry.

So if already want to start the cleanups like I mentioned in a
previous email, and I'll say it more explicitly, the tlb gather is for
freeing memory, it's just pure overhead and gives a false sense of
security, like it can make any difference, when just changing
protection with mmap_read_lock. It wouldn't be needed with the write
lock and it can't help solve the races that trigger with
mmap_read_lock either.

It is needed when you have to store the page pointer outside the pte,
clear a pte, flush the tlb and only then put_page. So it is needed to
keep tracks of which pages got cleared in the ptes so you don't have
to issue a tlb flush for each single pte that gets cleared.

So the only case when to use the tlb_gather is when you must make the
pte stop pointing to the page and you need an external storage that
will keep track of those pages that we cannot yet lose track of since
we didn't do put_page yet. That kind of external storage to keep track
of the pages that have pending tlb flushes, is never needed in
change_protection and clear_refs_write.

change_protection is already correct in that respect, it doesn't use
the tlb_gather. clear_refs_write is not ok and it need to stop using
tlb_gather_mmu/tlb_finish_mmu.

* tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down

See also the tear-down mention which doesn't apply to clear_refs_write.

clear_refs_write needs to become identical to
change_protection_range. Just increasing inc_tlb_flush_pending and
then doing a flush of the range right before the
dec_tlb_flush_pending.

That change is actually orthogonal to the regression fix to teach the
COW to deal with stale TLB entries and it will clean up the code.

So to pursue your simplification objective you could already go ahead
doing this improvement since it's very confusing to see the
clear_refs_write use something that it shouldn't use like if it
actually could make any difference and then seeing an incremental
patch trying to perfect the tlb_gather logic in clear_refs instead of
removing it. In fact it's so strange that it's hard to suggest
dropping tlb_gather entirely, I feel like I must be missing
something, so if I miss something this would be a good time to explain
me why tlb_gather is used in clear_refs.

Once that is also done, we can look at the flush_tlb_batched_pending
that I see you mentioned to Yu. I didn't go check it yet, but we can
certainly look at it deeper later, maybe starting a new thread for
it is simpler?

In the short term, for this thread, we can't solve everything at once
and reduce the complexity at the same time.

So refocusing on the memory ordering of dec_tlb_flush_pending and the
mm_tlb_flush_pending mentioned above, to find a proper abstraction and
write proper documentation for the flush in wp_copy_page would be
ideal. Then we can do the rest.

On my to-list before worrying about the rest in fact I also need to
re-send (I already proposed it for merging a few times on lkml) of the
ARM64 tlb flushing optimization to skip the hardware SMP un-scalable
ITLB broadcast for single threaded processes or multithreaded
processes that are temporarily running single threaded or bind to a
single CPU, which increases SMP scalability of the arm64 TLB flushing
by an order of magnitude for some workloads and we had to ship in
production already).

Thanks,
Andrea

Andrea Arcangeli Dec. 24, 2020, 7:16 p.m. UTC | #98

On Thu, Dec 24, 2020 at 01:49:45PM -0500, Andrea Arcangeli wrote:
> Without the above, can't the CPU decrement the tlb_flush_pending while
> the IPI to ack didn't arrive yet in csd_lock_wait?

Ehm: csd_lock_wait has smp_acquire__after_ctrl_dep() so the write side
looks ok after all sorry.

Peter Zijlstra Jan. 5, 2021, 3:37 p.m. UTC | #99

On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:

> So I think the basic rule is that "if you hold mmap_sem for writing,
> you're always safe". And that really should be considered the
> "default" locking.
> 
> ANY time you make a modification to the VM layer, you should basically
> always treat it as a write operation, and get the mmap_sem for
> writing.
> 
> Yeah, yeah, that's a bit simplified, and it ignores various special
> cases (and the hardware page table walkers that obviously take no
> locks at all), but if you hold the mmap_sem for writing you won't
> really race with anything else - not page faults, and not other
> "modify this VM".

> To a first approximation, everybody that changes the VM should take
> the mmap_sem for writing, and the readers should just be just about
> page fault handling (and I count GUP as "page fault handling" too -
> it's kind of the same "look up page" rather than "modify vm" kind of
> operation).
> 
> And there are just a _lot_ more page faults than there are things that
> modify the page tables and the vma's.
> 
> So having that mental model of "lookup of pages in a VM take mmap_semn
> for reading, any modification of the VM uses it for writing" makes
> sense both from a performance angle and a logical standpoint. It's the
> correct model.

> And it's worth noting that COW is still "lookup of pages", even though
> it might modify the page tables in the process. The same way lookup
> can modify the page tables to mark things accessed or dirty.
> 
> So COW is still a lookup operation, in ways that "change the
> writabiility of this range" very much is not. COW is "lookup for
> write", and the magic we do to copy to make that write valid is still
> all about the lookup of the page.

(your other email clarified this point; the COW needs to copy while
holding the PTL and we need TLBI under PTL if we're to change this)

> Which brings up another mental mistake I saw earlier in this thread:
> you should not think "mmap_sem is for vma, and the page table lock is
> for the page table changes".
> 
> mmap_sem is the primary lock for any modifications to the VM layout,
> whether it be in the vma's or in the page tables.
> 
> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
> it is partly because
> 
>  (a) we have things that historically walked the page tables _without_
> walking the vma's (notably the virtual memory scanning)
> 
>  (b) we do allow concurrent page faults, so we then need a lower-level
> lock to serialize the parallelism we _do_ have.

And I'm thinking the speculative page fault series steps right into all
this, it fundamentally avoids mmap_sem and entirely relies on the PTL.

Which opens it up to exactly these races explored here.

The range lock approach does not suffer this, but I'm still worried
about the actual performance of that thing.

Andrea Arcangeli Jan. 5, 2021, 6:03 p.m. UTC | #100

On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)

The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
need to be delivered under PT lock either.

Simply there need to be a TLBI broadcast before the copy. The patch I
sent here https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com that
needs to be cleaned up with some abstraction and better commentary
also misses a smp_mb() in the case flush_tlb_page is not called, but
that's a small detail.

> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.

I thought about that but that only applies to some kind of "anon" page
fault.

Here the problem isn't just the page fault, the problem is not to
regress clear_refs to block on page fault I/O, and all
MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read
/usr/ will still prevent clear_refs from running (and the other way
around) if it has to take the mmap_sem for writing.

I don't look at the speculative page fault for a while but last I
checked there was nothing there that can tame the above major
regression from CPU speed to disk I/O speed that would be inflicted on
both clear_refs on huge mm and on uffd-wp.

Thanks,
Andrea

Vinayak Menon Jan. 12, 2021, 11:43 a.m. UTC | #101

On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>
>> So I think the basic rule is that "if you hold mmap_sem for writing,
>> you're always safe". And that really should be considered the
>> "default" locking.
>>
>> ANY time you make a modification to the VM layer, you should basically
>> always treat it as a write operation, and get the mmap_sem for
>> writing.
>>
>> Yeah, yeah, that's a bit simplified, and it ignores various special
>> cases (and the hardware page table walkers that obviously take no
>> locks at all), but if you hold the mmap_sem for writing you won't
>> really race with anything else - not page faults, and not other
>> "modify this VM".
>> To a first approximation, everybody that changes the VM should take
>> the mmap_sem for writing, and the readers should just be just about
>> page fault handling (and I count GUP as "page fault handling" too -
>> it's kind of the same "look up page" rather than "modify vm" kind of
>> operation).
>>
>> And there are just a _lot_ more page faults than there are things that
>> modify the page tables and the vma's.
>>
>> So having that mental model of "lookup of pages in a VM take mmap_semn
>> for reading, any modification of the VM uses it for writing" makes
>> sense both from a performance angle and a logical standpoint. It's the
>> correct model.
>> And it's worth noting that COW is still "lookup of pages", even though
>> it might modify the page tables in the process. The same way lookup
>> can modify the page tables to mark things accessed or dirty.
>>
>> So COW is still a lookup operation, in ways that "change the
>> writabiility of this range" very much is not. COW is "lookup for
>> write", and the magic we do to copy to make that write valid is still
>> all about the lookup of the page.
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)
>
>> Which brings up another mental mistake I saw earlier in this thread:
>> you should not think "mmap_sem is for vma, and the page table lock is
>> for the page table changes".
>>
>> mmap_sem is the primary lock for any modifications to the VM layout,
>> whether it be in the vma's or in the page tables.
>>
>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>> it is partly because
>>
>>   (a) we have things that historically walked the page tables _without_
>> walking the vma's (notably the virtual memory scanning)
>>
>>   (b) we do allow concurrent page faults, so we then need a lower-level
>> lock to serialize the parallelism we _do_ have.
> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>
> Which opens it up to exactly these races explored here.
>
> The range lock approach does not suffer this, but I'm still worried
> about the actual performance of that thing.


Some thoughts on why there may not be an issue with speculative page fault.
Adding Laurent as well.

Possibility of race against other PTE modifiers

1) Fork - We have seen a case of SPF racing with fork marking PTEs RO 
and that
is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
2) mprotect - change_protection in mprotect which does the deferred flush is
marked under vm_write_begin/vm_write_end, thus SPF bails out on faults 
on those VMAs.
3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
But SPF does not take UFFD faults.
4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
(2) above.
5) Concurrent faults - SPF does not handle all faults. Only anon page 
faults.
Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
transitions without tlb flush. And I hope do_wp_page with RO->RW is fine 
as well.
I could not see a case where speculative path cannot see a PTE update 
done via
a fault on another CPU.

Laurent Dufour Jan. 12, 2021, 3:47 p.m. UTC | #102

Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
>> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>>
>>> So I think the basic rule is that "if you hold mmap_sem for writing,
>>> you're always safe". And that really should be considered the
>>> "default" locking.
>>>
>>> ANY time you make a modification to the VM layer, you should basically
>>> always treat it as a write operation, and get the mmap_sem for
>>> writing.
>>>
>>> Yeah, yeah, that's a bit simplified, and it ignores various special
>>> cases (and the hardware page table walkers that obviously take no
>>> locks at all), but if you hold the mmap_sem for writing you won't
>>> really race with anything else - not page faults, and not other
>>> "modify this VM".
>>> To a first approximation, everybody that changes the VM should take
>>> the mmap_sem for writing, and the readers should just be just about
>>> page fault handling (and I count GUP as "page fault handling" too -
>>> it's kind of the same "look up page" rather than "modify vm" kind of
>>> operation).
>>>
>>> And there are just a _lot_ more page faults than there are things that
>>> modify the page tables and the vma's.
>>>
>>> So having that mental model of "lookup of pages in a VM take mmap_semn
>>> for reading, any modification of the VM uses it for writing" makes
>>> sense both from a performance angle and a logical standpoint. It's the
>>> correct model.
>>> And it's worth noting that COW is still "lookup of pages", even though
>>> it might modify the page tables in the process. The same way lookup
>>> can modify the page tables to mark things accessed or dirty.
>>>
>>> So COW is still a lookup operation, in ways that "change the
>>> writabiility of this range" very much is not. COW is "lookup for
>>> write", and the magic we do to copy to make that write valid is still
>>> all about the lookup of the page.
>> (your other email clarified this point; the COW needs to copy while
>> holding the PTL and we need TLBI under PTL if we're to change this)
>>
>>> Which brings up another mental mistake I saw earlier in this thread:
>>> you should not think "mmap_sem is for vma, and the page table lock is
>>> for the page table changes".
>>>
>>> mmap_sem is the primary lock for any modifications to the VM layout,
>>> whether it be in the vma's or in the page tables.
>>>
>>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>>> it is partly because
>>>
>>>   (a) we have things that historically walked the page tables _without_
>>> walking the vma's (notably the virtual memory scanning)
>>>
>>>   (b) we do allow concurrent page faults, so we then need a lower-level
>>> lock to serialize the parallelism we _do_ have.
>> And I'm thinking the speculative page fault series steps right into all
>> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>>
>> Which opens it up to exactly these races explored here.
>>
>> The range lock approach does not suffer this, but I'm still worried
>> about the actual performance of that thing.
> 
> 
> Some thoughts on why there may not be an issue with speculative page fault.
> Adding Laurent as well.
> 
> Possibility of race against other PTE modifiers
> 
> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> 2) mprotect - change_protection in mprotect which does the deferred flush is
> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults on those 
> VMAs.
> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> But SPF does not take UFFD faults.
> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> (2) above.
> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> I could not see a case where speculative path cannot see a PTE update done via
> a fault on another CPU.
> 

Thanks Vinayak,

You explained it fine. Indeed SPF is handling deferred TLB invalidation by 
marking the VMA through vm_write_begin/end(), as for the fork case you 
mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE 
values read are valid.

Cheers,
Laurent.

Peter Zijlstra Jan. 12, 2021, 4:20 p.m. UTC | #103

On Tue, Jan 05, 2021 at 01:03:48PM -0500, Andrea Arcangeli wrote:
> On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> > (your other email clarified this point; the COW needs to copy while
> > holding the PTL and we need TLBI under PTL if we're to change this)
> 
> The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
> need to be delivered under PT lock either.
> 
> Simply there need to be a TLBI broadcast before the copy. The patch I
> sent here https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com that
> needs to be cleaned up with some abstraction and better commentary
> also misses a smp_mb() in the case flush_tlb_page is not called, but
> that's a small detail.

That's horrific crap. All of that tlb-pending stuff is batshit, and this
makes it worse.

> > And I'm thinking the speculative page fault series steps right into all
> > this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
> 
> I thought about that but that only applies to some kind of "anon" page
> fault.

That must be something new; it used to handle all faults. I specifically
spend quite a bit of time getting the file crud right (which Linus
initially fingered for being horrible broken).

SPF fundamentally elides the mmap_sem, which Linus said must serialize
faults.

> Here the problem isn't just the page fault, the problem is not to
> regress clear_refs to block on page fault I/O, and all

IIRC we do the actual reads without any locks held, just like
VM_FAULT_RETRY does today. You take the fault, find you need IO, drop
locks, do IO, retake fault.

> MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read
> /usr/ will still prevent clear_refs from running (and the other way
> around) if it has to take the mmap_sem for writing.
> 
> I don't look at the speculative page fault for a while but last I
> checked there was nothing there that can tame the above major
> regression from CPU speed to disk I/O speed that would be inflicted on
> both clear_refs on huge mm and on uffd-wp.

All of the clear_refs nonsense is immaterial to SPF. Also, who again
cares about clear_refs? Why is it important?

Peter Zijlstra Jan. 12, 2021, 4:57 p.m. UTC | #104

On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :

> > Possibility of race against other PTE modifiers
> > 
> > 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> > is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/

Right, that's exactly the kind of thing I was worried about.

> > 2) mprotect - change_protection in mprotect which does the deferred flush is
> > marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> > on those VMAs.

Sure, mprotect also changes vm_flags, so it really needs that anyway.

> > 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> > But SPF does not take UFFD faults.
> > 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> > (2) above.

> > 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.

What happened to shared/file-backed stuff? ISTR I had that working.

> > Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> > transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.

The tricky one is demotion, specifically write to non-write.

> > I could not see a case where speculative path cannot see a PTE update done via
> > a fault on another CPU.

One you didn't mention is the NUMA balancing scanning crud; although I
think that's fine, loosing a PTE update there is harmless. But I've not
thought overly hard on it.

> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> marking the VMA through vm_write_begin/end(), as for the fork case you
> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> values read are valid.

That should indeed work, but are we really sure we covered them all?
Should we invest in better TLBI APIs to make sure we can't get this
wrong?

Laurent Dufour Jan. 12, 2021, 7:02 p.m. UTC | #105

Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> 
>>> Possibility of race against other PTE modifiers
>>>
>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> 
> Right, that's exactly the kind of thing I was worried about.
> 
>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>> on those VMAs.
> 
> Sure, mprotect also changes vm_flags, so it really needs that anyway.
> 
>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>> But SPF does not take UFFD faults.
>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>> (2) above.
> 
>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> 
> What happened to shared/file-backed stuff? ISTR I had that working.

File-backed mappings are not processed in a speculative way, there were options 
to manage some of them depending on the underlying file system but that's still 
not done.

Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops 
is not null).

>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> 
> The tricky one is demotion, specifically write to non-write.
> 
>>> I could not see a case where speculative path cannot see a PTE update done via
>>> a fault on another CPU.
> 
> One you didn't mention is the NUMA balancing scanning crud; although I
> think that's fine, loosing a PTE update there is harmless. But I've not
> thought overly hard on it.

That's a good point, I need to double check on that side.

>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>> marking the VMA through vm_write_begin/end(), as for the fork case you
>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>> values read are valid.
> 
> That should indeed work, but are we really sure we covered them all?
> Should we invest in better TLBI APIs to make sure we can't get this
> wrong?

That may be a good option to identify deferred TLB invalidation but I've no clue 
on what this API would look like.

Nadav Amit Jan. 12, 2021, 7:15 p.m. UTC | #106

> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
> 
> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>>>> Possibility of race against other PTE modifiers
>>>> 
>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>> Right, that's exactly the kind of thing I was worried about.
>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>>> on those VMAs.
>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>>> But SPF does not take UFFD faults.
>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>>> (2) above.
>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>> What happened to shared/file-backed stuff? ISTR I had that working.
> 
> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
> 
> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
> 
>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>> The tricky one is demotion, specifically write to non-write.
>>>> I could not see a case where speculative path cannot see a PTE update done via
>>>> a fault on another CPU.
>> One you didn't mention is the NUMA balancing scanning crud; although I
>> think that's fine, loosing a PTE update there is harmless. But I've not
>> thought overly hard on it.
> 
> That's a good point, I need to double check on that side.
> 
>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>>> marking the VMA through vm_write_begin/end(), as for the fork case you
>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>>> values read are valid.
>> That should indeed work, but are we really sure we covered them all?
>> Should we invest in better TLBI APIs to make sure we can't get this
>> wrong?
> 
> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.

I will send an RFC soon for per-table deferred TLB flushes tracking.
The basic idea is to save a generation in the page-struct that tracks
when deferred PTE change took place, and track whenever a TLB flush
completed. In addition, other users - such as mprotect - would use
the tlb_gather interface.

Unfortunately, due to limited space in page-struct this would only
be possible for 64-bit (and my implementation is only for x86-64).

It would still require to do the copying while holding the PTL though.

Yu Zhao Jan. 12, 2021, 7:56 p.m. UTC | #107

On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
> > 
> > Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
> >> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> >>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> >>>> Possibility of race against other PTE modifiers
> >>>> 
> >>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> >>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> >> Right, that's exactly the kind of thing I was worried about.
> >>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
> >>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> >>>> on those VMAs.
> >> Sure, mprotect also changes vm_flags, so it really needs that anyway.
> >>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> >>>> But SPF does not take UFFD faults.
> >>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> >>>> (2) above.
> >>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> >> What happened to shared/file-backed stuff? ISTR I had that working.
> > 
> > File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
> > 
> > Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
> > 
> >>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> >>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> >> The tricky one is demotion, specifically write to non-write.
> >>>> I could not see a case where speculative path cannot see a PTE update done via
> >>>> a fault on another CPU.
> >> One you didn't mention is the NUMA balancing scanning crud; although I
> >> think that's fine, loosing a PTE update there is harmless. But I've not
> >> thought overly hard on it.
> > 
> > That's a good point, I need to double check on that side.
> > 
> >>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> >>> marking the VMA through vm_write_begin/end(), as for the fork case you
> >>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> >>> values read are valid.
> >> That should indeed work, but are we really sure we covered them all?
> >> Should we invest in better TLBI APIs to make sure we can't get this
> >> wrong?
> > 
> > That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
> 
> I will send an RFC soon for per-table deferred TLB flushes tracking.
> The basic idea is to save a generation in the page-struct that tracks
> when deferred PTE change took place, and track whenever a TLB flush
> completed. In addition, other users - such as mprotect - would use
> the tlb_gather interface.
> 
> Unfortunately, due to limited space in page-struct this would only
> be possible for 64-bit (and my implementation is only for x86-64).

I don't want to discourage you but I don't think this would end up
well. PPC doesn't necessarily follow one-page-struct-per-table rule,
and I've run into problems with this before while trying to do
something similar.

I'd recommend per-vma and per-category (unmapping, clearing writable
and clearing dirty) tracking, which only rely on arch-independent data
structures, i.e., vm_area_struct and mm_struct.

> It would still require to do the copying while holding the PTL though.

IMO, this is unacceptable. Most archs don't support per-table PTL, and
even x86_64 can be configured to use per-mm PTL. What if we want to
support a larger page size in the feature?

It seems to me the only way to solve the problem with self-explanatory
code and without performance impact is to check mm_tlb_flush_pending
and the writable bit (and two other cases I mentioned above) at the
same time. Of course, this requires a lot of effort to audit the
existing uses, clean them up and properly wrap them up with new
primitives, BUG_ON all invalid cases and document the exact workflow
to prevent misuses.

I've mentioned the following before -- it only demonstrates the rough
idea.

diff --git a/mm/memory.c b/mm/memory.c
index 5e9ca612d7d7..af38c5ee327e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		goto unlock;
 	}
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
+				flush_tlb_page(vmf->vma, vmf->address);
 			return do_wp_page(vmf);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);

Nadav Amit Jan. 12, 2021, 8:38 p.m. UTC | #108

> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
>>> 
>>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>>>>>> Possibility of race against other PTE modifiers
>>>>>> 
>>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>>>> Right, that's exactly the kind of thing I was worried about.
>>>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>>>>> on those VMAs.
>>>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>>>>> But SPF does not take UFFD faults.
>>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>>>>> (2) above.
>>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>>>> What happened to shared/file-backed stuff? ISTR I had that working.
>>> 
>>> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
>>> 
>>> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
>>> 
>>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>>>> The tricky one is demotion, specifically write to non-write.
>>>>>> I could not see a case where speculative path cannot see a PTE update done via
>>>>>> a fault on another CPU.
>>>> One you didn't mention is the NUMA balancing scanning crud; although I
>>>> think that's fine, loosing a PTE update there is harmless. But I've not
>>>> thought overly hard on it.
>>> 
>>> That's a good point, I need to double check on that side.
>>> 
>>>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>>>>> marking the VMA through vm_write_begin/end(), as for the fork case you
>>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>>>>> values read are valid.
>>>> That should indeed work, but are we really sure we covered them all?
>>>> Should we invest in better TLBI APIs to make sure we can't get this
>>>> wrong?
>>> 
>>> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
>> 
>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>> The basic idea is to save a generation in the page-struct that tracks
>> when deferred PTE change took place, and track whenever a TLB flush
>> completed. In addition, other users - such as mprotect - would use
>> the tlb_gather interface.
>> 
>> Unfortunately, due to limited space in page-struct this would only
>> be possible for 64-bit (and my implementation is only for x86-64).
> 
> I don't want to discourage you but I don't think this would end up
> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> and I've run into problems with this before while trying to do
> something similar.

Discourage, discourage. Better now than later.

It will be relatively easy to extend the scheme to be per-VMA instead of
per-table for architectures that prefer it this way. It does require
TLB-generation tracking though, which Andy only implemented for x86, so I
will focus on x86-64 right now.

[ For per-VMA it would require an additional cmpxchg, I presume to save the
last deferred generation though. ]

> I'd recommend per-vma and per-category (unmapping, clearing writable
> and clearing dirty) tracking, which only rely on arch-independent data
> structures, i.e., vm_area_struct and mm_struct.

I think that tracking changes on “what was changed” granularity is harder
and more fragile.

Let me finish trying the clean up the mess first, since fullmm and
need_flush_all semantics were mixed; there are 3 different flushing schemes
for mprotect(), munmap() and try_to_unmap(); there are missing memory
barriers; mprotect() performs TLB flushes even when permissions are
promoted; etc.

There are also some optimizations that we discussed before, such on x86 - 
RW->RO does not require a TLB flush if a PTE is not dirty, etc.

I am trying to finish something so you can say how terrible it is, so I will
not waste too much time. ;-)

>> It would still require to do the copying while holding the PTL though.
> 
> IMO, this is unacceptable. Most archs don't support per-table PTL, and
> even x86_64 can be configured to use per-mm PTL. What if we want to
> support a larger page size in the feature?
> 
> It seems to me the only way to solve the problem with self-explanatory
> code and without performance impact is to check mm_tlb_flush_pending
> and the writable bit (and two other cases I mentioned above) at the
> same time. Of course, this requires a lot of effort to audit the
> existing uses, clean them up and properly wrap them up with new
> primitives, BUG_ON all invalid cases and document the exact workflow
> to prevent misuses.
> 
> I've mentioned the following before -- it only demonstrates the rough
> idea.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> 		goto unlock;
> 	}
> 	if (vmf->flags & FAULT_FLAG_WRITE) {
> -		if (!pte_write(entry))
> +		if (!pte_write(entry)) {
> +			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> +				flush_tlb_page(vmf->vma, vmf->address);
> 			return do_wp_page(vmf);
> +		}
> 		entry = pte_mkdirty(entry);
> 	}
> 	entry = pte_mkyoung(entry);

I understand. This might be required, regardless of the deferred flushes
detection scheme. If we assume that no write-unprotect requires a COW (which
should be true in this case, since we take a reference on the page), your
proposal should be sufficient.

Still, I think that there are many unnecessary TLB flushes right now,
and others that might be missed due to the overly complicated invalidation
schemes. 

Regardless, as Andrea pointed, this requires first to figure out the
semantics of mprotect() and friends when pages are pinned.

Yu Zhao Jan. 12, 2021, 8:49 p.m. UTC | #109

On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
> >>> 
> >>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
> >>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> >>>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> >>>>>> Possibility of race against other PTE modifiers
> >>>>>> 
> >>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> >>>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> >>>> Right, that's exactly the kind of thing I was worried about.
> >>>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
> >>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> >>>>>> on those VMAs.
> >>>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
> >>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> >>>>>> But SPF does not take UFFD faults.
> >>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> >>>>>> (2) above.
> >>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> >>>> What happened to shared/file-backed stuff? ISTR I had that working.
> >>> 
> >>> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
> >>> 
> >>> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
> >>> 
> >>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> >>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> >>>> The tricky one is demotion, specifically write to non-write.
> >>>>>> I could not see a case where speculative path cannot see a PTE update done via
> >>>>>> a fault on another CPU.
> >>>> One you didn't mention is the NUMA balancing scanning crud; although I
> >>>> think that's fine, loosing a PTE update there is harmless. But I've not
> >>>> thought overly hard on it.
> >>> 
> >>> That's a good point, I need to double check on that side.
> >>> 
> >>>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> >>>>> marking the VMA through vm_write_begin/end(), as for the fork case you
> >>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> >>>>> values read are valid.
> >>>> That should indeed work, but are we really sure we covered them all?
> >>>> Should we invest in better TLBI APIs to make sure we can't get this
> >>>> wrong?
> >>> 
> >>> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
> >> 
> >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >> The basic idea is to save a generation in the page-struct that tracks
> >> when deferred PTE change took place, and track whenever a TLB flush
> >> completed. In addition, other users - such as mprotect - would use
> >> the tlb_gather interface.
> >> 
> >> Unfortunately, due to limited space in page-struct this would only
> >> be possible for 64-bit (and my implementation is only for x86-64).
> > 
> > I don't want to discourage you but I don't think this would end up
> > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > and I've run into problems with this before while trying to do
> > something similar.
> 
> Discourage, discourage. Better now than later.
> 
> It will be relatively easy to extend the scheme to be per-VMA instead of
> per-table for architectures that prefer it this way. It does require
> TLB-generation tracking though, which Andy only implemented for x86, so I
> will focus on x86-64 right now.
> 
> [ For per-VMA it would require an additional cmpxchg, I presume to save the
> last deferred generation though. ]
> 
> > I'd recommend per-vma and per-category (unmapping, clearing writable
> > and clearing dirty) tracking, which only rely on arch-independent data
> > structures, i.e., vm_area_struct and mm_struct.
> 
> I think that tracking changes on “what was changed” granularity is harder
> and more fragile.
> 
> Let me finish trying the clean up the mess first, since fullmm and
> need_flush_all semantics were mixed; there are 3 different flushing schemes
> for mprotect(), munmap() and try_to_unmap(); there are missing memory
> barriers; mprotect() performs TLB flushes even when permissions are
> promoted; etc.
> 
> There are also some optimizations that we discussed before, such on x86 - 
> RW->RO does not require a TLB flush if a PTE is not dirty, etc.
> 
> I am trying to finish something so you can say how terrible it is, so I will
> not waste too much time. ;-)
> 
> >> It would still require to do the copying while holding the PTL though.
> > 
> > IMO, this is unacceptable. Most archs don't support per-table PTL, and
> > even x86_64 can be configured to use per-mm PTL. What if we want to
> > support a larger page size in the feature?
> > 
> > It seems to me the only way to solve the problem with self-explanatory
> > code and without performance impact is to check mm_tlb_flush_pending
> > and the writable bit (and two other cases I mentioned above) at the
> > same time. Of course, this requires a lot of effort to audit the
> > existing uses, clean them up and properly wrap them up with new
> > primitives, BUG_ON all invalid cases and document the exact workflow
> > to prevent misuses.
> > 
> > I've mentioned the following before -- it only demonstrates the rough
> > idea.
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5e9ca612d7d7..af38c5ee327e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > 		goto unlock;
> > 	}
> > 	if (vmf->flags & FAULT_FLAG_WRITE) {
> > -		if (!pte_write(entry))
> > +		if (!pte_write(entry)) {
> > +			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> > +				flush_tlb_page(vmf->vma, vmf->address);
> > 			return do_wp_page(vmf);
> > +		}
> > 		entry = pte_mkdirty(entry);
> > 	}
> > 	entry = pte_mkyoung(entry);
> 
> I understand. This might be required, regardless of the deferred flushes
> detection scheme. If we assume that no write-unprotect requires a COW (which
> should be true in this case, since we take a reference on the page), your
> proposal should be sufficient.
> 
> Still, I think that there are many unnecessary TLB flushes right now,
> and others that might be missed due to the overly complicated invalidation
> schemes. 
> 
> Regardless, as Andrea pointed, this requires first to figure out the
> semantics of mprotect() and friends when pages are pinned.

Thanks, I appreciate your effort. I'd be glad to review whatever you
come up with.

Will Deacon Jan. 12, 2021, 9:43 p.m. UTC | #110

On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >> The basic idea is to save a generation in the page-struct that tracks
> >> when deferred PTE change took place, and track whenever a TLB flush
> >> completed. In addition, other users - such as mprotect - would use
> >> the tlb_gather interface.
> >> 
> >> Unfortunately, due to limited space in page-struct this would only
> >> be possible for 64-bit (and my implementation is only for x86-64).
> > 
> > I don't want to discourage you but I don't think this would end up
> > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > and I've run into problems with this before while trying to do
> > something similar.
> 
> Discourage, discourage. Better now than later.
> 
> It will be relatively easy to extend the scheme to be per-VMA instead of
> per-table for architectures that prefer it this way. It does require
> TLB-generation tracking though, which Andy only implemented for x86, so I
> will focus on x86-64 right now.

Can you remind me of what we're missing on arm64 in this area, please? I'm
happy to help get this up and running once you have something I can build
on.

Will

Nadav Amit Jan. 12, 2021, 10:29 p.m. UTC | #111

> On Jan 12, 2021, at 1:43 PM, Will Deacon <will@kernel.org> wrote:
> 
> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>> The basic idea is to save a generation in the page-struct that tracks
>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>> completed. In addition, other users - such as mprotect - would use
>>>> the tlb_gather interface.
>>>> 
>>>> Unfortunately, due to limited space in page-struct this would only
>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>> 
>>> I don't want to discourage you but I don't think this would end up
>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>> and I've run into problems with this before while trying to do
>>> something similar.
>> 
>> Discourage, discourage. Better now than later.
>> 
>> It will be relatively easy to extend the scheme to be per-VMA instead of
>> per-table for architectures that prefer it this way. It does require
>> TLB-generation tracking though, which Andy only implemented for x86, so I
>> will focus on x86-64 right now.
> 
> Can you remind me of what we're missing on arm64 in this area, please? I'm
> happy to help get this up and running once you have something I can build
> on.

Let me first finish making something that we can use as a basis for a
discussion. I do not waste your time before I have something ready.

Will Deacon Jan. 12, 2021, 10:46 p.m. UTC | #112

On Tue, Jan 12, 2021 at 02:29:51PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 1:43 PM, Will Deacon <will@kernel.org> wrote:
> > 
> > On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> >>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>> The basic idea is to save a generation in the page-struct that tracks
> >>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>> completed. In addition, other users - such as mprotect - would use
> >>>> the tlb_gather interface.
> >>>> 
> >>>> Unfortunately, due to limited space in page-struct this would only
> >>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>> 
> >>> I don't want to discourage you but I don't think this would end up
> >>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>> and I've run into problems with this before while trying to do
> >>> something similar.
> >> 
> >> Discourage, discourage. Better now than later.
> >> 
> >> It will be relatively easy to extend the scheme to be per-VMA instead of
> >> per-table for architectures that prefer it this way. It does require
> >> TLB-generation tracking though, which Andy only implemented for x86, so I
> >> will focus on x86-64 right now.
> > 
> > Can you remind me of what we're missing on arm64 in this area, please? I'm
> > happy to help get this up and running once you have something I can build
> > on.
> 
> Let me first finish making something that we can use as a basis for a
> discussion. I do not waste your time before I have something ready.

Sure thing! Give me a shout when you're ready.

Will

Andy Lutomirski Jan. 13, 2021, 12:31 a.m. UTC | #113

> On Jan 12, 2021, at 2:29 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> 
>> On Jan 12, 2021, at 1:43 PM, Will Deacon <will@kernel.org> wrote:
>> 
>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>> the tlb_gather interface.
>>>>>> 
>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>> 
>>>>> I don't want to discourage you but I don't think this would end up
>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>> and I've run into problems with this before while trying to do
>>>>> something similar.
>>> 
>>> Discourage, discourage. Better now than later.
>>> 
>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>> per-table for architectures that prefer it this way. It does require
>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>> will focus on x86-64 right now.
>> 
>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>> happy to help get this up and running once you have something I can build
>> on.
> 
> Let me first finish making something that we can use as a basis for a
> discussion. I do not waste your time before I have something ready.

If you want a hand, let me know.  I suspect you understand the x86 code as well as I do at this point, though :)

Yu Zhao Jan. 17, 2021, 4:41 a.m. UTC | #114

On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > > On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> > > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> > >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> > >> The basic idea is to save a generation in the page-struct that tracks
> > >> when deferred PTE change took place, and track whenever a TLB flush
> > >> completed. In addition, other users - such as mprotect - would use
> > >> the tlb_gather interface.
> > >> 
> > >> Unfortunately, due to limited space in page-struct this would only
> > >> be possible for 64-bit (and my implementation is only for x86-64).
> > > 
> > > I don't want to discourage you but I don't think this would end up
> > > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > > and I've run into problems with this before while trying to do
> > > something similar.
> > 
> > Discourage, discourage. Better now than later.
> > 
> > It will be relatively easy to extend the scheme to be per-VMA instead of
> > per-table for architectures that prefer it this way. It does require
> > TLB-generation tracking though, which Andy only implemented for x86, so I
> > will focus on x86-64 right now.
> 
> Can you remind me of what we're missing on arm64 in this area, please? I'm
> happy to help get this up and running once you have something I can build
> on.

I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
Would it be something worth pursuing? Arm has been using mm_cpumask,
so it might not be too difficult I guess?

Nadav Amit Jan. 17, 2021, 7:32 a.m. UTC | #115

> On Jan 16, 2021, at 8:41 PM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>> completed. In addition, other users - such as mprotect - would use
>>>>> the tlb_gather interface.
>>>>> 
>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>> 
>>>> I don't want to discourage you but I don't think this would end up
>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>> and I've run into problems with this before while trying to do
>>>> something similar.
>>> 
>>> Discourage, discourage. Better now than later.
>>> 
>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>> per-table for architectures that prefer it this way. It does require
>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>> will focus on x86-64 right now.
>> 
>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>> happy to help get this up and running once you have something I can build
>> on.
> 
> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> Would it be something worth pursuing? Arm has been using mm_cpumask,
> so it might not be too difficult I guess?

[ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]

IIUC, there are at least two bugs in x86 implementation.

First, there is a missing memory barrier in tlbbatch_add_mm() between
inc_mm_tlb_gen() and the read of mm_cpumask().

Second, try_to_unmap_flush() clears flush_required after flushing. Another
thread can call set_tlb_ubc_flush_pending() after the flush and before
flush_required is cleared, and the indication that a TLB flush is pending
can be lost.

I am working on addressing these issues among others, but, as you already
saw, I am a bit slow.

On a different but related topic: Another thing that I noticed that Arm does
not do is batching TLB flushes across VMAs. Since Arm does not have its own
tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
separately. Peter Zijlstra’s comment says that there are advantages in
flushing each VMA separately, but I am not sure it is better or intentional
(especially since x86 does not do so).

I am trying to remove the arch-specific tlb_end_vma() and have a config
option to control this behavior.

Again, sorry for being slow. I hope to send an RFC soon.

Yu Zhao Jan. 17, 2021, 9:16 a.m. UTC | #116

On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
> > On Jan 16, 2021, at 8:41 PM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> >> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> >>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>>> The basic idea is to save a generation in the page-struct that tracks
> >>>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>>> completed. In addition, other users - such as mprotect - would use
> >>>>> the tlb_gather interface.
> >>>>> 
> >>>>> Unfortunately, due to limited space in page-struct this would only
> >>>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>>> 
> >>>> I don't want to discourage you but I don't think this would end up
> >>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>>> and I've run into problems with this before while trying to do
> >>>> something similar.
> >>> 
> >>> Discourage, discourage. Better now than later.
> >>> 
> >>> It will be relatively easy to extend the scheme to be per-VMA instead of
> >>> per-table for architectures that prefer it this way. It does require
> >>> TLB-generation tracking though, which Andy only implemented for x86, so I
> >>> will focus on x86-64 right now.
> >> 
> >> Can you remind me of what we're missing on arm64 in this area, please? I'm
> >> happy to help get this up and running once you have something I can build
> >> on.
> > 
> > I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> > Would it be something worth pursuing? Arm has been using mm_cpumask,
> > so it might not be too difficult I guess?
> 
> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
> 
> IIUC, there are at least two bugs in x86 implementation.
> 
> First, there is a missing memory barrier in tlbbatch_add_mm() between
> inc_mm_tlb_gen() and the read of mm_cpumask().

In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
comment says -- atomic update ops that return values are also full
memory barriers.

> Second, try_to_unmap_flush() clears flush_required after flushing. Another
> thread can call set_tlb_ubc_flush_pending() after the flush and before
> flush_required is cleared, and the indication that a TLB flush is pending
> can be lost.

This isn't a problem either because flush_required is per thread.

> I am working on addressing these issues among others, but, as you already
> saw, I am a bit slow.
> 
> On a different but related topic: Another thing that I noticed that Arm does
> not do is batching TLB flushes across VMAs. Since Arm does not have its own
> tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
> separately. Peter Zijlstra’s comment says that there are advantages in
> flushing each VMA separately, but I am not sure it is better or intentional
> (especially since x86 does not do so).
> 
> I am trying to remove the arch-specific tlb_end_vma() and have a config
> option to control this behavior.

One thing worth noting is not all arm/arm64 hw versions support ranges.
(system_supports_tlb_range()). But IIUC what you are trying to do, this
isn't a problem.

> Again, sorry for being slow. I hope to send an RFC soon.

No worries. I brought it up only because I noticed it and didn't want
it to slip away.

Nadav Amit Jan. 17, 2021, 10:13 a.m. UTC | #117

> On Jan 17, 2021, at 1:16 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
>>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>>> the tlb_gather interface.
>>>>>>> 
>>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>>> 
>>>>>> I don't want to discourage you but I don't think this would end up
>>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>>> and I've run into problems with this before while trying to do
>>>>>> something similar.
>>>>> 
>>>>> Discourage, discourage. Better now than later.
>>>>> 
>>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>>>> per-table for architectures that prefer it this way. It does require
>>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>>>> will focus on x86-64 right now.
>>>> 
>>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>>>> happy to help get this up and running once you have something I can build
>>>> on.
>>> 
>>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
>>> Would it be something worth pursuing? Arm has been using mm_cpumask,
>>> so it might not be too difficult I guess?
>> 
>> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
>> 
>> IIUC, there are at least two bugs in x86 implementation.
>> 
>> First, there is a missing memory barrier in tlbbatch_add_mm() between
>> inc_mm_tlb_gen() and the read of mm_cpumask().
> 
> In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
> comment says -- atomic update ops that return values are also full
> memory barriers.

Yes, you are correct.

> 
>> Second, try_to_unmap_flush() clears flush_required after flushing. Another
>> thread can call set_tlb_ubc_flush_pending() after the flush and before
>> flush_required is cleared, and the indication that a TLB flush is pending
>> can be lost.
> 
> This isn't a problem either because flush_required is per thread.

Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
flush_tlb_batched_pending() clears it after flush and indications that
set_tlb_ubc_flush_pending() sets in between can be lost.

>> I am working on addressing these issues among others, but, as you already
>> saw, I am a bit slow.
>> 
>> On a different but related topic: Another thing that I noticed that Arm does
>> not do is batching TLB flushes across VMAs. Since Arm does not have its own
>> tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
>> separately. Peter Zijlstra’s comment says that there are advantages in
>> flushing each VMA separately, but I am not sure it is better or intentional
>> (especially since x86 does not do so).
>> 
>> I am trying to remove the arch-specific tlb_end_vma() and have a config
>> option to control this behavior.
> 
> One thing worth noting is not all arm/arm64 hw versions support ranges.
> (system_supports_tlb_range()). But IIUC what you are trying to do, this
> isn't a problem.

I just wanted to get rid of arch-specific tlb_start_vma() it in order to
cleanup the code (I am doing first the VMA-deferred tracking, as you asked).
While I was doing that, I noticed that Arm does per-VMA TLB flushes. I do
not know whether it is intentional, but it seems rather easy to get this
behavior unintentionally.

Yu Zhao Jan. 17, 2021, 7:25 p.m. UTC | #118

On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote:
> > On Jan 17, 2021, at 1:16 AM, Yu Zhao <yuzhao@google.com> wrote:
> > 
> > On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
> >>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <yuzhao@google.com> wrote:
> >>> 
> >>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> >>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
> >>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>>>>> The basic idea is to save a generation in the page-struct that tracks
> >>>>>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>>>>> completed. In addition, other users - such as mprotect - would use
> >>>>>>> the tlb_gather interface.
> >>>>>>> 
> >>>>>>> Unfortunately, due to limited space in page-struct this would only
> >>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>>>>> 
> >>>>>> I don't want to discourage you but I don't think this would end up
> >>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>>>>> and I've run into problems with this before while trying to do
> >>>>>> something similar.
> >>>>> 
> >>>>> Discourage, discourage. Better now than later.
> >>>>> 
> >>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
> >>>>> per-table for architectures that prefer it this way. It does require
> >>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
> >>>>> will focus on x86-64 right now.
> >>>> 
> >>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
> >>>> happy to help get this up and running once you have something I can build
> >>>> on.
> >>> 
> >>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> >>> Would it be something worth pursuing? Arm has been using mm_cpumask,
> >>> so it might not be too difficult I guess?
> >> 
> >> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
> >> 
> >> IIUC, there are at least two bugs in x86 implementation.
> >> 
> >> First, there is a missing memory barrier in tlbbatch_add_mm() between
> >> inc_mm_tlb_gen() and the read of mm_cpumask().
> > 
> > In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
> > comment says -- atomic update ops that return values are also full
> > memory barriers.
> 
> Yes, you are correct.
> 
> > 
> >> Second, try_to_unmap_flush() clears flush_required after flushing. Another
> >> thread can call set_tlb_ubc_flush_pending() after the flush and before
> >> flush_required is cleared, and the indication that a TLB flush is pending
> >> can be lost.
> > 
> > This isn't a problem either because flush_required is per thread.
> 
> Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
> flush_tlb_batched_pending() clears it after flush and indications that
> set_tlb_ubc_flush_pending() sets in between can be lost.

Hmm, the PTL argument above flush_tlb_batched_pending() doesn't seem
to hold when USE_SPLIT_PTE_PTLOCKS is set. Do you have a reproducer?
KCSAN might be able to help in this case.

Nadav Amit Jan. 18, 2021, 2:49 a.m. UTC | #119

> On Jan 17, 2021, at 11:25 AM, Yu Zhao <yuzhao@google.com> wrote:
> 
> On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote:
>>> On Jan 17, 2021, at 1:16 AM, Yu Zhao <yuzhao@google.com> wrote:
>>> 
>>> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>>>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <yuzhao@google.com> wrote:
>>>>> 
>>>>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>>>>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <yuzhao@google.com> wrote:
>>>>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>>>>> the tlb_gather interface.
>>>>>>>>> 
>>>>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>>>>> 
>>>>>>>> I don't want to discourage you but I don't think this would end up
>>>>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>>>>> and I've run into problems with this before while trying to do
>>>>>>>> something similar.
>>>>>>> 
>>>>>>> Discourage, discourage. Better now than later.
>>>>>>> 
>>>>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>>>>>> per-table for architectures that prefer it this way. It does require
>>>>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>>>>>> will focus on x86-64 right now.
>>>>>> 
>>>>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>>>>>> happy to help get this up and running once you have something I can build
>>>>>> on.
>>>>> 
>>>>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
>>>>> Would it be something worth pursuing? Arm has been using mm_cpumask,
>>>>> so it might not be too difficult I guess?
>>>> 
>>>> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
>>>> 
>>>> IIUC, there are at least two bugs in x86 implementation.
>>>> 
>>>> First, there is a missing memory barrier in tlbbatch_add_mm() between
>>>> inc_mm_tlb_gen() and the read of mm_cpumask().
>>> 
>>> In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
>>> comment says -- atomic update ops that return values are also full
>>> memory barriers.
>> 
>> Yes, you are correct.
>> 
>>>> Second, try_to_unmap_flush() clears flush_required after flushing. Another
>>>> thread can call set_tlb_ubc_flush_pending() after the flush and before
>>>> flush_required is cleared, and the indication that a TLB flush is pending
>>>> can be lost.
>>> 
>>> This isn't a problem either because flush_required is per thread.
>> 
>> Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
>> flush_tlb_batched_pending() clears it after flush and indications that
>> set_tlb_ubc_flush_pending() sets in between can be lost.
> 
> Hmm, the PTL argument above flush_tlb_batched_pending() doesn't seem
> to hold when USE_SPLIT_PTE_PTLOCKS is set. Do you have a reproducer?
> KCSAN might be able to help in this case.

I do not have a reproducer. It is just based on my understanding of this
code.

I will give a short try for building a reproducer, although for some reason
“you guys” complain that my reproducers do not work for you (is it PTI that
I disable? idle=poll? running in a VM?). It is also not likely to be too
easy to build a reproducer that actually triggers a memory corruption.

Anyhow, apparently KCSAN has already shouted about this code, causing Qian
Cai to add "data_race()" to avoid KCSAN from shouting (9c1177b62a8c
"mm/rmap: annotate a data race at tlb_flush_batched”).

Note that Andrea asked me not to hijack this thread and have a different one
on this issue.

diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab709023e9aa..c08c4055b051 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+					      pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..7423808640ef 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -652,7 +652,15 @@  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	/* Does the address range wrap, or is the span zero-sized? */
 	BUG_ON(start + len <= start);
 
-	mmap_read_lock(dst_mm);
+	/*
+	 * Although we do not change the VMA, we have to ensure deferred TLB
+	 * flushes are performed before page-faults can be handled. Otherwise
+	 * we can get inconsistent TLB state.
+	 */
+	if (enable_wp)
+		mmap_write_lock(dst_mm);
+	else
+		mmap_read_lock(dst_mm);
 
 	/*
 	 * If memory mappings are changing because of non-cooperative
@@ -686,6 +694,9 @@  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 
 	err = 0;
 out_unlock:
-	mmap_read_unlock(dst_mm);
+	if (enable_wp)
+		mmap_write_unlock(dst_mm);
+	else
+		mmap_read_unlock(dst_mm);
 	return err;
 }