diff mbox series

[RFC,v2,1/2] mm/userfaultfd: fix memory corruption due to writeprotect

Message ID 20201225092529.3228466-2-namit@vmware.com (mailing list archive)
State New, archived
Headers show
Series mm: fix races due to deferred TLB flushes | expand

Commit Message

Nadav Amit Dec. 25, 2020, 9:25 a.m. UTC
From: Nadav Amit <namit@vmware.com>

Userfaultfd self-test fails occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from
the TLBs in wp_page_copy(), this flush takes place after the copy has
already been performed, and therefore changes of the page are possible
between the time of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses
a problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural
PTE permission: when userfaultfd_writeprotect() unprotects memory
region, it unintentionally *clears* the RW-bit if it was already set.
Note that this unprotecting a PTE that is not write-protected is a valid
use-case: the userfaultfd monitor might ask to unprotect a region that
holds both write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0				cpu1			cpu2
----				----			----
							[ Writable PTE
							  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
				[ page-fault ]
				...
				wp_page_copy()
				 cow_user_page()
				  [ copy page ]
							[ write to old
							  page ]
				...
				 set_pte_at_notify()

A similar scenario can happen:

cpu0		cpu1		cpu2		cpu3
----		----		----		----
						[ Writable PTE
				  		  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
		userfaultfd_writeprotect()
		[ write-unprotect ]
		[ deferred TLB flush]
				[ page-fault ]
				wp_page_copy()
				 cow_user_page()
				 [ copy page ]
				 ...		[ write to page ]
				set_pte_at_notify()

As Yu Zhao pointed, these races became more apparent since commit
09854ba94c6a ("mm: do_wp_page() simplification") which made
wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

Note that one might consider additional potentially dangerous scenarios,
which are not directly related to the deferred TLB flushes.  A memory
corruption might in theory occur if after the page is copied by
cow_user_page() and before the PTE is set, the PTE is write-unprotected
(by a concurrent page-fault handler) and then protected again (by
subsequent calls to userfaultfd_writeprotect() to protect and unprotect
the page). In practice, it seems that such scenarios cannot happen.

To resolve the aforementioned races, acquire mmap_lock for write when
write-protecting userfaultfd region using ioctl's. Keep acquiring
mmap_lock for read when unprotecting memory, but do not change the
write-bit set when performing userfaultfd write-unprotection.

This solution can introduce performance regression to userfaultfd
write-protection.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/mprotect.c    |  3 ++-
 mm/userfaultfd.c | 15 +++++++++++++--
 2 files changed, 15 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Jan. 4, 2021, 12:22 p.m. UTC | #1
On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:

> The scenario that happens in selftests/vm/userfaultfd is as follows:
> 
> cpu0				cpu1			cpu2
> ----				----			----
> 							[ Writable PTE
> 							  cached in TLB ]
> userfaultfd_writeprotect()
> [ write-*unprotect* ]
> mwriteprotect_range()
> mmap_read_lock()
> change_protection()
> 
> change_protection_range()
> ...
> change_pte_range()
> [ *clear* “write”-bit ]
> [ defer TLB flushes ]
> 				[ page-fault ]
> 				...
> 				wp_page_copy()
> 				 cow_user_page()
> 				  [ copy page ]
> 							[ write to old
> 							  page ]
> 				...
> 				 set_pte_at_notify()

Yuck!

Isn't this all rather similar to the problem that resulted in the
tlb_flush_pending mess?

I still think that's all fundamentally buggered, the much saner solution
(IMO) would've been to make things wait for the pending flush, instead
of doing a local flush and fudging things like we do now.

Then the above could be fixed by having wp_page_copy() wait for the
pending invalidate (although a more fine-grained pending state would be
awesome).

The below probably doesn't compile and will probably cause massive
header fail at the very least, but does show the general.


diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 07d9acb5b19c..0210547ac424 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
 	 *
 	 * Therefore we must rely on tlb_flush_*() to guarantee order.
 	 */
-	atomic_dec(&mm->tlb_flush_pending);
+	if (atomic_dec_and_test(&mm->tlb_flush_pending))
+		wake_up_var(&mm->tlb_flush_pending);
 }
 
 static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
@@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
 	return atomic_read(&mm->tlb_flush_pending) > 1;
 }
 
+static inline void wait_tlb_flush_pending(struct mm_struct *mm)
+{
+	wait_var_event(&mm->tlb_flush_pending,
+		       atomic_read(&mm->tlb_flush_pending) == 0);
+}
+
 struct vm_fault;
 
 /**
diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..3c36bca2972a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3087,6 +3087,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
+	wait_tlb_flush_pending(vma->vm_mm);
+
 	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
Andrea Arcangeli Jan. 4, 2021, 7:24 p.m. UTC | #2
Hello,

On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> 
> > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > 
> > cpu0				cpu1			cpu2
> > ----				----			----
> > 							[ Writable PTE
> > 							  cached in TLB ]
> > userfaultfd_writeprotect()
> > [ write-*unprotect* ]
> > mwriteprotect_range()
> > mmap_read_lock()
> > change_protection()
> > 
> > change_protection_range()
> > ...
> > change_pte_range()
> > [ *clear* “write”-bit ]
> > [ defer TLB flushes ]
> > 				[ page-fault ]
> > 				...
> > 				wp_page_copy()
> > 				 cow_user_page()
> > 				  [ copy page ]
> > 							[ write to old
> > 							  page ]
> > 				...
> > 				 set_pte_at_notify()
> 
> Yuck!
> 

Note, the above was posted before we figured out the details so it
wasn't showing the real deferred tlb flush that caused problems (the
one showed on the left causes zero issues).

The problematic one not pictured is the one of the wrprotect that has
to be running in another CPU which is also isn't picture above. More
accurate traces are posted later in the thread.

> Isn't this all rather similar to the problem that resulted in the
> tlb_flush_pending mess?
> 
> I still think that's all fundamentally buggered, the much saner solution
> (IMO) would've been to make things wait for the pending flush, instead

How do intend you wait in PT lock while the writer also has to take PT
lock repeatedly before it can do wake_up_var?

If you release the PT lock before calling wait_tlb_flush_pending it
all falls apart again.

This I guess explains why a local pte/hugepmd smp local invlpg is the
only working solution for this issue, similarly to how it's done in rmap.

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 07d9acb5b19c..0210547ac424 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
>  	 *
>  	 * Therefore we must rely on tlb_flush_*() to guarantee order.
>  	 */
> -	atomic_dec(&mm->tlb_flush_pending);
> +	if (atomic_dec_and_test(&mm->tlb_flush_pending))
> +		wake_up_var(&mm->tlb_flush_pending);
>  }
>  
>  static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
> @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>  	return atomic_read(&mm->tlb_flush_pending) > 1;
>  }
>  
> +static inline void wait_tlb_flush_pending(struct mm_struct *mm)
> +{
> +	wait_var_event(&mm->tlb_flush_pending,
> +		       atomic_read(&mm->tlb_flush_pending) == 0);
> +}

I appreciate the effort in not regressing soft dirty and uffd-wp
writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for
writing.

At the same time what was posted so far wasn't clean enough but it
wasn't even tested... if we abstract it in some clean way and we mark
all connected points (soft dirty, uffd-wp, the wrprotect page fault),
then I can be optimistic it will remain understandable when we look at
it again a few years down the road.

Or at the very least it can't get worse than the "tlb_flush_pending
mess" you mentioned above.

flush_tlb_batched_pending() has to be orthogonally re-reviewed for
those things Nadav pointed out. But I'd rather keep that review in a
separate thread since any bug in that code has zero connection to this
issue. The basic idea is similar but the methods and logic are
different and our flush here will be granular and it's going to be
only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or
if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide,
unconditional etc.. Pretty much all different.

Thanks,
Andrea
Nadav Amit Jan. 4, 2021, 7:35 p.m. UTC | #3
> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> Hello,
> 
> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>> 
>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>> 
>>> cpu0				cpu1			cpu2
>>> ----				----			----
>>> 							[ Writable PTE
>>> 							  cached in TLB ]
>>> userfaultfd_writeprotect()
>>> [ write-*unprotect* ]
>>> mwriteprotect_range()
>>> mmap_read_lock()
>>> change_protection()
>>> 
>>> change_protection_range()
>>> ...
>>> change_pte_range()
>>> [ *clear* “write”-bit ]
>>> [ defer TLB flushes ]
>>> 				[ page-fault ]
>>> 				...
>>> 				wp_page_copy()
>>> 				 cow_user_page()
>>> 				  [ copy page ]
>>> 							[ write to old
>>> 							  page ]
>>> 				...
>>> 				 set_pte_at_notify()
>> 
>> Yuck!
> 
> Note, the above was posted before we figured out the details so it
> wasn't showing the real deferred tlb flush that caused problems (the
> one showed on the left causes zero issues).

Actually it was posted after (note that this is v2). The aforementioned
scenario that Peter regards to is the one that I actually encountered (not
the second scenario that is “theoretical”). This scenario that Peter regards
is indeed more “stupid” in the sense that we should just not write-protect
the PTE on userfaultfd write-unprotect.

Let me know if I made any mistake in the description.

> The problematic one not pictured is the one of the wrprotect that has
> to be running in another CPU which is also isn't picture above. More
> accurate traces are posted later in the thread.

I think I included this scenario as well in the commit log (of v2). Let me
know if I screwed up and the description is not clear.
Andrea Arcangeli Jan. 4, 2021, 8:19 p.m. UTC | #4
On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
> > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > Hello,
> > 
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >> 
> >>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>> 
> >>> cpu0				cpu1			cpu2
> >>> ----				----			----
> >>> 							[ Writable PTE
> >>> 							  cached in TLB ]
> >>> userfaultfd_writeprotect()
> >>> [ write-*unprotect* ]
> >>> mwriteprotect_range()
> >>> mmap_read_lock()
> >>> change_protection()
> >>> 
> >>> change_protection_range()
> >>> ...
> >>> change_pte_range()
> >>> [ *clear* “write”-bit ]
> >>> [ defer TLB flushes ]
> >>> 				[ page-fault ]
> >>> 				...
> >>> 				wp_page_copy()
> >>> 				 cow_user_page()
> >>> 				  [ copy page ]
> >>> 							[ write to old
> >>> 							  page ]
> >>> 				...
> >>> 				 set_pte_at_notify()
> >> 
> >> Yuck!
> > 
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> 
> Actually it was posted after (note that this is v2). The aforementioned
> scenario that Peter regards to is the one that I actually encountered (not
> the second scenario that is “theoretical”). This scenario that Peter regards
> is indeed more “stupid” in the sense that we should just not write-protect
> the PTE on userfaultfd write-unprotect.
> 
> Let me know if I made any mistake in the description.

I didn't say there is a mistake. I said it is not showing the real
deferred tlb flush that cause problems.

The issue here is that we have a "defer tlb flush" that runs after
"write to old page".

If you look at the above, you're induced to think the "defer tlb
flush" that causes issues is the one in cpu0. It's not. That is
totally harmless.


> 
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
> 
> I think I included this scenario as well in the commit log (of v2). Let me
> know if I screwed up and the description is not clear.

Instead of not showing the real "defer tlb flush" in the trace and
then fixing it up in the comment, why don't you take the trace showing
the real problematic "defer tlb flush"? No need to reinvent it.

https://lkml.kernel.org/r/X+JJqK91plkBVisG@redhat.com

See here the detail underlined:

deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here

This show the real deferred tlb flush, your v2 does not include it
instead.
Nadav Amit Jan. 4, 2021, 8:39 p.m. UTC | #5
> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> Hello,
>>> 
>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>> 
>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>>> 
>>>>> cpu0				cpu1			cpu2
>>>>> ----				----			----
>>>>> 							[ Writable PTE
>>>>> 							  cached in TLB ]
>>>>> userfaultfd_writeprotect()
>>>>> [ write-*unprotect* ]
>>>>> mwriteprotect_range()
>>>>> mmap_read_lock()
>>>>> change_protection()
>>>>> 
>>>>> change_protection_range()
>>>>> ...
>>>>> change_pte_range()
>>>>> [ *clear* “write”-bit ]
>>>>> [ defer TLB flushes ]
>>>>> 				[ page-fault ]
>>>>> 				...
>>>>> 				wp_page_copy()
>>>>> 				 cow_user_page()
>>>>> 				  [ copy page ]
>>>>> 							[ write to old
>>>>> 							  page ]
>>>>> 				...
>>>>> 				 set_pte_at_notify()
>>>> 
>>>> Yuck!
>>> 
>>> Note, the above was posted before we figured out the details so it
>>> wasn't showing the real deferred tlb flush that caused problems (the
>>> one showed on the left causes zero issues).
>> 
>> Actually it was posted after (note that this is v2). The aforementioned
>> scenario that Peter regards to is the one that I actually encountered (not
>> the second scenario that is “theoretical”). This scenario that Peter regards
>> is indeed more “stupid” in the sense that we should just not write-protect
>> the PTE on userfaultfd write-unprotect.
>> 
>> Let me know if I made any mistake in the description.
> 
> I didn't say there is a mistake. I said it is not showing the real
> deferred tlb flush that cause problems.
> 
> The issue here is that we have a "defer tlb flush" that runs after
> "write to old page".
> 
> If you look at the above, you're induced to think the "defer tlb
> flush" that causes issues is the one in cpu0. It's not. That is
> totally harmless.

I do not understand what you say. The deferred TLB flush on cpu0 *is* the
the one that causes the problem. The PTE is write-protected (although it is
a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
the page-fault (and copy), while cpu2 keeps writing to the source page. If
cpu0 did not defer the TLB flush, this problem would not happen.

>>> The problematic one not pictured is the one of the wrprotect that has
>>> to be running in another CPU which is also isn't picture above. More
>>> accurate traces are posted later in the thread.
>> 
>> I think I included this scenario as well in the commit log (of v2). Let me
>> know if I screwed up and the description is not clear.
> 
> Instead of not showing the real "defer tlb flush" in the trace and
> then fixing it up in the comment, why don't you take the trace showing
> the real problematic "defer tlb flush"? No need to reinvent it.

The scenario you mention is indeed identical to the second scenario I
mention in the commit log. I think the version I included is cleared since
it shows the write that triggers the corruption instead of discussing
“windows”, which might be less clear. Running copy_user_page() with stale
TLB is by itself not a bug if you detect it later (e.g., using pte_same()).

Note that my second scenario is also consistent in style with the first
scenario.

I am not married to my description and if you (and others) insist I would
copy-paste your version.

> This show the real deferred tlb flush, your v2 does not include it
> instead.

Are you talking about the first scenario (write-unprotect), the second one
(write-protect followed by write-unprotect), both? It seems to me that all
the deferred TLB flushes are mentioned at the point they are deferred. I can
add the point in which they are performed.
Andrea Arcangeli Jan. 4, 2021, 9:01 p.m. UTC | #6
On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
> > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
> >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >>>> 
> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>>>> 
> >>>>> cpu0				cpu1			cpu2
> >>>>> ----				----			----
> >>>>> 							[ Writable PTE
> >>>>> 							  cached in TLB ]
> >>>>> userfaultfd_writeprotect()
> >>>>> [ write-*unprotect* ]
> >>>>> mwriteprotect_range()
> >>>>> mmap_read_lock()
> >>>>> change_protection()
> >>>>> 
> >>>>> change_protection_range()
> >>>>> ...
> >>>>> change_pte_range()
> >>>>> [ *clear* “write”-bit ]
> >>>>> [ defer TLB flushes ]
> >>>>> 				[ page-fault ]
> >>>>> 				...
> >>>>> 				wp_page_copy()
> >>>>> 				 cow_user_page()
> >>>>> 				  [ copy page ]
> >>>>> 							[ write to old
> >>>>> 							  page ]
> >>>>> 				...
> >>>>> 				 set_pte_at_notify()
> >>>> 
> >>>> Yuck!
> >>> 
> >>> Note, the above was posted before we figured out the details so it
> >>> wasn't showing the real deferred tlb flush that caused problems (the
> >>> one showed on the left causes zero issues).
> >> 
> >> Actually it was posted after (note that this is v2). The aforementioned
> >> scenario that Peter regards to is the one that I actually encountered (not
> >> the second scenario that is “theoretical”). This scenario that Peter regards
> >> is indeed more “stupid” in the sense that we should just not write-protect
> >> the PTE on userfaultfd write-unprotect.
> >> 
> >> Let me know if I made any mistake in the description.
> > 
> > I didn't say there is a mistake. I said it is not showing the real
> > deferred tlb flush that cause problems.
> > 
> > The issue here is that we have a "defer tlb flush" that runs after
> > "write to old page".
> > 
> > If you look at the above, you're induced to think the "defer tlb
> > flush" that causes issues is the one in cpu0. It's not. That is
> > totally harmless.
> 
> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
> the one that causes the problem. The PTE is write-protected (although it is
> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
> the page-fault (and copy), while cpu2 keeps writing to the source page. If
> cpu0 did not defer the TLB flush, this problem would not happen.

Your argument "If cpu0 did not defer the TLB flush, this problem would
not happen" is identical to "if the cpu0 has a small TLB size and the
tlb entry is recycled, the problem would not happen".

There are a multitude of factors that are unrelated to the real
problematic deferred tlb flush that may happen and still won't cause
the issue, including a parallel IPI.

The point is that we don't need to worry about the "defer TLB flushes"
of the un-wrprotect, because you said earlier that deferring tlb
flushes when you're doing "permission promotions" does not cause
problems.

The only "deferred tlb flush" we need to worry about, not in the
picture, is the one following the actual permission removal (the
wrprotection).

> it shows the write that triggers the corruption instead of discussing
> “windows”, which might be less clear. Running copy_user_page() with stale

I think showing exactly where the race window opens is key to
understand the issue, but then that's the way I work and feel free to
think it in any other way that may sound simpler.

I just worried people thinks the deferred tlb flush in your v2 trace
is the one that causes problem when obviously it's not since it
follows a permission promotion. Once that is clear, feel free to
reject my trace.

All I care about is that performance don't regress from CPU-speed to
disk I/O spindle speed, for soft dirty and uffd-wp.

> I am not married to my description and if you (and others) insist I would
> copy-paste your version.

I definitely don't insist, I only wanted to clarify in case it may not
have been clear the problematic deferred tlb flush wasn't part of your
trace.

> Are you talking about the first scenario (write-unprotect), the second one
> (write-protect followed by write-unprotect), both? It seems to me that all
> the deferred TLB flushes are mentioned at the point they are deferred. I can
> add the point in which they are performed.

The only case that has an issue for uffd-wp is in my trace and only
ever happens if there's a wrprotect in flight, the deferred tlb flush
of the wrprotect is deferred (and that's the problematic one that
closes the window when it finally runs) and un-wrprotect runs. The
window opens when the un-wrprotect unlocks the PT lock. The deferred
tlb flush of un-wrprotect is as relevant for this race, as random tlb
flushes from IPI or the TLB being small or none.

Thanks,
Andrea
Nadav Amit Jan. 4, 2021, 9:26 p.m. UTC | #7
> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>>>> 
>>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>>>>> 
>>>>>>> cpu0				cpu1			cpu2
>>>>>>> ----				----			----
>>>>>>> 							[ Writable PTE
>>>>>>> 							  cached in TLB ]
>>>>>>> userfaultfd_writeprotect()
>>>>>>> [ write-*unprotect* ]
>>>>>>> mwriteprotect_range()
>>>>>>> mmap_read_lock()
>>>>>>> change_protection()
>>>>>>> 
>>>>>>> change_protection_range()
>>>>>>> ...
>>>>>>> change_pte_range()
>>>>>>> [ *clear* “write”-bit ]
>>>>>>> [ defer TLB flushes ]
>>>>>>> 				[ page-fault ]
>>>>>>> 				...
>>>>>>> 				wp_page_copy()
>>>>>>> 				 cow_user_page()
>>>>>>> 				  [ copy page ]
>>>>>>> 							[ write to old
>>>>>>> 							  page ]
>>>>>>> 				...
>>>>>>> 				 set_pte_at_notify()
>>>>>> 
>>>>>> Yuck!
>>>>> 
>>>>> Note, the above was posted before we figured out the details so it
>>>>> wasn't showing the real deferred tlb flush that caused problems (the
>>>>> one showed on the left causes zero issues).
>>>> 
>>>> Actually it was posted after (note that this is v2). The aforementioned
>>>> scenario that Peter regards to is the one that I actually encountered (not
>>>> the second scenario that is “theoretical”). This scenario that Peter regards
>>>> is indeed more “stupid” in the sense that we should just not write-protect
>>>> the PTE on userfaultfd write-unprotect.
>>>> 
>>>> Let me know if I made any mistake in the description.
>>> 
>>> I didn't say there is a mistake. I said it is not showing the real
>>> deferred tlb flush that cause problems.
>>> 
>>> The issue here is that we have a "defer tlb flush" that runs after
>>> "write to old page".
>>> 
>>> If you look at the above, you're induced to think the "defer tlb
>>> flush" that causes issues is the one in cpu0. It's not. That is
>>> totally harmless.
>> 
>> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
>> the one that causes the problem. The PTE is write-protected (although it is
>> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
>> the page-fault (and copy), while cpu2 keeps writing to the source page. If
>> cpu0 did not defer the TLB flush, this problem would not happen.
> 
> Your argument "If cpu0 did not defer the TLB flush, this problem would
> not happen" is identical to "if the cpu0 has a small TLB size and the
> tlb entry is recycled, the problem would not happen".
> 
> There are a multitude of factors that are unrelated to the real
> problematic deferred tlb flush that may happen and still won't cause
> the issue, including a parallel IPI.
> 
> The point is that we don't need to worry about the "defer TLB flushes"
> of the un-wrprotect, because you said earlier that deferring tlb
> flushes when you're doing "permission promotions" does not cause
> problems.
> 
> The only "deferred tlb flush" we need to worry about, not in the
> picture, is the one following the actual permission removal (the
> wrprotection).

I think you are missing the point of this scenario, which is different than
the second scenario.

In this scenario, change_pte_range(), when called to do userfaultfd’s
*unprotect* operation, did not preserve the write-bit if it was already set.
Instead change_pte_range() *cleared* the write-bit. So upon a logical
permission promotion operation - userfaultfd *unprotect* - you got a
physical permission demotion, turning RW PTEs into RO.

This problem is fully resolved by this part of the patch:

--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
		oldpte = *pte;
		if (pte_present(oldpte)) {
			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+					      pte_write(oldpte);

You can argue that this not directly related to the deferred TLB flush, as
once this chunk is added, a TLB flush would not be needed at all for
userfaultfd-unprotect. But I consider it a part of the problem, especially
since this is what triggered the userfaultfd self-tests to fail.

>> it shows the write that triggers the corruption instead of discussing
>> “windows”, which might be less clear. Running copy_user_page() with stale
> 
> I think showing exactly where the race window opens is key to
> understand the issue, but then that's the way I work and feel free to
> think it in any other way that may sound simpler.
> 
> I just worried people thinks the deferred tlb flush in your v2 trace
> is the one that causes problem when obviously it's not since it
> follows a permission promotion. Once that is clear, feel free to
> reject my trace.
> 
> All I care about is that performance don't regress from CPU-speed to
> disk I/O spindle speed, for soft dirty and uffd-wp.

I would feel more comfortable if you provide patches for uffd-wp. If you
want, I will do it, but I restate that I do not feel comfortable with this
solution (worried as it seems a bit ad-hoc and might leave out a scenario
we all missed or cause a TLB shootdown storm).

As for soft-dirty, I thought that you said that you do not see a better
(backportable) solution for soft-dirty. Correct me if I am wrong.

Anyhow, I will add your comments regarding the stale TLB window to make the
description clearer.
Peter Zijlstra Jan. 5, 2021, 8:13 a.m. UTC | #8
On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> The problematic one not pictured is the one of the wrprotect that has
> to be running in another CPU which is also isn't picture above. More
> accurate traces are posted later in the thread.

What thread? I don't seem to have discovered it yet, and the cover
letter from Nadav doesn't seem to have a msgid linking it either.
Nadav Amit Jan. 5, 2021, 8:52 a.m. UTC | #9
> On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
>> The problematic one not pictured is the one of the wrprotect that has
>> to be running in another CPU which is also isn't picture above. More
>> accurate traces are posted later in the thread.
> 
> What thread? I don't seem to have discovered it yet, and the cover
> letter from Nadav doesn't seem to have a msgid linking it either.

Sorry for that:

https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/
Peter Zijlstra Jan. 5, 2021, 8:58 a.m. UTC | #10
On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > 
> > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > 
> > > cpu0				cpu1			cpu2
> > > ----				----			----
> > > 							[ Writable PTE
> > > 							  cached in TLB ]
> > > userfaultfd_writeprotect()
> > > [ write-*unprotect* ]
> > > mwriteprotect_range()
> > > mmap_read_lock()
> > > change_protection()
> > > 
> > > change_protection_range()
> > > ...
> > > change_pte_range()
> > > [ *clear* “write”-bit ]
> > > [ defer TLB flushes ]
> > > 				[ page-fault ]
> > > 				...
> > > 				wp_page_copy()
> > > 				 cow_user_page()
> > > 				  [ copy page ]
> > > 							[ write to old
> > > 							  page ]
> > > 				...
> > > 				 set_pte_at_notify()
> > 
> > Yuck!
> > 
> 
> Note, the above was posted before we figured out the details so it
> wasn't showing the real deferred tlb flush that caused problems (the
> one showed on the left causes zero issues).
> 
> The problematic one not pictured is the one of the wrprotect that has
> to be running in another CPU which is also isn't picture above. More
> accurate traces are posted later in the thread.

Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

> > Isn't this all rather similar to the problem that resulted in the
> > tlb_flush_pending mess?
> > 
> > I still think that's all fundamentally buggered, the much saner solution
> > (IMO) would've been to make things wait for the pending flush, instead
> 
> How do intend you wait in PT lock while the writer also has to take PT
> lock repeatedly before it can do wake_up_var?
> 
> If you release the PT lock before calling wait_tlb_flush_pending it
> all falls apart again.

I suppose you can check for pending, if found, release lock, wait for 0,
and re-take the fault?

> This I guess explains why a local pte/hugepmd smp local invlpg is the
> only working solution for this issue, similarly to how it's done in rmap.

In that case a local invalidate on CPU1 simply doesn't help anything.

CPU1 needs to do a global invalidate or wait for the in-progress one to
complete, such that CPU2 is sure to not have a W entry left before CPU1
goes and copies the page.
Nadav Amit Jan. 5, 2021, 9:22 a.m. UTC | #11
> On Jan 5, 2021, at 12:58 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>> 
>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>> 
>>>> cpu0				cpu1			cpu2
>>>> ----				----			----
>>>> 							[ Writable PTE
>>>> 							  cached in TLB ]
>>>> userfaultfd_writeprotect()
>>>> [ write-*unprotect* ]
>>>> mwriteprotect_range()
>>>> mmap_read_lock()
>>>> change_protection()
>>>> 
>>>> change_protection_range()
>>>> ...
>>>> change_pte_range()
>>>> [ *clear* “write”-bit ]
>>>> [ defer TLB flushes ]
>>>> 				[ page-fault ]
>>>> 				...
>>>> 				wp_page_copy()
>>>> 				 cow_user_page()
>>>> 				  [ copy page ]
>>>> 							[ write to old
>>>> 							  page ]
>>>> 				...
>>>> 				 set_pte_at_notify()
>>> 
>>> Yuck!
>> 
>> Note, the above was posted before we figured out the details so it
>> wasn't showing the real deferred tlb flush that caused problems (the
>> one showed on the left causes zero issues).
>> 
>> The problematic one not pictured is the one of the wrprotect that has
>> to be running in another CPU which is also isn't picture above. More
>> accurate traces are posted later in the thread.
> 
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

This is the second scenario that is mentioned in the patch. (The first one
is relatively easy to address by not clearing the write-bit).

>>> Isn't this all rather similar to the problem that resulted in the
>>> tlb_flush_pending mess?
>>> 
>>> I still think that's all fundamentally buggered, the much saner solution
>>> (IMO) would've been to make things wait for the pending flush, instead
>> 
>> How do intend you wait in PT lock while the writer also has to take PT
>> lock repeatedly before it can do wake_up_var?
>> 
>> If you release the PT lock before calling wait_tlb_flush_pending it
>> all falls apart again.
> 
> I suppose you can check for pending, if found, release lock, wait for 0,
> and re-take the fault?

My personal take on this issue (which for full disclosure I think Andrea
disagrees with) is that it the most important enhancement is to reduce the
number of cases which we mistakenly think that we must wait for pending TLB
flush. It will not be free though.

As to the enhancement that you propose: although it seems as a valid
enhancement to me, I think that it is more robust to make forward progress
when possible (as done today). This is especially important if the proposed
enhancement cannot be checked by lockdep.
Peter Zijlstra Jan. 5, 2021, 2:26 p.m. UTC | #12
On Tue, Jan 05, 2021 at 12:52:48AM -0800, Nadav Amit wrote:
> > On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> >> The problematic one not pictured is the one of the wrprotect that has
> >> to be running in another CPU which is also isn't picture above. More
> >> accurate traces are posted later in the thread.
> > 
> > What thread? I don't seem to have discovered it yet, and the cover
> > letter from Nadav doesn't seem to have a msgid linking it either.
> 
> Sorry for that:
> 
> https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/

Much reading later..

OK, go with the write-lock for now.
Peter Xu Jan. 5, 2021, 3:08 p.m. UTC | #13
On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ab709023e9aa..c08c4055b051 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  		oldpte = *pte;
>  		if (pte_present(oldpte)) {
>  			pte_t ptent;
> -			bool preserve_write = prot_numa && pte_write(oldpte);
> +			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
> +					      pte_write(oldpte);

Irrelevant of the other tlb issue, this is a standalone one and I commented in
v1 about simply ignore the change if necessary; unluckily that seems to be
ignored..  so I'll try again - would below be slightly better?

    if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
        continue;

Firstly, current patch is confusing at least to me, because "uffd_wp_resolve"
means "unprotect the pte", whose write bit should mostly be cleared already
when uffd_wp_resolve is applicable.  Then "preserve_write" for that pte looks
odd already.

Meanwhile, if that really happens (when pte write bit set, but during a
uffd_wp_resolve request) imho there is really nothing we can do, so we should
simply avoid touching that at all, and also avoid ptep_modify_prot_start,
pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost.

Thanks,
Andrea Arcangeli Jan. 5, 2021, 5:58 p.m. UTC | #14
On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > > 
> > > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > > 
> > > > cpu0				cpu1			cpu2
> > > > ----				----			----
> > > > 							[ Writable PTE
> > > > 							  cached in TLB ]
> > > > userfaultfd_writeprotect()
> > > > [ write-*unprotect* ]
> > > > mwriteprotect_range()
> > > > mmap_read_lock()
> > > > change_protection()
> > > > 
> > > > change_protection_range()
> > > > ...
> > > > change_pte_range()
> > > > [ *clear* “write”-bit ]
> > > > [ defer TLB flushes ]
> > > > 				[ page-fault ]
> > > > 				...
> > > > 				wp_page_copy()
> > > > 				 cow_user_page()
> > > > 				  [ copy page ]
> > > > 							[ write to old
> > > > 							  page ]
> > > > 				...
> > > > 				 set_pte_at_notify()
> > > 
> > > Yuck!
> > > 
> > 
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> > 
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
> 
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

I was mistaken saying the deferred tlb flush was not shown in the v2
trace, just this appears a new different case we didn't happen to
consider before.

In the previous case we discussed earlier, when un-wrprotect above is
called it never should have been a W->RO since a wrprotect run first.

Doesn't it ring a bell that if an un-wrprotect does a W->RO
transition, something is a bit going backwards?

I don't recall from previous discussion that un-wrprotect was
considered as called on read-write memory. I think we need the below
change to fix this new case.

			if (uffd_wp) {
+				if (unlikely(pte_uffd_wp(oldpte)))
+					continue;
				ptent = pte_wrprotect(ptent);
				ptent = pte_mkuffd_wp(ptent);
			} else if (uffd_wp_resolve) {
+				if (unlikely(!pte_uffd_wp(oldpte)))
+					continue;
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

I now get why the v2 patch touches preserved_write, but this is not
about preserve_write, it's not about leaving the write bit alone. This
is about leaving the whole pte alone if the uffd-wp bit doesn't
actually change.

We shouldn't just defer the tlb flush if un-wprotect is called on
read-write memory: we should not have flushed the tlb at all in such
case.

Same for hugepmd in huge_memory.c which will be somewhere else.

Once the above is optimized, then un-wrprotect as in
MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in
MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition
that just clears the uffd_wp flag and nothing else and whose tlb flush
is in turn irrelevant.

The fix discussed still works for this new case too: I'm not
suggesting we should rely on the above optimization for the tlb
safety. The above is just a missing optimization.

> > > Isn't this all rather similar to the problem that resulted in the
> > > tlb_flush_pending mess?
> > > 
> > > I still think that's all fundamentally buggered, the much saner solution
> > > (IMO) would've been to make things wait for the pending flush, instead
> > 
> > How do intend you wait in PT lock while the writer also has to take PT
> > lock repeatedly before it can do wake_up_var?
> > 
> > If you release the PT lock before calling wait_tlb_flush_pending it
> > all falls apart again.
> 
> I suppose you can check for pending, if found, release lock, wait for 0,
> and re-take the fault?

Aborting the page fault unconditionally while MADV_DONTNEED is running
on some other unrelated vma, sounds not desirable.

Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds
less bad but it would still mean that while clear_refs is running, no
thread can write to any anon memory of the process.

> > This I guess explains why a local pte/hugepmd smp local invlpg is the
> > only working solution for this issue, similarly to how it's done in rmap.
> 
> In that case a local invalidate on CPU1 simply doesn't help anything.
> 
> CPU1 needs to do a global invalidate or wait for the in-progress one to
> complete, such that CPU2 is sure to not have a W entry left before CPU1
> goes and copies the page.

Yes, it was a global invlpg, definitely not local sorry for the
confusion, as in the PoC posted here which needs cleaning up:

       https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com

+                       flush_tlb_page(vma, vmf->address);

I think instead of the flush_tlb_page above, we just need an
ad-hoc abstraction there.

The added complexity to the page fault common code consist in having
to call such abstract call in the right place of the page fault.

The vm_flags to check will be the same for both the flush_tlb_page and
the wait_tlb_pending approaches.

Once the filter on vm_flags pass, the only difference is between
"flush_tlb_page; return void" or "PT unlock; wait_; return
VM_FAULT_RETRY" so it looks more an implementation detail with a
different tradeoff at runtime.

Thanks,
Andrea
Andrea Arcangeli Jan. 5, 2021, 6:08 p.m. UTC | #15
On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote:
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index ab709023e9aa..c08c4055b051 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  		oldpte = *pte;
> >  		if (pte_present(oldpte)) {
> >  			pte_t ptent;
> > -			bool preserve_write = prot_numa && pte_write(oldpte);
> > +			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
> > +					      pte_write(oldpte);
> 
> Irrelevant of the other tlb issue, this is a standalone one and I commented in
> v1 about simply ignore the change if necessary; unluckily that seems to be
> ignored..  so I'll try again - would below be slightly better?
> 
>     if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
>         continue;

I posted the exact same code before seeing the above so I take it as a good
sign :). I'd suggest to add the reverse check to the uffd_wp too.

> Firstly, current patch is confusing at least to me, because "uffd_wp_resolve"
> means "unprotect the pte", whose write bit should mostly be cleared already
> when uffd_wp_resolve is applicable.  Then "preserve_write" for that pte looks
> odd already.
> 
> Meanwhile, if that really happens (when pte write bit set, but during a
> uffd_wp_resolve request) imho there is really nothing we can do, so we should
> simply avoid touching that at all, and also avoid ptep_modify_prot_start,
> pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost.

Agreed. It should not just defer the flush, by doing continue we will
not flush anything.

So ultimately the above will be an orthogonal optimization, but now I
get the why the deferred tlb flush on the cpu0 of the v2 patch was the
problematic one. I didn't see we lacked the above optimization and I
thought we were discussing still the regular case where un-wrprotect
is called on a pte with uffd-wp set.

thanks,
Andrea
Peter Xu Jan. 5, 2021, 6:41 p.m. UTC | #16
On Tue, Jan 05, 2021 at 01:08:48PM -0500, Andrea Arcangeli wrote:
> On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote:
> > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index ab709023e9aa..c08c4055b051 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >  		oldpte = *pte;
> > >  		if (pte_present(oldpte)) {
> > >  			pte_t ptent;
> > > -			bool preserve_write = prot_numa && pte_write(oldpte);
> > > +			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
> > > +					      pte_write(oldpte);
> > 
> > Irrelevant of the other tlb issue, this is a standalone one and I commented in
> > v1 about simply ignore the change if necessary; unluckily that seems to be
> > ignored..  so I'll try again - would below be slightly better?
> > 
> >     if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
> >         continue;
> 
> I posted the exact same code before seeing the above so I take it as a good
> sign :). I'd suggest to add the reverse check to the uffd_wp too.

Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply
to v1 patchset) here only because the uffd_wp check is pure optimization; while
the uffd_wp_resolve check is more critical because it is potentially a fix of
similar tlb flushing issue where we could have demoted the pte without being
noticed, so I think it's indeed more important as Nadav wanted to fix in the
same patch.

It would be even nicer if we have both covered (all of them can be in
unlikely() as Andrea suggested in the other email), then maybe nicer as a
standalone patch, then mention about the difference of the two in the commit
log (majorly, the resolving change will be more than optimization).

Thanks,
Andrea Arcangeli Jan. 5, 2021, 6:45 p.m. UTC | #17
On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote:
> I would feel more comfortable if you provide patches for uffd-wp. If you
> want, I will do it, but I restate that I do not feel comfortable with this
> solution (worried as it seems a bit ad-hoc and might leave out a scenario
> we all missed or cause a TLB shootdown storm).
>
> As for soft-dirty, I thought that you said that you do not see a better
> (backportable) solution for soft-dirty. Correct me if I am wrong.

I think they should use the same technique, since they deal with the
exact same challenge. I will try to cleanup the patch in the meantime.

I can also try to do the additional cleanups to clear_refs to
eliminate the tlb_gather completely since it doesn't gather any page
and it has no point in using it.

> Anyhow, I will add your comments regarding the stale TLB window to make the
> description clearer.

Having the mmap_write_lock solution as backup won't hurt, but I think
it's only for planB if planA doesn't work and the only stable tree
that will have to apply this is v5.9.x. All previous don't need any
change in this respect. So there's no worry of rejects.

It worked by luck until Aug 2020, but it did so reliably or somebody
would have noticed already. And it's not exploitable either, it just
works stable, but it was prone to break if the kernel changed in some
other way, and it eventually changed in Aug 2020 when an unrelated
patch happened to the reuse logic.

If you want to maintain the mmap_write_lock patch if you could drop
the preserved_write and adjust the Fixes to target Aug 2020 it'd be
ideal. The uffd-wp needs a different optimization that maybe Peter is
already working on or I can include in the patchset for this, but
definitely in a separate commit because it's orthogonal.

It's great you noticed the W->RO transition of un-wprotect so we can
optimize that too (it will have a positive runtime effect, it's not
just theoretical since it's normal to unwrprotect a huge range once
the postcopy snapshotting of the virtual machine is complete), I was
thinking at the previous case discussed in the other thread.

I just don't like to slow down a feature required in the future for
implementing postcopy live snapshotting or other snapshots to userland
processes (for the non-KVM case, also unprivileged by default if using
bounce buffers to feed the syscalls) that can be used by open source
hypervisors to beat proprietary hypervisors like vmware.

The security concern of uffd-wp that allows to enlarge the window of
use-after-free kernel bugs, is not as a concern as it is for regular
processes. First the jailer model can obtain the uffd before dropping
all caps and before firing up seccomp in the child, so it won't even
require to lift the unprivileged_userfaultfd in the superior and
cleaner monolithic jailer model.

If the uffd and uffd-wp can only run in rust-vmm and qemu, that
userland is system software to be trusted as the kernel from the guest
point of view. It's similar to fuse, if somebody gets into the fuse
process it can also stop the kernel initiated faults. From that
respect fuse is also system software despite it runs in userland.

In other words I think if there's a vm-escape that takes control of
rust-vmm userland, the last worry is the fact it can stop kernel
initiated page faults because the jailer took an uffd before drop privs.

Thanks,
Andrea
Andrea Arcangeli Jan. 5, 2021, 6:55 p.m. UTC | #18
On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote:
> Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply
> to v1 patchset) here only because the uffd_wp check is pure optimization; while

Agreed it's a pure optimization.

Only if we used the group lock to fix this (which we didn't since it
wouldn't help clear_refs to avoid the performance regression), the
optimization would have become not an optimization anymore.

> the uffd_wp_resolve check is more critical because it is potentially a fix of
> similar tlb flushing issue where we could have demoted the pte without being
> noticed, so I think it's indeed more important as Nadav wanted to fix in the
> same patch.

I didn't get why that was touched in the same patch, I already
suggested to remove that optimization...

> It would be even nicer if we have both covered (all of them can be in
> unlikely() as Andrea suggested in the other email), then maybe nicer as a
> standalone patch, then mention about the difference of the two in the commit
> log (majorly, the resolving change will be more than optimization).

Yes, if you want to go ahead optimizing both cases of the
UFFDIO_WRITEPROTECT, I don't think there's any dependency on this. The
huge_memory.c also needs covering but I didn't look at it, hopefully
the code will result as clean as in the pte case.

I'll try to cleanup the tlb flush in the meantime to see if it look
maintainable after the cleanups.

Then we can change it to wait_pending_flush(); return VM_FAULT_RETRY
model if we want to or if the IPI is slower, at least clear_refs will
still not block on random pagein or swapin from disk, but only anon
memory write access will block while clear_refs run.

Thanks,
Andrea
Nadav Amit Jan. 5, 2021, 7:05 p.m. UTC | #19
> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote:
>> I would feel more comfortable if you provide patches for uffd-wp. If you
>> want, I will do it, but I restate that I do not feel comfortable with this
>> solution (worried as it seems a bit ad-hoc and might leave out a scenario
>> we all missed or cause a TLB shootdown storm).
>> 
>> As for soft-dirty, I thought that you said that you do not see a better
>> (backportable) solution for soft-dirty. Correct me if I am wrong.
> 
> I think they should use the same technique, since they deal with the
> exact same challenge. I will try to cleanup the patch in the meantime.
> 
> I can also try to do the additional cleanups to clear_refs to
> eliminate the tlb_gather completely since it doesn't gather any page
> and it has no point in using it.
> 
>> Anyhow, I will add your comments regarding the stale TLB window to make the
>> description clearer.
> 
> Having the mmap_write_lock solution as backup won't hurt, but I think
> it's only for planB if planA doesn't work and the only stable tree
> that will have to apply this is v5.9.x. All previous don't need any
> change in this respect. So there's no worry of rejects.
> 
> It worked by luck until Aug 2020, but it did so reliably or somebody
> would have noticed already. And it's not exploitable either, it just
> works stable, but it was prone to break if the kernel changed in some
> other way, and it eventually changed in Aug 2020 when an unrelated
> patch happened to the reuse logic.
> 
> If you want to maintain the mmap_write_lock patch if you could drop
> the preserved_write and adjust the Fixes to target Aug 2020 it'd be
> ideal. The uffd-wp needs a different optimization that maybe Peter is
> already working on or I can include in the patchset for this, but
> definitely in a separate commit because it's orthogonal.
> 
> It's great you noticed the W->RO transition of un-wprotect so we can
> optimize that too (it will have a positive runtime effect, it's not
> just theoretical since it's normal to unwrprotect a huge range once
> the postcopy snapshotting of the virtual machine is complete), I was
> thinking at the previous case discussed in the other thread.

Understood. I will separate it to a different patch and use your version.
I am sorry that I missed Peter Xu feedback for that. As I understand that
this will not be backported, I will see if I can get rid of the TLB flush
and the inc_tlb_flush_pending() for write-unprotect case as well (which
I think I mentioned before).

> 
> I just don't like to slow down a feature required in the future for
> implementing postcopy live snapshotting or other snapshots to userland
> processes (for the non-KVM case, also unprivileged by default if using
> bounce buffers to feed the syscalls) that can be used by open source
> hypervisors to beat proprietary hypervisors like vmware.

Ouch, that’s uncalled for. I am sure that you understand that I have no
hidden agenda and we all have the same goal.

Regards,
Nadav
Nadav Amit Jan. 5, 2021, 7:07 p.m. UTC | #20
> On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index ab709023e9aa..c08c4055b051 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> 		oldpte = *pte;
>> 		if (pte_present(oldpte)) {
>> 			pte_t ptent;
>> -			bool preserve_write = prot_numa && pte_write(oldpte);
>> +			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
>> +					      pte_write(oldpte);
> 
> Irrelevant of the other tlb issue, this is a standalone one and I commented in
> v1 about simply ignore the change if necessary; unluckily that seems to be
> ignored..  so I'll try again - would below be slightly better?
> 
>    if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
>        continue;
> 
> Firstly, current patch is confusing at least to me, because "uffd_wp_resolve"
> means "unprotect the pte", whose write bit should mostly be cleared already
> when uffd_wp_resolve is applicable.  Then "preserve_write" for that pte looks
> odd already.
> 
> Meanwhile, if that really happens (when pte write bit set, but during a
> uffd_wp_resolve request) imho there is really nothing we can do, so we should
> simply avoid touching that at all, and also avoid ptep_modify_prot_start,
> pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost.

Sorry for missing your feedback before. What you suggest makes perfect
sense.
Peter Xu Jan. 5, 2021, 7:43 p.m. UTC | #21
On Tue, Jan 05, 2021 at 07:07:51PM +0000, Nadav Amit wrote:
> > On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index ab709023e9aa..c08c4055b051 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >> 		oldpte = *pte;
> >> 		if (pte_present(oldpte)) {
> >> 			pte_t ptent;
> >> -			bool preserve_write = prot_numa && pte_write(oldpte);
> >> +			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
> >> +					      pte_write(oldpte);
> > 
> > Irrelevant of the other tlb issue, this is a standalone one and I commented in
> > v1 about simply ignore the change if necessary; unluckily that seems to be
> > ignored..  so I'll try again - would below be slightly better?
> > 
> >    if (uffd_wp_resolve && !pte_uffd_wp(oldpte))
> >        continue;
> > 
> > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve"
> > means "unprotect the pte", whose write bit should mostly be cleared already
> > when uffd_wp_resolve is applicable.  Then "preserve_write" for that pte looks
> > odd already.
> > 
> > Meanwhile, if that really happens (when pte write bit set, but during a
> > uffd_wp_resolve request) imho there is really nothing we can do, so we should
> > simply avoid touching that at all, and also avoid ptep_modify_prot_start,
> > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost.
> 
> Sorry for missing your feedback before. What you suggest makes perfect
> sense.

No problem.  I actually appreciated a lot for all your great works on these.
The strange thing is the userfaultfd kselftest seems to be working always fine
locally to me (probably another reason that I mostly test uffd-wp with
umapsort), so I won't be able to reproduce some issue you (and Andrea) have
encountered.  It's great you unveiled all these hard tlb problems and nailed
them down so lives should be easier for all of us.

Thanks,
Andrea Arcangeli Jan. 5, 2021, 7:45 p.m. UTC | #22
On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote:
> > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > I just don't like to slow down a feature required in the future for
> > implementing postcopy live snapshotting or other snapshots to userland
> > processes (for the non-KVM case, also unprivileged by default if using
> > bounce buffers to feed the syscalls) that can be used by open source
> > hypervisors to beat proprietary hypervisors like vmware.
> 
> Ouch, that’s uncalled for. I am sure that you understand that I have no
> hidden agenda and we all have the same goal.

Ehm I never said you had an hidden agenda, so I'm not exactly why
you're accusing me of something I never said.

I merely pointed out one relevant justification for increasing kernel
complexity here by not slowing down clear_refs longstanding O(N)
clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is
to be more competitive with proprietary software solutions, since
at least for uffd-wp, postcopy live snapshotting that the #1 use
case.

I never questioned your contribution or your preference, to be even
more explicit, it never crossed my mind that you have an hidden
agenda.

However since everyone already acked your patches and I'm not ok with
your patches because they will give a hit to KVM postcopy live
snapshotting and other container clear_refs users, I have to justify
why I NAK your patches and remaining competitive with proprietary
hypervisors is one of them, so I don't see what is wrong to state a
tangible end goal here.

Thanks,
Andrea
Nadav Amit Jan. 5, 2021, 8:06 p.m. UTC | #23
> On Jan 5, 2021, at 11:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote:
>>> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> I just don't like to slow down a feature required in the future for
>>> implementing postcopy live snapshotting or other snapshots to userland
>>> processes (for the non-KVM case, also unprivileged by default if using
>>> bounce buffers to feed the syscalls) that can be used by open source
>>> hypervisors to beat proprietary hypervisors like vmware.
>> 
>> Ouch, that’s uncalled for. I am sure that you understand that I have no
>> hidden agenda and we all have the same goal.
> 
> Ehm I never said you had an hidden agenda, so I'm not exactly why
> you're accusing me of something I never said.
> 
> I merely pointed out one relevant justification for increasing kernel
> complexity here by not slowing down clear_refs longstanding O(N)
> clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is
> to be more competitive with proprietary software solutions, since
> at least for uffd-wp, postcopy live snapshotting that the #1 use
> case.
> 
> I never questioned your contribution or your preference, to be even
> more explicit, it never crossed my mind that you have an hidden
> agenda.
> 
> However since everyone already acked your patches and I'm not ok with
> your patches because they will give a hit to KVM postcopy live
> snapshotting and other container clear_refs users, I have to justify
> why I NAK your patches and remaining competitive with proprietary
> hypervisors is one of them, so I don't see what is wrong to state a
> tangible end goal here.

I fully understand your objection to my patches and it is a valid
objection, which I will address.

I just thought that there might be some insinuation, as you mentioned VMware
by name. My response was half-jokingly and should have had a smiley to
prevent you from wasting your time on the explanation.
Andrea Arcangeli Jan. 5, 2021, 9:06 p.m. UTC | #24
On Tue, Jan 05, 2021 at 08:06:22PM +0000, Nadav Amit wrote:
> I just thought that there might be some insinuation, as you mentioned VMware
> by name. My response was half-jokingly and should have had a smiley to
> prevent you from wasting your time on the explanation.

No problem, actually I appreciate you pointed out to give me the extra
opportunity to further clarify I wasn't implying anything like that,
sorry again for any confusion I may have generated.

I mentioned vmware because I'd be shocked if for the whole duration of
the wrprotect on the guest physical memory it'd have to halt all minor
faults and all memory freeing like it would happen to rust-vmm/qemu if
we take the mmap_write_lock, that's all. Or am I wrong about this?

For uffd-wp avoiding the mmap_write_lock isn't an immediate concern
(obviously so in the rust-vmm case which won't even do postcopy live
migration), but the above concern applies for the long term and maybe
mid term for qemu.

The postcopy live snapshoitting was the #1 use case so it's hard not
to mention it, but there's still other interesting userland use cases
of uffd-wp with various users already testing it in their apps, that
may ultimately become more prevalent, who knows.

The point is that those that will experiment with uffd-wp will run a
benchmark, post a blog, others will see the blog, they will test too
in their app and post their blog. It needs to deliver the full
acceleration immediately, otherwise the evaluation may show it as a
fail or not worth it.

In theory we could just say, we'll optimize it later if significant
userbase emerge, but in my view it's bit of a chicken egg problem and
I'm afraid that such theory may not work well in practice.

Still, for the initial fix, avoiding the mmap_write_lock seems more
important actually for clear_refs than for uffd-wp. uffd-wp is
somewhat lucky and will just share any solution to keep clear_refs
scalable, since the issue is identical.

Thanks,
Andrea
Peter Xu Jan. 5, 2021, 9:43 p.m. UTC | #25
On Tue, Jan 05, 2021 at 04:06:27PM -0500, Andrea Arcangeli wrote:
> The postcopy live snapshoitting was the #1 use case so it's hard not
> to mention it, but there's still other interesting userland use cases
> of uffd-wp with various users already testing it in their apps, that
> may ultimately become more prevalent, who knows.

That's true.  AFAIU umap [1] uses uffd-wp for their computings already.  I
didn't really measure how far it can go, but currently the library is highly
concurrent, for example, there're quite a few macros that can tune the
parallelism of the library [2]:

  UMAP_PAGE_FILLERS This is the number of worker threads that will perform read
  operations from the backing store (including read-ahead) for a specific umap
  region.

  UMAP_PAGE_EVICTORS This is the number of worker threads that will perform
  evictions of pages. Eviction includes writing to the backing store if the
  page is dirty and telling the operating system that the page is no longer
  needed.

The write lock means at least all the evictor threads will be serialized,
immediately makes UMAP_PAGE_EVICTORS meaningless... not to mention all the rest
of read lock takers (filler threads, worker threads, etc.).  So if it happens,
I bet LLNL will suddenly observe a drastic drop after upgrading the kernel..

I don't know why umap didn't hit the tlb issue already.  It seems to me that
issues may only trigger with COW right after a stalled tlb so COW is the only
one affected (or, is it?) while umap may not use cow that lot by accident.  But
I could be completely wrong on that.

[1] https://github.com/LLNL/umap
[2] https://llnl-umap.readthedocs.io/en/develop/environment_variables.html
diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab709023e9aa..c08c4055b051 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+					      pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9a3d451402d7..7423808640ef 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -652,7 +652,15 @@  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 	/* Does the address range wrap, or is the span zero-sized? */
 	BUG_ON(start + len <= start);
 
-	mmap_read_lock(dst_mm);
+	/*
+	 * Although we do not change the VMA, we have to ensure deferred TLB
+	 * flushes are performed before page-faults can be handled. Otherwise
+	 * we can get inconsistent TLB state.
+	 */
+	if (enable_wp)
+		mmap_write_lock(dst_mm);
+	else
+		mmap_read_lock(dst_mm);
 
 	/*
 	 * If memory mappings are changing because of non-cooperative
@@ -686,6 +694,9 @@  int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 
 	err = 0;
 out_unlock:
-	mmap_read_unlock(dst_mm);
+	if (enable_wp)
+		mmap_write_unlock(dst_mm);
+	else
+		mmap_read_unlock(dst_mm);
 	return err;
 }