Message ID | 20201225092529.3228466-2-namit@vmware.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix races due to deferred TLB flushes | expand |
On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > The scenario that happens in selftests/vm/userfaultfd is as follows: > > cpu0 cpu1 cpu2 > ---- ---- ---- > [ Writable PTE > cached in TLB ] > userfaultfd_writeprotect() > [ write-*unprotect* ] > mwriteprotect_range() > mmap_read_lock() > change_protection() > > change_protection_range() > ... > change_pte_range() > [ *clear* “write”-bit ] > [ defer TLB flushes ] > [ page-fault ] > ... > wp_page_copy() > cow_user_page() > [ copy page ] > [ write to old > page ] > ... > set_pte_at_notify() Yuck! Isn't this all rather similar to the problem that resulted in the tlb_flush_pending mess? I still think that's all fundamentally buggered, the much saner solution (IMO) would've been to make things wait for the pending flush, instead of doing a local flush and fudging things like we do now. Then the above could be fixed by having wp_page_copy() wait for the pending invalidate (although a more fine-grained pending state would be awesome). The below probably doesn't compile and will probably cause massive header fail at the very least, but does show the general. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 07d9acb5b19c..0210547ac424 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */ - atomic_dec(&mm->tlb_flush_pending); + if (atomic_dec_and_test(&mm->tlb_flush_pending)) + wake_up_var(&mm->tlb_flush_pending); } static inline bool mm_tlb_flush_pending(struct mm_struct *mm) @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) return atomic_read(&mm->tlb_flush_pending) > 1; } +static inline void wait_tlb_flush_pending(struct mm_struct *mm) +{ + wait_var_event(&mm->tlb_flush_pending, + atomic_read(&mm->tlb_flush_pending) == 0); +} + struct vm_fault; /** diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..3c36bca2972a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3087,6 +3087,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + wait_tlb_flush_pending(vma->vm_mm); + if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP);
Hello, On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > cpu0 cpu1 cpu2 > > ---- ---- ---- > > [ Writable PTE > > cached in TLB ] > > userfaultfd_writeprotect() > > [ write-*unprotect* ] > > mwriteprotect_range() > > mmap_read_lock() > > change_protection() > > > > change_protection_range() > > ... > > change_pte_range() > > [ *clear* “write”-bit ] > > [ defer TLB flushes ] > > [ page-fault ] > > ... > > wp_page_copy() > > cow_user_page() > > [ copy page ] > > [ write to old > > page ] > > ... > > set_pte_at_notify() > > Yuck! > Note, the above was posted before we figured out the details so it wasn't showing the real deferred tlb flush that caused problems (the one showed on the left causes zero issues). The problematic one not pictured is the one of the wrprotect that has to be running in another CPU which is also isn't picture above. More accurate traces are posted later in the thread. > Isn't this all rather similar to the problem that resulted in the > tlb_flush_pending mess? > > I still think that's all fundamentally buggered, the much saner solution > (IMO) would've been to make things wait for the pending flush, instead How do intend you wait in PT lock while the writer also has to take PT lock repeatedly before it can do wake_up_var? If you release the PT lock before calling wait_tlb_flush_pending it all falls apart again. This I guess explains why a local pte/hugepmd smp local invlpg is the only working solution for this issue, similarly to how it's done in rmap. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 07d9acb5b19c..0210547ac424 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > * > * Therefore we must rely on tlb_flush_*() to guarantee order. > */ > - atomic_dec(&mm->tlb_flush_pending); > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) > + wake_up_var(&mm->tlb_flush_pending); > } > > static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > return atomic_read(&mm->tlb_flush_pending) > 1; > } > > +static inline void wait_tlb_flush_pending(struct mm_struct *mm) > +{ > + wait_var_event(&mm->tlb_flush_pending, > + atomic_read(&mm->tlb_flush_pending) == 0); > +} I appreciate the effort in not regressing soft dirty and uffd-wp writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for writing. At the same time what was posted so far wasn't clean enough but it wasn't even tested... if we abstract it in some clean way and we mark all connected points (soft dirty, uffd-wp, the wrprotect page fault), then I can be optimistic it will remain understandable when we look at it again a few years down the road. Or at the very least it can't get worse than the "tlb_flush_pending mess" you mentioned above. flush_tlb_batched_pending() has to be orthogonally re-reviewed for those things Nadav pointed out. But I'd rather keep that review in a separate thread since any bug in that code has zero connection to this issue. The basic idea is similar but the methods and logic are different and our flush here will be granular and it's going to be only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide, unconditional etc.. Pretty much all different. Thanks, Andrea
> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hello, > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> >>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>> >>> cpu0 cpu1 cpu2 >>> ---- ---- ---- >>> [ Writable PTE >>> cached in TLB ] >>> userfaultfd_writeprotect() >>> [ write-*unprotect* ] >>> mwriteprotect_range() >>> mmap_read_lock() >>> change_protection() >>> >>> change_protection_range() >>> ... >>> change_pte_range() >>> [ *clear* “write”-bit ] >>> [ defer TLB flushes ] >>> [ page-fault ] >>> ... >>> wp_page_copy() >>> cow_user_page() >>> [ copy page ] >>> [ write to old >>> page ] >>> ... >>> set_pte_at_notify() >> >> Yuck! > > Note, the above was posted before we figured out the details so it > wasn't showing the real deferred tlb flush that caused problems (the > one showed on the left causes zero issues). Actually it was posted after (note that this is v2). The aforementioned scenario that Peter regards to is the one that I actually encountered (not the second scenario that is “theoretical”). This scenario that Peter regards is indeed more “stupid” in the sense that we should just not write-protect the PTE on userfaultfd write-unprotect. Let me know if I made any mistake in the description. > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. I think I included this scenario as well in the commit log (of v2). Let me know if I screwed up and the description is not clear.
On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: > > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Hello, > > > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >> > >>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>> > >>> cpu0 cpu1 cpu2 > >>> ---- ---- ---- > >>> [ Writable PTE > >>> cached in TLB ] > >>> userfaultfd_writeprotect() > >>> [ write-*unprotect* ] > >>> mwriteprotect_range() > >>> mmap_read_lock() > >>> change_protection() > >>> > >>> change_protection_range() > >>> ... > >>> change_pte_range() > >>> [ *clear* “write”-bit ] > >>> [ defer TLB flushes ] > >>> [ page-fault ] > >>> ... > >>> wp_page_copy() > >>> cow_user_page() > >>> [ copy page ] > >>> [ write to old > >>> page ] > >>> ... > >>> set_pte_at_notify() > >> > >> Yuck! > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > Actually it was posted after (note that this is v2). The aforementioned > scenario that Peter regards to is the one that I actually encountered (not > the second scenario that is “theoretical”). This scenario that Peter regards > is indeed more “stupid” in the sense that we should just not write-protect > the PTE on userfaultfd write-unprotect. > > Let me know if I made any mistake in the description. I didn't say there is a mistake. I said it is not showing the real deferred tlb flush that cause problems. The issue here is that we have a "defer tlb flush" that runs after "write to old page". If you look at the above, you're induced to think the "defer tlb flush" that causes issues is the one in cpu0. It's not. That is totally harmless. > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > I think I included this scenario as well in the commit log (of v2). Let me > know if I screwed up and the description is not clear. Instead of not showing the real "defer tlb flush" in the trace and then fixing it up in the comment, why don't you take the trace showing the real problematic "defer tlb flush"? No need to reinvent it. https://lkml.kernel.org/r/X+JJqK91plkBVisG@redhat.com See here the detail underlined: deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here This show the real deferred tlb flush, your v2 does not include it instead.
> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> >>> Hello, >>> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>> >>>>> cpu0 cpu1 cpu2 >>>>> ---- ---- ---- >>>>> [ Writable PTE >>>>> cached in TLB ] >>>>> userfaultfd_writeprotect() >>>>> [ write-*unprotect* ] >>>>> mwriteprotect_range() >>>>> mmap_read_lock() >>>>> change_protection() >>>>> >>>>> change_protection_range() >>>>> ... >>>>> change_pte_range() >>>>> [ *clear* “write”-bit ] >>>>> [ defer TLB flushes ] >>>>> [ page-fault ] >>>>> ... >>>>> wp_page_copy() >>>>> cow_user_page() >>>>> [ copy page ] >>>>> [ write to old >>>>> page ] >>>>> ... >>>>> set_pte_at_notify() >>>> >>>> Yuck! >>> >>> Note, the above was posted before we figured out the details so it >>> wasn't showing the real deferred tlb flush that caused problems (the >>> one showed on the left causes zero issues). >> >> Actually it was posted after (note that this is v2). The aforementioned >> scenario that Peter regards to is the one that I actually encountered (not >> the second scenario that is “theoretical”). This scenario that Peter regards >> is indeed more “stupid” in the sense that we should just not write-protect >> the PTE on userfaultfd write-unprotect. >> >> Let me know if I made any mistake in the description. > > I didn't say there is a mistake. I said it is not showing the real > deferred tlb flush that cause problems. > > The issue here is that we have a "defer tlb flush" that runs after > "write to old page". > > If you look at the above, you're induced to think the "defer tlb > flush" that causes issues is the one in cpu0. It's not. That is > totally harmless. I do not understand what you say. The deferred TLB flush on cpu0 *is* the the one that causes the problem. The PTE is write-protected (although it is a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle the page-fault (and copy), while cpu2 keeps writing to the source page. If cpu0 did not defer the TLB flush, this problem would not happen. >>> The problematic one not pictured is the one of the wrprotect that has >>> to be running in another CPU which is also isn't picture above. More >>> accurate traces are posted later in the thread. >> >> I think I included this scenario as well in the commit log (of v2). Let me >> know if I screwed up and the description is not clear. > > Instead of not showing the real "defer tlb flush" in the trace and > then fixing it up in the comment, why don't you take the trace showing > the real problematic "defer tlb flush"? No need to reinvent it. The scenario you mention is indeed identical to the second scenario I mention in the commit log. I think the version I included is cleared since it shows the write that triggers the corruption instead of discussing “windows”, which might be less clear. Running copy_user_page() with stale TLB is by itself not a bug if you detect it later (e.g., using pte_same()). Note that my second scenario is also consistent in style with the first scenario. I am not married to my description and if you (and others) insist I would copy-paste your version. > This show the real deferred tlb flush, your v2 does not include it > instead. Are you talking about the first scenario (write-unprotect), the second one (write-protect followed by write-unprotect), both? It seems to me that all the deferred TLB flushes are mentioned at the point they are deferred. I can add the point in which they are performed.
On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: > > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: > >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > >>> > >>> Hello, > >>> > >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >>>> > >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>>>> > >>>>> cpu0 cpu1 cpu2 > >>>>> ---- ---- ---- > >>>>> [ Writable PTE > >>>>> cached in TLB ] > >>>>> userfaultfd_writeprotect() > >>>>> [ write-*unprotect* ] > >>>>> mwriteprotect_range() > >>>>> mmap_read_lock() > >>>>> change_protection() > >>>>> > >>>>> change_protection_range() > >>>>> ... > >>>>> change_pte_range() > >>>>> [ *clear* “write”-bit ] > >>>>> [ defer TLB flushes ] > >>>>> [ page-fault ] > >>>>> ... > >>>>> wp_page_copy() > >>>>> cow_user_page() > >>>>> [ copy page ] > >>>>> [ write to old > >>>>> page ] > >>>>> ... > >>>>> set_pte_at_notify() > >>>> > >>>> Yuck! > >>> > >>> Note, the above was posted before we figured out the details so it > >>> wasn't showing the real deferred tlb flush that caused problems (the > >>> one showed on the left causes zero issues). > >> > >> Actually it was posted after (note that this is v2). The aforementioned > >> scenario that Peter regards to is the one that I actually encountered (not > >> the second scenario that is “theoretical”). This scenario that Peter regards > >> is indeed more “stupid” in the sense that we should just not write-protect > >> the PTE on userfaultfd write-unprotect. > >> > >> Let me know if I made any mistake in the description. > > > > I didn't say there is a mistake. I said it is not showing the real > > deferred tlb flush that cause problems. > > > > The issue here is that we have a "defer tlb flush" that runs after > > "write to old page". > > > > If you look at the above, you're induced to think the "defer tlb > > flush" that causes issues is the one in cpu0. It's not. That is > > totally harmless. > > I do not understand what you say. The deferred TLB flush on cpu0 *is* the > the one that causes the problem. The PTE is write-protected (although it is > a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle > the page-fault (and copy), while cpu2 keeps writing to the source page. If > cpu0 did not defer the TLB flush, this problem would not happen. Your argument "If cpu0 did not defer the TLB flush, this problem would not happen" is identical to "if the cpu0 has a small TLB size and the tlb entry is recycled, the problem would not happen". There are a multitude of factors that are unrelated to the real problematic deferred tlb flush that may happen and still won't cause the issue, including a parallel IPI. The point is that we don't need to worry about the "defer TLB flushes" of the un-wrprotect, because you said earlier that deferring tlb flushes when you're doing "permission promotions" does not cause problems. The only "deferred tlb flush" we need to worry about, not in the picture, is the one following the actual permission removal (the wrprotection). > it shows the write that triggers the corruption instead of discussing > “windows”, which might be less clear. Running copy_user_page() with stale I think showing exactly where the race window opens is key to understand the issue, but then that's the way I work and feel free to think it in any other way that may sound simpler. I just worried people thinks the deferred tlb flush in your v2 trace is the one that causes problem when obviously it's not since it follows a permission promotion. Once that is clear, feel free to reject my trace. All I care about is that performance don't regress from CPU-speed to disk I/O spindle speed, for soft dirty and uffd-wp. > I am not married to my description and if you (and others) insist I would > copy-paste your version. I definitely don't insist, I only wanted to clarify in case it may not have been clear the problematic deferred tlb flush wasn't part of your trace. > Are you talking about the first scenario (write-unprotect), the second one > (write-protect followed by write-unprotect), both? It seems to me that all > the deferred TLB flushes are mentioned at the point they are deferred. I can > add the point in which they are performed. The only case that has an issue for uffd-wp is in my trace and only ever happens if there's a wrprotect in flight, the deferred tlb flush of the wrprotect is deferred (and that's the problematic one that closes the window when it finally runs) and un-wrprotect runs. The window opens when the un-wrprotect unlocks the PT lock. The deferred tlb flush of un-wrprotect is as relevant for this race, as random tlb flushes from IPI or the TLB being small or none. Thanks, Andrea
> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> >>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>>>> >>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>>>> >>>>>>> cpu0 cpu1 cpu2 >>>>>>> ---- ---- ---- >>>>>>> [ Writable PTE >>>>>>> cached in TLB ] >>>>>>> userfaultfd_writeprotect() >>>>>>> [ write-*unprotect* ] >>>>>>> mwriteprotect_range() >>>>>>> mmap_read_lock() >>>>>>> change_protection() >>>>>>> >>>>>>> change_protection_range() >>>>>>> ... >>>>>>> change_pte_range() >>>>>>> [ *clear* “write”-bit ] >>>>>>> [ defer TLB flushes ] >>>>>>> [ page-fault ] >>>>>>> ... >>>>>>> wp_page_copy() >>>>>>> cow_user_page() >>>>>>> [ copy page ] >>>>>>> [ write to old >>>>>>> page ] >>>>>>> ... >>>>>>> set_pte_at_notify() >>>>>> >>>>>> Yuck! >>>>> >>>>> Note, the above was posted before we figured out the details so it >>>>> wasn't showing the real deferred tlb flush that caused problems (the >>>>> one showed on the left causes zero issues). >>>> >>>> Actually it was posted after (note that this is v2). The aforementioned >>>> scenario that Peter regards to is the one that I actually encountered (not >>>> the second scenario that is “theoretical”). This scenario that Peter regards >>>> is indeed more “stupid” in the sense that we should just not write-protect >>>> the PTE on userfaultfd write-unprotect. >>>> >>>> Let me know if I made any mistake in the description. >>> >>> I didn't say there is a mistake. I said it is not showing the real >>> deferred tlb flush that cause problems. >>> >>> The issue here is that we have a "defer tlb flush" that runs after >>> "write to old page". >>> >>> If you look at the above, you're induced to think the "defer tlb >>> flush" that causes issues is the one in cpu0. It's not. That is >>> totally harmless. >> >> I do not understand what you say. The deferred TLB flush on cpu0 *is* the >> the one that causes the problem. The PTE is write-protected (although it is >> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle >> the page-fault (and copy), while cpu2 keeps writing to the source page. If >> cpu0 did not defer the TLB flush, this problem would not happen. > > Your argument "If cpu0 did not defer the TLB flush, this problem would > not happen" is identical to "if the cpu0 has a small TLB size and the > tlb entry is recycled, the problem would not happen". > > There are a multitude of factors that are unrelated to the real > problematic deferred tlb flush that may happen and still won't cause > the issue, including a parallel IPI. > > The point is that we don't need to worry about the "defer TLB flushes" > of the un-wrprotect, because you said earlier that deferring tlb > flushes when you're doing "permission promotions" does not cause > problems. > > The only "deferred tlb flush" we need to worry about, not in the > picture, is the one following the actual permission removal (the > wrprotection). I think you are missing the point of this scenario, which is different than the second scenario. In this scenario, change_pte_range(), when called to do userfaultfd’s *unprotect* operation, did not preserve the write-bit if it was already set. Instead change_pte_range() *cleared* the write-bit. So upon a logical permission promotion operation - userfaultfd *unprotect* - you got a physical permission demotion, turning RW PTEs into RO. This problem is fully resolved by this part of the patch: --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); You can argue that this not directly related to the deferred TLB flush, as once this chunk is added, a TLB flush would not be needed at all for userfaultfd-unprotect. But I consider it a part of the problem, especially since this is what triggered the userfaultfd self-tests to fail. >> it shows the write that triggers the corruption instead of discussing >> “windows”, which might be less clear. Running copy_user_page() with stale > > I think showing exactly where the race window opens is key to > understand the issue, but then that's the way I work and feel free to > think it in any other way that may sound simpler. > > I just worried people thinks the deferred tlb flush in your v2 trace > is the one that causes problem when obviously it's not since it > follows a permission promotion. Once that is clear, feel free to > reject my trace. > > All I care about is that performance don't regress from CPU-speed to > disk I/O spindle speed, for soft dirty and uffd-wp. I would feel more comfortable if you provide patches for uffd-wp. If you want, I will do it, but I restate that I do not feel comfortable with this solution (worried as it seems a bit ad-hoc and might leave out a scenario we all missed or cause a TLB shootdown storm). As for soft-dirty, I thought that you said that you do not see a better (backportable) solution for soft-dirty. Correct me if I am wrong. Anyhow, I will add your comments regarding the stale TLB window to make the description clearer.
On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. What thread? I don't seem to have discovered it yet, and the cover letter from Nadav doesn't seem to have a msgid linking it either.
> On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted later in the thread. > > What thread? I don't seem to have discovered it yet, and the cover > letter from Nadav doesn't seem to have a msgid linking it either. Sorry for that: https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/
On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > > > cpu0 cpu1 cpu2 > > > ---- ---- ---- > > > [ Writable PTE > > > cached in TLB ] > > > userfaultfd_writeprotect() > > > [ write-*unprotect* ] > > > mwriteprotect_range() > > > mmap_read_lock() > > > change_protection() > > > > > > change_protection_range() > > > ... > > > change_pte_range() > > > [ *clear* “write”-bit ] > > > [ defer TLB flushes ] > > > [ page-fault ] > > > ... > > > wp_page_copy() > > > cow_user_page() > > > [ copy page ] > > > [ write to old > > > page ] > > > ... > > > set_pte_at_notify() > > > > Yuck! > > > > Note, the above was posted before we figured out the details so it > wasn't showing the real deferred tlb flush that caused problems (the > one showed on the left causes zero issues). > > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. Lets assume CPU0 does a read-lock, W -> RO with deferred flush. > > Isn't this all rather similar to the problem that resulted in the > > tlb_flush_pending mess? > > > > I still think that's all fundamentally buggered, the much saner solution > > (IMO) would've been to make things wait for the pending flush, instead > > How do intend you wait in PT lock while the writer also has to take PT > lock repeatedly before it can do wake_up_var? > > If you release the PT lock before calling wait_tlb_flush_pending it > all falls apart again. I suppose you can check for pending, if found, release lock, wait for 0, and re-take the fault? > This I guess explains why a local pte/hugepmd smp local invlpg is the > only working solution for this issue, similarly to how it's done in rmap. In that case a local invalidate on CPU1 simply doesn't help anything. CPU1 needs to do a global invalidate or wait for the in-progress one to complete, such that CPU2 is sure to not have a W entry left before CPU1 goes and copies the page.
> On Jan 5, 2021, at 12:58 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>> >>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>> >>>> cpu0 cpu1 cpu2 >>>> ---- ---- ---- >>>> [ Writable PTE >>>> cached in TLB ] >>>> userfaultfd_writeprotect() >>>> [ write-*unprotect* ] >>>> mwriteprotect_range() >>>> mmap_read_lock() >>>> change_protection() >>>> >>>> change_protection_range() >>>> ... >>>> change_pte_range() >>>> [ *clear* “write”-bit ] >>>> [ defer TLB flushes ] >>>> [ page-fault ] >>>> ... >>>> wp_page_copy() >>>> cow_user_page() >>>> [ copy page ] >>>> [ write to old >>>> page ] >>>> ... >>>> set_pte_at_notify() >>> >>> Yuck! >> >> Note, the above was posted before we figured out the details so it >> wasn't showing the real deferred tlb flush that caused problems (the >> one showed on the left causes zero issues). >> >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. This is the second scenario that is mentioned in the patch. (The first one is relatively easy to address by not clearing the write-bit). >>> Isn't this all rather similar to the problem that resulted in the >>> tlb_flush_pending mess? >>> >>> I still think that's all fundamentally buggered, the much saner solution >>> (IMO) would've been to make things wait for the pending flush, instead >> >> How do intend you wait in PT lock while the writer also has to take PT >> lock repeatedly before it can do wake_up_var? >> >> If you release the PT lock before calling wait_tlb_flush_pending it >> all falls apart again. > > I suppose you can check for pending, if found, release lock, wait for 0, > and re-take the fault? My personal take on this issue (which for full disclosure I think Andrea disagrees with) is that it the most important enhancement is to reduce the number of cases which we mistakenly think that we must wait for pending TLB flush. It will not be free though. As to the enhancement that you propose: although it seems as a valid enhancement to me, I think that it is more robust to make forward progress when possible (as done today). This is especially important if the proposed enhancement cannot be checked by lockdep.
On Tue, Jan 05, 2021 at 12:52:48AM -0800, Nadav Amit wrote: > > On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > >> The problematic one not pictured is the one of the wrprotect that has > >> to be running in another CPU which is also isn't picture above. More > >> accurate traces are posted later in the thread. > > > > What thread? I don't seem to have discovered it yet, and the cover > > letter from Nadav doesn't seem to have a msgid linking it either. > > Sorry for that: > > https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/ Much reading later.. OK, go with the write-lock for now.
On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > diff --git a/mm/mprotect.c b/mm/mprotect.c > index ab709023e9aa..c08c4055b051 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > oldpte = *pte; > if (pte_present(oldpte)) { > pte_t ptent; > - bool preserve_write = prot_numa && pte_write(oldpte); > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > + pte_write(oldpte); Irrelevant of the other tlb issue, this is a standalone one and I commented in v1 about simply ignore the change if necessary; unluckily that seems to be ignored.. so I'll try again - would below be slightly better? if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) continue; Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" means "unprotect the pte", whose write bit should mostly be cleared already when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks odd already. Meanwhile, if that really happens (when pte write bit set, but during a uffd_wp_resolve request) imho there is really nothing we can do, so we should simply avoid touching that at all, and also avoid ptep_modify_prot_start, pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Thanks,
On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote: > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > > > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > > > > > cpu0 cpu1 cpu2 > > > > ---- ---- ---- > > > > [ Writable PTE > > > > cached in TLB ] > > > > userfaultfd_writeprotect() > > > > [ write-*unprotect* ] > > > > mwriteprotect_range() > > > > mmap_read_lock() > > > > change_protection() > > > > > > > > change_protection_range() > > > > ... > > > > change_pte_range() > > > > [ *clear* “write”-bit ] > > > > [ defer TLB flushes ] > > > > [ page-fault ] > > > > ... > > > > wp_page_copy() > > > > cow_user_page() > > > > [ copy page ] > > > > [ write to old > > > > page ] > > > > ... > > > > set_pte_at_notify() > > > > > > Yuck! > > > > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. I was mistaken saying the deferred tlb flush was not shown in the v2 trace, just this appears a new different case we didn't happen to consider before. In the previous case we discussed earlier, when un-wrprotect above is called it never should have been a W->RO since a wrprotect run first. Doesn't it ring a bell that if an un-wrprotect does a W->RO transition, something is a bit going backwards? I don't recall from previous discussion that un-wrprotect was considered as called on read-write memory. I think we need the below change to fix this new case. if (uffd_wp) { + if (unlikely(pte_uffd_wp(oldpte))) + continue; ptent = pte_wrprotect(ptent); ptent = pte_mkuffd_wp(ptent); } else if (uffd_wp_resolve) { + if (unlikely(!pte_uffd_wp(oldpte))) + continue; /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } I now get why the v2 patch touches preserved_write, but this is not about preserve_write, it's not about leaving the write bit alone. This is about leaving the whole pte alone if the uffd-wp bit doesn't actually change. We shouldn't just defer the tlb flush if un-wprotect is called on read-write memory: we should not have flushed the tlb at all in such case. Same for hugepmd in huge_memory.c which will be somewhere else. Once the above is optimized, then un-wrprotect as in MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition that just clears the uffd_wp flag and nothing else and whose tlb flush is in turn irrelevant. The fix discussed still works for this new case too: I'm not suggesting we should rely on the above optimization for the tlb safety. The above is just a missing optimization. > > > Isn't this all rather similar to the problem that resulted in the > > > tlb_flush_pending mess? > > > > > > I still think that's all fundamentally buggered, the much saner solution > > > (IMO) would've been to make things wait for the pending flush, instead > > > > How do intend you wait in PT lock while the writer also has to take PT > > lock repeatedly before it can do wake_up_var? > > > > If you release the PT lock before calling wait_tlb_flush_pending it > > all falls apart again. > > I suppose you can check for pending, if found, release lock, wait for 0, > and re-take the fault? Aborting the page fault unconditionally while MADV_DONTNEED is running on some other unrelated vma, sounds not desirable. Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds less bad but it would still mean that while clear_refs is running, no thread can write to any anon memory of the process. > > This I guess explains why a local pte/hugepmd smp local invlpg is the > > only working solution for this issue, similarly to how it's done in rmap. > > In that case a local invalidate on CPU1 simply doesn't help anything. > > CPU1 needs to do a global invalidate or wait for the in-progress one to > complete, such that CPU2 is sure to not have a W entry left before CPU1 > goes and copies the page. Yes, it was a global invlpg, definitely not local sorry for the confusion, as in the PoC posted here which needs cleaning up: https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com + flush_tlb_page(vma, vmf->address); I think instead of the flush_tlb_page above, we just need an ad-hoc abstraction there. The added complexity to the page fault common code consist in having to call such abstract call in the right place of the page fault. The vm_flags to check will be the same for both the flush_tlb_page and the wait_tlb_pending approaches. Once the filter on vm_flags pass, the only difference is between "flush_tlb_page; return void" or "PT unlock; wait_; return VM_FAULT_RETRY" so it looks more an implementation detail with a different tradeoff at runtime. Thanks, Andrea
On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index ab709023e9aa..c08c4055b051 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > > oldpte = *pte; > > if (pte_present(oldpte)) { > > pte_t ptent; > > - bool preserve_write = prot_numa && pte_write(oldpte); > > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > > + pte_write(oldpte); > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > v1 about simply ignore the change if necessary; unluckily that seems to be > ignored.. so I'll try again - would below be slightly better? > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > continue; I posted the exact same code before seeing the above so I take it as a good sign :). I'd suggest to add the reverse check to the uffd_wp too. > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > means "unprotect the pte", whose write bit should mostly be cleared already > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > odd already. > > Meanwhile, if that really happens (when pte write bit set, but during a > uffd_wp_resolve request) imho there is really nothing we can do, so we should > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Agreed. It should not just defer the flush, by doing continue we will not flush anything. So ultimately the above will be an orthogonal optimization, but now I get the why the deferred tlb flush on the cpu0 of the v2 patch was the problematic one. I didn't see we lacked the above optimization and I thought we were discussing still the regular case where un-wrprotect is called on a pte with uffd-wp set. thanks, Andrea
On Tue, Jan 05, 2021 at 01:08:48PM -0500, Andrea Arcangeli wrote: > On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index ab709023e9aa..c08c4055b051 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > > > oldpte = *pte; > > > if (pte_present(oldpte)) { > > > pte_t ptent; > > > - bool preserve_write = prot_numa && pte_write(oldpte); > > > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > > > + pte_write(oldpte); > > > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > > v1 about simply ignore the change if necessary; unluckily that seems to be > > ignored.. so I'll try again - would below be slightly better? > > > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > > continue; > > I posted the exact same code before seeing the above so I take it as a good > sign :). I'd suggest to add the reverse check to the uffd_wp too. Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply to v1 patchset) here only because the uffd_wp check is pure optimization; while the uffd_wp_resolve check is more critical because it is potentially a fix of similar tlb flushing issue where we could have demoted the pte without being noticed, so I think it's indeed more important as Nadav wanted to fix in the same patch. It would be even nicer if we have both covered (all of them can be in unlikely() as Andrea suggested in the other email), then maybe nicer as a standalone patch, then mention about the difference of the two in the commit log (majorly, the resolving change will be more than optimization). Thanks,
On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: > I would feel more comfortable if you provide patches for uffd-wp. If you > want, I will do it, but I restate that I do not feel comfortable with this > solution (worried as it seems a bit ad-hoc and might leave out a scenario > we all missed or cause a TLB shootdown storm). > > As for soft-dirty, I thought that you said that you do not see a better > (backportable) solution for soft-dirty. Correct me if I am wrong. I think they should use the same technique, since they deal with the exact same challenge. I will try to cleanup the patch in the meantime. I can also try to do the additional cleanups to clear_refs to eliminate the tlb_gather completely since it doesn't gather any page and it has no point in using it. > Anyhow, I will add your comments regarding the stale TLB window to make the > description clearer. Having the mmap_write_lock solution as backup won't hurt, but I think it's only for planB if planA doesn't work and the only stable tree that will have to apply this is v5.9.x. All previous don't need any change in this respect. So there's no worry of rejects. It worked by luck until Aug 2020, but it did so reliably or somebody would have noticed already. And it's not exploitable either, it just works stable, but it was prone to break if the kernel changed in some other way, and it eventually changed in Aug 2020 when an unrelated patch happened to the reuse logic. If you want to maintain the mmap_write_lock patch if you could drop the preserved_write and adjust the Fixes to target Aug 2020 it'd be ideal. The uffd-wp needs a different optimization that maybe Peter is already working on or I can include in the patchset for this, but definitely in a separate commit because it's orthogonal. It's great you noticed the W->RO transition of un-wprotect so we can optimize that too (it will have a positive runtime effect, it's not just theoretical since it's normal to unwrprotect a huge range once the postcopy snapshotting of the virtual machine is complete), I was thinking at the previous case discussed in the other thread. I just don't like to slow down a feature required in the future for implementing postcopy live snapshotting or other snapshots to userland processes (for the non-KVM case, also unprivileged by default if using bounce buffers to feed the syscalls) that can be used by open source hypervisors to beat proprietary hypervisors like vmware. The security concern of uffd-wp that allows to enlarge the window of use-after-free kernel bugs, is not as a concern as it is for regular processes. First the jailer model can obtain the uffd before dropping all caps and before firing up seccomp in the child, so it won't even require to lift the unprivileged_userfaultfd in the superior and cleaner monolithic jailer model. If the uffd and uffd-wp can only run in rust-vmm and qemu, that userland is system software to be trusted as the kernel from the guest point of view. It's similar to fuse, if somebody gets into the fuse process it can also stop the kernel initiated faults. From that respect fuse is also system software despite it runs in userland. In other words I think if there's a vm-escape that takes control of rust-vmm userland, the last worry is the fact it can stop kernel initiated page faults because the jailer took an uffd before drop privs. Thanks, Andrea
On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote: > Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply > to v1 patchset) here only because the uffd_wp check is pure optimization; while Agreed it's a pure optimization. Only if we used the group lock to fix this (which we didn't since it wouldn't help clear_refs to avoid the performance regression), the optimization would have become not an optimization anymore. > the uffd_wp_resolve check is more critical because it is potentially a fix of > similar tlb flushing issue where we could have demoted the pte without being > noticed, so I think it's indeed more important as Nadav wanted to fix in the > same patch. I didn't get why that was touched in the same patch, I already suggested to remove that optimization... > It would be even nicer if we have both covered (all of them can be in > unlikely() as Andrea suggested in the other email), then maybe nicer as a > standalone patch, then mention about the difference of the two in the commit > log (majorly, the resolving change will be more than optimization). Yes, if you want to go ahead optimizing both cases of the UFFDIO_WRITEPROTECT, I don't think there's any dependency on this. The huge_memory.c also needs covering but I didn't look at it, hopefully the code will result as clean as in the pte case. I'll try to cleanup the tlb flush in the meantime to see if it look maintainable after the cleanups. Then we can change it to wait_pending_flush(); return VM_FAULT_RETRY model if we want to or if the IPI is slower, at least clear_refs will still not block on random pagein or swapin from disk, but only anon memory write access will block while clear_refs run. Thanks, Andrea
> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: >> I would feel more comfortable if you provide patches for uffd-wp. If you >> want, I will do it, but I restate that I do not feel comfortable with this >> solution (worried as it seems a bit ad-hoc and might leave out a scenario >> we all missed or cause a TLB shootdown storm). >> >> As for soft-dirty, I thought that you said that you do not see a better >> (backportable) solution for soft-dirty. Correct me if I am wrong. > > I think they should use the same technique, since they deal with the > exact same challenge. I will try to cleanup the patch in the meantime. > > I can also try to do the additional cleanups to clear_refs to > eliminate the tlb_gather completely since it doesn't gather any page > and it has no point in using it. > >> Anyhow, I will add your comments regarding the stale TLB window to make the >> description clearer. > > Having the mmap_write_lock solution as backup won't hurt, but I think > it's only for planB if planA doesn't work and the only stable tree > that will have to apply this is v5.9.x. All previous don't need any > change in this respect. So there's no worry of rejects. > > It worked by luck until Aug 2020, but it did so reliably or somebody > would have noticed already. And it's not exploitable either, it just > works stable, but it was prone to break if the kernel changed in some > other way, and it eventually changed in Aug 2020 when an unrelated > patch happened to the reuse logic. > > If you want to maintain the mmap_write_lock patch if you could drop > the preserved_write and adjust the Fixes to target Aug 2020 it'd be > ideal. The uffd-wp needs a different optimization that maybe Peter is > already working on or I can include in the patchset for this, but > definitely in a separate commit because it's orthogonal. > > It's great you noticed the W->RO transition of un-wprotect so we can > optimize that too (it will have a positive runtime effect, it's not > just theoretical since it's normal to unwrprotect a huge range once > the postcopy snapshotting of the virtual machine is complete), I was > thinking at the previous case discussed in the other thread. Understood. I will separate it to a different patch and use your version. I am sorry that I missed Peter Xu feedback for that. As I understand that this will not be backported, I will see if I can get rid of the TLB flush and the inc_tlb_flush_pending() for write-unprotect case as well (which I think I mentioned before). > > I just don't like to slow down a feature required in the future for > implementing postcopy live snapshotting or other snapshots to userland > processes (for the non-KVM case, also unprivileged by default if using > bounce buffers to feed the syscalls) that can be used by open source > hypervisors to beat proprietary hypervisors like vmware. Ouch, that’s uncalled for. I am sure that you understand that I have no hidden agenda and we all have the same goal. Regards, Nadav
> On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index ab709023e9aa..c08c4055b051 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, >> oldpte = *pte; >> if (pte_present(oldpte)) { >> pte_t ptent; >> - bool preserve_write = prot_numa && pte_write(oldpte); >> + bool preserve_write = (prot_numa || uffd_wp_resolve) && >> + pte_write(oldpte); > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > v1 about simply ignore the change if necessary; unluckily that seems to be > ignored.. so I'll try again - would below be slightly better? > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > continue; > > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > means "unprotect the pte", whose write bit should mostly be cleared already > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > odd already. > > Meanwhile, if that really happens (when pte write bit set, but during a > uffd_wp_resolve request) imho there is really nothing we can do, so we should > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Sorry for missing your feedback before. What you suggest makes perfect sense.
On Tue, Jan 05, 2021 at 07:07:51PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >> diff --git a/mm/mprotect.c b/mm/mprotect.c > >> index ab709023e9aa..c08c4055b051 100644 > >> --- a/mm/mprotect.c > >> +++ b/mm/mprotect.c > >> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > >> oldpte = *pte; > >> if (pte_present(oldpte)) { > >> pte_t ptent; > >> - bool preserve_write = prot_numa && pte_write(oldpte); > >> + bool preserve_write = (prot_numa || uffd_wp_resolve) && > >> + pte_write(oldpte); > > > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > > v1 about simply ignore the change if necessary; unluckily that seems to be > > ignored.. so I'll try again - would below be slightly better? > > > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > > continue; > > > > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > > means "unprotect the pte", whose write bit should mostly be cleared already > > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > > odd already. > > > > Meanwhile, if that really happens (when pte write bit set, but during a > > uffd_wp_resolve request) imho there is really nothing we can do, so we should > > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. > > Sorry for missing your feedback before. What you suggest makes perfect > sense. No problem. I actually appreciated a lot for all your great works on these. The strange thing is the userfaultfd kselftest seems to be working always fine locally to me (probably another reason that I mostly test uffd-wp with umapsort), so I won't be able to reproduce some issue you (and Andrea) have encountered. It's great you unveiled all these hard tlb problems and nailed them down so lives should be easier for all of us. Thanks,
On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > I just don't like to slow down a feature required in the future for > > implementing postcopy live snapshotting or other snapshots to userland > > processes (for the non-KVM case, also unprivileged by default if using > > bounce buffers to feed the syscalls) that can be used by open source > > hypervisors to beat proprietary hypervisors like vmware. > > Ouch, that’s uncalled for. I am sure that you understand that I have no > hidden agenda and we all have the same goal. Ehm I never said you had an hidden agenda, so I'm not exactly why you're accusing me of something I never said. I merely pointed out one relevant justification for increasing kernel complexity here by not slowing down clear_refs longstanding O(N) clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is to be more competitive with proprietary software solutions, since at least for uffd-wp, postcopy live snapshotting that the #1 use case. I never questioned your contribution or your preference, to be even more explicit, it never crossed my mind that you have an hidden agenda. However since everyone already acked your patches and I'm not ok with your patches because they will give a hit to KVM postcopy live snapshotting and other container clear_refs users, I have to justify why I NAK your patches and remaining competitive with proprietary hypervisors is one of them, so I don't see what is wrong to state a tangible end goal here. Thanks, Andrea
> On Jan 5, 2021, at 11:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> I just don't like to slow down a feature required in the future for >>> implementing postcopy live snapshotting or other snapshots to userland >>> processes (for the non-KVM case, also unprivileged by default if using >>> bounce buffers to feed the syscalls) that can be used by open source >>> hypervisors to beat proprietary hypervisors like vmware. >> >> Ouch, that’s uncalled for. I am sure that you understand that I have no >> hidden agenda and we all have the same goal. > > Ehm I never said you had an hidden agenda, so I'm not exactly why > you're accusing me of something I never said. > > I merely pointed out one relevant justification for increasing kernel > complexity here by not slowing down clear_refs longstanding O(N) > clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is > to be more competitive with proprietary software solutions, since > at least for uffd-wp, postcopy live snapshotting that the #1 use > case. > > I never questioned your contribution or your preference, to be even > more explicit, it never crossed my mind that you have an hidden > agenda. > > However since everyone already acked your patches and I'm not ok with > your patches because they will give a hit to KVM postcopy live > snapshotting and other container clear_refs users, I have to justify > why I NAK your patches and remaining competitive with proprietary > hypervisors is one of them, so I don't see what is wrong to state a > tangible end goal here. I fully understand your objection to my patches and it is a valid objection, which I will address. I just thought that there might be some insinuation, as you mentioned VMware by name. My response was half-jokingly and should have had a smiley to prevent you from wasting your time on the explanation.
On Tue, Jan 05, 2021 at 08:06:22PM +0000, Nadav Amit wrote: > I just thought that there might be some insinuation, as you mentioned VMware > by name. My response was half-jokingly and should have had a smiley to > prevent you from wasting your time on the explanation. No problem, actually I appreciate you pointed out to give me the extra opportunity to further clarify I wasn't implying anything like that, sorry again for any confusion I may have generated. I mentioned vmware because I'd be shocked if for the whole duration of the wrprotect on the guest physical memory it'd have to halt all minor faults and all memory freeing like it would happen to rust-vmm/qemu if we take the mmap_write_lock, that's all. Or am I wrong about this? For uffd-wp avoiding the mmap_write_lock isn't an immediate concern (obviously so in the rust-vmm case which won't even do postcopy live migration), but the above concern applies for the long term and maybe mid term for qemu. The postcopy live snapshoitting was the #1 use case so it's hard not to mention it, but there's still other interesting userland use cases of uffd-wp with various users already testing it in their apps, that may ultimately become more prevalent, who knows. The point is that those that will experiment with uffd-wp will run a benchmark, post a blog, others will see the blog, they will test too in their app and post their blog. It needs to deliver the full acceleration immediately, otherwise the evaluation may show it as a fail or not worth it. In theory we could just say, we'll optimize it later if significant userbase emerge, but in my view it's bit of a chicken egg problem and I'm afraid that such theory may not work well in practice. Still, for the initial fix, avoiding the mmap_write_lock seems more important actually for clear_refs than for uffd-wp. uffd-wp is somewhat lucky and will just share any solution to keep clear_refs scalable, since the issue is identical. Thanks, Andrea
On Tue, Jan 05, 2021 at 04:06:27PM -0500, Andrea Arcangeli wrote: > The postcopy live snapshoitting was the #1 use case so it's hard not > to mention it, but there's still other interesting userland use cases > of uffd-wp with various users already testing it in their apps, that > may ultimately become more prevalent, who knows. That's true. AFAIU umap [1] uses uffd-wp for their computings already. I didn't really measure how far it can go, but currently the library is highly concurrent, for example, there're quite a few macros that can tune the parallelism of the library [2]: UMAP_PAGE_FILLERS This is the number of worker threads that will perform read operations from the backing store (including read-ahead) for a specific umap region. UMAP_PAGE_EVICTORS This is the number of worker threads that will perform evictions of pages. Eviction includes writing to the backing store if the page is dirty and telling the operating system that the page is no longer needed. The write lock means at least all the evictor threads will be serialized, immediately makes UMAP_PAGE_EVICTORS meaningless... not to mention all the rest of read lock takers (filler threads, worker threads, etc.). So if it happens, I bet LLNL will suddenly observe a drastic drop after upgrading the kernel.. I don't know why umap didn't hit the tlb issue already. It seems to me that issues may only trigger with COW right after a stalled tlb so COW is the only one affected (or, is it?) while umap may not use cow that lot by accident. But I could be completely wrong on that. [1] https://github.com/LLNL/umap [2] https://llnl-umap.readthedocs.io/en/develop/environment_variables.html
diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..c08c4055b051 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); /* * Avoid trapping faults against the zero or KSM diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..7423808640ef 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, /* Does the address range wrap, or is the span zero-sized? */ BUG_ON(start + len <= start); - mmap_read_lock(dst_mm); + /* + * Although we do not change the VMA, we have to ensure deferred TLB + * flushes are performed before page-faults can be handled. Otherwise + * we can get inconsistent TLB state. + */ + if (enable_wp) + mmap_write_lock(dst_mm); + else + mmap_read_lock(dst_mm); /* * If memory mappings are changing because of non-cooperative @@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, err = 0; out_unlock: - mmap_read_unlock(dst_mm); + if (enable_wp) + mmap_write_unlock(dst_mm); + else + mmap_read_unlock(dst_mm); return err; }