Message ID | 20220623234944.141869-1-pcc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: permit MAP_SHARED mappings with MTE enabled | expand |
+ Steven as he added the KVM and swap support for MTE. On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > depend on being able to map guest memory as MAP_SHARED. The current > restriction on sharing MAP_SHARED pages with the guest is preventing > the use of those features with MTE. Therefore, remove this restriction. We already have some corner cases where the PG_mte_tagged logic fails even for MAP_PRIVATE (but page shared with CoW). Adding this on top for KVM MAP_SHARED will potentially make things worse (or hard to reason about; for example the VMM sets PROT_MTE as well). I'm more inclined to get rid of PG_mte_tagged altogether, always zero (or restore) the tags on user page allocation, copy them on write. For swap we can scan and if all tags are 0 and just skip saving them. Another aspect is a change in the KVM ABI with this patch. It's probably not that bad since it's rather a relaxation but it has the potential to confuse the VMM, especially as it doesn't know whether it's running on older kernels or not (it would have to probe unless we expose this info to the VMM in some other way). > To avoid races between multiple tasks attempting to clear tags on the > same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it > atomically before beginning to clear tags on a page. If the flag was not > initially set, spin until the other task has finished clearing the tags. TBH, I can't mentally model all the corner cases, so maybe a formal model would help (I can have a go with TLA+, though not sure when I find a bit of time this summer). If we get rid of PG_mte_tagged altogether, this would simplify things (hopefully). As you noticed, the problem is that setting PG_mte_tagged and clearing (or restoring) the tags is not an atomic operation. There are places like mprotect() + CoW where one task can end up with stale tags. Another is shared memfd mappings if more than one mapping sets PROT_MTE and there's the swap restoring on top. > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c > index f6b00743c399..8f9655053a9f 100644 > --- a/arch/arm64/kernel/mte.c > +++ b/arch/arm64/kernel/mte.c > @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, > * the new page->flags are visible before the tags were updated. > */ > smp_wmb(); > - mte_clear_page_tags(page_address(page)); > + mte_ensure_page_tags_cleared(page); > +} > + > +void mte_ensure_page_tags_cleared(struct page *page) > +{ > + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) { > + while (!test_bit(PG_mte_tagged, &page->flags)) > + ; > + } else { > + mte_clear_page_tags(page_address(page)); > + set_bit(PG_mte_tagged, &page->flags); > + } > } mte_sync_tags() already sets PG_mte_tagged prior to clearing the page tags. The reason was so that multiple concurrent set_pte_at() would not all rush to clear (or restore) the tags. But we do have the risk of one thread accessing the page with the stale tags (copy_user_highpage() is worse as the tags would be wrong in the destination page). I'd rather be consistent everywhere with how we set the flags. However, I find it easier to reason about if we used the new flag as a lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not set, take the PG_mte_locked flag, check PG_mte_tagged again and clear/restore the tags followed by PG_mte_tagged (and you can use test_and_set_bit_lock() for the acquire semantics). It would be interesting to benchmark the cost of always zeroing the tags on allocation and copy when MTE is not in use: diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c index 0dea80bf6de4..d31708886bf9 100644 --- a/arch/arm64/mm/copypage.c +++ b/arch/arm64/mm/copypage.c @@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from) copy_page(kto, kfrom); - if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { + if (system_supports_mte()) { set_bit(PG_mte_tagged, &to->flags); page_kasan_tag_reset(to); /* diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index c5e11768e5c1..b42cad9b9349 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, { gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; - /* - * If the page is mapped with PROT_MTE, initialise the tags at the - * point of allocation and page zeroing as this is usually faster than - * separate DC ZVA and STGM. - */ - if (vma->vm_flags & VM_MTE) + if (system_supports_mte()) flags |= __GFP_ZEROTAGS; return alloc_page_vma(flags, vma, vaddr); If that's negligible, we can hopefully get rid of PG_mte_tagged. For swap we could move the restoring to arch_do_swap_page() (but move the call one line above set_pte_at() in do_swap_page()).
On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > + Steven as he added the KVM and swap support for MTE. > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > depend on being able to map guest memory as MAP_SHARED. The current > > restriction on sharing MAP_SHARED pages with the guest is preventing > > the use of those features with MTE. Therefore, remove this restriction. > > We already have some corner cases where the PG_mte_tagged logic fails > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > KVM MAP_SHARED will potentially make things worse (or hard to reason > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > on user page allocation, copy them on write. For swap we can scan and if > all tags are 0 and just skip saving them. A problem with this approach is that it would conflict with any potential future changes that we might make that would require the kernel to avoid modifying the tags for non-PROT_MTE pages. Thinking about this some more, another idea that I had was to only allow MAP_SHARED mappings in a guest with MTE enabled if the mapping is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous mappings I don't think it's possible to create a non-PROT_MTE alias in another mm (since you can't turn off PROT_MTE with mprotect), and for memfd maybe we could introduce a flag that requires PROT_MTE on all mappings. That way, we are guaranteed that either the page has been tagged prior to fault or we have exclusive access to it so it can be tagged on demand without racing. Let me see what effect that has on crosvm. > Another aspect is a change in the KVM ABI with this patch. It's probably > not that bad since it's rather a relaxation but it has the potential to > confuse the VMM, especially as it doesn't know whether it's running on > older kernels or not (it would have to probe unless we expose this info > to the VMM in some other way). > > > To avoid races between multiple tasks attempting to clear tags on the > > same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it > > atomically before beginning to clear tags on a page. If the flag was not > > initially set, spin until the other task has finished clearing the tags. > > TBH, I can't mentally model all the corner cases, so maybe a formal > model would help (I can have a go with TLA+, though not sure when I find > a bit of time this summer). If we get rid of PG_mte_tagged altogether, > this would simplify things (hopefully). > > As you noticed, the problem is that setting PG_mte_tagged and clearing > (or restoring) the tags is not an atomic operation. There are places > like mprotect() + CoW where one task can end up with stale tags. Another > is shared memfd mappings if more than one mapping sets PROT_MTE and > there's the swap restoring on top. > > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c > > index f6b00743c399..8f9655053a9f 100644 > > --- a/arch/arm64/kernel/mte.c > > +++ b/arch/arm64/kernel/mte.c > > @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, > > * the new page->flags are visible before the tags were updated. > > */ > > smp_wmb(); > > - mte_clear_page_tags(page_address(page)); > > + mte_ensure_page_tags_cleared(page); > > +} > > + > > +void mte_ensure_page_tags_cleared(struct page *page) > > +{ > > + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) { > > + while (!test_bit(PG_mte_tagged, &page->flags)) > > + ; > > + } else { > > + mte_clear_page_tags(page_address(page)); > > + set_bit(PG_mte_tagged, &page->flags); > > + } > > } > > mte_sync_tags() already sets PG_mte_tagged prior to clearing the page > tags. The reason was so that multiple concurrent set_pte_at() would not > all rush to clear (or restore) the tags. But we do have the risk of one > thread accessing the page with the stale tags (copy_user_highpage() is > worse as the tags would be wrong in the destination page). I'd rather be > consistent everywhere with how we set the flags. > > However, I find it easier to reason about if we used the new flag as a > lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not > set, take the PG_mte_locked flag, check PG_mte_tagged again and > clear/restore the tags followed by PG_mte_tagged (and you can use > test_and_set_bit_lock() for the acquire semantics). Okay, I can look at doing it that way as well. > It would be interesting to benchmark the cost of always zeroing the tags > on allocation and copy when MTE is not in use: > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > index 0dea80bf6de4..d31708886bf9 100644 > --- a/arch/arm64/mm/copypage.c > +++ b/arch/arm64/mm/copypage.c > @@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from) > > copy_page(kto, kfrom); > > - if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > + if (system_supports_mte()) { > set_bit(PG_mte_tagged, &to->flags); > page_kasan_tag_reset(to); > /* > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index c5e11768e5c1..b42cad9b9349 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, > { > gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; > > - /* > - * If the page is mapped with PROT_MTE, initialise the tags at the > - * point of allocation and page zeroing as this is usually faster than > - * separate DC ZVA and STGM. > - */ > - if (vma->vm_flags & VM_MTE) > + if (system_supports_mte()) > flags |= __GFP_ZEROTAGS; > > return alloc_page_vma(flags, vma, vaddr); > > If that's negligible, we can hopefully get rid of PG_mte_tagged. For > swap we could move the restoring to arch_do_swap_page() (but move the > call one line above set_pte_at() in do_swap_page()). I could experiment with this but I'm hesistant given the potential to cut off potential future changes. Peter
On 24/06/2022 18:05, Catalin Marinas wrote: > + Steven as he added the KVM and swap support for MTE. > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: >> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that >> depend on being able to map guest memory as MAP_SHARED. The current >> restriction on sharing MAP_SHARED pages with the guest is preventing >> the use of those features with MTE. Therefore, remove this restriction. > > We already have some corner cases where the PG_mte_tagged logic fails > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > KVM MAP_SHARED will potentially make things worse (or hard to reason > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > on user page allocation, copy them on write. For swap we can scan and if > all tags are 0 and just skip saving them. > > Another aspect is a change in the KVM ABI with this patch. It's probably > not that bad since it's rather a relaxation but it has the potential to > confuse the VMM, especially as it doesn't know whether it's running on > older kernels or not (it would have to probe unless we expose this info > to the VMM in some other way). > >> To avoid races between multiple tasks attempting to clear tags on the >> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it >> atomically before beginning to clear tags on a page. If the flag was not >> initially set, spin until the other task has finished clearing the tags. > > TBH, I can't mentally model all the corner cases, so maybe a formal > model would help (I can have a go with TLA+, though not sure when I find > a bit of time this summer). If we get rid of PG_mte_tagged altogether, > this would simplify things (hopefully). > > As you noticed, the problem is that setting PG_mte_tagged and clearing > (or restoring) the tags is not an atomic operation. There are places > like mprotect() + CoW where one task can end up with stale tags. Another > is shared memfd mappings if more than one mapping sets PROT_MTE and > there's the swap restoring on top. > >> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c >> index f6b00743c399..8f9655053a9f 100644 >> --- a/arch/arm64/kernel/mte.c >> +++ b/arch/arm64/kernel/mte.c >> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, >> * the new page->flags are visible before the tags were updated. >> */ >> smp_wmb(); >> - mte_clear_page_tags(page_address(page)); >> + mte_ensure_page_tags_cleared(page); >> +} >> + >> +void mte_ensure_page_tags_cleared(struct page *page) >> +{ >> + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) { >> + while (!test_bit(PG_mte_tagged, &page->flags)) >> + ; >> + } else { >> + mte_clear_page_tags(page_address(page)); >> + set_bit(PG_mte_tagged, &page->flags); >> + } I'm pretty sure we need some form of barrier in here to ensure the tag clearing is visible to the other CPU. Otherwise I can't immediately see any problems with the approach of a second flag (it was something I had considered). But I do also think we should seriously consider Catalin's approach of simply zeroing tags unconditionally - it would certainly simplify the code. The main issue with unconditionally zeroing is if you then want to (ab)use the tag memory carveout as extra memory for regions that are not being used for tags. Personally it seems pretty crazy (a whole lot of extra complexity for a small gain in ram), and I'm not convinced that memory is sufficiently moveable for it to reliably work. >> } > > mte_sync_tags() already sets PG_mte_tagged prior to clearing the page > tags. The reason was so that multiple concurrent set_pte_at() would not > all rush to clear (or restore) the tags. But we do have the risk of one > thread accessing the page with the stale tags (copy_user_highpage() is > worse as the tags would be wrong in the destination page). I'd rather be > consistent everywhere with how we set the flags. > > However, I find it easier to reason about if we used the new flag as a > lock. IOW, if PG_mte_tagged is set, we know that tags are valid. If not > set, take the PG_mte_locked flag, check PG_mte_tagged again and > clear/restore the tags followed by PG_mte_tagged (and you can use > test_and_set_bit_lock() for the acquire semantics). I agree - when I considered this before it was using the second flag as a lock. Steve > > It would be interesting to benchmark the cost of always zeroing the tags > on allocation and copy when MTE is not in use: > > diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c > index 0dea80bf6de4..d31708886bf9 100644 > --- a/arch/arm64/mm/copypage.c > +++ b/arch/arm64/mm/copypage.c > @@ -21,7 +21,7 @@ void copy_highpage(struct page *to, struct page *from) > > copy_page(kto, kfrom); > > - if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { > + if (system_supports_mte()) { > set_bit(PG_mte_tagged, &to->flags); > page_kasan_tag_reset(to); > /* > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index c5e11768e5c1..b42cad9b9349 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -913,12 +913,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, > { > gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; > > - /* > - * If the page is mapped with PROT_MTE, initialise the tags at the > - * point of allocation and page zeroing as this is usually faster than > - * separate DC ZVA and STGM. > - */ > - if (vma->vm_flags & VM_MTE) > + if (system_supports_mte()) > flags |= __GFP_ZEROTAGS; > > return alloc_page_vma(flags, vma, vaddr); > > If that's negligible, we can hopefully get rid of PG_mte_tagged. For > swap we could move the restoring to arch_do_swap_page() (but move the > call one line above set_pte_at() in do_swap_page()). >
On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > + Steven as he added the KVM and swap support for MTE. > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > depend on being able to map guest memory as MAP_SHARED. The current > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > the use of those features with MTE. Therefore, remove this restriction. > > > > We already have some corner cases where the PG_mte_tagged logic fails > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > on user page allocation, copy them on write. For swap we can scan and if > > all tags are 0 and just skip saving them. > > A problem with this approach is that it would conflict with any > potential future changes that we might make that would require the > kernel to avoid modifying the tags for non-PROT_MTE pages. Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the vma available where it matters. We can keep PG_mte_tagged around but always set it on page allocation (e.g. when zeroing or CoW) and check VM_MTE_ALLOWED rather than VM_MTE. I'm not sure how Linux can deal with pages that do not support MTE. Currently we only handle this at the vm_flags level. Assuming that we somehow manage to, we can still use PG_mte_tagged to mark the pages that supported tags on allocation (and they have been zeroed or copied). I guess if you want to move a page around, you'd need to go through something like try_to_unmap() (or set all mappings to PROT_NONE like in NUMA migration). Then you can either check whether the page is PROT_MTE anywhere and maybe read the tags to see whether all are zero after unmapping. Deferring tag zeroing/restoring to set_pte_at() can be racy without a lock (or your approach with another flag) but I'm not sure it's worth the extra logic if zeroing or copying the tags doesn't have a significant overhead for non-PROT_MTE pages. Another issue with set_pte_at() is that it can write tags on mprotect() even if the page is mapped read-only. So far I couldn't find a problem with this but it adds to the complexity. > Thinking about this some more, another idea that I had was to only > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous > mappings I don't think it's possible to create a non-PROT_MTE alias in > another mm (since you can't turn off PROT_MTE with mprotect), and for > memfd maybe we could introduce a flag that requires PROT_MTE on all > mappings. That way, we are guaranteed that either the page has been > tagged prior to fault or we have exclusive access to it so it can be > tagged on demand without racing. Let me see what effect that has on > crosvm. You could still have all initial shared mappings as !PROT_MTE and some mprotect() afterwards setting PG_mte_tagged and clearing the tags and this can race. AFAICT, the easiest way to avoid the race is to set PG_mte_tagged on allocation before it ends up in set_pte_at().
[I'm still in the process of trying to grok the issues surrounding MTE+KVM, so apologies in advance if I'm muddying the waters] On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote: > On 24/06/2022 18:05, Catalin Marinas wrote: >> + Steven as he added the KVM and swap support for MTE. >> >> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: >>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that >>> depend on being able to map guest memory as MAP_SHARED. The current >>> restriction on sharing MAP_SHARED pages with the guest is preventing >>> the use of those features with MTE. Therefore, remove this restriction. >> >> We already have some corner cases where the PG_mte_tagged logic fails >> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for >> KVM MAP_SHARED will potentially make things worse (or hard to reason >> about; for example the VMM sets PROT_MTE as well). I'm more inclined to >> get rid of PG_mte_tagged altogether, always zero (or restore) the tags >> on user page allocation, copy them on write. For swap we can scan and if >> all tags are 0 and just skip saving them. >> >> Another aspect is a change in the KVM ABI with this patch. It's probably >> not that bad since it's rather a relaxation but it has the potential to >> confuse the VMM, especially as it doesn't know whether it's running on >> older kernels or not (it would have to probe unless we expose this info >> to the VMM in some other way). Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.) >> >>> To avoid races between multiple tasks attempting to clear tags on the >>> same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it >>> atomically before beginning to clear tags on a page. If the flag was not >>> initially set, spin until the other task has finished clearing the tags. >> >> TBH, I can't mentally model all the corner cases, so maybe a formal >> model would help (I can have a go with TLA+, though not sure when I find >> a bit of time this summer). If we get rid of PG_mte_tagged altogether, >> this would simplify things (hopefully). >> >> As you noticed, the problem is that setting PG_mte_tagged and clearing >> (or restoring) the tags is not an atomic operation. There are places >> like mprotect() + CoW where one task can end up with stale tags. Another >> is shared memfd mappings if more than one mapping sets PROT_MTE and >> there's the swap restoring on top. >> >>> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c >>> index f6b00743c399..8f9655053a9f 100644 >>> --- a/arch/arm64/kernel/mte.c >>> +++ b/arch/arm64/kernel/mte.c >>> @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, >>> * the new page->flags are visible before the tags were updated. >>> */ >>> smp_wmb(); >>> - mte_clear_page_tags(page_address(page)); >>> + mte_ensure_page_tags_cleared(page); >>> +} >>> + >>> +void mte_ensure_page_tags_cleared(struct page *page) >>> +{ >>> + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) { >>> + while (!test_bit(PG_mte_tagged, &page->flags)) >>> + ; >>> + } else { >>> + mte_clear_page_tags(page_address(page)); >>> + set_bit(PG_mte_tagged, &page->flags); >>> + } > > I'm pretty sure we need some form of barrier in here to ensure the tag > clearing is visible to the other CPU. Otherwise I can't immediately see > any problems with the approach of a second flag (it was something I had > considered). But I do also think we should seriously consider Catalin's > approach of simply zeroing tags unconditionally - it would certainly > simplify the code. What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end up copying zeroes? That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS will be called? I.e. when implementing migration, it should be ok to call it while the vm is paused, but you probably won't get a consistent state while the vm is running? [Postcopy needs a different interface, I guess, so that the migration target can atomically place a received page and its metadata. I see https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; has there been any follow-up?]
On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > + Steven as he added the KVM and swap support for MTE. > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > on user page allocation, copy them on write. For swap we can scan and if > > > all tags are 0 and just skip saving them. > > > > A problem with this approach is that it would conflict with any > > potential future changes that we might make that would require the > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > vma available where it matters. We can keep PG_mte_tagged around but > always set it on page allocation (e.g. when zeroing or CoW) and check > VM_MTE_ALLOWED rather than VM_MTE. Right, but for avoiding tagging we would like that to apply to as many pages as possible. If we check VM_MTE_ALLOWED then the heap pages of those processes that are not using MTE would not be covered, which on a mostly non-MTE system would be a majority of pages. Over the weekend I thought of another policy, which would be similar to your original one. We can always tag pages which are mapped as MAP_SHARED. These pages are much less common than private pages, so the impact would be less. So the if statement in alloc_zeroed_user_highpage_movable would become: if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && (vma->vm_flags & VM_SHARED))) That would allow us to put basically any shared mapping in the guest address space without needing to deal with races in sanitise_mte_tags. We may consider going further than this and require all pages mapped into guests with MTE enabled to be PROT_MTE. I think it would allow dropping sanitise_mte_tags entirely. This would not be a relaxation of the ABI but perhaps we can get away with it if, as Cornelia mentioned, QEMU does not currently support MTE, and since crosvm doesn't currently support it either there's no userspace to break AFAIK. This would also address a current weirdness in the API where it is possible for the underlying pages of a MAP_SHARED file mapping to become tagged via KVM, said tags are exposed to the guest and are discarded when the underlying page is paged out. We can perhaps accomplish it by dropping support for KVM_CAP_ARM_MTE in the kernel and introducing something like a KVM_CAP_ARM_MTE_V2 with the new restriction. > I'm not sure how Linux can deal with pages that do not support MTE. > Currently we only handle this at the vm_flags level. Assuming that we > somehow manage to, we can still use PG_mte_tagged to mark the pages that > supported tags on allocation (and they have been zeroed or copied). I > guess if you want to move a page around, you'd need to go through > something like try_to_unmap() (or set all mappings to PROT_NONE like in > NUMA migration). Then you can either check whether the page is PROT_MTE > anywhere and maybe read the tags to see whether all are zero after > unmapping. > > Deferring tag zeroing/restoring to set_pte_at() can be racy without a > lock (or your approach with another flag) but I'm not sure it's worth > the extra logic if zeroing or copying the tags doesn't have a > significant overhead for non-PROT_MTE pages. > > Another issue with set_pte_at() is that it can write tags on mprotect() > even if the page is mapped read-only. So far I couldn't find a problem > with this but it adds to the complexity. > > > Thinking about this some more, another idea that I had was to only > > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping > > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous > > mappings I don't think it's possible to create a non-PROT_MTE alias in > > another mm (since you can't turn off PROT_MTE with mprotect), and for > > memfd maybe we could introduce a flag that requires PROT_MTE on all > > mappings. That way, we are guaranteed that either the page has been > > tagged prior to fault or we have exclusive access to it so it can be > > tagged on demand without racing. Let me see what effect that has on > > crosvm. > > You could still have all initial shared mappings as !PROT_MTE and some > mprotect() afterwards setting PG_mte_tagged and clearing the tags and > this can race. AFAICT, the easiest way to avoid the race is to set > PG_mte_tagged on allocation before it ends up in set_pte_at(). For shared pages we will be safe with my new proposal because the pages will always be tagged. For private ones we might need to move pages to an MTE supported region in response to the mprotect, as you discussed above. Peter
On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote: > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > > <catalin.marinas@arm.com> wrote: > > > > + Steven as he added the KVM and swap support for MTE. > > > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > > on user page allocation, copy them on write. For swap we can scan and if > > > > all tags are 0 and just skip saving them. > > > > > > A problem with this approach is that it would conflict with any > > > potential future changes that we might make that would require the > > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > > vma available where it matters. We can keep PG_mte_tagged around but > > always set it on page allocation (e.g. when zeroing or CoW) and check > > VM_MTE_ALLOWED rather than VM_MTE. > > Right, but for avoiding tagging we would like that to apply to as many > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of > those processes that are not using MTE would not be covered, which on > a mostly non-MTE system would be a majority of pages. By non-MTE system, I guess you mean a system that supports MTE but most of the user apps don't use it. That's why it would be interesting to see the effect of using DC GZVA instead of DC ZVA for page zeroing. I suspect on Android you'd notice the fork() penalty a bit more with all the copy-on-write having to copy tags. But we can't tell until we do some benchmarks. If the penalty is indeed significant, we'll go back to assessing the races here. Another thing that won't happen for PG_mte_tagged currently is KSM page merging. I had a patch to allow comparing the tags but eventually dropped it (can dig it out). > Over the weekend I thought of another policy, which would be similar > to your original one. We can always tag pages which are mapped as > MAP_SHARED. These pages are much less common than private pages, so > the impact would be less. So the if statement in > alloc_zeroed_user_highpage_movable would become: > > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && > (vma->vm_flags & VM_SHARED))) > > That would allow us to put basically any shared mapping in the guest > address space without needing to deal with races in sanitise_mte_tags. It's not just about VM_SHARED. A page can be effectively shared as a result of a fork(). It is read-only in all processes but still shared and one task may call mprotect(PROT_MTE). Another case of sharing is between the VMM and the guest though I think an mprotect() in the VMM would trigger the unmapping of the guest address and pages mapped into guests already have PG_mte_tagged set. We probably need to draw a state machine of all the cases. AFAICT, we need to take into account a few of the below (it's probably incomplete; I've been with Steven through most of them IIRC): 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent mprotect(PROT_MTE). That's one of the things I dislike is that a late tag clearing via set_pte_at() can happen without breaking the CoW mapping. It's a bit counter-intuitive if you treat the tags as data (rather than some cache), you don't expect a read-only page to have some (tag) updated. 2. Shared mappings with concurrent mprotect(PROT_MTE). 3. Shared mapping restoring from swap. 4. Private mapping restoring from swap into CoW mapping. 5. KVM faults. 6. Concurrent ptrace accesses (or KVM tag copying) What currently risks failing I think is breaking a CoW mapping with concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we do this the other way around - clear tags first and then set the flag. I think using another bit as a lock may solve most (all) of these but another option is to treat the tags as data and make sure they are set before mapping. > We may consider going further than this and require all pages mapped > into guests with MTE enabled to be PROT_MTE. We discussed this when upstreaming KVM support and the idea got pushed back. The main problem is that the VMM may use MTE for itself but can no longer access the guest memory without the risk of taking a fault. We don't have a match-all tag in user space and we can't teach the VMM to use the PSTATE.TCO bit since driver emulation can be fairly generic. And, of course, there's also the ABI change now. > I think it would allow > dropping sanitise_mte_tags entirely. This would not be a relaxation of > the ABI but perhaps we can get away with it if, as Cornelia mentioned, > QEMU does not currently support MTE, and since crosvm doesn't > currently support it either there's no userspace to break AFAIK. This > would also address a current weirdness in the API where it is possible > for the underlying pages of a MAP_SHARED file mapping to become tagged > via KVM, said tags are exposed to the guest and are discarded when the > underlying page is paged out. Ah, good point, shared file mappings is another reason we did not allow MAP_SHARED and MTE for guest memory. BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED irrespective of whether we allow MAP_SHARED or not. > We can perhaps accomplish it by dropping > support for KVM_CAP_ARM_MTE in the kernel and introducing something > like a KVM_CAP_ARM_MTE_V2 with the new restriction. That's an option for the ABI upgrade but we still need to solve the potential races.
On Tue, Jun 28, 2022 at 10:58 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote: > > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > > > <catalin.marinas@arm.com> wrote: > > > > > + Steven as he added the KVM and swap support for MTE. > > > > > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > > > on user page allocation, copy them on write. For swap we can scan and if > > > > > all tags are 0 and just skip saving them. > > > > > > > > A problem with this approach is that it would conflict with any > > > > potential future changes that we might make that would require the > > > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > > > > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > > > vma available where it matters. We can keep PG_mte_tagged around but > > > always set it on page allocation (e.g. when zeroing or CoW) and check > > > VM_MTE_ALLOWED rather than VM_MTE. > > > > Right, but for avoiding tagging we would like that to apply to as many > > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of > > those processes that are not using MTE would not be covered, which on > > a mostly non-MTE system would be a majority of pages. > > By non-MTE system, I guess you mean a system that supports MTE but most > of the user apps don't use it. Yes, that's what I meant. > That's why it would be interesting to see > the effect of using DC GZVA instead of DC ZVA for page zeroing. > > I suspect on Android you'd notice the fork() penalty a bit more with all > the copy-on-write having to copy tags. But we can't tell until we do > some benchmarks. If the penalty is indeed significant, we'll go back to > assessing the races here. Okay, I can try to measure it. I do feel rather strongly though that we should try to avoid tagging pages as much as possible even ignoring the potential performance implications. Here's one more idea: we can tag pages eagerly as you propose, but introduce an opt-out. For example, we introduce a MAP_NOMTE flag, which would prevent tag initialization as well as causing any future attempt to mprotect(PROT_MTE) to fail. Allocators that know that the memory will not be used for MTE in the future can set this flag. For example, Scudo can start setting this flag once MTE has been disabled as it has no mechanism for turning MTE back on once disabled. And that way we will end up with no tags on the heap in the processes with MTE disabled. Mappings with MAP_NOMTE would not be permitted in the guest memory space of MTE enabled guests. For executables mapped by the kernel we may consider adding a bit to the ELF program headers to enable MAP_NOMTE. > Another thing that won't happen for PG_mte_tagged currently is KSM page > merging. I had a patch to allow comparing the tags but eventually > dropped it (can dig it out). > > > Over the weekend I thought of another policy, which would be similar > > to your original one. We can always tag pages which are mapped as > > MAP_SHARED. These pages are much less common than private pages, so > > the impact would be less. So the if statement in > > alloc_zeroed_user_highpage_movable would become: > > > > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && > > (vma->vm_flags & VM_SHARED))) > > > > That would allow us to put basically any shared mapping in the guest > > address space without needing to deal with races in sanitise_mte_tags. > > It's not just about VM_SHARED. A page can be effectively shared as a > result of a fork(). It is read-only in all processes but still shared > and one task may call mprotect(PROT_MTE). I see, so VM_SHARED isn't that useful for this then. > Another case of sharing is between the VMM and the guest though I think > an mprotect() in the VMM would trigger the unmapping of the guest > address and pages mapped into guests already have PG_mte_tagged set. > > We probably need to draw a state machine of all the cases. AFAICT, we > need to take into account a few of the below (it's probably incomplete; > I've been with Steven through most of them IIRC): > > 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent > mprotect(PROT_MTE). That's one of the things I dislike is that a late > tag clearing via set_pte_at() can happen without breaking the CoW > mapping. It's a bit counter-intuitive if you treat the tags as data > (rather than some cache), you don't expect a read-only page to have > some (tag) updated. > > 2. Shared mappings with concurrent mprotect(PROT_MTE). > > 3. Shared mapping restoring from swap. > > 4. Private mapping restoring from swap into CoW mapping. > > 5. KVM faults. > > 6. Concurrent ptrace accesses (or KVM tag copying) > > What currently risks failing I think is breaking a CoW mapping with > concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the > tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we > do this the other way around - clear tags first and then set the flag. > > I think using another bit as a lock may solve most (all) of these but > another option is to treat the tags as data and make sure they are set > before mapping. > > > We may consider going further than this and require all pages mapped > > into guests with MTE enabled to be PROT_MTE. > > We discussed this when upstreaming KVM support and the idea got pushed > back. The main problem is that the VMM may use MTE for itself but can no > longer access the guest memory without the risk of taking a fault. We > don't have a match-all tag in user space and we can't teach the VMM to > use the PSTATE.TCO bit since driver emulation can be fairly generic. > And, of course, there's also the ABI change now. It sounds like MAP_NOMTE would solve this problem. Since tag initialization becomes independent of the memory access attributes, we don't have a risk of tag check faults in the VMM. > > I think it would allow > > dropping sanitise_mte_tags entirely. This would not be a relaxation of > > the ABI but perhaps we can get away with it if, as Cornelia mentioned, > > QEMU does not currently support MTE, and since crosvm doesn't > > currently support it either there's no userspace to break AFAIK. This > > would also address a current weirdness in the API where it is possible > > for the underlying pages of a MAP_SHARED file mapping to become tagged > > via KVM, said tags are exposed to the guest and are discarded when the > > underlying page is paged out. > > Ah, good point, shared file mappings is another reason we did not allow > MAP_SHARED and MTE for guest memory. > > BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED > irrespective of whether we allow MAP_SHARED or not. Yes, that sounds like a good idea. If we made MAP_NOMTE clear VM_MTE_ALLOWED then I think that's the only check we would need. > > We can perhaps accomplish it by dropping > > support for KVM_CAP_ARM_MTE in the kernel and introducing something > > like a KVM_CAP_ARM_MTE_V2 with the new restriction. > > That's an option for the ABI upgrade but we still need to solve the > potential races. With MAP_NOMTE I think we wouldn't need an ABI change at all. Peter
On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: > [I'm still in the process of trying to grok the issues surrounding > MTE+KVM, so apologies in advance if I'm muddying the waters] No worries, we are not that far ahead either ;). > On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote: > > On 24/06/2022 18:05, Catalin Marinas wrote: > >> + Steven as he added the KVM and swap support for MTE. > >> > >> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > >>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > >>> depend on being able to map guest memory as MAP_SHARED. The current > >>> restriction on sharing MAP_SHARED pages with the guest is preventing > >>> the use of those features with MTE. Therefore, remove this restriction. > >> > >> We already have some corner cases where the PG_mte_tagged logic fails > >> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > >> KVM MAP_SHARED will potentially make things worse (or hard to reason > >> about; for example the VMM sets PROT_MTE as well). I'm more inclined to > >> get rid of PG_mte_tagged altogether, always zero (or restore) the tags > >> on user page allocation, copy them on write. For swap we can scan and if > >> all tags are 0 and just skip saving them. > >> > >> Another aspect is a change in the KVM ABI with this patch. It's probably > >> not that bad since it's rather a relaxation but it has the potential to > >> confuse the VMM, especially as it doesn't know whether it's running on > >> older kernels or not (it would have to probe unless we expose this info > >> to the VMM in some other way). > > Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.) Steven to confirm but I think he only played with kvmtool. Adding Jean-Philippe who also had Qemu on his plans at some point. > What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end > up copying zeroes? Yes. For migration, the VMM could ignore sending over tags that are all zeros or maybe use some simple compression. We don't have a way to disable MTE for guests, so all pages mapped into the guest address space end up with PG_mte_tagged. > That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS > will be called? I.e. when implementing migration, it should be ok to > call it while the vm is paused, but you probably won't get a consistent > state while the vm is running? Wouldn't this be the same as migrating data? The VMM would only copy it after it was marked read-only. BTW, I think sanitise_mte_tags() needs a barrier before setting the PG_mte_tagged() flag (unless we end up with some lock for reading the tags). > [Postcopy needs a different interface, I guess, so that the migration > target can atomically place a received page and its metadata. I see > https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; > has there been any follow-up?] I don't follow the qemu list, so I wasn't even aware of that thread. But postcopy, the VMM needs to ensure that both the data and tags are up to date before mapping such page into the guest address space.
On Tue, Jun 28, 2022 at 11:54:51AM -0700, Peter Collingbourne wrote: > On Tue, Jun 28, 2022 at 10:58 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > That's why it would be interesting to see > > the effect of using DC GZVA instead of DC ZVA for page zeroing. > > > > I suspect on Android you'd notice the fork() penalty a bit more with all > > the copy-on-write having to copy tags. But we can't tell until we do > > some benchmarks. If the penalty is indeed significant, we'll go back to > > assessing the races here. > > Okay, I can try to measure it. I do feel rather strongly though that > we should try to avoid tagging pages as much as possible even ignoring > the potential performance implications. > > Here's one more idea: we can tag pages eagerly as you propose, but > introduce an opt-out. For example, we introduce a MAP_NOMTE flag, > which would prevent tag initialization as well as causing any future > attempt to mprotect(PROT_MTE) to fail. Allocators that know that the > memory will not be used for MTE in the future can set this flag. For > example, Scudo can start setting this flag once MTE has been disabled > as it has no mechanism for turning MTE back on once disabled. And that > way we will end up with no tags on the heap in the processes with MTE > disabled. Mappings with MAP_NOMTE would not be permitted in the guest > memory space of MTE enabled guests. For executables mapped by the > kernel we may consider adding a bit to the ELF program headers to > enable MAP_NOMTE. I don't like such negative flags and we should aim for minimal changes to code that doesn't care about MTE. If there's a performance penalty with zeroing the tags, we'll keep looking at the lazy tag initialisation. In the meantime, I'll think some more about the lazy stuff. We need at least mte_sync_tags() fixed to set the PG_mte_tagged after the tags have been updated (fixes the CoW + mprotect() race but probably breaks concurrent MAP_SHARED mprotect()). We'd have to add some barriers (maybe in a new function, set_page_tagged()). Some cases like restoring from swap (both private and shared) have the page lock held. KVM doesn't seem to take any page lock, so it can race with the VMM. Anyway, I doubt we can get away with a single bit. We can't make use of PG_locked either since set_pte_at() is called with the page either locked or unlocked.
On Wed, Jun 29, 2022 at 08:15:07PM +0100, Catalin Marinas wrote: > In the meantime, I'll think some more about the lazy stuff. We need at > least mte_sync_tags() fixed to set the PG_mte_tagged after the tags have > been updated (fixes the CoW + mprotect() race but probably breaks > concurrent MAP_SHARED mprotect()). We'd have to add some barriers (maybe > in a new function, set_page_tagged()). Some cases like restoring from > swap (both private and shared) have the page lock held. KVM doesn't seem > to take any page lock, so it can race with the VMM. > > Anyway, I doubt we can get away with a single bit. We can't make use of > PG_locked either since set_pte_at() is called with the page either > locked or unlocked. I dug out the one of the old threads on MAP_SHARED and KVM: https://lore.kernel.org/r/20210609174117.GA18459@arm.com Last year we did not consider another page flag for the lock, only a big lock or an array of locks and neither looked appealing. An extra bit in the page flags scales a lot better though we tend not go through these bits that quickly. Anyway, I think at least for the current setup, we need something like below. It can break concurrent mprotect() by zeroing valid tags but that's a lot less likely to happen than CoW. I'm still digging through the KVM, I think it has the mprotect() race already if it's done in the VMM. -----8<----------------------------- From 1f424a52a71df0ff0dafef8a00fe1a8363de63ee Mon Sep 17 00:00:00 2001 From: Catalin Marinas <catalin.marinas@arm.com> Date: Thu, 30 Jun 2022 10:16:56 +0100 Subject: [PATCH] arm64: mte: Fix/clarify the PG_mte_tagged semantics Currently the PG_mte_tagged page flag mostly means the page contains valid tags and it should be set after the tags have been cleared or restored. However, in mte_sync_tags() it is set before setting the tags to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on write in another thread can cause the new page to have stale tags. Similarly, tag reading via ptrace() can read stale tags of the PG_mte_tagged flag is set before actually clearing/restoring the tags. Fix the PG_mte_tagged semantics so that it is only set after the tags have been cleared or restored. This is safe for swap restoring into a MAP_SHARED or CoW page since the core code takes the page lock. Add two functions to test and set the PG_mte_tagged flag with acquire and release semantics. The downside is that concurrent mprotect(PROT_MTE) on a MAP_SHARED page may cause tag loss. This is already the case for KVM guests if a VMM changes the page protection while the guest triggers a user_mem_abort(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Peter Collingbourne <pcc@google.com> --- arch/arm64/include/asm/mte.h | 32 ++++++++++++++++++++++++++++++++ arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/cpufeature.c | 4 +++- arch/arm64/kernel/elfcore.c | 2 +- arch/arm64/kernel/hibernate.c | 2 +- arch/arm64/kernel/mte.c | 8 ++++---- arch/arm64/kvm/guest.c | 4 ++-- arch/arm64/kvm/mmu.c | 4 ++-- arch/arm64/mm/copypage.c | 4 ++-- arch/arm64/mm/fault.c | 2 +- arch/arm64/mm/mteswap.c | 2 +- 11 files changed, 50 insertions(+), 16 deletions(-) diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index aa523591a44e..dcf0e09112b9 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -37,6 +37,29 @@ void mte_free_tag_storage(char *storage); /* track which pages have valid allocation tags */ #define PG_mte_tagged PG_arch_2 +static inline void set_page_mte_tagged(struct page *page) +{ + /* + * Ensure that the tags written prior to this function are visible + * before the page flags update. + */ + smp_wmb(); + set_bit(PG_mte_tagged, &page->flags); +} + +static inline bool page_mte_tagged(struct page *page) +{ + bool ret = test_bit(PG_mte_tagged, &page->flags); + + /* + * If the page is tagged, ensure ordering with a likely subsequent + * read of the tags. + */ + if (ret) + smp_rmb(); + return ret; +} + void mte_zero_clear_page_tags(void *addr); void mte_sync_tags(pte_t old_pte, pte_t pte); void mte_copy_page_tags(void *kto, const void *kfrom); @@ -54,6 +77,15 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t size); /* unused if !CONFIG_ARM64_MTE, silence the compiler */ #define PG_mte_tagged 0 +static inline set_page_mte_tagged(struct page *page) +{ +} + +static inline bool page_mte_tagged(struct page *page) +{ + return false; +} + static inline void mte_zero_clear_page_tags(void *addr) { } diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 0b6632f18364..08823669db0a 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1034,7 +1034,7 @@ static inline void arch_swap_invalidate_area(int type) static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio) { if (system_supports_mte() && mte_restore_tags(entry, &folio->page)) - set_bit(PG_mte_tagged, &folio->flags); + set_page_mte_tagged(&folio->page); } #endif /* CONFIG_ARM64_MTE */ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 8d88433de81d..01eda797bd59 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1964,8 +1964,10 @@ static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap) * Clear the tags in the zero page. This needs to be done via the * linear map which has the Tagged attribute. */ - if (!test_and_set_bit(PG_mte_tagged, &ZERO_PAGE(0)->flags)) + if (page_mte_tagged(ZERO_PAGE(0))) { mte_clear_page_tags(lm_alias(empty_zero_page)); + set_page_mte_tagged(ZERO_PAGE(0)); + } kasan_init_hw_tags_cpu(); } diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c index 98d67444a5b6..f91bb1572d22 100644 --- a/arch/arm64/kernel/elfcore.c +++ b/arch/arm64/kernel/elfcore.c @@ -47,7 +47,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm, * Pages mapped in user space as !pte_access_permitted() (e.g. * PROT_EXEC only) may not have the PG_mte_tagged flag set. */ - if (!test_bit(PG_mte_tagged, &page->flags)) { + if (!page_mte_tagged(page)) { put_page(page); dump_skip(cprm, MTE_PAGE_TAG_STORAGE); continue; diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c index 2e248342476e..018ae7a25737 100644 --- a/arch/arm64/kernel/hibernate.c +++ b/arch/arm64/kernel/hibernate.c @@ -271,7 +271,7 @@ static int swsusp_mte_save_tags(void) if (!page) continue; - if (!test_bit(PG_mte_tagged, &page->flags)) + if (!page_mte_tagged(page)) continue; ret = save_tags(page, pfn); diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index f6b00743c399..37020353bb35 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -58,6 +58,7 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, */ smp_wmb(); mte_clear_page_tags(page_address(page)); + set_page_mte_tagged(page); } void mte_sync_tags(pte_t old_pte, pte_t pte) @@ -73,7 +74,7 @@ void mte_sync_tags(pte_t old_pte, pte_t pte) /* if PG_mte_tagged is set, tags have already been initialised */ for (i = 0; i < nr_pages; i++, page++) { - if (!test_and_set_bit(PG_mte_tagged, &page->flags)) + if (!page_mte_tagged(page)) mte_sync_page_tags(page, old_pte, check_swap, pte_is_tagged); } @@ -100,8 +101,7 @@ int memcmp_pages(struct page *page1, struct page *page2) * pages is tagged, set_pte_at() may zero or change the tags of the * other page via mte_sync_tags(). */ - if (test_bit(PG_mte_tagged, &page1->flags) || - test_bit(PG_mte_tagged, &page2->flags)) + if (page_mte_tagged(page1) || page_mte_tagged(page2)) return addr1 != addr2; return ret; @@ -407,7 +407,7 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr, put_page(page); break; } - WARN_ON_ONCE(!test_bit(PG_mte_tagged, &page->flags)); + WARN_ON_ONCE(!page_mte_tagged(page)); /* limit access to the end of the page */ offset = offset_in_page(addr); diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 8c607199cad1..3b04e69006b4 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -1058,7 +1058,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, maddr = page_address(page); if (!write) { - if (test_bit(PG_mte_tagged, &page->flags)) + if (page_mte_tagged(page)) num_tags = mte_copy_tags_to_user(tags, maddr, MTE_GRANULES_PER_PAGE); else @@ -1075,7 +1075,7 @@ long kvm_vm_ioctl_mte_copy_tags(struct kvm *kvm, * completed fully */ if (num_tags == MTE_GRANULES_PER_PAGE) - set_bit(PG_mte_tagged, &page->flags); + set_page_mte_tagged(page); kvm_release_pfn_dirty(pfn); } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f5651a05b6a8..9cfa516452e1 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1075,9 +1075,9 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, return -EFAULT; for (i = 0; i < nr_pages; i++, page++) { - if (!test_bit(PG_mte_tagged, &page->flags)) { + if (!page_mte_tagged(page)) { mte_clear_page_tags(page_address(page)); - set_bit(PG_mte_tagged, &page->flags); + set_page_mte_tagged(page); } } diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c index 0dea80bf6de4..f36d796f1bce 100644 --- a/arch/arm64/mm/copypage.c +++ b/arch/arm64/mm/copypage.c @@ -21,8 +21,7 @@ void copy_highpage(struct page *to, struct page *from) copy_page(kto, kfrom); - if (system_supports_mte() && test_bit(PG_mte_tagged, &from->flags)) { - set_bit(PG_mte_tagged, &to->flags); + if (system_supports_mte() && page_mte_tagged(from)) { page_kasan_tag_reset(to); /* * We need smp_wmb() in between setting the flags and clearing the @@ -33,6 +32,7 @@ void copy_highpage(struct page *to, struct page *from) */ smp_wmb(); mte_copy_page_tags(kto, kfrom); + set_page_mte_tagged(to); } } EXPORT_SYMBOL(copy_highpage); diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index c5e11768e5c1..147fe28d3fbe 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -928,5 +928,5 @@ void tag_clear_highpage(struct page *page) { mte_zero_clear_page_tags(page_address(page)); page_kasan_tag_reset(page); - set_bit(PG_mte_tagged, &page->flags); + set_page_mte_tagged(page); } diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c index a9e50e930484..9d3a8cf388fc 100644 --- a/arch/arm64/mm/mteswap.c +++ b/arch/arm64/mm/mteswap.c @@ -24,7 +24,7 @@ int mte_save_tags(struct page *page) { void *tag_storage, *ret; - if (!test_bit(PG_mte_tagged, &page->flags)) + if (!page_mte_tagged(page)) return 0; tag_storage = mte_allocate_tag_storage();
On 29/06/2022 09:45, Catalin Marinas wrote: > On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: >> [I'm still in the process of trying to grok the issues surrounding >> MTE+KVM, so apologies in advance if I'm muddying the waters] > > No worries, we are not that far ahead either ;). > >> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote: >>> On 24/06/2022 18:05, Catalin Marinas wrote: >>>> + Steven as he added the KVM and swap support for MTE. >>>> >>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: >>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that >>>>> depend on being able to map guest memory as MAP_SHARED. The current >>>>> restriction on sharing MAP_SHARED pages with the guest is preventing >>>>> the use of those features with MTE. Therefore, remove this restriction. >>>> >>>> We already have some corner cases where the PG_mte_tagged logic fails >>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for >>>> KVM MAP_SHARED will potentially make things worse (or hard to reason >>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to >>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags >>>> on user page allocation, copy them on write. For swap we can scan and if >>>> all tags are 0 and just skip saving them. >>>> >>>> Another aspect is a change in the KVM ABI with this patch. It's probably >>>> not that bad since it's rather a relaxation but it has the potential to >>>> confuse the VMM, especially as it doesn't know whether it's running on >>>> older kernels or not (it would have to probe unless we expose this info >>>> to the VMM in some other way). >> >> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.) > > Steven to confirm but I think he only played with kvmtool. Adding > Jean-Philippe who also had Qemu on his plans at some point. Yes I've only played with kvmtool so far. 'basic support' at the moment is little more than enabling the cap - that allows the guest to access tags. However obviously aspects such as migration need to understand what's going on to correctly save/restore tags - which is mostly only relevant to Qemu. >> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end >> up copying zeroes? > > Yes. For migration, the VMM could ignore sending over tags that are all > zeros or maybe use some simple compression. We don't have a way to > disable MTE for guests, so all pages mapped into the guest address space > end up with PG_mte_tagged. Architecturally we don't (yet) have a way of describing memory without tags, so indeed you will get all zeros if the guest hasn't populated the tags yet. >> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS >> will be called? I.e. when implementing migration, it should be ok to >> call it while the vm is paused, but you probably won't get a consistent >> state while the vm is running? > > Wouldn't this be the same as migrating data? The VMM would only copy it > after it was marked read-only. BTW, I think sanitise_mte_tags() needs a > barrier before setting the PG_mte_tagged() flag (unless we end up with > some lock for reading the tags). As Catalin says, tags are no different from data so the VMM needs to either pause the VM or mark the page read-only to protect it from guest updates during the copy. The whole test_bit/set_bit dance does seem to be leaving open race conditions. I /think/ that Peter's extra flag as a lock with an added memory barrier is sufficient to mitigate it, but at this stage I'm also thinking some formal modelling would be wise as I don't trust my intuition when it comes to memory barriers. >> [Postcopy needs a different interface, I guess, so that the migration >> target can atomically place a received page and its metadata. I see >> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; >> has there been any follow-up?] > > I don't follow the qemu list, so I wasn't even aware of that thread. But > postcopy, the VMM needs to ensure that both the data and tags are up to > date before mapping such page into the guest address space. > I'm not sure I see how atomically updating data+tags is different from the existing issues around atomically updating the data. The VMM needs to ensure that the guest doesn't see the page before all the data+all the tags are written. It does mean lazy setting of the tags isn't possible in the VMM, but I'm not sure that's a worthwhile thing anyway. Perhaps I'm missing something? Steve
On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: > On 29/06/2022 09:45, Catalin Marinas wrote: >> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: >>> [I'm still in the process of trying to grok the issues surrounding >>> MTE+KVM, so apologies in advance if I'm muddying the waters] >> >> No worries, we are not that far ahead either ;). >> >>> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote: >>>> On 24/06/2022 18:05, Catalin Marinas wrote: >>>>> + Steven as he added the KVM and swap support for MTE. >>>>> >>>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: >>>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that >>>>>> depend on being able to map guest memory as MAP_SHARED. The current >>>>>> restriction on sharing MAP_SHARED pages with the guest is preventing >>>>>> the use of those features with MTE. Therefore, remove this restriction. >>>>> >>>>> We already have some corner cases where the PG_mte_tagged logic fails >>>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for >>>>> KVM MAP_SHARED will potentially make things worse (or hard to reason >>>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to >>>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags >>>>> on user page allocation, copy them on write. For swap we can scan and if >>>>> all tags are 0 and just skip saving them. >>>>> >>>>> Another aspect is a change in the KVM ABI with this patch. It's probably >>>>> not that bad since it's rather a relaxation but it has the potential to >>>>> confuse the VMM, especially as it doesn't know whether it's running on >>>>> older kernels or not (it would have to probe unless we expose this info >>>>> to the VMM in some other way). >>> >>> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.) >> >> Steven to confirm but I think he only played with kvmtool. Adding >> Jean-Philippe who also had Qemu on his plans at some point. > > Yes I've only played with kvmtool so far. 'basic support' at the moment > is little more than enabling the cap - that allows the guest to access > tags. However obviously aspects such as migration need to understand > what's going on to correctly save/restore tags - which is mostly only > relevant to Qemu. Yes, simply only enabling the cap seems to work fine in QEMU as well (as in, 'mte selftests work fine'). Migration support is the hard/interesting part. > >>> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end >>> up copying zeroes? >> >> Yes. For migration, the VMM could ignore sending over tags that are all >> zeros or maybe use some simple compression. We don't have a way to >> disable MTE for guests, so all pages mapped into the guest address space >> end up with PG_mte_tagged. > > Architecturally we don't (yet) have a way of describing memory without > tags, so indeed you will get all zeros if the guest hasn't populated the > tags yet. Nod. > >>> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS >>> will be called? I.e. when implementing migration, it should be ok to >>> call it while the vm is paused, but you probably won't get a consistent >>> state while the vm is running? >> >> Wouldn't this be the same as migrating data? The VMM would only copy it >> after it was marked read-only. BTW, I think sanitise_mte_tags() needs a >> barrier before setting the PG_mte_tagged() flag (unless we end up with >> some lock for reading the tags). > > As Catalin says, tags are no different from data so the VMM needs to > either pause the VM or mark the page read-only to protect it from guest > updates during the copy. Yes, that seems reasonable; not sure whether the documentation should call that out explicitly. > > The whole test_bit/set_bit dance does seem to be leaving open race > conditions. I /think/ that Peter's extra flag as a lock with an added > memory barrier is sufficient to mitigate it, but at this stage I'm also > thinking some formal modelling would be wise as I don't trust my > intuition when it comes to memory barriers. > >>> [Postcopy needs a different interface, I guess, so that the migration >>> target can atomically place a received page and its metadata. I see >>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; >>> has there been any follow-up?] >> >> I don't follow the qemu list, so I wasn't even aware of that thread. But >> postcopy, the VMM needs to ensure that both the data and tags are up to >> date before mapping such page into the guest address space. >> > > I'm not sure I see how atomically updating data+tags is different from > the existing issues around atomically updating the data. The VMM needs > to ensure that the guest doesn't see the page before all the data+all > the tags are written. It does mean lazy setting of the tags isn't > possible in the VMM, but I'm not sure that's a worthwhile thing anyway. > Perhaps I'm missing something? For postcopy, we basically want to fault in any not-yet-migrated page via uffd once the guest accesses it. We only get the page data that way, though, not the tag. I'm wondering whether we'd need a 'page+metadata' uffd mode; not sure if that makes sense. Otherwise, we'd need to stop the guest while grabbing the tags for the page as well, and stopping is the thing we want to avoid here.
On 04/07/2022 13:19, Cornelia Huck wrote: > On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: > >> On 29/06/2022 09:45, Catalin Marinas wrote: >>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: >>>> [I'm still in the process of trying to grok the issues surrounding >>>> MTE+KVM, so apologies in advance if I'm muddying the waters] >>> >>> No worries, we are not that far ahead either ;). >>> >>>> On Sat, Jun 25 2022, Steven Price <steven.price@arm.com> wrote: >>>>> On 24/06/2022 18:05, Catalin Marinas wrote: >>>>>> + Steven as he added the KVM and swap support for MTE. >>>>>> >>>>>> On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: >>>>>>> Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that >>>>>>> depend on being able to map guest memory as MAP_SHARED. The current >>>>>>> restriction on sharing MAP_SHARED pages with the guest is preventing >>>>>>> the use of those features with MTE. Therefore, remove this restriction. >>>>>> >>>>>> We already have some corner cases where the PG_mte_tagged logic fails >>>>>> even for MAP_PRIVATE (but page shared with CoW). Adding this on top for >>>>>> KVM MAP_SHARED will potentially make things worse (or hard to reason >>>>>> about; for example the VMM sets PROT_MTE as well). I'm more inclined to >>>>>> get rid of PG_mte_tagged altogether, always zero (or restore) the tags >>>>>> on user page allocation, copy them on write. For swap we can scan and if >>>>>> all tags are 0 and just skip saving them. >>>>>> >>>>>> Another aspect is a change in the KVM ABI with this patch. It's probably >>>>>> not that bad since it's rather a relaxation but it has the potential to >>>>>> confuse the VMM, especially as it doesn't know whether it's running on >>>>>> older kernels or not (it would have to probe unless we expose this info >>>>>> to the VMM in some other way). >>>> >>>> Which VMMs support KVM+MTE so far? (I'm looking at adding support in QEMU.) >>> >>> Steven to confirm but I think he only played with kvmtool. Adding >>> Jean-Philippe who also had Qemu on his plans at some point. >> >> Yes I've only played with kvmtool so far. 'basic support' at the moment >> is little more than enabling the cap - that allows the guest to access >> tags. However obviously aspects such as migration need to understand >> what's going on to correctly save/restore tags - which is mostly only >> relevant to Qemu. > > Yes, simply only enabling the cap seems to work fine in QEMU as well (as > in, 'mte selftests work fine'). Migration support is the > hard/interesting part. > >> >>>> What happens in kvm_vm_ioctl_mte_copy_tags()? I think we would just end >>>> up copying zeroes? >>> >>> Yes. For migration, the VMM could ignore sending over tags that are all >>> zeros or maybe use some simple compression. We don't have a way to >>> disable MTE for guests, so all pages mapped into the guest address space >>> end up with PG_mte_tagged. >> >> Architecturally we don't (yet) have a way of describing memory without >> tags, so indeed you will get all zeros if the guest hasn't populated the >> tags yet. > > Nod. > >> >>>> That said, do we make any assumptions about when KVM_ARM_MTE_COPY_TAGS >>>> will be called? I.e. when implementing migration, it should be ok to >>>> call it while the vm is paused, but you probably won't get a consistent >>>> state while the vm is running? >>> >>> Wouldn't this be the same as migrating data? The VMM would only copy it >>> after it was marked read-only. BTW, I think sanitise_mte_tags() needs a >>> barrier before setting the PG_mte_tagged() flag (unless we end up with >>> some lock for reading the tags). >> >> As Catalin says, tags are no different from data so the VMM needs to >> either pause the VM or mark the page read-only to protect it from guest >> updates during the copy. > > Yes, that seems reasonable; not sure whether the documentation should > call that out explicitly. > >> >> The whole test_bit/set_bit dance does seem to be leaving open race >> conditions. I /think/ that Peter's extra flag as a lock with an added >> memory barrier is sufficient to mitigate it, but at this stage I'm also >> thinking some formal modelling would be wise as I don't trust my >> intuition when it comes to memory barriers. >> >>>> [Postcopy needs a different interface, I guess, so that the migration >>>> target can atomically place a received page and its metadata. I see >>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; >>>> has there been any follow-up?] >>> >>> I don't follow the qemu list, so I wasn't even aware of that thread. But >>> postcopy, the VMM needs to ensure that both the data and tags are up to >>> date before mapping such page into the guest address space. >>> >> >> I'm not sure I see how atomically updating data+tags is different from >> the existing issues around atomically updating the data. The VMM needs >> to ensure that the guest doesn't see the page before all the data+all >> the tags are written. It does mean lazy setting of the tags isn't >> possible in the VMM, but I'm not sure that's a worthwhile thing anyway. >> Perhaps I'm missing something? > > For postcopy, we basically want to fault in any not-yet-migrated page > via uffd once the guest accesses it. We only get the page data that way, > though, not the tag. I'm wondering whether we'd need a 'page+metadata' > uffd mode; not sure if that makes sense. Otherwise, we'd need to stop > the guest while grabbing the tags for the page as well, and stopping is > the thing we want to avoid here. Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page and ensures that no thread will see the partially populated page. But there's currently no way of doing that with tags as well. I'd not looked at the implementation of userfaultfd before and I'd assumed it avoided the need for an 'atomic' operation like this. But apparently not! AFAICT either a new ioctl would be needed (which can take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the alignment requirements of `src` and would copy the tags along with the data. Steve
On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: > On 04/07/2022 13:19, Cornelia Huck wrote: >> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: >> >>> On 29/06/2022 09:45, Catalin Marinas wrote: >>>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: >>> >>>>> [Postcopy needs a different interface, I guess, so that the migration >>>>> target can atomically place a received page and its metadata. I see >>>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; >>>>> has there been any follow-up?] >>>> >>>> I don't follow the qemu list, so I wasn't even aware of that thread. But >>>> postcopy, the VMM needs to ensure that both the data and tags are up to >>>> date before mapping such page into the guest address space. >>>> >>> >>> I'm not sure I see how atomically updating data+tags is different from >>> the existing issues around atomically updating the data. The VMM needs >>> to ensure that the guest doesn't see the page before all the data+all >>> the tags are written. It does mean lazy setting of the tags isn't >>> possible in the VMM, but I'm not sure that's a worthwhile thing anyway. >>> Perhaps I'm missing something? >> >> For postcopy, we basically want to fault in any not-yet-migrated page >> via uffd once the guest accesses it. We only get the page data that way, >> though, not the tag. I'm wondering whether we'd need a 'page+metadata' >> uffd mode; not sure if that makes sense. Otherwise, we'd need to stop >> the guest while grabbing the tags for the page as well, and stopping is >> the thing we want to avoid here. > > Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page > and ensures that no thread will see the partially populated page. But > there's currently no way of doing that with tags as well. Nod. > > I'd not looked at the implementation of userfaultfd before and I'd > assumed it avoided the need for an 'atomic' operation like this. But > apparently not! AFAICT either a new ioctl would be needed (which can > take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the > alignment requirements of `src` and would copy the tags along with the data. I was thinking about a new flag that implies "copy metadata"; not sure how we would get the same atomicity with a separate ioctl. I've only just started looking at userfaultfd, though, and I might be on a wrong track... One thing I'd like to avoid is having something that is too ARM-specific, I think there are other architecture features that might have similar issues. Maybe someone more familiar with uffd and/or postcopy can chime in?
On Fri, Jul 08, 2022 at 03:03:34PM +0200, Cornelia Huck wrote: > On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: > > > On 04/07/2022 13:19, Cornelia Huck wrote: > >> On Mon, Jul 04 2022, Steven Price <steven.price@arm.com> wrote: > >> > >>> On 29/06/2022 09:45, Catalin Marinas wrote: > >>>> On Mon, Jun 27, 2022 at 05:55:33PM +0200, Cornelia Huck wrote: > >>> > >>>>> [Postcopy needs a different interface, I guess, so that the migration > >>>>> target can atomically place a received page and its metadata. I see > >>>>> https://lore.kernel.org/all/CAJc+Z1FZxSYB_zJit4+0uTR-88VqQL+-01XNMSEfua-dXDy6Wg@mail.gmail.com/; > >>>>> has there been any follow-up?] > >>>> > >>>> I don't follow the qemu list, so I wasn't even aware of that thread. But > >>>> postcopy, the VMM needs to ensure that both the data and tags are up to > >>>> date before mapping such page into the guest address space. > >>>> > >>> > >>> I'm not sure I see how atomically updating data+tags is different from > >>> the existing issues around atomically updating the data. The VMM needs > >>> to ensure that the guest doesn't see the page before all the data+all > >>> the tags are written. It does mean lazy setting of the tags isn't > >>> possible in the VMM, but I'm not sure that's a worthwhile thing anyway. > >>> Perhaps I'm missing something? > >> > >> For postcopy, we basically want to fault in any not-yet-migrated page > >> via uffd once the guest accesses it. We only get the page data that way, > >> though, not the tag. I'm wondering whether we'd need a 'page+metadata' > >> uffd mode; not sure if that makes sense. Otherwise, we'd need to stop > >> the guest while grabbing the tags for the page as well, and stopping is > >> the thing we want to avoid here. > > > > Ah, I think I see now. UFFDIO_COPY atomically populates the (data) page > > and ensures that no thread will see the partially populated page. But > > there's currently no way of doing that with tags as well. > > Nod. > > > > > I'd not looked at the implementation of userfaultfd before and I'd > > assumed it avoided the need for an 'atomic' operation like this. But > > apparently not! AFAICT either a new ioctl would be needed (which can > > take a tag buffer) or a new flag to UFFDIO_COPY which would tighten the > > alignment requirements of `src` and would copy the tags along with the data. > > I was thinking about a new flag that implies "copy metadata"; not sure > how we would get the same atomicity with a separate ioctl. I've only > just started looking at userfaultfd, though, and I might be on a wrong > track... One thing I'd like to avoid is having something that is too > ARM-specific, I think there are other architecture features that might > have similar issues. Agreed, to propose such an interface we'd better make sure it'll be easily applicable to other similar memory protection mechanisms elsewhere. > > Maybe someone more familiar with uffd and/or postcopy can chime in? Hanving UFFDIO_COPY provide a new flag sounds reasonable to me. I'm curious what's the maximum possible size of the tags and whether they can be embeded already into struct uffdio_copy somehow. Thanks,
On Fri, Jul 08 2022, Peter Xu <peterx@redhat.com> wrote: > On Fri, Jul 08, 2022 at 03:03:34PM +0200, Cornelia Huck wrote: >> I was thinking about a new flag that implies "copy metadata"; not sure >> how we would get the same atomicity with a separate ioctl. I've only >> just started looking at userfaultfd, though, and I might be on a wrong >> track... One thing I'd like to avoid is having something that is too >> ARM-specific, I think there are other architecture features that might >> have similar issues. > > Agreed, to propose such an interface we'd better make sure it'll be easily > applicable to other similar memory protection mechanisms elsewhere. There's storage keys on s390, although I believe they are considered legacy by now. I dimly recall something in x86 land. > >> >> Maybe someone more familiar with uffd and/or postcopy can chime in? > > Hanving UFFDIO_COPY provide a new flag sounds reasonable to me. I'm > curious what's the maximum possible size of the tags and whether they can > be embeded already into struct uffdio_copy somehow. Each tag is four bits and covers 16 bytes (also see the defs in arch/arm64/include/asm/mte-def.h).
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index aa523591a44e..f66f70194c76 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -19,6 +19,7 @@ #include <asm/pgtable-types.h> void mte_clear_page_tags(void *addr); +void mte_ensure_page_tags_cleared(struct page *page); unsigned long mte_copy_tags_from_user(void *to, const void __user *from, unsigned long n); unsigned long mte_copy_tags_to_user(void __user *to, void *from, @@ -37,6 +38,9 @@ void mte_free_tag_storage(char *storage); /* track which pages have valid allocation tags */ #define PG_mte_tagged PG_arch_2 +/* the page is or has been in the process of having its tags cleared */ +#define PG_mte_tag_clearing PG_arch_3 + void mte_zero_clear_page_tags(void *addr); void mte_sync_tags(pte_t old_pte, pte_t pte); void mte_copy_page_tags(void *kto, const void *kfrom); @@ -53,6 +57,7 @@ size_t mte_probe_user_range(const char __user *uaddr, size_t size); /* unused if !CONFIG_ARM64_MTE, silence the compiler */ #define PG_mte_tagged 0 +#define PG_mte_tag_clearing 0 static inline void mte_zero_clear_page_tags(void *addr) { diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index f6b00743c399..8f9655053a9f 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -57,7 +57,18 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte, * the new page->flags are visible before the tags were updated. */ smp_wmb(); - mte_clear_page_tags(page_address(page)); + mte_ensure_page_tags_cleared(page); +} + +void mte_ensure_page_tags_cleared(struct page *page) +{ + if (test_and_set_bit(PG_mte_tag_clearing, &page->flags)) { + while (!test_bit(PG_mte_tagged, &page->flags)) + ; + } else { + mte_clear_page_tags(page_address(page)); + set_bit(PG_mte_tagged, &page->flags); + } } void mte_sync_tags(pte_t old_pte, pte_t pte) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index f5651a05b6a8..016d14c9d798 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1050,11 +1050,9 @@ static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva) * able to see the page's tags and therefore they must be initialised first. If * PG_mte_tagged is set, tags have already been initialised. * - * The race in the test/set of the PG_mte_tagged flag is handled by: - * - preventing VM_SHARED mappings in a memslot with MTE preventing two VMs - * racing to santise the same page - * - mmap_lock protects between a VM faulting a page in and the VMM performing - * an mprotect() to add VM_MTE + * The race in the test/set of the PG_mte_tagged flag is handled via the flag + * PG_mte_tag_clearing which is atomically test and set before beginning a tag + * initialization, and spinning on PG_mte_tagged if it was already set. */ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, unsigned long size) @@ -1074,12 +1072,9 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, if (!page) return -EFAULT; - for (i = 0; i < nr_pages; i++, page++) { - if (!test_bit(PG_mte_tagged, &page->flags)) { - mte_clear_page_tags(page_address(page)); - set_bit(PG_mte_tagged, &page->flags); - } - } + for (i = 0; i < nr_pages; i++, page++) + if (!test_bit(PG_mte_tagged, &page->flags)) + mte_ensure_page_tags_cleared(page); return 0; } @@ -1092,7 +1087,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, bool write_fault, writable, force_pte = false; bool exec_fault; bool device = false; - bool shared; unsigned long mmu_seq; struct kvm *kvm = vcpu->kvm; struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; @@ -1142,8 +1136,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, vma_shift = get_vma_page_shift(vma, hva); } - shared = (vma->vm_flags & VM_SHARED); - switch (vma_shift) { #ifndef __PAGETABLE_PMD_FOLDED case PUD_SHIFT: @@ -1263,11 +1255,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, } if (fault_status != FSC_PERM && !device && kvm_has_mte(kvm)) { - /* Check the VMM hasn't introduced a new VM_SHARED VMA */ - if (!shared) - ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); - else - ret = -EFAULT; + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); if (ret) goto out_unlock; } @@ -1705,16 +1693,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, if (!vma) break; - /* - * VM_SHARED mappings are not allowed with MTE to avoid races - * when updating the PG_mte_tagged page flag, see - * sanitise_mte_tags for more details. - */ - if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) { - ret = -EINVAL; - break; - } - if (vma->vm_flags & VM_PFNMAP) { /* IO region dirty page logging not allowed */ if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e66f7aa3191d..447cdd4b1311 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -134,6 +134,7 @@ enum pageflags { #endif #ifdef CONFIG_64BIT PG_arch_2, + PG_arch_3, #endif #ifdef CONFIG_KASAN_HW_TAGS PG_skip_kasan_poison, diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index e87cb2b80ed3..ecf7ae0de3d8 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -92,9 +92,9 @@ #endif #ifdef CONFIG_64BIT -#define IF_HAVE_PG_ARCH_2(flag,string) ,{1UL << flag, string} +#define IF_HAVE_PG_ARCH_2_3(flag,string) ,{1UL << flag, string} #else -#define IF_HAVE_PG_ARCH_2(flag,string) +#define IF_HAVE_PG_ARCH_2_3(flag,string) #endif #ifdef CONFIG_KASAN_HW_TAGS @@ -130,7 +130,8 @@ IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ -IF_HAVE_PG_ARCH_2(PG_arch_2, "arch_2" ) \ +IF_HAVE_PG_ARCH_2_3(PG_arch_2, "arch_2" ) \ +IF_HAVE_PG_ARCH_2_3(PG_arch_3, "arch_3" ) \ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") #define show_page_flags(flags) \
Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that depend on being able to map guest memory as MAP_SHARED. The current restriction on sharing MAP_SHARED pages with the guest is preventing the use of those features with MTE. Therefore, remove this restriction. To avoid races between multiple tasks attempting to clear tags on the same page, introduce a new page flag, PG_mte_tag_clearing, and test-set it atomically before beginning to clear tags on a page. If the flag was not initially set, spin until the other task has finished clearing the tags. Link: https://linux-review.googlesource.com/id/I05a57f546f2b4d9b008a08f5f03e9ba3c950fdcc Signed-off-by: Peter Collingbourne <pcc@google.com> --- arch/arm64/include/asm/mte.h | 5 +++++ arch/arm64/kernel/mte.c | 13 +++++++++++- arch/arm64/kvm/mmu.c | 36 +++++++--------------------------- include/linux/page-flags.h | 1 + include/trace/events/mmflags.h | 7 ++++--- 5 files changed, 29 insertions(+), 33 deletions(-)