diff mbox series

[v3,1/1] KVM: arm64: Allow cacheable stage 2 mapping using VMA flags

Message ID 20250310103008.3471-2-ankita@nvidia.com (mailing list archive)
State New
Headers show
Series KVM: arm64: Map GPU device memory as cacheable | expand

Commit Message

Ankit Agrawal March 10, 2025, 10:30 a.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
based on pfn_is_map_memory (which tracks whether the device memory
is added to the kernel) and ignores the per-VMA flags that indicates the
memory attributes. The KVM code is thus restrictive and allows only for
the memory that is added to the kernel to be marked as cacheable.

The device memory such as on the Grace Hopper/Blackwell systems
is interchangeable with DDR memory and retains properties such as
cacheability, unaligned accesses, atomics and handling of executable
faults. This requires the device memory to be mapped as NORMAL in
stage-2.

Given that the GPU device memory is not added to the kernel (but is rather
VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
to be mapped Normal cacheable. The patch aims to solve this use case.

A cachebility check is made if the VM_PFNMAP is set in VMA flags by
consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL,
it is safe to be mapped cacheable as the KVM S2 will have the same
Normal memory type as the VMA has in the S1 and KVM has no additional
responsibility for safety. Checking pgprot as NORMAL is thus a KVM
sanity check.

Introduce a new variable cacheable_devmem to indicate a safely
cacheable mapping. Do not set the device variable when cacheable_devmem
is true. This essentially have the effect of setting stage-2 mapping
as NORMAL through kvm_pgtable_stage2_map.

Add check for COW VM_PFNMAP and refuse such mapping.

No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
already tests it at an early stage during memslot creation. There would
not even be a fault if the memslot is not created.

Note when FWB is not enabled, the kernel expects to trivially do
cache management by flushing the memory by linearly converting a
kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
cache management thus relies on memory being mapped. So do not allow
!FWB.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/kvm/hyp/pgtable.c         |  2 +-
 arch/arm64/kvm/mmu.c                 | 43 +++++++++++++++++++++++++++-
 3 files changed, 51 insertions(+), 2 deletions(-)

Comments

Marc Zyngier March 10, 2025, 11:54 a.m. UTC | #1
On Mon, 10 Mar 2025 10:30:08 +0000,
<ankita@nvidia.com> wrote:
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Today KVM forces the memory to either NORMAL or DEVICE_nGnRE
> based on pfn_is_map_memory (which tracks whether the device memory
> is added to the kernel) and ignores the per-VMA flags that indicates the
> memory attributes. The KVM code is thus restrictive and allows only for
> the memory that is added to the kernel to be marked as cacheable.
> 
> The device memory such as on the Grace Hopper/Blackwell systems
> is interchangeable with DDR memory and retains properties such as
> cacheability, unaligned accesses, atomics and handling of executable
> faults. This requires the device memory to be mapped as NORMAL in
> stage-2.
> 
> Given that the GPU device memory is not added to the kernel (but is rather
> VMA mapped through remap_pfn_range() in nvgrace-gpu module which sets
> VM_PFNMAP), pfn_is_map_memory() is false and thus KVM prevents such memory
> to be mapped Normal cacheable. The patch aims to solve this use case.
> 
> A cachebility check is made if the VM_PFNMAP is set in VMA flags by
> consulting the VMA pgprot value. If the pgprot mapping type is MT_NORMAL,
> it is safe to be mapped cacheable as the KVM S2 will have the same
> Normal memory type as the VMA has in the S1 and KVM has no additional
> responsibility for safety. Checking pgprot as NORMAL is thus a KVM
> sanity check.
> 
> Introduce a new variable cacheable_devmem to indicate a safely
> cacheable mapping. Do not set the device variable when cacheable_devmem
> is true. This essentially have the effect of setting stage-2 mapping
> as NORMAL through kvm_pgtable_stage2_map.
> 
> Add check for COW VM_PFNMAP and refuse such mapping.
> 
> No additional checks for MTE are needed as kvm_arch_prepare_memory_region()
> already tests it at an early stage during memslot creation. There would
> not even be a fault if the memslot is not created.
> 
> Note when FWB is not enabled, the kernel expects to trivially do
> cache management by flushing the memory by linearly converting a
> kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). The
> cache management thus relies on memory being mapped. So do not allow
> !FWB.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>  arch/arm64/kvm/hyp/pgtable.c         |  2 +-
>  arch/arm64/kvm/mmu.c                 | 43 +++++++++++++++++++++++++++-
>  3 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 6b9d274052c7..f21e2fae2bfe 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -507,6 +507,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
>   */
>  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
>  
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
>  /**
>   * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
>   * @vtcr:	Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index df5cc74a7dd0..ee6b98fefd61 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -637,7 +637,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
>  	return vtcr;
>  }
>  
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  {
>  	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>  		return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1f55b0c7b11d..4b3a03c9d700 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1454,6 +1454,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_s2_trans *nested,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> @@ -1463,6 +1472,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	bool write_fault, writable, force_pte = false;
>  	bool exec_fault, mte_allowed;
>  	bool device = false, vfio_allow_any_uc = false;
> +	bool cacheable_devmem = false;
>  	unsigned long mmu_seq;
>  	phys_addr_t ipa = fault_ipa;
>  	struct kvm *kvm = vcpu->kvm;
> @@ -1600,6 +1610,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
>  
> +	if (vma->vm_flags & VM_PFNMAP) {
> +		/* Reject COW VM_PFNMAP */
> +		if (is_cow_mapping(vma->vm_flags))
> +			return -EINVAL;
> +		/*
> +		 * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
> +		 * check to see if pgprot mapping type is MT_NORMAL and a
> +		 * safely cacheable device memory.
> +		 */
> +		if (mapping_type(vma->vm_page_prot) == MT_NORMAL)
> +			cacheable_devmem = true;
> +	}
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1633,8 +1656,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		 *
>  		 * In both cases, we don't let transparent_hugepage_adjust()
>  		 * change things at the last minute.
> +		 *
> +		 * Do not set device as the device memory is cacheable. Note
> +		 * that such mapping is safe as the KVM S2 will have the same
> +		 * Normal memory type as the VMA has in the S1.
>  		 */
> -		device = true;
> +		if (!cacheable_devmem)
> +			device = true;
>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write
> @@ -1716,6 +1744,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		prot |= KVM_PGTABLE_PROT_X;
>  	}
>  
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map.
> +	 *
> +	 *  Outright reject as the cacheable device memory is not present in
> +	 *  the kernel map and not suitable for cache management.
> +	 */
> +	if (cacheable_devmem && !stage2_has_fwb(pgt)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +

These new error reasons should at least be complemented by an
equivalent check at the point where the memslot is registered. It
maybe OK to blindly return an error at fault time (because userspace
has messed with the mapping behind our back), but there should at
least be something telling a well behaved userspace that there is a
bunch of combination we're unwilling to support.

Which brings me to the next point: FWB is not discoverable from
userspace. How do you expect a VMM to know what it can or cannot do?

	M.
Ankit Agrawal March 11, 2025, 3:42 a.m. UTC | #2
>> +     /*
>> +      *  When FWB is unsupported KVM needs to do cache flushes
>> +      *  (via dcache_clean_inval_poc()) of the underlying memory. This is
>> +      *  only possible if the memory is already mapped into the kernel map.
>> +      *
>> +      *  Outright reject as the cacheable device memory is not present in
>> +      *  the kernel map and not suitable for cache management.
>> +      */
>> +     if (cacheable_devmem && !stage2_has_fwb(pgt)) {
>> +             ret = -EINVAL;
>> +             goto out_unlock;
>> +     }
>> +
>
> These new error reasons should at least be complemented by an
> equivalent check at the point where the memslot is registered. It

Understood. I can add such check in kvm_arch_prepare_memory_region().


> maybe OK to blindly return an error at fault time (because userspace
> has messed with the mapping behind our back), but there should at
> least be something telling a well behaved userspace that there is a
> bunch of combination we're unwilling to support.

How about WARN_ON() or BUG() for the faulty situation?


> Which brings me to the next point: FWB is not discoverable from
> userspace. How do you expect a VMM to know what it can or cannot do?

Good point. I am not sure if it can. I suppose you are concerned about error
during fault handling when !FWB without VMM having any clear indications
of the cause? Perhaps we can gracefully fall back to the default device mapping
in such case? But that would cause VM to crash as soon as it makes some
access violating DEVICE_nGnRE.

>        M.
>
> --
> Without deviation from the norm, progress is not possible.
Marc Zyngier March 11, 2025, 11:18 a.m. UTC | #3
On Tue, 11 Mar 2025 03:42:23 +0000,
Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> >> +     /*
> >> +      *  When FWB is unsupported KVM needs to do cache flushes
> >> +      *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> >> +      *  only possible if the memory is already mapped into the kernel map.
> >> +      *
> >> +      *  Outright reject as the cacheable device memory is not present in
> >> +      *  the kernel map and not suitable for cache management.
> >> +      */
> >> +     if (cacheable_devmem && !stage2_has_fwb(pgt)) {
> >> +             ret = -EINVAL;
> >> +             goto out_unlock;
> >> +     }
> >> +
> >
> > These new error reasons should at least be complemented by an
> > equivalent check at the point where the memslot is registered. It
> 
> Understood. I can add such check in kvm_arch_prepare_memory_region().
> 
> 
> > maybe OK to blindly return an error at fault time (because userspace
> > has messed with the mapping behind our back), but there should at
> > least be something telling a well behaved userspace that there is a
> > bunch of combination we're unwilling to support.
> 
> How about WARN_ON() or BUG() for the faulty situation?

Absolutely not. Do you really want any user to randomly crash the
kernel because they flip a mapping, which they can do anytime they
want?

The way KVM works is that we return to userspace for the VMM to fix
things. Either by emulating something we can't do in the kernel, or by
fixing things so that the kernel can replay the fault and sort it out.

Either way, this requires some form of fault syndrome so that usespace
has a chance of understanding WTF is going on.

> > Which brings me to the next point: FWB is not discoverable from
> > userspace. How do you expect a VMM to know what it can or cannot do?
> 
> Good point. I am not sure if it can. I suppose you are concerned about error
> during fault handling when !FWB without VMM having any clear indications
> of the cause?

No, I'm concerned that a well established API (populating a memslot)
works in some case and doesn't work in another without a clear
indication of *why* we have this behaviour.

To me, this indicates that userspace needs to buy in this new
behaviour, and that behaviour needs to be advertised by a capability,
which is in turn conditional on FWB.

> Perhaps we can gracefully fall back to the default device mapping
> in such case? But that would cause VM to crash as soon as it makes some
> access violating DEVICE_nGnRE.

Which would now be a regression...

My take is that this cacheable PNFMAP contraption must only be exposed
to a guest if FWB is available. We can't prevent someone to do an
mmap() behind our back, but we can at least:

- tell userspace whether this is supported
- only handle the fault if userspace has bought in this mode
- report the fault to userspace for it to fix things otherwise

	M.
Ankit Agrawal March 11, 2025, 12:07 p.m. UTC | #4
Thanks Marc for the feedback.

> No, I'm concerned that a well established API (populating a memslot)
> works in some case and doesn't work in another without a clear
> indication of *why* we have this behaviour.
>
> To me, this indicates that userspace needs to buy in this new
> behaviour, and that behaviour needs to be advertised by a capability,
> which is in turn conditional on FWB.

Yes, that makes sense.

>>> Perhaps we can gracefully fall back to the default device mapping
>>> in such case? But that would cause VM to crash as soon as it makes some
>>> access violating DEVICE_nGnRE.
>
> Which would now be a regression...
> 
> My take is that this cacheable PNFMAP contraption must only be exposed
> to a guest if FWB is available. We can't prevent someone to do an
> mmap() behind our back, but we can at least:
>
> - tell userspace whether this is supported

For my education, what is an accepted way to communicate this? Please let
me know if there are any relevant examples that you may be aware of.

I suppose just checking for FWB (for PFNMAP) and returning some sort of
an error on userspace mmap will not be enough of a hint here?

> - only handle the fault if userspace has bought in this mode
> - report the fault to userspace for it to fix things otherwise
>
>        M.
>
> --
> Without deviation from the norm, progress is not possible.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 6b9d274052c7..f21e2fae2bfe 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -507,6 +507,14 @@  u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
  */
 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
 
+/**
+ * stage2_has_fwb() - Determine whether FWB is supported
+ * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
+ *
+ * Return: True if FWB is supported.
+ */
+bool stage2_has_fwb(struct kvm_pgtable *pgt);
+
 /**
  * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
  * @vtcr:	Content of the VTCR register.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index df5cc74a7dd0..ee6b98fefd61 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -637,7 +637,7 @@  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	return vtcr;
 }
 
-static bool stage2_has_fwb(struct kvm_pgtable *pgt)
+bool stage2_has_fwb(struct kvm_pgtable *pgt)
 {
 	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
 		return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1f55b0c7b11d..4b3a03c9d700 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1454,6 +1454,15 @@  static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type(pgprot_t page_prot)
+{
+	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1463,6 +1472,7 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	bool write_fault, writable, force_pte = false;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
+	bool cacheable_devmem = false;
 	unsigned long mmu_seq;
 	phys_addr_t ipa = fault_ipa;
 	struct kvm *kvm = vcpu->kvm;
@@ -1600,6 +1610,19 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	vfio_allow_any_uc = vma->vm_flags & VM_ALLOW_ANY_UNCACHED;
 
+	if (vma->vm_flags & VM_PFNMAP) {
+		/* Reject COW VM_PFNMAP */
+		if (is_cow_mapping(vma->vm_flags))
+			return -EINVAL;
+		/*
+		 * If the VM_PFNMAP is set in VMA flags, do a KVM sanity
+		 * check to see if pgprot mapping type is MT_NORMAL and a
+		 * safely cacheable device memory.
+		 */
+		if (mapping_type(vma->vm_page_prot) == MT_NORMAL)
+			cacheable_devmem = true;
+	}
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -1633,8 +1656,13 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		 *
 		 * In both cases, we don't let transparent_hugepage_adjust()
 		 * change things at the last minute.
+		 *
+		 * Do not set device as the device memory is cacheable. Note
+		 * that such mapping is safe as the KVM S2 will have the same
+		 * Normal memory type as the VMA has in the S1.
 		 */
-		device = true;
+		if (!cacheable_devmem)
+			device = true;
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -1716,6 +1744,19 @@  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot |= KVM_PGTABLE_PROT_X;
 	}
 
+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map.
+	 *
+	 *  Outright reject as the cacheable device memory is not present in
+	 *  the kernel map and not suitable for cache management.
+	 */
+	if (cacheable_devmem && !stage2_has_fwb(pgt)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,