[v1,1/2] KVM: arm64: determine memory type from VMA

Message ID	20230907181459.18145-2-ankita@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: <ankita@nvidia.com> To: <ankita@nvidia.com>, <jgg@nvidia.com>, <maz@kernel.org>, <oliver.upton@linux.dev>, <catalin.marinas@arm.com>, <will@kernel.org> CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>, <targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>, <apopple@nvidia.com>, <jhubbard@nvidia.com>, <danw@nvidia.com>, <linux-arm-kernel@lists.infradead.org>, <kvmarm@lists.linux.dev>, <linux-kernel@vger.kernel.org> Subject: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Thu, 7 Sep 2023 11:14:58 -0700 Message-ID: <20230907181459.18145-2-ankita@nvidia.com> In-Reply-To: <20230907181459.18145-1-ankita@nvidia.com> References: <20230907181459.18145-1-ankita@nvidia.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: support write combining and cachable IO memory in VMs \| expand [v1,0/2] KVM: arm64: support write combining and cachable IO memory in VMs [v1,1/2] KVM: arm64: determine memory type from VMA [v1,2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

Ankit Agrawal Sept. 7, 2023, 6:14 p.m. UTC

From: Ankit Agrawal <ankita@nvidia.com>

Currently KVM determines if a VMA is pointing at IO memory by checking
pfn_is_map_memory(). However, the MM already gives us a way to tell what
kind of memory it is by inspecting the VMA.

Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
the memory is IO and thus needs stage-2 device mapping.

The VMA's pgprot is tested to determine the memory type with the
following mapping:

 pgprot_noncached    MT_DEVICE_nGnRnE   device
 pgprot_writecombine MT_NORMAL_NC       device
 pgprot_device       MT_DEVICE_nGnRE    device
 pgprot_tagged       MT_NORMAL_TAGGED   RAM

This patch solves a problems where it is possible for the kernel to
have VMAs pointing at cachable memory without causing
pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
devices. This memory is now properly marked as cachable in KVM.

Unfortunately when FWB is not enabled, the kernel expects to naively do
cache management by flushing the memory using an address in the
kernel's map. This does not work in several of the newly allowed
cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
and its mapping KVA is valid in case the FWB is absent before continuing.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/kvm/hyp/pgtable.c         |  2 +-
 arch/arm64/kvm/mmu.c                 | 40 +++++++++++++++++++++++++---
 3 files changed, 45 insertions(+), 5 deletions(-)

Jason Gunthorpe Sept. 7, 2023, 7:12 p.m. UTC | #1

On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM
> 
> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before
> continuing.

Looking at this more closely, it relies on kvm_pte_follow() which
ultimately calls __va(), and the ARM 64 version of page_to_virt() is:

#define page_to_virt(x) ({                                              \
        __typeof__(x) __page = x;                                       \
        void *__addr = __va(page_to_phys(__page));                      \
        (void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\
})

So we don't actually have an issue here, anything with a struct page
will be properly handled by KVM.

Thus this can just be:

	if (!stage2_has_fwb(pgt) && !pfn_valid(pfn)))

Then the last paragraph of the commit message is:

 As cachable vs device memory is now determined by the VMA adjust
 the other checks to match their purpose. In most cases the code needs
 to know if there is a struct page, or if the memory is in the kernel
 map and pfn_valid() is an appropriate API for this.

 Note when FWB is not enabled, the kernel expects to trivially do
 cache management by flushing the memory by linearly converting a
 kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
 only possibile for struct page backed memory. Do not allow non-struct
 page memory to be cachable without FWB.

> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
>  
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;

You still didn't remove the kvm_is_device_pfn() check from this code,
I don't think it can really stay and make any sense..

Probably this:

   if (exec_fault && (prot & KVM_PGTABLE_PROT_DEVICE))
		return -ENOEXEC;

And these two should also be pfn_valid() [precompute pfn_valid]

	if (vma_pagesize == PAGE_SIZE && !(force_pte || !pfn_valid(pte))) {

	if (fault_status != ESR_ELx_FSC_PERM && pfn_valid(pte) && kvm_has_mte(kvm)) {

Makes sense?

Check if kvm_is_device_pfn() can be removed entirely.

Jason

Catalin Marinas Oct. 5, 2023, 4:15 p.m. UTC | #2

On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.

Well, it doesn't. It tells us what attributes the user mapped that
memory with, not whether it's I/O memory or standard RAM.

> Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM

I would move the second patch to be the first, we could even merge that
independently as it is about relaxing the stage 2 mapping to Normal NC.
It would make it simpler I think to reason about the second patch which
further relaxes the stage 2 mapping to Normal Cacheable under certain
conditions.

> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before continuing.

I would only allow cacheable stage 2 mappings if FWB is enabled.
Otherwise we end up with a mismatch between the VMM mapping and whatever
the guest may do.

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>  arch/arm64/kvm/hyp/pgtable.c         |  2 +-
>  arch/arm64/kvm/mmu.c                 | 40 +++++++++++++++++++++++++---
>  3 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index d3e354bb8351..0579dbe958b9 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
>   */
>  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
>  
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
>  /**
>   * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
>   * @vtcr:	Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index f155b8c9e98c..ccd291b6893d 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
>  	return vtcr;
>  }
>  
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  {
>  	if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>  		return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 482280fe22d7..79f1caaa08a0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
>  
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;

Does this mean that we can end up with some I/O memory also mapped as
executable? Is there a use-case (e.g. using CXL memory as standard guest
RAM, executable)?

> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> -	if (device)
> -		prot |= KVM_PGTABLE_PROT_DEVICE;
> -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> -		prot |= KVM_PGTABLE_PROT_X;
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map
> +	 *  at the usual spot.
> +	 *
> +	 *  Validate that there is a struct page for the PFN which maps
> +	 *  to the KVA that the flushing code expects.
> +	 */
> +	if (!stage2_has_fwb(pgt) &&
> +	    !(pfn_valid(pfn) &&
> +	      page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

My preference would be to keep most of the current logic (including
pfn_is_map_memory()) but force stage 2 cacheable for this page if the
user vma_page_prot is MT_NORMAL or MT_NORMAL_TAGGED and we have FWB. It
might be seen as an ABI change but I don't think it matters, it mostly
brings cacheable I/O mem mappings in line with standard RAM (bar the
exec permission unless there is a use-case for it).

Jason Gunthorpe Oct. 5, 2023, 4:54 p.m. UTC | #3

On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Currently KVM determines if a VMA is pointing at IO memory by checking
> > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > kind of memory it is by inspecting the VMA.
> 
> Well, it doesn't. It tells us what attributes the user mapped that
> memory with, not whether it's I/O memory or standard RAM.

There is VM_IO which is intended to be used for address space with
side effects.

And there is VM_PFNMAP which is intended to be used for address space
without struct page (IO or not)

And finally we have the pgprot bit which define the cachability.

Do you have a definition of IO memory that those three things don't
cover?

I would propose that, for KVM's purpose, IO memory is marked with
VM_IO or a non-cachable pgprot

And "standard RAM" is defined by a cachable pgprot. Linux never makes
something that is VM_IO cachable.

?

> I would move the second patch to be the first, we could even merge that
> independently as it is about relaxing the stage 2 mapping to Normal NC.
> It would make it simpler I think to reason about the second patch which
> further relaxes the stage 2 mapping to Normal Cacheable under certain
> conditions.

Make sense
 
> > Unfortunately when FWB is not enabled, the kernel expects to naively do
> > cache management by flushing the memory using an address in the
> > kernel's map. This does not work in several of the newly allowed
> > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> > and its mapping KVA is valid in case the FWB is absent before continuing.
> 
> I would only allow cacheable stage 2 mappings if FWB is enabled.
> Otherwise we end up with a mismatch between the VMM mapping and whatever
> the guest may do.

Does it need to be stronger? If FWB is disabled and the cache flush
works then what is the issue?

> > +	/*
> > +	 * Figure out the memory type based on the user va mapping properties
> > +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> > +	 * pgprot_device() and pgprot_noncached() respectively.
> > +	 */
> > +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> > +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> > +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> > +		prot |= KVM_PGTABLE_PROT_DEVICE;
> > +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> > +		prot |= KVM_PGTABLE_PROT_X;
> 
> Does this mean that we can end up with some I/O memory also mapped as
> executable?

Yes. We don't have cachable memory with side effects in Linux?

> Is there a use-case (e.g. using CXL memory as standard guest
> RAM, executable)?

Certainly.

> > +
> >  	/* Don't use the VMA after the unlock -- it may have vanished */
> >  	vma = NULL;
> >  
> > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (exec_fault)
> >  		prot |= KVM_PGTABLE_PROT_X;
> >  
> > -	if (device)
> > -		prot |= KVM_PGTABLE_PROT_DEVICE;
> > -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> > -		prot |= KVM_PGTABLE_PROT_X;
> > +	/*
> > +	 *  When FWB is unsupported KVM needs to do cache flushes
> > +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> > +	 *  only possible if the memory is already mapped into the kernel map
> > +	 *  at the usual spot.
> > +	 *
> > +	 *  Validate that there is a struct page for the PFN which maps
> > +	 *  to the KVA that the flushing code expects.
> > +	 */
> > +	if (!stage2_has_fwb(pgt) &&
> > +	    !(pfn_valid(pfn) &&
> > +	      page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> 
> My preference would be to keep most of the current logic (including
> pfn_is_map_memory())

Why? I think this pfn_is_map_memory() is actually not desired, it
doesn't deal with modern memory hotplug or pgmap stuff? Isn't
pfn_valid() more appropriate?

> as an ABI change but I don't think it matters, it mostly brings
> cacheable I/O mem mappings in line with standard RAM (bar the exec
> permission unless there is a use-case for it).

I would discourage the concept of "cacheable I/O mem mappings".

Cachable memory located on a NUMA node close to the CPU should have
exactly the same treatement as cachable memory located on a NUMA node
distant from the CPU.

I think when you say "cachable I/O memory" it really just means normal
memory that lives on a NUMA node that is located on an IO device.

At the KVM level we don't care about the NUMA locality, we only care
if it is normal cachable system memory.

I think there are two issues here. 

1) KVM uses pfn_is_map_memory() which does not cover all our modern
NUMA and memory hotplug cases for normal struct page backed cachable
memory.

2) KVM doesn't work with normal cachable memory that does not have
struct pages.


For 1 the test should be 'does the pfn have a struct page, does the
struct page refer to cachable memory?'

For 2 the test should be 'does the VMA have pgprot = cachable,
VM_PFNMAP and not VM_IO (both implied)'

Jason

Catalin Marinas Oct. 10, 2023, 2:25 p.m. UTC | #4

On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > kind of memory it is by inspecting the VMA.
> > 
> > Well, it doesn't. It tells us what attributes the user mapped that
> > memory with, not whether it's I/O memory or standard RAM.
> 
> There is VM_IO which is intended to be used for address space with
> side effects.
> 
> And there is VM_PFNMAP which is intended to be used for address space
> without struct page (IO or not)
> 
> And finally we have the pgprot bit which define the cachability.
> 
> Do you have a definition of IO memory that those three things don't
> cover?
> 
> I would propose that, for KVM's purpose, IO memory is marked with
> VM_IO or a non-cachable pgprot
> 
> And "standard RAM" is defined by a cachable pgprot. Linux never makes
> something that is VM_IO cachable.

I think we can safely set a stage 2 Normal NC for a vma with pgprot
other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
not that simple. Just because the VMM was allowed to map it as cacheable
does not mean that it supports all the CPU features. One example is MTE
where we can only guarantee that the RAM given to the OS at boot
supports tagged accesses. I've seen something similar in the past with
LSE atomics (or was it exclusives?) not being propagated. These don't
make the memory safe for a guest to use as general purpose RAM.

I don't have a nice solution, it looks more like the host kernel being
able to trust what the VMM maps and gives to a guest (or we map
everything as Device at Stage 2 as we currently do). An alternative
would be for the host to know which physical address ranges support
which attributes and ignore the vma but not sure we have such
information in the ACPI tables (we can make something up for DT).

> > > Unfortunately when FWB is not enabled, the kernel expects to naively do
> > > cache management by flushing the memory using an address in the
> > > kernel's map. This does not work in several of the newly allowed
> > > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> > > and its mapping KVA is valid in case the FWB is absent before continuing.
> > 
> > I would only allow cacheable stage 2 mappings if FWB is enabled.
> > Otherwise we end up with a mismatch between the VMM mapping and whatever
> > the guest may do.
> 
> Does it need to be stronger? If FWB is disabled and the cache flush
> works then what is the issue?

I was thinking more about keeping things simpler and avoid any lack of
coherence between the VMM and the VM, in case the latter maps it as
Normal NC. But if the VMM doesn't touch it, the initial cache
maintenance by the KVM would do.

> I think there are two issues here. 
> 
> 1) KVM uses pfn_is_map_memory() which does not cover all our modern
> NUMA and memory hotplug cases for normal struct page backed cachable
> memory.
> 
> 2) KVM doesn't work with normal cachable memory that does not have
> struct pages.
> 
> For 1 the test should be 'does the pfn have a struct page, does the
> struct page refer to cachable memory?'
> 
> For 2 the test should be 'does the VMA have pgprot = cachable,
> VM_PFNMAP and not VM_IO (both implied)'

See above on the characteristics of the memory. If some of them are not
supported, it's probably fine (atomics not working) but others like MTE
accesses could either cause external aborts or access random addresses
from elsewhere. Currently we rely on pfn_is_map_memory() for this but we
need a way to tell that other ranges outside the initial RAM supports
all features. IOW, is any of this memory (mapped as cacheable in the
VMM) special purpose with only a subset of the CPU features supported?

Jason Gunthorpe Oct. 10, 2023, 3:05 p.m. UTC | #5

On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote:
> On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > > From: Ankit Agrawal <ankita@nvidia.com>
> > > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > > kind of memory it is by inspecting the VMA.
> > > 
> > > Well, it doesn't. It tells us what attributes the user mapped that
> > > memory with, not whether it's I/O memory or standard RAM.
> > 
> > There is VM_IO which is intended to be used for address space with
> > side effects.
> > 
> > And there is VM_PFNMAP which is intended to be used for address space
> > without struct page (IO or not)
> > 
> > And finally we have the pgprot bit which define the cachability.
> > 
> > Do you have a definition of IO memory that those three things don't
> > cover?
> > 
> > I would propose that, for KVM's purpose, IO memory is marked with
> > VM_IO or a non-cachable pgprot
> > 
> > And "standard RAM" is defined by a cachable pgprot. Linux never makes
> > something that is VM_IO cachable.
> 
> I think we can safely set a stage 2 Normal NC for a vma with pgprot
> other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
> not that simple. Just because the VMM was allowed to map it as cacheable
> does not mean that it supports all the CPU features. One example is MTE
> where we can only guarantee that the RAM given to the OS at boot
> supports tagged accesses. 

Is there a use case to supply the VMM with cachable memory that is not
full featured? At least the vfio cases I am aware of do not actually
want to do this and would probably like the arch to prevent these
corner cases upfront.

> I've seen something similar in the past with
> LSE atomics (or was it exclusives?) not being propagated. These don't
> make the memory safe for a guest to use as general purpose RAM.

At least from a mm perspective, I think it is important that cachable
struct pages are all the same and all interchangable. If the arch
cannot provide this it should not allow the pgmap/memremap to succeed
at all. Otherwise drivers using these new APIs are never going to work
fully right..

That leaves open the question of what to do with a cachable VM_PFNMAP,
but if the arch can deal with the memremap issue then it seems like it
can use the same logic when inspecting the VMA contents?

> I was thinking more about keeping things simpler and avoid any lack of
> coherence between the VMM and the VM, in case the latter maps it as
> Normal NC. But if the VMM doesn't touch it, the initial cache
> maintenance by the KVM would do.

I see, in general we usually don't have use cases for the VMM touching
the vfio memory as it is hard to make that secure. I would like it if
that memory could be a FD instead of a VMA. Maybe someday..

> See above on the characteristics of the memory. If some of them are not
> supported, it's probably fine (atomics not working) but others like MTE
> accesses could either cause external aborts or access random addresses
> from elsewhere. Currently we rely on pfn_is_map_memory() for this but we
> need a way to tell that other ranges outside the initial RAM supports
> all features. 

Indeed, I did not realize this about arm. See above on my note that at
the mm level we should not have different types of cachable struct
pages. So I think the right thing to do is fix that and still rely on
the struct page test here in kvm.

> IOW, is any of this memory (mapped as cacheable in the
> VMM) special purpose with only a subset of the CPU features supported?

At least the use cases I am interested in has the memory be
functionally indistinguishable from boot time DDR.

Cachable memory that does not function the same as DDR is something
else, maybe that is your "cachable IO memory" concept. I don't have a
use case for it. If we do decide to create it then it likely should be
tagged as MEMORY_DEVICE_PCI_P2PDMA in the mm side and not get into any
code paths that assume DDR.

Jason

Catalin Marinas Oct. 10, 2023, 5:19 p.m. UTC | #6

On Tue, Oct 10, 2023 at 12:05:02PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > > > From: Ankit Agrawal <ankita@nvidia.com>
> > > > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > > > kind of memory it is by inspecting the VMA.
> > > > 
> > > > Well, it doesn't. It tells us what attributes the user mapped that
> > > > memory with, not whether it's I/O memory or standard RAM.
> > > 
> > > There is VM_IO which is intended to be used for address space with
> > > side effects.
> > > 
> > > And there is VM_PFNMAP which is intended to be used for address space
> > > without struct page (IO or not)
> > > 
> > > And finally we have the pgprot bit which define the cachability.
> > > 
> > > Do you have a definition of IO memory that those three things don't
> > > cover?
> > > 
> > > I would propose that, for KVM's purpose, IO memory is marked with
> > > VM_IO or a non-cachable pgprot
> > > 
> > > And "standard RAM" is defined by a cachable pgprot. Linux never makes
> > > something that is VM_IO cachable.
> > 
> > I think we can safely set a stage 2 Normal NC for a vma with pgprot
> > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
> > not that simple. Just because the VMM was allowed to map it as cacheable
> > does not mean that it supports all the CPU features. One example is MTE
> > where we can only guarantee that the RAM given to the OS at boot
> > supports tagged accesses. 
> 
> Is there a use case to supply the VMM with cachable memory that is not
> full featured? At least the vfio cases I am aware of do not actually
> want to do this and would probably like the arch to prevent these
> corner cases upfront.

The MTE case is the problematic one here. On a data access, the
interconnect shifts (right) the physical address and adds an offset. The
resulting address is used to access tags. Such shift+offset is
configured by firmware at boot and normally only covers the default
memory. If there's some memory on PCIe, it's very unlikely to be
covered and we can't tell whether it simply drops such tag accesses or
makes up some random address that may or may not hit an existing memory
or device.

We don't currently have a way to describe this in ACPI tables (there
were talks about describing special purpose memory, I lost track of the
progress) and the way MTE was first designed doesn't allow a hypervisor
to prevent the guest from generating a tagged access (other than mapping
the memory as non-cacheable at Stage 2). This has been fixed in newer
architecture versions but we haven't added Linux support for it yet (and
there's no hardware available either). AFAIK, there's no MTE support for
CXL-attached memory at the moment in the current interconnects, so
better not pretend it's general purpose memory that supports all the
features.

Other than preventing malicious guest behaviour, it depends what the VM
needs cacheable access for: some GPU memory that's only used for sharing
data and we don't need all features or general purpose memory that a VM
can use to run applications from etc. The former may not need all the
features (e.g. can skip exclusives) but the latter does.

We can probably work around the MTE case by only allowing cacheable
Stage 2 if MTE is disabled for the guest or FEAT_MTE_PERM is
implemented/supported (TBD when we'll add this). For the other cases, it
would be up to the VMM how it presents the mapping to the guest (device
pass-through or general purpose memory).

> > I've seen something similar in the past with
> > LSE atomics (or was it exclusives?) not being propagated. These don't
> > make the memory safe for a guest to use as general purpose RAM.
> 
> At least from a mm perspective, I think it is important that cachable
> struct pages are all the same and all interchangable. If the arch
> cannot provide this it should not allow the pgmap/memremap to succeed
> at all. Otherwise drivers using these new APIs are never going to work
> fully right..

Yes, for struct page backed memory, the current assumption is that all
are the same, support all CPU features. It's the PFN-based memory where
we don't have such guarantees.

> That leaves open the question of what to do with a cachable VM_PFNMAP,
> but if the arch can deal with the memremap issue then it seems like it
> can use the same logic when inspecting the VMA contents?

In the MTE case, memremap() never returns a Normal Tagged mapping and it
would not map it in user-space as PROT_MTE either. But if a page is not
mmap(PROT_MTE) (or VM_MTE in vma->flags) in the VMM, it doesn't mean the
guest should not be allowed to use MTE. Qemu for example doesn't map the
guest memory with mmap(PROT_MTE) since it doesn't have a need for it but
the guest can enable MTE independently (well, if enabled at the vCPU
level).

We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
by struct page. We could probe that in KVM and either fall back to
non-cacheable or allow cacheable if MTE is disable at the vCPU level.

Jason Gunthorpe Oct. 10, 2023, 6:23 p.m. UTC | #7

On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote:

> This has been fixed in newer architecture versions but we haven't
> added Linux support for it yet (and there's no hardware available
> either). AFAIK, there's no MTE support for CXL-attached memory at
> the moment in the current interconnects, so better not pretend it's
> general purpose memory that supports all the features.

I don't know much about MTE, but the use case imagined for CXL memory
allows the MM to exchange any system page with a CXL page. So there
cannot be a behavioral difference.

Can usespace continue to use tagged pointers even if the mm has moved
the pages to CXL that doesn't support it?

The main purpose for this is to tier VM memory, so having CXL
behaviorally different in a VM seems fatal to me.

Linux drivers need a way to understand this, like we can't have a CXL
memory pool driver or GPU driver calling memremap_pages() and getting
a somewhat broken system because of MTE incompatibilities. So maybe
ARM really should block memremap_pages() in case of MTE until
everything is resolved?

From the mm perspective we can't have two kinds of cachable struct
pages running around that are functionally different.

> Other than preventing malicious guest behaviour, it depends what the VM
> needs cacheable access for: some GPU memory that's only used for sharing
> data and we don't need all features or general purpose memory that a VM
> can use to run applications from etc. The former may not need all the
> features (e.g. can skip exclusives) but the latter does.

Like CXL memory pooling GPU memory is used interchangably with every
kind of DDR memory in every context. It must be completely transparent
and interchangable via the mm's migration machinery.

> > > I've seen something similar in the past with
> > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > make the memory safe for a guest to use as general purpose RAM.
> > 
> > At least from a mm perspective, I think it is important that cachable
> > struct pages are all the same and all interchangable. If the arch
> > cannot provide this it should not allow the pgmap/memremap to succeed
> > at all. Otherwise drivers using these new APIs are never going to work
> > fully right..
> 
> Yes, for struct page backed memory, the current assumption is that all
> are the same, support all CPU features. It's the PFN-based memory where
> we don't have such guarantees.

I see it got a bit confused, I am talking about memremap_pages() (ie
include/linux/memremap.h), not memremap (include/linux/io.h) for
getting non-struct page memory. It is confusing :|

memremap_pages() is one of the entry points of the struct page hotplug
machinery. Things like CXL drivers assume they can hot plug in new
memory through these APIs and get new cachable struct pages that are
functionally identical to boot time cachable struct pages.

> We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> by struct page. We could probe that in KVM and either fall back to
> non-cacheable or allow cacheable if MTE is disable at the vCPU level.

I'm not sure what this does, it is only set by shmem_map? That is
much stricter than "mappings backed by struct page"

Still, I'm not sure how to proceed here - we veered into this MTE
stuff I don't know we have experiance with yet.

Thanks,
Jason

Catalin Marinas Oct. 11, 2023, 5:45 p.m. UTC | #8

On Tue, Oct 10, 2023 at 03:23:28PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote:
> > This has been fixed in newer architecture versions but we haven't
> > added Linux support for it yet (and there's no hardware available
> > either). AFAIK, there's no MTE support for CXL-attached memory at
> > the moment in the current interconnects, so better not pretend it's
> > general purpose memory that supports all the features.
> 
> I don't know much about MTE, but the use case imagined for CXL memory
> allows the MM to exchange any system page with a CXL page. So there
> cannot be a behavioral difference.
> 
> Can usespace continue to use tagged pointers even if the mm has moved
> the pages to CXL that doesn't support it?

That would lead to external aborts in the worst case or just losing tags
in the best. But none of them ideal.

> The main purpose for this is to tier VM memory, so having CXL
> behaviorally different in a VM seems fatal to me.

Yes, that's why if you can't tell the memory supports MTE, either it
should not be given to a guest with KVM_CAP_ARM_MTE or MTE gets disabled
in the guest.

> Linux drivers need a way to understand this, like we can't have a CXL
> memory pool driver or GPU driver calling memremap_pages() and getting
> a somewhat broken system because of MTE incompatibilities. So maybe
> ARM really should block memremap_pages() in case of MTE until
> everything is resolved?

I need to understand this part of the kernel a bit more but see below
(and MTE as it currently stands doesn't play well with server-like
systems, memory hotplug etc.)

> From the mm perspective we can't have two kinds of cachable struct
> pages running around that are functionally different.

From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work
fine, all pages interchangeable (if it doesn't, the hardware is broken).
These are added via the add_memory_resource() hotplug path. If a
platform is known not to support this, it better not advertise MTE as a
feature (the CPUs usually have some tie-off signal when the rest of the
SoC cannot handle MTE). We could claim it's a hardware erratum if it
does.

But for ZONE_DEVICE ranges, these are not guaranteed to support all the
characteristics of the main RAM. I think that's what memremap_pages()
gives us. I'm not too familiar with this part of the kernel but IIUC
that falls under the HMM category, so not interchangeable with the
normal RAM (hotplugged or not).

> > Other than preventing malicious guest behaviour, it depends what the VM
> > needs cacheable access for: some GPU memory that's only used for sharing
> > data and we don't need all features or general purpose memory that a VM
> > can use to run applications from etc. The former may not need all the
> > features (e.g. can skip exclusives) but the latter does.
> 
> Like CXL memory pooling GPU memory is used interchangably with every
> kind of DDR memory in every context. It must be completely transparent
> and interchangable via the mm's migration machinery.

I don't see the mm code doing this but I haven't looked deep enough.
At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
allocating ZONE_DEVICE pages and passing them to the user.

> > > > I've seen something similar in the past with
> > > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > > make the memory safe for a guest to use as general purpose RAM.
> > > 
> > > At least from a mm perspective, I think it is important that cachable
> > > struct pages are all the same and all interchangable. If the arch
> > > cannot provide this it should not allow the pgmap/memremap to succeed
> > > at all. Otherwise drivers using these new APIs are never going to work
> > > fully right..
> > 
> > Yes, for struct page backed memory, the current assumption is that all
> > are the same, support all CPU features. It's the PFN-based memory where
> > we don't have such guarantees.
> 
> I see it got a bit confused, I am talking about memremap_pages() (ie
> include/linux/memremap.h), not memremap (include/linux/io.h) for
> getting non-struct page memory. It is confusing :|
> 
> memremap_pages() is one of the entry points of the struct page hotplug
> machinery. Things like CXL drivers assume they can hot plug in new
> memory through these APIs and get new cachable struct pages that are
> functionally identical to boot time cachable struct pages.

We have two mechanisms, one in memremap.c and another in
memory_hotplug.c. So far my assumption is that only the memory added by
the latter ends up in ZONE_MOVABLE and can be used by the kernel as any
of the ZONE_NORMAL RAM, transparently to the user. For ZONE_DEVICE
allocations, one would have to explicitly mmap() it via a device fd.

If a VMM wants to mmap() such GPU memory and give it to the guest as
general purpose RAM, it should make sure it has all the characteristics
as advertised by the CPU or disable certain features (if it can).
Currently we don't have a way to tell what such memory supports (neither
ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
that it doesn't.

> > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> > by struct page. We could probe that in KVM and either fall back to
> > non-cacheable or allow cacheable if MTE is disable at the vCPU level.
> 
> I'm not sure what this does, it is only set by shmem_map? That is
> much stricter than "mappings backed by struct page"

This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets
the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see
arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the
memory supports MTE and sets the flag explicitly. Currently that's only
done in shmem_mmap() as we know where this memory is coming from. When
the user wants an mmap(PROT_MTE), the arch code checks whether
VM_MTE_ALLOWED is set on the vma before allowing tag accesses.

Memory mapped from ZONE_DEVICE won't have such flag set, so
mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap()
call into the hypervisor. A guest can simply enable MTE at stage 1
without the hypervisor being able to tell.

> Still, I'm not sure how to proceed here - we veered into this MTE
> stuff I don't know we have experiance with yet.

We veered mostly because on arm64 such GPU memory is not guaranteed to
have all the characteristics of the generic RAM. I think only MTE is the
dangerous one and it needs extra care but I wouldn't be surprised if we
notice atomics failing.

It looks like memremap_pages() also takes a memory type and I suspect
it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic
RAM). Is there any sanity check on the host kernel side to allow VMM
cacheable mappings only for such memory and not the other
MEMORY_DEVICE_* types?

Going back to KVM, we can relax to cacheable mapping at Stage 2 if the
vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is
disabled. From the earlier discussions, we can probably ignore VM_IO
since we won't have a cacheable mapping with this flag. Not sure about
VM_PFNMAP.

Jason Gunthorpe Oct. 11, 2023, 6:38 p.m. UTC | #9

On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:

> > From the mm perspective we can't have two kinds of cachable struct
> > pages running around that are functionally different.
> 
> From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work
> fine, all pages interchangeable (if it doesn't, the hardware is
> broken).

Yes, and we imagine adding GPU and CXL memory to ZONE_MOVABLE (on a
NUMA node)

> These are added via the add_memory_resource() hotplug path. If a
> platform is known not to support this, it better not advertise MTE as a
> feature (the CPUs usually have some tie-off signal when the rest of the
> SoC cannot handle MTE). We could claim it's a hardware erratum if it
> does.

Seems logical
 
> But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> characteristics of the main RAM. I think that's what memremap_pages()
> gives us. I'm not too familiar with this part of the kernel but IIUC
> that falls under the HMM category, so not interchangeable with the
> normal RAM (hotplugged or not).

DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".

They are not fully interchangable, but they get into the page cache,
they can back .data segements, they could be subject atomics/etc. So
they should be fully functional like DDR.

> I don't see the mm code doing this but I haven't looked deep enough.
> At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> allocating ZONE_DEVICE pages and passing them to the user.

Not ZONE_DEVICE. One popular coherent GPU approach is to use
ZONE_MOVABLE pages.

> > > > > I've seen something similar in the past with
> > > > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > > > make the memory safe for a guest to use as general purpose RAM.
> > > > 
> > > > At least from a mm perspective, I think it is important that cachable
> > > > struct pages are all the same and all interchangable. If the arch
> > > > cannot provide this it should not allow the pgmap/memremap to succeed
> > > > at all. Otherwise drivers using these new APIs are never going to work
> > > > fully right..
> > > 
> > > Yes, for struct page backed memory, the current assumption is that all
> > > are the same, support all CPU features. It's the PFN-based memory where
> > > we don't have such guarantees.
> > 
> > I see it got a bit confused, I am talking about memremap_pages() (ie
> > include/linux/memremap.h), not memremap (include/linux/io.h) for
> > getting non-struct page memory. It is confusing :|
> > 
> > memremap_pages() is one of the entry points of the struct page hotplug
> > machinery. Things like CXL drivers assume they can hot plug in new
> > memory through these APIs and get new cachable struct pages that are
> > functionally identical to boot time cachable struct pages.
> 
> We have two mechanisms, one in memremap.c and another in
> memory_hotplug.c.

Yes

>  So far my assumption is that only the memory added by
> the latter ends up in ZONE_MOVABLE and can be used by the kernel as any
> of the ZONE_NORMAL RAM, transparently to the user. 

Probably for now, yes. But CXL/GPU memory goes there too.

> For ZONE_DEVICE allocations, one would have to explicitly mmap() it
> via a device fd.

Not quite for DAX, it gets in through the page cache, but it is still
mmap on a FD and not anonymous memory.

> If a VMM wants to mmap() such GPU memory and give it to the guest as
> general purpose RAM, it should make sure it has all the characteristics
> as advertised by the CPU or disable certain features (if it can).

This is the VFIO flow we are talking about here, I think. PFNMAP
memory that goes into a VM that is cachable.

> Currently we don't have a way to tell what such memory supports (neither
> ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> that it doesn't.

Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
driver turned in into PFNMAP.. So these things seem incompatible.

> > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> > > by struct page. We could probe that in KVM and either fall back to
> > > non-cacheable or allow cacheable if MTE is disable at the vCPU level.
> > 
> > I'm not sure what this does, it is only set by shmem_map? That is
> > much stricter than "mappings backed by struct page"
> 
> This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets
> the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see
> arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the
> memory supports MTE and sets the flag explicitly. Currently that's only
> done in shmem_mmap() as we know where this memory is coming from. When
> the user wants an mmap(PROT_MTE), the arch code checks whether
> VM_MTE_ALLOWED is set on the vma before allowing tag accesses.
> 
> Memory mapped from ZONE_DEVICE won't have such flag set, so
> mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap()
> call into the hypervisor. A guest can simply enable MTE at stage 1
> without the hypervisor being able to tell.

Yes, so this is all safe for DAX usages, not safe for GPU CXL NUMA
memory hotplug. :|

> > Still, I'm not sure how to proceed here - we veered into this MTE
> > stuff I don't know we have experiance with yet.
> 
> We veered mostly because on arm64 such GPU memory is not guaranteed to
> have all the characteristics of the generic RAM. I think only MTE is the
> dangerous one and it needs extra care but I wouldn't be surprised if we
> notice atomics failing.

So, at least for the system this change is being tested on, the
"pre-CXL" GPU memory is 100% interchangable with DDR memory. It is
surfaced to the OS as ZONE_MOVABLE and it should work in VMs like this
too.

> It looks like memremap_pages() also takes a memory type and I suspect
> it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic
> RAM). Is there any sanity check on the host kernel side to allow VMM
> cacheable mappings only for such memory and not the other
> MEMORY_DEVICE_* types?

I guess it is this current KVM code we are discussing, it probably
happens by the pfn_is_map_memory() check?
 
> Going back to KVM, we can relax to cacheable mapping at Stage 2 if the
> vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is
> disabled.

This seems logical to me, thanks

> From the earlier discussions, we can probably ignore VM_IO
> since we won't have a cacheable mapping with this flag. Not sure about
> VM_PFNMAP.

PFNMAP is the interesting one for VFIO, at least. Can we use the same
reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
discussion.

Currently no VFIO driver is doing cachable that has memory that is
different from DDR memory. So this is sort of theoretical discussion
about future cachable HW that does use VFIO that does have a
non-uniformity.

Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
use ZONE_MOVABLE?

Where does that leave us for this patch? We check the VM_MTE_ALLOWED
and check for ZONE_MOVABLE struct pages as one of the conditions for
NORMAL?

Jason

Catalin Marinas Oct. 12, 2023, 4:16 p.m. UTC | #10

On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:
> > But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> > characteristics of the main RAM. I think that's what memremap_pages()
> > gives us. I'm not too familiar with this part of the kernel but IIUC
> > that falls under the HMM category, so not interchangeable with the
> > normal RAM (hotplugged or not).
> 
> DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".
> 
> They are not fully interchangable, but they get into the page cache,
> they can back .data segements, they could be subject atomics/etc. So
> they should be fully functional like DDR.

Unfortunately the Arm architecture makes the distinction between
"cacheable" and "cacheable tagged". We don't currently have any way of
describing this in firmware tables, so we rely on the hardware or
firmware not advertising MTE if such memory ends up as general purpose.

That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED
set, so no mmap(PROT_MTE) possible.

> > I don't see the mm code doing this but I haven't looked deep enough.
> > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> > allocating ZONE_DEVICE pages and passing them to the user.
> 
> Not ZONE_DEVICE. One popular coherent GPU approach is to use
> ZONE_MOVABLE pages.

OK, so presumably the driver could tell on which architecture it is
running and plug in the memory appropriately (or refuse to). It's a bit
more arch knowledge in a (generic) driver that I'd like but we don't
have a way to describe or probe this yet. Maybe a firmware config would
just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken
bit or tie-off for the ID registers not to advertise MTE).

> > If a VMM wants to mmap() such GPU memory and give it to the guest as
> > general purpose RAM, it should make sure it has all the characteristics
> > as advertised by the CPU or disable certain features (if it can).
> 
> This is the VFIO flow we are talking about here, I think. PFNMAP
> memory that goes into a VM that is cachable.
> 
> > Currently we don't have a way to tell what such memory supports (neither
> > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> > that it doesn't.
> 
> Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
> driver turned in into PFNMAP.. So these things seem incompatible.

So can we end up with the same pfn mapped in two different vmas, one
backed by struct page and another with VM_PFNMAP (implying no struct
page)? I don't know how this is supposed to work. Even if the memory
supports MTE, we rely on the struct page to track the status of the tags
(the PG_mte_tagged flag). We may get away with this if user_mem_abort()
-> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd
actually prevent this path altogether if VM_PFNMAP (well, we may
actually have a bug if we don't already do this).

> > From the earlier discussions, we can probably ignore VM_IO
> > since we won't have a cacheable mapping with this flag. Not sure about
> > VM_PFNMAP.
> 
> PFNMAP is the interesting one for VFIO, at least. Can we use the same
> reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
> discussion.
> 
> Currently no VFIO driver is doing cachable that has memory that is
> different from DDR memory. So this is sort of theoretical discussion
> about future cachable HW that does use VFIO that does have a
> non-uniformity.
> 
> Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
> use ZONE_MOVABLE?

I think we should only keep VM_IO for memory mapped I/O with side
effects. Other ranges can be VM_PFNMAP if not backed by struct page.

> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> and check for ZONE_MOVABLE struct pages as one of the conditions for
> NORMAL?

I think we should keep it as simple as possible and, looking at it
again, maybe even ignore vm_page_prot. Two questions though:

1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
   MT_NORMAL_TAGGED?

2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
   which end up in user_mem_abort()) imply VM_IO?

If yes to both, I think something like below would do:

	mte_allowed = kvm_vma_mte_allowed(vma);
	noncacheable = false;				// or 'device' as in user_mem_abort()
	...
	if (vma->flags & VM_IO)				// replaces !pfn_is_map_memory()
		noncacheable = true;
	else if (!mte_allowed && kvm_has_mte())
		noncaheable = true;
	...
	if (noncacheable)
		prot |= KVM_PGTABLE_PROT_DEVICE;	// or the new KVM_PGTABLE_PROT_NC

mte_allowed would cover DAX mappings (and, who knows, some future DAX
mapping may allow MTE and the driver explicitly set the flag). Anything
else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
MTE disabled altogether.

Shameerali Kolothum Thodi Oct. 23, 2023, 1:20 p.m. UTC | #11

Hi,

> -----Original Message-----
> From: ankita@nvidia.com [mailto:ankita@nvidia.com]
> Sent: 07 September 2023 19:15
> To: ankita@nvidia.com; jgg@nvidia.com; maz@kernel.org;
> oliver.upton@linux.dev; catalin.marinas@arm.com; will@kernel.org
> Cc: aniketa@nvidia.com; cjia@nvidia.com; kwankhede@nvidia.com;
> targupta@nvidia.com; vsethi@nvidia.com; acurrid@nvidia.com;
> apopple@nvidia.com; jhubbard@nvidia.com; danw@nvidia.com;
> linux-arm-kernel@lists.infradead.org; kvmarm@lists.linux.dev;
> linux-kernel@vger.kernel.org
> Subject: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> Replace pfn_is_map_memory() with a check on the VMA pgprot to
> determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM
> 
> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before continuing.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>  arch/arm64/kvm/hyp/pgtable.c         |  2 +-
>  arch/arm64/kvm/mmu.c                 | 40
> +++++++++++++++++++++++++---
>  3 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
> b/arch/arm64/include/asm/kvm_pgtable.h
> index d3e354bb8351..0579dbe958b9 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable
> *pgt, u64 addr, u64 size);
>   */
>  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
> 
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
>  /**
>   * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2
> PGD
>   * @vtcr:	Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c
> b/arch/arm64/kvm/hyp/pgtable.c
> index f155b8c9e98c..ccd291b6893d 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32
> phys_shift)
>  	return vtcr;
>  }
> 
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  {
>  	if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>  		return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 482280fe22d7..79f1caaa08a0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct
> vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
> 
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu
> *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
> 
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
> 
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu
> *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> 
> -	if (device)
> -		prot |= KVM_PGTABLE_PROT_DEVICE;
> -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> -		prot |= KVM_PGTABLE_PROT_X;
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map
> +	 *  at the usual spot.
> +	 *
> +	 *  Validate that there is a struct page for the PFN which maps
> +	 *  to the KVA that the flushing code expects.
> +	 */
> +	if (!stage2_has_fwb(pgt) &&
> +	    !(pfn_valid(pfn) &&
> +	      page_to_virt(pfn_to_page(pfn)) ==
> kvm_host_va(PFN_PHYS(pfn)))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

I don't quite follow the above check. Does no FWB matters for
Non-cacheable/device memory as well?

From a quick check, it breaks a n/w dev assignment on a platform that doesn't 
have FWB.

Qemu reports,

error: kvm run failed Invalid argument
 PC=ffff800080a95ed0 X00=ffff0000c7ff8090 X01=ffff0000c7ff80a0
X02=00000000fe180000 X03=ffff800085327000 X04=0000000000000001
X05=0000000000000040 X06=000000000000003f X07=0000000000000000
X08=ffff0000be190000 X09=0000000000000000 X10=0000000000000040

Please let me know.

Thanks,
Shameer

> 
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
> --
> 2.17.1
>

Ankit Agrawal March 10, 2024, 3:49 a.m. UTC | #12

Bringing this to the fore.

>> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
>> and check for ZONE_MOVABLE struct pages as one of the conditions for
>> NORMAL?
>
> I think we should keep it as simple as possible and, looking at it
> again, maybe even ignore vm_page_prot. Two questions though:
>
> 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
>    MT_NORMAL_TAGGED?
>
> 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
>   which end up in user_mem_abort()) imply VM_IO?
>
> If yes to both, I think something like below would do:

How may we get the answer to these? It seems to be the logical behavior, but
how can we confirm?

While we discuss on that, I am considering to send out the next version of this
patch (incorporating the feedbacks on the thread) that is rebased to a commit
inclusive of the recently applied KVM patch series:

[KVM: arm64: Allow the VM to select DEVICE_* and NORMAL_NC for IO memory]
https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/

>        mte_allowed = kvm_vma_mte_allowed(vma);
>        noncacheable = false;                           // or 'device' as in user_mem_abort()
>        ...
>        if (vma->flags & VM_IO)                         // replaces !pfn_is_map_memory()
>                noncacheable = true;
>        else if (!mte_allowed && kvm_has_mte())
>                noncaheable = true;
>        ...
>        if (noncacheable)
>                prot |= KVM_PGTABLE_PROT_DEVICE;        // or the new KVM_PGTABLE_PROT_NC
>
> mte_allowed would cover DAX mappings (and, who knows, some future DAX
> mapping may allow MTE and the driver explicitly set the flag). Anything
> else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
> MTE disabled altogether.

--
Catalin

Jason Gunthorpe March 19, 2024, 1:38 p.m. UTC | #13

On Sun, Mar 10, 2024 at 03:49:36AM +0000, Ankit Agrawal wrote:
> Bringing this to the fore.
> 
> >> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> >> and check for ZONE_MOVABLE struct pages as one of the conditions for
> >> NORMAL?
> >
> > I think we should keep it as simple as possible and, looking at it
> > again, maybe even ignore vm_page_prot. Two questions though:
> >
> > 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
> >    MT_NORMAL_TAGGED?

Drivers do crazy things, so I would not be surprised if something in
the tree does this (wrong) thing. We can check the pgprot in the vma
as well as the VM_IO to make the decision.

> > 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
> >   which end up in user_mem_abort()) imply VM_IO?

Yes, this is definately true for VFIO.

Jason

[v1,1/2] KVM: arm64: determine memory type from VMA

Commit Message

Comments

Patch