diff mbox series

[v3,1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

Message ID 20230606025320.22647-1-ankita@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [v3,1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper | expand

Commit Message

Ankit Agrawal June 6, 2023, 2:53 a.m. UTC
From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary cache coherent interconnect.

This representation has a number of limitations compared to a real PCI
device, in particular, it does not model the coherent GPU memory
aperture as a PCI config space BAR, and PCI doesn't know anything
about cacheable memory types.

Provide a VFIO PCI variant driver that adapts the unique PCI
representation into a more standard PCI representation facing
userspace. The GPU memory aperture is obtained from ACPI using
device_property_read_u64(), according to the FW specification,
and exported to userspace as a separate VFIO_REGION. Since the device
implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
to the next available PCI BAR (BAR2). Qemu will then naturally generate a
PCI device in the VM with two 64-bit BARs (where the cacheable aperture
reported in BAR2).

Since this memory region is actually cache coherent with the CPU, the
VFIO variant driver will mmap it into VMA using a cacheable mapping. The
mapping is done using remap_pfn_range().

PCI BAR are aligned to the power-of-2, but the actual memory on the
device may not. The physical address from the last device PFN up to the
next power-of-2 aligned PA thus is mapped to a dummy PFN through
vm_operations fault.

This goes along with a qemu series to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device.
https://www.mail-archive.com/qemu-devel@nongnu.org/msg967557.html

This patch is split from a patch series being pursued separately:
https://lore.kernel.org/lkml/20230405180134.16932-1-ankita@nvidia.com/

Applied and tested over v6.4-rc4.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---

v2 -> v3
- Added fault handler to map the region outside the physical GPU memory
  up to the next power-of-2 to a dummy PFN.
- Changed to select instead of "depends on" VFIO_PCI_CORE for all the
  vfio-pci variant driver.
- Code cleanup based on feedback comments.
- Code implemented and tested against v6.4-rc4.

v1 -> v2
- Updated the wording of reference to BAR offset and replaced with
  index.
- The GPU memory is exposed at the fixed BAR2_REGION_INDEX.
- Code cleanup based on feedback comments.

 MAINTAINERS                        |   6 +
 drivers/vfio/pci/Kconfig           |   2 +
 drivers/vfio/pci/Makefile          |   2 +
 drivers/vfio/pci/hisilicon/Kconfig |   2 +-
 drivers/vfio/pci/mlx5/Kconfig      |   2 +-
 drivers/vfio/pci/nvgpu/Kconfig     |  10 ++
 drivers/vfio/pci/nvgpu/Makefile    |   3 +
 drivers/vfio/pci/nvgpu/main.c      | 257 +++++++++++++++++++++++++++++
 8 files changed, 282 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvgpu/main.c

Comments

Alex Williamson June 6, 2023, 2:32 p.m. UTC | #1
On Mon, 5 Jun 2023 19:53:20 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary cache coherent interconnect.
> 
> This representation has a number of limitations compared to a real PCI
> device, in particular, it does not model the coherent GPU memory
> aperture as a PCI config space BAR, and PCI doesn't know anything
> about cacheable memory types.
> 
> Provide a VFIO PCI variant driver that adapts the unique PCI
> representation into a more standard PCI representation facing
> userspace. The GPU memory aperture is obtained from ACPI using
> device_property_read_u64(), according to the FW specification,
> and exported to userspace as a separate VFIO_REGION. Since the device
> implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> reported in BAR2).
> 
> Since this memory region is actually cache coherent with the CPU, the
> VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> mapping is done using remap_pfn_range().
> 
> PCI BAR are aligned to the power-of-2, but the actual memory on the
> device may not. The physical address from the last device PFN up to the
> next power-of-2 aligned PA thus is mapped to a dummy PFN through
> vm_operations fault.

As noted in the QEMU series, this all suggests to me that we should
simply expose a device specific region for this coherent memory which
QEMU can then choose to expose to the VM as a BAR, or not.  It's
clearly not a BAR on bare metal, so if we need to go to all the trouble
to create ACPI tables to further define the coherent memory space,
what's the benefit of pretending that it's a PCI BAR?  ie. Why should a
VM view this as a BAR rather than follow the same semantics as bare
metal?

We can then have a discussion whether this even needs to be a variant
driver versus a vfio-pci quirk if this device specific region is the
only feature provided (ie. is migration in the future for this
driver?).  I think we're trying to push in the direction of variant
drivers, but I'm not sure how much that makes sense for the purpose of
only tacking on a device specific region.

Also, isn't "nvgpu" already overused?  Thanks,

Alex

 
> This goes along with a qemu series to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg967557.html
> 
> This patch is split from a patch series being pursued separately:
> https://lore.kernel.org/lkml/20230405180134.16932-1-ankita@nvidia.com/
> 
> Applied and tested over v6.4-rc4.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> 
> v2 -> v3
> - Added fault handler to map the region outside the physical GPU memory
>   up to the next power-of-2 to a dummy PFN.
> - Changed to select instead of "depends on" VFIO_PCI_CORE for all the
>   vfio-pci variant driver.
> - Code cleanup based on feedback comments.
> - Code implemented and tested against v6.4-rc4.
> 
> v1 -> v2
> - Updated the wording of reference to BAR offset and replaced with
>   index.
> - The GPU memory is exposed at the fixed BAR2_REGION_INDEX.
> - Code cleanup based on feedback comments.
> 
>  MAINTAINERS                        |   6 +
>  drivers/vfio/pci/Kconfig           |   2 +
>  drivers/vfio/pci/Makefile          |   2 +
>  drivers/vfio/pci/hisilicon/Kconfig |   2 +-
>  drivers/vfio/pci/mlx5/Kconfig      |   2 +-
>  drivers/vfio/pci/nvgpu/Kconfig     |  10 ++
>  drivers/vfio/pci/nvgpu/Makefile    |   3 +
>  drivers/vfio/pci/nvgpu/main.c      | 257 +++++++++++++++++++++++++++++
>  8 files changed, 282 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/vfio/pci/nvgpu/Kconfig
>  create mode 100644 drivers/vfio/pci/nvgpu/Makefile
>  create mode 100644 drivers/vfio/pci/nvgpu/main.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 250518fc70ff..51e364b8f00b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22150,6 +22150,12 @@ L:	kvm@vger.kernel.org
>  S:	Maintained
>  F:	drivers/vfio/platform/
>  
> +VFIO NVIDIA PCI DRIVER
> +M:	Ankit Agrawal <ankita@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	drivers/vfio/pci/nvgpu/
> +
>  VGA_SWITCHEROO
>  R:	Lukas Wunner <lukas@wunner.de>
>  S:	Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index f9d0c908e738..ade18b0ffb7b 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig"
>  
>  source "drivers/vfio/pci/hisilicon/Kconfig"
>  
> +source "drivers/vfio/pci/nvgpu/Kconfig"
> +
>  endif
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 24c524224da5..0c93d452d0da 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> +
> +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu/
> diff --git a/drivers/vfio/pci/hisilicon/Kconfig b/drivers/vfio/pci/hisilicon/Kconfig
> index 5daa0f45d2f9..38e90e05d68a 100644
> --- a/drivers/vfio/pci/hisilicon/Kconfig
> +++ b/drivers/vfio/pci/hisilicon/Kconfig
> @@ -2,12 +2,12 @@
>  config HISI_ACC_VFIO_PCI
>  	tristate "VFIO PCI support for HiSilicon ACC devices"
>  	depends on ARM64 || (COMPILE_TEST && 64BIT)
> -	depends on VFIO_PCI_CORE
>  	depends on PCI_MSI
>  	depends on CRYPTO_DEV_HISI_QM
>  	depends on CRYPTO_DEV_HISI_HPRE
>  	depends on CRYPTO_DEV_HISI_SEC2
>  	depends on CRYPTO_DEV_HISI_ZIP
> +	select VFIO_PCI_CORE
>  	help
>  	  This provides generic PCI support for HiSilicon ACC devices
>  	  using the VFIO framework.
> diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
> index 29ba9c504a75..7088edc4fb28 100644
> --- a/drivers/vfio/pci/mlx5/Kconfig
> +++ b/drivers/vfio/pci/mlx5/Kconfig
> @@ -2,7 +2,7 @@
>  config MLX5_VFIO_PCI
>  	tristate "VFIO support for MLX5 PCI devices"
>  	depends on MLX5_CORE
> -	depends on VFIO_PCI_CORE
> +	select VFIO_PCI_CORE
>  	help
>  	  This provides migration support for MLX5 devices using the VFIO
>  	  framework.
> diff --git a/drivers/vfio/pci/nvgpu/Kconfig b/drivers/vfio/pci/nvgpu/Kconfig
> new file mode 100644
> index 000000000000..066f764f7c5f
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgpu/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config NVGPU_VFIO_PCI
> +	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
> +	depends on ARM64 || (COMPILE_TEST && 64BIT)
> +	select VFIO_PCI_CORE
> +	help
> +	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> +	  required to assign the GPU device to a VM using KVM/qemu/etc.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvgpu/Makefile b/drivers/vfio/pci/nvgpu/Makefile
> new file mode 100644
> index 000000000000..00fd3a078218
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgpu/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu-vfio-pci.o
> +nvgpu-vfio-pci-y := main.o
> diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c
> new file mode 100644
> index 000000000000..9a838fe52a78
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgpu/main.c
> @@ -0,0 +1,257 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/vfio_pci_core.h>
> +
> +#define DUMMY_PFN \
> +	(((nvdev->mem_prop.hpa + nvdev->mem_prop.mem_length) >> PAGE_SHIFT) - 1)
> +
> +struct dev_mem_properties {
> +	uint64_t hpa;
> +	uint64_t mem_length;
> +};
> +
> +struct nvgpu_vfio_pci_core_device {
> +	struct vfio_pci_core_device core_device;
> +	struct dev_mem_properties mem_prop;
> +};
> +
> +static vm_fault_t nvgpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		vmf->vma->vm_file->private_data,
> +		struct nvgpu_vfio_pci_core_device, core_device.vdev);
> +	int ret;
> +
> +	/*
> +	 * We come here only if the access is to a memory region that is
> +	 * beyond physical GPU memory. Map such request to a dummy PFN.
> +	 */
> +	ret = remap_pfn_range(vmf->vma,
> +			vmf->vma->vm_start + (mem_offset << PAGE_SHIFT),
> +			DUMMY_PFN, PAGE_SIZE,
> +			vmf->vma->vm_page_prot);
> +	if (ret)
> +		return VM_FAULT_ERROR;
> +
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct nvgpu_vfio_pci_mmap_ops = {
> +	.fault = nvgpu_vfio_pci_fault,
> +};
> +
> +static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	ret = vfio_pci_core_enable(vdev);
> +	if (ret)
> +		return ret;
> +
> +	vfio_pci_core_finish_enable(vdev);
> +
> +	return 0;
> +}
> +
> +static int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
> +			struct vm_area_struct *vma)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
> +
> +	unsigned long start_pfn;
> +	unsigned int index;
> +	u64 req_len, pgoff;
> +	int ret = 0;
> +
> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +	if (index != VFIO_PCI_BAR2_REGION_INDEX)
> +		return vfio_pci_core_mmap(core_vdev, vma);
> +
> +	/*
> +	 * Request to mmap the BAR. Map to the CPU accessible memory on the
> +	 * GPU using the memory information gathered from the system ACPI
> +	 * tables.
> +	 */
> +	start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT;
> +	req_len = vma->vm_end - vma->vm_start;
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT))
> +		return -EINVAL;
> +
> +	/*
> +	 * Perform a PFN map to the memory. The device BAR is backed by the
> +	 * GPU memory now. Check that the mapping does not overflow out of
> +	 * the GPU memory size.
> +	 *
> +	 * The available GPU memory size may not be power-of-2 aligned. Given
> +	 * that the memory is exposed as a BAR, the mapping request is of the
> +	 * power-of-2 aligned size. Map only up to the size of the GPU memory.
> +	 * If the memory access is beyond the actual GPU memory size, it will
> +	 * be handled by the VMA ops fault through mapping to a dummy PFN.
> +	 *
> +	 * During device reset, the GPU is safely disconnected to the CPU
> +	 * and access to the BAR will be immediately returned preventing
> +	 * machine check.
> +	 */
> +	ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff,
> +			      min(req_len, nvdev->mem_prop.mem_length - pgoff),
> +			      vma->vm_page_prot);
> +	if (ret)
> +		return ret;
> +
> +	vma->vm_pgoff = start_pfn + pgoff;
> +	vma->vm_ops = &nvgpu_vfio_pci_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev,
> +			unsigned int cmd, unsigned long arg)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
> +		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
> +
> +	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> +	struct vfio_region_info info;
> +
> +	if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_BAR2_REGION_INDEX) {
> +			/*
> +			 * Request to determine the BAR region information. Send the
> +			 * GPU memory information.
> +			 */
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = nvdev->mem_prop.mem_length;
> +			info.flags = VFIO_REGION_INFO_FLAG_READ |
> +				     VFIO_REGION_INFO_FLAG_WRITE |
> +				     VFIO_REGION_INFO_FLAG_MMAP;
> +			return copy_to_user((void __user *)arg, &info, minsz) ?
> +				       -EFAULT : 0;
> +		}
> +	}
> +
> +	return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> +}
> +
> +static const struct vfio_device_ops nvgpu_vfio_pci_ops = {
> +	.name = "nvgpu-vfio-pci",
> +	.init = vfio_pci_core_init_dev,
> +	.release = vfio_pci_core_release_dev,
> +	.open_device = nvgpu_vfio_pci_open_device,
> +	.close_device = vfio_pci_core_close_device,
> +	.ioctl = nvgpu_vfio_pci_ioctl,
> +	.read = vfio_pci_core_read,
> +	.write = vfio_pci_core_write,
> +	.mmap = nvgpu_vfio_pci_mmap,
> +	.request = vfio_pci_core_request,
> +	.match = vfio_pci_core_match,
> +	.bind_iommufd = vfio_iommufd_physical_bind,
> +	.unbind_iommufd = vfio_iommufd_physical_unbind,
> +	.attach_ioas = vfio_iommufd_physical_attach_ioas,
> +};
> +
> +static struct nvgpu_vfio_pci_core_device *nvgpu_drvdata(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> +	return container_of(core_device, struct nvgpu_vfio_pci_core_device,
> +			    core_device);
> +}
> +
> +static int
> +nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
> +				     struct nvgpu_vfio_pci_core_device *nvdev)
> +{
> +	int ret;
> +
> +	/*
> +	 * The memory information is present in the system ACPI tables as DSD
> +	 * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
> +	 */
> +	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-base-pa",
> +				       &(nvdev->mem_prop.hpa));
> +	if (ret)
> +		return ret;
> +
> +	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
> +				       &(nvdev->mem_prop.mem_length));
> +	return ret;
> +}
> +
> +static int nvgpu_vfio_pci_probe(struct pci_dev *pdev,
> +				const struct pci_device_id *id)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev;
> +	int ret;
> +
> +	nvdev = vfio_alloc_device(nvgpu_vfio_pci_core_device, core_device.vdev,
> +				  &pdev->dev, &nvgpu_vfio_pci_ops);
> +	if (IS_ERR(nvdev))
> +		return PTR_ERR(nvdev);
> +
> +	dev_set_drvdata(&pdev->dev, nvdev);
> +
> +	ret = nvgpu_vfio_pci_fetch_memory_property(pdev, nvdev);
> +	if (ret)
> +		goto out_put_vdev;
> +
> +	ret = vfio_pci_core_register_device(&nvdev->core_device);
> +	if (ret)
> +		goto out_put_vdev;
> +
> +	return ret;
> +
> +out_put_vdev:
> +	vfio_put_device(&nvdev->core_device.vdev);
> +	return ret;
> +}
> +
> +static void nvgpu_vfio_pci_remove(struct pci_dev *pdev)
> +{
> +	struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev);
> +	struct vfio_pci_core_device *vdev = &nvdev->core_device;
> +
> +	vfio_pci_core_unregister_device(vdev);
> +	vfio_put_device(&vdev->vdev);
> +}
> +
> +static const struct pci_device_id nvgpu_vfio_pci_table[] = {
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2343) },
> +	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
> +	{}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, nvgpu_vfio_pci_table);
> +
> +static struct pci_driver nvgpu_vfio_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = nvgpu_vfio_pci_table,
> +	.probe = nvgpu_vfio_pci_probe,
> +	.remove = nvgpu_vfio_pci_remove,
> +	.err_handler = &vfio_pci_core_err_handlers,
> +	.driver_managed_dma = true,
> +};
> +
> +module_pci_driver(nvgpu_vfio_pci_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
> +MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
> +MODULE_DESCRIPTION(
> +	"VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
Jason Gunthorpe June 6, 2023, 3:32 p.m. UTC | #2
On Tue, Jun 06, 2023 at 08:32:38AM -0600, Alex Williamson wrote:
> On Mon, 5 Jun 2023 19:53:20 -0700
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > for the on-chip GPU that is the logical OS representation of the
> > internal proprietary cache coherent interconnect.
> > 
> > This representation has a number of limitations compared to a real PCI
> > device, in particular, it does not model the coherent GPU memory
> > aperture as a PCI config space BAR, and PCI doesn't know anything
> > about cacheable memory types.
> > 
> > Provide a VFIO PCI variant driver that adapts the unique PCI
> > representation into a more standard PCI representation facing
> > userspace. The GPU memory aperture is obtained from ACPI using
> > device_property_read_u64(), according to the FW specification,
> > and exported to userspace as a separate VFIO_REGION. Since the device
> > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > reported in BAR2).
> > 
> > Since this memory region is actually cache coherent with the CPU, the
> > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > mapping is done using remap_pfn_range().
> > 
> > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > device may not. The physical address from the last device PFN up to the
> > next power-of-2 aligned PA thus is mapped to a dummy PFN through
> > vm_operations fault.
> 
> As noted in the QEMU series, this all suggests to me that we should
> simply expose a device specific region for this coherent memory which
> QEMU can then choose to expose to the VM as a BAR, or not.  

It doesn't expose as a BAR on bare metal due to a HW limitation. When
we look toward VFIO CXL devices I would expect them to have proper
BARs and not this ACPI hack.

So the approach is to compartmentalize the hack to the bare metal
kernel driver and let the ABI and qemu parts be closer to what CXL
will eventually need.

> It's clearly not a BAR on bare metal, so if we need to go to all the
> trouble to create ACPI tables to further define the coherent memory
> space,

The ACPI tables shouldn't relate to the "BAR", they are needed to
overcome the NUMA problems in the kernel in the same way real device
FW does.

> what's the benefit of pretending that it's a PCI BAR?  ie. Why should a
> VM view this as a BAR rather than follow the same semantics as bare
> metal?

Primarily it is a heck of a lot simpler in qemu and better aligned
with where things are going.

> We can then have a discussion whether this even needs to be a variant
> driver versus a vfio-pci quirk if this device specific region is the
> only feature provided (ie. is migration in the future for this
> driver?).  

There is alot more here, go back to the original v1 posting to see it
all. This is way too much to be just a quirk.

Jason
Alex Williamson June 6, 2023, 5:05 p.m. UTC | #3
On Tue, 6 Jun 2023 12:32:13 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jun 06, 2023 at 08:32:38AM -0600, Alex Williamson wrote:
> > On Mon, 5 Jun 2023 19:53:20 -0700
> > <ankita@nvidia.com> wrote:
> >   
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > 
> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> > > for the on-chip GPU that is the logical OS representation of the
> > > internal proprietary cache coherent interconnect.
> > > 
> > > This representation has a number of limitations compared to a real PCI
> > > device, in particular, it does not model the coherent GPU memory
> > > aperture as a PCI config space BAR, and PCI doesn't know anything
> > > about cacheable memory types.
> > > 
> > > Provide a VFIO PCI variant driver that adapts the unique PCI
> > > representation into a more standard PCI representation facing
> > > userspace. The GPU memory aperture is obtained from ACPI using
> > > device_property_read_u64(), according to the FW specification,
> > > and exported to userspace as a separate VFIO_REGION. Since the device
> > > implements only one 64-bit BAR (BAR0), the GPU memory aperture is mapped
> > > to the next available PCI BAR (BAR2). Qemu will then naturally generate a
> > > PCI device in the VM with two 64-bit BARs (where the cacheable aperture
> > > reported in BAR2).
> > > 
> > > Since this memory region is actually cache coherent with the CPU, the
> > > VFIO variant driver will mmap it into VMA using a cacheable mapping. The
> > > mapping is done using remap_pfn_range().
> > > 
> > > PCI BAR are aligned to the power-of-2, but the actual memory on the
> > > device may not. The physical address from the last device PFN up to the
> > > next power-of-2 aligned PA thus is mapped to a dummy PFN through
> > > vm_operations fault.  
> > 
> > As noted in the QEMU series, this all suggests to me that we should
> > simply expose a device specific region for this coherent memory which
> > QEMU can then choose to expose to the VM as a BAR, or not.    
> 
> It doesn't expose as a BAR on bare metal due to a HW limitation. When
> we look toward VFIO CXL devices I would expect them to have proper
> BARs and not this ACPI hack.
> 
> So the approach is to compartmentalize the hack to the bare metal
> kernel driver and let the ABI and qemu parts be closer to what CXL
> will eventually need.
> 
> > It's clearly not a BAR on bare metal, so if we need to go to all the
> > trouble to create ACPI tables to further define the coherent memory
> > space,  
> 
> The ACPI tables shouldn't relate to the "BAR", they are needed to
> overcome the NUMA problems in the kernel in the same way real device
> FW does.
> 
> > what's the benefit of pretending that it's a PCI BAR?  ie. Why should a
> > VM view this as a BAR rather than follow the same semantics as bare
> > metal?  
> 
> Primarily it is a heck of a lot simpler in qemu and better aligned
> with where things are going.

It actually seems more complicated this way.  We're masquerading this
region as a BAR, but then QEMU needs to know based on device IDs that
it's really not a BAR, it has special size properties, mapping
attributes, error handling, etc.  Maybe we should have taken the hint
that it's not affected by the PCI config space memory enable bit that a
BAR region is not the right way for vfio to compose the device.

It's really beside the point whether you want QEMU to expose the memory
region to the VM as a BAR, but the more I see how this works the more
it makes sense to me that this should be a device specific region that
is the trigger for QEMU to setup these special properties.  It is
trivial for QEMU to expose a region as a BAR and then it can manage the
size issues for mapping, keeping things like an overflow page out of
the kernel.

> > We can then have a discussion whether this even needs to be a variant
> > driver versus a vfio-pci quirk if this device specific region is the
> > only feature provided (ie. is migration in the future for this
> > driver?).    
> 
> There is alot more here, go back to the original v1 posting to see it
> all. This is way too much to be just a quirk.

I'm not privy to a v1, the earliest I see is this (v3):

https://lore.kernel.org/all/20230405180134.16932-1-ankita@nvidia.com/

That outlines that we have a proprietary interconnect exposing cache
coherent memory which requires use of special mapping attributes vs a
standard PCI BAR and participates in ECC.  All of which seems like it
would be easier to setup in QEMU if the vfio-pci representation of the
device didn't masquerade this regions as a standard BAR.  In fact it
also reminds me of NVlink2 coherent RAM on POWER machines that was
similarly handled as device specific regions.  Thanks,

Alex
Jason Gunthorpe June 6, 2023, 5:16 p.m. UTC | #4
On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:

> It actually seems more complicated this way.  We're masquerading this
> region as a BAR, but then QEMU needs to know based on device IDs that
> it's really not a BAR, it has special size properties, mapping
> attributes, error handling, etc.  

This seems like something has gone wrong then. ie the SIGUBS error
handling stuff should be totally generic in the qemu side. Mapping
attributes are set by the kernel, qemu shouldn't know, doesn't need to
know.

The size issue is going to a be a problem in future anyhow, I expect
some new standards coming to support non-power-two sizes and they will
want to map to PCI devices in VMs still.

It seems OK to me if qemu can do this generically for any "BAR"
region, at least creating an entire "nvidia only" code path just for
non power 2 BAR sizing seems like a bad ABI choice.

> I'm not privy to a v1, the earliest I see is this (v3):
> 
> https://lore.kernel.org/all/20230405180134.16932-1-ankita@nvidia.com/
> 
> That outlines that we have a proprietary interconnect exposing cache
> coherent memory which requires use of special mapping attributes vs a
> standard PCI BAR and participates in ECC.  All of which seems like it
> would be easier to setup in QEMU if the vfio-pci representation of the
> device didn't masquerade this regions as a standard BAR.  In fact it
> also reminds me of NVlink2 coherent RAM on POWER machines that was
> similarly handled as device specific regions.  

It wasn't so good on POWER and if some of that stuff has been done
more generally we would have been further ahead here..

Jason
Alex Williamson June 6, 2023, 6:13 p.m. UTC | #5
On Tue, 6 Jun 2023 14:16:42 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
> 
> > It actually seems more complicated this way.  We're masquerading this
> > region as a BAR, but then QEMU needs to know based on device IDs that
> > it's really not a BAR, it has special size properties, mapping
> > attributes, error handling, etc.    
> 
> This seems like something has gone wrong then. ie the SIGUBS error
> handling stuff should be totally generic in the qemu side. Mapping
> attributes are set by the kernel, qemu shouldn't know, doesn't need to
> know.

You asked me to look at the v1 posting to see why there's so much more
going on here than a quirk.  That's what I read from the first public
posting, a coherent memory region masqueraded as a BAR which requires
different memory mapping and participates in ECC.  I agree that the
actual mapping is done by the kernel, but it doesn't really make a
difference if that's a vfio-pci variant driver providing a different
mmap callback for a BAR region or a device specific region handler.

> The size issue is going to a be a problem in future anyhow, I expect
> some new standards coming to support non-power-two sizes and they will
> want to map to PCI devices in VMs still.

Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
not software compatible with those constraints.  That's obviously not
to say that a new capability couldn't expose arbitrary resources sizes
on a PCI-like device though.  I don't see how a non-power-of-2 BAR at
this stage helps or fits within any spec, which is exactly what's
being proposed through this BAR masquerade.
 
> It seems OK to me if qemu can do this generically for any "BAR"
> region, at least creating an entire "nvidia only" code path just for
> non power 2 BAR sizing seems like a bad ABI choice.

Have you looked at Ankit's QEMU series?  It's entirely NVIDIA-only code
paths.  Also nothing here precludes that shared code in QEMU might
expose some known arbitrary sized regions as a BAR, or whatever spec
defined thing allows that in the future.  It would only be a slight
modification in the QEMU code to key on the presence of a device
specific region rather than PCI vendor and device IDs, to then register
that region as a PCI BAR and proceed with all this NVIDIA specific
PXM/SRAT setup. IMO it makes a lot more sense to create memory-only
NUMA nodes based on a device specific region than it does a PCI BAR.

> > I'm not privy to a v1, the earliest I see is this (v3):
> > 
> > https://lore.kernel.org/all/20230405180134.16932-1-ankita@nvidia.com/
> > 
> > That outlines that we have a proprietary interconnect exposing cache
> > coherent memory which requires use of special mapping attributes vs a
> > standard PCI BAR and participates in ECC.  All of which seems like it
> > would be easier to setup in QEMU if the vfio-pci representation of the
> > device didn't masquerade this regions as a standard BAR.  In fact it
> > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > similarly handled as device specific regions.    
> 
> It wasn't so good on POWER and if some of that stuff has been done
> more generally we would have been further ahead here..

Specifics?  Nothing here explained why masquerading the coherent memory
as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
could assemble on its own with a device specific region.  Thanks,

Alex
Jason Gunthorpe June 6, 2023, 7:05 p.m. UTC | #6
On Tue, Jun 06, 2023 at 12:13:48PM -0600, Alex Williamson wrote:
> On Tue, 6 Jun 2023 14:16:42 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
> > 
> > > It actually seems more complicated this way.  We're masquerading this
> > > region as a BAR, but then QEMU needs to know based on device IDs that
> > > it's really not a BAR, it has special size properties, mapping
> > > attributes, error handling, etc.    
> > 
> > This seems like something has gone wrong then. ie the SIGUBS error
> > handling stuff should be totally generic in the qemu side. Mapping
> > attributes are set by the kernel, qemu shouldn't know, doesn't need to
> > know.
> 
> You asked me to look at the v1 posting to see why there's so much more
> going on here than a quirk.  That's what I read from the first public
> posting, a coherent memory region masqueraded as a BAR which requires
> different memory mapping and participates in ECC.  I agree that the
> actual mapping is done by the kernel, but it doesn't really make a
> difference if that's a vfio-pci variant driver providing a different
> mmap callback for a BAR region or a device specific region handler.

The ECC stuff is generic too.

Even the non-power-2 size thing is *generic*, even if isn't following
PCI SIG.

> > The size issue is going to a be a problem in future anyhow, I expect
> > some new standards coming to support non-power-two sizes and they will
> > want to map to PCI devices in VMs still.
> 
> Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
> not software compatible with those constraints.  That's obviously not
> to say that a new capability couldn't expose arbitrary resources sizes
> on a PCI-like device though.  I don't see how a non-power-of-2 BAR at
> this stage helps or fits within any spec, which is exactly what's
> being proposed through this BAR masquerade.

To emulate PCI, someone, somewhere, has to fix this mismatch up.

So given choices
  1) Qemu sees a special NVIDIA thing and fixes it
  2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
  3) Kernel lies and makes a power-2 size and it fixes it

2 seems the most forward looking and reusable.

I definately don't think this is important enough to stick a vendor
label on it.

Broadly, we are looking toward a way for the kernel VFIO variant
driver to provide the majority of the "PCI emulation" and the VMM can
be general. It is not nice if every PCI emulation type driver needs
unique modifications to the VMM to support it. We are planning more
than one of these things already, and there are industry standards
afoot that will widly open the door here.

> > It seems OK to me if qemu can do this generically for any "BAR"
> > region, at least creating an entire "nvidia only" code path just for
> > non power 2 BAR sizing seems like a bad ABI choice.
> 
> Have you looked at Ankit's QEMU series? 

Not well, I haven't seen any versions of it till it was posted

> It's entirely NVIDIA-only code paths.  Also nothing here precludes
> that shared code in QEMU might expose some known arbitrary sized
> regions as a BAR, or whatever spec defined thing allows that in the
> future.

It should not be like that. From a kernel perspective this is
basically all generic VMM code that can apply to any VFIO driver. The
kernel side was ment to be a general purpose API for the VMM.

The only special bit is emulating the weird Grace FW ACPI stuff.

> > > coherent memory which requires use of special mapping attributes vs a
> > > standard PCI BAR and participates in ECC.  All of which seems like it
> > > would be easier to setup in QEMU if the vfio-pci representation of the
> > > device didn't masquerade this regions as a standard BAR.  In fact it
> > > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > > similarly handled as device specific regions.    
> > 
> > It wasn't so good on POWER and if some of that stuff has been done
> > more generally we would have been further ahead here..
> 
> Specifics?  Nothing here explained why masquerading the coherent memory
> as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
> could assemble on its own with a device specific region.  Thanks,

Well, we deleted all the POWER stuff and got nothing general out of
it. Isn't that enough to say it was a bad idea? Here we are again with
the same basic CXLish hardware design and the answer should not be
keep making more vendor ABI.

I think this is the direction things are trending in the
industry. There is nothing particuarlly special, and certainly nothing
*nvidia specific* about these things.

So lets find a way to give these things appropriate generic names at
the ABI level please..

Jason
Alex Williamson June 6, 2023, 9:30 p.m. UTC | #7
On Tue, 6 Jun 2023 16:05:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jun 06, 2023 at 12:13:48PM -0600, Alex Williamson wrote:
> > On Tue, 6 Jun 2023 14:16:42 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Jun 06, 2023 at 11:05:10AM -0600, Alex Williamson wrote:
> > >   
> > > > It actually seems more complicated this way.  We're masquerading this
> > > > region as a BAR, but then QEMU needs to know based on device IDs that
> > > > it's really not a BAR, it has special size properties, mapping
> > > > attributes, error handling, etc.      
> > > 
> > > This seems like something has gone wrong then. ie the SIGUBS error
> > > handling stuff should be totally generic in the qemu side. Mapping
> > > attributes are set by the kernel, qemu shouldn't know, doesn't need to
> > > know.  
> > 
> > You asked me to look at the v1 posting to see why there's so much more
> > going on here than a quirk.  That's what I read from the first public
> > posting, a coherent memory region masqueraded as a BAR which requires
> > different memory mapping and participates in ECC.  I agree that the
> > actual mapping is done by the kernel, but it doesn't really make a
> > difference if that's a vfio-pci variant driver providing a different
> > mmap callback for a BAR region or a device specific region handler.  
> 
> The ECC stuff is generic too.
> 
> Even the non-power-2 size thing is *generic*, even if isn't following
> PCI SIG.

Generic yes, but exposing a non-power-of-2 size region for a BAR is...
well, wrong.

> > > The size issue is going to a be a problem in future anyhow, I expect
> > > some new standards coming to support non-power-two sizes and they will
> > > want to map to PCI devices in VMs still.  
> > 
> > Ok, but a PCI BAR has specific constraints and a non-power-of-2 BAR is
> > not software compatible with those constraints.  That's obviously not
> > to say that a new capability couldn't expose arbitrary resources sizes
> > on a PCI-like device though.  I don't see how a non-power-of-2 BAR at
> > this stage helps or fits within any spec, which is exactly what's
> > being proposed through this BAR masquerade.  
> 
> To emulate PCI, someone, somewhere, has to fix this mismatch up.
> 
> So given choices
>   1) Qemu sees a special NVIDIA thing and fixes it
>   2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
>   3) Kernel lies and makes a power-2 size and it fixes it
> 
> 2 seems the most forward looking and reusable.

What?!  It's not just a matter of fixing it.  The vfio-pci uAPI should
never return a BAR region that's not compatible as a BAR.  It's
incorrectly sized, it does special things with mmap under the covers,
and it doesn't honor the memory enable bit.  And then QEMU goes on to
ignore this peculiarity when setting up all the ACPI features, instead
relying on the PCI vendor/device ID when it could be using a device
specific region to initiate that support.

> I definately don't think this is important enough to stick a vendor
> label on it.

How high is the bar for a device specific region?  This certainly looks
and smells like one to me.

> Broadly, we are looking toward a way for the kernel VFIO variant
> driver to provide the majority of the "PCI emulation" and the VMM can
> be general. It is not nice if every PCI emulation type driver needs
> unique modifications to the VMM to support it. We are planning more
> than one of these things already, and there are industry standards
> afoot that will widly open the door here.

Meanwhile every VMM needs a hook to extend non-compliant BAR sizes,
assuming the kernel will fixup mappings beyond the region extent, and
pretend that none of this is a device bug?  It really is a very small
amount of code in QEMU to setup a MemoryRegion based on a device
specific region and register it as a PCI BAR.  The non-standard size is
a factor here when mapping to the VM address space, but I'm a bit
surprised to hear an argument for hacking that in the kernel rather
than userspace.

> > > It seems OK to me if qemu can do this generically for any "BAR"
> > > region, at least creating an entire "nvidia only" code path just for
> > > non power 2 BAR sizing seems like a bad ABI choice.  
> > 
> > Have you looked at Ankit's QEMU series?   
> 
> Not well, I haven't seen any versions of it till it was posted
> 
> > It's entirely NVIDIA-only code paths.  Also nothing here precludes
> > that shared code in QEMU might expose some known arbitrary sized
> > regions as a BAR, or whatever spec defined thing allows that in the
> > future.  
> 
> It should not be like that. From a kernel perspective this is
> basically all generic VMM code that can apply to any VFIO driver. The
> kernel side was ment to be a general purpose API for the VMM.
> 
> The only special bit is emulating the weird Grace FW ACPI stuff.

And a device specific region seems like the ideal jumping off point to
create that memory-only node for this thing.
 
> > > > coherent memory which requires use of special mapping attributes vs a
> > > > standard PCI BAR and participates in ECC.  All of which seems like it
> > > > would be easier to setup in QEMU if the vfio-pci representation of the
> > > > device didn't masquerade this regions as a standard BAR.  In fact it
> > > > also reminds me of NVlink2 coherent RAM on POWER machines that was
> > > > similarly handled as device specific regions.      
> > > 
> > > It wasn't so good on POWER and if some of that stuff has been done
> > > more generally we would have been further ahead here..  
> > 
> > Specifics?  Nothing here explained why masquerading the coherent memory
> > as a BAR in the vfio-pci ABI is anything more than a hack that QEMU
> > could assemble on its own with a device specific region.  Thanks,  
> 
> Well, we deleted all the POWER stuff and got nothing general out of
> it. Isn't that enough to say it was a bad idea? Here we are again with
> the same basic CXLish hardware design and the answer should not be
> keep making more vendor ABI.
> 
> I think this is the direction things are trending in the
> industry. There is nothing particuarlly special, and certainly nothing
> *nvidia specific* about these things.
> 
> So lets find a way to give these things appropriate generic names at
> the ABI level please..

What is the generic feature that "these things" implement?

There's a lot of vendor specific things going on here.  Not only is all
that "weird Grace FW ACPI stuff" based on this region, but also if we
are exposing it as a BAR, which BAR index(s) for a given device.

If "the industry" does come out with a spec for "these things",
couldn't QEMU optionally plug a device specific region into the
implementation of that spec, potentially along with some commonly
defined region to make this device appear to honor that new spec?
Versus with the in-kernel variant driver masquerading the BAR, we're
stuck with what the kernel implements.  Would future hardware
implementing to this new spec require a variant driver or would we
extend vfio-pci to support them with extended regions and expect the
VMM to compose them appropriately?  Thanks,

Alex
Jason Gunthorpe June 7, 2023, 12:14 a.m. UTC | #8
On Tue, Jun 06, 2023 at 03:30:57PM -0600, Alex Williamson wrote:
> > To emulate PCI, someone, somewhere, has to fix this mismatch up.
> > 
> > So given choices
> >   1) Qemu sees a special NVIDIA thing and fixes it
> >   2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
> >   3) Kernel lies and makes a power-2 size and it fixes it
> > 
> > 2 seems the most forward looking and reusable.
> 
> What?!  It's not just a matter of fixing it.  The vfio-pci uAPI should
> never return a BAR region that's not compatible as a BAR. 

Why? But OK, if you don't like then then let's call it
VFIO_PCI_BAR0_REGION_NOT_POW2. Not seeing that it really helps so
much..

> It's incorrectly sized, it does special things with mmap under the
> covers, and it doesn't honor the memory enable bit.

The mmap attributes stuff is not uAPI visible, so it doesn't matter.

Normal vfio-pci devices will SIGBUS on memory disable, this could do
that if it was really important (I don't think it is)

So we are left with.. The size is weird. Someone has to provide the
fixing to fit that into the PCI config space world because we are
emulating a real PCI device.

The fixing is generic, a generic function does not elevate to create a
vendor uAPI IMHO.

> And then QEMU goes on to ignore this peculiarity when setting up all
> the ACPI features, instead relying on the PCI vendor/device ID when
> it could be using a device specific region to initiate that support.

We really should not rely on vendor regions to trigger device specific
VMM behaviors for a variant driver. If we want to do this better than
vendor/device ID we should have a VFIO ioctl report which variant
driver is running the device so userspace can do whatever.

> > I definately don't think this is important enough to stick a vendor
> > label on it.
> 
> How high is the bar for a device specific region?  This certainly looks
> and smells like one to me.

I would say if the thing that is showing up on the VM side is not PCI
then maybe a vendor label might make sense.

> > Broadly, we are looking toward a way for the kernel VFIO variant
> > driver to provide the majority of the "PCI emulation" and the VMM can
> > be general. It is not nice if every PCI emulation type driver needs
> > unique modifications to the VMM to support it. We are planning more
> > than one of these things already, and there are industry standards
> > afoot that will widly open the door here.
> 
> Meanwhile every VMM needs a hook to extend non-compliant BAR sizes,
> assuming the kernel will fixup mappings beyond the region extent,

Yes! It is a basically a new generic VFIO ability to allow this size
adaptation. If you don't like this version of the uAPI then lets tweak
it, but it still needs the same basic operation where the kernel tells
userspace that a certain mmap is to be placed in a certain BAR config
space.

> pretend that none of this is a device bug?  It really is a very small
> amount of code in QEMU to setup a MemoryRegion based on a device
> specific region and register it as a PCI BAR.  The non-standard size is
> a factor here when mapping to the VM address space, but I'm a bit
> surprised to hear an argument for hacking that in the kernel rather
> than userspace.

Well, I'd rather do it in userspace.

> > The only special bit is emulating the weird Grace FW ACPI stuff.
> 
> And a device specific region seems like the ideal jumping off point to
> create that memory-only node for this thing.

It really has nothing to do with the regions, it is something that is
needed if this variant driver is being used at all. The vPCI device
will work without the ACPI, but the Linux drivers won't like it.

> > So lets find a way to give these things appropriate generic names at
> > the ABI level please..
> 
> What is the generic feature that "these things" implement?

As far as I can see, non-power-2 size is the thing the VMM needs to
worry about.

And maybe a generic way to detect which variant driver is running.

> There's a lot of vendor specific things going on here.  Not only is all
> that "weird Grace FW ACPI stuff" based on this region, but also if we
> are exposing it as a BAR, which BAR index(s) for a given device.

The kernel decides the BAR indexes, not the vmm, because broadly we
want to have the kernel in charge of making the synthetic config
space.

The ACPI is not related to the region. It is just creating many empty
NUMA nodes. They should have no CPUs and no memory. The patch is
trying to make the insertion of the ACPI automatic. Keying it off a
region is not right for the purpose.

> If "the industry" does come out with a spec for "these things",
> couldn't QEMU optionally plug a device specific region into the
> implementation of that spec, potentially along with some commonly
> defined region to make this device appear to honor that new spec?
> Versus with the in-kernel variant driver masquerading the BAR, we're
> stuck with what the kernel implements.  Would future hardware
> implementing to this new spec require a variant driver or would we
> extend vfio-pci to support them with extended regions and expect the
> VMM to compose them appropriately?  

Look at Intel's SIOV document for some idea, I expect devices like
that will be a VFIO driver (not a variant PCI driver) that largely
exposes the vfio-pci uAPI with the purpose of creating a vPCI device
in the VM.

There is no real PCI function under this so all the config space and
so on will be synthetic. It is convenient if the kernel driver does
this so that it works on all the VMMs generically.

Non-power-2 BAR is desirable in this world because address space is
precious at high scale and power2 scaling gets wasteful.

Further, I would expect there will be a generic CXL driver and generic
CXL related ACPI someday.

This device is sort of in the middle where it does have a real PCI
function but it is also synthesizing some config space. We have
another VFIO driver in progress that is also doing some modification
of the config space..

Jason
Alex Williamson June 7, 2023, 6:23 p.m. UTC | #9
On Tue, 6 Jun 2023 21:14:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jun 06, 2023 at 03:30:57PM -0600, Alex Williamson wrote:
> > > To emulate PCI, someone, somewhere, has to fix this mismatch up.
> > > 
> > > So given choices
> > >   1) Qemu sees a special NVIDIA thing and fixes it
> > >   2) Qemu sees a VFIO_PCI_BAR0_REGION with an odd size and fixes it
> > >   3) Kernel lies and makes a power-2 size and it fixes it
> > > 
> > > 2 seems the most forward looking and reusable.  
> > 
> > What?!  It's not just a matter of fixing it.  The vfio-pci uAPI should
> > never return a BAR region that's not compatible as a BAR.   
> 
> Why? But OK, if you don't like then then let's call it
> VFIO_PCI_BAR0_REGION_NOT_POW2. Not seeing that it really helps so
> much..
> 
> > It's incorrectly sized, it does special things with mmap under the
> > covers, and it doesn't honor the memory enable bit.  
> 
> The mmap attributes stuff is not uAPI visible, so it doesn't matter.
> 
> Normal vfio-pci devices will SIGBUS on memory disable, this could do
> that if it was really important (I don't think it is)
> 
> So we are left with.. The size is weird. Someone has to provide the
> fixing to fit that into the PCI config space world because we are
> emulating a real PCI device.
> 
> The fixing is generic, a generic function does not elevate to create a
> vendor uAPI IMHO.

Who else is trying to expose a non-power-of-2 region as a BAR right
now?  We have neither a specification nor a complimentary implementation
from which to derive generic support.  It's possible that the vendor
uAPI could become the de facto implementation, at which point we still
might share the code and generalize the interface, but I have no
crystal ball to predict that.

GPUs seem to manage to have non-power-of-2 size VRAM while still
providing BARs that are a power-of-2, ex. 6GB VRAM, 8GB BAR.  Why
shouldn't the variant driver here extend the BAR region to a power-of-2
and then we can decide the error handling should accesses exceed the
implemented range? (as we know, calling remap_pfn_range() as done here
from a fault handler is troublesome)  If you really want to eliminate
VMM changes, it would be through this.

If there's a requirement for providing the actual size, a vfio
capability on the region or variant driver implementaiton of a device
specific capability in config space of the device could provide that.

> > And then QEMU goes on to ignore this peculiarity when setting up all
> > the ACPI features, instead relying on the PCI vendor/device ID when
> > it could be using a device specific region to initiate that support.  
> 
> We really should not rely on vendor regions to trigger device specific
> VMM behaviors for a variant driver. If we want to do this better than
> vendor/device ID we should have a VFIO ioctl report which variant
> driver is running the device so userspace can do whatever.

Disagree, vendor specific regions or capabilities are the preferred way
to trigger device specific behavior in QEMU.  Surely we don't really
want to endorse a solution of exposing a driver name to infer the
feature set when we have a means to explicitly report features through
capabilities and device specific regions...

> > > I definately don't think this is important enough to stick a vendor
> > > label on it.  
> > 
> > How high is the bar for a device specific region?  This certainly looks
> > and smells like one to me.  
> 
> I would say if the thing that is showing up on the VM side is not PCI
> then maybe a vendor label might make sense.

Well, how do you suppose a device with a non-power-of-2 BAR is PCI
compliant?  You're asking the VMM to assume what the driver meant by
providing that non-standard BAR, which sounds vendor specific to me.

> > > Broadly, we are looking toward a way for the kernel VFIO variant
> > > driver to provide the majority of the "PCI emulation" and the VMM can
> > > be general. It is not nice if every PCI emulation type driver needs
> > > unique modifications to the VMM to support it. We are planning more
> > > than one of these things already, and there are industry standards
> > > afoot that will widly open the door here.  
> > 
> > Meanwhile every VMM needs a hook to extend non-compliant BAR sizes,
> > assuming the kernel will fixup mappings beyond the region extent,  
> 
> Yes! It is a basically a new generic VFIO ability to allow this size
> adaptation. If you don't like this version of the uAPI then lets tweak
> it, but it still needs the same basic operation where the kernel tells
> userspace that a certain mmap is to be placed in a certain BAR config
> space.

"Size adaptation", that sounds like marketing spin for non-compliant.

> > pretend that none of this is a device bug?  It really is a very small
> > amount of code in QEMU to setup a MemoryRegion based on a device
> > specific region and register it as a PCI BAR.  The non-standard size is
> > a factor here when mapping to the VM address space, but I'm a bit
> > surprised to hear an argument for hacking that in the kernel rather
> > than userspace.  
> 
> Well, I'd rather do it in userspace.

A device specific region is the only way I see to do it in userspace.
We should never be providing a non-standard size BAR through the vfio
BAR region indexes.  Is your primary complaint here that you don't want
that region to be labeled VFIO_PCI_NVGPU_BAR1?  We could certainly
define VFIO_PCI_VENDOR_BAR0..5 where QEMU knows that it's supposed to
relax expectations and mangle the region into a compliant BAR, but now
you're adding complexity that may never be used elsewhere.

> > > The only special bit is emulating the weird Grace FW ACPI stuff.  
> > 
> > And a device specific region seems like the ideal jumping off point to
> > create that memory-only node for this thing.  
> 
> It really has nothing to do with the regions, it is something that is
> needed if this variant driver is being used at all. The vPCI device
> will work without the ACPI, but the Linux drivers won't like it.

OTOH if the ACPI work is based on device specific regions, the list of
device IDs in QEMU goes away and support for a new device requires no
VMM changes.

> > > So lets find a way to give these things appropriate generic names at
> > > the ABI level please..  
> > 
> > What is the generic feature that "these things" implement?  
> 
> As far as I can see, non-power-2 size is the thing the VMM needs to
> worry about.
> 
> And maybe a generic way to detect which variant driver is running.

Technically a capability on the device info would allow for this, but I
think it's the wrong approach.  Either expose a compliant BAR region
and do whatever you want with device IDs to hook in the ACPI changes or
make the VMM handle the emulation of a device specific region as a BAR,
at which point not basing the ACPI hooks on those regions would be
redundant and require more maintenance overhead.

> > There's a lot of vendor specific things going on here.  Not only is all
> > that "weird Grace FW ACPI stuff" based on this region, but also if we
> > are exposing it as a BAR, which BAR index(s) for a given device.  
> 
> The kernel decides the BAR indexes, not the vmm, because broadly we
> want to have the kernel in charge of making the synthetic config
> space.
> 
> The ACPI is not related to the region. It is just creating many empty
> NUMA nodes. They should have no CPUs and no memory. The patch is
> trying to make the insertion of the ACPI automatic. Keying it off a
> region is not right for the purpose.

Why aren't the different NUMA nodes a machine type option?  If we start
having each device mangle the machine in incompatible ways it seems
like we're going to get conflicts not only with other devices, but also
user specified NUMA configurations.  I'm struggling with whether I can
set some bits in the root port devcap2 register[1] based on device
capabilities and here this is fundamentally manipulating the VM
topology.

[1]https://lore.kernel.org/all/20230526231558.1660396-1-alex.williamson@redhat.com/

> > If "the industry" does come out with a spec for "these things",
> > couldn't QEMU optionally plug a device specific region into the
> > implementation of that spec, potentially along with some commonly
> > defined region to make this device appear to honor that new spec?
> > Versus with the in-kernel variant driver masquerading the BAR, we're
> > stuck with what the kernel implements.  Would future hardware
> > implementing to this new spec require a variant driver or would we
> > extend vfio-pci to support them with extended regions and expect the
> > VMM to compose them appropriately?    
> 
> Look at Intel's SIOV document for some idea, I expect devices like
> that will be a VFIO driver (not a variant PCI driver) that largely
> exposes the vfio-pci uAPI with the purpose of creating a vPCI device
> in the VM.
> 
> There is no real PCI function under this so all the config space and
> so on will be synthetic. It is convenient if the kernel driver does
> this so that it works on all the VMMs generically.
> 
> Non-power-2 BAR is desirable in this world because address space is
> precious at high scale and power2 scaling gets wasteful.
> 
> Further, I would expect there will be a generic CXL driver and generic
> CXL related ACPI someday.
> 
> This device is sort of in the middle where it does have a real PCI
> function but it is also synthesizing some config space. We have
> another VFIO driver in progress that is also doing some modification
> of the config space..

Thanks for the background, but PCI is still a standard that requires
power-of-2 BAR sizes, so either the in-kernel variant driver needs to
round up and handle the empty space or the VMM does, and the VMM should
specifically enable this for new region indexes and be allowed to
assume PCI compliant BARs through the existing region indexes.  Thanks,

Alex
Jason Gunthorpe June 13, 2023, 2:35 p.m. UTC | #10
On Wed, Jun 07, 2023 at 12:23:03PM -0600, Alex Williamson wrote:

> > The fixing is generic, a generic function does not elevate to create a
> > vendor uAPI IMHO.
> 
> Who else is trying to expose a non-power-of-2 region as a BAR right
> now? 

I see a few needs in other places internally that are not public yet,
especially in the SIOV world I alluded to below.

> We have neither a specification nor a complimentary implementation
> from which to derive generic support.  

"specification"? It is literally - the size of the bar is not a power
of two, so make the resulting hole when mapping that to a vPCI discard
writes and read 0s.

This isn't a PCI specification, it is a contract between the kernel
side and the VMM side within VFIO about how to handle VFIO devices
where the kernel driver wants to have a padding.

It makes sense that we shouldn't just do this blidly for the existing
indexes without some negotation, but that isn't a PCI "compliance"
problem, that is a incomplete VFIO uAPI design in this patch.

> GPUs seem to manage to have non-power-of-2 size VRAM while still
> providing BARs that are a power-of-2, ex. 6GB VRAM, 8GB BAR.  

Generally excess BAR is modeled in HW as discard writes return 0 (or
maybe return -1). HW can do this easially. SW is more tricky

> Why shouldn't the variant driver here extend the BAR region to a
> power-of-2 and then we can decide the error handling should accesses
> exceed the implemented range?

This sounds doable, I don't know if Ankit had a problem with using the
sparse mmap feature to do this. We'd need to have the padding be
non-mmapable space and then the kernel driver would do the discard/0
with the read/write calls.

If this works out I'm happy enough if we go this way too. I suppose it
allows better compatibility with all the VMMs.

> > I would say if the thing that is showing up on the VM side is not PCI
> > then maybe a vendor label might make sense.
> 
> Well, how do you suppose a device with a non-power-of-2 BAR is PCI
> compliant?  You're asking the VMM to assume what the driver meant by
> providing that non-standard BAR, which sounds vendor specific to me.

It shows up as a PCI compliant device in the VM, with a compliant
power of two size.

> Is your primary complaint here that you don't want
> that region to be labeled VFIO_PCI_NVGPU_BAR1?  

Yes. This modified VFIO uAPI contract is general enough it should not
be forced as unique to this device.

> We could certainly define VFIO_PCI_VENDOR_BAR0..5 where QEMU knows
> that it's supposed to relax expectations and mangle the region into
> a compliant BAR, but now you're adding complexity that may never be
> used elsewhere.

I wouldn't use a _VENDOR_ name for this since it is generic.

I would suggest using a FEATURE:

/*
 * When SET to 1 it indicates that the userspace understands non-power of two
 * region sizes on VFIO_PCI_BARx_REGION_INDEX. If the kernel driver requests a
 * non-power of two size then if userspace needs to round the size back up to a
 * power of two, eg to create a vPCI device, it should return 0 for reads and
 * discard writes for the padding that was added.
 *
 * If this flag is not set, and the VFIO driver cannot work without non-power of
 * two BARs then mmap of those BAR indexes will fail. (FIXME: maybe
 * this needs more thinking)
 */
#define VFIO_DEVICE_FEATURE_PCI_PAD_BARS 10

As adding new index types for every new functionality will become
combinatoral, and adding discovery of which vPCI BAR index the new
indexes should map too is also complicated..

Though sparse mmap is probably better.

> > It really has nothing to do with the regions, it is something that is
> > needed if this variant driver is being used at all. The vPCI device
> > will work without the ACPI, but the Linux drivers won't like it.
> 
> OTOH if the ACPI work is based on device specific regions, the list of
> device IDs in QEMU goes away and support for a new device requires no
> VMM changes.

Using the device ID seems like the better approach as we don't know
what future devices using this varient driver are going to need for
ACPI modeling.

It also seems I was too optimistic to think a simple variant driver ID
would be sufficient. IMHO you are closer below, it depends on what
bare metal FW qemu is trying to emulate, which I suppose is
technically a combination of the machine type and the installed vPCI
devices..

> > The ACPI is not related to the region. It is just creating many empty
> > NUMA nodes. They should have no CPUs and no memory. The patch is
> > trying to make the insertion of the ACPI automatic. Keying it off a
> > region is not right for the purpose.
> 
> Why aren't the different NUMA nodes a machine type option?  If we start
> having each device mangle the machine in incompatible ways it seems
> like we're going to get conflicts not only with other devices, but also
> user specified NUMA configurations.

You may be right, I think this patch is trying to make things
automatic for user, but a dedicated machine type might make more
sense.

> I'm struggling with whether I can
> set some bits in the root port devcap2 register[1] based on device
> capabilities and here this is fundamentally manipulating the VM
> topology.
> 
> [1]https://lore.kernel.org/all/20230526231558.1660396-1-alex.williamson@redhat.com/

That does seem a bit dicey - alot of this stuff, especially ACPI, is
reasonably assumed to be set at boot time by an OS. Changing it
dynamically becomes exciting..

Thanks,
Jason
Alex Williamson June 13, 2023, 7:24 p.m. UTC | #11
On Tue, 13 Jun 2023 11:35:45 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 07, 2023 at 12:23:03PM -0600, Alex Williamson wrote:
> 
> > > The fixing is generic, a generic function does not elevate to create a
> > > vendor uAPI IMHO.  
> > 
> > Who else is trying to expose a non-power-of-2 region as a BAR right
> > now?   
> 
> I see a few needs in other places internally that are not public yet,
> especially in the SIOV world I alluded to below.
> 
> > We have neither a specification nor a complimentary implementation
> > from which to derive generic support.    
> 
> "specification"? It is literally - the size of the bar is not a power
> of two, so make the resulting hole when mapping that to a vPCI discard
> writes and read 0s.
> 
> This isn't a PCI specification, it is a contract between the kernel
> side and the VMM side within VFIO about how to handle VFIO devices
> where the kernel driver wants to have a padding.
> 
> It makes sense that we shouldn't just do this blidly for the existing
> indexes without some negotation, but that isn't a PCI "compliance"
> problem, that is a incomplete VFIO uAPI design in this patch.
> 
> > GPUs seem to manage to have non-power-of-2 size VRAM while still
> > providing BARs that are a power-of-2, ex. 6GB VRAM, 8GB BAR.    
> 
> Generally excess BAR is modeled in HW as discard writes return 0 (or
> maybe return -1). HW can do this easially. SW is more tricky
> 
> > Why shouldn't the variant driver here extend the BAR region to a
> > power-of-2 and then we can decide the error handling should accesses
> > exceed the implemented range?  
> 
> This sounds doable, I don't know if Ankit had a problem with using the
> sparse mmap feature to do this. We'd need to have the padding be
> non-mmapable space and then the kernel driver would do the discard/0
> with the read/write calls.
> 
> If this works out I'm happy enough if we go this way too. I suppose it
> allows better compatibility with all the VMMs.

I'd even forgotten about the sparse mmap solution here, that's even
better than trying to do something clever with the mmap.

Existing QEMU has an assert in pci_register_bar() that the size of the
MemoryRegion must be a power of 2, therefore it is ABI that the a vfio
PCI BAR region must be a power of 2.  It's not sufficient to change
QEMU, an old QEMU running on a kernel with this device would assert.
The BAR region must either be properly sized or we define a new region
definition which requires new QEMU support to mimic a BAR, but I don't
know why we wouldn't just use this sparse mmap trick.

> > > I would say if the thing that is showing up on the VM side is not PCI
> > > then maybe a vendor label might make sense.  
> > 
> > Well, how do you suppose a device with a non-power-of-2 BAR is PCI
> > compliant?  You're asking the VMM to assume what the driver meant by
> > providing that non-standard BAR, which sounds vendor specific to me.  
> 
> It shows up as a PCI compliant device in the VM, with a compliant
> power of two size.
> 
> > Is your primary complaint here that you don't want
> > that region to be labeled VFIO_PCI_NVGPU_BAR1?    
> 
> Yes. This modified VFIO uAPI contract is general enough it should not
> be forced as unique to this device.
> 
> > We could certainly define VFIO_PCI_VENDOR_BAR0..5 where QEMU knows
> > that it's supposed to relax expectations and mangle the region into
> > a compliant BAR, but now you're adding complexity that may never be
> > used elsewhere.  
> 
> I wouldn't use a _VENDOR_ name for this since it is generic.
> 
> I would suggest using a FEATURE:
> 
> /*
>  * When SET to 1 it indicates that the userspace understands non-power of two
>  * region sizes on VFIO_PCI_BARx_REGION_INDEX. If the kernel driver requests a
>  * non-power of two size then if userspace needs to round the size back up to a
>  * power of two, eg to create a vPCI device, it should return 0 for reads and
>  * discard writes for the padding that was added.
>  *
>  * If this flag is not set, and the VFIO driver cannot work without non-power of
>  * two BARs then mmap of those BAR indexes will fail. (FIXME: maybe
>  * this needs more thinking)
>  */
> #define VFIO_DEVICE_FEATURE_PCI_PAD_BARS 10
> 
> As adding new index types for every new functionality will become
> combinatoral, and adding discovery of which vPCI BAR index the new
> indexes should map too is also complicated..
> 
> Though sparse mmap is probably better.

Yes.

> > > It really has nothing to do with the regions, it is something that is
> > > needed if this variant driver is being used at all. The vPCI device
> > > will work without the ACPI, but the Linux drivers won't like it.  
> > 
> > OTOH if the ACPI work is based on device specific regions, the list of
> > device IDs in QEMU goes away and support for a new device requires no
> > VMM changes.  
> 
> Using the device ID seems like the better approach as we don't know
> what future devices using this varient driver are going to need for
> ACPI modeling.
> 
> It also seems I was too optimistic to think a simple variant driver ID
> would be sufficient. IMHO you are closer below, it depends on what
> bare metal FW qemu is trying to emulate, which I suppose is
> technically a combination of the machine type and the installed vPCI
> devices..
> 
> > > The ACPI is not related to the region. It is just creating many empty
> > > NUMA nodes. They should have no CPUs and no memory. The patch is
> > > trying to make the insertion of the ACPI automatic. Keying it off a
> > > region is not right for the purpose.  
> > 
> > Why aren't the different NUMA nodes a machine type option?  If we start
> > having each device mangle the machine in incompatible ways it seems
> > like we're going to get conflicts not only with other devices, but also
> > user specified NUMA configurations.  
> 
> You may be right, I think this patch is trying to make things
> automatic for user, but a dedicated machine type might make more
> sense.

Juan and I discussed this with Ankit last week, there are a lot of down
sides with another machine type, but the automatic manipulation of the
machine is still problematic.  Another option we have is to use QEMU
command line options for each feature.  For example we already support
NUMA VM configurations and loading command line ACPI tables, hopefully
also associating devices to nodes.  Do we end up with just a
configuration spec for the VM to satisfy the in-guest drivers?
Potentially guest driver requirements may changes over time, so a hard
coded recipe built-in to QEMU might not be the best solution anyway.

> > I'm struggling with whether I can
> > set some bits in the root port devcap2 register[1] based on device
> > capabilities and here this is fundamentally manipulating the VM
> > topology.
> > 
> > [1]https://lore.kernel.org/all/20230526231558.1660396-1-alex.williamson@redhat.com/  
> 
> That does seem a bit dicey - alot of this stuff, especially ACPI, is
> reasonably assumed to be set at boot time by an OS. Changing it
> dynamically becomes exciting..

I think we need to be careful about how and where we apply it.  Yes,
there could be a guest OS that caches the capabilities of the root
ports and could have stale information at the time the driver probes
Atomic Ops information.  But the spec doesn't prohibit RO defined bits
from changing that I can find... and I can't come up with a better
solution to support atomic ops.

The alternative would again be device switches on the QEMU command
line, which I think would have a dependency on disabling hotplug for
the slot (pcie-root-port,hotplug=off) and also need to validate that
all virtual downstream devices capable of those atomic ops have
equivalent host support.  I couldn't find enough downsides to the risk
that a guest booted with an Atomic Ops capable device that also caches
the capabilities of the root port (Linux doesn't) might hot-unplug that
device, hot-plug a new device w/o the same host support (ex. different
slot or different host after a migration), and report the stale atomic
ops support.

I think NVIDIA might have an interest in enabling Atomic Ops support in
VMs as well, so please comment in the series thread if there are
concerns here or if anyone can definitively says that another guest OS
we might care about does cache root port capability bits.  Thanks,

Alex
Jason Gunthorpe June 14, 2023, 5:55 p.m. UTC | #12
On Tue, Jun 13, 2023 at 01:24:02PM -0600, Alex Williamson wrote:

> I'd even forgotten about the sparse mmap solution here, that's even
> better than trying to do something clever with the mmap.

Okay, Ankit please try this, it sounds good

> > You may be right, I think this patch is trying to make things
> > automatic for user, but a dedicated machine type might make more
> > sense.
> 
> Juan and I discussed this with Ankit last week, there are a lot of down
> sides with another machine type, but the automatic manipulation of the
> machine is still problematic.  Another option we have is to use QEMU
> command line options for each feature.  For example we already support
> NUMA VM configurations and loading command line ACPI tables, hopefully
> also associating devices to nodes.  Do we end up with just a
> configuration spec for the VM to satisfy the in-guest drivers?
> Potentially guest driver requirements may changes over time, so a hard
> coded recipe built-in to QEMU might not be the best solution anyway.

Let's have those discussions settle then, I know there are a few
different ideas here people are looking at.

> I think NVIDIA might have an interest in enabling Atomic Ops support in
> VMs as well, so please comment in the series thread if there are
> concerns here or if anyone can definitively says that another guest OS
> we might care about does cache root port capability bits.  Thanks,

I expect we do - I haven't heard of atomic ops specifically yet
though.

We just did a big exercise on relaxed ordering which is similarly
troubled.

Here we deciced to just not use the VM's config space at all. The
device itself knows if it can do relaxed ordering and it just reports
this directly to the driver.

In many ways I would prefer to do the same for atomic.. I haven't
checked fully but I think we do this anyhow as you can see mlx5 simply
tries to enable PCI atomics but doesn't appear to do anything with the
result of it. I expect the actual success/fail is looped back through
the device interface itself.

So, for mlx5, it probably already works in most real cases. Passing a
PF might not work I guess.

It is not a satisfying answer from a VMM design perspective..

Some qemu command line to say what root ports with what atomic caps to
create seems like a reasonable thing to do.

Jason
Alex Williamson June 14, 2023, 7:20 p.m. UTC | #13
On Wed, 14 Jun 2023 14:55:28 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Jun 13, 2023 at 01:24:02PM -0600, Alex Williamson wrote:
> 
> > I'd even forgotten about the sparse mmap solution here, that's even
> > better than trying to do something clever with the mmap.  
> 
> Okay, Ankit please try this, it sounds good
> 
> > > You may be right, I think this patch is trying to make things
> > > automatic for user, but a dedicated machine type might make more
> > > sense.  
> > 
> > Juan and I discussed this with Ankit last week, there are a lot of down
> > sides with another machine type, but the automatic manipulation of the
> > machine is still problematic.  Another option we have is to use QEMU
> > command line options for each feature.  For example we already support
> > NUMA VM configurations and loading command line ACPI tables, hopefully
> > also associating devices to nodes.  Do we end up with just a
> > configuration spec for the VM to satisfy the in-guest drivers?
> > Potentially guest driver requirements may changes over time, so a hard
> > coded recipe built-in to QEMU might not be the best solution anyway.  
> 
> Let's have those discussions settle then, I know there are a few
> different ideas here people are looking at.
> 
> > I think NVIDIA might have an interest in enabling Atomic Ops support in
> > VMs as well, so please comment in the series thread if there are
> > concerns here or if anyone can definitively says that another guest OS
> > we might care about does cache root port capability bits.  Thanks,  
> 
> I expect we do - I haven't heard of atomic ops specifically yet
> though.
> 
> We just did a big exercise on relaxed ordering which is similarly
> troubled.
> 
> Here we deciced to just not use the VM's config space at all. The
> device itself knows if it can do relaxed ordering and it just reports
> this directly to the driver.
> 
> In many ways I would prefer to do the same for atomic.. I haven't
> checked fully but I think we do this anyhow as you can see mlx5 simply
> tries to enable PCI atomics but doesn't appear to do anything with the
> result of it. I expect the actual success/fail is looped back through
> the device interface itself.
> 
> So, for mlx5, it probably already works in most real cases. Passing a
> PF might not work I guess.
> 
> It is not a satisfying answer from a VMM design perspective..
> 
> Some qemu command line to say what root ports with what atomic caps to
> create seems like a reasonable thing to do.

The referenced QEMU proposal puts a number of restrictions on
automatically flipping bits on the root port, ex. as exposed in the VM
the endpoint must be directly connected to a root port (avoiding
complications around atomic ops routing support) and must be a
single function device at devfn 0x0 (avoiding heterogeneous paths on
the host).  It also tests the root port bits to make sure they aren't
otherwise set in order to be compatible with some future root port
device switch to enable fixed atomic completer support.

This tries to balance the idea that we want to support device switches
for these sort of fine grained VM configuration, but there are also
isolated cases which can be automatically enabled that can potentially
cover the vast majority of use cases.

OTOH, trying to do something automatic for 'AtomicOps Routing Support'
looks far more challenging and we probably would rely on command line
device switches for that.

Regarding relaxed ordering, are we talking about the 'No RO-enabled
PR-PR Passing' bit in the devcap2 register?  Unfortunately that bit is
labeled HwInit, so we don't have the same leniency towards modifying it
at runtime as we do for the immediately preceding AtomicOps completer
support bits.  In my interpretation, that bit also only seems to be
reporting whether a specific reordering is implemented, so more
important in determining expected performance than functionality(?)

In general, I think we put driver writers in an awkward place if they
start trying things that are clearly not supported as reported by
hardware capability bits.  Error handling can be pretty fragile,
especially when value-add firmware thinks it knows best.  Thanks,

Alex
Jason Gunthorpe June 14, 2023, 7:32 p.m. UTC | #14
On Wed, Jun 14, 2023 at 01:20:47PM -0600, Alex Williamson wrote:

> Regarding relaxed ordering, are we talking about the 'No RO-enabled
> PR-PR Passing' bit in the devcap2 register?  

I don't think so.. The RO problem is that the RO bit in the config
space is only defined for a SRIOV PF not a SRIOV VF. The VF wires 0 to
that config space.

So if you stick a VF into a VM as a vPCI then it appears to have a 0
RO bit in the config space. A VM driver that relies on it will think
it is not enabled.

I suppose atomic will have the same issue?

> In general, I think we put driver writers in an awkward place if they
> start trying things that are clearly not supported as reported by
> hardware capability bits.  

If you start down this path then IMHO it all has to work the same for
a VF and the VF has to inherent any relevant PF capability bits. This
also means the vPCI emulated bits don't actually work correctly which
is a different mess..

This is why the answer of "don't trust the config space" is appealing
for driver writers because the VMMs are not able to emulate bare metal
very well in these details.

Jason
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 250518fc70ff..51e364b8f00b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22150,6 +22150,12 @@  L:	kvm@vger.kernel.org
 S:	Maintained
 F:	drivers/vfio/platform/
 
+VFIO NVIDIA PCI DRIVER
+M:	Ankit Agrawal <ankita@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	drivers/vfio/pci/nvgpu/
+
 VGA_SWITCHEROO
 R:	Lukas Wunner <lukas@wunner.de>
 S:	Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index f9d0c908e738..ade18b0ffb7b 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -59,4 +59,6 @@  source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
 
+source "drivers/vfio/pci/nvgpu/Kconfig"
+
 endif
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 24c524224da5..0c93d452d0da 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -11,3 +11,5 @@  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
+
+obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu/
diff --git a/drivers/vfio/pci/hisilicon/Kconfig b/drivers/vfio/pci/hisilicon/Kconfig
index 5daa0f45d2f9..38e90e05d68a 100644
--- a/drivers/vfio/pci/hisilicon/Kconfig
+++ b/drivers/vfio/pci/hisilicon/Kconfig
@@ -2,12 +2,12 @@ 
 config HISI_ACC_VFIO_PCI
 	tristate "VFIO PCI support for HiSilicon ACC devices"
 	depends on ARM64 || (COMPILE_TEST && 64BIT)
-	depends on VFIO_PCI_CORE
 	depends on PCI_MSI
 	depends on CRYPTO_DEV_HISI_QM
 	depends on CRYPTO_DEV_HISI_HPRE
 	depends on CRYPTO_DEV_HISI_SEC2
 	depends on CRYPTO_DEV_HISI_ZIP
+	select VFIO_PCI_CORE
 	help
 	  This provides generic PCI support for HiSilicon ACC devices
 	  using the VFIO framework.
diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig
index 29ba9c504a75..7088edc4fb28 100644
--- a/drivers/vfio/pci/mlx5/Kconfig
+++ b/drivers/vfio/pci/mlx5/Kconfig
@@ -2,7 +2,7 @@ 
 config MLX5_VFIO_PCI
 	tristate "VFIO support for MLX5 PCI devices"
 	depends on MLX5_CORE
-	depends on VFIO_PCI_CORE
+	select VFIO_PCI_CORE
 	help
 	  This provides migration support for MLX5 devices using the VFIO
 	  framework.
diff --git a/drivers/vfio/pci/nvgpu/Kconfig b/drivers/vfio/pci/nvgpu/Kconfig
new file mode 100644
index 000000000000..066f764f7c5f
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/Kconfig
@@ -0,0 +1,10 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+config NVGPU_VFIO_PCI
+	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
+	depends on ARM64 || (COMPILE_TEST && 64BIT)
+	select VFIO_PCI_CORE
+	help
+	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
+	  required to assign the GPU device to a VM using KVM/qemu/etc.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvgpu/Makefile b/drivers/vfio/pci/nvgpu/Makefile
new file mode 100644
index 000000000000..00fd3a078218
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/Makefile
@@ -0,0 +1,3 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu-vfio-pci.o
+nvgpu-vfio-pci-y := main.o
diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c
new file mode 100644
index 000000000000..9a838fe52a78
--- /dev/null
+++ b/drivers/vfio/pci/nvgpu/main.c
@@ -0,0 +1,257 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#define DUMMY_PFN \
+	(((nvdev->mem_prop.hpa + nvdev->mem_prop.mem_length) >> PAGE_SHIFT) - 1)
+
+struct dev_mem_properties {
+	uint64_t hpa;
+	uint64_t mem_length;
+};
+
+struct nvgpu_vfio_pci_core_device {
+	struct vfio_pci_core_device core_device;
+	struct dev_mem_properties mem_prop;
+};
+
+static vm_fault_t nvgpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+	unsigned long mem_offset = vmf->pgoff - vmf->vma->vm_pgoff;
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		vmf->vma->vm_file->private_data,
+		struct nvgpu_vfio_pci_core_device, core_device.vdev);
+	int ret;
+
+	/*
+	 * We come here only if the access is to a memory region that is
+	 * beyond physical GPU memory. Map such request to a dummy PFN.
+	 */
+	ret = remap_pfn_range(vmf->vma,
+			vmf->vma->vm_start + (mem_offset << PAGE_SHIFT),
+			DUMMY_PFN, PAGE_SIZE,
+			vmf->vma->vm_page_prot);
+	if (ret)
+		return VM_FAULT_ERROR;
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct nvgpu_vfio_pci_mmap_ops = {
+	.fault = nvgpu_vfio_pci_fault,
+};
+
+static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	ret = vfio_pci_core_enable(vdev);
+	if (ret)
+		return ret;
+
+	vfio_pci_core_finish_enable(vdev);
+
+	return 0;
+}
+
+static int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev,
+			struct vm_area_struct *vma)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+
+	unsigned long start_pfn;
+	unsigned int index;
+	u64 req_len, pgoff;
+	int ret = 0;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	if (index != VFIO_PCI_BAR2_REGION_INDEX)
+		return vfio_pci_core_mmap(core_vdev, vma);
+
+	/*
+	 * Request to mmap the BAR. Map to the CPU accessible memory on the
+	 * GPU using the memory information gathered from the system ACPI
+	 * tables.
+	 */
+	start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT;
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT))
+		return -EINVAL;
+
+	/*
+	 * Perform a PFN map to the memory. The device BAR is backed by the
+	 * GPU memory now. Check that the mapping does not overflow out of
+	 * the GPU memory size.
+	 *
+	 * The available GPU memory size may not be power-of-2 aligned. Given
+	 * that the memory is exposed as a BAR, the mapping request is of the
+	 * power-of-2 aligned size. Map only up to the size of the GPU memory.
+	 * If the memory access is beyond the actual GPU memory size, it will
+	 * be handled by the VMA ops fault through mapping to a dummy PFN.
+	 *
+	 * During device reset, the GPU is safely disconnected to the CPU
+	 * and access to the BAR will be immediately returned preventing
+	 * machine check.
+	 */
+	ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff,
+			      min(req_len, nvdev->mem_prop.mem_length - pgoff),
+			      vma->vm_page_prot);
+	if (ret)
+		return ret;
+
+	vma->vm_pgoff = start_pfn + pgoff;
+	vma->vm_ops = &nvgpu_vfio_pci_mmap_ops;
+
+	return 0;
+}
+
+static long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev,
+			unsigned int cmd, unsigned long arg)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev);
+
+	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+	struct vfio_region_info info;
+
+	if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_BAR2_REGION_INDEX) {
+			/*
+			 * Request to determine the BAR region information. Send the
+			 * GPU memory information.
+			 */
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = nvdev->mem_prop.mem_length;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE |
+				     VFIO_REGION_INFO_FLAG_MMAP;
+			return copy_to_user((void __user *)arg, &info, minsz) ?
+				       -EFAULT : 0;
+		}
+	}
+
+	return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+}
+
+static const struct vfio_device_ops nvgpu_vfio_pci_ops = {
+	.name = "nvgpu-vfio-pci",
+	.init = vfio_pci_core_init_dev,
+	.release = vfio_pci_core_release_dev,
+	.open_device = nvgpu_vfio_pci_open_device,
+	.close_device = vfio_pci_core_close_device,
+	.ioctl = nvgpu_vfio_pci_ioctl,
+	.read = vfio_pci_core_read,
+	.write = vfio_pci_core_write,
+	.mmap = nvgpu_vfio_pci_mmap,
+	.request = vfio_pci_core_request,
+	.match = vfio_pci_core_match,
+	.bind_iommufd = vfio_iommufd_physical_bind,
+	.unbind_iommufd = vfio_iommufd_physical_unbind,
+	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+};
+
+static struct nvgpu_vfio_pci_core_device *nvgpu_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct nvgpu_vfio_pci_core_device,
+			    core_device);
+}
+
+static int
+nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
+				     struct nvgpu_vfio_pci_core_device *nvdev)
+{
+	int ret;
+
+	/*
+	 * The memory information is present in the system ACPI tables as DSD
+	 * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
+	 */
+	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-base-pa",
+				       &(nvdev->mem_prop.hpa));
+	if (ret)
+		return ret;
+
+	ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size",
+				       &(nvdev->mem_prop.mem_length));
+	return ret;
+}
+
+static int nvgpu_vfio_pci_probe(struct pci_dev *pdev,
+				const struct pci_device_id *id)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev;
+	int ret;
+
+	nvdev = vfio_alloc_device(nvgpu_vfio_pci_core_device, core_device.vdev,
+				  &pdev->dev, &nvgpu_vfio_pci_ops);
+	if (IS_ERR(nvdev))
+		return PTR_ERR(nvdev);
+
+	dev_set_drvdata(&pdev->dev, nvdev);
+
+	ret = nvgpu_vfio_pci_fetch_memory_property(pdev, nvdev);
+	if (ret)
+		goto out_put_vdev;
+
+	ret = vfio_pci_core_register_device(&nvdev->core_device);
+	if (ret)
+		goto out_put_vdev;
+
+	return ret;
+
+out_put_vdev:
+	vfio_put_device(&nvdev->core_device.vdev);
+	return ret;
+}
+
+static void nvgpu_vfio_pci_remove(struct pci_dev *pdev)
+{
+	struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev);
+	struct vfio_pci_core_device *vdev = &nvdev->core_device;
+
+	vfio_pci_core_unregister_device(vdev);
+	vfio_put_device(&vdev->vdev);
+}
+
+static const struct pci_device_id nvgpu_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2343) },
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, nvgpu_vfio_pci_table);
+
+static struct pci_driver nvgpu_vfio_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = nvgpu_vfio_pci_table,
+	.probe = nvgpu_vfio_pci_probe,
+	.remove = nvgpu_vfio_pci_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(nvgpu_vfio_pci_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
+MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
+MODULE_DESCRIPTION(
+	"VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");