Message ID | 20230509040734.24392-1-ankita@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper | expand |
[Copying additional vfio-pci variant driver reviewers] On Mon, 8 May 2023 21:07:34 -0700 <ankita@nvidia.com> wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal propritary cache coherent interconnect. > > This representation has a number of limitations compared to a real PCI > device, in particular, it does not model the coherent GPU memory > aperture as a PCI config space BAR, and PCI doesn't know anything > about cacheable memory types. > > Provide a VFIO PCI variant driver that adapts the unique PCI > representation into a more standard PCI representation facing > userspace. The GPU memory aperture is obtained from ACPI using > device_property_read_u64(), according to the FW specification, > and exported to userspace as the VFIO_REGION that covers the first > PCI BAR. qemu will naturally generate a PCI device in the VM where the > cacheable aperture is reported in BAR1. > > Since this memory region is actually cache coherent with the CPU, the > VFIO variant driver will mmap it into VMA using a cacheable mapping. The > mapping is done using remap_pfn_range(). > > This goes along with a qemu series to provides the necessary > implementation of the Grace Hopper Superchip firmware specification so > that the guest operating system can see the correct ACPI modeling for > the coherent GPU device. > https://github.com/qemu/qemu/compare/master...ankita-nv:qemu:dev-ankit/cohmem-0330 This doesn't seem like a very stable link, a lore link to an RFC posting would be preferred. > This patch is split from a patch series being pursued separately: > https://lore.kernel.org/lkml/20230405180134.16932-1-ankita@nvidia.com/ > > Applied and tested over v6.3-rc7. > > v1 -> v2 Patch version changes can go below the '---' below to keep them out of the committed change log. > - Updated the wording of reference to BAR offset and replaced with > index. > - The GPU memory is exposed at the fixed BAR2_REGION_INDEX. Nit, the commit log above still refers to BAR1. > - Code cleanup based on feedback comments. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > MAINTAINERS | 6 + > drivers/vfio/pci/Kconfig | 2 + > drivers/vfio/pci/Makefile | 2 + > drivers/vfio/pci/nvgpu/Kconfig | 10 ++ > drivers/vfio/pci/nvgpu/Makefile | 3 + > drivers/vfio/pci/nvgpu/main.c | 217 ++++++++++++++++++++++++++++++++ > 6 files changed, 240 insertions(+) > create mode 100644 drivers/vfio/pci/nvgpu/Kconfig > create mode 100644 drivers/vfio/pci/nvgpu/Makefile > create mode 100644 drivers/vfio/pci/nvgpu/main.c > > diff --git a/MAINTAINERS b/MAINTAINERS > index 0e64787aace8..6b55861bbfbe 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -21949,6 +21949,12 @@ L: kvm@vger.kernel.org > S: Maintained > F: drivers/vfio/pci/mlx5/ > > +VFIO NVIDIA PCI DRIVER > +M: Ankit Agrawal <ankita@nvidia.com> > +L: kvm@vger.kernel.org > +S: Maintained > +F: drivers/vfio/pci/nvgpu/ > + > VGA_SWITCHEROO > R: Lukas Wunner <lukas@wunner.de> > S: Maintained > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig > index f9d0c908e738..ade18b0ffb7b 100644 > --- a/drivers/vfio/pci/Kconfig > +++ b/drivers/vfio/pci/Kconfig > @@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig" > > source "drivers/vfio/pci/hisilicon/Kconfig" > > +source "drivers/vfio/pci/nvgpu/Kconfig" > + > endif > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile > index 24c524224da5..0c93d452d0da 100644 > --- a/drivers/vfio/pci/Makefile > +++ b/drivers/vfio/pci/Makefile > @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o > obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/ > > obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/ > + > +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu/ > diff --git a/drivers/vfio/pci/nvgpu/Kconfig b/drivers/vfio/pci/nvgpu/Kconfig > new file mode 100644 > index 000000000000..066f764f7c5f > --- /dev/null > +++ b/drivers/vfio/pci/nvgpu/Kconfig > @@ -0,0 +1,10 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +config NVGPU_VFIO_PCI > + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip" > + depends on ARM64 || (COMPILE_TEST && 64BIT) > + select VFIO_PCI_CORE I think this should be a 'depends on' as well, that's what we have for the other vfio-pci variant drivers. > + help > + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is > + required to assign the GPU device to a VM using KVM/qemu/etc. > + > + If you don't know what to do here, say N. > diff --git a/drivers/vfio/pci/nvgpu/Makefile b/drivers/vfio/pci/nvgpu/Makefile > new file mode 100644 > index 000000000000..00fd3a078218 > --- /dev/null > +++ b/drivers/vfio/pci/nvgpu/Makefile > @@ -0,0 +1,3 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu-vfio-pci.o > +nvgpu-vfio-pci-y := main.o > diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c > new file mode 100644 > index 000000000000..bb09dada9907 > --- /dev/null > +++ b/drivers/vfio/pci/nvgpu/main.c > @@ -0,0 +1,217 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved > + */ > + > +#include <linux/pci.h> > +#include <linux/vfio_pci_core.h> > + > +struct dev_mem_properties { > + uint64_t hpa; > + uint64_t mem_length; > +}; > + > +struct nvgpu_vfio_pci_core_device { > + struct vfio_pci_core_device core_device; > + struct dev_mem_properties mem_prop; > +}; > + > +static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev) > +{ > + struct vfio_pci_core_device *vdev = > + container_of(core_vdev, struct vfio_pci_core_device, vdev); > + int ret; > + > + ret = vfio_pci_core_enable(vdev); > + if (ret) > + return ret; > + > + vfio_pci_core_finish_enable(vdev); > + > + return ret; > +} > + > +static int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev, > + struct vm_area_struct *vma) > +{ > + struct nvgpu_vfio_pci_core_device *nvdev = container_of( > + core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev); > + > + unsigned long start_pfn; > + unsigned int index; > + u64 req_len, pgoff; > + int ret = 0; > + > + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); > + if (index != VFIO_PCI_BAR2_REGION_INDEX) > + return vfio_pci_core_mmap(core_vdev, vma); > + > + /* > + * Request to mmap the BAR. Map to the CPU accessible memory on the > + * GPU using the memory information gathered from the system ACPI > + * tables. > + */ > + start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT; > + req_len = vma->vm_end - vma->vm_start; > + pgoff = vma->vm_pgoff & > + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); > + if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT)) > + return -EINVAL; > + > + /* > + * Perform a PFN map to the memory. The device BAR is backed by the > + * GPU memory now. Check that the mapping does not overflow out of > + * the GPU memory size. vfio-pci-core return -EINVAL if the mapping length exceeds the resource length whereas this silently resizes the requested length. IMO it should fail like vfio-pci-core does. Is our test for vm_end < vm_start in vfio-pci-core just paranoia? I don't see an equivalent here. Can we also get a comment in the code outlining the various reasons that this "BAR" doesn't need the disabled access protections that vfio-pci-core implements? For example outlining the behavior relative to BAR access while the memory enable bit is disabled, the bus being in reset, or the device being in a low-power state. Thanks, Alex > + */ > + ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff, > + min(req_len, nvdev->mem_prop.mem_length - pgoff), > + vma->vm_page_prot); > + if (ret) > + return ret; > + > + vma->vm_pgoff = start_pfn + pgoff; > + > + return 0; > +} > + > +static long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, > + unsigned int cmd, unsigned long arg) > +{ > + struct nvgpu_vfio_pci_core_device *nvdev = container_of( > + core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev); > + > + unsigned long minsz = offsetofend(struct vfio_region_info, offset); > + struct vfio_region_info info; > + > + if (cmd == VFIO_DEVICE_GET_REGION_INFO) { > + if (copy_from_user(&info, (void __user *)arg, minsz)) > + return -EFAULT; > + > + if (info.argsz < minsz) > + return -EINVAL; > + > + if (info.index == VFIO_PCI_BAR2_REGION_INDEX) { > + /* > + * Request to determine the BAR region information. Send the > + * GPU memory information. > + */ > + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); > + info.size = nvdev->mem_prop.mem_length; > + info.flags = VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE | > + VFIO_REGION_INFO_FLAG_MMAP; > + return copy_to_user((void __user *)arg, &info, minsz) ? > + -EFAULT : 0; > + } > + } > + > + return vfio_pci_core_ioctl(core_vdev, cmd, arg); > +} > + > +static const struct vfio_device_ops nvgpu_vfio_pci_ops = { > + .name = "nvgpu-vfio-pci", > + .init = vfio_pci_core_init_dev, > + .release = vfio_pci_core_release_dev, > + .open_device = nvgpu_vfio_pci_open_device, > + .close_device = vfio_pci_core_close_device, > + .ioctl = nvgpu_vfio_pci_ioctl, > + .read = vfio_pci_core_read, > + .write = vfio_pci_core_write, > + .mmap = nvgpu_vfio_pci_mmap, > + .request = vfio_pci_core_request, > + .match = vfio_pci_core_match, > + .bind_iommufd = vfio_iommufd_physical_bind, > + .unbind_iommufd = vfio_iommufd_physical_unbind, > + .attach_ioas = vfio_iommufd_physical_attach_ioas, > +}; > + > +static struct nvgpu_vfio_pci_core_device *nvgpu_drvdata(struct pci_dev *pdev) > +{ > + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev); > + > + return container_of(core_device, struct nvgpu_vfio_pci_core_device, > + core_device); > +} > + > +static int > +nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev, > + struct nvgpu_vfio_pci_core_device *nvdev) > +{ > + int ret; > + > + /* > + * The memory information is present in the system ACPI tables as DSD > + * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size. > + */ > + ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-base-pa", > + &(nvdev->mem_prop.hpa)); > + if (ret) > + return ret; > + > + ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size", > + &(nvdev->mem_prop.mem_length)); > + return ret; > +} > + > +static int nvgpu_vfio_pci_probe(struct pci_dev *pdev, > + const struct pci_device_id *id) > +{ > + struct nvgpu_vfio_pci_core_device *nvdev; > + int ret; > + > + nvdev = vfio_alloc_device(nvgpu_vfio_pci_core_device, core_device.vdev, > + &pdev->dev, &nvgpu_vfio_pci_ops); > + if (IS_ERR(nvdev)) > + return PTR_ERR(nvdev); > + > + dev_set_drvdata(&pdev->dev, nvdev); > + > + ret = nvgpu_vfio_pci_fetch_memory_property(pdev, nvdev); > + if (ret) > + goto out_put_vdev; > + > + ret = vfio_pci_core_register_device(&nvdev->core_device); > + if (ret) > + goto out_put_vdev; > + > + return ret; > + > +out_put_vdev: > + vfio_put_device(&nvdev->core_device.vdev); > + return ret; > +} > + > +static void nvgpu_vfio_pci_remove(struct pci_dev *pdev) > +{ > + struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev); > + struct vfio_pci_core_device *vdev = &nvdev->core_device; > + > + vfio_pci_core_unregister_device(vdev); > + vfio_put_device(&vdev->vdev); > +} > + > +static const struct pci_device_id nvgpu_vfio_pci_table[] = { > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) }, > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2343) }, > + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) }, > + {} > +}; > + > +MODULE_DEVICE_TABLE(pci, nvgpu_vfio_pci_table); > + > +static struct pci_driver nvgpu_vfio_pci_driver = { > + .name = KBUILD_MODNAME, > + .id_table = nvgpu_vfio_pci_table, > + .probe = nvgpu_vfio_pci_probe, > + .remove = nvgpu_vfio_pci_remove, > + .err_handler = &vfio_pci_core_err_handlers, > + .driver_managed_dma = true, > +}; > + > +module_pci_driver(nvgpu_vfio_pci_driver); > + > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>"); > +MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>"); > +MODULE_DESCRIPTION( > + "VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
On Tue, May 16, 2023 at 03:09:14PM -0600, Alex Williamson wrote: > > +# SPDX-License-Identifier: GPL-2.0-only > > +config NVGPU_VFIO_PCI > > + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip" > > + depends on ARM64 || (COMPILE_TEST && 64BIT) > > + select VFIO_PCI_CORE > > I think this should be a 'depends on' as well, that's what we have for > the other vfio-pci variant drivers. It should be removed completely, AFAICT: config VFIO_PCI tristate "Generic VFIO support for any PCI device" select VFIO_PCI_CORE Ensures it is turned on if VFIO_PCI source "drivers/vfio/pci/mlx5/Kconfig" endif Autoamtically injects a 'depends on VFIO_PCI' to all the enclosed kconfig statements (and puts them nicely in the menu) So we have everything needed already SELECT is the correct action since it doesn't have a config text. > Is our test for vm_end < vm_start in vfio-pci-core just paranoia? I > don't see an equivalent here. Yes, mm core will not invoke the op with something incorrect. > Can we also get a comment in the code outlining the various reasons > that this "BAR" doesn't need the disabled access protections that > vfio-pci-core implements? For example outlining the behavior relative > to BAR access while the memory enable bit is disabled, the bus being in > reset, or the device being in a low-power state. The HW has some "isolation" feature that kicks in and safely disconnects the GPU from the CPU. A lot of work has been done to make things like VFIO and KVM safe against machine checks/etc under basically all circumstances. Jason
> From: ankita@nvidia.com <ankita@nvidia.com> > Sent: Tuesday, May 9, 2023 12:08 PM > > From: Ankit Agrawal <ankita@nvidia.com> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal propritary cache coherent interconnect. > > This representation has a number of limitations compared to a real PCI > device, in particular, it does not model the coherent GPU memory > aperture as a PCI config space BAR, and PCI doesn't know anything > about cacheable memory types. > > Provide a VFIO PCI variant driver that adapts the unique PCI > representation into a more standard PCI representation facing > userspace. The GPU memory aperture is obtained from ACPI using > device_property_read_u64(), according to the FW specification, > and exported to userspace as the VFIO_REGION that covers the first > PCI BAR. qemu will naturally generate a PCI device in the VM where the > cacheable aperture is reported in BAR1. BAR2. and it's more informative by describing how many BARs this device already implements then BAR2 is selected because it's free. > + > +static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev) > +{ > + struct vfio_pci_core_device *vdev = > + container_of(core_vdev, struct vfio_pci_core_device, vdev); > + int ret; > + > + ret = vfio_pci_core_enable(vdev); > + if (ret) > + return ret; > + > + vfio_pci_core_finish_enable(vdev); > + > + return ret; > +} NIT. "return 0" as other variant drivers do. > + > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>"); > +MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>"); > +MODULE_DESCRIPTION( > + "VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU > coherently accessible device memory"); what does 'PF' mean? Physical function? Probably needs a more specific name for the coherent part... nvgpu-vfio-pci sounds covering all NV GPUs.
On Tue, 16 May 2023 21:28:35 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, May 16, 2023 at 03:09:14PM -0600, Alex Williamson wrote: > > > > +# SPDX-License-Identifier: GPL-2.0-only > > > +config NVGPU_VFIO_PCI > > > + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip" > > > + depends on ARM64 || (COMPILE_TEST && 64BIT) > > > + select VFIO_PCI_CORE > > > > I think this should be a 'depends on' as well, that's what we have for > > the other vfio-pci variant drivers. > > It should be removed completely, AFAICT: > > config VFIO_PCI > tristate "Generic VFIO support for any PCI device" > select VFIO_PCI_CORE > > Ensures it is turned on > > if VFIO_PCI > source "drivers/vfio/pci/mlx5/Kconfig" > endif The source command actually comes after the VFIO_PCI endif, the mlx5 Kconfig is sourced if PCI && MMU. > Autoamtically injects a 'depends on VFIO_PCI' to all the enclosed > kconfig statements (and puts them nicely in the menu) > > So we have everything needed already > > SELECT is the correct action since it doesn't have a config text. In fact I think it's the current variant drivers that are incorrect to make use of 'depends on', this makes those variant drivers implicitly depend on VFIO_PCI, but it should instead be possible to build a kernel that doesn't include vfio-pci but does include mlx5-vfio-pci, or other vfio-pci variant drivers. Currently if I disable VFIO_PCI I no longer have the option to select either the mlx5 or hisi_acc drivers, they actually depend only on VFIO_PCI_CORE, but currently only VFIO_PCI can select VFIO_PCI_CORE. I withdraw my objection to using select, the other variant drivers should adopt select as well, imo. > > Is our test for vm_end < vm_start in vfio-pci-core just paranoia? I > > don't see an equivalent here. > > Yes, mm core will not invoke the op with something incorrect. > > > Can we also get a comment in the code outlining the various reasons > > that this "BAR" doesn't need the disabled access protections that > > vfio-pci-core implements? For example outlining the behavior relative > > to BAR access while the memory enable bit is disabled, the bus being in > > reset, or the device being in a low-power state. > > The HW has some "isolation" feature that kicks in and safely > disconnects the GPU from the CPU. > > A lot of work has been done to make things like VFIO and KVM safe > against machine checks/etc under basically all circumstances. So a comment in the code to reflect that the hardware takes this into account such that we don't need to worry about mmap access during bus reset or otherwise disabled MMIO access of the PCI device would not be unreasonable. Thanks, Alex
On 5/8/2023 9:07 PM, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device > for the on-chip GPU that is the logical OS representation of the > internal propritary cache coherent interconnect. ^proprietary ---Trilok Soni
On Thu, May 25, 2023 at 09:21:23AM -0600, Alex Williamson wrote: > On Tue, 16 May 2023 21:28:35 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, May 16, 2023 at 03:09:14PM -0600, Alex Williamson wrote: > > > > > > +# SPDX-License-Identifier: GPL-2.0-only > > > > +config NVGPU_VFIO_PCI > > > > + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip" > > > > + depends on ARM64 || (COMPILE_TEST && 64BIT) > > > > + select VFIO_PCI_CORE > > > > > > I think this should be a 'depends on' as well, that's what we have for > > > the other vfio-pci variant drivers. > > > > It should be removed completely, AFAICT: > > > > config VFIO_PCI > > tristate "Generic VFIO support for any PCI device" > > select VFIO_PCI_CORE > > > > Ensures it is turned on > > > > if VFIO_PCI > > source "drivers/vfio/pci/mlx5/Kconfig" > > endif > > The source command actually comes after the VFIO_PCI endif, the mlx5 > Kconfig is sourced if PCI && MMU. Ah, I forgot we made the VFIO_PCI_CORE a hidden menu choice, so yeah, it should be select everywhere and we can't use the IF trick. > In fact I think it's the current variant drivers that are incorrect to > make use of 'depends on', this makes those variant drivers implicitly > depend on VFIO_PCI Yes Jason
diff --git a/MAINTAINERS b/MAINTAINERS index 0e64787aace8..6b55861bbfbe 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -21949,6 +21949,12 @@ L: kvm@vger.kernel.org S: Maintained F: drivers/vfio/pci/mlx5/ +VFIO NVIDIA PCI DRIVER +M: Ankit Agrawal <ankita@nvidia.com> +L: kvm@vger.kernel.org +S: Maintained +F: drivers/vfio/pci/nvgpu/ + VGA_SWITCHEROO R: Lukas Wunner <lukas@wunner.de> S: Maintained diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index f9d0c908e738..ade18b0ffb7b 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -59,4 +59,6 @@ source "drivers/vfio/pci/mlx5/Kconfig" source "drivers/vfio/pci/hisilicon/Kconfig" +source "drivers/vfio/pci/nvgpu/Kconfig" + endif diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 24c524224da5..0c93d452d0da 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -11,3 +11,5 @@ obj-$(CONFIG_VFIO_PCI) += vfio-pci.o obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/ obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/ + +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu/ diff --git a/drivers/vfio/pci/nvgpu/Kconfig b/drivers/vfio/pci/nvgpu/Kconfig new file mode 100644 index 000000000000..066f764f7c5f --- /dev/null +++ b/drivers/vfio/pci/nvgpu/Kconfig @@ -0,0 +1,10 @@ +# SPDX-License-Identifier: GPL-2.0-only +config NVGPU_VFIO_PCI + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip" + depends on ARM64 || (COMPILE_TEST && 64BIT) + select VFIO_PCI_CORE + help + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is + required to assign the GPU device to a VM using KVM/qemu/etc. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/nvgpu/Makefile b/drivers/vfio/pci/nvgpu/Makefile new file mode 100644 index 000000000000..00fd3a078218 --- /dev/null +++ b/drivers/vfio/pci/nvgpu/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_NVGPU_VFIO_PCI) += nvgpu-vfio-pci.o +nvgpu-vfio-pci-y := main.o diff --git a/drivers/vfio/pci/nvgpu/main.c b/drivers/vfio/pci/nvgpu/main.c new file mode 100644 index 000000000000..bb09dada9907 --- /dev/null +++ b/drivers/vfio/pci/nvgpu/main.c @@ -0,0 +1,217 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved + */ + +#include <linux/pci.h> +#include <linux/vfio_pci_core.h> + +struct dev_mem_properties { + uint64_t hpa; + uint64_t mem_length; +}; + +struct nvgpu_vfio_pci_core_device { + struct vfio_pci_core_device core_device; + struct dev_mem_properties mem_prop; +}; + +static int nvgpu_vfio_pci_open_device(struct vfio_device *core_vdev) +{ + struct vfio_pci_core_device *vdev = + container_of(core_vdev, struct vfio_pci_core_device, vdev); + int ret; + + ret = vfio_pci_core_enable(vdev); + if (ret) + return ret; + + vfio_pci_core_finish_enable(vdev); + + return ret; +} + +static int nvgpu_vfio_pci_mmap(struct vfio_device *core_vdev, + struct vm_area_struct *vma) +{ + struct nvgpu_vfio_pci_core_device *nvdev = container_of( + core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev); + + unsigned long start_pfn; + unsigned int index; + u64 req_len, pgoff; + int ret = 0; + + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); + if (index != VFIO_PCI_BAR2_REGION_INDEX) + return vfio_pci_core_mmap(core_vdev, vma); + + /* + * Request to mmap the BAR. Map to the CPU accessible memory on the + * GPU using the memory information gathered from the system ACPI + * tables. + */ + start_pfn = nvdev->mem_prop.hpa >> PAGE_SHIFT; + req_len = vma->vm_end - vma->vm_start; + pgoff = vma->vm_pgoff & + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + if (pgoff >= (nvdev->mem_prop.mem_length >> PAGE_SHIFT)) + return -EINVAL; + + /* + * Perform a PFN map to the memory. The device BAR is backed by the + * GPU memory now. Check that the mapping does not overflow out of + * the GPU memory size. + */ + ret = remap_pfn_range(vma, vma->vm_start, start_pfn + pgoff, + min(req_len, nvdev->mem_prop.mem_length - pgoff), + vma->vm_page_prot); + if (ret) + return ret; + + vma->vm_pgoff = start_pfn + pgoff; + + return 0; +} + +static long nvgpu_vfio_pci_ioctl(struct vfio_device *core_vdev, + unsigned int cmd, unsigned long arg) +{ + struct nvgpu_vfio_pci_core_device *nvdev = container_of( + core_vdev, struct nvgpu_vfio_pci_core_device, core_device.vdev); + + unsigned long minsz = offsetofend(struct vfio_region_info, offset); + struct vfio_region_info info; + + if (cmd == VFIO_DEVICE_GET_REGION_INFO) { + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + + if (info.index == VFIO_PCI_BAR2_REGION_INDEX) { + /* + * Request to determine the BAR region information. Send the + * GPU memory information. + */ + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); + info.size = nvdev->mem_prop.mem_length; + info.flags = VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE | + VFIO_REGION_INFO_FLAG_MMAP; + return copy_to_user((void __user *)arg, &info, minsz) ? + -EFAULT : 0; + } + } + + return vfio_pci_core_ioctl(core_vdev, cmd, arg); +} + +static const struct vfio_device_ops nvgpu_vfio_pci_ops = { + .name = "nvgpu-vfio-pci", + .init = vfio_pci_core_init_dev, + .release = vfio_pci_core_release_dev, + .open_device = nvgpu_vfio_pci_open_device, + .close_device = vfio_pci_core_close_device, + .ioctl = nvgpu_vfio_pci_ioctl, + .read = vfio_pci_core_read, + .write = vfio_pci_core_write, + .mmap = nvgpu_vfio_pci_mmap, + .request = vfio_pci_core_request, + .match = vfio_pci_core_match, + .bind_iommufd = vfio_iommufd_physical_bind, + .unbind_iommufd = vfio_iommufd_physical_unbind, + .attach_ioas = vfio_iommufd_physical_attach_ioas, +}; + +static struct nvgpu_vfio_pci_core_device *nvgpu_drvdata(struct pci_dev *pdev) +{ + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev); + + return container_of(core_device, struct nvgpu_vfio_pci_core_device, + core_device); +} + +static int +nvgpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev, + struct nvgpu_vfio_pci_core_device *nvdev) +{ + int ret; + + /* + * The memory information is present in the system ACPI tables as DSD + * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size. + */ + ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-base-pa", + &(nvdev->mem_prop.hpa)); + if (ret) + return ret; + + ret = device_property_read_u64(&(pdev->dev), "nvidia,gpu-mem-size", + &(nvdev->mem_prop.mem_length)); + return ret; +} + +static int nvgpu_vfio_pci_probe(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + struct nvgpu_vfio_pci_core_device *nvdev; + int ret; + + nvdev = vfio_alloc_device(nvgpu_vfio_pci_core_device, core_device.vdev, + &pdev->dev, &nvgpu_vfio_pci_ops); + if (IS_ERR(nvdev)) + return PTR_ERR(nvdev); + + dev_set_drvdata(&pdev->dev, nvdev); + + ret = nvgpu_vfio_pci_fetch_memory_property(pdev, nvdev); + if (ret) + goto out_put_vdev; + + ret = vfio_pci_core_register_device(&nvdev->core_device); + if (ret) + goto out_put_vdev; + + return ret; + +out_put_vdev: + vfio_put_device(&nvdev->core_device.vdev); + return ret; +} + +static void nvgpu_vfio_pci_remove(struct pci_dev *pdev) +{ + struct nvgpu_vfio_pci_core_device *nvdev = nvgpu_drvdata(pdev); + struct vfio_pci_core_device *vdev = &nvdev->core_device; + + vfio_pci_core_unregister_device(vdev); + vfio_put_device(&vdev->vdev); +} + +static const struct pci_device_id nvgpu_vfio_pci_table[] = { + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) }, + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2343) }, + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) }, + {} +}; + +MODULE_DEVICE_TABLE(pci, nvgpu_vfio_pci_table); + +static struct pci_driver nvgpu_vfio_pci_driver = { + .name = KBUILD_MODNAME, + .id_table = nvgpu_vfio_pci_table, + .probe = nvgpu_vfio_pci_probe, + .remove = nvgpu_vfio_pci_remove, + .err_handler = &vfio_pci_core_err_handlers, + .driver_managed_dma = true, +}; + +module_pci_driver(nvgpu_vfio_pci_driver); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>"); +MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>"); +MODULE_DESCRIPTION( + "VFIO NVGPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");