[RFCv1,0/7] vfio: Allow userspace to specify the address for each MSI vector

Message ID	cover.1731130093.git.nicolinc@nvidia.com (mailing list archive)
Headers	show Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2080.outbound.protection.outlook.com [40.107.223.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 866A428E8; Sat, 9 Nov 2024 05:49:01 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <maz@kernel.org>, <tglx@linutronix.de>, <bhelgaas@google.com>, <alex.williamson@redhat.com> CC: <jgg@nvidia.com>, <leonro@nvidia.com>, <shameerali.kolothum.thodi@huawei.com>, <robin.murphy@arm.com>, <dlemoal@kernel.org>, <kevin.tian@intel.com>, <smostafa@google.com>, <andriy.shevchenko@linux.intel.com>, <reinette.chatre@intel.com>, <eric.auger@redhat.com>, <ddutile@redhat.com>, <yebin10@huawei.com>, <brauner@kernel.org>, <apatel@ventanamicro.com>, <shivamurthy.shastri@linutronix.de>, <anna-maria@linutronix.de>, <nipun.gupta@amd.com>, <marek.vasut+renesas@mailbox.org>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>, <kvm@vger.kernel.org> Subject: [PATCH RFCv1 0/7] vfio: Allow userspace to specify the address for each MSI vector Date: Fri, 8 Nov 2024 21:48:45 -0800 Message-ID: <cover.1731130093.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	vfio: Allow userspace to specify the address for each MSI vector \| expand [RFCv1,0/7] vfio: Allow userspace to specify the address for each MSI vector [RFCv1,1/7] genirq/msi: Allow preset IOVA in struct msi_desc for MSI doorbell address [RFCv1,2/7] irqchip/gic-v3-its: Bypass iommu_cookie if desc->msi_iova is preset [RFCv1,3/7] PCI/MSI: Pass in msi_iova to msi_domain_insert_msi_desc [RFCv1,4/7] PCI/MSI: Allow __pci_enable_msi_range to pass in iova [RFCv1,5/7] PCI/MSI: Extract a common __pci_alloc_irq_vectors function [RFCv1,6/7] PCI/MSI: Add pci_alloc_irq_vectors_iovas helper [RFCv1,7/7] vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

Nicolin Chen Nov. 9, 2024, 5:48 a.m. UTC

On ARM GIC systems and others, the target address of the MSI is translated
by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
IOMMU is disabled, the MSI address is programmed to the physical location
of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
page is behind the IOMMU, so the MSI address is programmed to an allocated
IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
When a 2-stage translation is enabled, IOVA will be still used to program
the MSI address, though the mappings will be in two stages:
  IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> 0x20200000
(IPA stands for Intermediate Physical Address).

If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
IOVA is dynamically allocated from the top of the IOVA space. If attached
to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.

So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
of the IOMMU translation (1-stage translation), since the IOVA for the ITS
page is fixed and known by kernel. However, with virtual machine enabling
a nested IOMMU translation (2-stage), a guest kernel directly controls the
stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
kernel can't know that guest-level IOVA to program the MSI address.

To solve this problem the VMM should capture the MSI IOVA allocated by the
guest kernel and relay it to the GIC driver in the host kernel, to program
the correct MSI IOVA. And this requires a new ioctl via VFIO.

Extend the VFIO path to allow an MSI target IOVA to be forwarded into the
kernel and pushed down to the GIC driver.

Add VFIO ioctl VFIO_IRQ_SET_ACTION_PREPARE with VFIO_IRQ_SET_DATA_MSI_IOVA
to carry the data.

The downstream calltrace is quite long from the VFIO to the ITS driver. So
in order to carry the MSI IOVA from the top to its_irq_domain_alloc(), add
patches in a leaf-to-root order:

  vfio_pci_core_ioctl:
    vfio_pci_set_irqs_ioctl:
      vfio_pci_set_msi_prepare:                           // PATCH-7
        pci_alloc_irq_vectors_iovas:                      // PATCH-6
          __pci_alloc_irq_vectors:                        // PATCH-5
            __pci_enable_msi/msix_range:                  // PATCH-4
              msi/msix_capability_init:                   // PATCH-3
                msi/msix_setup_msi_descs:
                  msi_insert_msi_desc();                  // PATCH-1
                pci_msi_setup_msi_irqs:
                  msi_domain_alloc_irqs_all_locked:
                    __msi_domain_alloc_locked:
                      __msi_domain_alloc_irqs:
                        __irq_domain_alloc_irqs:
                          irq_domain_alloc_irqs_locked:
                            irq_domain_alloc_irqs_hierarchy:
                              msi_domain_alloc:
                                irq_domain_alloc_irqs_parent:
                                  its_irq_domain_alloc(); // PATCH-2

Note that this series solves half the problem, since it only allows kernel
to set the physical PCI MSI/MSI-X on the device with the correct head IOVA
of a 2-stage translation, where the guest kernel does the stage-1 mapping
that MSI IOVA (0xEEEE0000) to its own vITS page (0x80900000) while missing
the stage-2 mapping from that IPA to the physical ITS page:
  0xEEEE0000 ===> 0x80900000 =x=> 0x20200000
A followup series should fill that gap, doing the stage-2 mapping from the
vITS page 0x80900000 to the physical ITS page (0x20200000), likely via new
IOMMUFD ioctl. Once VMM sets up this stage-2 mapping, VM will act the same
as bare metal relying on a running kernel to handle the stage-1 mapping:
  0xEEEE0000 ===> 0x80900000 ===> 0x20200000

This series (prototype) is on Github:
https://github.com/nicolinc/iommufd/commits/vfio_msi_giova-rfcv1/
It's tested by hacking the host kernel to hard-code a stage-2 mapping.

Thanks!
Nicolin

Nicolin Chen (7):
  genirq/msi: Allow preset IOVA in struct msi_desc for MSI doorbell
    address
  irqchip/gic-v3-its: Bypass iommu_cookie if desc->msi_iova is preset
  PCI/MSI: Pass in msi_iova to msi_domain_insert_msi_desc
  PCI/MSI: Allow __pci_enable_msi_range to pass in iova
  PCI/MSI: Extract a common __pci_alloc_irq_vectors function
  PCI/MSI: Add pci_alloc_irq_vectors_iovas helper
  vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE

 drivers/pci/msi/msi.h             |   3 +-
 include/linux/msi.h               |  11 +++
 include/linux/pci.h               |  18 ++++
 include/linux/vfio_pci_core.h     |   1 +
 include/uapi/linux/vfio.h         |   8 +-
 drivers/irqchip/irq-gic-v3-its.c  |  21 ++++-
 drivers/pci/msi/api.c             | 136 ++++++++++++++++++++----------
 drivers/pci/msi/msi.c             |  20 +++--
 drivers/vfio/pci/vfio_pci_intrs.c |  41 ++++++++-
 drivers/vfio/vfio_main.c          |   3 +
 kernel/irq/msi.c                  |   6 ++
 11 files changed, 212 insertions(+), 56 deletions(-)

Robin Murphy Nov. 11, 2024, 1:09 p.m. UTC | #1

On 2024-11-09 5:48 am, Nicolin Chen wrote:
> On ARM GIC systems and others, the target address of the MSI is translated
> by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
> IOMMU is disabled, the MSI address is programmed to the physical location
> of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> page is behind the IOMMU, so the MSI address is programmed to an allocated
> IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> When a 2-stage translation is enabled, IOVA will be still used to program
> the MSI address, though the mappings will be in two stages:
>    IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> 0x20200000
> (IPA stands for Intermediate Physical Address).
> 
> If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
> IOVA is dynamically allocated from the top of the IOVA space. If attached
> to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
> fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
> which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
> 
> So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> page is fixed and known by kernel. However, with virtual machine enabling
> a nested IOMMU translation (2-stage), a guest kernel directly controls the
> stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
> IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> kernel can't know that guest-level IOVA to program the MSI address.
> 
> To solve this problem the VMM should capture the MSI IOVA allocated by the
> guest kernel and relay it to the GIC driver in the host kernel, to program
> the correct MSI IOVA. And this requires a new ioctl via VFIO.

Once VFIO has that information from userspace, though, do we really need 
the whole complicated dance to push it right down into the irqchip layer 
just so it can be passed back up again? AFAICS 
vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly 
rewrites MSI-X vectors, so it seems like it should be pretty 
straightforward to override the message address in general at that 
level, without the lower layers having to be aware at all, no?

Thanks,
Robin.

> Extend the VFIO path to allow an MSI target IOVA to be forwarded into the
> kernel and pushed down to the GIC driver.
> 
> Add VFIO ioctl VFIO_IRQ_SET_ACTION_PREPARE with VFIO_IRQ_SET_DATA_MSI_IOVA
> to carry the data.
> 
> The downstream calltrace is quite long from the VFIO to the ITS driver. So
> in order to carry the MSI IOVA from the top to its_irq_domain_alloc(), add
> patches in a leaf-to-root order:
> 
>    vfio_pci_core_ioctl:
>      vfio_pci_set_irqs_ioctl:
>        vfio_pci_set_msi_prepare:                           // PATCH-7
>          pci_alloc_irq_vectors_iovas:                      // PATCH-6
>            __pci_alloc_irq_vectors:                        // PATCH-5
>              __pci_enable_msi/msix_range:                  // PATCH-4
>                msi/msix_capability_init:                   // PATCH-3
>                  msi/msix_setup_msi_descs:
>                    msi_insert_msi_desc();                  // PATCH-1
>                  pci_msi_setup_msi_irqs:
>                    msi_domain_alloc_irqs_all_locked:
>                      __msi_domain_alloc_locked:
>                        __msi_domain_alloc_irqs:
>                          __irq_domain_alloc_irqs:
>                            irq_domain_alloc_irqs_locked:
>                              irq_domain_alloc_irqs_hierarchy:
>                                msi_domain_alloc:
>                                  irq_domain_alloc_irqs_parent:
>                                    its_irq_domain_alloc(); // PATCH-2
> 
> Note that this series solves half the problem, since it only allows kernel
> to set the physical PCI MSI/MSI-X on the device with the correct head IOVA
> of a 2-stage translation, where the guest kernel does the stage-1 mapping
> that MSI IOVA (0xEEEE0000) to its own vITS page (0x80900000) while missing
> the stage-2 mapping from that IPA to the physical ITS page:
>    0xEEEE0000 ===> 0x80900000 =x=> 0x20200000
> A followup series should fill that gap, doing the stage-2 mapping from the
> vITS page 0x80900000 to the physical ITS page (0x20200000), likely via new
> IOMMUFD ioctl. Once VMM sets up this stage-2 mapping, VM will act the same
> as bare metal relying on a running kernel to handle the stage-1 mapping:
>    0xEEEE0000 ===> 0x80900000 ===> 0x20200000
> 
> This series (prototype) is on Github:
> https://github.com/nicolinc/iommufd/commits/vfio_msi_giova-rfcv1/
> It's tested by hacking the host kernel to hard-code a stage-2 mapping.
> 
> Thanks!
> Nicolin
> 
> Nicolin Chen (7):
>    genirq/msi: Allow preset IOVA in struct msi_desc for MSI doorbell
>      address
>    irqchip/gic-v3-its: Bypass iommu_cookie if desc->msi_iova is preset
>    PCI/MSI: Pass in msi_iova to msi_domain_insert_msi_desc
>    PCI/MSI: Allow __pci_enable_msi_range to pass in iova
>    PCI/MSI: Extract a common __pci_alloc_irq_vectors function
>    PCI/MSI: Add pci_alloc_irq_vectors_iovas helper
>    vfio/pci: Allow preset MSI IOVAs via VFIO_IRQ_SET_ACTION_PREPARE
> 
>   drivers/pci/msi/msi.h             |   3 +-
>   include/linux/msi.h               |  11 +++
>   include/linux/pci.h               |  18 ++++
>   include/linux/vfio_pci_core.h     |   1 +
>   include/uapi/linux/vfio.h         |   8 +-
>   drivers/irqchip/irq-gic-v3-its.c  |  21 ++++-
>   drivers/pci/msi/api.c             | 136 ++++++++++++++++++++----------
>   drivers/pci/msi/msi.c             |  20 +++--
>   drivers/vfio/pci/vfio_pci_intrs.c |  41 ++++++++-
>   drivers/vfio/vfio_main.c          |   3 +
>   kernel/irq/msi.c                  |   6 ++
>   11 files changed, 212 insertions(+), 56 deletions(-)
>

Marc Zyngier Nov. 11, 2024, 2:14 p.m. UTC | #2

On Mon, 11 Nov 2024 13:09:20 +0000,
Robin Murphy <robin.murphy@arm.com> wrote:
> 
> On 2024-11-09 5:48 am, Nicolin Chen wrote:
> > On ARM GIC systems and others, the target address of the MSI is translated
> > by the IOMMU. For GIC, the MSI address page is called "ITS" page. When the
> > IOMMU is disabled, the MSI address is programmed to the physical location
> > of the GIC ITS page (e.g. 0x20200000). When the IOMMU is enabled, the ITS
> > page is behind the IOMMU, so the MSI address is programmed to an allocated
> > IO virtual address (a.k.a IOVA), e.g. 0xFFFF0000, which must be mapped to
> > the physical ITS page: IOVA (0xFFFF0000) ===> PA (0x20200000).
> > When a 2-stage translation is enabled, IOVA will be still used to program
> > the MSI address, though the mappings will be in two stages:
> >    IOVA (0xFFFF0000) ===> IPA (e.g. 0x80900000) ===> 0x20200000
> > (IPA stands for Intermediate Physical Address).
> > 
> > If the device that generates MSI is attached to an IOMMU_DOMAIN_DMA, the
> > IOVA is dynamically allocated from the top of the IOVA space. If attached
> > to an IOMMU_DOMAIN_UNMANAGED (e.g. a VFIO passthrough device), the IOVA is
> > fixed to an MSI window reported by the IOMMU driver via IOMMU_RESV_SW_MSI,
> > which is hardwired to MSI_IOVA_BASE (IOVA==0x8000000) for ARM IOMMUs.
> > 
> > So far, this IOMMU_RESV_SW_MSI works well as kernel is entirely in charge
> > of the IOMMU translation (1-stage translation), since the IOVA for the ITS
> > page is fixed and known by kernel. However, with virtual machine enabling
> > a nested IOMMU translation (2-stage), a guest kernel directly controls the
> > stage-1 translation with an IOMMU_DOMAIN_DMA, mapping a vITS page (at an
> > IPA 0x80900000) onto its own IOVA space (e.g. 0xEEEE0000). Then, the host
> > kernel can't know that guest-level IOVA to program the MSI address.
> > 
> > To solve this problem the VMM should capture the MSI IOVA allocated by the
> > guest kernel and relay it to the GIC driver in the host kernel, to program
> > the correct MSI IOVA. And this requires a new ioctl via VFIO.
> 
> Once VFIO has that information from userspace, though, do we really
> need the whole complicated dance to push it right down into the
> irqchip layer just so it can be passed back up again? AFAICS
> vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already
> explicitly rewrites MSI-X vectors, so it seems like it should be
> pretty straightforward to override the message address in general at
> that level, without the lower layers having to be aware at all, no?

+1.

I would like to avoid polluting each and every interrupt controller
with usage-specific knowledge (they usually are brain-damaged enough).
We already have an indirection into the IOMMU subsystem and it
shouldn't be a big deal to intercept the message for all
implementations at this level.

I also wonder how to handle the case of braindead^Wwonderful platforms
where ITS transactions are not translated by the SMMU. Somehow, VFIO
should be made aware of this situation.

Thanks,

	M.

Nicolin Chen Nov. 12, 2024, 9:54 p.m. UTC | #3

On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:
> On 2024-11-09 5:48 am, Nicolin Chen wrote:
> > To solve this problem the VMM should capture the MSI IOVA allocated by the
> > guest kernel and relay it to the GIC driver in the host kernel, to program
> > the correct MSI IOVA. And this requires a new ioctl via VFIO.
> 
> Once VFIO has that information from userspace, though, do we really need
> the whole complicated dance to push it right down into the irqchip layer
> just so it can be passed back up again? AFAICS
> vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly
> rewrites MSI-X vectors, so it seems like it should be pretty
> straightforward to override the message address in general at that
> level, without the lower layers having to be aware at all, no?

Didn't see that clearly!! It works with a simple following override:
--------------------------------------------------------------------
@@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
                struct msi_msg msg;

                get_cached_msi_msg(irq, &msg);
+               if (vdev->msi_iovas) {
+                       msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
+                       msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
+               }
                pci_write_msi_msg(irq, &msg);
        }
 
--------------------------------------------------------------------

With that, I think we only need one VFIO change for this part :)

Thanks!
Nicolin

Nicolin Chen Nov. 12, 2024, 10:13 p.m. UTC | #4

On Mon, Nov 11, 2024 at 02:14:15PM +0000, Marc Zyngier wrote:
> On Mon, 11 Nov 2024 13:09:20 +0000,
> Robin Murphy <robin.murphy@arm.com> wrote:
> > On 2024-11-09 5:48 am, Nicolin Chen wrote:
> > > To solve this problem the VMM should capture the MSI IOVA allocated by the
> > > guest kernel and relay it to the GIC driver in the host kernel, to program
> > > the correct MSI IOVA. And this requires a new ioctl via VFIO.
> >
> > Once VFIO has that information from userspace, though, do we really
> > need the whole complicated dance to push it right down into the
> > irqchip layer just so it can be passed back up again? AFAICS
> > vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already
> > explicitly rewrites MSI-X vectors, so it seems like it should be
> > pretty straightforward to override the message address in general at
> > that level, without the lower layers having to be aware at all, no?
> 
> +1.
> 
> I would like to avoid polluting each and every interrupt controller
> with usage-specific knowledge (they usually are brain-damaged enough).
> We already have an indirection into the IOMMU subsystem and it
> shouldn't be a big deal to intercept the message for all
> implementations at this level.
> 
> I also wonder how to handle the case of braindead^Wwonderful platforms
> where ITS transactions are not translated by the SMMU. Somehow, VFIO
> should be made aware of this situation.

Perhaps we should do iommu_get_domain_for_dev(&vdev->pdev->dev) and
check the returned domain->type:
 * if (domain->type & __IOMMU_DOMAIN_PAGING): 1-stage translation
 * if (domain->type == IOMMU_DOMAIN_NESTED): 2-stage translation

And for this particular topic/series, we should do something like:
	if (vdev->msi_iovas && domain->type == IOMMU_DOMAIN_NESTED) {
		msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
		msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
	}
?

Thanks
Nicolin

Jason Gunthorpe Nov. 13, 2024, 1:34 a.m. UTC | #5

On Tue, Nov 12, 2024 at 01:54:58PM -0800, Nicolin Chen wrote:
> On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:
> > On 2024-11-09 5:48 am, Nicolin Chen wrote:
> > > To solve this problem the VMM should capture the MSI IOVA allocated by the
> > > guest kernel and relay it to the GIC driver in the host kernel, to program
> > > the correct MSI IOVA. And this requires a new ioctl via VFIO.
> > 
> > Once VFIO has that information from userspace, though, do we really need
> > the whole complicated dance to push it right down into the irqchip layer
> > just so it can be passed back up again? AFAICS
> > vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly
> > rewrites MSI-X vectors, so it seems like it should be pretty
> > straightforward to override the message address in general at that
> > level, without the lower layers having to be aware at all, no?
> 
> Didn't see that clearly!! It works with a simple following override:
> --------------------------------------------------------------------
> @@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>                 struct msi_msg msg;
> 
>                 get_cached_msi_msg(irq, &msg);
> +               if (vdev->msi_iovas) {
> +                       msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
> +                       msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
> +               }
>                 pci_write_msi_msg(irq, &msg);
>         }
>  
> --------------------------------------------------------------------
> 
> With that, I think we only need one VFIO change for this part :)

Wow, is that really OK from a layering perspective? The comment is
pretty clear on the intention that this is to resync the irq layer
view of the device with the physical HW.

Editing the msi_msg while doing that resync smells bad.

Also, this is only doing MSI-X, we should include normal MSI as
well. (it probably should have a resync too?)

I'd want Thomas/Marc/Alex to agree.. (please read the cover letter for
context)

I think there are many options here we just need to get a clearer
understanding what best fits the architecture of the interrupt
subsystem.

Jason

Alex Williamson Nov. 13, 2024, 9:11 p.m. UTC | #6

On Tue, 12 Nov 2024 21:34:30 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Nov 12, 2024 at 01:54:58PM -0800, Nicolin Chen wrote:
> > On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:  
> > > On 2024-11-09 5:48 am, Nicolin Chen wrote:  
> > > > To solve this problem the VMM should capture the MSI IOVA allocated by the
> > > > guest kernel and relay it to the GIC driver in the host kernel, to program
> > > > the correct MSI IOVA. And this requires a new ioctl via VFIO.  
> > > 
> > > Once VFIO has that information from userspace, though, do we really need
> > > the whole complicated dance to push it right down into the irqchip layer
> > > just so it can be passed back up again? AFAICS
> > > vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly
> > > rewrites MSI-X vectors, so it seems like it should be pretty
> > > straightforward to override the message address in general at that
> > > level, without the lower layers having to be aware at all, no?  
> > 
> > Didn't see that clearly!! It works with a simple following override:
> > --------------------------------------------------------------------
> > @@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
> >                 struct msi_msg msg;
> > 
> >                 get_cached_msi_msg(irq, &msg);
> > +               if (vdev->msi_iovas) {
> > +                       msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
> > +                       msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
> > +               }
> >                 pci_write_msi_msg(irq, &msg);
> >         }
> >  
> > --------------------------------------------------------------------
> > 
> > With that, I think we only need one VFIO change for this part :)  
> 
> Wow, is that really OK from a layering perspective? The comment is
> pretty clear on the intention that this is to resync the irq layer
> view of the device with the physical HW.
> 
> Editing the msi_msg while doing that resync smells bad.
> 
> Also, this is only doing MSI-X, we should include normal MSI as
> well. (it probably should have a resync too?)

This was added for a specific IBM HBA that clears the vector table
during a built-in self test, so it's possible the MSI table being in
config space never had the same issue, or we just haven't encountered
it.  I don't expect anything else actually requires this.

> I'd want Thomas/Marc/Alex to agree.. (please read the cover letter for
> context)

It seems suspect to me too.  In a sense it is still just synchronizing
the MSI address, but to a different address space.

Is it possible to do this with the existing write_msi_msg callback on
the msi descriptor?  For instance we could simply translate the msg
address and call pci_write_msi_msg() (while avoiding an infinite
recursion).  Or maybe there should be an xlate_msi_msg callback we can
register.  Or I suppose there might be a way to insert an irqchip that
does the translation on write.  Thanks,

Alex

Robin Murphy Nov. 14, 2024, 3:35 p.m. UTC | #7

On 13/11/2024 9:11 pm, Alex Williamson wrote:
> On Tue, 12 Nov 2024 21:34:30 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Nov 12, 2024 at 01:54:58PM -0800, Nicolin Chen wrote:
>>> On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:
>>>> On 2024-11-09 5:48 am, Nicolin Chen wrote:
>>>>> To solve this problem the VMM should capture the MSI IOVA allocated by the
>>>>> guest kernel and relay it to the GIC driver in the host kernel, to program
>>>>> the correct MSI IOVA. And this requires a new ioctl via VFIO.
>>>>
>>>> Once VFIO has that information from userspace, though, do we really need
>>>> the whole complicated dance to push it right down into the irqchip layer
>>>> just so it can be passed back up again? AFAICS
>>>> vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already explicitly
>>>> rewrites MSI-X vectors, so it seems like it should be pretty
>>>> straightforward to override the message address in general at that
>>>> level, without the lower layers having to be aware at all, no?
>>>
>>> Didn't see that clearly!! It works with a simple following override:
>>> --------------------------------------------------------------------
>>> @@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>>>                  struct msi_msg msg;
>>>
>>>                  get_cached_msi_msg(irq, &msg);
>>> +               if (vdev->msi_iovas) {
>>> +                       msg.address_lo = lower_32_bits(vdev->msi_iovas[vector]);
>>> +                       msg.address_hi = upper_32_bits(vdev->msi_iovas[vector]);
>>> +               }
>>>                  pci_write_msi_msg(irq, &msg);
>>>          }
>>>   
>>> --------------------------------------------------------------------
>>>
>>> With that, I think we only need one VFIO change for this part :)
>>
>> Wow, is that really OK from a layering perspective? The comment is
>> pretty clear on the intention that this is to resync the irq layer
>> view of the device with the physical HW.
>>
>> Editing the msi_msg while doing that resync smells bad.
>>
>> Also, this is only doing MSI-X, we should include normal MSI as
>> well. (it probably should have a resync too?)
> 
> This was added for a specific IBM HBA that clears the vector table
> during a built-in self test, so it's possible the MSI table being in
> config space never had the same issue, or we just haven't encountered
> it.  I don't expect anything else actually requires this.

Yeah, I wasn't really suggesting to literally hook into this exact case; 
it was more just a general observation that if VFIO already has one 
justification for tinkering with pci_write_msi_msg() directly without 
going through the msi_domain layer, then adding another (wherever it 
fits best) can't be *entirely* unreasonable.

At the end of the day, the semantic here is that VFIO does know more 
than the IRQ layer, and does need to program the endpoint differently 
from what the irqchip assumes, so I don't see much benefit in dressing 
that up more than functionally necessary.

>> I'd want Thomas/Marc/Alex to agree.. (please read the cover letter for
>> context)
> 
> It seems suspect to me too.  In a sense it is still just synchronizing
> the MSI address, but to a different address space.
> 
> Is it possible to do this with the existing write_msi_msg callback on
> the msi descriptor?  For instance we could simply translate the msg
> address and call pci_write_msi_msg() (while avoiding an infinite
> recursion).  Or maybe there should be an xlate_msi_msg callback we can
> register.  Or I suppose there might be a way to insert an irqchip that
> does the translation on write.  Thanks,

I'm far from keen on the idea, but if there really is an appetite for 
more indirection, then I guess the least-worst option would be yet 
another type of iommu_dma_cookie to work via the existing 
iommu_dma_compose_msi_msg() flow, with some interface for VFIO to update 
per-device addresses directly. But then it's still going to need some 
kind of "layering violation" for VFIO to poke the IRQ layer into 
re-composing and re-writing a message whenever userspace feels like 
changing an address, because we're fundamentally stepping outside the 
established lifecycle of a kernel-managed IRQ around which said layering 
was designed...

Thanks,
Robin.

Eric Auger Nov. 20, 2024, 1:17 p.m. UTC | #8

On 11/14/24 16:35, Robin Murphy wrote:
> On 13/11/2024 9:11 pm, Alex Williamson wrote:
>> On Tue, 12 Nov 2024 21:34:30 -0400
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Tue, Nov 12, 2024 at 01:54:58PM -0800, Nicolin Chen wrote:
>>>> On Mon, Nov 11, 2024 at 01:09:20PM +0000, Robin Murphy wrote:
>>>>> On 2024-11-09 5:48 am, Nicolin Chen wrote:
>>>>>> To solve this problem the VMM should capture the MSI IOVA
>>>>>> allocated by the
>>>>>> guest kernel and relay it to the GIC driver in the host kernel,
>>>>>> to program
>>>>>> the correct MSI IOVA. And this requires a new ioctl via VFIO.
>>>>>
>>>>> Once VFIO has that information from userspace, though, do we
>>>>> really need
>>>>> the whole complicated dance to push it right down into the irqchip
>>>>> layer
>>>>> just so it can be passed back up again? AFAICS
>>>>> vfio_msi_set_vector_signal() via VFIO_DEVICE_SET_IRQS already
>>>>> explicitly
>>>>> rewrites MSI-X vectors, so it seems like it should be pretty
>>>>> straightforward to override the message address in general at that
>>>>> level, without the lower layers having to be aware at all, no?
>>>>
>>>> Didn't see that clearly!! It works with a simple following override:
>>>> --------------------------------------------------------------------
>>>> @@ -497,6 +497,10 @@ static int vfio_msi_set_vector_signal(struct
>>>> vfio_pci_core_device *vdev,
>>>>                  struct msi_msg msg;
>>>>
>>>>                  get_cached_msi_msg(irq, &msg);
>>>> +               if (vdev->msi_iovas) {
>>>> +                       msg.address_lo =
>>>> lower_32_bits(vdev->msi_iovas[vector]);
>>>> +                       msg.address_hi =
>>>> upper_32_bits(vdev->msi_iovas[vector]);
>>>> +               }
>>>>                  pci_write_msi_msg(irq, &msg);
>>>>          }
>>>>   --------------------------------------------------------------------
>>>>
>>>> With that, I think we only need one VFIO change for this part :)
>>>
>>> Wow, is that really OK from a layering perspective? The comment is
>>> pretty clear on the intention that this is to resync the irq layer
>>> view of the device with the physical HW.
>>>
>>> Editing the msi_msg while doing that resync smells bad.
>>>
>>> Also, this is only doing MSI-X, we should include normal MSI as
>>> well. (it probably should have a resync too?)
>>
>> This was added for a specific IBM HBA that clears the vector table
>> during a built-in self test, so it's possible the MSI table being in
>> config space never had the same issue, or we just haven't encountered
>> it.  I don't expect anything else actually requires this.
>
> Yeah, I wasn't really suggesting to literally hook into this exact
> case; it was more just a general observation that if VFIO already has
> one justification for tinkering with pci_write_msi_msg() directly
> without going through the msi_domain layer, then adding another
> (wherever it fits best) can't be *entirely* unreasonable.
>
> At the end of the day, the semantic here is that VFIO does know more
> than the IRQ layer, and does need to program the endpoint differently
> from what the irqchip assumes, so I don't see much benefit in dressing
> that up more than functionally necessary.
>
>>> I'd want Thomas/Marc/Alex to agree.. (please read the cover letter for
>>> context)
>>
>> It seems suspect to me too.  In a sense it is still just synchronizing
>> the MSI address, but to a different address space.
>>
>> Is it possible to do this with the existing write_msi_msg callback on
>> the msi descriptor?  For instance we could simply translate the msg
>> address and call pci_write_msi_msg() (while avoiding an infinite
>> recursion).  Or maybe there should be an xlate_msi_msg callback we can
>> register.  Or I suppose there might be a way to insert an irqchip that
>> does the translation on write.  Thanks,
>
> I'm far from keen on the idea, but if there really is an appetite for
> more indirection, then I guess the least-worst option would be yet
> another type of iommu_dma_cookie to work via the existing
> iommu_dma_compose_msi_msg() flow, with some interface for VFIO to
> update per-device addresses direcitly. But then it's still going to
> need some kind of "layering violation" for VFIO to poke the IRQ layer
> into re-composing and re-writing a message whenever userspace feels
> like changing an address, because we're fundamentally stepping outside
> the established lifecycle of a kernel-managed IRQ around which said
> layering was designed...

for the record, the first integration was based on such distinct
iommu_dma_cookie

[PATCH v15 00/12] SMMUv3 Nested Stage Setup (IOMMU part) <https://lore.kernel.org/all/20210411111228.14386-1-eric.auger@redhat.com/#r>, patches 8 - 11

Thanks

Eric



>
> Thanks,
> Robin.
>

Jason Gunthorpe Nov. 20, 2024, 2:03 p.m. UTC | #9

On Wed, Nov 20, 2024 at 02:17:46PM +0100, Eric Auger wrote:
> > Yeah, I wasn't really suggesting to literally hook into this exact
> > case; it was more just a general observation that if VFIO already has
> > one justification for tinkering with pci_write_msi_msg() directly
> > without going through the msi_domain layer, then adding another
> > (wherever it fits best) can't be *entirely* unreasonable.

I'm not sure that we can assume VFIO is the only thing touching the
interrupt programming.

I think there is a KVM path, and also the /proc/ path that will change
the MSI affinity on the fly for a VFIO created IRQ. If the platform
requires a MSI update to do this (ie encoding affinity in the
add/data, not using IRQ remapping HW) then we still need to ensure the
correct MSI address is hooked in.

> >> Is it possible to do this with the existing write_msi_msg callback on
> >> the msi descriptor?  For instance we could simply translate the msg
> >> address and call pci_write_msi_msg() (while avoiding an infinite
> >> recursion).  Or maybe there should be an xlate_msi_msg callback we can
> >> register.  Or I suppose there might be a way to insert an irqchip that
> >> does the translation on write.  Thanks,
> >
> > I'm far from keen on the idea, but if there really is an appetite for
> > more indirection, then I guess the least-worst option would be yet
> > another type of iommu_dma_cookie to work via the existing
> > iommu_dma_compose_msi_msg() flow, 

For this direction I think I would turn iommu_dma_compose_msi_msg()
into a function pointer stored in the iommu_domain and have
vfio/iommufd provide its own implementation. The thing that is in
control of the domain's translation should be providing the msi_msg.

> > update per-device addresses direcitly. But then it's still going to
> > need some kind of "layering violation" for VFIO to poke the IRQ layer
> > into re-composing and re-writing a message whenever userspace feels
> > like changing an address

I think we'd need to get into the affinity update path and force a MSI
write as well, even if the platform isn't changing the MSI for
affinity. Processing a vMSI entry update would be two steps where we
update the MSI addr in VFIO and then set the affinity.

> for the record, the first integration was based on such distinct
> iommu_dma_cookie
> 
> [PATCH v15 00/12] SMMUv3 Nested Stage Setup (IOMMU part) <https://lore.kernel.org/all/20210411111228.14386-1-eric.auger@redhat.com/#r>, patches 8 - 11

There are some significant differences from that series with this idea:

 - We want to maintain a per-MSI-index/per-device lookup table. It
   is not just a simple cookie, the msi_desc->dev &&
   msi_desc->msi_index have to be matched against what userspace
   provides in the per-vMSI IOCTL

 - There would be no implicit progamming of the stage 2, this will be
   done directly in userspace by creating an IOAS area for the ITS page

 - It shouldn't have any sort of dynamic allocation behavior. It is an
   error for the kernel to ask for an msi_desc that userspace hasn't
   provided a mapping for

 - It should work well with nested and non-nested domains

Jason

[RFCv1,0/7] vfio: Allow userspace to specify the address for each MSI vector

Message

Comments