mbox series

[RFC,V3,00/26] vfio/pci: Back guest interrupts from Interrupt Message Store (IMS)

Message ID cover.1698422237.git.reinette.chatre@intel.com (mailing list archive)
Headers show
Series vfio/pci: Back guest interrupts from Interrupt Message Store (IMS) | expand

Message

Reinette Chatre Oct. 27, 2023, 5 p.m. UTC
Changes since RFC V2:
- RFC V2: https://lore.kernel.org/lkml/cover.1696609476.git.reinette.chatre@intel.com/
- Still submiting this as RFC series. I believe that this now matches the
  expectatations raised during earlier reviews. If you agree this is
  the right direction then I can drop the RFC prefix on next submission.
  If you do not agree then please do let me know where I missed
  expectations.
- First patch (PCI/MSI: Provide stubs for IMS functions)
  has been submitted upstream separately and is queued for inclusion during
  the next merge window. I do still include it in this series to avoid
  the noise about issues that bots will find when checking this series
  without it included.
  https://lore.kernel.org/lkml/169757242009.3135.5502383859327174030.tip-bot2@tip-bot2/
- Eliminated duplicate code between the PCI passthrough device backend and
  the IMS backend through more abstraction within the interrupt management
  frontend. (Kevin)
- Emulated interrupts are now managed by the interrupt management
  frontend and no longer unique to IMS. (Jason and Kevin)
- Since being an emulated interrupt is a persistent property there is
  a new functional change to PCI interrupt management in that per-interrupt
  contexts (managed by frontend) are now persistent (they remain allocated
  until device release).
- Most of the patches from RFC V2 look the same with more patches
  added to support the additional abstraction needed to eliminate the
  duplicate code. The IMS support was refactored to benefit from the
  new abstraction. Please refer to individual patches for specific changes.

Changes since RFC V1:
- RFC V1: https://lore.kernel.org/lkml/cover.1692892275.git.reinette.chatre@intel.com/
- This is a complete rewrite based on feedback from Jason and Kevin.
  Primarily the transition is to make IMS a new backend of MSI-X
  emulation: VFIO PCI transitions to be an interrupt management frontend
  with existing interrupt management for PCI passthrough devices as a
  backend and IMS interrupt management introduced as a new backend.
  The first part of the series splits VFIO PCI interrupt
  management into a "frontend" and "backend" with the existing PCI
  interrupt management as its first backend. The second part of the
  series adds IMS interrupt management as a new interrupt management
  backend.
  This is a significant change from RFC V1 as well as in the impact of
  the changes on existing VFIO PCI. This was done in response to
  feedback that I hope I understood as intended. If I did not get it
  right, please do point out to me where I went astray and I'd be
  happy to rewrite. Of course, suggestions for improvement will
  be much appreciated.

Hi Everybody,

With Interrupt Message Store (IMS) support introduced in
commit 0194425af0c8 ("PCI/MSI: Provide IMS (Interrupt Message Store)
support") a device can create a secondary interrupt domain that works
side by side with MSI-X on the same device. IMS allows for
implementation-specific interrupt storage that is managed by the
implementation specific interrupt chip associated with the IMS domain
at the time it (the IMS domain) is created for the device via
pci_create_ims_domain().

An example usage of IMS is for devices that can have their resources
assigned to guests with varying granularity. For example, an
accelerator device may support many workqueues and a single workqueue
can be composed into a virtual device for use by a guest. Using
IMS interrupts for the guest preserves MSI-X for host usage while
allowing a significantly larger number of interrupt vectors than
allowed by MSI-X. All while enabling usage of the same device driver
within the host and guest.

This series introduces IMS support to VFIO PCI for use by
virtual devices that support MSI-X interrupts that are backed by IMS
interrupts on the host. Specifically, that means that when the virtual
device's VFIO_DEVICE_SET_IRQS ioctl() receives a "trigger interrupt"
(VFIO_IRQ_SET_ACTION_TRIGGER) for a MSI-X index then VFIO PCI IMS
allocates/frees an IMS interrupt on the host.

VFIO PCI assumes that it is managing interrupts of a passthrough PCI
device. Split VFIO PCI into a "frontend" and "backend" to support
interrupt management for virtual devices that are not passthrough PCI
devices. The VFIO PCI frontend directs guest requests to the
appropriate backend. Existing interrupt management for passthrough PCI
devices is the first backend, guest MSI-X interrupts backed by
IMS interrupts on the host is the new backend (VFIO PCI IMS).

An IMS interrupt is allocated via pci_ims_alloc_irq() that requires
an implementation specific cookie that is opaque to VFIO PCI IMS. This
can be a PASID, queue ID, pointer etc. During initialization
VFIO PCI IMS learns which PCI device to operate on and what the
default cookie should be for any new interrupt allocation. VFIO PCI
IMS can also associate a unique cookie with each vector.

Guests may access a virtual device via both 'direct-path', where the
guest interacts directly with the underlying hardware, and 'intercepted
path', where the virtual device emulates operations. VFIO PCI
supports emulated interrupts (better naming suggestions are welcome) to
handle 'intercepted path' operations where completion interrupts are
signaled from the virtual device, not the underlying hardware.

This has been tested with a yet to be published VFIO driver for the
Intel Data Accelerators (IDXD) present in Intel Xeon CPUs.

While this series contains a working implementation it is presented
as an RFC with the goal to obtain feedback on whether VFIO PCI IMS
is appropriate for inclusion into VFIO and whether it is
(or could be adapted to be) appropriate for support of other
planned IMS usages you may be aware of.

Any feedback will be greatly appreciated.

Reinette

Reinette Chatre (26):
  PCI/MSI: Provide stubs for IMS functions
  vfio/pci: Move PCI specific check from wrapper to PCI function
  vfio/pci: Use unsigned int instead of unsigned
  vfio/pci: Make core interrupt callbacks accessible to all virtual
    devices
  vfio/pci: Split PCI interrupt management into front and backend
  vfio/pci: Separate MSI and MSI-X handling
  vfio/pci: Move interrupt eventfd to interrupt context
  vfio/pci: Move mutex acquisition into function
  vfio/pci: Move per-interrupt contexts to generic interrupt struct
  vfio/pci: Move IRQ type to generic interrupt context
  vfio/pci: Provide interrupt context to irq_is() and is_irq_none()
  vfio/pci: Provide interrupt context to generic ops
  vfio/pci: Provide interrupt context to vfio_msi_enable() and
    vfio_msi_disable()
  vfio/pci: Let interrupt management backend interpret interrupt index
  vfio/pci: Move generic code to frontend
  vfio/pci: Split interrupt context initialization
  vfio/pci: Make vfio_pci_set_irqs_ioctl() available
  vfio/pci: Preserve per-interrupt contexts
  vfio/pci: Store Linux IRQ number in per-interrupt context
  vfio/pci: Separate frontend and backend code during interrupt
    enable/disable
  vfio/pci: Replace backend specific calls with callbacks
  vfio/pci: Introduce backend specific context initializer
  vfio/pci: Support emulated interrupts
  vfio/pci: Add core IMS support
  vfio/pci: Add accessor for IMS index
  vfio/pci: Support IMS cookie modification

 drivers/vfio/pci/vfio_pci_config.c |   2 +-
 drivers/vfio/pci/vfio_pci_core.c   |  50 +-
 drivers/vfio/pci/vfio_pci_intrs.c  | 758 ++++++++++++++++++++++++-----
 drivers/vfio/pci/vfio_pci_priv.h   |   2 +-
 include/linux/pci.h                |  34 +-
 include/linux/vfio_pci_core.h      |  87 +++-
 6 files changed, 756 insertions(+), 177 deletions(-)


base-commit: 611da07b89fdd53f140d7b33013f255bf0ed8f34

Comments

Tian, Kevin Oct. 31, 2023, 7:31 a.m. UTC | #1
> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Saturday, October 28, 2023 1:01 AM
> 
> Changes since RFC V2:
> - RFC V2:
> https://lore.kernel.org/lkml/cover.1696609476.git.reinette.chatre@intel.com
> /
> - Still submiting this as RFC series. I believe that this now matches the
>   expectatations raised during earlier reviews. If you agree this is
>   the right direction then I can drop the RFC prefix on next submission.
>   If you do not agree then please do let me know where I missed
>   expectations.

Overall this matches my expectation. Let's wait for Alex/Jason's thoughts
before moving to next-level refinement.

btw as commented to last version, if this is the agreed direction probably
next version can be split into two parts: part1 contains the new framework
and converts intel vgpu driver to use it, then part2 for ims specific logic.

this way part1 can be verified and merged as a integral part. 
Alex Williamson Nov. 1, 2023, 6:07 p.m. UTC | #2
On Tue, 31 Oct 2023 07:31:24 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Chatre, Reinette <reinette.chatre@intel.com>
> > Sent: Saturday, October 28, 2023 1:01 AM
> > 
> > Changes since RFC V2:
> > - RFC V2:
> > https://lore.kernel.org/lkml/cover.1696609476.git.reinette.chatre@intel.com
> > /
> > - Still submiting this as RFC series. I believe that this now matches the
> >   expectatations raised during earlier reviews. If you agree this is
> >   the right direction then I can drop the RFC prefix on next submission.
> >   If you do not agree then please do let me know where I missed
> >   expectations.  
> 
> Overall this matches my expectation. Let's wait for Alex/Jason's thoughts
> before moving to next-level refinement.

It feels like there's a lot of gratuitous change without any clear
purpose.  We create an ops structure so that a variant/mdev driver can
make use of the vfio-pci-core set_irqs ioctl piecemeal, but then the
two entry points that are actually implemented by the ims version are
the same as the core version, so the ops appear to be at the wrong
level.  The use of the priv pointer for the core callbacks looks like
it's just trying to justify the existence of the opaque pointer, it
should really just be using container_of().  We drill down into various
support functions for MSI (ie. enable, disable, request_interrupt,
free_interrupt, device name), but INTx is largely ignored, where we
haven't even kept is_intx() consistent with the other helpers.

Without an in-tree user of this code, we're just chopping up code for
no real purpose.  There's no reason that a variant driver requiring IMS
couldn't initially implement their own SET_IRQS ioctl.  Doing that
might lead to a more organic solution where we create interfaces where
they're actually needed.  The existing mdev sample drivers should also
be considered in any schemes to refactor the core code into a generic
SET_IRQS helper for devices exposing a vfio-pci API.  Thanks,

Alex
Tian, Kevin Nov. 2, 2023, 2:51 a.m. UTC | #3
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, November 2, 2023 2:07 AM
> 
> On Tue, 31 Oct 2023 07:31:24 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Chatre, Reinette <reinette.chatre@intel.com>
> > > Sent: Saturday, October 28, 2023 1:01 AM
> > >
> > > Changes since RFC V2:
> > > - RFC V2:
> > >
> https://lore.kernel.org/lkml/cover.1696609476.git.reinette.chatre@intel.com
> > > /
> > > - Still submiting this as RFC series. I believe that this now matches the
> > >   expectatations raised during earlier reviews. If you agree this is
> > >   the right direction then I can drop the RFC prefix on next submission.
> > >   If you do not agree then please do let me know where I missed
> > >   expectations.
> >
> > Overall this matches my expectation. Let's wait for Alex/Jason's thoughts
> > before moving to next-level refinement.
> 
> It feels like there's a lot of gratuitous change without any clear
> purpose.  We create an ops structure so that a variant/mdev driver can
> make use of the vfio-pci-core set_irqs ioctl piecemeal, but then the
> two entry points that are actually implemented by the ims version are
> the same as the core version, so the ops appear to be at the wrong
> level.  The use of the priv pointer for the core callbacks looks like
> it's just trying to justify the existence of the opaque pointer, it
> should really just be using container_of().  We drill down into various
> support functions for MSI (ie. enable, disable, request_interrupt,
> free_interrupt, device name), but INTx is largely ignored, where we
> haven't even kept is_intx() consistent with the other helpers.

All above are good points. The main interest of this series is to share
MSI frontend interface with various backends (PCI MSI/X, IMS, and
purely emulated). From this angle the current ops abstraction does
sound to sit in a wrong level. But if counting your suggestion to also
refactor mdev sample driver (e.g. mtty emulates INTx) then there
might be a different outcome.

> 
> Without an in-tree user of this code, we're just chopping up code for
> no real purpose.  There's no reason that a variant driver requiring IMS
> couldn't initially implement their own SET_IRQS ioctl.  Doing that

this is an interesting idea. We haven't seen a real usage which wants
such MSI emulation on IMS for variant drivers. but if the code is
simple enough to demonstrate the 1st user of IMS it might not be
a bad choice. There are additional trap-emulation required in the
device MMIO bar (mostly copying MSI permission entry which contains
PASID info to the corresponding IMS entry). At a glance that area
is 4k-aligned so should be doable.

let's explore more into this option.

> might lead to a more organic solution where we create interfaces where
> they're actually needed.  The existing mdev sample drivers should also
> be considered in any schemes to refactor the core code into a generic
> SET_IRQS helper for devices exposing a vfio-pci API.  Thanks,
> 

In this case we'll have mtty to demonstrate an emulated INTx backend
and intel vgpu to contain an emulated MSI backend.

and moving forward all drivers with a vfio-pci API should share a same
frontend interface.
Tian, Kevin Nov. 2, 2023, 3:14 a.m. UTC | #4
> From: Tian, Kevin
> Sent: Thursday, November 2, 2023 10:52 AM
> 
> >
> > Without an in-tree user of this code, we're just chopping up code for
> > no real purpose.  There's no reason that a variant driver requiring IMS
> > couldn't initially implement their own SET_IRQS ioctl.  Doing that
> 
> this is an interesting idea. We haven't seen a real usage which wants
> such MSI emulation on IMS for variant drivers. but if the code is
> simple enough to demonstrate the 1st user of IMS it might not be
> a bad choice. There are additional trap-emulation required in the
> device MMIO bar (mostly copying MSI permission entry which contains
> PASID info to the corresponding IMS entry). At a glance that area
> is 4k-aligned so should be doable.
> 

misread the spec. the MSI-X permission table which provides
auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
4k page together with many other registers. emulation of them
could be simple with a native read/write handler but not sure
whether any of them may sit in a hot path to affect perf due to
trap...
Alex Williamson Nov. 2, 2023, 9:13 p.m. UTC | #5
On Thu, 2 Nov 2023 03:14:09 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Thursday, November 2, 2023 10:52 AM
> >   
> > >
> > > Without an in-tree user of this code, we're just chopping up code for
> > > no real purpose.  There's no reason that a variant driver requiring IMS
> > > couldn't initially implement their own SET_IRQS ioctl.  Doing that  
> > 
> > this is an interesting idea. We haven't seen a real usage which wants
> > such MSI emulation on IMS for variant drivers. but if the code is
> > simple enough to demonstrate the 1st user of IMS it might not be
> > a bad choice. There are additional trap-emulation required in the
> > device MMIO bar (mostly copying MSI permission entry which contains
> > PASID info to the corresponding IMS entry). At a glance that area
> > is 4k-aligned so should be doable.
> >   
> 
> misread the spec. the MSI-X permission table which provides
> auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> 4k page together with many other registers. emulation of them
> could be simple with a native read/write handler but not sure
> whether any of them may sit in a hot path to affect perf due to
> trap...

I'm not sure if you're referring to a specific device spec or the PCI
spec, but the PCI spec has long included an implementation note
suggesting alignment of the MSI-X vector table and pba and separation
from CSRs, and I see this is now even more strongly worded in the 6.0
spec.

Note though that for QEMU, these are emulated in the VMM and not
written through to the device.  The result of writes to the vector
table in the VMM are translated to vector use/unuse operations, which
we see at the kernel level through SET_IRQS ioctl calls.  Are you
expecting to get PASID information written by the guest through the
emulated vector table?  That would entail something more than a simple
IMS backend to MSI-X frontend.  Thanks,

Alex
Tian, Kevin Nov. 3, 2023, 7:23 a.m. UTC | #6
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, November 3, 2023 5:14 AM
> 
> On Thu, 2 Nov 2023 03:14:09 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Thursday, November 2, 2023 10:52 AM
> > >
> > > >
> > > > Without an in-tree user of this code, we're just chopping up code for
> > > > no real purpose.  There's no reason that a variant driver requiring IMS
> > > > couldn't initially implement their own SET_IRQS ioctl.  Doing that
> > >
> > > this is an interesting idea. We haven't seen a real usage which wants
> > > such MSI emulation on IMS for variant drivers. but if the code is
> > > simple enough to demonstrate the 1st user of IMS it might not be
> > > a bad choice. There are additional trap-emulation required in the
> > > device MMIO bar (mostly copying MSI permission entry which contains
> > > PASID info to the corresponding IMS entry). At a glance that area
> > > is 4k-aligned so should be doable.
> > >
> >
> > misread the spec. the MSI-X permission table which provides
> > auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> > 4k page together with many other registers. emulation of them
> > could be simple with a native read/write handler but not sure
> > whether any of them may sit in a hot path to affect perf due to
> > trap...
> 
> I'm not sure if you're referring to a specific device spec or the PCI
> spec, but the PCI spec has long included an implementation note
> suggesting alignment of the MSI-X vector table and pba and separation
> from CSRs, and I see this is now even more strongly worded in the 6.0
> spec.
> 
> Note though that for QEMU, these are emulated in the VMM and not
> written through to the device.  The result of writes to the vector
> table in the VMM are translated to vector use/unuse operations, which
> we see at the kernel level through SET_IRQS ioctl calls.  Are you
> expecting to get PASID information written by the guest through the
> emulated vector table?  That would entail something more than a simple
> IMS backend to MSI-X frontend.  Thanks,
> 

I was referring to IDXD device spec. Basically it allows a process to
submit a descriptor which contains a completion interrupt handle.
The handle is the index of a MSI-X entry or IMS entry allocated by
the idxd driver. To mark the association between application and
related handles the driver records the PASID of the application
in an auxiliary structure for MSI-X (called MSI-X permission table)
or directly in the IMS entry. This additional info includes whether
an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
value to be checked against the descriptor.

As you said virtualizing MSI-X table itself is via SET_IRQS and it's
4k aligned. Then we also need to capture guest updates to the MSI-X
permission table and copy the PASID information into the
corresponding IMS entry when using the IMS backend. It's MSI-X
permission table not 4k aligned then trapping it will affect adjacent
registers.

My quick check in idxd spec doesn't reveal an real impact in perf
critical path. Most registers are configuration/control registers
accessed at driver init time and a few interrupt registers related
to errors or administrative purpose.
Alex Williamson Nov. 3, 2023, 3:51 p.m. UTC | #7
On Fri, 3 Nov 2023 07:23:13 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, November 3, 2023 5:14 AM
> > 
> > On Thu, 2 Nov 2023 03:14:09 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Thursday, November 2, 2023 10:52 AM
> > > >  
> > > > >
> > > > > Without an in-tree user of this code, we're just chopping up code for
> > > > > no real purpose.  There's no reason that a variant driver requiring IMS
> > > > > couldn't initially implement their own SET_IRQS ioctl.  Doing that  
> > > >
> > > > this is an interesting idea. We haven't seen a real usage which wants
> > > > such MSI emulation on IMS for variant drivers. but if the code is
> > > > simple enough to demonstrate the 1st user of IMS it might not be
> > > > a bad choice. There are additional trap-emulation required in the
> > > > device MMIO bar (mostly copying MSI permission entry which contains
> > > > PASID info to the corresponding IMS entry). At a glance that area
> > > > is 4k-aligned so should be doable.
> > > >  
> > >
> > > misread the spec. the MSI-X permission table which provides
> > > auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> > > 4k page together with many other registers. emulation of them
> > > could be simple with a native read/write handler but not sure
> > > whether any of them may sit in a hot path to affect perf due to
> > > trap...  
> > 
> > I'm not sure if you're referring to a specific device spec or the PCI
> > spec, but the PCI spec has long included an implementation note
> > suggesting alignment of the MSI-X vector table and pba and separation
> > from CSRs, and I see this is now even more strongly worded in the 6.0
> > spec.
> > 
> > Note though that for QEMU, these are emulated in the VMM and not
> > written through to the device.  The result of writes to the vector
> > table in the VMM are translated to vector use/unuse operations, which
> > we see at the kernel level through SET_IRQS ioctl calls.  Are you
> > expecting to get PASID information written by the guest through the
> > emulated vector table?  That would entail something more than a simple
> > IMS backend to MSI-X frontend.  Thanks,
> >   
> 
> I was referring to IDXD device spec. Basically it allows a process to
> submit a descriptor which contains a completion interrupt handle.
> The handle is the index of a MSI-X entry or IMS entry allocated by
> the idxd driver. To mark the association between application and
> related handles the driver records the PASID of the application
> in an auxiliary structure for MSI-X (called MSI-X permission table)
> or directly in the IMS entry. This additional info includes whether
> an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
> value to be checked against the descriptor.
> 
> As you said virtualizing MSI-X table itself is via SET_IRQS and it's
> 4k aligned. Then we also need to capture guest updates to the MSI-X
> permission table and copy the PASID information into the
> corresponding IMS entry when using the IMS backend. It's MSI-X
> permission table not 4k aligned then trapping it will affect adjacent
> registers.
> 
> My quick check in idxd spec doesn't reveal an real impact in perf
> critical path. Most registers are configuration/control registers
> accessed at driver init time and a few interrupt registers related
> to errors or administrative purpose.

Right, it looks like you'll need to trap writes to the MSI-X
Permissions Table via a sparse mmap capability to avoid assumptions
whether it lives on the same page as the MSI-X vector table or PBA.
Ideally the hardware folks have considered this to avoid any conflict
with latency sensitive registers.

The variant driver would use this for collecting the meta data relative
to the IMS interrupt, but this is all tangential to whether we
preemptively slice up vfio-pci-core's SET_IRQS ioctl or the iDXD driver
implements its own.

And just to be clear, I don't expect the iDXD variant driver to go to
extraordinary lengths to duplicate the core ioctl, we can certainly
refactor and export things where it makes sense, but I think it likely
makes more sense for the variant driver to implement the shell of the
ioctl rather than trying to multiplex the entire core ioctl with an ops
structure that's so intimately tied to the core implementation and
focused only on the MSI-X code paths.  Thanks,

Alex
Tian, Kevin Nov. 7, 2023, 8:29 a.m. UTC | #8
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, November 3, 2023 11:51 PM
> 
> On Fri, 3 Nov 2023 07:23:13 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, November 3, 2023 5:14 AM
> > >
> > > On Thu, 2 Nov 2023 03:14:09 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Thursday, November 2, 2023 10:52 AM
> > > > >
> > > > > >
> > > > > > Without an in-tree user of this code, we're just chopping up code for
> > > > > > no real purpose.  There's no reason that a variant driver requiring
> IMS
> > > > > > couldn't initially implement their own SET_IRQS ioctl.  Doing that
> > > > >
> > > > > this is an interesting idea. We haven't seen a real usage which wants
> > > > > such MSI emulation on IMS for variant drivers. but if the code is
> > > > > simple enough to demonstrate the 1st user of IMS it might not be
> > > > > a bad choice. There are additional trap-emulation required in the
> > > > > device MMIO bar (mostly copying MSI permission entry which
> contains
> > > > > PASID info to the corresponding IMS entry). At a glance that area
> > > > > is 4k-aligned so should be doable.
> > > > >
> > > >
> > > > misread the spec. the MSI-X permission table which provides
> > > > auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> > > > 4k page together with many other registers. emulation of them
> > > > could be simple with a native read/write handler but not sure
> > > > whether any of them may sit in a hot path to affect perf due to
> > > > trap...
> > >
> > > I'm not sure if you're referring to a specific device spec or the PCI
> > > spec, but the PCI spec has long included an implementation note
> > > suggesting alignment of the MSI-X vector table and pba and separation
> > > from CSRs, and I see this is now even more strongly worded in the 6.0
> > > spec.
> > >
> > > Note though that for QEMU, these are emulated in the VMM and not
> > > written through to the device.  The result of writes to the vector
> > > table in the VMM are translated to vector use/unuse operations, which
> > > we see at the kernel level through SET_IRQS ioctl calls.  Are you
> > > expecting to get PASID information written by the guest through the
> > > emulated vector table?  That would entail something more than a simple
> > > IMS backend to MSI-X frontend.  Thanks,
> > >
> >
> > I was referring to IDXD device spec. Basically it allows a process to
> > submit a descriptor which contains a completion interrupt handle.
> > The handle is the index of a MSI-X entry or IMS entry allocated by
> > the idxd driver. To mark the association between application and
> > related handles the driver records the PASID of the application
> > in an auxiliary structure for MSI-X (called MSI-X permission table)
> > or directly in the IMS entry. This additional info includes whether
> > an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
> > value to be checked against the descriptor.
> >
> > As you said virtualizing MSI-X table itself is via SET_IRQS and it's
> > 4k aligned. Then we also need to capture guest updates to the MSI-X
> > permission table and copy the PASID information into the
> > corresponding IMS entry when using the IMS backend. It's MSI-X
> > permission table not 4k aligned then trapping it will affect adjacent
> > registers.
> >
> > My quick check in idxd spec doesn't reveal an real impact in perf
> > critical path. Most registers are configuration/control registers
> > accessed at driver init time and a few interrupt registers related
> > to errors or administrative purpose.
> 
> Right, it looks like you'll need to trap writes to the MSI-X
> Permissions Table via a sparse mmap capability to avoid assumptions
> whether it lives on the same page as the MSI-X vector table or PBA.
> Ideally the hardware folks have considered this to avoid any conflict
> with latency sensitive registers.
> 
> The variant driver would use this for collecting the meta data relative
> to the IMS interrupt, but this is all tangential to whether we
> preemptively slice up vfio-pci-core's SET_IRQS ioctl or the iDXD driver
> implements its own.

Agree

> 
> And just to be clear, I don't expect the iDXD variant driver to go to
> extraordinary lengths to duplicate the core ioctl, we can certainly
> refactor and export things where it makes sense, but I think it likely
> makes more sense for the variant driver to implement the shell of the
> ioctl rather than trying to multiplex the entire core ioctl with an ops
> structure that's so intimately tied to the core implementation and
> focused only on the MSI-X code paths.  Thanks,
> 

btw I'll let Reinette to decide whether she wants to implement such
a variant driver or waits until idxd siov driver is ready to demonstrate
the usage. One concern with the variant driver approach is lacking
of a real-world usage (e.g. what IMS brings when guest only wants
MSI-X on an assigned PF). Though it may provide a shorter path
to enable the IMS backend support, also comes with the long-term
maintenance burden.
Reinette Chatre Nov. 7, 2023, 7:48 p.m. UTC | #9
Hi Alex and Kevin,

On 11/7/2023 12:29 AM, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Friday, November 3, 2023 11:51 PM
>>
>> On Fri, 3 Nov 2023 07:23:13 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>> Sent: Friday, November 3, 2023 5:14 AM
>>>>
>>>> On Thu, 2 Nov 2023 03:14:09 +0000
>>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>>
>>>>>> From: Tian, Kevin
>>>>>> Sent: Thursday, November 2, 2023 10:52 AM
>>>>>>
>>>>>>>
>>>>>>> Without an in-tree user of this code, we're just chopping up code for
>>>>>>> no real purpose.  There's no reason that a variant driver requiring
>> IMS
>>>>>>> couldn't initially implement their own SET_IRQS ioctl.  Doing that
>>>>>>
>>>>>> this is an interesting idea. We haven't seen a real usage which wants
>>>>>> such MSI emulation on IMS for variant drivers. but if the code is
>>>>>> simple enough to demonstrate the 1st user of IMS it might not be
>>>>>> a bad choice. There are additional trap-emulation required in the
>>>>>> device MMIO bar (mostly copying MSI permission entry which
>> contains
>>>>>> PASID info to the corresponding IMS entry). At a glance that area
>>>>>> is 4k-aligned so should be doable.
>>>>>>
>>>>>
>>>>> misread the spec. the MSI-X permission table which provides
>>>>> auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
>>>>> 4k page together with many other registers. emulation of them
>>>>> could be simple with a native read/write handler but not sure
>>>>> whether any of them may sit in a hot path to affect perf due to
>>>>> trap...
>>>>
>>>> I'm not sure if you're referring to a specific device spec or the PCI
>>>> spec, but the PCI spec has long included an implementation note
>>>> suggesting alignment of the MSI-X vector table and pba and separation
>>>> from CSRs, and I see this is now even more strongly worded in the 6.0
>>>> spec.
>>>>
>>>> Note though that for QEMU, these are emulated in the VMM and not
>>>> written through to the device.  The result of writes to the vector
>>>> table in the VMM are translated to vector use/unuse operations, which
>>>> we see at the kernel level through SET_IRQS ioctl calls.  Are you
>>>> expecting to get PASID information written by the guest through the
>>>> emulated vector table?  That would entail something more than a simple
>>>> IMS backend to MSI-X frontend.  Thanks,
>>>>
>>>
>>> I was referring to IDXD device spec. Basically it allows a process to
>>> submit a descriptor which contains a completion interrupt handle.
>>> The handle is the index of a MSI-X entry or IMS entry allocated by
>>> the idxd driver. To mark the association between application and
>>> related handles the driver records the PASID of the application
>>> in an auxiliary structure for MSI-X (called MSI-X permission table)
>>> or directly in the IMS entry. This additional info includes whether
>>> an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
>>> value to be checked against the descriptor.
>>>
>>> As you said virtualizing MSI-X table itself is via SET_IRQS and it's
>>> 4k aligned. Then we also need to capture guest updates to the MSI-X
>>> permission table and copy the PASID information into the
>>> corresponding IMS entry when using the IMS backend. It's MSI-X
>>> permission table not 4k aligned then trapping it will affect adjacent
>>> registers.
>>>
>>> My quick check in idxd spec doesn't reveal an real impact in perf
>>> critical path. Most registers are configuration/control registers
>>> accessed at driver init time and a few interrupt registers related
>>> to errors or administrative purpose.
>>
>> Right, it looks like you'll need to trap writes to the MSI-X
>> Permissions Table via a sparse mmap capability to avoid assumptions
>> whether it lives on the same page as the MSI-X vector table or PBA.
>> Ideally the hardware folks have considered this to avoid any conflict
>> with latency sensitive registers.
>>
>> The variant driver would use this for collecting the meta data relative
>> to the IMS interrupt, but this is all tangential to whether we
>> preemptively slice up vfio-pci-core's SET_IRQS ioctl or the iDXD driver
>> implements its own.
> 
> Agree
> 
>>
>> And just to be clear, I don't expect the iDXD variant driver to go to
>> extraordinary lengths to duplicate the core ioctl, we can certainly
>> refactor and export things where it makes sense, but I think it likely
>> makes more sense for the variant driver to implement the shell of the
>> ioctl rather than trying to multiplex the entire core ioctl with an ops
>> structure that's so intimately tied to the core implementation and
>> focused only on the MSI-X code paths.  Thanks,
>>
> 
> btw I'll let Reinette to decide whether she wants to implement such
> a variant driver or waits until idxd siov driver is ready to demonstrate
> the usage. One concern with the variant driver approach is lacking
> of a real-world usage (e.g. what IMS brings when guest only wants
> MSI-X on an assigned PF). Though it may provide a shorter path
> to enable the IMS backend support, also comes with the long-term
> maintenance burden.

Thank you very much for your guidance and advice.

I'd be happy to implement what is required for this work. Unfortunately
I am not confident that I understand what is meant with "variant driver".

I initially understood "variant driver" to mean the planned IDXD virtual
device driver (the "IDXD VDCM" driver) that will assign IDXD resources
to guests with varying granularity and back the guest MSI-X interrupts
of these virtual devices with IMS interrupts on the host. From Kevin
I understand "variant driver" is a new sample driver for an IDXD
assigned PF, backing guest MSI-X interrupts with IMS interrupts on
the host.

The IDXD VDCM driver is in progress. If a new variant driver needs to be
created then I still need to do some research in how to accomplish it.
Even so, it is not clear to me who the users of this new driver would be.
If the requirement is to demonstrate this VFIO IMS usage then we could
perhaps wait until the IDXD VDCM driver is ready and thus not have to deal
with additional maintenance burden.

In the mean time there are items that I do understand better
and will work on right away:
- Do not have ops span the SET_IRQS ioctl()
- Use container_of() instead of opaque pointer
- Do not ignore INTx, consider the mdev sample driver when refactoring
  this code.
- Consider the Intel vgpu driver as user of new emulated interrupt
  interface.

Reinette
Alex Williamson Nov. 7, 2023, 11:06 p.m. UTC | #10
On Tue, 7 Nov 2023 11:48:28 -0800
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Hi Alex and Kevin,
> 
> On 11/7/2023 12:29 AM, Tian, Kevin wrote:
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Friday, November 3, 2023 11:51 PM
> >>
> >> On Fri, 3 Nov 2023 07:23:13 +0000
> >> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>  
> >>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>> Sent: Friday, November 3, 2023 5:14 AM
> >>>>
> >>>> On Thu, 2 Nov 2023 03:14:09 +0000
> >>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >>>>  
> >>>>>> From: Tian, Kevin
> >>>>>> Sent: Thursday, November 2, 2023 10:52 AM
> >>>>>>  
> >>>>>>>
> >>>>>>> Without an in-tree user of this code, we're just chopping up code for
> >>>>>>> no real purpose.  There's no reason that a variant driver requiring  
> >> IMS  
> >>>>>>> couldn't initially implement their own SET_IRQS ioctl.  Doing that  
> >>>>>>
> >>>>>> this is an interesting idea. We haven't seen a real usage which wants
> >>>>>> such MSI emulation on IMS for variant drivers. but if the code is
> >>>>>> simple enough to demonstrate the 1st user of IMS it might not be
> >>>>>> a bad choice. There are additional trap-emulation required in the
> >>>>>> device MMIO bar (mostly copying MSI permission entry which  
> >> contains  
> >>>>>> PASID info to the corresponding IMS entry). At a glance that area
> >>>>>> is 4k-aligned so should be doable.
> >>>>>>  
> >>>>>
> >>>>> misread the spec. the MSI-X permission table which provides
> >>>>> auxiliary data to MSI-X table is not 4k-aligned. It sits in the 1st
> >>>>> 4k page together with many other registers. emulation of them
> >>>>> could be simple with a native read/write handler but not sure
> >>>>> whether any of them may sit in a hot path to affect perf due to
> >>>>> trap...  
> >>>>
> >>>> I'm not sure if you're referring to a specific device spec or the PCI
> >>>> spec, but the PCI spec has long included an implementation note
> >>>> suggesting alignment of the MSI-X vector table and pba and separation
> >>>> from CSRs, and I see this is now even more strongly worded in the 6.0
> >>>> spec.
> >>>>
> >>>> Note though that for QEMU, these are emulated in the VMM and not
> >>>> written through to the device.  The result of writes to the vector
> >>>> table in the VMM are translated to vector use/unuse operations, which
> >>>> we see at the kernel level through SET_IRQS ioctl calls.  Are you
> >>>> expecting to get PASID information written by the guest through the
> >>>> emulated vector table?  That would entail something more than a simple
> >>>> IMS backend to MSI-X frontend.  Thanks,
> >>>>  
> >>>
> >>> I was referring to IDXD device spec. Basically it allows a process to
> >>> submit a descriptor which contains a completion interrupt handle.
> >>> The handle is the index of a MSI-X entry or IMS entry allocated by
> >>> the idxd driver. To mark the association between application and
> >>> related handles the driver records the PASID of the application
> >>> in an auxiliary structure for MSI-X (called MSI-X permission table)
> >>> or directly in the IMS entry. This additional info includes whether
> >>> an MSI-X/IMS entry has PASID enabled and if yes what is the PASID
> >>> value to be checked against the descriptor.
> >>>
> >>> As you said virtualizing MSI-X table itself is via SET_IRQS and it's
> >>> 4k aligned. Then we also need to capture guest updates to the MSI-X
> >>> permission table and copy the PASID information into the
> >>> corresponding IMS entry when using the IMS backend. It's MSI-X
> >>> permission table not 4k aligned then trapping it will affect adjacent
> >>> registers.
> >>>
> >>> My quick check in idxd spec doesn't reveal an real impact in perf
> >>> critical path. Most registers are configuration/control registers
> >>> accessed at driver init time and a few interrupt registers related
> >>> to errors or administrative purpose.  
> >>
> >> Right, it looks like you'll need to trap writes to the MSI-X
> >> Permissions Table via a sparse mmap capability to avoid assumptions
> >> whether it lives on the same page as the MSI-X vector table or PBA.
> >> Ideally the hardware folks have considered this to avoid any conflict
> >> with latency sensitive registers.
> >>
> >> The variant driver would use this for collecting the meta data relative
> >> to the IMS interrupt, but this is all tangential to whether we
> >> preemptively slice up vfio-pci-core's SET_IRQS ioctl or the iDXD driver
> >> implements its own.  
> > 
> > Agree
> >   
> >>
> >> And just to be clear, I don't expect the iDXD variant driver to go to
> >> extraordinary lengths to duplicate the core ioctl, we can certainly
> >> refactor and export things where it makes sense, but I think it likely
> >> makes more sense for the variant driver to implement the shell of the
> >> ioctl rather than trying to multiplex the entire core ioctl with an ops
> >> structure that's so intimately tied to the core implementation and
> >> focused only on the MSI-X code paths.  Thanks,
> >>  
> > 
> > btw I'll let Reinette to decide whether she wants to implement such
> > a variant driver or waits until idxd siov driver is ready to demonstrate
> > the usage. One concern with the variant driver approach is lacking
> > of a real-world usage (e.g. what IMS brings when guest only wants
> > MSI-X on an assigned PF). Though it may provide a shorter path
> > to enable the IMS backend support, also comes with the long-term
> > maintenance burden.  
> 
> Thank you very much for your guidance and advice.
> 
> I'd be happy to implement what is required for this work. Unfortunately
> I am not confident that I understand what is meant with "variant driver".
> 
> I initially understood "variant driver" to mean the planned IDXD virtual
> device driver (the "IDXD VDCM" driver) that will assign IDXD resources
> to guests with varying granularity and back the guest MSI-X interrupts
> of these virtual devices with IMS interrupts on the host. From Kevin
> I understand "variant driver" is a new sample driver for an IDXD
> assigned PF, backing guest MSI-X interrupts with IMS interrupts on
> the host.
> 
> The IDXD VDCM driver is in progress. If a new variant driver needs to be
> created then I still need to do some research in how to accomplish it.
> Even so, it is not clear to me who the users of this new driver would be.
> If the requirement is to demonstrate this VFIO IMS usage then we could
> perhaps wait until the IDXD VDCM driver is ready and thus not have to deal
> with additional maintenance burden.

A vfio-pci variant driver is specifically a driver that leverages
portions of vfio-pci-core for implementing vfio_device_ops and binds to
a PCI device.  It might actually be the wrong term here, but I jumped
to that since the series tries to generalize portions of one of the
vfio-pci-core code paths. You might very well be intending to use this
with something more like an mdev driver, which is fine.

That also sort of illustrates the point though that this series is
taking a pretty broad approach to slicing up vfio-pci-core's SET_IRQS
ioctl code path, enabling support for IMS backed interrupts, but in
effect complicating the whole thing without any actual consumer to
justify the complication.  Meanwhile I think the goal is to reduce
complication to a driver that doesn't exist yet.  So it currently seems
like a poor trade-off.

This driver that doesn't exist yet could implement its own SET_IRQS
ioctl that backs MSI-X with IMS as a starting point.  Presumably we
expect multiple drivers to require this behavior, so common code makes
sense, but the rest of us in the community can't really evaluate how
much it makes sense to slice the common code without seeing that
implementation and how it might leverage, if not directly use, the
existing core code.

The sample drivers come into play in that if we were to make the
vfio-pci-core SET_IRQS path usable in this generic way, then any driver
implementing SET_IRQS for a PCI device should be able to take advantage
of it, including those that already exist in samples/vfio-mdev/.  I
don't think anyone is requesting an iDXD sample driver, the real driver
should be sufficient.

> In the mean time there are items that I do understand better
> and will work on right away:
> - Do not have ops span the SET_IRQS ioctl()
> - Use container_of() instead of opaque pointer
> - Do not ignore INTx, consider the mdev sample driver when refactoring
>   this code.
> - Consider the Intel vgpu driver as user of new emulated interrupt
>   interface.

I think looking at kvmgt and the existing sample drivers to find
commonality that actually provides simplification would be a good
start, but I don't anticipate implementing IMS backed MSI-X in common
code unless we're at least imminently able to make use of it.  Thanks,

Alex
Jason Gunthorpe Nov. 8, 2023, 2:49 a.m. UTC | #11
On Tue, Nov 07, 2023 at 04:06:41PM -0700, Alex Williamson wrote:

> A vfio-pci variant driver is specifically a driver that leverages
> portions of vfio-pci-core for implementing vfio_device_ops and binds to
> a PCI device.  It might actually be the wrong term here, but I jumped
> to that since the series tries to generalize portions of one of the
> vfio-pci-core code paths. You might very well be intending to use this
> with something more like an mdev driver, which is fine.

IDXD will be a SIOV device and we need to have a serious talk about
how SIOV device lifecycle will work..

> That also sort of illustrates the point though that this series is
> taking a pretty broad approach to slicing up vfio-pci-core's SET_IRQS
> ioctl code path, enabling support for IMS backed interrupts, but in
> effect complicating the whole thing without any actual consumer to
> justify the complication.  Meanwhile I think the goal is to reduce
> complication to a driver that doesn't exist yet.  So it currently seems
> like a poor trade-off.

I think we need to see some draft of the IDXD driver to really
understand this

> This driver that doesn't exist yet could implement its own SET_IRQS
> ioctl that backs MSI-X with IMS as a starting point.  Presumably we
> expect multiple drivers to require this behavior, so common code makes
> sense, but the rest of us in the community can't really evaluate how
> much it makes sense to slice the common code without seeing that
> implementation and how it might leverage, if not directly use, the
> existing core code.

I've been seeing a general interest in taking something that is not
MSI-X (eg "IMS" for IDXD) and converting it into MSI-X for the vPCI
function. I think this will be a durable need in this space.

Ideally it will be overtaken by simply teaching the guest, vfio and
the hypervisor interrupt logic how to directly generate interrupts
with a guest controlled addr/data pair without requiring MSI-X
trapping. That is the fundamental reason why this has to be done this
convoluted way.

Jason
Tian, Kevin Nov. 8, 2023, 9:16 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, November 8, 2023 10:50 AM
> 
> On Tue, Nov 07, 2023 at 04:06:41PM -0700, Alex Williamson wrote:
> 
> > This driver that doesn't exist yet could implement its own SET_IRQS
> > ioctl that backs MSI-X with IMS as a starting point.  Presumably we
> > expect multiple drivers to require this behavior, so common code makes
> > sense, but the rest of us in the community can't really evaluate how
> > much it makes sense to slice the common code without seeing that
> > implementation and how it might leverage, if not directly use, the
> > existing core code.
> 
> I've been seeing a general interest in taking something that is not
> MSI-X (eg "IMS" for IDXD) and converting it into MSI-X for the vPCI
> function. I think this will be a durable need in this space.
> 
> Ideally it will be overtaken by simply teaching the guest, vfio and
> the hypervisor interrupt logic how to directly generate interrupts
> with a guest controlled addr/data pair without requiring MSI-X
> trapping. That is the fundamental reason why this has to be done this
> convoluted way.
> 

Even with that a legacy guest which doesn't support such enlightened
way still needs this convoluted way. 
Reinette Chatre Nov. 8, 2023, 4:52 p.m. UTC | #13
Hi Alex,

On 11/7/2023 3:06 PM, Alex Williamson wrote:
> That also sort of illustrates the point though that this series is
> taking a pretty broad approach to slicing up vfio-pci-core's SET_IRQS
> ioctl code path, enabling support for IMS backed interrupts, but in
> effect complicating the whole thing without any actual consumer to
> justify the complication.  Meanwhile I think the goal is to reduce
> complication to a driver that doesn't exist yet.  So it currently seems
> like a poor trade-off.
> 
> This driver that doesn't exist yet could implement its own SET_IRQS
> ioctl that backs MSI-X with IMS as a starting point.  Presumably we
> expect multiple drivers to require this behavior, so common code makes
> sense, but the rest of us in the community can't really evaluate how
> much it makes sense to slice the common code without seeing that
> implementation and how it might leverage, if not directly use, the
> existing core code.

I understand. I'm hearing the same from you and Jason. I plan to
work on addressing your feedback but will only share it when it can be
accompanied by a draft of the IDXD VDCM driver. Please let me know
if you prefer a different approach.

Reinette