[RFC,0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts

Message ID	cover.1678911529.git.reinette.chatre@intel.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Reinette Chatre <reinette.chatre@intel.com> To: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, alex.williamson@redhat.com Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, reinette.chatre@intel.com, linux-kernel@vger.kernel.org Subject: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Date: Wed, 15 Mar 2023 13:59:20 -0700 Message-Id: <cover.1678911529.git.reinette.chatre@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	vfio/pci: Support dynamic allocation of MSI-X interrupts \| expand [RFC,0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts [RFC,1/8] vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable [RFC,2/8] vfio/pci: Remove negative check on unsigned vector [RFC,3/8] vfio/pci: Prepare for dynamic interrupt context storage [RFC,4/8] vfio/pci: Use xarray for interrupt context storage [RFC,5/8] vfio/pci: Remove interrupt context counter [RFC,6/8] vfio/pci: Move to single error path [RFC,7/8] vfio/pci: Support dynamic MSI-x [RFC,8/8] vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

Message ID

cover.1678911529.git.reinette.chatre@intel.com (mailing list archive)

Headers

From: Reinette Chatre <reinette.chatre@intel.com>
To: jgg@nvidia.com, yishaih@nvidia.com,
        shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
        alex.williamson@redhat.com
Cc: tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org,
        dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com,
        fenghua.yu@intel.com, tom.zanussi@linux.intel.com,
        reinette.chatre@intel.com, linux-kernel@vger.kernel.org
Subject: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X
 interrupts
Date: Wed, 15 Mar 2023 13:59:20 -0700
Message-Id: <cover.1678911529.git.reinette.chatre@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

vfio/pci: Support dynamic allocation of MSI-X interrupts | expand

Message

Reinette Chatre March 15, 2023, 8:59 p.m. UTC

== Cover letter ==

Qemu allocates interrupts incrementally at the time the guest unmasks an
interrupt, for example each time a Linux guest runs request_irq().

Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
This prompted Qemu to, when allocating a new interrupt, first release all
previously allocated interrupts (including disable of MSI-X) followed
by re-allocation of all interrupts that includes the new interrupt.
Please see [2] for a detailed discussion about this issue.

Releasing and re-allocating interrupts may be acceptable if all
interrupts are unmasked during device initialization. If unmasking of
interrupts occur during runtime this may result in lost interrupts.
For example, consider an accelerator device with multiple work queues,
each work queue having a dedicated interrupt. A work queue can be
enabled at any time with its associated interrupt unmasked while other
work queues are already active. Having all interrupts released and MSI-X
disabled to enable the new work queue will impact active work queues.

This series builds on the recent interrupt sub-system core changes
that added support for dynamic MSI-X allocation after initial MSI-X
enabling.

Add support for dynamic MSI-X allocation to vfio-pci. A flag
indicating lack of support for dynamic allocation already exist:
VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
support for dynamic MSI-X the flag is cleared for MSI-X, enabling
Qemu to modify its behavior.

== Why is this an RFC ? ==

vfio support for dynamic MSI-X needs to work with existing user space
as well as upcoming user space that takes advantage of this feature.
I would appreciate guidance on the expectations and requirements
surrounding error handling when considering existing user space.

For example, consider the following scenario:
Start: Consider a passthrough device that supports 10 MSI-X interrupts
(0 to 9) and existing Qemu allocated interrupts 0 to 4.

Scenario: Qemu (hypothetically) attempts to associate a new action to
interrupts 0 to 7. Early checking of this range in
vfio_set_irqs_validate_and_prepare() will pass since it
is a valid range for the device. What happens after the
early checking is considered next:

Current behavior (before this series): Since the provided range, 0 - 7,
exceeds the allocated range, no action will be taken on existing
allocated interrupts 0 - 4 and the Qemu request will fail without
making any state changes.

New behavior (with this series): Since vfio supports dynamic MSI-X new
interrupts will be allocated for vectors 5, 6, and 7. Please note
that this changes the behavior encountered by unmodified Qemu: new
interrupts are allocated instead of returning an error. Even more,
since the range provided by Qemu spans 0 - 7, a failure during
allocation of 5, 6, or 7 will result in release of entire range.

This series aims to maintain existing error behavior for MSI (please see
"vfio/pci: Remove interrupt context counter") but I would appreciate your
guidance on whether existing error behavior should be maintained for MSI-X
and how to do so if it is a requirement.

Any feedback is appreciated

Reinette

[1] commit 34026364df8e ("PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X")
[2] https://lore.kernel.org/kvm/MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com/#t

Reinette Chatre (8):
vfio/pci: Consolidate irq cleanup on MSI/MSI-X disable
vfio/pci: Remove negative check on unsigned vector
vfio/pci: Prepare for dynamic interrupt context storage
vfio/pci: Use xarray for interrupt context storage
vfio/pci: Remove interrupt context counter
vfio/pci: Move to single error path
vfio/pci: Support dynamic MSI-x
vfio/pci: Clear VFIO_IRQ_INFO_NORESIZE for MSI-X

drivers/vfio/pci/vfio_pci_core.c | 3 +-
drivers/vfio/pci/vfio_pci_intrs.c | 376 ++++++++++++++++++++++--------
include/linux/vfio_pci_core.h | 3 +-
include/uapi/linux/vfio.h | 3 +
4 files changed, 286 insertions(+), 99 deletions(-)

base-commit: eeac8ede17557680855031c6f305ece2378af326

Comments

Alex Williamson March 16, 2023, 9:56 p.m. UTC | #1

On Wed, 15 Mar 2023 13:59:20 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> == Cover letter ==
> 
> Qemu allocates interrupts incrementally at the time the guest unmasks an
> interrupt, for example each time a Linux guest runs request_irq().
> 
> Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1].
> This prompted Qemu to, when allocating a new interrupt, first release all
> previously allocated interrupts (including disable of MSI-X) followed
> by re-allocation of all interrupts that includes the new interrupt.
> Please see [2] for a detailed discussion about this issue.
> 
> Releasing and re-allocating interrupts may be acceptable if all
> interrupts are unmasked during device initialization. If unmasking of
> interrupts occur during runtime this may result in lost interrupts.
> For example, consider an accelerator device with multiple work queues,
> each work queue having a dedicated interrupt. A work queue can be
> enabled at any time with its associated interrupt unmasked while other
> work queues are already active. Having all interrupts released and MSI-X
> disabled to enable the new work queue will impact active work queues.
> 
> This series builds on the recent interrupt sub-system core changes
> that added support for dynamic MSI-X allocation after initial MSI-X
> enabling.
> 
> Add support for dynamic MSI-X allocation to vfio-pci. A flag
> indicating lack of support for dynamic allocation already exist:
> VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With
> support for dynamic MSI-X the flag is cleared for MSI-X, enabling
> Qemu to modify its behavior.
> 
> == Why is this an RFC ? ==
> 
> vfio support for dynamic MSI-X needs to work with existing user space
> as well as upcoming user space that takes advantage of this feature.
> I would appreciate guidance on the expectations and requirements
> surrounding error handling when considering existing user space.
> 
> For example, consider the following scenario:
> Start: Consider a passthrough device that supports 10 MSI-X interrupts
> 	(0 to 9) and existing Qemu allocated interrupts 0 to 4.
> 
> Scenario: Qemu (hypothetically) attempts to associate a new action to
> 	interrupts 0 to 7. Early checking of this range in
> 	vfio_set_irqs_validate_and_prepare() will pass since it
> 	is a valid range for the device. What happens after the
> 	early checking is considered next:
> 
> Current behavior (before this series): Since the provided range, 0 - 7,
> 	exceeds the allocated range, no action will be taken on existing
> 	allocated interrupts 0 - 4 and the Qemu request will fail without
> 	making any state changes.

I must be missing something, it was described correctly earlier that
QEMU will disable MSI-X and re-enable with a new vector count.  Not
only does QEMU not really have a way to fail this change, pretty much
nothing would work if we did.

What happens in this case is that the QEMU vfio-pci driver gets a
vector_use callback for one of these new vectors {5,6,7},
vfio_msix_vector_do_use() checks that's greater than we have enabled,
disables MSI-X, and re-enables it for the new vector count.  Worst case
we'll do that 3 times if the vectors are presented in ascending order.

> New behavior (with this series): Since vfio supports dynamic MSI-X new
> 	interrupts will be allocated for vectors 5, 6, and 7. Please note
> 	that this changes the behavior encountered by unmodified Qemu: new
> 	interrupts are allocated instead of returning an error. Even more,
> 	since the range provided by Qemu spans 0 - 7, a failure during
> 	allocation of 5, 6, or 7 will result in release of entire range.

As above, QEMU doesn't currently return an error here, nor is there
actually any means to return an error.  MSI-X is not paravirtualized
and the PCI spec definition of MSI-X does not define that a driver
needs to check whether the device accepted unmasking a vector.  All we
can do in QEMU if the host fails to configure the device as directed is
print an error message as a breadcrumb to help debug why the device
stopped working.  In practice, I don't think I've ever seen this.

So really the difference should be transparent to the guest.  The risk
of an error allocating vectors on the host is the same, the only
significant difference should be the amount of disruption at the device
that we can better approximate the request of the guest.

> This series aims to maintain existing error behavior for MSI (please see
> "vfio/pci: Remove interrupt context counter") but I would appreciate your
> guidance on whether existing error behavior should be maintained for MSI-X
> and how to do so if it is a requirement.

Based on above, there really can never be an error if we expect the
device to work, so I think there's a misread of the current status.
Dynamic MSI-X support should simply reduce the disruption and chance of
lost interrupts at the device, but the points where we risk that the
host cannot provide the configuration we need are the same.  Thanks,

Alex

Reinette Chatre March 16, 2023, 11:38 p.m. UTC | #2

Hi Alex,

On 3/16/2023 2:56 PM, Alex Williamson wrote:
> On Wed, 15 Mar 2023 13:59:20 -0700 Reinette Chatre
> <reinette.chatre@intel.com> wrote:
> 

...

>> == Why is this an RFC ? ==
>> 
>> vfio support for dynamic MSI-X needs to work with existing user
>> space as well as upcoming user space that takes advantage of this
>> feature. I would appreciate guidance on the expectations and
>> requirements surrounding error handling when considering existing
>> user space.
>> 
>> For example, consider the following scenario: Start: Consider a
>> passthrough device that supports 10 MSI-X interrupts (0 to 9) and
>> existing Qemu allocated interrupts 0 to 4.
>> 
>> Scenario: Qemu (hypothetically) attempts to associate a new action
>> to interrupts 0 to 7. Early checking of this range in 
>> vfio_set_irqs_validate_and_prepare() will pass since it is a valid
>> range for the device. What happens after the early checking is
>> considered next:
>> 
>> Current behavior (before this series): Since the provided range, 0
>> - 7, exceeds the allocated range, no action will be taken on
>> existing allocated interrupts 0 - 4 and the Qemu request will fail
>> without making any state changes.
> 
> I must be missing something, it was described correctly earlier that 
> QEMU will disable MSI-X and re-enable with a new vector count.  Not 
> only does QEMU not really have a way to fail this change, pretty
> much nothing would work if we did.

Thank you very much for confirming Qemu behavior. 

One of my goals is to ensure that these kernel changes do not break
existing user space in any way. The only area I could find where
existing user space may encounter new behavior is if user space makes
the (pre dynamic MSI-X) mistake of attempting to associate triggers
to interrupts that have not been allocated. Thank you for confirming
that this is not a valid scenario for Qemu.

> What happens in this case is that the QEMU vfio-pci driver gets a 
> vector_use callback for one of these new vectors {5,6,7}, 
> vfio_msix_vector_do_use() checks that's greater than we have
> enabled, disables MSI-X, and re-enables it for the new vector count.
> Worst case we'll do that 3 times if the vectors are presented in
> ascending order.

Indeed. While testing this work I modified Qemu's vfio_msix_vector_do_use() to
consider VFIO_IRQ_INFO_NORESIZE. I saw the vector_use callback arriving
for each new vector and having Qemu just associate a new action with the
new vector (call vfio_set_irq_signaling()) instead of disabling MSI-X followed
by enabling all vectors worked well with the changes I present here. A single
interrupt was dynamically allocated and enabled without impacting others.

> Based on above, there really can never be an error if we expect the 
> device to work, so I think there's a misread of the current status. 
> Dynamic MSI-X support should simply reduce the disruption and chance
> of lost interrupts at the device, but the points where we risk that
> the host cannot provide the configuration we need are the same.

Thank you very much Alex. In this case, please do consider this
submission as a submission for inclusion. I'd be happy to resubmit
without the "RFC" prefix if that is preferred.

Reinette

Tian, Kevin March 16, 2023, 11:52 p.m. UTC | #3

> From: Chatre, Reinette <reinette.chatre@intel.com>
> Sent: Friday, March 17, 2023 7:38 AM
> 
> > Based on above, there really can never be an error if we expect the
> > device to work, so I think there's a misread of the current status.
> > Dynamic MSI-X support should simply reduce the disruption and chance
> > of lost interrupts at the device, but the points where we risk that
> > the host cannot provide the configuration we need are the same.
> 
> Thank you very much Alex. In this case, please do consider this
> submission as a submission for inclusion. I'd be happy to resubmit
> without the "RFC" prefix if that is preferred.
> 

With that do we still want to keep the error behavior for MSI?

If no patch5 can be simplified e.g. no need of vfio_irq_ctx_range_allocated()
and MSI/MSI-X error behaviors become consistent.

Reinette Chatre March 17, 2023, 4:48 p.m. UTC | #4

Hi Kevin,

On 3/16/2023 4:52 PM, Tian, Kevin wrote:
>> From: Chatre, Reinette <reinette.chatre@intel.com>
>> Sent: Friday, March 17, 2023 7:38 AM
>>
>>> Based on above, there really can never be an error if we expect the
>>> device to work, so I think there's a misread of the current status.
>>> Dynamic MSI-X support should simply reduce the disruption and chance
>>> of lost interrupts at the device, but the points where we risk that
>>> the host cannot provide the configuration we need are the same.
>>
>> Thank you very much Alex. In this case, please do consider this
>> submission as a submission for inclusion. I'd be happy to resubmit
>> without the "RFC" prefix if that is preferred.
>>
> 
> With that do we still want to keep the error behavior for MSI?
> 
> If no patch5 can be simplified e.g. no need of vfio_irq_ctx_range_allocated()
> and MSI/MSI-X error behaviors become consistent.

Thank you. Yes, if I understand correctly MSI and MSI-X error handling
can become consistent. I'll modify patch 5 to remove
vfio_irq_ctx_range_allocated().

Reinette