Message ID | 20230308132903.465159-13-yi.l.liu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cover-letter: Add vfio_device cdev for iommufd support | expand |
> From: Yi Liu > Sent: Wednesday, March 8, 2023 9:29 PM > > This is another method to issue PCI hot reset for the users that bounds > device to a positive iommufd value. In such case, iommufd is a proof of > device ownership. By passing a zero-length fd array, user indicates kernel > to do ownership check with the bound iommufd. All the opened devices > within > the affected dev_set should have been bound to the same iommufd. This is > simpler and faster as user does not need to pass a set of fds and kernel > no need to search the device within the given fds. > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Yi Liu <yi.l.liu@intel.com> I think you also need a s-o-b from Jason since he wrote most of the code here. > +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev) > +{ > + /* Only serve for physical device */ > + if (!vdev->iommufd_device) > + return NULL; pointless comment > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { > * The ownership can be proved by: > * - An array of group fds > * - An array of device fds > + * - A zero-length array > + * > + * In the last case all affected devices which are opened by this user > + * must have been bound to a same iommufd_ctx. This approach is only > + * available for devices bound to positive iommufd. As we chatted before I still think the last sentence is pointless. If a device is bound to negative iommufd value (i.e. noiommu) it doesn't have a valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Friday, March 10, 2023 1:31 PM > > > From: Yi Liu > > Sent: Wednesday, March 8, 2023 9:29 PM > > > > This is another method to issue PCI hot reset for the users that bounds > > device to a positive iommufd value. In such case, iommufd is a proof of > > device ownership. By passing a zero-length fd array, user indicates kernel > > to do ownership check with the bound iommufd. All the opened devices > > within > > the affected dev_set should have been bound to the same iommufd. This > is > > simpler and faster as user does not need to pass a set of fds and kernel > > no need to search the device within the given fds. > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > I think you also need a s-o-b from Jason since he wrote most of the > code here. Yes, it is. I'll add it if no objection from Jason. > > +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device > *vdev) > > +{ > > + /* Only serve for physical device */ > > + if (!vdev->iommufd_device) > > + return NULL; > > pointless comment Will remove it. > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { > > * The ownership can be proved by: > > * - An array of group fds > > * - An array of device fds > > + * - A zero-length array > > + * > > + * In the last case all affected devices which are opened by this user > > + * must have been bound to a same iommufd_ctx. This approach is only > > + * available for devices bound to positive iommufd. > > As we chatted before I still think the last sentence is pointless. If a device > is bound to negative iommufd value (i.e. noiommu) it doesn't have a > valid iommufd_ctx so won't meet "bound to a same iommufd_ctx". Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just know whether it has bound a positive iommufd or a negative iommufd to the device. So positive iommufd may be more straightforward to userspace programmers.
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Friday, March 10, 2023 2:04 PM > > > + * > > > + * In the last case all affected devices which are opened by this user > > > + * must have been bound to a same iommufd_ctx. This approach is only > > > + * available for devices bound to positive iommufd. > > > > As we chatted before I still think the last sentence is pointless. If a device > > is bound to negative iommufd value (i.e. noiommu) it doesn't have a > > valid iommufd_ctx so won't meet "bound to a same iommufd_ctx". > > Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just > know whether it has bound a positive iommufd or a negative iommufd > to the device. So positive iommufd may be more straightforward to > userspace programmers.
On Fri, Mar 10, 2023 at 06:04:02AM +0000, Liu, Yi L wrote: > > From: Tian, Kevin <kevin.tian@intel.com> > > Sent: Friday, March 10, 2023 1:31 PM > > > > > From: Yi Liu > > > Sent: Wednesday, March 8, 2023 9:29 PM > > > > > > This is another method to issue PCI hot reset for the users that bounds > > > device to a positive iommufd value. In such case, iommufd is a proof of > > > device ownership. By passing a zero-length fd array, user indicates kernel > > > to do ownership check with the bound iommufd. All the opened devices > > > within > > > the affected dev_set should have been bound to the same iommufd. This > > is > > > simpler and faster as user does not need to pass a set of fds and kernel > > > no need to search the device within the given fds. > > > > > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com> > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > > > I think you also need a s-o-b from Jason since he wrote most of the > > code here. > > Yes, it is. I'll add it if no objection from Jason. Go ahead Jason
On Wed, 8 Mar 2023 05:28:51 -0800 Yi Liu <yi.l.liu@intel.com> wrote: > This is another method to issue PCI hot reset for the users that bounds > device to a positive iommufd value. In such case, iommufd is a proof of > device ownership. By passing a zero-length fd array, user indicates kernel > to do ownership check with the bound iommufd. All the opened devices within > the affected dev_set should have been bound to the same iommufd. This is > simpler and faster as user does not need to pass a set of fds and kernel > no need to search the device within the given fds. Couldn't this same idea apply to containers? I'm afraid this proposal reduces or eliminates the handshake we have with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the _INFO ioctl altogether, resulting in drivers that don't understand the scope of the reset. Is it worth it? What do we really gain? > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index d80141969cd1..382d95455f89 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { > * The ownership can be proved by: > * - An array of group fds > * - An array of device fds > + * - A zero-length array > + * > + * In the last case all affected devices which are opened by this user > + * must have been bound to a same iommufd_ctx. This approach is only > + * available for devices bound to positive iommufd. > * > * Return: 0 on success, -errno on failure. > */ There's no introspection that this feature is supported, is that why containers are not considered? ie. if the host supports vfio cdevs, it necessarily must support vfio-pci hot reset w/ a zero-length array? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Thursday, March 16, 2023 6:53 AM > > On Wed, 8 Mar 2023 05:28:51 -0800 > Yi Liu <yi.l.liu@intel.com> wrote: > > > This is another method to issue PCI hot reset for the users that bounds > > device to a positive iommufd value. In such case, iommufd is a proof of > > device ownership. By passing a zero-length fd array, user indicates kernel > > to do ownership check with the bound iommufd. All the opened devices > within > > the affected dev_set should have been bound to the same iommufd. This is > > simpler and faster as user does not need to pass a set of fds and kernel > > no need to search the device within the given fds. > > Couldn't this same idea apply to containers? User is allowed to create multiple containers. Looks we don't have a way to check whether multiple containers belong to the same user today. > > I'm afraid this proposal reduces or eliminates the handshake we have > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the > _INFO ioctl altogether, resulting in drivers that don't understand the > scope of the reset. Is it worth it? What do we really gain? Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually useful today. It's an interface on opened device. So the tiny difference is whether the user knows the device is resettable when calling GET_INFO or later when actually calling PCI_HOT_RESET. and with this series we also allow reset on affected devices which are not opened. Such dynamic cannot be reflected in static GET_INFO. More suitable a try-and-fail style. > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index d80141969cd1..382d95455f89 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { > > * The ownership can be proved by: > > * - An array of group fds > > * - An array of device fds > > + * - A zero-length array > > + * > > + * In the last case all affected devices which are opened by this user > > + * must have been bound to a same iommufd_ctx. This approach is only > > + * available for devices bound to positive iommufd. > > * > > * Return: 0 on success, -errno on failure. > > */ > > There's no introspection that this feature is supported, is that why > containers are not considered? ie. if the host supports vfio cdevs, it > necessarily must support vfio-pci hot reset w/ a zero-length array? > Thanks, > yes. It's more for users who knows that iommufd is used.
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Thursday, March 16, 2023 7:31 AM > > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Thursday, March 16, 2023 6:53 AM > > > > On Wed, 8 Mar 2023 05:28:51 -0800 > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > This is another method to issue PCI hot reset for the users that bounds > > > device to a positive iommufd value. In such case, iommufd is a proof of > > > device ownership. By passing a zero-length fd array, user indicates > kernel > > > to do ownership check with the bound iommufd. All the opened devices > > within > > > the affected dev_set should have been bound to the same iommufd. > This is > > > simpler and faster as user does not need to pass a set of fds and kernel > > > no need to search the device within the given fds. > > > > Couldn't this same idea apply to containers? > > User is allowed to create multiple containers. Looks we don't have a way > to check whether multiple containers belong to the same user today. Hi Kevin, This reminds me. In the compat mode, container fd is actually iommufd. If the compat mode passes a zeror-length array to do reset, it is possible that the opened devices in this affected dev_set may be set to different containers (a.k.a. iommufd_ctx). This would break what we defined in uapi. So a better description is users that use cdev can use this zero-length approach. And also, in kernel, we need to check if this approach is abused by the compat mode. > > > > I'm afraid this proposal reduces or eliminates the handshake we have > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore > the > > _INFO ioctl altogether, resulting in drivers that don't understand the > > scope of the reset. Is it worth it? What do we really gain? > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > useful today. > > It's an interface on opened device. So the tiny difference is whether the > user knows the device is resettable when calling GET_INFO or later when > actually calling PCI_HOT_RESET. > > and with this series we also allow reset on affected devices which are not > opened. Such dynamic cannot be reflected in static GET_INFO. More > suitable a try-and-fail style. Got the usage of zero-length, > > > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > > index d80141969cd1..382d95455f89 100644 > > > --- a/include/uapi/linux/vfio.h > > > +++ b/include/uapi/linux/vfio.h > > > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { > > > * The ownership can be proved by: > > > * - An array of group fds > > > * - An array of device fds > > > + * - A zero-length array > > > + * > > > + * In the last case all affected devices which are opened by this user > > > + * must have been bound to a same iommufd_ctx. This approach is > only > > > + * available for devices bound to positive iommufd. > > > * > > > * Return: 0 on success, -errno on failure. > > > */ > > > > There's no introspection that this feature is supported, is that why > > containers are not considered? ie. if the host supports vfio cdevs, it > > necessarily must support vfio-pci hot reset w/ a zero-length array? > > Thanks, > > > > yes. It's more for users who knows that iommufd is used. Needs to be more accurate. Only users that uses cdev.
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Thursday, March 16, 2023 11:55 AM > > > From: Tian, Kevin <kevin.tian@intel.com> > > Sent: Thursday, March 16, 2023 7:31 AM > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Thursday, March 16, 2023 6:53 AM > > > > > > On Wed, 8 Mar 2023 05:28:51 -0800 > > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > > > This is another method to issue PCI hot reset for the users that bounds > > > > device to a positive iommufd value. In such case, iommufd is a proof of > > > > device ownership. By passing a zero-length fd array, user indicates > > kernel > > > > to do ownership check with the bound iommufd. All the opened devices > > > within > > > > the affected dev_set should have been bound to the same iommufd. > > This is > > > > simpler and faster as user does not need to pass a set of fds and kernel > > > > no need to search the device within the given fds. > > > > > > Couldn't this same idea apply to containers? > > > > User is allowed to create multiple containers. Looks we don't have a way > > to check whether multiple containers belong to the same user today. > > Hi Kevin, > > This reminds me. In the compat mode, container fd is actually iommufd. > If the compat mode passes a zeror-length array to do reset, it is possible > that the opened devices in this affected dev_set may be set to different > containers (a.k.a. iommufd_ctx). This would break what we defined in > uapi. So a better description is users that use cdev can use this zero-length > approach. And also, in kernel, we need to check if this approach is abused > by the compat mode. > In normal case legacy application uses group fd array and new application with cdev uses zero-length approach. In rare case an application which opens /dev/iommu but opts to use it as a container in compat mode can also use zero-length array to reset if all devices are attached to a single container (internally to a same iommufd_ctx). It's still kind of matching uAPI description. I'm not sure whether we want to add explicit check to prevent it. Of course if affected devices span multiple compat iommufd's then it will fail. The open Alex raised is whether we want to further extend it to legacy container if all affected devices are in one container. But I hesitate to do so since iommufd is the future and if an application can be rewritten to utilize zero-length reset then it probably should explicitly embrace iommufd instead. Anyway let's not wait here. Send your v7 and we can have more focused discussion in your split series about hot reset. Thanks Kevin
> From: Tian, Kevin <kevin.tian@intel.com> > Sent: Thursday, March 16, 2023 2:10 PM > > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Thursday, March 16, 2023 11:55 AM > > > > > From: Tian, Kevin <kevin.tian@intel.com> > > > Sent: Thursday, March 16, 2023 7:31 AM > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Thursday, March 16, 2023 6:53 AM > > > > > > > > On Wed, 8 Mar 2023 05:28:51 -0800 > > > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > > > > > This is another method to issue PCI hot reset for the users that > bounds > > > > > device to a positive iommufd value. In such case, iommufd is a proof > of > > > > > device ownership. By passing a zero-length fd array, user indicates > > > kernel > > > > > to do ownership check with the bound iommufd. All the opened > devices > > > > within > > > > > the affected dev_set should have been bound to the same iommufd. > > > This is > > > > > simpler and faster as user does not need to pass a set of fds and > kernel > > > > > no need to search the device within the given fds. > > > > > > > > Couldn't this same idea apply to containers? > > > > > > User is allowed to create multiple containers. Looks we don't have a way > > > to check whether multiple containers belong to the same user today. > > > > Hi Kevin, > > > > This reminds me. In the compat mode, container fd is actually iommufd. > > If the compat mode passes a zeror-length array to do reset, it is possible > > that the opened devices in this affected dev_set may be set to different > > containers (a.k.a. iommufd_ctx). This would break what we defined in > > uapi. So a better description is users that use cdev can use this zero-length > > approach. And also, in kernel, we need to check if this approach is abused > > by the compat mode. > > > > In normal case legacy application uses group fd array and new application > with cdev uses zero-length approach. > > In rare case an application which opens /dev/iommu but opts to use it > as a container in compat mode can also use zero-length array to reset > if all devices are attached to a single container (internally to a same > iommufd_ctx). It's still kind of matching uAPI description. > > I'm not sure whether we want to add explicit check to prevent it. > > Of course if affected devices span multiple compat iommufd's then > it will fail. Yes. this failure matches the uapi description. And it is rare case, mostly likely applications should be explicitly updated to use cdev and then use this zero-length approach. Before that, the legacy applications do not know it at all. Even it uses this approach by mistake, it will fail in the multiple compat iommufd case. So maybe no need to limit it. > The open Alex raised is whether we want to further extend it to > legacy container if all affected devices are in one container. But > I hesitate to do so since iommufd is the future and if an application > can be rewritten to utilize zero-length reset then it probably > should explicitly embrace iommufd instead. For this, I agree with you. > > Anyway let's not wait here. Send your v7 and we can have more > focused discussion in your split series about hot reset. Sure. Once Nicolin's patch is updated, I can send v7 with the hot reset series as well. Regards, Yi Liu
On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote: > > Anyway let's not wait here. Send your v7 and we can have more > > focused discussion in your split series about hot reset. > > Sure. Once Nicolin's patch is updated, I can send v7 with the hot > reset series as well. I've updated three commits and pushed here: https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-03152023 Please pull the following commit to the emulated series: "iommufd: Create access in vfio_iommufd_emulated_bind()" https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daab404a8c8d5a90a2d Please pull the following commit to the cdev series or a place that you feel it'd be better -- it's required by the change of adding vfio_iommufd_emulated_detach_ioas(): "iommufd/device: Add iommufd_access_detach() API" https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a14bd249476072dec The other one adding replace() will go with the replace series. And regarding the new baseline for the replace series and the nesting series, it'd be nicer to have another one git-merging your cdev v7 branch on top of Jason's iommufd_hwpt branch. We could wait for him updating to 6.3-rc2, if that's necessary. Thanks Nic
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, March 16, 2023 2:49 PM > > On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote: > > > > Anyway let's not wait here. Send your v7 and we can have more > > > focused discussion in your split series about hot reset. > > > > Sure. Once Nicolin's patch is updated, I can send v7 with the hot > > reset series as well. > > I've updated three commits and pushed here: > https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting- > 03152023 > > Please pull the following commit to the emulated series: > "iommufd: Create access in vfio_iommufd_emulated_bind()" > > https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daa > b404a8c8d5a90a2d > > Please pull the following commit to the cdev series or a place > that you feel it'd be better -- it's required by the change of > adding vfio_iommufd_emulated_detach_ioas(): > "iommufd/device: Add iommufd_access_detach() API" > > https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a > 14bd249476072dec Thanks, I've taken them. v7 was sent out. > The other one adding replace() will go with the replace series. > > And regarding the new baseline for the replace series and the > nesting series, it'd be nicer to have another one git-merging > your cdev v7 branch on top of Jason's iommufd_hwpt branch. We > could wait for him updating to 6.3-rc2, if that's necessary. Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a merge and then cherry-pick the replace and nesting series from your above branch. Though the order between cdev and iommufd_hwpt not perfect, we may use it as a wip baseline when we try to address the comments w.r.t. nesting and replace series. https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023 Regards, Yi Liu
On Wed, 15 Mar 2023 23:31:23 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Thursday, March 16, 2023 6:53 AM > > > > On Wed, 8 Mar 2023 05:28:51 -0800 > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > This is another method to issue PCI hot reset for the users that bounds > > > device to a positive iommufd value. In such case, iommufd is a proof of > > > device ownership. By passing a zero-length fd array, user indicates kernel > > > to do ownership check with the bound iommufd. All the opened devices > > within > > > the affected dev_set should have been bound to the same iommufd. This is > > > simpler and faster as user does not need to pass a set of fds and kernel > > > no need to search the device within the given fds. > > > > Couldn't this same idea apply to containers? > > User is allowed to create multiple containers. Looks we don't have a way > to check whether multiple containers belong to the same user today. No, but a common configuration is that all devices are in the same container, ie. no-vIOMMU, and it's also rather common that when we have multi-function devices, all functions are within the same IOMMU group and therefore necessarily require the same address space and therefore container. > > I'm afraid this proposal reduces or eliminates the handshake we have > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the > > _INFO ioctl altogether, resulting in drivers that don't understand the > > scope of the reset. Is it worth it? What do we really gain? > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > useful today. > > It's an interface on opened device. So the tiny difference is whether the > user knows the device is resettable when calling GET_INFO or later when > actually calling PCI_HOT_RESET. No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET can be performed, but equally important the scope of the reset, ie. which devices are affected by the reset. If we de-emphasize the INFO portion, then this easily gets confused as just a variant of VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset. In fact, I'd say the interface is not only trying to validate that the user has sufficient privileges for the reset, but that they explicitly acknowledge the scope of the reset. > and with this series we also allow reset on affected devices which are not > opened. Such dynamic cannot be reflected in static GET_INFO. More > suitable a try-and-fail style. Resets have side-effects, obviously, so this isn't the sort of thing we can simply ask the user to probe for. I agree that dynamics can change, the GET_PCI_HOT_RESET_INFO is a point in time, isolated functions on the same bus can change ownership. However, in practice, and in its primary use case with GPUs without isolation, it's sufficiently static. So I think this is a mischaracterization. Thanks, Alex
On Thu, Mar 16, 2023 at 01:22:58PM +0000, Liu, Yi L wrote: > > And regarding the new baseline for the replace series and the > > nesting series, it'd be nicer to have another one git-merging > > your cdev v7 branch on top of Jason's iommufd_hwpt branch. We > > could wait for him updating to 6.3-rc2, if that's necessary. > > Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a > merge and then cherry-pick the replace and nesting series from > your above branch. Though the order between cdev and > iommufd_hwpt not perfect, we may use it as a wip baseline > when we try to address the comments w.r.t. nesting and > replace series. > > https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023 Nice. It looks like you integrated everything from my tree so it saves me some effort :) Regarding the order between cdev and iommufd_hwpt, I think it would be Jason's decision whether to merge his changes prior to the PR from the VFIO tree, or the other way around. One way or another, I think the replace v5 and the nesting v2 will be less impacted, unless Jason makes some huge changes to his branch. Let's use this tree this week to rework both series, and rebase after he comes back and updates his tree. Lemme know if you need a help for the nesting series or so. Thanks Nic
> From: Alex Williamson > Sent: Friday, March 17, 2023 2:46 AM > > On Wed, 15 Mar 2023 23:31:23 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Thursday, March 16, 2023 6:53 AM > > > I'm afraid this proposal reduces or eliminates the handshake we have > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore > the > > > _INFO ioctl altogether, resulting in drivers that don't understand the > > > scope of the reset. Is it worth it? What do we really gain? > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > > useful today. > > > > It's an interface on opened device. So the tiny difference is whether the > > user knows the device is resettable when calling GET_INFO or later when > > actually calling PCI_HOT_RESET. > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET > can > be performed, but equally important the scope of the reset, ie. which > devices are affected by the reset. If we de-emphasize the INFO > portion, then this easily gets confused as just a variant of > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset. In > fact, I'd say the interface is not only trying to validate that the > user has sufficient privileges for the reset, but that they explicitly > acknowledge the scope of the reset. > IMHO the usefulness of scope is if it's discoverable by the management stack which then can try to assign devices with affected reset to a same user. but this info is only available after the device is opened. Then the mgmt. stack just assigns devices w/o awareness of reset scope and nothing can be changed by the user no matter it knows the scope or not. from this angle I don't see a value of probe-and-reset vs. direct reset when reset itself also takes the scope into consideration. Probably the slight difference is that with probe a more informative error message can be printed out?
On Thu, 16 Mar 2023 23:29:21 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson > > Sent: Friday, March 17, 2023 2:46 AM > > > > On Wed, 15 Mar 2023 23:31:23 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Thursday, March 16, 2023 6:53 AM > > > > I'm afraid this proposal reduces or eliminates the handshake we have > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore > > the > > > > _INFO ioctl altogether, resulting in drivers that don't understand the > > > > scope of the reset. Is it worth it? What do we really gain? > > > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > > > useful today. > > > > > > It's an interface on opened device. So the tiny difference is whether the > > > user knows the device is resettable when calling GET_INFO or later when > > > actually calling PCI_HOT_RESET. > > > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET > > can > > be performed, but equally important the scope of the reset, ie. which > > devices are affected by the reset. If we de-emphasize the INFO > > portion, then this easily gets confused as just a variant of > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset. In > > fact, I'd say the interface is not only trying to validate that the > > user has sufficient privileges for the reset, but that they explicitly > > acknowledge the scope of the reset. > > > > IMHO the usefulness of scope is if it's discoverable by the management > stack which then can try to assign devices with affected reset to a same > user. Disagree, the user needs to know the scope of reset. Take for instance two function of a device configured onto separate buses within a VM. The VMM needs to know that a hot-reset of one will reset the other. That's not obvious to the VMM without some understanding of PCI/e topology and analysis of the host system. The info ioctl simplifies that discovery for the VMM and the handshake of passing the affected groups makes sure that the info ioctl remains relevant. OTOH, I really haven't seen any evidence that the null-array concept provides significant simplification for userspace, especially without compromising the user's understanding of the scope of the provided reset. Thanks, Alex
> From: Alex Williamson > Sent: Friday, March 17, 2023 8:23 AM > > On Thu, 16 Mar 2023 23:29:21 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson > > > Sent: Friday, March 17, 2023 2:46 AM > > > > > > On Wed, 15 Mar 2023 23:31:23 +0000 > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > Sent: Thursday, March 16, 2023 6:53 AM > > > > > I'm afraid this proposal reduces or eliminates the handshake we have > > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > and > > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to > ignore > > > the > > > > > _INFO ioctl altogether, resulting in drivers that don't understand the > > > > > scope of the reset. Is it worth it? What do we really gain? > > > > > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > > > > useful today. > > > > > > > > It's an interface on opened device. So the tiny difference is whether the > > > > user knows the device is resettable when calling GET_INFO or later > when > > > > actually calling PCI_HOT_RESET. > > > > > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a > PCI_HOT_RESET > > > can > > > be performed, but equally important the scope of the reset, ie. which > > > devices are affected by the reset. If we de-emphasize the INFO > > > portion, then this easily gets confused as just a variant of > > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset. In > > > fact, I'd say the interface is not only trying to validate that the > > > user has sufficient privileges for the reset, but that they explicitly > > > acknowledge the scope of the reset. > > > > > > > IMHO the usefulness of scope is if it's discoverable by the management > > stack which then can try to assign devices with affected reset to a same > > user. > > Disagree, the user needs to know the scope of reset. Take for instance > two function of a device configured onto separate buses within a VM. > The VMM needs to know that a hot-reset of one will reset the other. > That's not obvious to the VMM without some understanding of PCI/e > topology and analysis of the host system. The info ioctl simplifies > that discovery for the VMM and the handshake of passing the affected > groups makes sure that the info ioctl remains relevant. If that is the intended usage then I don't see why this proposal will promote userspace to ignore the _INFO ioctl. It should be always queried no matter how the reset ioctl itself is designed. The motivation of calling _INFO is not from the reset ioctl asking for an array of fds. > > OTOH, I really haven't seen any evidence that the null-array concept > provides significant simplification for userspace, especially without > compromising the user's understanding of the scope of the provided > reset. Thanks, > I'll let Jason to further comment after he is back. The bottom line, if this cannot be converged in short time, is to move it out of the preparatory series for cdev. There is no reason to block cdev just for this open. Anyway we'll allow using device fd array for cdev so there is no function gap.
On Fri, 17 Mar 2023 00:57:23 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson > > Sent: Friday, March 17, 2023 8:23 AM > > > > On Thu, 16 Mar 2023 23:29:21 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Alex Williamson > > > > Sent: Friday, March 17, 2023 2:46 AM > > > > > > > > On Wed, 15 Mar 2023 23:31:23 +0000 > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > > > Sent: Thursday, March 16, 2023 6:53 AM > > > > > > I'm afraid this proposal reduces or eliminates the handshake we have > > > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO > > and > > > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to > > ignore > > > > the > > > > > > _INFO ioctl altogether, resulting in drivers that don't understand the > > > > > > scope of the reset. Is it worth it? What do we really gain? > > > > > > > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually > > > > > useful today. > > > > > > > > > > It's an interface on opened device. So the tiny difference is whether the > > > > > user knows the device is resettable when calling GET_INFO or later > > when > > > > > actually calling PCI_HOT_RESET. > > > > > > > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a > > PCI_HOT_RESET > > > > can > > > > be performed, but equally important the scope of the reset, ie. which > > > > devices are affected by the reset. If we de-emphasize the INFO > > > > portion, then this easily gets confused as just a variant of > > > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset. In > > > > fact, I'd say the interface is not only trying to validate that the > > > > user has sufficient privileges for the reset, but that they explicitly > > > > acknowledge the scope of the reset. > > > > > > > > > > IMHO the usefulness of scope is if it's discoverable by the management > > > stack which then can try to assign devices with affected reset to a same > > > user. > > > > Disagree, the user needs to know the scope of reset. Take for instance > > two function of a device configured onto separate buses within a VM. > > The VMM needs to know that a hot-reset of one will reset the other. > > That's not obvious to the VMM without some understanding of PCI/e > > topology and analysis of the host system. The info ioctl simplifies > > that discovery for the VMM and the handshake of passing the affected > > groups makes sure that the info ioctl remains relevant. > > If that is the intended usage then I don't see why this proposal will > promote userspace to ignore the _INFO ioctl. It should be always > queried no matter how the reset ioctl itself is designed. The motivation > of calling _INFO is not from the reset ioctl asking for an array of fds. The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev) fds that encompass the set of affected devices reported by the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the last sentence above. This proposal seems to be based on some confusion about the difference between VFIO_DEVICE_RESET and VFIO_DEVICE_PCI_HOT_RESET, and therefore IMO, proliferates that confusion by making the scope argument optional, which I see as a key difference. This therefore makes the behavior of the ioctl less intuitive, easier to get wrong, and I expect we'll see users unitentionally resetting devices beyond the desired scope if the group/device fd array is allowed to be empty. Thanks, Alex
On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote: > > If that is the intended usage then I don't see why this proposal will > > promote userspace to ignore the _INFO ioctl. It should be always > > queried no matter how the reset ioctl itself is designed. The motivation > > of calling _INFO is not from the reset ioctl asking for an array of fds. > > The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev) > fds that encompass the set of affected devices reported by the > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the > last sentence above. There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to prove security that the userspace is not attempting to reset something that it does not have ownership over. Eg a reset group that spans multiple iommu groups. The second is for userspace to discover the reset group so it can understand what is happening. IMHO it is perfectly fine for each API to be only concerned with its own purpose. VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the iommufd_ctx check does just fine VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group span so userspace can do something with this. I think confusing security and scope and "acknowledgment" is not a good idea. The APIs are well defined and userspace can always use them wrong. It doesn't need to call RESET_INFO even today, it can just trivially pass every group FD it owns to meet the security check. It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security check without code marshalling fds, which is why we went this direction. Jason
On Mon, 20 Mar 2023 14:14:48 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote: > > > If that is the intended usage then I don't see why this proposal will > > > promote userspace to ignore the _INFO ioctl. It should be always > > > queried no matter how the reset ioctl itself is designed. The motivation > > > of calling _INFO is not from the reset ioctl asking for an array of fds. > > > > The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev) > > fds that encompass the set of affected devices reported by the > > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the > > last sentence above. > > There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to > prove security that the userspace is not attempting to reset something > that it does not have ownership over. Eg a reset group that spans > multiple iommu groups. > > The second is for userspace to discover the reset group so it can > understand what is happening. > > IMHO it is perfectly fine for each API to be only concerned with its > own purpose. > > VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the > iommufd_ctx check does just fine > > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group > span so userspace can do something with this. > > I think confusing security and scope and "acknowledgment" is not a > good idea. > > The APIs are well defined and userspace can always use them wrong. It > doesn't need to call RESET_INFO even today, it can just trivially pass > every group FD it owns to meet the security check. That's not actually true, in order to avoid arbitrarily large buffers from the user, the ioctl won't accept an array greater than the number of devices affected by the reset. > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security > check without code marshalling fds, which is why we went this > direction. I agree that nullifying the arg makes the ioctl easier to use, but my hesitation is whether it makes it more difficult to use correctly, which includes resetting devices unexpectedly. We're talking about something that's a relatively rare event, so I don't see that time overhead is a factor, nor has the complexity overhead in the QEMU implementation ever been raised as an issue previously. We can always blame the developer for using an interface incorrectly, but if we make it easier to use incorrectly in order to optimize something that doesn't need to be optimized, does that make it a good choice for the uAPI? Thanks, Alex
On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote: > > The APIs are well defined and userspace can always use them wrong. It > > doesn't need to call RESET_INFO even today, it can just trivially pass > > every group FD it owns to meet the security check. > > That's not actually true, in order to avoid arbitrarily large buffers > from the user, the ioctl won't accept an array greater than the number > of devices affected by the reset. Oh yuk! > > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security > > check without code marshalling fds, which is why we went this > > direction. > > I agree that nullifying the arg makes the ioctl easier to use, but my > hesitation is whether it makes it more difficult to use correctly, > which includes resetting devices unexpectedly. I don't think it makes it harder to use correctly. It maybe makes it easier to misuse, but IMHO not too much. If the desire was to have an API that explicitly acknowledged the reset scope then it should have taken in a list of device FDs and optimally reset all of them or fail EPERM. What is going to make this hard to use is the _INFO IOCTL, it returns basically the BDF string, but I think we effectively get rid of this in the new model. libvirt will know the BDF and open the cdev, then fd pass the cdev to qemu. Qemu shouldn't also have to know the sysfs path.. So we really want a new _INFO ioctl to make this easier to use.. > We can always blame the developer for using an interface incorrectly, > but if we make it easier to use incorrectly in order to optimize > something that doesn't need to be optimized, does that make it a good > choice for the uAPI? IMHO the API is designed around a security proof. Present some groups and a subset of devices in those groups will be reset. You can't know the subset unless you do the _INFO thing. If we wanted it to be clearly linked to scope it should have taken in a list of device FDs, and reset those devices FDs optimally or returned -EPERM. Then the reset scope is very clearly connected to the API. Jason
On Mon, 20 Mar 2023 20:39:07 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote: > > > > The APIs are well defined and userspace can always use them wrong. It > > > doesn't need to call RESET_INFO even today, it can just trivially pass > > > every group FD it owns to meet the security check. > > > > That's not actually true, in order to avoid arbitrarily large buffers > > from the user, the ioctl won't accept an array greater than the number > > of devices affected by the reset. > > Oh yuk! > > > > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security > > > check without code marshalling fds, which is why we went this > > > direction. > > > > I agree that nullifying the arg makes the ioctl easier to use, but my > > hesitation is whether it makes it more difficult to use correctly, > > which includes resetting devices unexpectedly. > > I don't think it makes it harder to use correctly. It maybe makes it > easier to misuse, but IMHO not too much. > > If the desire was to have an API that explicitly acknowledged the > reset scope then it should have taken in a list of device FDs and > optimally reset all of them or fail EPERM. > > What is going to make this hard to use is the _INFO IOCTL, it returns > basically the BDF string, but I think we effectively get rid of this > in the new model. libvirt will know the BDF and open the cdev, then fd > pass the cdev to qemu. Qemu shouldn't also have to know the sysfs > path.. > > So we really want a new _INFO ioctl to make this easier to use.. I think this makes it even worse. If userspace cannot match BDFs from the _INFO ioctl to devices files, for proof of ownership or scope, then the answer is clearly not "get rid of the device files". > > We can always blame the developer for using an interface incorrectly, > > but if we make it easier to use incorrectly in order to optimize > > something that doesn't need to be optimized, does that make it a good > > choice for the uAPI? > > IMHO the API is designed around a security proof. Present some groups > and a subset of devices in those groups will be reset. You can't know > the subset unless you do the _INFO thing. > > If we wanted it to be clearly linked to scope it should have taken in > a list of device FDs, and reset those devices FDs optimally or > returned -EPERM. Then the reset scope is very clearly connected to the > API. This just seems like nit-picking that the API could have accomplished this more concisely. Probably that's true, but I think you've identified a gap above that amplifies the issue. If the user cannot map BDFs to cdevs because the cdevs are passed as open fds to the user driver, the _INFO results become meaningless and by removing the fds array, that becomes the obvious choice that a user presented with this dilemma would take. We're skipping past easier to misuse, difficult to use correctly, and circling around no obvious way to use correctly. Unfortunately the _INFO ioctl does presume that userspace knows the BDF to device mappings today, so if we are attempting to pre-enable a case with cdev support where that is not the case, then there must be something done with the _INFO ioctl to provide scope. Thanks, Alex
On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote: > This just seems like nit-picking that the API could have accomplished > this more concisely. Probably that's true, but I think you've > identified a gap above that amplifies the issue. If the user cannot > map BDFs to cdevs because the cdevs are passed as open fds to the user > driver, the _INFO results become meaningless and by removing the fds > array, that becomes the obvious choice that a user presented with this > dilemma would take. We're skipping past easier to misuse, difficult to > use correctly, and circling around no obvious way to use correctly. No - this just isn't finished yet is all it means :( I just noticed it just now, presumably Eric would have discovered this when he tried to implement the FD pass and we would have made a new _INFO at that point (or more ugly, have libvirt pass the BDF along with the FD). > Unfortunately the _INFO ioctl does presume that userspace knows the BDF > to device mappings today, so if we are attempting to pre-enable a case > with cdev support where that is not the case, then there must be > something done with the _INFO ioctl to provide scope. Yes, something is required with _INFO before libvirt can use a FD pass. I'm thinking of a new _INFO query that returns the iommufd dev_ids for the reset group. Then qemu can match the dev_ids back to cdev FDs and thus vPCI devices and do what it needs to do. But for the current qemu setup it will open cdev directly and it will know the BDF so it can still use the current _INFO. Though it would be nice if qemu didn't need two implementations so Yi I'd rather see a new info in this series as well and qemu can just consistently use dev_id and never bdf in iommufd mode. Anyhow, I don't see the two topics as really related, the intention is not to discourage people from calling _INFO, it just to make the security proof simpler and more logical. Jason
On Tue, 21 Mar 2023 17:50:08 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote: > > > This just seems like nit-picking that the API could have accomplished > > this more concisely. Probably that's true, but I think you've > > identified a gap above that amplifies the issue. If the user cannot > > map BDFs to cdevs because the cdevs are passed as open fds to the user > > driver, the _INFO results become meaningless and by removing the fds > > array, that becomes the obvious choice that a user presented with this > > dilemma would take. We're skipping past easier to misuse, difficult to > > use correctly, and circling around no obvious way to use correctly. > > No - this just isn't finished yet is all it means :( > > I just noticed it just now, presumably Eric would have discovered this > when he tried to implement the FD pass and we would have made a new > _INFO at that point (or more ugly, have libvirt pass the BDF along > with the FD). > > > Unfortunately the _INFO ioctl does presume that userspace knows the BDF > > to device mappings today, so if we are attempting to pre-enable a case > > with cdev support where that is not the case, then there must be > > something done with the _INFO ioctl to provide scope. > > Yes, something is required with _INFO before libvirt can use a FD > pass. I'm thinking of a new _INFO query that returns the iommufd > dev_ids for the reset group. Then qemu can match the dev_ids back to > cdev FDs and thus vPCI devices and do what it needs to do. > > But for the current qemu setup it will open cdev directly and it will > know the BDF so it can still use the current _INFO. > > Though it would be nice if qemu didn't need two implementations so Yi > I'd rather see a new info in this series as well and qemu can just > consistently use dev_id and never bdf in iommufd mode. We also need to consider how libvirt determines if QEMU has the kernel support it needs to pass file descriptors. It'd be a lot cleaner if this aligned with the introduction of vfio cdevs. > Anyhow, I don't see the two topics as really related, the intention is > not to discourage people from calling _INFO, it just to make the > security proof simpler and more logical. At a minimum, we need a new _INFO ioctl to get back to the point where it's only a discussion of whether we're checking the user on scope. We can't remove the array while doing so opens up an obviously incorrect solution to an impossible to use API. Thanks, Alex
On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > Though it would be nice if qemu didn't need two implementations so Yi > > I'd rather see a new info in this series as well and qemu can just > > consistently use dev_id and never bdf in iommufd mode. > > We also need to consider how libvirt determines if QEMU has the kernel > support it needs to pass file descriptors. It'd be a lot cleaner if > this aligned with the introduction of vfio cdevs. Yes, that would be much better if it was one package. But this is starting to grow and we have so many threads that need to progress blocked on this cdev enablement :( Could we go forward with the cdev main patches and kconfig it to experimental or something while the rest of the parts are completed and tested through qemu? ie move the vfio-pci reset enablment to after the cdev patches? Jason
On Tue, 21 Mar 2023 19:20:37 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > I'd rather see a new info in this series as well and qemu can just > > > consistently use dev_id and never bdf in iommufd mode. > > > > We also need to consider how libvirt determines if QEMU has the kernel > > support it needs to pass file descriptors. It'd be a lot cleaner if > > this aligned with the introduction of vfio cdevs. > > Yes, that would be much better if it was one package. > > But this is starting to grow and we have so many threads that need to > progress blocked on this cdev enablement :( > > Could we go forward with the cdev main patches and kconfig it to > experimental or something while the rest of the parts are completed > and tested through qemu? ie move the vfio-pci reset enablment to after > the cdev patches? We need to be able to guarantee that there cannot be any significant builds of the kernel with vfio cdev support if our intention is to stage it for libvirt. We don't have a global EXPERIMENTAL config option any more. Adding new code under BROKEN seems wrong. Fedora ships with STAGING enabled. A sternly worded Kconfig entry is toothless. What is the proposed mechanism to make this not look like a big uncompiled code dump? Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, March 22, 2023 6:48 AM > > On Tue, 21 Mar 2023 19:20:37 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > > I'd rather see a new info in this series as well and qemu can just > > > > consistently use dev_id and never bdf in iommufd mode. > > > > > > We also need to consider how libvirt determines if QEMU has the kernel > > > support it needs to pass file descriptors. It'd be a lot cleaner if > > > this aligned with the introduction of vfio cdevs. > > > > Yes, that would be much better if it was one package. > > > > But this is starting to grow and we have so many threads that need to > > progress blocked on this cdev enablement :( > > > > Could we go forward with the cdev main patches and kconfig it to > > experimental or something while the rest of the parts are completed > > and tested through qemu? ie move the vfio-pci reset enablment to after > > the cdev patches? > > We need to be able to guarantee that there cannot be any significant > builds of the kernel with vfio cdev support if our intention is to stage > it for libvirt. We don't have a global EXPERIMENTAL config option any > more. Adding new code under BROKEN seems wrong. Fedora ships with > STAGING enabled. A sternly worded Kconfig entry is toothless. What is > the proposed mechanism to make this not look like a big uncompiled code > dump? Thanks, Just out of curious, is the BDF mapping gap only for cdev or it also exists in the traditional group path? IMHO, if it is only a gap for cdev, maybe we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by default. I think it won't change until one day the whole ecosystem is updated. Anyhow, I'll also see the complexity of adding a new _INFO ioctl. It should return a set of dev_id to user rather than the bdf info in the existing _INFO ioctl. Is it? Regards, Yi Liu
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, March 22, 2023 4:50 AM > > On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote: > > > This just seems like nit-picking that the API could have accomplished > > this more concisely. Probably that's true, but I think you've > > identified a gap above that amplifies the issue. If the user cannot > > map BDFs to cdevs because the cdevs are passed as open fds to the user > > driver, the _INFO results become meaningless and by removing the fds > > array, that becomes the obvious choice that a user presented with this > > dilemma would take. We're skipping past easier to misuse, difficult to > > use correctly, and circling around no obvious way to use correctly. > > No - this just isn't finished yet is all it means :( > > I just noticed it just now, presumably Eric would have discovered this > when he tried to implement the FD pass and we would have made a new > _INFO at that point (or more ugly, have libvirt pass the BDF along > with the FD). > > > Unfortunately the _INFO ioctl does presume that userspace knows the BDF > > to device mappings today, so if we are attempting to pre-enable a case > > with cdev support where that is not the case, then there must be > > something done with the _INFO ioctl to provide scope. > > Yes, something is required with _INFO before libvirt can use a FD > pass. I'm thinking of a new _INFO query that returns the iommufd > dev_ids for the reset group. Then qemu can match the dev_ids back to > cdev FDs and thus vPCI devices and do what it needs to do. Could you elaborate what is required with _INFO before libvirt can use a FD pass? > But for the current qemu setup it will open cdev directly and it will > know the BDF so it can still use the current _INFO. > > Though it would be nice if qemu didn't need two implementations so Yi > I'd rather see a new info in this series as well and qemu can just > consistently use dev_id and never bdf in iommufd mode. I have one concern here. iommufd dev_id is not a static info as much as bdf. It is generated when bound to iommufd. So if there are devices that are affected but not bound to iommufd yet at the time of invoking _INFO, then the _INFO ioctl just gets a subset of the affected devices. Is it enough? Regards, Yi Liu
On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote: > Could you elaborate what is required with _INFO before libvirt can > use a FD pass? Make a new _INFO that returns an array of dev_ids within the cdev's iommufd_ctx that are part of the reset group, eg the devset. qemu will call this for each dev_id after it opens the cdev to generate the groupings. > > But for the current qemu setup it will open cdev directly and it will > > know the BDF so it can still use the current _INFO. > > > > Though it would be nice if qemu didn't need two implementations so Yi > > I'd rather see a new info in this series as well and qemu can just > > consistently use dev_id and never bdf in iommufd mode. > > I have one concern here. iommufd dev_id is not a static info as much as > bdf. It is generated when bound to iommufd. So if there are devices that > are affected but not bound to iommufd yet at the time of invoking _INFO, > then the _INFO ioctl just gets a subset of the affected devices. Is it enough? I'd probably use similar logic as the reset path, if one of reset group devices is open and on a different iommufd_ctx then fail the IOCTL with EPERM. Jason
On Wed, 22 Mar 2023 04:42:16 +0000 "Liu, Yi L" <yi.l.liu@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, March 22, 2023 6:48 AM > > > > On Tue, 21 Mar 2023 19:20:37 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > > > I'd rather see a new info in this series as well and qemu can just > > > > > consistently use dev_id and never bdf in iommufd mode. > > > > > > > > We also need to consider how libvirt determines if QEMU has the kernel > > > > support it needs to pass file descriptors. It'd be a lot cleaner if > > > > this aligned with the introduction of vfio cdevs. > > > > > > Yes, that would be much better if it was one package. > > > > > > But this is starting to grow and we have so many threads that need to > > > progress blocked on this cdev enablement :( > > > > > > Could we go forward with the cdev main patches and kconfig it to > > > experimental or something while the rest of the parts are completed > > > and tested through qemu? ie move the vfio-pci reset enablment to after > > > the cdev patches? > > > > We need to be able to guarantee that there cannot be any significant > > builds of the kernel with vfio cdev support if our intention is to stage > > it for libvirt. We don't have a global EXPERIMENTAL config option any > > more. Adding new code under BROKEN seems wrong. Fedora ships with > > STAGING enabled. A sternly worded Kconfig entry is toothless. What is > > the proposed mechanism to make this not look like a big uncompiled code > > dump? Thanks, > > Just out of curious, is the BDF mapping gap only for cdev or it also > exists in the traditional group path? The group path doesn't support passing file descriptors, getting access to the device files requires a full container configuration, which implies significant policy decisions in libvirt. Even if groups were passed, QEMU would need to know the device name, ie. BDF in string format, to get the device from the group. > IMHO, if it is only a gap for cdev, maybe > we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by > default. I think it won't change until one day the whole ecosystem is > updated. See the "toothless" comment above, disabling vfio cdev support by default because we don't have feature parity in reset support does not provide any guarantees to libvirt that it can effectively take advantage of passing cdev fds to QEMU. Thanks, Alex
On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote: > On Tue, 21 Mar 2023 19:20:37 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > > I'd rather see a new info in this series as well and qemu can just > > > > consistently use dev_id and never bdf in iommufd mode. > > > > > > We also need to consider how libvirt determines if QEMU has the kernel > > > support it needs to pass file descriptors. It'd be a lot cleaner if > > > this aligned with the introduction of vfio cdevs. > > > > Yes, that would be much better if it was one package. > > > > But this is starting to grow and we have so many threads that need to > > progress blocked on this cdev enablement :( > > > > Could we go forward with the cdev main patches and kconfig it to > > experimental or something while the rest of the parts are completed > > and tested through qemu? ie move the vfio-pci reset enablment to after > > the cdev patches? > > We need to be able to guarantee that there cannot be any significant > builds of the kernel with vfio cdev support if our intention is to stage > it for libvirt. We don't have a global EXPERIMENTAL config option any > more. Adding new code under BROKEN seems wrong. Fedora ships with > STAGING enabled. A sternly worded Kconfig entry is toothless. What is > the proposed mechanism to make this not look like a big uncompiled code > dump? Thanks, I would suggest a sternly worded kconfig and STAGING. This isn't such a big issue, we are trying to say that a future released qemu is not required to work on older kernels with a STAGING kconfig mark. IOW we are saying that qemu release X.0 with production iommufd requires kernel version > x.y and just lightly reflecting this into the kconfig. qemu should simply not support iommufd if it finds itself on a old kernel. Jason
On Wed, 22 Mar 2023 09:27:16 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote: > > On Tue, 21 Mar 2023 19:20:37 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > > > I'd rather see a new info in this series as well and qemu can just > > > > > consistently use dev_id and never bdf in iommufd mode. > > > > > > > > We also need to consider how libvirt determines if QEMU has the kernel > > > > support it needs to pass file descriptors. It'd be a lot cleaner if > > > > this aligned with the introduction of vfio cdevs. > > > > > > Yes, that would be much better if it was one package. > > > > > > But this is starting to grow and we have so many threads that need to > > > progress blocked on this cdev enablement :( > > > > > > Could we go forward with the cdev main patches and kconfig it to > > > experimental or something while the rest of the parts are completed > > > and tested through qemu? ie move the vfio-pci reset enablment to after > > > the cdev patches? > > > > We need to be able to guarantee that there cannot be any significant > > builds of the kernel with vfio cdev support if our intention is to stage > > it for libvirt. We don't have a global EXPERIMENTAL config option any > > more. Adding new code under BROKEN seems wrong. Fedora ships with > > STAGING enabled. A sternly worded Kconfig entry is toothless. What is > > the proposed mechanism to make this not look like a big uncompiled code > > dump? Thanks, > > I would suggest a sternly worded kconfig and STAGING. > > This isn't such a big issue, we are trying to say that a future > released qemu is not required to work on older kernels with a STAGING > kconfig mark. > > IOW we are saying that qemu release X.0 with production iommufd > requires kernel version > x.y and just lightly reflecting this into > the kconfig. > > qemu should simply not support iommufd if it finds itself on a old > kernel. Inferring features based on kernel versions doesn't work in a world where downstreams backport features to older kernels. Thanks, Alex
On Wed, Mar 22, 2023 at 06:36:14AM -0600, Alex Williamson wrote: > On Wed, 22 Mar 2023 09:27:16 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote: > > > On Tue, 21 Mar 2023 19:20:37 -0300 > > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote: > > > > > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > > > > I'd rather see a new info in this series as well and qemu can just > > > > > > consistently use dev_id and never bdf in iommufd mode. > > > > > > > > > > We also need to consider how libvirt determines if QEMU has the kernel > > > > > support it needs to pass file descriptors. It'd be a lot cleaner if > > > > > this aligned with the introduction of vfio cdevs. > > > > > > > > Yes, that would be much better if it was one package. > > > > > > > > But this is starting to grow and we have so many threads that need to > > > > progress blocked on this cdev enablement :( > > > > > > > > Could we go forward with the cdev main patches and kconfig it to > > > > experimental or something while the rest of the parts are completed > > > > and tested through qemu? ie move the vfio-pci reset enablment to after > > > > the cdev patches? > > > > > > We need to be able to guarantee that there cannot be any significant > > > builds of the kernel with vfio cdev support if our intention is to stage > > > it for libvirt. We don't have a global EXPERIMENTAL config option any > > > more. Adding new code under BROKEN seems wrong. Fedora ships with > > > STAGING enabled. A sternly worded Kconfig entry is toothless. What is > > > the proposed mechanism to make this not look like a big uncompiled code > > > dump? Thanks, > > > > I would suggest a sternly worded kconfig and STAGING. > > > > This isn't such a big issue, we are trying to say that a future > > released qemu is not required to work on older kernels with a STAGING > > kconfig mark. > > > > IOW we are saying that qemu release X.0 with production iommufd > > requires kernel version > x.y and just lightly reflecting this into > > the kconfig. > > > > qemu should simply not support iommufd if it finds itself on a old > > kernel. > > Inferring features based on kernel versions doesn't work in a world > where downstreams backport features to older kernels. I don't mean actual kernel versions as a compatability test. I mean it as documention and an expected "support" window. ie we are disclaiming support for STAGING kernel as a matter of documentation, not code. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, March 22, 2023 8:18 PM > > On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote: > > > Could you elaborate what is required with _INFO before libvirt can > > use a FD pass? > > Make a new _INFO that returns an array of dev_ids within the cdev's > iommufd_ctx that are part of the reset group, eg the devset. > > qemu will call this for each dev_id after it opens the cdev to > generate the groupings. Thanks. So this new _INFO only reports a limited scope instead of the full list of affected devices. Also, it is not static scope since device may be opened just after the _INFO returns. > > > But for the current qemu setup it will open cdev directly and it will > > > know the BDF so it can still use the current _INFO. > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > I'd rather see a new info in this series as well and qemu can just > > > consistently use dev_id and never bdf in iommufd mode. > > > > I have one concern here. iommufd dev_id is not a static info as much as > > bdf. It is generated when bound to iommufd. So if there are devices that > > are affected but not bound to iommufd yet at the time of invoking _INFO, > > then the _INFO ioctl just gets a subset of the affected devices. Is it enough? > > I'd probably use similar logic as the reset path, if one of reset > group devices is open and on a different iommufd_ctx then fail the > IOCTL with EPERM. Say there are three devices in the dev_set. When the first device is opened by the current qemu, this new _INFO would return one dev_id to user. When the second device is opened, this new _INFO will return two dev_ids to user. If the third device is opened by another qemu, then the new _INFO would fail since the former two devices were opened and have different iommufd_ctx with the third device. Regards, Yi Liu
On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote: > Thanks. So this new _INFO only reports a limited scope instead of > the full list of affected devices. Also, it is not static scope since device > may be opened just after the _INFO returns. Yes, it would be simplest for qemu to do the query after it gains a new dev_id and then it can add the new dev_id with the correct reset group. > > I'd probably use similar logic as the reset path, if one of reset > > group devices is open and on a different iommufd_ctx then fail the > > IOCTL with EPERM. > > Say there are three devices in the dev_set. When the first device is > opened by the current qemu, this new _INFO would return one dev_id > to user. When the second device is opened, this new _INFO will return > two dev_ids to user. Yes > If the third device is opened by another qemu, then > the new _INFO would fail since the former two devices were opened and > have different iommufd_ctx with the third device. Yes qemu should refuse to use the device at this moment. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, March 22, 2023 9:43 PM > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote: > > > Thanks. So this new _INFO only reports a limited scope instead of > > the full list of affected devices. Also, it is not static scope since device > > may be opened just after the _INFO returns. > > Yes, it would be simplest for qemu to do the query after it gains a > new dev_id and then it can add the new dev_id with the correct reset > group. I see. QEMU can decide. For now, it seems like QEMU doesn't store such the info return by the existing _INFO ioctl. It is used in the hot reset helper and freed before it returns. Though, I'm not sure whether QEMU will store the dev_id info returned by the new _INFO. Perhaps Alex can give some guidance. > > > I'd probably use similar logic as the reset path, if one of reset > > > group devices is open and on a different iommufd_ctx then fail the > > > IOCTL with EPERM. > > > > Say there are three devices in the dev_set. When the first device is > > opened by the current qemu, this new _INFO would return one dev_id > > to user. When the second device is opened, this new _INFO will return > > two dev_ids to user. > > Yes > > > If the third device is opened by another qemu, then > > the new _INFO would fail since the former two devices were opened and > > have different iommufd_ctx with the third device. > > Yes > > qemu should refuse to use the device at this moment. Yes. it is. btw. Another question about this new _INFO ioctl. If there are affected devices that have not been bound to vfio driver yet, should this new _INFO ioctl fail all the same with EPERM? Or it still just returns the dev_ids of the devices that are already bound and opened. This seems to make sense with two reasons: - for one, the new _INFO is not designed to give userspace an exact affected device list; - for two, reset shall fail when checking vfio_pci_dev_set_resettable(); Please feel free to correct me if this is wrong. Regards, Yi Liu
On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, March 22, 2023 9:43 PM > > > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote: > > > > > Thanks. So this new _INFO only reports a limited scope instead of > > > the full list of affected devices. Also, it is not static scope since device > > > may be opened just after the _INFO returns. > > > > Yes, it would be simplest for qemu to do the query after it gains a > > new dev_id and then it can add the new dev_id with the correct reset > > group. > > I see. QEMU can decide. For now, it seems like QEMU doesn't store > such the info return by the existing _INFO ioctl. It is used in the hot > reset helper and freed before it returns. Though, I'm not sure whether > QEMU will store the dev_id info returned by the new _INFO. Perhaps > Alex can give some guidance. That seems a bit confusing, qemu should take the reset group information and encode it so that the guest can know it as well. If all it does is blindly invoke the hot_reset with the right parameters then what was the point of all this discussion? > btw. Another question about this new _INFO ioctl. If there are affected > devices that have not been bound to vfio driver yet, should this new _INFO > ioctl fail all the same with EPERM? Yeah, it should EPERM the same as the normal hot reset if it can't marshal the device list. Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, March 22, 2023 5:01 AM > > On Tue, 21 Mar 2023 17:50:08 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > I'd rather see a new info in this series as well and qemu can just > > consistently use dev_id and never bdf in iommufd mode. > > We also need to consider how libvirt determines if QEMU has the kernel > support it needs to pass file descriptors. It'd be a lot cleaner if > this aligned with the introduction of vfio cdevs. > Libvirt can check whether the kernel creates cdev for a given device via sysfs. but I'm not sure how Libvirt determines whether QEMU supports a feature that it wants to use. But sounds this is a general handshake problem as Libvirt needs to support many versions of QEMU then there must be a way for such negotiation?
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, March 23, 2023 8:02 PM > > On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, March 22, 2023 9:43 PM > > > > > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote: > > > > > > > Thanks. So this new _INFO only reports a limited scope instead of > > > > the full list of affected devices. Also, it is not static scope since device > > > > may be opened just after the _INFO returns. > > > > > > Yes, it would be simplest for qemu to do the query after it gains a > > > new dev_id and then it can add the new dev_id with the correct reset > > > group. > > > > I see. QEMU can decide. For now, it seems like QEMU doesn't store > > such the info return by the existing _INFO ioctl. It is used in the hot > > reset helper and freed before it returns. Though, I'm not sure whether > > QEMU will store the dev_id info returned by the new _INFO. Perhaps > > Alex can give some guidance. > > That seems a bit confusing, qemu should take the reset group > information and encode it so that the guest can know it as well. > > If all it does is blindly invoke the hot_reset with the right > parameters then what was the point of all this discussion? > > > btw. Another question about this new _INFO ioctl. If there are affected > > devices that have not been bound to vfio driver yet, should this new _INFO > > ioctl fail all the same with EPERM? > > Yeah, it should EPERM the same as the normal hot reset if it can't > marshal the device list. Hi Jason, Alex, I got a draft patch to add the new _INFO? It checks if all the affected devices are in the dev_set, and then gets the dev_id of all the opened devices within the dev_set. There is still one thing not quite clear. It is the noiommu mode. In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not available for users that operating in noiommu mode. Is this acceptable? From e763474e255ff9805b1fb76d6b6b9ccedb61058f Mon Sep 17 00:00:00 2001 From: Yi Liu <yi.l.liu@intel.com> Date: Fri, 24 Mar 2023 00:54:08 -0700 Subject: [PATCH 06/10] vfio/pci: Add VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO to report the affected devices for a given device's hot reset. It is a list of iommufd dev_id that is opened by the user. If there is device that is not bound to vfio driver or opened by another user, this shall fail with -EPERM. For the noiommu mode in the vfio device cdev path, this shall fail as no dev_id would be generated. Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++ include/uapi/linux/vfio.h | 28 +++++++++ 2 files changed, 126 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index b68fcba67a4b..5789933a64ae 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, return ret; } +static struct pci_dev * +vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set); + +static int vfio_pci_ioctl_get_pci_hot_reset_group_info( + struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset_group_info __user *arg) +{ + unsigned long minsz = + offsetofend(struct vfio_pci_hot_reset_group_info, count); + struct vfio_pci_hot_reset_group_info hdr; + struct iommufd_ctx *iommufd, *cur_iommufd; + u32 count = 0, index = 0, *devices = NULL; + struct vfio_pci_core_device *cur; + bool slot = false; + int ret = 0; + + if (copy_from_user(&hdr, arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz) + return -EINVAL; + + hdr.flags = 0; + + /* Can we do a slot or bus reset or neither? */ + if (!pci_probe_reset_slot(vdev->pdev->slot)) + slot = true; + else if (pci_probe_reset_bus(vdev->pdev->bus)) + return -ENODEV; + + mutex_lock(&vdev->vdev.dev_set->lock); + if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) { + ret = -EPERM; + goto out_unlock; + } + + iommufd = vfio_iommufd_physical_ictx(&vdev->vdev); + if (!iommufd) { + ret = -EPERM; + goto out_unlock; + } + + /* How many devices are affected? */ + ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs, + &count, slot); + if (ret) + goto out_unlock; + + WARN_ON(!count); /* Should always be at least one */ + + /* + * If there's enough space, fill it now, otherwise return -ENOSPC and + * the number of devices affected. + */ + if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) { + ret = -ENOSPC; + hdr.count = count; + goto reset_info_exit; + } + + devices = kcalloc(count, sizeof(*devices), GFP_KERNEL); + if (!devices) { + ret = -ENOMEM; + goto reset_info_exit; + } + + list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) { + cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev); + if (cur->vdev.open_count) { + if (cur_iommufd != iommufd) { + ret = -EPERM; + break; + } + ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]); + if (ret) + break; + index++; + } + } + +reset_info_exit: + if (copy_to_user(arg, &hdr, minsz)) + ret = -EFAULT; + + if (!ret) { + if (copy_to_user(&arg->devices, devices, + hdr.count * sizeof(*devices))) + ret = -EFAULT; + } + + kfree(devices); +out_unlock: + mutex_unlock(&vdev->vdev.dev_set->lock); + return ret; +} + static int vfio_pci_ioctl_get_pci_hot_reset_info( struct vfio_pci_core_device *vdev, struct vfio_pci_hot_reset_info __user *arg) @@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, return vfio_pci_ioctl_get_irq_info(vdev, uarg); case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO: return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg); + case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO: + return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg); case VFIO_DEVICE_GET_REGION_INFO: return vfio_pci_ioctl_get_region_info(vdev, uarg); case VFIO_DEVICE_IOEVENTFD: diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 17aa5d09db41..572497cda3ca 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -669,6 +669,43 @@ struct vfio_pci_hot_reset_info { #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12) +/** + * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12, + * struct vfio_pci_hot_reset_group_info) + * + * This is used in the vfio device cdev mode. It returns the list of + * affected devices (represented by iommufd dev_id) when hot reset is + * issued on the current device with which this ioctl is invoked. It + * only includes the devices that are opened by the current user in the + * time of this command is invoked. So user should re-invoke it when + * it has opened new devices. This information gives user a scope of + * affected devices that is opened by it. This is helpful for user to + * decide whether VFIO_DEVICE_PCI_HOT_RESET can be issued. However, + * even user can do hot reset per the info got via this ioctl, hot reset + * can fail since the not-opened affected device can be opened by other + * users in the window between the two ioctls. Detail can check the + * description of VFIO_DEVICE_PCI_HOT_RESET. + * + * Return: 0 on success, -errno on failure: + * -enospc = insufficient buffer; + * -enodev = unsupported for device; + * -eperm = no permission for device, this error comes: + * - when there are affected devices that are opened but + * not bound to the same iommufd with the current device + * with which this ioctl is invoked, + * - there are affected devices that are not bound to vfio + * driver yet. + * - no valid iommufd is bound (e.g. noiommu mode) + */ +struct vfio_pci_hot_reset_group_info { + __u32 argsz; + __u32 flags; + __u32 count; + __u32 devices[]; +}; + +#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO _IO(VFIO_TYPE, VFIO_BASE + 18) + /** * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13, * struct vfio_pci_hot_reset)
On Fri, Mar 24, 2023 at 09:09:59AM +0000, Tian, Kevin wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, March 22, 2023 5:01 AM > > > > On Tue, 21 Mar 2023 17:50:08 -0300 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > > > Though it would be nice if qemu didn't need two implementations so Yi > > > I'd rather see a new info in this series as well and qemu can just > > > consistently use dev_id and never bdf in iommufd mode. > > > > We also need to consider how libvirt determines if QEMU has the kernel > > support it needs to pass file descriptors. It'd be a lot cleaner if > > this aligned with the introduction of vfio cdevs. > > > > Libvirt can check whether the kernel creates cdev for a given device > via sysfs. > > but I'm not sure how Libvirt determines whether QEMU supports a > feature that it wants to use. But sounds this is a general handshake > problem as Libvirt needs to support many versions of QEMU then > there must be a way for such negotiation? Ideally libvirt would be able to learn what ioctls are supported on the cdev fd after it opens it, but before binding. I don't think we have something here yet, but I could imagine having something. Clearly it is easier if cdev alone is enough proof to know that it should go ahead in cdev mode. Jason
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Friday, March 24, 2023 5:25 PM > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, March 23, 2023 8:02 PM > > > > On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Wednesday, March 22, 2023 9:43 PM > > > > > > > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote: > > > > > > > > > Thanks. So this new _INFO only reports a limited scope instead of > > > > > the full list of affected devices. Also, it is not static scope since device > > > > > may be opened just after the _INFO returns. > > > > > > > > Yes, it would be simplest for qemu to do the query after it gains a > > > > new dev_id and then it can add the new dev_id with the correct reset > > > > group. > > > > > > I see. QEMU can decide. For now, it seems like QEMU doesn't store > > > such the info return by the existing _INFO ioctl. It is used in the hot > > > reset helper and freed before it returns. Though, I'm not sure whether > > > QEMU will store the dev_id info returned by the new _INFO. Perhaps > > > Alex can give some guidance. > > > > That seems a bit confusing, qemu should take the reset group > > information and encode it so that the guest can know it as well. > > > > If all it does is blindly invoke the hot_reset with the right > > parameters then what was the point of all this discussion? > > > > > btw. Another question about this new _INFO ioctl. If there are affected > > > devices that have not been bound to vfio driver yet, should this new > _INFO > > > ioctl fail all the same with EPERM? > > > > Yeah, it should EPERM the same as the normal hot reset if it can't > > marshal the device list. > > Hi Jason, Alex, > > I got a draft patch to add the new _INFO? It checks if all the affected devices > are in the dev_set, and then gets the dev_id of all the opened devices within > the dev_set. There is still one thing not quite clear. It is the noiommu mode. > In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this > new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not > available for users that operating in noiommu mode. Is this acceptable? The latest _INFO ioctl is in below link. https://lore.kernel.org/kvm/20230327093458.44939-11-yi.l.liu@intel.com/
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 9087cd8ed3ea..dbcee0d38a48 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -131,6 +131,12 @@ void iommufd_device_unbind(struct iommufd_device *idev) } EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD); +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev) +{ + return idev->ictx; +} +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD); + static int iommufd_device_setup_msi(struct iommufd_device *idev, struct iommufd_hw_pagetable *hwpt, phys_addr_t sw_msi_start) diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c index 768d353cb6fa..30c0da2e11f9 100644 --- a/drivers/vfio/iommufd.c +++ b/drivers/vfio/iommufd.c @@ -69,6 +69,15 @@ void vfio_iommufd_unbind(struct vfio_device *vdev) vdev->ops->unbind_iommufd(vdev); } +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev) +{ + /* Only serve for physical device */ + if (!vdev->iommufd_device) + return NULL; + return iommufd_device_to_ictx(vdev->iommufd_device); +} +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ctx); + /* * The physical standard ops mean that the iommufd_device is bound to the * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 123b468ead73..b039fbd5c656 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) struct vfio_pci_user_file_info; static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set); static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, - struct vfio_pci_user_file_info *user_info); + struct vfio_pci_user_file_info *user_info, + struct iommufd_ctx *iommufd_ctx); /* * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND @@ -1255,29 +1256,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info( return ret; } -static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, +static int +vfio_pci_ioctl_pci_hot_reset_user_files(struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset *hdr, + bool slot, struct vfio_pci_hot_reset __user *arg) { - unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count); - struct vfio_pci_hot_reset hdr; int32_t *user_fds; struct file **files; struct vfio_pci_user_file_info info; - bool slot = false; int file_idx, count = 0, ret = 0; - if (copy_from_user(&hdr, arg, minsz)) - return -EFAULT; - - if (hdr.argsz < minsz || hdr.flags) - return -EINVAL; - - /* Can we do a slot or bus reset or neither? */ - if (!pci_probe_reset_slot(vdev->pdev->slot)) - slot = true; - else if (pci_probe_reset_bus(vdev->pdev->bus)) - return -ENODEV; - /* * We can't let userspace give us an arbitrarily large buffer to copy, * so verify how many we think there could be. Note groups can have @@ -1289,11 +1278,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, return ret; /* Somewhere between 1 and count is OK */ - if (!hdr.count || hdr.count > count) + if (hdr->count > count) return -EINVAL; - user_fds = kcalloc(hdr.count, sizeof(*user_fds), GFP_KERNEL); - files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL); + user_fds = kcalloc(hdr->count, sizeof(*user_fds), GFP_KERNEL); + files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL); if (!user_fds || !files) { kfree(user_fds); kfree(files); @@ -1301,7 +1290,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, } if (copy_from_user(user_fds, arg->fds, - hdr.count * sizeof(*user_fds))) { + hdr->count * sizeof(*user_fds))) { kfree(user_fds); kfree(files); return -EFAULT; @@ -1311,7 +1300,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, * Get the file for each fd to ensure the group/device file * is held across the reset */ - for (file_idx = 0; file_idx < hdr.count; file_idx++) { + for (file_idx = 0; file_idx < hdr->count; file_idx++) { struct file *file = fget(user_fds[file_idx]); if (!file) { @@ -1341,10 +1330,10 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, if (ret) goto hot_reset_release; - info.count = hdr.count; + info.count = hdr->count; info.files = files; - ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info); + ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL); hot_reset_release: for (file_idx--; file_idx >= 0; file_idx--) @@ -1354,6 +1343,36 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, return ret; } +static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev, + struct vfio_pci_hot_reset __user *arg) +{ + unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count); + struct vfio_pci_hot_reset hdr; + struct iommufd_ctx *iommufd; + bool slot = false; + + if (copy_from_user(&hdr, arg, minsz)) + return -EFAULT; + + if (hdr.argsz < minsz || hdr.flags) + return -EINVAL; + + /* Can we do a slot or bus reset or neither? */ + if (!pci_probe_reset_slot(vdev->pdev->slot)) + slot = true; + else if (pci_probe_reset_bus(vdev->pdev->bus)) + return -ENODEV; + + if (hdr.count) + return vfio_pci_ioctl_pci_hot_reset_user_files(vdev, &hdr, slot, arg); + + iommufd = vfio_iommufd_physical_ctx(&vdev->vdev); + if (!iommufd) + return -EINVAL; + + return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd); +} + static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev, struct vfio_device_ioeventfd __user *arg) { @@ -2323,6 +2342,9 @@ static bool vfio_dev_in_user_fds(struct vfio_pci_core_device *vdev, { unsigned int i; + if (!user_info) + return false; + for (i = 0; i < user_info->count; i++) if (vfio_file_has_dev(user_info->files[i], &vdev->vdev)) return true; @@ -2398,13 +2420,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) return ret; } +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev, + struct iommufd_ctx *iommufd_ctx) +{ + struct iommufd_ctx *iommufd = vfio_iommufd_physical_ctx(&vdev->vdev); + + if (!iommufd) + return false; + + return iommufd == iommufd_ctx; +} + /* * We need to get memory_lock for each device, but devices can share mmap_lock, * therefore we need to zap and hold the vma_lock for each device, and only then * get each memory_lock. */ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, - struct vfio_pci_user_file_info *user_info) + struct vfio_pci_user_file_info *user_info, + struct iommufd_ctx *iommufd_ctx) { struct vfio_pci_core_device *cur_mem; struct vfio_pci_core_device *cur_vma; @@ -2448,10 +2482,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * For the devices that have been opened, needs to check the * ownership. If the user provides a set of group/device * fds, test whether all the opened devices are contained - * by the set of groups/devices provided by the user. + * by the set of groups/devices provided by the user. If + * user provides a zero-length array, the ownerhsip check + * is done by checking if all the opened devices are bound + * to the same iommufd_ctx. */ if (cur_vma->vdev.open_count && - !vfio_dev_in_user_fds(cur_vma, user_info)) { + !vfio_dev_in_user_fds(cur_vma, user_info) && + !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) { ret = -EINVAL; goto err_undo; } diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index 365f11e8e615..7a0d7f2c4237 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -20,6 +20,7 @@ struct file; struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, struct device *dev, u32 *id); void iommufd_device_unbind(struct iommufd_device *idev); +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev); int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id); void iommufd_device_detach(struct iommufd_device *idev); diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 1c69be2d687e..fc14f8430a10 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -116,6 +116,7 @@ struct vfio_device_ops { int vfio_iommufd_physical_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id); void vfio_iommufd_physical_unbind(struct vfio_device *vdev); +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev); int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id); int vfio_iommufd_emulated_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx, u32 *out_device_id); @@ -127,6 +128,8 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id); u32 *out_device_id)) NULL) #define vfio_iommufd_physical_unbind \ ((void (*)(struct vfio_device *vdev)) NULL) +#define vfio_iommufd_physical_ctx \ + ((struct iommufd_ctx * (*)(struct vfio_device *vdev)) NULL) #define vfio_iommufd_physical_attach_ioas \ ((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL) #define vfio_iommufd_emulated_bind \ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index d80141969cd1..382d95455f89 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info { * The ownership can be proved by: * - An array of group fds * - An array of device fds + * - A zero-length array + * + * In the last case all affected devices which are opened by this user + * must have been bound to a same iommufd_ctx. This approach is only + * available for devices bound to positive iommufd. * * Return: 0 on success, -errno on failure. */
This is another method to issue PCI hot reset for the users that bounds device to a positive iommufd value. In such case, iommufd is a proof of device ownership. By passing a zero-length fd array, user indicates kernel to do ownership check with the bound iommufd. All the opened devices within the affected dev_set should have been bound to the same iommufd. This is simpler and faster as user does not need to pass a set of fds and kernel no need to search the device within the given fds. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- drivers/iommu/iommufd/device.c | 6 +++ drivers/vfio/iommufd.c | 9 ++++ drivers/vfio/pci/vfio_pci_core.c | 92 ++++++++++++++++++++++---------- include/linux/iommufd.h | 1 + include/linux/vfio.h | 3 ++ include/uapi/linux/vfio.h | 5 ++ 6 files changed, 89 insertions(+), 27 deletions(-)