Message ID | 0-v1-90bf0950c42c+39-vfio_group_disassociate_jgg@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow the group FD to remain open when unplugging a device | expand |
On Thu, 6 Oct 2022 09:40:35 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > Testing has shown that virtnodedevd will leave the group FD open for long > periods, even after all the cdevs have been destroyed. This blocks > destruction of the VFIO device and is undesirable. > > That approach was selected to accomodate SPAPR which has an broken > lifecyle model for the iommu_group. However, we can accomodate SPAPR by > realizing that it doesn't use the iommu core at all, so rules about > iommu_group lifetime do not apply to it. > > Giving the KVM code its own kref on the iommu_group allows the VFIO core > code to release its iommu_group reference earlier and we can remove the > sleep that only existed for SPAPR. > > Jason Gunthorpe (3): > vfio: Add vfio_file_is_group() > vfio: Hold a reference to the iommu_group in kvm for SPAPR > vfio: Make the group FD disassociate from the iommu_group > > drivers/vfio/pci/vfio_pci_core.c | 2 +- > drivers/vfio/vfio.h | 1 - > drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- > include/linux/vfio.h | 1 + > virt/kvm/vfio.c | 45 +++++++++++----- > 5 files changed, 94 insertions(+), 45 deletions(-) Containers aren't getting cleaned up with this series, starting and shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev device, and making use of hugepages, /proc/meminfo shows the hugepages are not released on VM shutdown and I'm unable to subsequently restart the VM. Thanks, Alex
On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: > On Thu, 6 Oct 2022 09:40:35 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > Testing has shown that virtnodedevd will leave the group FD open for long > > periods, even after all the cdevs have been destroyed. This blocks > > destruction of the VFIO device and is undesirable. > > > > That approach was selected to accomodate SPAPR which has an broken > > lifecyle model for the iommu_group. However, we can accomodate SPAPR by > > realizing that it doesn't use the iommu core at all, so rules about > > iommu_group lifetime do not apply to it. > > > > Giving the KVM code its own kref on the iommu_group allows the VFIO core > > code to release its iommu_group reference earlier and we can remove the > > sleep that only existed for SPAPR. > > > > Jason Gunthorpe (3): > > vfio: Add vfio_file_is_group() > > vfio: Hold a reference to the iommu_group in kvm for SPAPR > > vfio: Make the group FD disassociate from the iommu_group > > > > drivers/vfio/pci/vfio_pci_core.c | 2 +- > > drivers/vfio/vfio.h | 1 - > > drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- > > include/linux/vfio.h | 1 + > > virt/kvm/vfio.c | 45 +++++++++++----- > > 5 files changed, 94 insertions(+), 45 deletions(-) > > Containers aren't getting cleaned up with this series, starting and > shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev > device, and making use of hugepages, /proc/meminfo shows the hugepages > are not released on VM shutdown and I'm unable to subsequently restart > the VM. Thanks, Oh, I'm surprised the s390 testing didn't hit this!! This hunk should remain since not all cases are closures due to device hot unplug diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index f9cb734d3991b3..62aba3a128fb8d 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep) filep->private_data = NULL; mutex_lock(&group->group_lock); + /* + * Device FDs hold a group file reference, therefore the group release + * is only called when there are no open devices. + */ + WARN_ON(group->notifier.head); + if (group->container) + vfio_group_detach_container(group); group->opened_file = NULL; mutex_unlock(&group->group_lock); return 0;
On 10/6/22 6:42 PM, Jason Gunthorpe wrote: > On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: >> On Thu, 6 Oct 2022 09:40:35 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> Testing has shown that virtnodedevd will leave the group FD open for long >>> periods, even after all the cdevs have been destroyed. This blocks >>> destruction of the VFIO device and is undesirable. >>> >>> That approach was selected to accomodate SPAPR which has an broken >>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by >>> realizing that it doesn't use the iommu core at all, so rules about >>> iommu_group lifetime do not apply to it. >>> >>> Giving the KVM code its own kref on the iommu_group allows the VFIO core >>> code to release its iommu_group reference earlier and we can remove the >>> sleep that only existed for SPAPR. >>> >>> Jason Gunthorpe (3): >>> vfio: Add vfio_file_is_group() >>> vfio: Hold a reference to the iommu_group in kvm for SPAPR >>> vfio: Make the group FD disassociate from the iommu_group >>> >>> drivers/vfio/pci/vfio_pci_core.c | 2 +- >>> drivers/vfio/vfio.h | 1 - >>> drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- >>> include/linux/vfio.h | 1 + >>> virt/kvm/vfio.c | 45 +++++++++++----- >>> 5 files changed, 94 insertions(+), 45 deletions(-) >> >> Containers aren't getting cleaned up with this series, starting and >> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev >> device, and making use of hugepages, /proc/meminfo shows the hugepages >> are not released on VM shutdown and I'm unable to subsequently restart >> the VM. Thanks, > > Oh, I'm surprised the s390 testing didn't hit this!! Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. I definitely did multiple VM (re)starts and hot (un)plugs. But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal. -ap and -ccw also don't pin everything upfront (and I did far less testing with those). Ugh. Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine. > > This hunk should remain since not all cases are closures due to device > hot unplug > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > index f9cb734d3991b3..62aba3a128fb8d 100644 > --- a/drivers/vfio/vfio_main.c > +++ b/drivers/vfio/vfio_main.c > @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep) > filep->private_data = NULL; > > mutex_lock(&group->group_lock); > + /* > + * Device FDs hold a group file reference, therefore the group release > + * is only called when there are no open devices. > + */ > + WARN_ON(group->notifier.head); > + if (group->container) > + vfio_group_detach_container(group); > group->opened_file = NULL; > mutex_unlock(&group->group_lock); > return 0; Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good. For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well.
On 10/6/22 7:28 PM, Matthew Rosato wrote: > On 10/6/22 6:42 PM, Jason Gunthorpe wrote: >> On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: >>> On Thu, 6 Oct 2022 09:40:35 -0300 >>> Jason Gunthorpe <jgg@nvidia.com> wrote: >>> >>>> Testing has shown that virtnodedevd will leave the group FD open for long >>>> periods, even after all the cdevs have been destroyed. This blocks >>>> destruction of the VFIO device and is undesirable. >>>> >>>> That approach was selected to accomodate SPAPR which has an broken >>>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by >>>> realizing that it doesn't use the iommu core at all, so rules about >>>> iommu_group lifetime do not apply to it. >>>> >>>> Giving the KVM code its own kref on the iommu_group allows the VFIO core >>>> code to release its iommu_group reference earlier and we can remove the >>>> sleep that only existed for SPAPR. >>>> >>>> Jason Gunthorpe (3): >>>> vfio: Add vfio_file_is_group() >>>> vfio: Hold a reference to the iommu_group in kvm for SPAPR >>>> vfio: Make the group FD disassociate from the iommu_group >>>> >>>> drivers/vfio/pci/vfio_pci_core.c | 2 +- >>>> drivers/vfio/vfio.h | 1 - >>>> drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- >>>> include/linux/vfio.h | 1 + >>>> virt/kvm/vfio.c | 45 +++++++++++----- >>>> 5 files changed, 94 insertions(+), 45 deletions(-) >>> >>> Containers aren't getting cleaned up with this series, starting and >>> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev >>> device, and making use of hugepages, /proc/meminfo shows the hugepages >>> are not released on VM shutdown and I'm unable to subsequently restart >>> the VM. Thanks, >> >> Oh, I'm surprised the s390 testing didn't hit this!! > > Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. I definitely did multiple VM (re)starts and hot (un)plugs. But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal. > > -ap and -ccw also don't pin everything upfront (and I did far less testing with those). > > Ugh. Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine. > >> >> This hunk should remain since not all cases are closures due to device >> hot unplug >> >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c >> index f9cb734d3991b3..62aba3a128fb8d 100644 >> --- a/drivers/vfio/vfio_main.c >> +++ b/drivers/vfio/vfio_main.c >> @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep) >> filep->private_data = NULL; >> >> mutex_lock(&group->group_lock); >> + /* >> + * Device FDs hold a group file reference, therefore the group release >> + * is only called when there are no open devices. >> + */ >> + WARN_ON(group->notifier.head); >> + if (group->container) >> + vfio_group_detach_container(group); >> group->opened_file = NULL; >> mutex_unlock(&group->group_lock); >> return 0; > > Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good. For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well. Further s390 vfio-pci testing still looks good with this change too.
Am 07.10.22 um 00:42 schrieb Jason Gunthorpe: > On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote: >> On Thu, 6 Oct 2022 09:40:35 -0300 >> Jason Gunthorpe <jgg@nvidia.com> wrote: >> >>> Testing has shown that virtnodedevd will leave the group FD open for long >>> periods, even after all the cdevs have been destroyed. This blocks >>> destruction of the VFIO device and is undesirable. >>> >>> That approach was selected to accomodate SPAPR which has an broken >>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by >>> realizing that it doesn't use the iommu core at all, so rules about >>> iommu_group lifetime do not apply to it. >>> >>> Giving the KVM code its own kref on the iommu_group allows the VFIO core >>> code to release its iommu_group reference earlier and we can remove the >>> sleep that only existed for SPAPR. >>> >>> Jason Gunthorpe (3): >>> vfio: Add vfio_file_is_group() >>> vfio: Hold a reference to the iommu_group in kvm for SPAPR >>> vfio: Make the group FD disassociate from the iommu_group >>> >>> drivers/vfio/pci/vfio_pci_core.c | 2 +- >>> drivers/vfio/vfio.h | 1 - >>> drivers/vfio/vfio_main.c | 90 +++++++++++++++++++++----------- >>> include/linux/vfio.h | 1 + >>> virt/kvm/vfio.c | 45 +++++++++++----- >>> 5 files changed, 94 insertions(+), 45 deletions(-) >> >> Containers aren't getting cleaned up with this series, starting and >> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev >> device, and making use of hugepages, /proc/meminfo shows the hugepages >> are not released on VM shutdown and I'm unable to subsequently restart >> the VM. Thanks, > > Oh, I'm surprised the s390 testing didn't hit this!! I guess its because that most of our CI testcases create a new ephemeral guest for each testcase. We do test reboot but not shutdown and restart. Will have a look if that can be improved.
On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote: > > Oh, I'm surprised the s390 testing didn't hit this!! > > Huh, me too, at least eventually - I think it's because we aren't > pinning everything upfront but rather on-demand so the missing the > type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. > I definitely did multiple VM (re)starts and hot (un)plugs. But > while my test workloads did some I/O, the long-running one was > focused on the plug/unplug scenarios to recreate the initial issue > so the I/O (and thus pinning) done would have been minimal. That explains ccw/ap a bit but for PCI the iommu ownership wasn't released so it becomes impossible to re-attach a container to the group. eg a 2nd VM can never be started Ah well, thanks! Jason
On 10/7/22 9:37 AM, Jason Gunthorpe wrote: > On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote: > >>> Oh, I'm surprised the s390 testing didn't hit this!! >> >> Huh, me too, at least eventually - I think it's because we aren't >> pinning everything upfront but rather on-demand so the missing the >> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. >> I definitely did multiple VM (re)starts and hot (un)plugs. But >> while my test workloads did some I/O, the long-running one was >> focused on the plug/unplug scenarios to recreate the initial issue >> so the I/O (and thus pinning) done would have been minimal. > > That explains ccw/ap a bit but for PCI the iommu ownership wasn't > released so it becomes impossible to re-attach a container to the > group. eg a 2nd VM can never be started > > Ah well, thanks! > > Jason Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain: vfio_pci_remove vfio_pci_core_unregister_device vfio_unregister_group_dev vfio_device_remove_group vfio_group_detach_container I'd guess non-s390 vfio-pci would do the same. Alex also had the mtty mdev, maybe that's relevant.
On Fri, Oct 07, 2022 at 10:37:11AM -0400, Matthew Rosato wrote: > On 10/7/22 9:37 AM, Jason Gunthorpe wrote: > > On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote: > > > >>> Oh, I'm surprised the s390 testing didn't hit this!! > >> > >> Huh, me too, at least eventually - I think it's because we aren't > >> pinning everything upfront but rather on-demand so the missing the > >> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. > >> I definitely did multiple VM (re)starts and hot (un)plugs. But > >> while my test workloads did some I/O, the long-running one was > >> focused on the plug/unplug scenarios to recreate the initial issue > >> so the I/O (and thus pinning) done would have been minimal. > > > > That explains ccw/ap a bit but for PCI the iommu ownership wasn't > > released so it becomes impossible to re-attach a container to the > > group. eg a 2nd VM can never be started > > > > Ah well, thanks! > > > > Jason > > Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain: > > vfio_pci_remove > vfio_pci_core_unregister_device > vfio_unregister_group_dev > vfio_device_remove_group > vfio_group_detach_container > > I'd guess non-s390 vfio-pci would do the same. Alex also had the mtty mdev, maybe that's relevant. As long as you are unplugging a driver the v1 series would work. The failure mode is when you don't unplug the driver and just run a VM twice in a row. Jason
On 10/7/22 10:39 AM, Jason Gunthorpe wrote: > On Fri, Oct 07, 2022 at 10:37:11AM -0400, Matthew Rosato wrote: >> On 10/7/22 9:37 AM, Jason Gunthorpe wrote: >>> On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote: >>> >>>>> Oh, I'm surprised the s390 testing didn't hit this!! >>>> >>>> Huh, me too, at least eventually - I think it's because we aren't >>>> pinning everything upfront but rather on-demand so the missing the >>>> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious. >>>> I definitely did multiple VM (re)starts and hot (un)plugs. But >>>> while my test workloads did some I/O, the long-running one was >>>> focused on the plug/unplug scenarios to recreate the initial issue >>>> so the I/O (and thus pinning) done would have been minimal. >>> >>> That explains ccw/ap a bit but for PCI the iommu ownership wasn't >>> released so it becomes impossible to re-attach a container to the >>> group. eg a 2nd VM can never be started >>> >>> Ah well, thanks! >>> >>> Jason >> >> Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain: >> >> vfio_pci_remove >> vfio_pci_core_unregister_device >> vfio_unregister_group_dev >> vfio_device_remove_group >> vfio_group_detach_container >> >> I'd guess non-s390 vfio-pci would do the same. Alex also had the mtty mdev, maybe that's relevant. > > As long as you are unplugging a driver the v1 series would work. The > failure mode is when you don't unplug the driver and just run a VM > twice in a row. > > Jason Oh, duh - And yep all of my tests are using managed libvirt so its unbinding from vfio-pci back to the default host driver on VM shutdown. OK, if I force the point and leave vfio-pci bound the 2nd guest boot indeed fails setting up the container with unmodified v1. I'll try again with the new v2 now