mbox series

[0/3] Allow the group FD to remain open when unplugging a device

Message ID 0-v1-90bf0950c42c+39-vfio_group_disassociate_jgg@nvidia.com (mailing list archive)
Headers show
Series Allow the group FD to remain open when unplugging a device | expand

Message

Jason Gunthorpe Oct. 6, 2022, 12:40 p.m. UTC
Testing has shown that virtnodedevd will leave the group FD open for long
periods, even after all the cdevs have been destroyed. This blocks
destruction of the VFIO device and is undesirable.

That approach was selected to accomodate SPAPR which has an broken
lifecyle model for the iommu_group. However, we can accomodate SPAPR by
realizing that it doesn't use the iommu core at all, so rules about
iommu_group lifetime do not apply to it.

Giving the KVM code its own kref on the iommu_group allows the VFIO core
code to release its iommu_group reference earlier and we can remove the
sleep that only existed for SPAPR.

Jason Gunthorpe (3):
  vfio: Add vfio_file_is_group()
  vfio: Hold a reference to the iommu_group in kvm for SPAPR
  vfio: Make the group FD disassociate from the iommu_group

 drivers/vfio/pci/vfio_pci_core.c |  2 +-
 drivers/vfio/vfio.h              |  1 -
 drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
 include/linux/vfio.h             |  1 +
 virt/kvm/vfio.c                  | 45 +++++++++++-----
 5 files changed, 94 insertions(+), 45 deletions(-)


base-commit: c82e81ab2569559ad873b3061217c2f37560682b

Comments

Alex Williamson Oct. 6, 2022, 7:53 p.m. UTC | #1
On Thu,  6 Oct 2022 09:40:35 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> Testing has shown that virtnodedevd will leave the group FD open for long
> periods, even after all the cdevs have been destroyed. This blocks
> destruction of the VFIO device and is undesirable.
> 
> That approach was selected to accomodate SPAPR which has an broken
> lifecyle model for the iommu_group. However, we can accomodate SPAPR by
> realizing that it doesn't use the iommu core at all, so rules about
> iommu_group lifetime do not apply to it.
> 
> Giving the KVM code its own kref on the iommu_group allows the VFIO core
> code to release its iommu_group reference earlier and we can remove the
> sleep that only existed for SPAPR.
> 
> Jason Gunthorpe (3):
>   vfio: Add vfio_file_is_group()
>   vfio: Hold a reference to the iommu_group in kvm for SPAPR
>   vfio: Make the group FD disassociate from the iommu_group
> 
>  drivers/vfio/pci/vfio_pci_core.c |  2 +-
>  drivers/vfio/vfio.h              |  1 -
>  drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
>  include/linux/vfio.h             |  1 +
>  virt/kvm/vfio.c                  | 45 +++++++++++-----
>  5 files changed, 94 insertions(+), 45 deletions(-)

Containers aren't getting cleaned up with this series, starting and
shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev
device, and making use of hugepages, /proc/meminfo shows the hugepages
are not released on VM shutdown and I'm unable to subsequently restart
the VM. Thanks,

Alex
Jason Gunthorpe Oct. 6, 2022, 10:42 p.m. UTC | #2
On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote:
> On Thu,  6 Oct 2022 09:40:35 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Testing has shown that virtnodedevd will leave the group FD open for long
> > periods, even after all the cdevs have been destroyed. This blocks
> > destruction of the VFIO device and is undesirable.
> > 
> > That approach was selected to accomodate SPAPR which has an broken
> > lifecyle model for the iommu_group. However, we can accomodate SPAPR by
> > realizing that it doesn't use the iommu core at all, so rules about
> > iommu_group lifetime do not apply to it.
> > 
> > Giving the KVM code its own kref on the iommu_group allows the VFIO core
> > code to release its iommu_group reference earlier and we can remove the
> > sleep that only existed for SPAPR.
> > 
> > Jason Gunthorpe (3):
> >   vfio: Add vfio_file_is_group()
> >   vfio: Hold a reference to the iommu_group in kvm for SPAPR
> >   vfio: Make the group FD disassociate from the iommu_group
> > 
> >  drivers/vfio/pci/vfio_pci_core.c |  2 +-
> >  drivers/vfio/vfio.h              |  1 -
> >  drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
> >  include/linux/vfio.h             |  1 +
> >  virt/kvm/vfio.c                  | 45 +++++++++++-----
> >  5 files changed, 94 insertions(+), 45 deletions(-)
> 
> Containers aren't getting cleaned up with this series, starting and
> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev
> device, and making use of hugepages, /proc/meminfo shows the hugepages
> are not released on VM shutdown and I'm unable to subsequently restart
> the VM. Thanks,

Oh, I'm surprised the s390 testing didn't hit this!!

This hunk should remain since not all cases are closures due to device
hot unplug

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index f9cb734d3991b3..62aba3a128fb8d 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 	filep->private_data = NULL;
 
 	mutex_lock(&group->group_lock);
+	/*
+	 * Device FDs hold a group file reference, therefore the group release
+	 * is only called when there are no open devices.
+	 */
+	WARN_ON(group->notifier.head);
+	if (group->container)
+		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	mutex_unlock(&group->group_lock);
 	return 0;
Matthew Rosato Oct. 6, 2022, 11:28 p.m. UTC | #3
On 10/6/22 6:42 PM, Jason Gunthorpe wrote:
> On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote:
>> On Thu,  6 Oct 2022 09:40:35 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> Testing has shown that virtnodedevd will leave the group FD open for long
>>> periods, even after all the cdevs have been destroyed. This blocks
>>> destruction of the VFIO device and is undesirable.
>>>
>>> That approach was selected to accomodate SPAPR which has an broken
>>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by
>>> realizing that it doesn't use the iommu core at all, so rules about
>>> iommu_group lifetime do not apply to it.
>>>
>>> Giving the KVM code its own kref on the iommu_group allows the VFIO core
>>> code to release its iommu_group reference earlier and we can remove the
>>> sleep that only existed for SPAPR.
>>>
>>> Jason Gunthorpe (3):
>>>   vfio: Add vfio_file_is_group()
>>>   vfio: Hold a reference to the iommu_group in kvm for SPAPR
>>>   vfio: Make the group FD disassociate from the iommu_group
>>>
>>>  drivers/vfio/pci/vfio_pci_core.c |  2 +-
>>>  drivers/vfio/vfio.h              |  1 -
>>>  drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
>>>  include/linux/vfio.h             |  1 +
>>>  virt/kvm/vfio.c                  | 45 +++++++++++-----
>>>  5 files changed, 94 insertions(+), 45 deletions(-)
>>
>> Containers aren't getting cleaned up with this series, starting and
>> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev
>> device, and making use of hugepages, /proc/meminfo shows the hugepages
>> are not released on VM shutdown and I'm unable to subsequently restart
>> the VM. Thanks,
> 
> Oh, I'm surprised the s390 testing didn't hit this!!

Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.  I definitely did multiple VM (re)starts and hot (un)plugs.  But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal.

-ap and -ccw also don't pin everything upfront (and I did far less testing with those).

Ugh.  Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine.

> 
> This hunk should remain since not all cases are closures due to device
> hot unplug
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index f9cb734d3991b3..62aba3a128fb8d 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>  	filep->private_data = NULL;
>  
>  	mutex_lock(&group->group_lock);
> +	/*
> +	 * Device FDs hold a group file reference, therefore the group release
> +	 * is only called when there are no open devices.
> +	 */
> +	WARN_ON(group->notifier.head);
> +	if (group->container)
> +		vfio_group_detach_container(group);
>  	group->opened_file = NULL;
>  	mutex_unlock(&group->group_lock);
>  	return 0;

Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good.  For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well.
Matthew Rosato Oct. 7, 2022, 1:46 a.m. UTC | #4
On 10/6/22 7:28 PM, Matthew Rosato wrote:
> On 10/6/22 6:42 PM, Jason Gunthorpe wrote:
>> On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote:
>>> On Thu,  6 Oct 2022 09:40:35 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> Testing has shown that virtnodedevd will leave the group FD open for long
>>>> periods, even after all the cdevs have been destroyed. This blocks
>>>> destruction of the VFIO device and is undesirable.
>>>>
>>>> That approach was selected to accomodate SPAPR which has an broken
>>>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by
>>>> realizing that it doesn't use the iommu core at all, so rules about
>>>> iommu_group lifetime do not apply to it.
>>>>
>>>> Giving the KVM code its own kref on the iommu_group allows the VFIO core
>>>> code to release its iommu_group reference earlier and we can remove the
>>>> sleep that only existed for SPAPR.
>>>>
>>>> Jason Gunthorpe (3):
>>>>   vfio: Add vfio_file_is_group()
>>>>   vfio: Hold a reference to the iommu_group in kvm for SPAPR
>>>>   vfio: Make the group FD disassociate from the iommu_group
>>>>
>>>>  drivers/vfio/pci/vfio_pci_core.c |  2 +-
>>>>  drivers/vfio/vfio.h              |  1 -
>>>>  drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
>>>>  include/linux/vfio.h             |  1 +
>>>>  virt/kvm/vfio.c                  | 45 +++++++++++-----
>>>>  5 files changed, 94 insertions(+), 45 deletions(-)
>>>
>>> Containers aren't getting cleaned up with this series, starting and
>>> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev
>>> device, and making use of hugepages, /proc/meminfo shows the hugepages
>>> are not released on VM shutdown and I'm unable to subsequently restart
>>> the VM. Thanks,
>>
>> Oh, I'm surprised the s390 testing didn't hit this!!
> 
> Huh, me too, at least eventually - I think it's because we aren't pinning everything upfront but rather on-demand so the missing the type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.  I definitely did multiple VM (re)starts and hot (un)plugs.  But while my test workloads did some I/O, the long-running one was focused on the plug/unplug scenarios to recreate the initial issue so the I/O (and thus pinning) done would have been minimal.
> 
> -ap and -ccw also don't pin everything upfront (and I did far less testing with those).
> 
> Ugh.  Moving forward, might be worth seeing how I can loop in some non-s390-specific vfio testing into my routine.
> 
>>
>> This hunk should remain since not all cases are closures due to device
>> hot unplug
>>
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index f9cb734d3991b3..62aba3a128fb8d 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -954,6 +954,13 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>>  	filep->private_data = NULL;
>>  
>>  	mutex_lock(&group->group_lock);
>> +	/*
>> +	 * Device FDs hold a group file reference, therefore the group release
>> +	 * is only called when there are no open devices.
>> +	 */
>> +	WARN_ON(group->notifier.head);
>> +	if (group->container)
>> +		vfio_group_detach_container(group);
>>  	group->opened_file = NULL;
>>  	mutex_unlock(&group->group_lock);
>>  	return 0;
> 
> Anyway, FWIW, I folded this in and re-ran a brief series of -pci, -ccw and -ap tests on s390 and things still look good.  For completeness I'll start some longer-running pci tests next but I expect this will still be fine as well.

Further s390 vfio-pci testing still looks good with this change too.
Christian Borntraeger Oct. 7, 2022, 11:17 a.m. UTC | #5
Am 07.10.22 um 00:42 schrieb Jason Gunthorpe:
> On Thu, Oct 06, 2022 at 01:53:15PM -0600, Alex Williamson wrote:
>> On Thu,  6 Oct 2022 09:40:35 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> Testing has shown that virtnodedevd will leave the group FD open for long
>>> periods, even after all the cdevs have been destroyed. This blocks
>>> destruction of the VFIO device and is undesirable.
>>>
>>> That approach was selected to accomodate SPAPR which has an broken
>>> lifecyle model for the iommu_group. However, we can accomodate SPAPR by
>>> realizing that it doesn't use the iommu core at all, so rules about
>>> iommu_group lifetime do not apply to it.
>>>
>>> Giving the KVM code its own kref on the iommu_group allows the VFIO core
>>> code to release its iommu_group reference earlier and we can remove the
>>> sleep that only existed for SPAPR.
>>>
>>> Jason Gunthorpe (3):
>>>    vfio: Add vfio_file_is_group()
>>>    vfio: Hold a reference to the iommu_group in kvm for SPAPR
>>>    vfio: Make the group FD disassociate from the iommu_group
>>>
>>>   drivers/vfio/pci/vfio_pci_core.c |  2 +-
>>>   drivers/vfio/vfio.h              |  1 -
>>>   drivers/vfio/vfio_main.c         | 90 +++++++++++++++++++++-----------
>>>   include/linux/vfio.h             |  1 +
>>>   virt/kvm/vfio.c                  | 45 +++++++++++-----
>>>   5 files changed, 94 insertions(+), 45 deletions(-)
>>
>> Containers aren't getting cleaned up with this series, starting and
>> shutting down a libvirt managed VM with vfio-pci devices, an mtty mdev
>> device, and making use of hugepages, /proc/meminfo shows the hugepages
>> are not released on VM shutdown and I'm unable to subsequently restart
>> the VM. Thanks,
> 
> Oh, I'm surprised the s390 testing didn't hit this!!

I guess its because that most of our CI testcases create a new ephemeral
guest for each testcase. We do test reboot but not shutdown and restart.
Will have a look if that can be improved.
Jason Gunthorpe Oct. 7, 2022, 1:37 p.m. UTC | #6
On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote:

> > Oh, I'm surprised the s390 testing didn't hit this!!
> 
> Huh, me too, at least eventually - I think it's because we aren't
> pinning everything upfront but rather on-demand so the missing the
> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.
> I definitely did multiple VM (re)starts and hot (un)plugs.  But
> while my test workloads did some I/O, the long-running one was
> focused on the plug/unplug scenarios to recreate the initial issue
> so the I/O (and thus pinning) done would have been minimal.

That explains ccw/ap a bit but for PCI the iommu ownership wasn't
released so it becomes impossible to re-attach a container to the
group. eg a 2nd VM can never be started

Ah well, thanks!

Jason
Matthew Rosato Oct. 7, 2022, 2:37 p.m. UTC | #7
On 10/7/22 9:37 AM, Jason Gunthorpe wrote:
> On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote:
> 
>>> Oh, I'm surprised the s390 testing didn't hit this!!
>>
>> Huh, me too, at least eventually - I think it's because we aren't
>> pinning everything upfront but rather on-demand so the missing the
>> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.
>> I definitely did multiple VM (re)starts and hot (un)plugs.  But
>> while my test workloads did some I/O, the long-running one was
>> focused on the plug/unplug scenarios to recreate the initial issue
>> so the I/O (and thus pinning) done would have been minimal.
> 
> That explains ccw/ap a bit but for PCI the iommu ownership wasn't
> released so it becomes impossible to re-attach a container to the
> group. eg a 2nd VM can never be started
> 
> Ah well, thanks!
> 
> Jason

Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain:

vfio_pci_remove
 vfio_pci_core_unregister_device
  vfio_unregister_group_dev
   vfio_device_remove_group
    vfio_group_detach_container

I'd guess non-s390 vfio-pci would do the same.  Alex also had the mtty mdev, maybe that's relevant.
Jason Gunthorpe Oct. 7, 2022, 2:39 p.m. UTC | #8
On Fri, Oct 07, 2022 at 10:37:11AM -0400, Matthew Rosato wrote:
> On 10/7/22 9:37 AM, Jason Gunthorpe wrote:
> > On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote:
> > 
> >>> Oh, I'm surprised the s390 testing didn't hit this!!
> >>
> >> Huh, me too, at least eventually - I think it's because we aren't
> >> pinning everything upfront but rather on-demand so the missing the
> >> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.
> >> I definitely did multiple VM (re)starts and hot (un)plugs.  But
> >> while my test workloads did some I/O, the long-running one was
> >> focused on the plug/unplug scenarios to recreate the initial issue
> >> so the I/O (and thus pinning) done would have been minimal.
> > 
> > That explains ccw/ap a bit but for PCI the iommu ownership wasn't
> > released so it becomes impossible to re-attach a container to the
> > group. eg a 2nd VM can never be started
> > 
> > Ah well, thanks!
> > 
> > Jason
> 
> Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain:
> 
> vfio_pci_remove
>  vfio_pci_core_unregister_device
>   vfio_unregister_group_dev
>    vfio_device_remove_group
>     vfio_group_detach_container
> 
> I'd guess non-s390 vfio-pci would do the same.  Alex also had the mtty mdev, maybe that's relevant.

As long as you are unplugging a driver the v1 series would work. The
failure mode is when you don't unplug the driver and just run a VM
twice in a row.

Jason
Matthew Rosato Oct. 7, 2022, 2:46 p.m. UTC | #9
On 10/7/22 10:39 AM, Jason Gunthorpe wrote:
> On Fri, Oct 07, 2022 at 10:37:11AM -0400, Matthew Rosato wrote:
>> On 10/7/22 9:37 AM, Jason Gunthorpe wrote:
>>> On Thu, Oct 06, 2022 at 07:28:53PM -0400, Matthew Rosato wrote:
>>>
>>>>> Oh, I'm surprised the s390 testing didn't hit this!!
>>>>
>>>> Huh, me too, at least eventually - I think it's because we aren't
>>>> pinning everything upfront but rather on-demand so the missing the
>>>> type1 release / vfio_iommu_unmap_unpin_all wouldn't be so obvious.
>>>> I definitely did multiple VM (re)starts and hot (un)plugs.  But
>>>> while my test workloads did some I/O, the long-running one was
>>>> focused on the plug/unplug scenarios to recreate the initial issue
>>>> so the I/O (and thus pinning) done would have been minimal.
>>>
>>> That explains ccw/ap a bit but for PCI the iommu ownership wasn't
>>> released so it becomes impossible to re-attach a container to the
>>> group. eg a 2nd VM can never be started
>>>
>>> Ah well, thanks!
>>>
>>> Jason
>>
>> Well, this bugged me enough that I traced the v1 series without fixup and vfio-pci on s390 was OK because it was still calling detach_container on vm shutdown via this chain:
>>
>> vfio_pci_remove
>>  vfio_pci_core_unregister_device
>>   vfio_unregister_group_dev
>>    vfio_device_remove_group
>>     vfio_group_detach_container
>>
>> I'd guess non-s390 vfio-pci would do the same.  Alex also had the mtty mdev, maybe that's relevant.
> 
> As long as you are unplugging a driver the v1 series would work. The
> failure mode is when you don't unplug the driver and just run a VM
> twice in a row.
> 
> Jason

Oh, duh - And yep all of my tests are using managed libvirt so its unbinding from vfio-pci back to the default host driver on VM shutdown.

OK, if I force the point and leave vfio-pci bound the 2nd guest boot indeed fails setting up the container with unmodified v1.  I'll try again with the new v2 now