diff mbox series

vfio: Get/put KVM only for the first/last vfio_df_open/close in cdev path

Message ID 20240628151845.22166-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series vfio: Get/put KVM only for the first/last vfio_df_open/close in cdev path | expand

Commit Message

Yan Zhao June 28, 2024, 3:18 p.m. UTC
In the device cdev path, adjust the handling of the KVM reference count to
only increment with the first vfio_df_open() and decrement after the final
vfio_df_close(). This change addresses a KVM reference leak that occurs
when a device cdev file is opened multiple times and attempts to bind to
iommufd repeatedly.

Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
in the cdev path during iommufd binding. The corresponding
vfio_device_put_kvm() is executed either when iommufd is unbound or if an
error occurs during the binding process.

However, issues arise when a device binds to iommufd more than once. The
second vfio_df_open() will fail during iommufd binding, and
vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
Consequently, when iommufd is unbound from the first successfully bound
device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
KVM reference count.

Below is the calltrace that will be produced in this scenario when the KVM
module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".

Call Trace:
 <TASK>
 dump_stack_lvl+0x80/0xc0
 slab_err+0xb0/0xf0
 ? __kmem_cache_shutdown+0xc1/0x4e0
 ? rcu_is_watching+0x11/0x50
 ? lock_acquired+0x144/0x3c0
 __kmem_cache_shutdown+0x1b7/0x4e0
 kmem_cache_destroy+0xa6/0x260
 kvm_exit+0x80/0xc0 [kvm]
 vmx_exit+0xe/0x20 [kvm_intel]
 __x64_sys_delete_module+0x143/0x250
 ? ktime_get_coarse_real_ts64+0xd3/0xe0
 ? syscall_trace_enter+0x143/0x210
 do_syscall_64+0x6f/0x140
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/device_cdev.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)


base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f

Comments

Yi Liu July 1, 2024, 5:08 a.m. UTC | #1
On 2024/6/28 23:18, Yan Zhao wrote:
> In the device cdev path, adjust the handling of the KVM reference count to
> only increment with the first vfio_df_open() and decrement after the final
> vfio_df_close(). This change addresses a KVM reference leak that occurs
> when a device cdev file is opened multiple times and attempts to bind to
> iommufd repeatedly.
> 
> Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
> in the cdev path during iommufd binding. The corresponding
> vfio_device_put_kvm() is executed either when iommufd is unbound or if an
> error occurs during the binding process.
> 
> However, issues arise when a device binds to iommufd more than once. The
> second vfio_df_open() will fail during iommufd binding, and
> vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
> Consequently, when iommufd is unbound from the first successfully bound
> device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
> KVM reference count.

Good catch!!!

> Below is the calltrace that will be produced in this scenario when the KVM
> module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
> Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
> 
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x80/0xc0
>   slab_err+0xb0/0xf0
>   ? __kmem_cache_shutdown+0xc1/0x4e0
>   ? rcu_is_watching+0x11/0x50
>   ? lock_acquired+0x144/0x3c0
>   __kmem_cache_shutdown+0x1b7/0x4e0
>   kmem_cache_destroy+0xa6/0x260
>   kvm_exit+0x80/0xc0 [kvm]
>   vmx_exit+0xe/0x20 [kvm_intel]
>   __x64_sys_delete_module+0x143/0x250
>   ? ktime_get_coarse_real_ts64+0xd3/0xe0
>   ? syscall_trace_enter+0x143/0x210
>   do_syscall_64+0x6f/0x140
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>   drivers/vfio/device_cdev.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index bb1817bd4ff3..3b85d01d1b27 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>   {
>   	struct vfio_device *device = df->device;
>   	struct vfio_device_bind_iommufd bind;
> +	bool put_kvm = false;
>   	unsigned long minsz;
>   	int ret;
>   
> @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>   	}
>   
>   	/*
> -	 * Before the device open, get the KVM pointer currently
> +	 * Before the device's first open, get the KVM pointer currently
>   	 * associated with the device file (if there is) and obtain
> -	 * a reference.  This reference is held until device closed.
> +	 * a reference.  This reference is held until device's last closed.
>   	 * Save the pointer in the device for use by drivers.
>   	 */
> -	vfio_df_get_kvm_safe(df);
> +	if (device->open_count == 0) {
> +		vfio_df_get_kvm_safe(df);
> +		put_kvm = true;
> +	}
>   
>   	ret = vfio_df_open(df);
>   	if (ret)
> @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>   out_close_device:
>   	vfio_df_close(df);
>   out_put_kvm:
> -	vfio_device_put_kvm(device);
> +	if (put_kvm)

you may use if (device->open_count == 0) as well here to save a bool. 
Otherwise looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

> +		vfio_device_put_kvm(device);
>   	iommufd_ctx_put(df->iommufd);
>   	df->iommufd = NULL;
>   out_unlock:
> 
> base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f

BTW. The vfio_device_get_kvm_safe() is not supposed to be invoked multiple
times by design as the second call would override the device->put_kvm and
device->kvm. This does not change the put_kvm nor the kvm though. But not a
good thing anyghow. maybe worth a warn like below.

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index ee72c0b61795..a4bead0e5820 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -408,6 +408,8 @@ void vfio_device_get_kvm_safe(struct vfio_device 
*device, struct kvm *kvm)
  	if (!kvm)
  		return;

+	WARN_ON(device->put_kvm || device->kvm);
+
  	pfn = symbol_get(kvm_put_kvm);
  	if (WARN_ON(!pfn))
  		return;
Yan Zhao July 1, 2024, 5:42 a.m. UTC | #2
On Mon, Jul 01, 2024 at 01:08:30PM +0800, Yi Liu wrote:
> On 2024/6/28 23:18, Yan Zhao wrote:
> > In the device cdev path, adjust the handling of the KVM reference count to
> > only increment with the first vfio_df_open() and decrement after the final
> > vfio_df_close(). This change addresses a KVM reference leak that occurs
> > when a device cdev file is opened multiple times and attempts to bind to
> > iommufd repeatedly.
> > 
> > Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
> > in the cdev path during iommufd binding. The corresponding
> > vfio_device_put_kvm() is executed either when iommufd is unbound or if an
> > error occurs during the binding process.
> > 
> > However, issues arise when a device binds to iommufd more than once. The
> > second vfio_df_open() will fail during iommufd binding, and
> > vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
> > Consequently, when iommufd is unbound from the first successfully bound
> > device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
> > KVM reference count.
> 
> Good catch!!!
> 
> > Below is the calltrace that will be produced in this scenario when the KVM
> > module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
> > Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
> > 
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0x80/0xc0
> >   slab_err+0xb0/0xf0
> >   ? __kmem_cache_shutdown+0xc1/0x4e0
> >   ? rcu_is_watching+0x11/0x50
> >   ? lock_acquired+0x144/0x3c0
> >   __kmem_cache_shutdown+0x1b7/0x4e0
> >   kmem_cache_destroy+0xa6/0x260
> >   kvm_exit+0x80/0xc0 [kvm]
> >   vmx_exit+0xe/0x20 [kvm_intel]
> >   __x64_sys_delete_module+0x143/0x250
> >   ? ktime_get_coarse_real_ts64+0xd3/0xe0
> >   ? syscall_trace_enter+0x143/0x210
> >   do_syscall_64+0x6f/0x140
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >   drivers/vfio/device_cdev.c | 13 +++++++++----
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> > index bb1817bd4ff3..3b85d01d1b27 100644
> > --- a/drivers/vfio/device_cdev.c
> > +++ b/drivers/vfio/device_cdev.c
> > @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
> >   {
> >   	struct vfio_device *device = df->device;
> >   	struct vfio_device_bind_iommufd bind;
> > +	bool put_kvm = false;
> >   	unsigned long minsz;
> >   	int ret;
> > @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
> >   	}
> >   	/*
> > -	 * Before the device open, get the KVM pointer currently
> > +	 * Before the device's first open, get the KVM pointer currently
> >   	 * associated with the device file (if there is) and obtain
> > -	 * a reference.  This reference is held until device closed.
> > +	 * a reference.  This reference is held until device's last closed.
> >   	 * Save the pointer in the device for use by drivers.
> >   	 */
> > -	vfio_df_get_kvm_safe(df);
> > +	if (device->open_count == 0) {
> > +		vfio_df_get_kvm_safe(df);
> > +		put_kvm = true;
> > +	}
> >   	ret = vfio_df_open(df);
> >   	if (ret)
> > @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
> >   out_close_device:
> >   	vfio_df_close(df);
> >   out_put_kvm:
> > -	vfio_device_put_kvm(device);
> > +	if (put_kvm)
> 
> you may use if (device->open_count == 0) as well here to save a bool.
> Otherwise looks good to me.
Upon here, device->open_count is not necessarily 0 even for the first open.
The failure can be after a successful increment of device->open_count.

Maybe renaming "bool put_kvm" to "bool is_first_open" to save an assignment
in most common case?
 
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Thanks:)

> 
> > +		vfio_device_put_kvm(device);
> >   	iommufd_ctx_put(df->iommufd);
> >   	df->iommufd = NULL;
> >   out_unlock:
> > 
> > base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
> 
> BTW. The vfio_device_get_kvm_safe() is not supposed to be invoked multiple
> times by design as the second call would override the device->put_kvm and
> device->kvm. This does not change the put_kvm nor the kvm though. But not a
"kvm" may also be changed if the second bind is from a different VM.

> good thing anyghow. maybe worth a warn like below.
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index ee72c0b61795..a4bead0e5820 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -408,6 +408,8 @@ void vfio_device_get_kvm_safe(struct vfio_device
> *device, struct kvm *kvm)
>  	if (!kvm)
>  		return;
> 
> +	WARN_ON(device->put_kvm || device->kvm);
Yes, better.

>  	pfn = symbol_get(kvm_put_kvm);
>  	if (WARN_ON(!pfn))
>  		return;
> 
> -- 
> Regards,
> Yi Liu
Yi Liu July 1, 2024, 6:01 a.m. UTC | #3
On 2024/7/1 13:42, Yan Zhao wrote:
> On Mon, Jul 01, 2024 at 01:08:30PM +0800, Yi Liu wrote:
>> On 2024/6/28 23:18, Yan Zhao wrote:
>>> In the device cdev path, adjust the handling of the KVM reference count to
>>> only increment with the first vfio_df_open() and decrement after the final
>>> vfio_df_close(). This change addresses a KVM reference leak that occurs
>>> when a device cdev file is opened multiple times and attempts to bind to
>>> iommufd repeatedly.
>>>
>>> Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
>>> in the cdev path during iommufd binding. The corresponding
>>> vfio_device_put_kvm() is executed either when iommufd is unbound or if an
>>> error occurs during the binding process.
>>>
>>> However, issues arise when a device binds to iommufd more than once. The
>>> second vfio_df_open() will fail during iommufd binding, and
>>> vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
>>> Consequently, when iommufd is unbound from the first successfully bound
>>> device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
>>> KVM reference count.
>>
>> Good catch!!!
>>
>>> Below is the calltrace that will be produced in this scenario when the KVM
>>> module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
>>> Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
>>>
>>> Call Trace:
>>>    <TASK>
>>>    dump_stack_lvl+0x80/0xc0
>>>    slab_err+0xb0/0xf0
>>>    ? __kmem_cache_shutdown+0xc1/0x4e0
>>>    ? rcu_is_watching+0x11/0x50
>>>    ? lock_acquired+0x144/0x3c0
>>>    __kmem_cache_shutdown+0x1b7/0x4e0
>>>    kmem_cache_destroy+0xa6/0x260
>>>    kvm_exit+0x80/0xc0 [kvm]
>>>    vmx_exit+0xe/0x20 [kvm_intel]
>>>    __x64_sys_delete_module+0x143/0x250
>>>    ? ktime_get_coarse_real_ts64+0xd3/0xe0
>>>    ? syscall_trace_enter+0x143/0x210
>>>    do_syscall_64+0x6f/0x140
>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
>>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>>> ---
>>>    drivers/vfio/device_cdev.c | 13 +++++++++----
>>>    1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
>>> index bb1817bd4ff3..3b85d01d1b27 100644
>>> --- a/drivers/vfio/device_cdev.c
>>> +++ b/drivers/vfio/device_cdev.c
>>> @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>>>    {
>>>    	struct vfio_device *device = df->device;
>>>    	struct vfio_device_bind_iommufd bind;
>>> +	bool put_kvm = false;
>>>    	unsigned long minsz;
>>>    	int ret;
>>> @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>>>    	}
>>>    	/*
>>> -	 * Before the device open, get the KVM pointer currently
>>> +	 * Before the device's first open, get the KVM pointer currently
>>>    	 * associated with the device file (if there is) and obtain
>>> -	 * a reference.  This reference is held until device closed.
>>> +	 * a reference.  This reference is held until device's last closed.
>>>    	 * Save the pointer in the device for use by drivers.
>>>    	 */
>>> -	vfio_df_get_kvm_safe(df);
>>> +	if (device->open_count == 0) {
>>> +		vfio_df_get_kvm_safe(df);
>>> +		put_kvm = true;
>>> +	}
>>>    	ret = vfio_df_open(df);
>>>    	if (ret)
>>> @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
>>>    out_close_device:
>>>    	vfio_df_close(df);
>>>    out_put_kvm:
>>> -	vfio_device_put_kvm(device);
>>> +	if (put_kvm)
>>
>> you may use if (device->open_count == 0) as well here to save a bool.
>> Otherwise looks good to me.
> Upon here, device->open_count is not necessarily 0 even for the first open.
> The failure can be after a successful increment of device->open_count.
> 
> Maybe renaming "bool put_kvm" to "bool is_first_open" to save an assignment
> in most common case?
>   
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> Thanks:)
> 
>>
>>> +		vfio_device_put_kvm(device);
>>>    	iommufd_ctx_put(df->iommufd);
>>>    	df->iommufd = NULL;
>>>    out_unlock:
>>>
>>> base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
>>
>> BTW. The vfio_device_get_kvm_safe() is not supposed to be invoked multiple
>> times by design as the second call would override the device->put_kvm and
>> device->kvm. This does not change the put_kvm nor the kvm though. But not a
> "kvm" may also be changed if the second bind is from a different VM.
> 

yep.

>> good thing anyghow. maybe worth a warn like below.
>>
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index ee72c0b61795..a4bead0e5820 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -408,6 +408,8 @@ void vfio_device_get_kvm_safe(struct vfio_device
>> *device, struct kvm *kvm)
>>   	if (!kvm)
>>   		return;
>>
>> +	WARN_ON(device->put_kvm || device->kvm);
> Yes, better.
> 
>>   	pfn = symbol_get(kvm_put_kvm);
>>   	if (WARN_ON(!pfn))
>>   		return;
>>
>> -- 
>> Regards,
>> Yi Liu
Tian, Kevin July 1, 2024, 8:43 a.m. UTC | #4
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Friday, June 28, 2024 11:19 PM
> 
> In the device cdev path, adjust the handling of the KVM reference count to
> only increment with the first vfio_df_open() and decrement after the final
> vfio_df_close(). This change addresses a KVM reference leak that occurs
> when a device cdev file is opened multiple times and attempts to bind to
> iommufd repeatedly.
> 
> Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
> in the cdev path during iommufd binding. The corresponding
> vfio_device_put_kvm() is executed either when iommufd is unbound or if an
> error occurs during the binding process.
> 
> However, issues arise when a device binds to iommufd more than once. The
> second vfio_df_open() will fail during iommufd binding, and
> vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
> Consequently, when iommufd is unbound from the first successfully bound
> device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
> KVM reference count.

To be accurate this happens only when two binds are issued via different 
fds otherwise below check will happen earlier when two binds are in a
same fd:

	/* one device cannot be bound twice */
	if (df->access_granted) {
		ret = -EINVAL;
		goto out_unlock;
	}

> 
> Below is the calltrace that will be produced in this scenario when the KVM
> module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
> Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
> 
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x80/0xc0
>  slab_err+0xb0/0xf0
>  ? __kmem_cache_shutdown+0xc1/0x4e0
>  ? rcu_is_watching+0x11/0x50
>  ? lock_acquired+0x144/0x3c0
>  __kmem_cache_shutdown+0x1b7/0x4e0
>  kmem_cache_destroy+0xa6/0x260
>  kvm_exit+0x80/0xc0 [kvm]
>  vmx_exit+0xe/0x20 [kvm_intel]
>  __x64_sys_delete_module+0x143/0x250
>  ? ktime_get_coarse_real_ts64+0xd3/0xe0
>  ? syscall_trace_enter+0x143/0x210
>  do_syscall_64+0x6f/0x140
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/device_cdev.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index bb1817bd4ff3..3b85d01d1b27 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct
> vfio_device_file *df,
>  {
>  	struct vfio_device *device = df->device;
>  	struct vfio_device_bind_iommufd bind;
> +	bool put_kvm = false;
>  	unsigned long minsz;
>  	int ret;
> 
> @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct
> vfio_device_file *df,
>  	}
> 
>  	/*
> -	 * Before the device open, get the KVM pointer currently
> +	 * Before the device's first open, get the KVM pointer currently
>  	 * associated with the device file (if there is) and obtain
> -	 * a reference.  This reference is held until device closed.
> +	 * a reference.  This reference is held until device's last closed.
>  	 * Save the pointer in the device for use by drivers.
>  	 */
> -	vfio_df_get_kvm_safe(df);
> +	if (device->open_count == 0) {
> +		vfio_df_get_kvm_safe(df);
> +		put_kvm = true;
> +	}
> 
>  	ret = vfio_df_open(df);
>  	if (ret)
> @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct
> vfio_device_file *df,
>  out_close_device:
>  	vfio_df_close(df);
>  out_put_kvm:
> -	vfio_device_put_kvm(device);
> +	if (put_kvm)
> +		vfio_device_put_kvm(device);
>  	iommufd_ctx_put(df->iommufd);
>  	df->iommufd = NULL;
>  out_unlock:
> 

what about extending vfio_df_open() to unify the get/put_kvm()
and open_count trick in one place?

int vfio_df_open(struct vfio_device_file *df, struct kvm *kvm,
	spinlock_t *kvm_ref_lock)
{
	struct vfio_device *device = df->device;
	int ret = 0;
	
	lockdep_assert_held(&device->dev_set->lock);

	if (device->open_count == 0) {
		spin_lock(kvm_ref_lock);
		vfio_device_get_kvm_safe(device, kvm);
		spin_unlock(kvm_ref_lock);
	}

	/*
	 * Only the group path allows the device to be opened multiple
	 * times.  The device cdev path doesn't have a secure way for it.
	 */
	if (device->open_count != 0 && !df->group)
		return -EINVAL;

	device->open_count++;
	if (device->open_count == 1) {
		ret = vfio_df_device_first_open(df);
		if (ret)
			device->open_count--;
	}

	if (ret)
		vfio_device_put_kvm(device);
	return ret;
}

void vfio_df_close(struct vfio_device_file *df)
{
 	struct vfio_device *device = df->device;

	lockdep_assert_held(&device->dev_set->lock);

	vfio_assert_device_open(device);
	if (device->open_count == 1) {
		vfio_df_device_last_close(df);
		vfio_device_put_kvm(device);
	}
	device->open_count--;
}
Yi Liu July 1, 2024, 10:30 a.m. UTC | #5
On 2024/7/1 16:43, Tian, Kevin wrote:
>> From: Zhao, Yan Y <yan.y.zhao@intel.com>
>> Sent: Friday, June 28, 2024 11:19 PM
>>
>> In the device cdev path, adjust the handling of the KVM reference count to
>> only increment with the first vfio_df_open() and decrement after the final
>> vfio_df_close(). This change addresses a KVM reference leak that occurs
>> when a device cdev file is opened multiple times and attempts to bind to
>> iommufd repeatedly.
>>
>> Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
>> in the cdev path during iommufd binding. The corresponding
>> vfio_device_put_kvm() is executed either when iommufd is unbound or if an
>> error occurs during the binding process.
>>
>> However, issues arise when a device binds to iommufd more than once. The
>> second vfio_df_open() will fail during iommufd binding, and
>> vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
>> Consequently, when iommufd is unbound from the first successfully bound
>> device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
>> KVM reference count.
> 
> To be accurate this happens only when two binds are issued via different
> fds otherwise below check will happen earlier when two binds are in a
> same fd:
> 
> 	/* one device cannot be bound twice */
> 	if (df->access_granted) {
> 		ret = -EINVAL;
> 		goto out_unlock;
> 	}

yes

>>
>> Below is the calltrace that will be produced in this scenario when the KVM
>> module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
>> Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
>>
>> Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0x80/0xc0
>>   slab_err+0xb0/0xf0
>>   ? __kmem_cache_shutdown+0xc1/0x4e0
>>   ? rcu_is_watching+0x11/0x50
>>   ? lock_acquired+0x144/0x3c0
>>   __kmem_cache_shutdown+0x1b7/0x4e0
>>   kmem_cache_destroy+0xa6/0x260
>>   kvm_exit+0x80/0xc0 [kvm]
>>   vmx_exit+0xe/0x20 [kvm_intel]
>>   __x64_sys_delete_module+0x143/0x250
>>   ? ktime_get_coarse_real_ts64+0xd3/0xe0
>>   ? syscall_trace_enter+0x143/0x210
>>   do_syscall_64+0x6f/0x140
>>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
>> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>> ---
>>   drivers/vfio/device_cdev.c | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
>> index bb1817bd4ff3..3b85d01d1b27 100644
>> --- a/drivers/vfio/device_cdev.c
>> +++ b/drivers/vfio/device_cdev.c
>> @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct
>> vfio_device_file *df,
>>   {
>>   	struct vfio_device *device = df->device;
>>   	struct vfio_device_bind_iommufd bind;
>> +	bool put_kvm = false;
>>   	unsigned long minsz;
>>   	int ret;
>>
>> @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct
>> vfio_device_file *df,
>>   	}
>>
>>   	/*
>> -	 * Before the device open, get the KVM pointer currently
>> +	 * Before the device's first open, get the KVM pointer currently
>>   	 * associated with the device file (if there is) and obtain
>> -	 * a reference.  This reference is held until device closed.
>> +	 * a reference.  This reference is held until device's last closed.
>>   	 * Save the pointer in the device for use by drivers.
>>   	 */
>> -	vfio_df_get_kvm_safe(df);
>> +	if (device->open_count == 0) {
>> +		vfio_df_get_kvm_safe(df);
>> +		put_kvm = true;
>> +	}
>>
>>   	ret = vfio_df_open(df);
>>   	if (ret)
>> @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct
>> vfio_device_file *df,
>>   out_close_device:
>>   	vfio_df_close(df);
>>   out_put_kvm:
>> -	vfio_device_put_kvm(device);
>> +	if (put_kvm)
>> +		vfio_device_put_kvm(device);
>>   	iommufd_ctx_put(df->iommufd);
>>   	df->iommufd = NULL;
>>   out_unlock:
>>
> 
> what about extending vfio_df_open() to unify the get/put_kvm()
> and open_count trick in one place?
> 
> int vfio_df_open(struct vfio_device_file *df, struct kvm *kvm,
> 	spinlock_t *kvm_ref_lock)
> {

this should work. But need a comment to note why need pass in both kvm
and kvm_ref_lock given df has both of them. :)

> 	struct vfio_device *device = df->device;
> 	int ret = 0;
> 	
> 	lockdep_assert_held(&device->dev_set->lock);
> 
> 	if (device->open_count == 0) {
> 		spin_lock(kvm_ref_lock);
> 		vfio_device_get_kvm_safe(device, kvm);
> 		spin_unlock(kvm_ref_lock);
> 	}

perhaps it can be put in the "if (device->open_count == 1)" branch, just
before invoking vfio_df_device_first_open().

> 
> 	/*
> 	 * Only the group path allows the device to be opened multiple
> 	 * times.  The device cdev path doesn't have a secure way for it.
> 	 */
> 	if (device->open_count != 0 && !df->group)
> 		return -EINVAL;
> 
> 	device->open_count++;
> 	if (device->open_count == 1) {
> 		ret = vfio_df_device_first_open(df);
> 		if (ret)
> 			device->open_count--;
> 	}
> 
> 	if (ret)
> 		vfio_device_put_kvm(device);
> 	return ret;
> }
> 
> void vfio_df_close(struct vfio_device_file *df)
> {
>   	struct vfio_device *device = df->device;
> 
> 	lockdep_assert_held(&device->dev_set->lock);
> 
> 	vfio_assert_device_open(device);
> 	if (device->open_count == 1) {
> 		vfio_df_device_last_close(df);
> 		vfio_device_put_kvm(device);
> 	}
> 	device->open_count--;
> }
Yan Zhao July 1, 2024, 12:02 p.m. UTC | #6
On Mon, Jul 01, 2024 at 06:30:05PM +0800, Yi Liu wrote:
> On 2024/7/1 16:43, Tian, Kevin wrote:
> > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > Sent: Friday, June 28, 2024 11:19 PM
> > > 
> > > In the device cdev path, adjust the handling of the KVM reference count to
> > > only increment with the first vfio_df_open() and decrement after the final
> > > vfio_df_close(). This change addresses a KVM reference leak that occurs
> > > when a device cdev file is opened multiple times and attempts to bind to
> > > iommufd repeatedly.
> > > 
> > > Currently, vfio_df_get_kvm_safe() is invoked prior to each vfio_df_open()
> > > in the cdev path during iommufd binding. The corresponding
> > > vfio_device_put_kvm() is executed either when iommufd is unbound or if an
> > > error occurs during the binding process.
> > > 
> > > However, issues arise when a device binds to iommufd more than once. The
> > > second vfio_df_open() will fail during iommufd binding, and
> > > vfio_device_put_kvm() will be triggered, setting device->kvm to NULL.
> > > Consequently, when iommufd is unbound from the first successfully bound
> > > device, vfio_device_put_kvm() becomes ineffective, leading to a leak in the
> > > KVM reference count.
> > 
> > To be accurate this happens only when two binds are issued via different
> > fds otherwise below check will happen earlier when two binds are in a
> > same fd:
> > 
> > 	/* one device cannot be bound twice */
> > 	if (df->access_granted) {
> > 		ret = -EINVAL;
> > 		goto out_unlock;
> > 	}
> 
> yes
> 
> > > 
> > > Below is the calltrace that will be produced in this scenario when the KVM
> > > module is unloaded afterwards, reporting "BUG kvm_vcpu (Tainted: G S):
> > > Objects remaining in kvm_vcpu on __kmem_cache_shutdown()".
> > > 
> > > Call Trace:
> > >   <TASK>
> > >   dump_stack_lvl+0x80/0xc0
> > >   slab_err+0xb0/0xf0
> > >   ? __kmem_cache_shutdown+0xc1/0x4e0
> > >   ? rcu_is_watching+0x11/0x50
> > >   ? lock_acquired+0x144/0x3c0
> > >   __kmem_cache_shutdown+0x1b7/0x4e0
> > >   kmem_cache_destroy+0xa6/0x260
> > >   kvm_exit+0x80/0xc0 [kvm]
> > >   vmx_exit+0xe/0x20 [kvm_intel]
> > >   __x64_sys_delete_module+0x143/0x250
> > >   ? ktime_get_coarse_real_ts64+0xd3/0xe0
> > >   ? syscall_trace_enter+0x143/0x210
> > >   do_syscall_64+0x6f/0x140
> > >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > 
> > > Fixes: 5fcc26969a16 ("vfio: Add VFIO_DEVICE_BIND_IOMMUFD")
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >   drivers/vfio/device_cdev.c | 13 +++++++++----
> > >   1 file changed, 9 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> > > index bb1817bd4ff3..3b85d01d1b27 100644
> > > --- a/drivers/vfio/device_cdev.c
> > > +++ b/drivers/vfio/device_cdev.c
> > > @@ -65,6 +65,7 @@ long vfio_df_ioctl_bind_iommufd(struct
> > > vfio_device_file *df,
> > >   {
> > >   	struct vfio_device *device = df->device;
> > >   	struct vfio_device_bind_iommufd bind;
> > > +	bool put_kvm = false;
> > >   	unsigned long minsz;
> > >   	int ret;
> > > 
> > > @@ -101,12 +102,15 @@ long vfio_df_ioctl_bind_iommufd(struct
> > > vfio_device_file *df,
> > >   	}
> > > 
> > >   	/*
> > > -	 * Before the device open, get the KVM pointer currently
> > > +	 * Before the device's first open, get the KVM pointer currently
> > >   	 * associated with the device file (if there is) and obtain
> > > -	 * a reference.  This reference is held until device closed.
> > > +	 * a reference.  This reference is held until device's last closed.
> > >   	 * Save the pointer in the device for use by drivers.
> > >   	 */
> > > -	vfio_df_get_kvm_safe(df);
> > > +	if (device->open_count == 0) {
> > > +		vfio_df_get_kvm_safe(df);
> > > +		put_kvm = true;
> > > +	}
> > > 
> > >   	ret = vfio_df_open(df);
> > >   	if (ret)
> > > @@ -129,7 +133,8 @@ long vfio_df_ioctl_bind_iommufd(struct
> > > vfio_device_file *df,
> > >   out_close_device:
> > >   	vfio_df_close(df);
> > >   out_put_kvm:
> > > -	vfio_device_put_kvm(device);
> > > +	if (put_kvm)
> > > +		vfio_device_put_kvm(device);
> > >   	iommufd_ctx_put(df->iommufd);
> > >   	df->iommufd = NULL;
> > >   out_unlock:
> > > 
> > 
> > what about extending vfio_df_open() to unify the get/put_kvm()
> > and open_count trick in one place?
> > 
> > int vfio_df_open(struct vfio_device_file *df, struct kvm *kvm,
> > 	spinlock_t *kvm_ref_lock)
> > {
> 
> this should work. But need a comment to note why need pass in both kvm
> and kvm_ref_lock given df has both of them. :)
So why to pass them?

What about making vfio_device_group_get_kvm_safe() or vfio_df_get_kvm_safe()
not static and calling one of them in vfio_df_open() according to the df->group
?

> 
> > 	struct vfio_device *device = df->device;
> > 	int ret = 0;
> > 	
> > 	lockdep_assert_held(&device->dev_set->lock);
> > 
> > 	if (device->open_count == 0) {
> > 		spin_lock(kvm_ref_lock);
> > 		vfio_device_get_kvm_safe(device, kvm);
> > 		spin_unlock(kvm_ref_lock);
> > 	}
> 
> perhaps it can be put in the "if (device->open_count == 1)" branch, just
> before invoking vfio_df_device_first_open().
What about just moving the get/put_kvm into vfio_df_device_first_open()?

> > 
> > 	/*
> > 	 * Only the group path allows the device to be opened multiple
> > 	 * times.  The device cdev path doesn't have a secure way for it.
> > 	 */
> > 	if (device->open_count != 0 && !df->group)
> > 		return -EINVAL;
> > 
> > 	device->open_count++;
> > 	if (device->open_count == 1) {
> > 		ret = vfio_df_device_first_open(df);
> > 		if (ret)
> > 			device->open_count--;
> > 	}
> > 
> > 	if (ret)
> > 		vfio_device_put_kvm(device);
> > 	return ret;
> > }
> > 
> > void vfio_df_close(struct vfio_device_file *df)
> > {
> >   	struct vfio_device *device = df->device;
> > 
> > 	lockdep_assert_held(&device->dev_set->lock);
> > 
> > 	vfio_assert_device_open(device);
> > 	if (device->open_count == 1) {
> > 		vfio_df_device_last_close(df);
> > 		vfio_device_put_kvm(device);
> > 	}
> > 	device->open_count--;
> > }
> 
> -- 
> Regards,
> Yi Liu
Tian, Kevin July 2, 2024, 12:17 a.m. UTC | #7
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Monday, July 1, 2024 8:02 PM
> 
> On Mon, Jul 01, 2024 at 06:30:05PM +0800, Yi Liu wrote:
> > On 2024/7/1 16:43, Tian, Kevin wrote:
> > >
> > > what about extending vfio_df_open() to unify the get/put_kvm()
> > > and open_count trick in one place?
> > >
> > > int vfio_df_open(struct vfio_device_file *df, struct kvm *kvm,
> > > 	spinlock_t *kvm_ref_lock)
> > > {
> >
> > this should work. But need a comment to note why need pass in both kvm
> > and kvm_ref_lock given df has both of them. :)
> So why to pass them?

hmm actually passing them is wrong especially for the group path.
We have to get kvm upon the first reference to the pointer otherwise
it is prone to use-after-free issue.

> 
> What about making vfio_device_group_get_kvm_safe() or
> vfio_df_get_kvm_safe()
> not static and calling one of them in vfio_df_open() according to the df-
> >group
> ?
> 

yeah, this sounds better.
diff mbox series

Patch

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index bb1817bd4ff3..3b85d01d1b27 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -65,6 +65,7 @@  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
 {
 	struct vfio_device *device = df->device;
 	struct vfio_device_bind_iommufd bind;
+	bool put_kvm = false;
 	unsigned long minsz;
 	int ret;
 
@@ -101,12 +102,15 @@  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
 	}
 
 	/*
-	 * Before the device open, get the KVM pointer currently
+	 * Before the device's first open, get the KVM pointer currently
 	 * associated with the device file (if there is) and obtain
-	 * a reference.  This reference is held until device closed.
+	 * a reference.  This reference is held until device's last closed.
 	 * Save the pointer in the device for use by drivers.
 	 */
-	vfio_df_get_kvm_safe(df);
+	if (device->open_count == 0) {
+		vfio_df_get_kvm_safe(df);
+		put_kvm = true;
+	}
 
 	ret = vfio_df_open(df);
 	if (ret)
@@ -129,7 +133,8 @@  long vfio_df_ioctl_bind_iommufd(struct vfio_device_file *df,
 out_close_device:
 	vfio_df_close(df);
 out_put_kvm:
-	vfio_device_put_kvm(device);
+	if (put_kvm)
+		vfio_device_put_kvm(device);
 	iommufd_ctx_put(df->iommufd);
 	df->iommufd = NULL;
 out_unlock: