[v3,3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT

Message ID	20240912131729.14951-4-yi.l.liu@intel.com (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1E63B1AD278; Thu, 12 Sep 2024 13:17:39 +0000 (UTC) From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, jgg@nvidia.com, kevin.tian@intel.com, baolu.lu@linux.intel.com Cc: alex.williamson@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, iommu@lists.linux.dev, zhenzhong.duan@intel.com, linux-kselftest@vger.kernel.org, vasant.hegde@amd.com Subject: [PATCH v3 3/4] vfio: Add VFIO_DEVICE_PASID_[AT\|DE]TACH_IOMMUFD_PT Date: Thu, 12 Sep 2024 06:17:28 -0700 Message-Id: <20240912131729.14951-4-yi.l.liu@intel.com> In-Reply-To: <20240912131729.14951-1-yi.l.liu@intel.com> References: <20240912131729.14951-1-yi.l.liu@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	vfio-pci support pasid attach/detach \| expand [v3,0/4] vfio-pci support pasid attach/detach [v3,1/4] ida: Add ida_find_first_range() [v3,2/4] vfio-iommufd: Support pasid [at\|de]tach for physical VFIO devices [v3,3/4] vfio: Add VFIO_DEVICE_PASID_[AT\|DE]TACH_IOMMUFD_PT [v3,4/4] iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability

Yi Liu Sept. 12, 2024, 1:17 p.m. UTC

This adds ioctls for the userspace to attach/detach a given pasid of a
vfio device to/from an IOAS/HWPT.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h        |  4 +++
 drivers/vfio/vfio_main.c   |  8 ++++++
 include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+)

Jason Gunthorpe Sept. 26, 2024, 7:16 p.m. UTC | #1

On Thu, Sep 12, 2024 at 06:17:28AM -0700, Yi Liu wrote:
> This adds ioctls for the userspace to attach/detach a given pasid of a
> vfio device to/from an IOAS/HWPT.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio.h        |  4 +++
>  drivers/vfio/vfio_main.c   |  8 ++++++
>  include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 118 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

Tian, Kevin Sept. 30, 2024, 7:55 a.m. UTC | #2

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, September 12, 2024 9:17 PM
> 
> +/*
> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
> VFIO_BASE + 21,
> + *					      struct
> vfio_device_pasid_attach_iommufd_pt)
> + * @argsz:	User filled size of this data.
> + * @flags:	Must be 0.
> + * @pasid:	The pasid to be attached.
> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
> + *		allocated via iommufd subsystem.
> + *		Output the input ioas id or the attached hwpt id which could
> + *		be the specified hwpt itself or a hwpt automatically created
> + *		for the specified ioas by kernel during the attachment.
> + *
> + * Associate a pasid with an address space within the bound iommufd.
> Undo by
> + * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or device fd close. This is
> only allowed
> + * on cdev fds.
> + *
> + * If a pasid is currently attached to a valid hwpt, without doing a
> + * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT, a second
> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT ioctl passing in another
> hwpt id is
> + * allowed. This action, also known as a hwpt replacement, will replace the
> + * pasid's currently attached hwpt with a new hwpt corresponding to the
> given
> + * @pt_id.
> + *
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_pasid_attach_iommufd_pt {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u32	pasid;
> +	__u32	pt_id;
> +};
> +
> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> VFIO_BASE + 21)

Not sure whether this was discussed before. Does it make sense
to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
by introducing a new pasid field and a new flag bit?

Jason Gunthorpe Oct. 1, 2024, 12:11 p.m. UTC | #3

On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:

> > +struct vfio_device_pasid_attach_iommufd_pt {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +	__u32	pasid;
> > +	__u32	pt_id;
> > +};
> > +
> > +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> > VFIO_BASE + 21)
> 
> Not sure whether this was discussed before. Does it make sense
> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
> by introducing a new pasid field and a new flag bit?

Maybe? I don't have a strong feeling either way.

There is somewhat less code if you reuse the ioctl at least

Jason

Yi Liu Oct. 12, 2024, 1:49 p.m. UTC | #4

On 2024/10/1 20:11, Jason Gunthorpe wrote:
> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> 
>>> +struct vfio_device_pasid_attach_iommufd_pt {
>>> +	__u32	argsz;
>>> +	__u32	flags;
>>> +	__u32	pasid;
>>> +	__u32	pt_id;
>>> +};
>>> +
>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
>>> VFIO_BASE + 21)
>>
>> Not sure whether this was discussed before. Does it make sense
>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
>> by introducing a new pasid field and a new flag bit?
> 
> Maybe? I don't have a strong feeling either way.
> 
> There is somewhat less code if you reuse the ioctl at least

I had a rough memory that I was suggested to add a separate ioctl for
PASID. Let's see Alex's opinion.

@Alex.

Alex Williamson Oct. 14, 2024, 3:49 p.m. UTC | #5

On Sat, 12 Oct 2024 21:49:05 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/10/1 20:11, Jason Gunthorpe wrote:
> > On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> >   
> >>> +struct vfio_device_pasid_attach_iommufd_pt {
> >>> +	__u32	argsz;
> >>> +	__u32	flags;
> >>> +	__u32	pasid;
> >>> +	__u32	pt_id;
> >>> +};
> >>> +
> >>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> >>> VFIO_BASE + 21)  
> >>
> >> Not sure whether this was discussed before. Does it make sense
> >> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
> >> by introducing a new pasid field and a new flag bit?  
> > 
> > Maybe? I don't have a strong feeling either way.
> > 
> > There is somewhat less code if you reuse the ioctl at least  
> 
> I had a rough memory that I was suggested to add a separate ioctl for
> PASID. Let's see Alex's opinion.

I don't recall any previous arguments for separate ioctls, but it seems
to make a lot of sense to me to extend the existing ioctls with a flag
to indicate pasid cscope and id.  Thanks,

Alex

Yi Liu Oct. 15, 2024, 11:07 a.m. UTC | #6

On 2024/10/14 23:49, Alex Williamson wrote:
> On Sat, 12 Oct 2024 21:49:05 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/10/1 20:11, Jason Gunthorpe wrote:
>>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
>>>    
>>>>> +struct vfio_device_pasid_attach_iommufd_pt {
>>>>> +	__u32	argsz;
>>>>> +	__u32	flags;
>>>>> +	__u32	pasid;
>>>>> +	__u32	pt_id;
>>>>> +};
>>>>> +
>>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
>>>>> VFIO_BASE + 21)
>>>>
>>>> Not sure whether this was discussed before. Does it make sense
>>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
>>>> by introducing a new pasid field and a new flag bit?
>>>
>>> Maybe? I don't have a strong feeling either way.
>>>
>>> There is somewhat less code if you reuse the ioctl at least
>>
>> I had a rough memory that I was suggested to add a separate ioctl for
>> PASID. Let's see Alex's opinion.
> 
> I don't recall any previous arguments for separate ioctls, but it seems
> to make a lot of sense to me to extend the existing ioctls with a flag
> to indicate pasid cscope and id.  Thanks,

thanks for the confirmation. How about the below?

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index bb1817bd4ff3..c78533bce3c6 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
  int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
  			    struct vfio_device_attach_iommufd_pt __user *arg)
  {
-	struct vfio_device *device = df->device;
+	unsigned long minsz = offsetofend(
+			struct vfio_device_attach_iommufd_pt, pt_id);
  	struct vfio_device_attach_iommufd_pt attach;
-	unsigned long minsz;
+	struct vfio_device *device = df->device;
+	u32 user_size;
  	int ret;

-	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
+	ret = get_user(user_size, (u32 __user *)arg);
+	if (ret)
+		return ret;

-	if (copy_from_user(&attach, arg, minsz))
-		return -EFAULT;
+	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
+	if (ret)
+		return ret;

-	if (attach.argsz < minsz || attach.flags)
+	if (attach.argsz < minsz || attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
  		return -EINVAL;

+	if ((attach.flags & VFIO_DEVICE_ATTACH_PASID) &&
+	    !device->ops->pasid_attach_ioas)
+		return -EOPNOTSUPP;
+
  	mutex_lock(&device->dev_set->lock);
-	ret = device->ops->attach_ioas(device, &attach.pt_id);
+	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
+		ret = device->ops->pasid_attach_ioas(device, attach.pasid,
+						     &attach.pt_id);
+	else
+		ret = device->ops->attach_ioas(device, &attach.pt_id);
  	if (ret)
  		goto out_unlock;

@@ -198,20 +211,33 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
  int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
  			    struct vfio_device_detach_iommufd_pt __user *arg)
  {
-	struct vfio_device *device = df->device;
+	unsigned long minsz =
+		offsetofend(struct vfio_device_detach_iommufd_pt, flags);
  	struct vfio_device_detach_iommufd_pt detach;
-	unsigned long minsz;
+	struct vfio_device *device = df->device;
+	u32 user_size;
+	int ret;

-	minsz = offsetofend(struct vfio_device_detach_iommufd_pt, flags);
+	ret = get_user(user_size, (u32 __user *)arg);
+	if (ret)
+		return ret;

-	if (copy_from_user(&detach, arg, minsz))
-		return -EFAULT;
+	ret = copy_struct_from_user(&detach, sizeof(detach), arg, user_size);
+	if (ret)
+		return ret;

-	if (detach.argsz < minsz || detach.flags)
+	if (detach.argsz < minsz || detach.flags & (~VFIO_DEVICE_DETACH_PASID))
  		return -EINVAL;

+	if ((detach.flags & VFIO_DEVICE_DETACH_PASID) &&
+	    !device->ops->pasid_detach_ioas)
+		return -EOPNOTSUPP;
+
  	mutex_lock(&device->dev_set->lock);
-	device->ops->detach_ioas(device);
+	if (detach.flags & VFIO_DEVICE_DETACH_PASID)
+		device->ops->pasid_detach_ioas(device, detach.pasid);
+	else
+		device->ops->detach_ioas(device);
  	mutex_unlock(&device->dev_set->lock);

  	return 0;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..40b414e642f5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -931,29 +931,34 @@ struct vfio_device_bind_iommufd {
   * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19,
   *					struct vfio_device_attach_iommufd_pt)
   * @argsz:	User filled size of this data.
- * @flags:	Must be 0.
+ * @flags:	Flags for attach.
   * @pt_id:	Input the target id which can represent an ioas or a hwpt
   *		allocated via iommufd subsystem.
   *		Output the input ioas id or the attached hwpt id which could
   *		be the specified hwpt itself or a hwpt automatically created
   *		for the specified ioas by kernel during the attachment.
+ * @pasid:	The pasid to be attached, only meaningful when
+ *		VFIO_DEVICE_ATTACH_PASID is set in @flags
   *
   * Associate the device with an address space within the bound iommufd.
   * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.  This is only
   * allowed on cdev fds.
   *
- * If a vfio device is currently attached to a valid hw_pagetable, without 
doing
- * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT 
ioctl
- * passing in another hw_pagetable (hwpt) id is allowed. This action, also 
known
- * as a hw_pagetable replacement, will replace the device's currently attached
- * hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
+ * If a vfio device or a pasid of this device is currently attached to a valid
+ * hw_pagetable (hwpt), without doing a VFIO_DEVICE_DETACH_IOMMUFD_PT, a 
second
+ * VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is allowed.
+ * This action, also known as a hw_pagetable replacement, will replace the
+ * currently attached hwpt of the device or the pasid of this device with 
a new
+ * hwpt corresponding to the given pt_id.
   *
   * Return: 0 on success, -errno on failure.
   */
  struct vfio_device_attach_iommufd_pt {
  	__u32	argsz;
  	__u32	flags;
+#define VFIO_DEVICE_ATTACH_PASID	(1 << 0)
  	__u32	pt_id;
+	__u32	pasid;
  };

  #define VFIO_DEVICE_ATTACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 19)
@@ -962,17 +967,21 @@ struct vfio_device_attach_iommufd_pt {
   * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
   *					struct vfio_device_detach_iommufd_pt)
   * @argsz:	User filled size of this data.
- * @flags:	Must be 0.
+ * @flags:	Flags for detach.
+ * @pasid:	The pasid to be detached, only meaningful when
+ *		VFIO_DEVICE_DETACH_PASID is set in @flags
   *
- * Remove the association of the device and its current associated address
- * space.  After it, the device should be in a blocking DMA state.  This 
is only
- * allowed on cdev fds.
+ * Remove the association of the device or a pasid of the device and its 
current
+ * associated address space.  After it, the device or the pasid should be in a
+ * blocking DMA state.  This is only allowed on cdev fds.
   *
   * Return: 0 on success, -errno on failure.
   */
  struct vfio_device_detach_iommufd_pt {
  	__u32	argsz;
  	__u32	flags;
+#define VFIO_DEVICE_DETACH_PASID	(1 << 0)
+	__u32	pasid;
  };

  #define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)

Alex Williamson Oct. 15, 2024, 4:22 p.m. UTC | #7

On Tue, 15 Oct 2024 19:07:52 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/10/14 23:49, Alex Williamson wrote:
> > On Sat, 12 Oct 2024 21:49:05 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> >> On 2024/10/1 20:11, Jason Gunthorpe wrote:  
> >>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> >>>      
> >>>>> +struct vfio_device_pasid_attach_iommufd_pt {
> >>>>> +	__u32	argsz;
> >>>>> +	__u32	flags;
> >>>>> +	__u32	pasid;
> >>>>> +	__u32	pt_id;
> >>>>> +};
> >>>>> +
> >>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> >>>>> VFIO_BASE + 21)  
> >>>>
> >>>> Not sure whether this was discussed before. Does it make sense
> >>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
> >>>> by introducing a new pasid field and a new flag bit?  
> >>>
> >>> Maybe? I don't have a strong feeling either way.
> >>>
> >>> There is somewhat less code if you reuse the ioctl at least  
> >>
> >> I had a rough memory that I was suggested to add a separate ioctl for
> >> PASID. Let's see Alex's opinion.  
> > 
> > I don't recall any previous arguments for separate ioctls, but it seems
> > to make a lot of sense to me to extend the existing ioctls with a flag
> > to indicate pasid cscope and id.  Thanks,  
> 
> thanks for the confirmation. How about the below?
> 
> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> index bb1817bd4ff3..c78533bce3c6 100644
> --- a/drivers/vfio/device_cdev.c
> +++ b/drivers/vfio/device_cdev.c
> @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
>   int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
>   			    struct vfio_device_attach_iommufd_pt __user *arg)
>   {
> -	struct vfio_device *device = df->device;
> +	unsigned long minsz = offsetofend(
> +			struct vfio_device_attach_iommufd_pt, pt_id);
>   	struct vfio_device_attach_iommufd_pt attach;
> -	unsigned long minsz;
> +	struct vfio_device *device = df->device;
> +	u32 user_size;
>   	int ret;
> 
> -	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
> +	ret = get_user(user_size, (u32 __user *)arg);
> +	if (ret)
> +		return ret;
> 
> -	if (copy_from_user(&attach, arg, minsz))
> -		return -EFAULT;
> +	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
> +	if (ret)
> +		return ret;

I think this could break current users.  For better or worse, we don't
currently have any requirements for the remainder of the user buffer,
whereas copy_struct_from_user() returns an error for non-zero trailing
bytes.  I think we need to monotonically increase the structure size,
but maybe something more like below, using flags.  The expectation
would be that if we add another flag that extends the structure, we'd
test that flag after PASID and clobber xend to a new value further into
the new structure.  We'd also add that flag to the flags mask, but we'd
share the copy code.

	if (attach.argsz < minsz)
		return -EINVAL;

	if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
		return -EINVAL;

	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);

	if (xend) {
		if (attach.argsz < xend)
			return -EINVAL;
	
		if (copy_from_user((void *)&attach + minsz,
				    (void __user *)arg + minsz, xend - minsz))
			return -EFAULT;
	}

Maybe there are still more elegant options available.

We also generally try to label flags with FLAGS in the name, but it
does get rather unwieldy, so I'm open to suggestions.  Thanks,

Alex

> 
> -	if (attach.argsz < minsz || attach.flags)
> +	if (attach.argsz < minsz || attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
>   		return -EINVAL;
> 
> +	if ((attach.flags & VFIO_DEVICE_ATTACH_PASID) &&
> +	    !device->ops->pasid_attach_ioas)
> +		return -EOPNOTSUPP;
> +
>   	mutex_lock(&device->dev_set->lock);
> -	ret = device->ops->attach_ioas(device, &attach.pt_id);
> +	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> +		ret = device->ops->pasid_attach_ioas(device, attach.pasid,
> +						     &attach.pt_id);
> +	else
> +		ret = device->ops->attach_ioas(device, &attach.pt_id);
>   	if (ret)
>   		goto out_unlock;
> 
> @@ -198,20 +211,33 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
>   int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
>   			    struct vfio_device_detach_iommufd_pt __user *arg)
>   {
> -	struct vfio_device *device = df->device;
> +	unsigned long minsz =
> +		offsetofend(struct vfio_device_detach_iommufd_pt, flags);
>   	struct vfio_device_detach_iommufd_pt detach;
> -	unsigned long minsz;
> +	struct vfio_device *device = df->device;
> +	u32 user_size;
> +	int ret;
> 
> -	minsz = offsetofend(struct vfio_device_detach_iommufd_pt, flags);
> +	ret = get_user(user_size, (u32 __user *)arg);
> +	if (ret)
> +		return ret;
> 
> -	if (copy_from_user(&detach, arg, minsz))
> -		return -EFAULT;
> +	ret = copy_struct_from_user(&detach, sizeof(detach), arg, user_size);
> +	if (ret)
> +		return ret;
> 
> -	if (detach.argsz < minsz || detach.flags)
> +	if (detach.argsz < minsz || detach.flags & (~VFIO_DEVICE_DETACH_PASID))
>   		return -EINVAL;
> 
> +	if ((detach.flags & VFIO_DEVICE_DETACH_PASID) &&
> +	    !device->ops->pasid_detach_ioas)
> +		return -EOPNOTSUPP;
> +
>   	mutex_lock(&device->dev_set->lock);
> -	device->ops->detach_ioas(device);
> +	if (detach.flags & VFIO_DEVICE_DETACH_PASID)
> +		device->ops->pasid_detach_ioas(device, detach.pasid);
> +	else
> +		device->ops->detach_ioas(device);
>   	mutex_unlock(&device->dev_set->lock);
> 
>   	return 0;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf190..40b414e642f5 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -931,29 +931,34 @@ struct vfio_device_bind_iommufd {
>    * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 19,
>    *					struct vfio_device_attach_iommufd_pt)
>    * @argsz:	User filled size of this data.
> - * @flags:	Must be 0.
> + * @flags:	Flags for attach.
>    * @pt_id:	Input the target id which can represent an ioas or a hwpt
>    *		allocated via iommufd subsystem.
>    *		Output the input ioas id or the attached hwpt id which could
>    *		be the specified hwpt itself or a hwpt automatically created
>    *		for the specified ioas by kernel during the attachment.
> + * @pasid:	The pasid to be attached, only meaningful when
> + *		VFIO_DEVICE_ATTACH_PASID is set in @flags
>    *
>    * Associate the device with an address space within the bound iommufd.
>    * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.  This is only
>    * allowed on cdev fds.
>    *
> - * If a vfio device is currently attached to a valid hw_pagetable, without 
> doing
> - * a VFIO_DEVICE_DETACH_IOMMUFD_PT, a second VFIO_DEVICE_ATTACH_IOMMUFD_PT 
> ioctl
> - * passing in another hw_pagetable (hwpt) id is allowed. This action, also 
> known
> - * as a hw_pagetable replacement, will replace the device's currently attached
> - * hw_pagetable with a new hw_pagetable corresponding to the given pt_id.
> + * If a vfio device or a pasid of this device is currently attached to a valid
> + * hw_pagetable (hwpt), without doing a VFIO_DEVICE_DETACH_IOMMUFD_PT, a 
> second
> + * VFIO_DEVICE_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is allowed.
> + * This action, also known as a hw_pagetable replacement, will replace the
> + * currently attached hwpt of the device or the pasid of this device with 
> a new
> + * hwpt corresponding to the given pt_id.
>    *
>    * Return: 0 on success, -errno on failure.
>    */
>   struct vfio_device_attach_iommufd_pt {
>   	__u32	argsz;
>   	__u32	flags;
> +#define VFIO_DEVICE_ATTACH_PASID	(1 << 0)
>   	__u32	pt_id;
> +	__u32	pasid;
>   };
> 
>   #define VFIO_DEVICE_ATTACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 19)
> @@ -962,17 +967,21 @@ struct vfio_device_attach_iommufd_pt {
>    * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
>    *					struct vfio_device_detach_iommufd_pt)
>    * @argsz:	User filled size of this data.
> - * @flags:	Must be 0.
> + * @flags:	Flags for detach.
> + * @pasid:	The pasid to be detached, only meaningful when
> + *		VFIO_DEVICE_DETACH_PASID is set in @flags
>    *
> - * Remove the association of the device and its current associated address
> - * space.  After it, the device should be in a blocking DMA state.  This 
> is only
> - * allowed on cdev fds.
> + * Remove the association of the device or a pasid of the device and its 
> current
> + * associated address space.  After it, the device or the pasid should be in a
> + * blocking DMA state.  This is only allowed on cdev fds.
>    *
>    * Return: 0 on success, -errno on failure.
>    */
>   struct vfio_device_detach_iommufd_pt {
>   	__u32	argsz;
>   	__u32	flags;
> +#define VFIO_DEVICE_DETACH_PASID	(1 << 0)
> +	__u32	pasid;
>   };
> 
>   #define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)

Yi Liu Oct. 16, 2024, 3:35 a.m. UTC | #8

On 2024/10/16 00:22, Alex Williamson wrote:
> On Tue, 15 Oct 2024 19:07:52 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/10/14 23:49, Alex Williamson wrote:
>>> On Sat, 12 Oct 2024 21:49:05 +0800
>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>    
>>>> On 2024/10/1 20:11, Jason Gunthorpe wrote:
>>>>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
>>>>>       
>>>>>>> +struct vfio_device_pasid_attach_iommufd_pt {
>>>>>>> +	__u32	argsz;
>>>>>>> +	__u32	flags;
>>>>>>> +	__u32	pasid;
>>>>>>> +	__u32	pt_id;
>>>>>>> +};
>>>>>>> +
>>>>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
>>>>>>> VFIO_BASE + 21)
>>>>>>
>>>>>> Not sure whether this was discussed before. Does it make sense
>>>>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
>>>>>> by introducing a new pasid field and a new flag bit?
>>>>>
>>>>> Maybe? I don't have a strong feeling either way.
>>>>>
>>>>> There is somewhat less code if you reuse the ioctl at least
>>>>
>>>> I had a rough memory that I was suggested to add a separate ioctl for
>>>> PASID. Let's see Alex's opinion.
>>>
>>> I don't recall any previous arguments for separate ioctls, but it seems
>>> to make a lot of sense to me to extend the existing ioctls with a flag
>>> to indicate pasid cscope and id.  Thanks,
>>
>> thanks for the confirmation. How about the below?
>>
>> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
>> index bb1817bd4ff3..c78533bce3c6 100644
>> --- a/drivers/vfio/device_cdev.c
>> +++ b/drivers/vfio/device_cdev.c
>> @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
>>    int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
>>    			    struct vfio_device_attach_iommufd_pt __user *arg)
>>    {
>> -	struct vfio_device *device = df->device;
>> +	unsigned long minsz = offsetofend(
>> +			struct vfio_device_attach_iommufd_pt, pt_id);
>>    	struct vfio_device_attach_iommufd_pt attach;
>> -	unsigned long minsz;
>> +	struct vfio_device *device = df->device;
>> +	u32 user_size;
>>    	int ret;
>>
>> -	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
>> +	ret = get_user(user_size, (u32 __user *)arg);
>> +	if (ret)
>> +		return ret;
>>
>> -	if (copy_from_user(&attach, arg, minsz))
>> -		return -EFAULT;
>> +	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
>> +	if (ret)
>> +		return ret;
> 
> I think this could break current users. 

not quite get here. My understanding is as the below:

If the current user (compiled with the existing uapi struct), it will
provide less data that the new kernel knows. The copy_struct_from_user()
would zero the trailing bytes. And such user won't set the new flag, so
it should be fine.

So to me, the trailing bytes concern comes when user is compiled with the
new uapi struct but running on an eld kernel (say <= 6.12).But the eld
kernel uses copy_from_user(), so even there is non-zero trailing bytes in
the user buffer, the eld kernel just ignores them. So this seems not an
issue to me so far.

> For better or worse, we don't
> currently have any requirements for the remainder of the user buffer,
> whereas copy_struct_from_user() returns an error for non-zero trailing
> bytes.

This seems to be a general requirement when using copy_struct_from_user().
User needs to zero the fields that it does not intend to use. If there is
no such requirement for user, then the trailing bytes can be a concern in
the future but not this time as the existing kernel uses copy_from_user().

> I think we need to monotonically increase the structure size,
> but maybe something more like below, using flags.  The expectation
> would be that if we add another flag that extends the structure, we'd
> test that flag after PASID and clobber xend to a new value further into
> the new structure.  We'd also add that flag to the flags mask, but we'd
> share the copy code.

agree, this share code might be needed for other path as well. Some macros
I guess.

> 
> 	if (attach.argsz < minsz)
> 		return -EINVAL;
> 
> 	if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
> 		return -EINVAL;
> 
> 	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
> 
> 	if (xend) {
> 		if (attach.argsz < xend)
> 			return -EINVAL;
> 	
> 		if (copy_from_user((void *)&attach + minsz,
> 				    (void __user *)arg + minsz, xend - minsz))
> 			return -EFAULT;
> 	}

I think it might need to zero the trailing bytes if the user does not set
the extended flag. is it?

> Maybe there are still more elegant options available.
> 
> We also generally try to label flags with FLAGS in the name, but it
> does get rather unwieldy, so I'm open to suggestions.  Thanks,

There is already examples that new field added to a kernel-to-user uapi
struct like the vfio_region_info::cap_offset and
vfio_device_info::cap_offset. But it might be a little bit different
with the case we face here as it's user-to-kernel struct. It's a time for
you to set a rule for such extensions. :)

Alex Williamson Oct. 16, 2024, 4:11 p.m. UTC | #9

On Wed, 16 Oct 2024 11:35:05 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/10/16 00:22, Alex Williamson wrote:
> > On Tue, 15 Oct 2024 19:07:52 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> >> On 2024/10/14 23:49, Alex Williamson wrote:  
> >>> On Sat, 12 Oct 2024 21:49:05 +0800
> >>> Yi Liu <yi.l.liu@intel.com> wrote:
> >>>      
> >>>> On 2024/10/1 20:11, Jason Gunthorpe wrote:  
> >>>>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
> >>>>>         
> >>>>>>> +struct vfio_device_pasid_attach_iommufd_pt {
> >>>>>>> +	__u32	argsz;
> >>>>>>> +	__u32	flags;
> >>>>>>> +	__u32	pasid;
> >>>>>>> +	__u32	pt_id;
> >>>>>>> +};
> >>>>>>> +
> >>>>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
> >>>>>>> VFIO_BASE + 21)  
> >>>>>>
> >>>>>> Not sure whether this was discussed before. Does it make sense
> >>>>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
> >>>>>> by introducing a new pasid field and a new flag bit?  
> >>>>>
> >>>>> Maybe? I don't have a strong feeling either way.
> >>>>>
> >>>>> There is somewhat less code if you reuse the ioctl at least  
> >>>>
> >>>> I had a rough memory that I was suggested to add a separate ioctl for
> >>>> PASID. Let's see Alex's opinion.  
> >>>
> >>> I don't recall any previous arguments for separate ioctls, but it seems
> >>> to make a lot of sense to me to extend the existing ioctls with a flag
> >>> to indicate pasid cscope and id.  Thanks,  
> >>
> >> thanks for the confirmation. How about the below?
> >>
> >> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
> >> index bb1817bd4ff3..c78533bce3c6 100644
> >> --- a/drivers/vfio/device_cdev.c
> >> +++ b/drivers/vfio/device_cdev.c
> >> @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
> >>    int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
> >>    			    struct vfio_device_attach_iommufd_pt __user *arg)
> >>    {
> >> -	struct vfio_device *device = df->device;
> >> +	unsigned long minsz = offsetofend(
> >> +			struct vfio_device_attach_iommufd_pt, pt_id);
> >>    	struct vfio_device_attach_iommufd_pt attach;
> >> -	unsigned long minsz;
> >> +	struct vfio_device *device = df->device;
> >> +	u32 user_size;
> >>    	int ret;
> >>
> >> -	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
> >> +	ret = get_user(user_size, (u32 __user *)arg);
> >> +	if (ret)
> >> +		return ret;
> >>
> >> -	if (copy_from_user(&attach, arg, minsz))
> >> -		return -EFAULT;
> >> +	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
> >> +	if (ret)
> >> +		return ret;  
> > 
> > I think this could break current users.   
> 
> not quite get here. My understanding is as the below:
> 
> If the current user (compiled with the existing uapi struct), it will
> provide less data that the new kernel knows. The copy_struct_from_user()
> would zero the trailing bytes. And such user won't set the new flag, so
> it should be fine.

You're making an assumption that the user is passing exactly the
existing struct size as argsz, which is not a requirement.  The user
could allocate a buffer page, argsz might be 4K.
 
> So to me, the trailing bytes concern comes when user is compiled with the
> new uapi struct but running on an eld kernel (say <= 6.12).But the eld
> kernel uses copy_from_user(), so even there is non-zero trailing bytes in
> the user buffer, the eld kernel just ignores them. So this seems not an
> issue to me so far.

That's new userspace, old kernel.  I'm referencing an issue with old
userspace, new kernel, where old userspace has no requirement that
argsz is exactly the structure size, nor that the trailing bytes from
the structure size to argsz are zero'd.
 
> > For better or worse, we don't
> > currently have any requirements for the remainder of the user buffer,
> > whereas copy_struct_from_user() returns an error for non-zero trailing
> > bytes.  
> 
> This seems to be a general requirement when using copy_struct_from_user().
> User needs to zero the fields that it does not intend to use. If there is
> no such requirement for user, then the trailing bytes can be a concern in
> the future but not this time as the existing kernel uses copy_from_user().

Exactly, the current implementation makes no requirement on trailing
non-zero bytes, copy_struct_from_user() does.
 
> > I think we need to monotonically increase the structure size,
> > but maybe something more like below, using flags.  The expectation
> > would be that if we add another flag that extends the structure, we'd
> > test that flag after PASID and clobber xend to a new value further into
> > the new structure.  We'd also add that flag to the flags mask, but we'd
> > share the copy code.  
> 
> agree, this share code might be needed for other path as well. Some macros
> I guess.
> 
> > 
> > 	if (attach.argsz < minsz)
> > 		return -EINVAL;
> > 
> > 	if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
> > 		return -EINVAL;
> > 
> > 	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> > 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
> > 
> > 	if (xend) {
> > 		if (attach.argsz < xend)
> > 			return -EINVAL;
> > 	
> > 		if (copy_from_user((void *)&attach + minsz,
> > 				    (void __user *)arg + minsz, xend - minsz))
> > 			return -EFAULT;
> > 	}  
> 
> I think it might need to zero the trailing bytes if the user does not set
> the extended flag. is it?

We could zero initialize the attach structure for safety, but the kernel
side code should also be driven by flags.  We should never look at a
field beyond the base structure that isn't indicated by flags and copied
from the user.

> > Maybe there are still more elegant options available.
> > 
> > We also generally try to label flags with FLAGS in the name, but it
> > does get rather unwieldy, so I'm open to suggestions.  Thanks,  
> 
> There is already examples that new field added to a kernel-to-user uapi
> struct like the vfio_region_info::cap_offset and
> vfio_device_info::cap_offset. But it might be a little bit different
> with the case we face here as it's user-to-kernel struct. It's a time for
> you to set a rule for such extensions. :)

That's a returned structure, so it's a bit different, but the same
philosophy, we don't break userspace.  In that case we used argsz as an
output field and flags to indicate there are capabilities.  Old
userspace ignores the flag and argsz semantics and continues to work.
New userspace checks the flag, reallocs the buffer with argsz and
retries.  This is however an example of userspace providing an argsz
value that exceeds the ioctl structure, with no requirement to zero the
buffer, though it is an output structure rather than the input
structure here.

I think the same holds here, our policy is to not break userspace.  We
potentially do break userspace if we impose a requirement that the
trailing buffer size must now be zero.  We have the flags field so that
we don't need to blindly base the structure version on the size of the
user buffer.  We should use the flags field to determine relevant and
valid fields beyond the base structure without imposing new
requirements to userspace.  Thanks,

Alex

Yi Liu Oct. 18, 2024, 5:40 a.m. UTC | #10

On 2024/10/17 00:11, Alex Williamson wrote:
> On Wed, 16 Oct 2024 11:35:05 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/10/16 00:22, Alex Williamson wrote:
>>> On Tue, 15 Oct 2024 19:07:52 +0800
>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>    
>>>> On 2024/10/14 23:49, Alex Williamson wrote:
>>>>> On Sat, 12 Oct 2024 21:49:05 +0800
>>>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>>>       
>>>>>> On 2024/10/1 20:11, Jason Gunthorpe wrote:
>>>>>>> On Mon, Sep 30, 2024 at 07:55:08AM +0000, Tian, Kevin wrote:
>>>>>>>          
>>>>>>>>> +struct vfio_device_pasid_attach_iommufd_pt {
>>>>>>>>> +	__u32	argsz;
>>>>>>>>> +	__u32	flags;
>>>>>>>>> +	__u32	pasid;
>>>>>>>>> +	__u32	pt_id;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> +#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE,
>>>>>>>>> VFIO_BASE + 21)
>>>>>>>>
>>>>>>>> Not sure whether this was discussed before. Does it make sense
>>>>>>>> to reuse the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT
>>>>>>>> by introducing a new pasid field and a new flag bit?
>>>>>>>
>>>>>>> Maybe? I don't have a strong feeling either way.
>>>>>>>
>>>>>>> There is somewhat less code if you reuse the ioctl at least
>>>>>>
>>>>>> I had a rough memory that I was suggested to add a separate ioctl for
>>>>>> PASID. Let's see Alex's opinion.
>>>>>
>>>>> I don't recall any previous arguments for separate ioctls, but it seems
>>>>> to make a lot of sense to me to extend the existing ioctls with a flag
>>>>> to indicate pasid cscope and id.  Thanks,
>>>>
>>>> thanks for the confirmation. How about the below?
>>>>
>>>> diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
>>>> index bb1817bd4ff3..c78533bce3c6 100644
>>>> --- a/drivers/vfio/device_cdev.c
>>>> +++ b/drivers/vfio/device_cdev.c
>>>> @@ -162,21 +162,34 @@ void vfio_df_unbind_iommufd(struct vfio_device_file *df)
>>>>     int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
>>>>     			    struct vfio_device_attach_iommufd_pt __user *arg)
>>>>     {
>>>> -	struct vfio_device *device = df->device;
>>>> +	unsigned long minsz = offsetofend(
>>>> +			struct vfio_device_attach_iommufd_pt, pt_id);
>>>>     	struct vfio_device_attach_iommufd_pt attach;
>>>> -	unsigned long minsz;
>>>> +	struct vfio_device *device = df->device;
>>>> +	u32 user_size;
>>>>     	int ret;
>>>>
>>>> -	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
>>>> +	ret = get_user(user_size, (u32 __user *)arg);
>>>> +	if (ret)
>>>> +		return ret;
>>>>
>>>> -	if (copy_from_user(&attach, arg, minsz))
>>>> -		return -EFAULT;
>>>> +	ret = copy_struct_from_user(&attach, sizeof(attach), arg, user_size);
>>>> +	if (ret)
>>>> +		return ret;
>>>
>>> I think this could break current users.
>>
>> not quite get here. My understanding is as the below:
>>
>> If the current user (compiled with the existing uapi struct), it will
>> provide less data that the new kernel knows. The copy_struct_from_user()
>> would zero the trailing bytes. And such user won't set the new flag, so
>> it should be fine.
> 
> You're making an assumption that the user is passing exactly the
> existing struct size as argsz, which is not a requirement.  The user
> could allocate a buffer page, argsz might be 4K.

I see. If the argsz is far larger than the existing struct size, it might
be large than the size of the new struct as well. Then the trailing bytes
would result in failure which does not fail with old user and old kernel.

>> So to me, the trailing bytes concern comes when user is compiled with the
>> new uapi struct but running on an eld kernel (say <= 6.12).But the eld
>> kernel uses copy_from_user(), so even there is non-zero trailing bytes in
>> the user buffer, the eld kernel just ignores them. So this seems not an
>> issue to me so far.
> 
> That's new userspace, old kernel.  I'm referencing an issue with old
> userspace, new kernel, where old userspace has no requirement that
> argsz is exactly the structure size, nor that the trailing bytes from
> the structure size to argsz are zero'd.

got it.

>>> For better or worse, we don't
>>> currently have any requirements for the remainder of the user buffer,
>>> whereas copy_struct_from_user() returns an error for non-zero trailing
>>> bytes.
>>
>> This seems to be a general requirement when using copy_struct_from_user().
>> User needs to zero the fields that it does not intend to use. If there is
>> no such requirement for user, then the trailing bytes can be a concern in
>> the future but not this time as the existing kernel uses copy_from_user().
> 
> Exactly, the current implementation makes no requirement on trailing
> non-zero bytes, copy_struct_from_user() does.

got it.

>>> I think we need to monotonically increase the structure size,
>>> but maybe something more like below, using flags.  The expectation
>>> would be that if we add another flag that extends the structure, we'd
>>> test that flag after PASID and clobber xend to a new value further into
>>> the new structure.  We'd also add that flag to the flags mask, but we'd
>>> share the copy code.
>>
>> agree, this share code might be needed for other path as well. Some macros
>> I guess.
>>
>>>
>>> 	if (attach.argsz < minsz)
>>> 		return -EINVAL;
>>>
>>> 	if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
>>> 		return -EINVAL;
>>>
>>> 	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
>>> 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
>>>
>>> 	if (xend) {
>>> 		if (attach.argsz < xend)
>>> 			return -EINVAL;
>>> 	
>>> 		if (copy_from_user((void *)&attach + minsz,
>>> 				    (void __user *)arg + minsz, xend - minsz))
>>> 			return -EFAULT;
>>> 	}
>>
>> I think it might need to zero the trailing bytes if the user does not set
>> the extended flag. is it?
> 
> We could zero initialize the attach structure for safety, but the kernel
> side code should also be driven by flags.  We should never look at a
> field beyond the base structure that isn't indicated by flags and copied
> from the user.

yes, zeroing the trailing bytes is for safety, and all the new fields
should be indicated by flags.

>>> Maybe there are still more elegant options available.
>>>
>>> We also generally try to label flags with FLAGS in the name, but it
>>> does get rather unwieldy, so I'm open to suggestions.  Thanks,
>>
>> There is already examples that new field added to a kernel-to-user uapi
>> struct like the vfio_region_info::cap_offset and
>> vfio_device_info::cap_offset. But it might be a little bit different
>> with the case we face here as it's user-to-kernel struct. It's a time for
>> you to set a rule for such extensions. :)
> 
> That's a returned structure, so it's a bit different, but the same
> philosophy, we don't break userspace.  In that case we used argsz as an
> output field and flags to indicate there are capabilities.  Old
> userspace ignores the flag and argsz semantics and continues to work.
> New userspace checks the flag, reallocs the buffer with argsz and
> retries.  This is however an example of userspace providing an argsz
> value that exceeds the ioctl structure, with no requirement to zero the
> buffer, though it is an output structure rather than the input
> structure here.
> 
> I think the same holds here, our policy is to not break userspace.  We
> potentially do break userspace if we impose a requirement that the
> trailing buffer size must now be zero.  We have the flags field so that
> we don't need to blindly base the structure version on the size of the
> user buffer.  We should use the flags field to determine relevant and
> valid fields beyond the base structure without imposing new
> requirements to userspace.  Thanks,

got it. let me send another version.

Yi Liu Oct. 30, 2024, 12:54 p.m. UTC | #11

Hi Alex,

On 2024/10/18 13:40, Yi Liu wrote:
>>>> I think we need to monotonically increase the structure size,
>>>> but maybe something more like below, using flags.  The expectation
>>>> would be that if we add another flag that extends the structure, we'd
>>>> test that flag after PASID and clobber xend to a new value further into
>>>> the new structure.  We'd also add that flag to the flags mask, but we'd
>>>> share the copy code.
>>>
>>> agree, this share code might be needed for other path as well. Some macros
>>> I guess.
>>>
>>>>
>>>>     if (attach.argsz < minsz)
>>>>         return -EINVAL;
>>>>
>>>>     if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
>>>>         return -EINVAL;
>>>>
>>>>     if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
>>>>         xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
>>>>
>>>>     if (xend) {
>>>>         if (attach.argsz < xend)
>>>>             return -EINVAL;

Need to check the future usage of 'xend'. In understanding, 'xend' should
always be offsetofend(struct, the_last_field). A userspace that uses @pasid 
field would set argsz >= offsetofend(struct, pasid), most likely it would
just set argsz==offsetofend(struct, pasid). If so, such userspace would be
failed when running on a kernel that has added new fields behind @pasid.

Say two decades later, we add a new field (say @xyz) to this user struct,
the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend'
would be larger than the argsz provided by the aforementioned userspace.
Hence it would be failed in the above check. To make it work, I'm
considering to make some changes to the code. When argsz < xend, we only
copy extra data with size==argsz-minsz. Just as the below.

	if (xend) {
		unsigned long size;

		if (attach.argsz < xend)
			size = attach.argsz - minsz;
		else
			size = xend - minsz;

		if (copy_from_user((void *)&attach + minsz,
				  (void __user *)arg + minsz, size))
			return -EFAULT;
	}

However, it seems to have another problem. If the userspace that uses
@pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is
not supposed to work and should be failed by kernel. is it? However, my
above code cannot fail it. :(

Any suggestion about it?

>>>>
>>>>         if (copy_from_user((void *)&attach + minsz,
>>>>                     (void __user *)arg + minsz, xend - minsz))
>>>>             return -EFAULT;
>>>>     }
>>>

Alex Williamson Oct. 30, 2024, 5:51 p.m. UTC | #12

On Wed, 30 Oct 2024 20:54:09 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> On 2024/10/18 13:40, Yi Liu wrote:
> >>>> I think we need to monotonically increase the structure size,
> >>>> but maybe something more like below, using flags.  The expectation
> >>>> would be that if we add another flag that extends the structure, we'd
> >>>> test that flag after PASID and clobber xend to a new value further into
> >>>> the new structure.  We'd also add that flag to the flags mask, but we'd
> >>>> share the copy code.  
> >>>
> >>> agree, this share code might be needed for other path as well. Some macros
> >>> I guess.
> >>>  
> >>>>
> >>>>     if (attach.argsz < minsz)
> >>>>         return -EINVAL;
> >>>>
> >>>>     if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
> >>>>         return -EINVAL;
> >>>>
> >>>>     if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> >>>>         xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
> >>>>
> >>>>     if (xend) {
> >>>>         if (attach.argsz < xend)
> >>>>             return -EINVAL;  
> 
> Need to check the future usage of 'xend'. In understanding, 'xend' should
> always be offsetofend(struct, the_last_field). A userspace that uses @pasid 
> field would set argsz >= offsetofend(struct, pasid), most likely it would
> just set argsz==offsetofend(struct, pasid). If so, such userspace would be
> failed when running on a kernel that has added new fields behind @pasid.

No, xend denotes the end of the structure we need to satisfy the flags
that are requested by the user.
 
> Say two decades later, we add a new field (say @xyz) to this user struct,
> the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend'
> would be larger than the argsz provided by the aforementioned userspace.
> Hence it would be failed in the above check.

New field xyz would require a new flag, VFIO_DEVICE_XYZ and we'd extend
the above code as:

	if (attach.argsz < minsz)
		return -EINVAL;

	if (attach.flags & (~(VFIO_DEVICE_ATTACH_PASID |
			      VFIO_DEVICE_XYZ)))
		return -EINVAL;

	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);

	if (attach.flags & VFIO_DEVICE_XYZ)
		xend = offsetofend(struct vfio_device_attach_iommufd_pt, xyz);

	if (xend) {
		if (attach.argsz < xend)
			return -EINVAL;  

New userspace can provide argsz = offsetofend(, xyz), just as it could
provide argsz = PAGE_SIZE now if it really wanted, but argsz > minsz is
only required if the user sets any of these new flags.  Therefore old
userspace on new kernel continues to work.

> To make it work, I'm
> considering to make some changes to the code. When argsz < xend, we only
> copy extra data with size==argsz-minsz. Just as the below.
> 
> 	if (xend) {
> 		unsigned long size;
> 
> 		if (attach.argsz < xend)

This is an -EINVAL condition, xend tracks the flags the user has set.
The user must provide a sufficient buffer for the flags they've set.

> 			size = attach.argsz - minsz;
> 		else
> 			size = xend - minsz;

This is the only correct copy size.

> 
> 		if (copy_from_user((void *)&attach + minsz,
> 				  (void __user *)arg + minsz, size))
> 			return -EFAULT;
> 	}
> 
> However, it seems to have another problem. If the userspace that uses
> @pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is
> not supposed to work and should be failed by kernel. is it? However, my
> above code cannot fail it. :(
> 
> Any suggestion about it?

If a user sets the ATTACH_PASID flag and argsz is less than
offsetofend(struct, pasid), we need to return -EINVAL as indicated
above.  Thanks,

Alex

> 
> >>>>
> >>>>         if (copy_from_user((void *)&attach + minsz,
> >>>>                     (void __user *)arg + minsz, xend - minsz))
> >>>>             return -EFAULT;
> >>>>     }  
> >>>  
>

Yi Liu Oct. 31, 2024, 7:23 a.m. UTC | #13

On 2024/10/31 01:51, Alex Williamson wrote:
> On Wed, 30 Oct 2024 20:54:09 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> Hi Alex,
>>
>> On 2024/10/18 13:40, Yi Liu wrote:
>>>>>> I think we need to monotonically increase the structure size,
>>>>>> but maybe something more like below, using flags.  The expectation
>>>>>> would be that if we add another flag that extends the structure, we'd
>>>>>> test that flag after PASID and clobber xend to a new value further into
>>>>>> the new structure.  We'd also add that flag to the flags mask, but we'd
>>>>>> share the copy code.
>>>>>
>>>>> agree, this share code might be needed for other path as well. Some macros
>>>>> I guess.
>>>>>   
>>>>>>
>>>>>>      if (attach.argsz < minsz)
>>>>>>          return -EINVAL;
>>>>>>
>>>>>>      if (attach.flags & (~VFIO_DEVICE_ATTACH_PASID))
>>>>>>          return -EINVAL;
>>>>>>
>>>>>>      if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
>>>>>>          xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
>>>>>>
>>>>>>      if (xend) {
>>>>>>          if (attach.argsz < xend)
>>>>>>              return -EINVAL;
>>
>> Need to check the future usage of 'xend'. In understanding, 'xend' should
>> always be offsetofend(struct, the_last_field). A userspace that uses @pasid
>> field would set argsz >= offsetofend(struct, pasid), most likely it would
>> just set argsz==offsetofend(struct, pasid). If so, such userspace would be
>> failed when running on a kernel that has added new fields behind @pasid.
> 
> No, xend denotes the end of the structure we need to satisfy the flags
> that are requested by the user.
>   
>> Say two decades later, we add a new field (say @xyz) to this user struct,
>> the 'xend' would be updated to be offsetofend(struct, xyz). This 'xend'
>> would be larger than the argsz provided by the aforementioned userspace.
>> Hence it would be failed in the above check.
> 
> New field xyz would require a new flag, VFIO_DEVICE_XYZ and we'd extend
> the above code as:
> 
> 	if (attach.argsz < minsz)
> 		return -EINVAL;
> 
> 	if (attach.flags & (~(VFIO_DEVICE_ATTACH_PASID |
> 			      VFIO_DEVICE_XYZ)))
> 		return -EINVAL;
> 
> 	if (attach.flags & VFIO_DEVICE_ATTACH_PASID)
> 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, pasid);
> 
> 	if (attach.flags & VFIO_DEVICE_XYZ)
> 		xend = offsetofend(struct vfio_device_attach_iommufd_pt, xyz);
> 
> 	if (xend) {
> 		if (attach.argsz < xend)
> 			return -EINVAL;
> 
> New userspace can provide argsz = offsetofend(, xyz), just as it could
> provide argsz = PAGE_SIZE now if it really wanted, but argsz > minsz is
> only required if the user sets any of these new flags.  Therefore old
> userspace on new kernel continues to work.

got it. This should work. thanks.:)

>> To make it work, I'm
>> considering to make some changes to the code. When argsz < xend, we only
>> copy extra data with size==argsz-minsz. Just as the below.
>>
>> 	if (xend) {
>> 		unsigned long size;
>>
>> 		if (attach.argsz < xend)
> 
> This is an -EINVAL condition, xend tracks the flags the user has set.
> The user must provide a sufficient buffer for the flags they've set.
> 
>> 			size = attach.argsz - minsz;
>> 		else
>> 			size = xend - minsz;
> 
> This is the only correct copy size.
> 
>>
>> 		if (copy_from_user((void *)&attach + minsz,
>> 				  (void __user *)arg + minsz, size))
>> 			return -EFAULT;
>> 	}
>>
>> However, it seems to have another problem. If the userspace that uses
>> @pasid set the argsz==offsetofend(struct, pasid) - 1, such userspace is
>> not supposed to work and should be failed by kernel. is it? However, my
>> above code cannot fail it. :(
>>
>> Any suggestion about it?
> 
> If a user sets the ATTACH_PASID flag and argsz is less than
> offsetofend(struct, pasid), we need to return -EINVAL as indicated
> above.  Thanks,

yep.

> 
>>
>>>>>>
>>>>>>          if (copy_from_user((void *)&attach + minsz,
>>>>>>                      (void __user *)arg + minsz, xend - minsz))
>>>>>>              return -EFAULT;
>>>>>>      }
>>>>>   
>>
> 
>

[v3,3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT

Commit Message

Comments

Patch