diff mbox series

[v2] vfio/pci: Properly hide first-in-list PCIe extended capability

Message ID 20241121140057.25157-1-avihaih@nvidia.com (mailing list archive)
State New, archived
Headers show
Series [v2] vfio/pci: Properly hide first-in-list PCIe extended capability | expand

Commit Message

Avihai Horon Nov. 21, 2024, 2 p.m. UTC
There are cases where a PCIe extended capability should be hidden from
the user. For example, an unknown capability (i.e., capability with ID
greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
chosen to be hidden from the user.

Hiding a capability is done by virtualizing and modifying the 'Next
Capability Offset' field of the previous capability so it points to the
capability after the one that should be hidden.

The special case where the first capability in the list should be hidden
is handled differently because there is no previous capability that can
be modified. In this case, the capability ID and version are zeroed
while leaving the next pointer intact. This hides the capability and
leaves an anchor for the rest of the capability list.

However, today, hiding the first capability in the list is not done
properly if the capability is unknown, as struct
vfio_pci_core_device->pci_config_map is set to the capability ID during
initialization but the capability ID is not properly checked later when
used in vfio_config_do_rw(). This leads to the following warning [1] and
to an out-of-bounds access to ecap_perms array.

Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
read only access instead of the ecap_perms array.

Note that this is safe since the above is the only case where cap_id can
exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
are already checked before).

[1]

WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
(snip)
Call Trace:
 <TASK>
 ? show_regs+0x69/0x80
 ? __warn+0x8d/0x140
 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
 ? report_bug+0x18f/0x1a0
 ? handle_bug+0x63/0xa0
 ? exc_invalid_op+0x19/0x70
 ? asm_exc_invalid_op+0x1b/0x20
 ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
 ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
 vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
 vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
 vfio_device_fops_read+0x27/0x40 [vfio]
 vfs_read+0xbd/0x340
 ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
 ? __rseq_handle_notify_resume+0xa4/0x4b0
 __x64_sys_pread64+0x96/0xc0
 x64_sys_call+0x1c3d/0x20d0
 do_syscall_64+0x4d/0x120
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
Changes from v1:
* Use Alex's suggestion to fix the bug and adapt the commit message.
---
 drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Comments

Yi Liu Nov. 22, 2024, 12:45 p.m. UTC | #1
On 2024/11/21 22:00, Avihai Horon wrote:
> There are cases where a PCIe extended capability should be hidden from
> the user. For example, an unknown capability (i.e., capability with ID
> greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
> chosen to be hidden from the user.
> 
> Hiding a capability is done by virtualizing and modifying the 'Next
> Capability Offset' field of the previous capability so it points to the
> capability after the one that should be hidden.
> 
> The special case where the first capability in the list should be hidden
> is handled differently because there is no previous capability that can
> be modified. In this case, the capability ID and version are zeroed
> while leaving the next pointer intact. This hides the capability and
> leaves an anchor for the rest of the capability list.
> 
> However, today, hiding the first capability in the list is not done
> properly if the capability is unknown, as struct
> vfio_pci_core_device->pci_config_map is set to the capability ID during
> initialization but the capability ID is not properly checked later when
> used in vfio_config_do_rw(). This leads to the following warning [1] and
> to an out-of-bounds access to ecap_perms array.
> 
> Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
> than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
> read only access instead of the ecap_perms array.
> 
> Note that this is safe since the above is the only case where cap_id can
> exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
> are already checked before).
> 
> [1]
> 
> WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]

strange, it is not in the vfio_config_do_rw(). But never mind.

> CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
> (snip)
> Call Trace:
>   <TASK>
>   ? show_regs+0x69/0x80
>   ? __warn+0x8d/0x140
>   ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>   ? report_bug+0x18f/0x1a0
>   ? handle_bug+0x63/0xa0
>   ? exc_invalid_op+0x19/0x70
>   ? asm_exc_invalid_op+0x1b/0x20
>   ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>   ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
>   vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
>   vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
>   vfio_device_fops_read+0x27/0x40 [vfio]
>   vfs_read+0xbd/0x340
>   ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
>   ? __rseq_handle_notify_resume+0xa4/0x4b0
>   __x64_sys_pread64+0x96/0xc0
>   x64_sys_call+0x1c3d/0x20d0
>   do_syscall_64+0x4d/0x120
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
> Changes from v1:
> * Use Alex's suggestion to fix the bug and adapt the commit message.
> ---
>   drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
>   1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 97422aafaa7b..b2a1ba66e5f1 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -313,12 +313,16 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
>   	return count;
>   }
>   
> +static const struct perm_bits direct_ro_perms = {
> +	.readfn = vfio_direct_config_read,
> +};
> +
>   /* Default capability regions to read-only, no-virtualization */
>   static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
> -	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> +	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
>   };
>   static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
> -	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> +	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
>   };
>   /*
>    * Default unassigned regions to raw read-write access.  Some devices
> @@ -1897,9 +1901,17 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>   		cap_start = *ppos;
>   	} else {
>   		if (*ppos >= PCI_CFG_SPACE_SIZE) {
> -			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
> +			/*
> +			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
> +			 * if we're hiding an unknown capability at the start
> +			 * of the extended capability list.  Use default, ro
> +			 * access, which will virtualize the id and next values.
> +			 */
> +			if (cap_id > PCI_EXT_CAP_ID_MAX)
> +				perm = (struct perm_bits *)&direct_ro_perms;
> +			else
> +				perm = &ecap_perms[cap_id];
>   
> -			perm = &ecap_perms[cap_id];
>   			cap_start = vfio_find_cap_start(vdev, *ppos);
>   		} else {
>   			WARN_ON(cap_id > PCI_CAP_ID_MAX);

Looks good to me. :) I'm able to trigger this warning by hide the first 
ecap on my system with the below hack.

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index b2a1ba66e5f1..db91e19a48b3 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1617,6 +1617,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device 
*vdev)
  	u16 epos;
  	__le32 *prev = NULL;
  	int loops, ret, ecaps = 0;
+	int iii =0;

  	if (!vdev->extended_caps)
  		return 0;
@@ -1635,7 +1636,11 @@ static int vfio_ecap_init(struct 
vfio_pci_core_device *vdev)
  		if (ret)
  			return ret;

-		ecap = PCI_EXT_CAP_ID(header);
+		if (iii == 0) {
+			ecap = 0x61;
+			iii++;
+		} else
+			ecap = PCI_EXT_CAP_ID(header);

  		if (ecap <= PCI_EXT_CAP_ID_MAX) {
  			len = pci_ext_cap_length[ecap];
@@ -1664,6 +1669,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device 
*vdev)
  			 */
  			len = PCI_CAP_SIZEOF;
  			hidden = true;
+			printk("%s set hide\n", __func__);
  		}

  		for (i = 0; i < len; i++) {
@@ -1893,6 +1899,7 @@ static ssize_t vfio_config_do_rw(struct 
vfio_pci_core_device *vdev, char __user

  	cap_id = vdev->pci_config_map[*ppos];

+	printk("%s cap_id: %x\n", __func__, cap_id);
  	if (cap_id == PCI_CAP_ID_INVALID) {
  		perm = &unassigned_perms;
  		cap_start = *ppos;

And then this warning is gone after applying this patch. Hence,

Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>

But I can still see a valid next pointer. Like the below log, I hide
the first ecap at offset 0x100, its ID is zeroed. The second ecap locates
at offset==0x150, its cap_id is 0x0018. I can see the next pointer in the
guest. Is it expected?

Guest:
100: 00 00 00 15 00 00 00 00 00 00 10 00 00 00 04 00
110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00

Host:
100: 01 00 02 15 00 00 00 00 00 00 10 00 00 00 04 00
110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00


BTW. If the first PCI cap is a unknown cap, will it have a problem? The
vfio_pci_core_device->pci_config_map is kept to be PCI_CAP_ID_INVALID,
hence it would use the unassigned_perms. But it makes more sense to use the
direct_ro_perms introduced here. is it?
Alex Williamson Nov. 22, 2024, 4:48 p.m. UTC | #2
On Fri, 22 Nov 2024 20:45:08 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/11/21 22:00, Avihai Horon wrote:
> > There are cases where a PCIe extended capability should be hidden from
> > the user. For example, an unknown capability (i.e., capability with ID
> > greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
> > chosen to be hidden from the user.
> > 
> > Hiding a capability is done by virtualizing and modifying the 'Next
> > Capability Offset' field of the previous capability so it points to the
> > capability after the one that should be hidden.
> > 
> > The special case where the first capability in the list should be hidden
> > is handled differently because there is no previous capability that can
> > be modified. In this case, the capability ID and version are zeroed
> > while leaving the next pointer intact. This hides the capability and
> > leaves an anchor for the rest of the capability list.
> > 
> > However, today, hiding the first capability in the list is not done
> > properly if the capability is unknown, as struct
> > vfio_pci_core_device->pci_config_map is set to the capability ID during
> > initialization but the capability ID is not properly checked later when
> > used in vfio_config_do_rw(). This leads to the following warning [1] and
> > to an out-of-bounds access to ecap_perms array.
> > 
> > Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
> > than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
> > read only access instead of the ecap_perms array.
> > 
> > Note that this is safe since the above is the only case where cap_id can
> > exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
> > are already checked before).
> > 
> > [1]
> > 
> > WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]  
> 
> strange, it is not in the vfio_config_do_rw(). But never mind.
> 
> > CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
> > (snip)
> > Call Trace:
> >   <TASK>
> >   ? show_regs+0x69/0x80
> >   ? __warn+0x8d/0x140
> >   ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
> >   ? report_bug+0x18f/0x1a0
> >   ? handle_bug+0x63/0xa0
> >   ? exc_invalid_op+0x19/0x70
> >   ? asm_exc_invalid_op+0x1b/0x20
> >   ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
> >   ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
> >   vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
> >   vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
> >   vfio_device_fops_read+0x27/0x40 [vfio]
> >   vfs_read+0xbd/0x340
> >   ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
> >   ? __rseq_handle_notify_resume+0xa4/0x4b0
> >   __x64_sys_pread64+0x96/0xc0
> >   x64_sys_call+0x1c3d/0x20d0
> >   do_syscall_64+0x4d/0x120
> >   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > ---
> > Changes from v1:
> > * Use Alex's suggestion to fix the bug and adapt the commit message.
> > ---
> >   drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
> >   1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> > index 97422aafaa7b..b2a1ba66e5f1 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -313,12 +313,16 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
> >   	return count;
> >   }
> >   
> > +static const struct perm_bits direct_ro_perms = {
> > +	.readfn = vfio_direct_config_read,
> > +};
> > +
> >   /* Default capability regions to read-only, no-virtualization */
> >   static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
> > -	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> > +	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
> >   };
> >   static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
> > -	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> > +	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
> >   };
> >   /*
> >    * Default unassigned regions to raw read-write access.  Some devices
> > @@ -1897,9 +1901,17 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
> >   		cap_start = *ppos;
> >   	} else {
> >   		if (*ppos >= PCI_CFG_SPACE_SIZE) {
> > -			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
> > +			/*
> > +			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
> > +			 * if we're hiding an unknown capability at the start
> > +			 * of the extended capability list.  Use default, ro
> > +			 * access, which will virtualize the id and next values.
> > +			 */
> > +			if (cap_id > PCI_EXT_CAP_ID_MAX)
> > +				perm = (struct perm_bits *)&direct_ro_perms;
> > +			else
> > +				perm = &ecap_perms[cap_id];
> >   
> > -			perm = &ecap_perms[cap_id];
> >   			cap_start = vfio_find_cap_start(vdev, *ppos);
> >   		} else {
> >   			WARN_ON(cap_id > PCI_CAP_ID_MAX);  
> 
> Looks good to me. :) I'm able to trigger this warning by hide the first 
> ecap on my system with the below hack.
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index b2a1ba66e5f1..db91e19a48b3 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1617,6 +1617,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device 
> *vdev)
>   	u16 epos;
>   	__le32 *prev = NULL;
>   	int loops, ret, ecaps = 0;
> +	int iii =0;
> 
>   	if (!vdev->extended_caps)
>   		return 0;
> @@ -1635,7 +1636,11 @@ static int vfio_ecap_init(struct 
> vfio_pci_core_device *vdev)
>   		if (ret)
>   			return ret;
> 
> -		ecap = PCI_EXT_CAP_ID(header);
> +		if (iii == 0) {
> +			ecap = 0x61;
> +			iii++;
> +		} else
> +			ecap = PCI_EXT_CAP_ID(header);
> 
>   		if (ecap <= PCI_EXT_CAP_ID_MAX) {
>   			len = pci_ext_cap_length[ecap];
> @@ -1664,6 +1669,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device 
> *vdev)
>   			 */
>   			len = PCI_CAP_SIZEOF;
>   			hidden = true;
> +			printk("%s set hide\n", __func__);
>   		}
> 
>   		for (i = 0; i < len; i++) {
> @@ -1893,6 +1899,7 @@ static ssize_t vfio_config_do_rw(struct 
> vfio_pci_core_device *vdev, char __user
> 
>   	cap_id = vdev->pci_config_map[*ppos];
> 
> +	printk("%s cap_id: %x\n", __func__, cap_id);
>   	if (cap_id == PCI_CAP_ID_INVALID) {
>   		perm = &unassigned_perms;
>   		cap_start = *ppos;
> 
> And then this warning is gone after applying this patch. Hence,
> 
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Yi Liu <yi.l.liu@intel.com>

Thanks, good testing!
 
> But I can still see a valid next pointer. Like the below log, I hide
> the first ecap at offset 0x100, its ID is zeroed. The second ecap locates
> at offset==0x150, its cap_id is 0x0018. I can see the next pointer in the
> guest. Is it expected?

This is what makes hiding the first ecap unique, the ecap chain always
starts at 0x100, the next pointer must be valid for the rest of the
chain to remain.  For standard capabilities we can change the register
pointing at the head of the list.  This therefore looks like expected
behavior, unless I'm missing something more subtle in your example.
 
> Guest:
> 100: 00 00 00 15 00 00 00 00 00 00 10 00 00 00 04 00
> 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
> 160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00
> 
> Host:
> 100: 01 00 02 15 00 00 00 00 00 00 10 00 00 00 04 00
> 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
> 160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00
> 
> 
> BTW. If the first PCI cap is a unknown cap, will it have a problem? The
> vfio_pci_core_device->pci_config_map is kept to be PCI_CAP_ID_INVALID,
> hence it would use the unassigned_perms. But it makes more sense to use the
> direct_ro_perms introduced here. is it?

Once we've masked the capability ID, if the guest driver is still
touching the remaining body of the capability, we're really in the
space of undefined behavior, imo.  We've already taken the stance that
inter-capability space is accessible as a necessity for certain
devices.  It's certainly been suggested that we might want to take a
more guarded approach, even so far as readjusting the capability layout
for compatibility.  We might head in that direction but I don't think
we should start with this bug fix.  Thanks,

Alex
kernel test robot Nov. 22, 2024, 9:33 p.m. UTC | #3
Hi Avihai,

kernel test robot noticed the following build errors:

[auto build test ERROR on awilliam-vfio/next]
[also build test ERROR on linus/master awilliam-vfio/for-linus v6.12 next-20241122]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Avihai-Horon/vfio-pci-Properly-hide-first-in-list-PCIe-extended-capability/20241121-220249
base:   https://github.com/awilliam/linux-vfio.git next
patch link:    https://lore.kernel.org/r/20241121140057.25157-1-avihaih%40nvidia.com
patch subject: [PATCH v2] vfio/pci: Properly hide first-in-list PCIe extended capability
config: s390-randconfig-r053-20241122 (https://download.01.org/0day-ci/archive/20241123/202411230727.abAsDI8W-lkp@intel.com/config)
compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241123/202411230727.abAsDI8W-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411230727.abAsDI8W-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/vfio/pci/vfio_pci_config.c:24:
   In file included from include/linux/pci.h:39:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:95:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:37:59: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
                                                             ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
   #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                        ^
   In file included from drivers/vfio/pci/vfio_pci_config.c:24:
   In file included from include/linux/pci.h:39:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:95:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:35:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
   #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
                                                        ^
   In file included from drivers/vfio/pci/vfio_pci_config.c:24:
   In file included from include/linux/pci.h:39:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:95:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> drivers/vfio/pci/vfio_pci_config.c:322:27: error: initializer element is not a compile-time constant
           [0 ... PCI_CAP_ID_MAX] = direct_ro_perms
                                    ^~~~~~~~~~~~~~~
   drivers/vfio/pci/vfio_pci_config.c:325:31: error: initializer element is not a compile-time constant
           [0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
                                        ^~~~~~~~~~~~~~~
   12 warnings and 2 errors generated.


vim +322 drivers/vfio/pci/vfio_pci_config.c

   319	
   320	/* Default capability regions to read-only, no-virtualization */
   321	static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
 > 322		[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
   323	};
   324	static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
   325		[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
   326	};
   327	/*
   328	 * Default unassigned regions to raw read-write access.  Some devices
   329	 * require this to function as they hide registers between the gaps in
   330	 * config space (be2net).  Like MMIO and I/O port registers, we have
   331	 * to trust the hardware isolation.
   332	 */
   333	static struct perm_bits unassigned_perms = {
   334		.readfn = vfio_raw_config_read,
   335		.writefn = vfio_raw_config_write
   336	};
   337
Alex Williamson Nov. 22, 2024, 10:02 p.m. UTC | #4
On Sat, 23 Nov 2024 05:33:10 +0800
kernel test robot <lkp@intel.com> wrote:

> Hi Avihai,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on awilliam-vfio/next]
> [also build test ERROR on linus/master awilliam-vfio/for-linus v6.12 next-20241122]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Avihai-Horon/vfio-pci-Properly-hide-first-in-list-PCIe-extended-capability/20241121-220249
> base:   https://github.com/awilliam/linux-vfio.git next
> patch link:    https://lore.kernel.org/r/20241121140057.25157-1-avihaih%40nvidia.com
> patch subject: [PATCH v2] vfio/pci: Properly hide first-in-list PCIe extended capability
> config: s390-randconfig-r053-20241122 (https://download.01.org/0day-ci/archive/20241123/202411230727.abAsDI8W-lkp@intel.com/config)
> compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241123/202411230727.abAsDI8W-lkp@intel.com/reproduce)

...
> >> drivers/vfio/pci/vfio_pci_config.c:322:27: error: initializer element is not a compile-time constant  
>            [0 ... PCI_CAP_ID_MAX] = direct_ro_perms
>                                     ^~~~~~~~~~~~~~~
>    drivers/vfio/pci/vfio_pci_config.c:325:31: error: initializer element is not a compile-time constant
>            [0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
>                                         ^~~~~~~~~~~~~~~
>    12 warnings and 2 errors generated.
> 
> 
> vim +322 drivers/vfio/pci/vfio_pci_config.c
> 
>    319	
>    320	/* Default capability regions to read-only, no-virtualization */
>    321	static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
>  > 322		[0 ... PCI_CAP_ID_MAX] = direct_ro_perms  
>    323	};

I thought declaring direct_ro_perms as const was enough to resolve
this, it is with gcc but I didn't test clang.  Feel free to leave the
existing declarations alone if there's not an obvious fix.  Thanks,

Alex
Yi Liu Nov. 25, 2024, 2:31 a.m. UTC | #5
On 2024/11/23 00:48, Alex Williamson wrote:
> On Fri, 22 Nov 2024 20:45:08 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/11/21 22:00, Avihai Horon wrote:
>>> There are cases where a PCIe extended capability should be hidden from
>>> the user. For example, an unknown capability (i.e., capability with ID
>>> greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
>>> chosen to be hidden from the user.
>>>
>>> Hiding a capability is done by virtualizing and modifying the 'Next
>>> Capability Offset' field of the previous capability so it points to the
>>> capability after the one that should be hidden.
>>>
>>> The special case where the first capability in the list should be hidden
>>> is handled differently because there is no previous capability that can
>>> be modified. In this case, the capability ID and version are zeroed
>>> while leaving the next pointer intact. This hides the capability and
>>> leaves an anchor for the rest of the capability list.
>>>
>>> However, today, hiding the first capability in the list is not done
>>> properly if the capability is unknown, as struct
>>> vfio_pci_core_device->pci_config_map is set to the capability ID during
>>> initialization but the capability ID is not properly checked later when
>>> used in vfio_config_do_rw(). This leads to the following warning [1] and
>>> to an out-of-bounds access to ecap_perms array.
>>>
>>> Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
>>> than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
>>> read only access instead of the ecap_perms array.
>>>
>>> Note that this is safe since the above is the only case where cap_id can
>>> exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
>>> are already checked before).
>>>
>>> [1]
>>>
>>> WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>
>> strange, it is not in the vfio_config_do_rw(). But never mind.
>>
>>> CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
>>> (snip)
>>> Call Trace:
>>>    <TASK>
>>>    ? show_regs+0x69/0x80
>>>    ? __warn+0x8d/0x140
>>>    ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>>    ? report_bug+0x18f/0x1a0
>>>    ? handle_bug+0x63/0xa0
>>>    ? exc_invalid_op+0x19/0x70
>>>    ? asm_exc_invalid_op+0x1b/0x20
>>>    ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>>    ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
>>>    vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
>>>    vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
>>>    vfio_device_fops_read+0x27/0x40 [vfio]
>>>    vfs_read+0xbd/0x340
>>>    ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
>>>    ? __rseq_handle_notify_resume+0xa4/0x4b0
>>>    __x64_sys_pread64+0x96/0xc0
>>>    x64_sys_call+0x1c3d/0x20d0
>>>    do_syscall_64+0x4d/0x120
>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>> Changes from v1:
>>> * Use Alex's suggestion to fix the bug and adapt the commit message.
>>> ---
>>>    drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
>>>    1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>> index 97422aafaa7b..b2a1ba66e5f1 100644
>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>> @@ -313,12 +313,16 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
>>>    	return count;
>>>    }
>>>    
>>> +static const struct perm_bits direct_ro_perms = {
>>> +	.readfn = vfio_direct_config_read,
>>> +};
>>> +
>>>    /* Default capability regions to read-only, no-virtualization */
>>>    static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
>>> -	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
>>> +	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
>>>    };
>>>    static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
>>> -	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
>>> +	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
>>>    };
>>>    /*
>>>     * Default unassigned regions to raw read-write access.  Some devices
>>> @@ -1897,9 +1901,17 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>    		cap_start = *ppos;
>>>    	} else {
>>>    		if (*ppos >= PCI_CFG_SPACE_SIZE) {
>>> -			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
>>> +			/*
>>> +			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
>>> +			 * if we're hiding an unknown capability at the start
>>> +			 * of the extended capability list.  Use default, ro
>>> +			 * access, which will virtualize the id and next values.
>>> +			 */
>>> +			if (cap_id > PCI_EXT_CAP_ID_MAX)
>>> +				perm = (struct perm_bits *)&direct_ro_perms;
>>> +			else
>>> +				perm = &ecap_perms[cap_id];
>>>    
>>> -			perm = &ecap_perms[cap_id];
>>>    			cap_start = vfio_find_cap_start(vdev, *ppos);
>>>    		} else {
>>>    			WARN_ON(cap_id > PCI_CAP_ID_MAX);
>>
>> Looks good to me. :) I'm able to trigger this warning by hide the first
>> ecap on my system with the below hack.
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c
>> b/drivers/vfio/pci/vfio_pci_config.c
>> index b2a1ba66e5f1..db91e19a48b3 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -1617,6 +1617,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
>> *vdev)
>>    	u16 epos;
>>    	__le32 *prev = NULL;
>>    	int loops, ret, ecaps = 0;
>> +	int iii =0;
>>
>>    	if (!vdev->extended_caps)
>>    		return 0;
>> @@ -1635,7 +1636,11 @@ static int vfio_ecap_init(struct
>> vfio_pci_core_device *vdev)
>>    		if (ret)
>>    			return ret;
>>
>> -		ecap = PCI_EXT_CAP_ID(header);
>> +		if (iii == 0) {
>> +			ecap = 0x61;
>> +			iii++;
>> +		} else
>> +			ecap = PCI_EXT_CAP_ID(header);
>>
>>    		if (ecap <= PCI_EXT_CAP_ID_MAX) {
>>    			len = pci_ext_cap_length[ecap];
>> @@ -1664,6 +1669,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
>> *vdev)
>>    			 */
>>    			len = PCI_CAP_SIZEOF;
>>    			hidden = true;
>> +			printk("%s set hide\n", __func__);
>>    		}
>>
>>    		for (i = 0; i < len; i++) {
>> @@ -1893,6 +1899,7 @@ static ssize_t vfio_config_do_rw(struct
>> vfio_pci_core_device *vdev, char __user
>>
>>    	cap_id = vdev->pci_config_map[*ppos];
>>
>> +	printk("%s cap_id: %x\n", __func__, cap_id);
>>    	if (cap_id == PCI_CAP_ID_INVALID) {
>>    		perm = &unassigned_perms;
>>    		cap_start = *ppos;
>>
>> And then this warning is gone after applying this patch. Hence,
>>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> Tested-by: Yi Liu <yi.l.liu@intel.com>
> 
> Thanks, good testing!
>   
>> But I can still see a valid next pointer. Like the below log, I hide
>> the first ecap at offset 0x100, its ID is zeroed. The second ecap locates
>> at offset==0x150, its cap_id is 0x0018. I can see the next pointer in the
>> guest. Is it expected?
> 
> This is what makes hiding the first ecap unique, the ecap chain always
> starts at 0x100, the next pointer must be valid for the rest of the
> chain to remain.  For standard capabilities we can change the register
> pointing at the head of the list.  This therefore looks like expected
> behavior, unless I'm missing something more subtle in your example.

Got you. I was a little bit misled by the below comment. I thought the
cap_id, version and next would be zeroed. But the code actually only zeros
the cap_id and version. :)

		/*
		 * If we're just using this capability to anchor the list,
		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
		 * indicates to use cap id = 0, version = 0, next = 0 if
		 * ecaps are absent, hope users check all the way to next.
		 */
		if (hidden)
			*(__le32 *)&vdev->vconfig[epos] &=
				cpu_to_le32((0xffcU << 20));
		else
			ecaps++;

>> Guest:
>> 100: 00 00 00 15 00 00 00 00 00 00 10 00 00 00 04 00
>> 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
>> 160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00
>>
>> Host:
>> 100: 01 00 02 15 00 00 00 00 00 00 10 00 00 00 04 00
>> 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 150: 18 00 01 16 00 00 00 00 00 00 00 00 00 00 00 00
>> 160: 17 00 01 17 05 02 01 00 00 00 00 00 00 00 00 00
>>
>>
>> BTW. If the first PCI cap is a unknown cap, will it have a problem? The
>> vfio_pci_core_device->pci_config_map is kept to be PCI_CAP_ID_INVALID,
>> hence it would use the unassigned_perms. But it makes more sense to use the
>> direct_ro_perms introduced here. is it?
> 
> Once we've masked the capability ID, if the guest driver is still
> touching the remaining body of the capability, we're really in the
> space of undefined behavior, imo.  We've already taken the stance that
> inter-capability space is accessible as a necessity for certain
> devices.  It's certainly been suggested that we might want to take a
> more guarded approach, even so far as readjusting the capability layout
> for compatibility.  We might head in that direction but I don't think
> we should start with this bug fix.  Thanks,

sure.
Alex Williamson Nov. 25, 2024, 3:20 a.m. UTC | #6
On Mon, 25 Nov 2024 10:31:38 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/11/23 00:48, Alex Williamson wrote:
> > On Fri, 22 Nov 2024 20:45:08 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> >> On 2024/11/21 22:00, Avihai Horon wrote:  
> >>> There are cases where a PCIe extended capability should be hidden from
> >>> the user. For example, an unknown capability (i.e., capability with ID
> >>> greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
> >>> chosen to be hidden from the user.
> >>>
> >>> Hiding a capability is done by virtualizing and modifying the 'Next
> >>> Capability Offset' field of the previous capability so it points to the
> >>> capability after the one that should be hidden.
> >>>
> >>> The special case where the first capability in the list should be hidden
> >>> is handled differently because there is no previous capability that can
> >>> be modified. In this case, the capability ID and version are zeroed
> >>> while leaving the next pointer intact. This hides the capability and
> >>> leaves an anchor for the rest of the capability list.
> >>>
> >>> However, today, hiding the first capability in the list is not done
> >>> properly if the capability is unknown, as struct
> >>> vfio_pci_core_device->pci_config_map is set to the capability ID during
> >>> initialization but the capability ID is not properly checked later when
> >>> used in vfio_config_do_rw(). This leads to the following warning [1] and
> >>> to an out-of-bounds access to ecap_perms array.
> >>>
> >>> Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
> >>> than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
> >>> read only access instead of the ecap_perms array.
> >>>
> >>> Note that this is safe since the above is the only case where cap_id can
> >>> exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
> >>> are already checked before).
> >>>
> >>> [1]
> >>>
> >>> WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]  
> >>
> >> strange, it is not in the vfio_config_do_rw(). But never mind.
> >>  
> >>> CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
> >>> (snip)
> >>> Call Trace:
> >>>    <TASK>
> >>>    ? show_regs+0x69/0x80
> >>>    ? __warn+0x8d/0x140
> >>>    ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
> >>>    ? report_bug+0x18f/0x1a0
> >>>    ? handle_bug+0x63/0xa0
> >>>    ? exc_invalid_op+0x19/0x70
> >>>    ? asm_exc_invalid_op+0x1b/0x20
> >>>    ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
> >>>    ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
> >>>    vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
> >>>    vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
> >>>    vfio_device_fops_read+0x27/0x40 [vfio]
> >>>    vfs_read+0xbd/0x340
> >>>    ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
> >>>    ? __rseq_handle_notify_resume+0xa4/0x4b0
> >>>    __x64_sys_pread64+0x96/0xc0
> >>>    x64_sys_call+0x1c3d/0x20d0
> >>>    do_syscall_64+0x4d/0x120
> >>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >>>
> >>> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
> >>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >>> ---
> >>> Changes from v1:
> >>> * Use Alex's suggestion to fix the bug and adapt the commit message.
> >>> ---
> >>>    drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
> >>>    1 file changed, 16 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> >>> index 97422aafaa7b..b2a1ba66e5f1 100644
> >>> --- a/drivers/vfio/pci/vfio_pci_config.c
> >>> +++ b/drivers/vfio/pci/vfio_pci_config.c
> >>> @@ -313,12 +313,16 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
> >>>    	return count;
> >>>    }
> >>>    
> >>> +static const struct perm_bits direct_ro_perms = {
> >>> +	.readfn = vfio_direct_config_read,
> >>> +};
> >>> +
> >>>    /* Default capability regions to read-only, no-virtualization */
> >>>    static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
> >>> -	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> >>> +	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
> >>>    };
> >>>    static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
> >>> -	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
> >>> +	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
> >>>    };
> >>>    /*
> >>>     * Default unassigned regions to raw read-write access.  Some devices
> >>> @@ -1897,9 +1901,17 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
> >>>    		cap_start = *ppos;
> >>>    	} else {
> >>>    		if (*ppos >= PCI_CFG_SPACE_SIZE) {
> >>> -			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
> >>> +			/*
> >>> +			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
> >>> +			 * if we're hiding an unknown capability at the start
> >>> +			 * of the extended capability list.  Use default, ro
> >>> +			 * access, which will virtualize the id and next values.
> >>> +			 */
> >>> +			if (cap_id > PCI_EXT_CAP_ID_MAX)
> >>> +				perm = (struct perm_bits *)&direct_ro_perms;
> >>> +			else
> >>> +				perm = &ecap_perms[cap_id];
> >>>    
> >>> -			perm = &ecap_perms[cap_id];
> >>>    			cap_start = vfio_find_cap_start(vdev, *ppos);
> >>>    		} else {
> >>>    			WARN_ON(cap_id > PCI_CAP_ID_MAX);  
> >>
> >> Looks good to me. :) I'm able to trigger this warning by hide the first
> >> ecap on my system with the below hack.
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> >> b/drivers/vfio/pci/vfio_pci_config.c
> >> index b2a1ba66e5f1..db91e19a48b3 100644
> >> --- a/drivers/vfio/pci/vfio_pci_config.c
> >> +++ b/drivers/vfio/pci/vfio_pci_config.c
> >> @@ -1617,6 +1617,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
> >> *vdev)
> >>    	u16 epos;
> >>    	__le32 *prev = NULL;
> >>    	int loops, ret, ecaps = 0;
> >> +	int iii =0;
> >>
> >>    	if (!vdev->extended_caps)
> >>    		return 0;
> >> @@ -1635,7 +1636,11 @@ static int vfio_ecap_init(struct
> >> vfio_pci_core_device *vdev)
> >>    		if (ret)
> >>    			return ret;
> >>
> >> -		ecap = PCI_EXT_CAP_ID(header);
> >> +		if (iii == 0) {
> >> +			ecap = 0x61;
> >> +			iii++;
> >> +		} else
> >> +			ecap = PCI_EXT_CAP_ID(header);
> >>
> >>    		if (ecap <= PCI_EXT_CAP_ID_MAX) {
> >>    			len = pci_ext_cap_length[ecap];
> >> @@ -1664,6 +1669,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
> >> *vdev)
> >>    			 */
> >>    			len = PCI_CAP_SIZEOF;
> >>    			hidden = true;
> >> +			printk("%s set hide\n", __func__);
> >>    		}
> >>
> >>    		for (i = 0; i < len; i++) {
> >> @@ -1893,6 +1899,7 @@ static ssize_t vfio_config_do_rw(struct
> >> vfio_pci_core_device *vdev, char __user
> >>
> >>    	cap_id = vdev->pci_config_map[*ppos];
> >>
> >> +	printk("%s cap_id: %x\n", __func__, cap_id);
> >>    	if (cap_id == PCI_CAP_ID_INVALID) {
> >>    		perm = &unassigned_perms;
> >>    		cap_start = *ppos;
> >>
> >> And then this warning is gone after applying this patch. Hence,
> >>
> >> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> >> Tested-by: Yi Liu <yi.l.liu@intel.com>  
> > 
> > Thanks, good testing!
> >     
> >> But I can still see a valid next pointer. Like the below log, I hide
> >> the first ecap at offset 0x100, its ID is zeroed. The second ecap locates
> >> at offset==0x150, its cap_id is 0x0018. I can see the next pointer in the
> >> guest. Is it expected?  
> > 
> > This is what makes hiding the first ecap unique, the ecap chain always
> > starts at 0x100, the next pointer must be valid for the rest of the
> > chain to remain.  For standard capabilities we can change the register
> > pointing at the head of the list.  This therefore looks like expected
> > behavior, unless I'm missing something more subtle in your example.  
> 
> Got you. I was a little bit misled by the below comment. I thought the
> cap_id, version and next would be zeroed. But the code actually only zeros
> the cap_id and version. :)
> 
> 		/*
> 		 * If we're just using this capability to anchor the list,
> 		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
> 		 * indicates to use cap id = 0, version = 0, next = 0 if
> 		 * ecaps are absent, hope users check all the way to next.
> 		 */

This is just expressing concern that a sloppy guest might decide the
list ends at the zero capability ID w/o strictly following the spec.
We've never seen evidence of such a guest.  Thanks,

Alex
Yi Liu Nov. 25, 2024, 4:53 a.m. UTC | #7
On 2024/11/25 11:20, Alex Williamson wrote:
> On Mon, 25 Nov 2024 10:31:38 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/11/23 00:48, Alex Williamson wrote:
>>> On Fri, 22 Nov 2024 20:45:08 +0800
>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>    
>>>> On 2024/11/21 22:00, Avihai Horon wrote:
>>>>> There are cases where a PCIe extended capability should be hidden from
>>>>> the user. For example, an unknown capability (i.e., capability with ID
>>>>> greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
>>>>> chosen to be hidden from the user.
>>>>>
>>>>> Hiding a capability is done by virtualizing and modifying the 'Next
>>>>> Capability Offset' field of the previous capability so it points to the
>>>>> capability after the one that should be hidden.
>>>>>
>>>>> The special case where the first capability in the list should be hidden
>>>>> is handled differently because there is no previous capability that can
>>>>> be modified. In this case, the capability ID and version are zeroed
>>>>> while leaving the next pointer intact. This hides the capability and
>>>>> leaves an anchor for the rest of the capability list.
>>>>>
>>>>> However, today, hiding the first capability in the list is not done
>>>>> properly if the capability is unknown, as struct
>>>>> vfio_pci_core_device->pci_config_map is set to the capability ID during
>>>>> initialization but the capability ID is not properly checked later when
>>>>> used in vfio_config_do_rw(). This leads to the following warning [1] and
>>>>> to an out-of-bounds access to ecap_perms array.
>>>>>
>>>>> Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
>>>>> than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
>>>>> read only access instead of the ecap_perms array.
>>>>>
>>>>> Note that this is safe since the above is the only case where cap_id can
>>>>> exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
>>>>> are already checked before).
>>>>>
>>>>> [1]
>>>>>
>>>>> WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>>>
>>>> strange, it is not in the vfio_config_do_rw(). But never mind.
>>>>   
>>>>> CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
>>>>> (snip)
>>>>> Call Trace:
>>>>>     <TASK>
>>>>>     ? show_regs+0x69/0x80
>>>>>     ? __warn+0x8d/0x140
>>>>>     ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>>>>     ? report_bug+0x18f/0x1a0
>>>>>     ? handle_bug+0x63/0xa0
>>>>>     ? exc_invalid_op+0x19/0x70
>>>>>     ? asm_exc_invalid_op+0x1b/0x20
>>>>>     ? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
>>>>>     ? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
>>>>>     vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
>>>>>     vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
>>>>>     vfio_device_fops_read+0x27/0x40 [vfio]
>>>>>     vfs_read+0xbd/0x340
>>>>>     ? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
>>>>>     ? __rseq_handle_notify_resume+0xa4/0x4b0
>>>>>     __x64_sys_pread64+0x96/0xc0
>>>>>     x64_sys_call+0x1c3d/0x20d0
>>>>>     do_syscall_64+0x4d/0x120
>>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>> ---
>>>>> Changes from v1:
>>>>> * Use Alex's suggestion to fix the bug and adapt the commit message.
>>>>> ---
>>>>>     drivers/vfio/pci/vfio_pci_config.c | 20 ++++++++++++++++----
>>>>>     1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>>> index 97422aafaa7b..b2a1ba66e5f1 100644
>>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>>> @@ -313,12 +313,16 @@ static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
>>>>>     	return count;
>>>>>     }
>>>>>     
>>>>> +static const struct perm_bits direct_ro_perms = {
>>>>> +	.readfn = vfio_direct_config_read,
>>>>> +};
>>>>> +
>>>>>     /* Default capability regions to read-only, no-virtualization */
>>>>>     static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
>>>>> -	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
>>>>> +	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
>>>>>     };
>>>>>     static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
>>>>> -	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
>>>>> +	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
>>>>>     };
>>>>>     /*
>>>>>      * Default unassigned regions to raw read-write access.  Some devices
>>>>> @@ -1897,9 +1901,17 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>>     		cap_start = *ppos;
>>>>>     	} else {
>>>>>     		if (*ppos >= PCI_CFG_SPACE_SIZE) {
>>>>> -			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
>>>>> +			/*
>>>>> +			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
>>>>> +			 * if we're hiding an unknown capability at the start
>>>>> +			 * of the extended capability list.  Use default, ro
>>>>> +			 * access, which will virtualize the id and next values.
>>>>> +			 */
>>>>> +			if (cap_id > PCI_EXT_CAP_ID_MAX)
>>>>> +				perm = (struct perm_bits *)&direct_ro_perms;
>>>>> +			else
>>>>> +				perm = &ecap_perms[cap_id];
>>>>>     
>>>>> -			perm = &ecap_perms[cap_id];
>>>>>     			cap_start = vfio_find_cap_start(vdev, *ppos);
>>>>>     		} else {
>>>>>     			WARN_ON(cap_id > PCI_CAP_ID_MAX);
>>>>
>>>> Looks good to me. :) I'm able to trigger this warning by hide the first
>>>> ecap on my system with the below hack.
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c
>>>> b/drivers/vfio/pci/vfio_pci_config.c
>>>> index b2a1ba66e5f1..db91e19a48b3 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>> @@ -1617,6 +1617,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
>>>> *vdev)
>>>>     	u16 epos;
>>>>     	__le32 *prev = NULL;
>>>>     	int loops, ret, ecaps = 0;
>>>> +	int iii =0;
>>>>
>>>>     	if (!vdev->extended_caps)
>>>>     		return 0;
>>>> @@ -1635,7 +1636,11 @@ static int vfio_ecap_init(struct
>>>> vfio_pci_core_device *vdev)
>>>>     		if (ret)
>>>>     			return ret;
>>>>
>>>> -		ecap = PCI_EXT_CAP_ID(header);
>>>> +		if (iii == 0) {
>>>> +			ecap = 0x61;
>>>> +			iii++;
>>>> +		} else
>>>> +			ecap = PCI_EXT_CAP_ID(header);
>>>>
>>>>     		if (ecap <= PCI_EXT_CAP_ID_MAX) {
>>>>     			len = pci_ext_cap_length[ecap];
>>>> @@ -1664,6 +1669,7 @@ static int vfio_ecap_init(struct vfio_pci_core_device
>>>> *vdev)
>>>>     			 */
>>>>     			len = PCI_CAP_SIZEOF;
>>>>     			hidden = true;
>>>> +			printk("%s set hide\n", __func__);
>>>>     		}
>>>>
>>>>     		for (i = 0; i < len; i++) {
>>>> @@ -1893,6 +1899,7 @@ static ssize_t vfio_config_do_rw(struct
>>>> vfio_pci_core_device *vdev, char __user
>>>>
>>>>     	cap_id = vdev->pci_config_map[*ppos];
>>>>
>>>> +	printk("%s cap_id: %x\n", __func__, cap_id);
>>>>     	if (cap_id == PCI_CAP_ID_INVALID) {
>>>>     		perm = &unassigned_perms;
>>>>     		cap_start = *ppos;
>>>>
>>>> And then this warning is gone after applying this patch. Hence,
>>>>
>>>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>>>> Tested-by: Yi Liu <yi.l.liu@intel.com>
>>>
>>> Thanks, good testing!
>>>      
>>>> But I can still see a valid next pointer. Like the below log, I hide
>>>> the first ecap at offset 0x100, its ID is zeroed. The second ecap locates
>>>> at offset==0x150, its cap_id is 0x0018. I can see the next pointer in the
>>>> guest. Is it expected?
>>>
>>> This is what makes hiding the first ecap unique, the ecap chain always
>>> starts at 0x100, the next pointer must be valid for the rest of the
>>> chain to remain.  For standard capabilities we can change the register
>>> pointing at the head of the list.  This therefore looks like expected
>>> behavior, unless I'm missing something more subtle in your example.
>>
>> Got you. I was a little bit misled by the below comment. I thought the
>> cap_id, version and next would be zeroed. But the code actually only zeros
>> the cap_id and version. :)
>>
>> 		/*
>> 		 * If we're just using this capability to anchor the list,
>> 		 * hide the real ID.  Only count real ecaps.  XXX PCI spec
>> 		 * indicates to use cap id = 0, version = 0, next = 0 if
>> 		 * ecaps are absent, hope users check all the way to next.
>> 		 */
> 
> This is just expressing concern that a sloppy guest might decide the
> list ends at the zero capability ID w/o strictly following the spec.
> We've never seen evidence of such a guest.  Thanks,

got it. I see linux pci_find_next_ext_capability() does it correctly.
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 97422aafaa7b..b2a1ba66e5f1 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -313,12 +313,16 @@  static int vfio_virt_config_read(struct vfio_pci_core_device *vdev, int pos,
 	return count;
 }
 
+static const struct perm_bits direct_ro_perms = {
+	.readfn = vfio_direct_config_read,
+};
+
 /* Default capability regions to read-only, no-virtualization */
 static struct perm_bits cap_perms[PCI_CAP_ID_MAX + 1] = {
-	[0 ... PCI_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+	[0 ... PCI_CAP_ID_MAX] = direct_ro_perms
 };
 static struct perm_bits ecap_perms[PCI_EXT_CAP_ID_MAX + 1] = {
-	[0 ... PCI_EXT_CAP_ID_MAX] = { .readfn = vfio_direct_config_read }
+	[0 ... PCI_EXT_CAP_ID_MAX] = direct_ro_perms
 };
 /*
  * Default unassigned regions to raw read-write access.  Some devices
@@ -1897,9 +1901,17 @@  static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 		cap_start = *ppos;
 	} else {
 		if (*ppos >= PCI_CFG_SPACE_SIZE) {
-			WARN_ON(cap_id > PCI_EXT_CAP_ID_MAX);
+			/*
+			 * We can get a cap_id that exceeds PCI_EXT_CAP_ID_MAX
+			 * if we're hiding an unknown capability at the start
+			 * of the extended capability list.  Use default, ro
+			 * access, which will virtualize the id and next values.
+			 */
+			if (cap_id > PCI_EXT_CAP_ID_MAX)
+				perm = (struct perm_bits *)&direct_ro_perms;
+			else
+				perm = &ecap_perms[cap_id];
 
-			perm = &ecap_perms[cap_id];
 			cap_start = vfio_find_cap_start(vdev, *ppos);
 		} else {
 			WARN_ON(cap_id > PCI_CAP_ID_MAX);