diff mbox series

[v3] x86/msi: fix locking for SR-IOV devices

Message ID 20240812203938.981588-1-stewart.hildebrand@amd.com (mailing list archive)
State Superseded
Headers show
Series [v3] x86/msi: fix locking for SR-IOV devices | expand

Commit Message

Stewart Hildebrand Aug. 12, 2024, 8:39 p.m. UTC
In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
structure") a lock moved from allocate_and_map_msi_pirq() to the caller
and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
call path wasn't updated to reflect the change, leading to a failed
assertion observed under the following conditions:

* PV dom0
* Debug build (debug=y) of Xen
* There is an SR-IOV device in the system with one or more VFs enabled
* Dom0 has loaded the driver for the VF and enabled MSI-X

(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
(XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
(XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
(XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
(XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
(XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
(XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
(XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
(XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
(XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
(XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150

In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
associated PF to access the vf_rlen array. This array is initialized in
pci_add_device() and is only populated in the associated PF's struct
pci_dev.

Add a link from the VF's struct pci_dev to the associated PF struct
pci_dev, ensuring the PF's struct doesn't get deallocated until all its
VFs have gone away. Access the vf_rlen array via the new link to the PF,
and remove the troublesome call to pci_get_pdev().

Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
pci_add_device() to set up the link from VF to PF. In case the new
pci_add_device() invocation fails to find the associated PF (returning
NULL), we are no worse off than before: read_pci_mem_bar() will still
return 0 in that case.

Note that currently the only way for Xen to know if a device is a VF is
if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
a VF is not a case that Xen handles.

Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
---
Candidate for backport to 4.19

v2->v3:
* link from VF to PF's struct pci_dev *

v1->v2:
* remove call to pci_get_pdev()
---
 xen/arch/x86/msi.c            | 34 ++++++++++++++-------------
 xen/drivers/passthrough/pci.c | 44 +++++++++++++++++++++++++++++++----
 xen/include/xen/pci.h         | 12 +++++++++-
 3 files changed, 69 insertions(+), 21 deletions(-)


base-commit: d61248b0fa2cdfc0be1d806c43d1b211a72b5d49

Comments

Jan Beulich Aug. 13, 2024, 2:01 p.m. UTC | #1
On 12.08.2024 22:39, Stewart Hildebrand wrote:
> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
> call path wasn't updated to reflect the change, leading to a failed
> assertion observed under the following conditions:
> 
> * PV dom0
> * Debug build (debug=y) of Xen
> * There is an SR-IOV device in the system with one or more VFs enabled
> * Dom0 has loaded the driver for the VF and enabled MSI-X
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
> 
> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
> associated PF to access the vf_rlen array. This array is initialized in
> pci_add_device() and is only populated in the associated PF's struct
> pci_dev.
> 
> Add a link from the VF's struct pci_dev to the associated PF struct
> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
> VFs have gone away. Access the vf_rlen array via the new link to the PF,
> and remove the troublesome call to pci_get_pdev().
> 
> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
> pci_add_device() to set up the link from VF to PF. In case the new
> pci_add_device() invocation fails to find the associated PF (returning
> NULL), we are no worse off than before: read_pci_mem_bar() will still
> return 0 in that case.
> 
> Note that currently the only way for Xen to know if a device is a VF is
> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
> a VF is not a case that Xen handles.

How does the toolstack come into play here? It's still the Dom0 kernel to
tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
is even more kind of deprecated).

> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev,
>      return 0;
>  }
>  
> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 slot,
> +                            u8 func, u8 bir, int vf)
>  {
> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>      u8 limit;
>      u32 addr, base = PCI_BASE_ADDRESS_0;
>      u64 disp = 0;
>  
>      if ( vf >= 0 )
>      {
> -        struct pci_dev *pdev = pci_get_pdev(NULL,
> -                                            PCI_SBDF(seg, bus, slot, func));
>          unsigned int pos;
>          uint16_t ctrl, num_vf, offset, stride;
>  
> -        if ( !pdev )
> -            return 0;
> -
> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>  
>          if ( !pos ||
>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>               bir >= PCI_SRIOV_NUM_BARS ||
> -             !pdev->vf_rlen[bir] )
> +             !pf_info ||

See further down - I think it would be nice if we didn't need this new
check.

> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev,
>          int vf;
>          paddr_t pba_paddr;
>          unsigned int pba_offset;
> +        struct pf_info *pf_info = NULL;

Pointer-to-const?

> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev,
>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>              vf = dev->sbdf.bdf;
> +            if ( dev->virtfn.pf_pdev )
> +                pf_info = &dev->virtfn.pf_pdev->physfn;
>          }
>  
> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
> +        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, bir,
> +                                       vf);

I don't think putting the new arg first is very nice. SBDF should remain
first imo. Between the latter arguments I'm not as fussed.

> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>  
>      list_del(&pdev->alldevs_list);
>      pdev_msi_deinit(pdev);
> -    xfree(pdev);
> +
> +    if ( pdev->info.is_virtfn )
> +    {
> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
> +
> +        if ( pf_pdev )
> +        {
> +            list_del(&pdev->virtfn.next);
> +            if ( pf_pdev->physfn.dealloc_pending &&
> +                 list_empty(&pf_pdev->physfn.vf_list) )
> +                xfree(pf_pdev);
> +        }
> +        xfree(pdev);
> +    }
> +    else
> +    {
> +        if ( list_empty(&pdev->physfn.vf_list) )
> +            xfree(pdev);
> +        else
> +            pdev->physfn.dealloc_pending = true;
> +    }

Could I talk you into un-indenting the last if/else by a level, by making
the earlier else and "else if()"?

One aspect I didn't properly consider when making the suggestion: What if,
without all VFs having gone away, the PF is re-added? In that case we
would better recycle the existing structure. That's getting a little
complicated, so maybe a better approach is to refuse the request (in
pci_remove_device()) when the list isn't empty?

> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>           * extended function.
>           */
>          if ( pdev->info.is_virtfn )
> +        {
> +            struct pci_dev *pf_pdev;
> +
>              pdev->info.is_extfn = pf_is_extfn;
> +            pf_pdev = pci_get_pdev(NULL,
> +                                   PCI_SBDF(seg, info->physfn.bus,
> +                                            info->physfn.devfn));
> +            if ( pf_pdev )
> +            {
> +                pdev->virtfn.pf_pdev = pf_pdev;
> +                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
> +            }
> +        }

Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as
we drop the lock intermediately. I wonder though if we wouldn't better fail
the function if we can't find the PF instance.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -150,7 +150,17 @@ struct pci_dev {
>          unsigned int count;
>  #define PT_FAULT_THRESHOLD 10
>      } fault;
> -    u64 vf_rlen[6];
> +    struct pf_info {
> +        /* Only populated for PFs (info.is_virtfn == false) */
> +        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
> +        struct list_head vf_list;             /* List head */
> +        bool dealloc_pending;
> +    } physfn;
> +    struct {
> +        /* Only populated for VFs (info.is_virtfn == true) */
> +        struct pci_dev *pf_pdev;              /* Link from VF to PF */
> +        struct list_head next;                /* Entry in vf_list */

For doubly-linked lists "next" isn't really a good name. Since both struct
variants need such a member, why not use vf_list uniformly? That'll then
also lower the significance of my other question:

> +    } virtfn;

Should the two new struct-s perhaps be a union?

Jan
Stewart Hildebrand Aug. 15, 2024, 1:28 a.m. UTC | #2
On 8/13/24 10:01, Jan Beulich wrote:
> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add a link from the VF's struct pci_dev to the associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have gone away. Access the vf_rlen array via the new link to the PF,
>> and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_add_device() invocation fails to find the associated PF (returning
>> NULL), we are no worse off than before: read_pci_mem_bar() will still
>> return 0 in that case.
>>
>> Note that currently the only way for Xen to know if a device is a VF is
>> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
>> a VF is not a case that Xen handles.
> 
> How does the toolstack come into play here? It's still the Dom0 kernel to
> tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
> PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
> is even more kind of deprecated).

I guess I meant to say Dom0 kernel, not toolstack. I'm actually
questioning how much value this last portion of the commit description
is really adding. Maybe it would be better to just remove this bit.

>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev,
>>      return 0;
>>  }
>>  
>> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 slot,
>> +                            u8 func, u8 bir, int vf)
>>  {
>> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>>      u8 limit;
>>      u32 addr, base = PCI_BASE_ADDRESS_0;
>>      u64 disp = 0;
>>  
>>      if ( vf >= 0 )
>>      {
>> -        struct pci_dev *pdev = pci_get_pdev(NULL,
>> -                                            PCI_SBDF(seg, bus, slot, func));
>>          unsigned int pos;
>>          uint16_t ctrl, num_vf, offset, stride;
>>  
>> -        if ( !pdev )
>> -            return 0;
>> -
>> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
>> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
>> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
>> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
>> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
>> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
>> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
>> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>>  
>>          if ( !pos ||
>>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>> -             !pdev->vf_rlen[bir] )
>> +             !pf_info ||
> 
> See further down - I think it would be nice if we didn't need this new
> check.
> 
>> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>          int vf;
>>          paddr_t pba_paddr;
>>          unsigned int pba_offset;
>> +        struct pf_info *pf_info = NULL;
> 
> Pointer-to-const?
> 
>> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>>              vf = dev->sbdf.bdf;
>> +            if ( dev->virtfn.pf_pdev )
>> +                pf_info = &dev->virtfn.pf_pdev->physfn;
>>          }
>>  
>> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, bir,
>> +                                       vf);
> 
> I don't think putting the new arg first is very nice. SBDF should remain
> first imo. Between the latter arguments I'm not as fussed.
> 
>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> -    xfree(pdev);
>> +
>> +    if ( pdev->info.is_virtfn )
>> +    {
>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>> +
>> +        if ( pf_pdev )
>> +        {
>> +            list_del(&pdev->virtfn.next);
>> +            if ( pf_pdev->physfn.dealloc_pending &&
>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>> +                xfree(pf_pdev);
>> +        }
>> +        xfree(pdev);
>> +    }
>> +    else
>> +    {
>> +        if ( list_empty(&pdev->physfn.vf_list) )
>> +            xfree(pdev);
>> +        else
>> +            pdev->physfn.dealloc_pending = true;
>> +    }
> 
> Could I talk you into un-indenting the last if/else by a level, by making
> the earlier else and "else if()"?
> 
> One aspect I didn't properly consider when making the suggestion: What if,
> without all VFs having gone away, the PF is re-added? In that case we
> would better recycle the existing structure. That's getting a little
> complicated, so maybe a better approach is to refuse the request (in
> pci_remove_device()) when the list isn't empty?

Hm. Right. The growing complexity of maintaining these PF<->VF links
(introduced on a hunch that they may be useful in the future) is making
me favor the previous approach from v2 of simply copying the vf_len
array. So unless there are any objections I'll go back to that approach
for v4.

>> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
>> +            }
>> +        }
> 
> Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as
> we drop the lock intermediately. I wonder though if we wouldn't better fail
> the function if we can't find the PF instance.
> 
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,17 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    struct pf_info {
>> +        /* Only populated for PFs (info.is_virtfn == false) */
>> +        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        struct list_head vf_list;             /* List head */
>> +        bool dealloc_pending;
>> +    } physfn;
>> +    struct {
>> +        /* Only populated for VFs (info.is_virtfn == true) */
>> +        struct pci_dev *pf_pdev;              /* Link from VF to PF */
>> +        struct list_head next;                /* Entry in vf_list */
> 
> For doubly-linked lists "next" isn't really a good name. Since both struct
> variants need such a member, why not use vf_list uniformly? That'll then
> also lower the significance of my other question:
> 
>> +    } virtfn;
> 
> Should the two new struct-s perhaps be a union?
> 
> Jan
Jan Beulich Aug. 15, 2024, 8:36 a.m. UTC | #3
On 15.08.2024 03:28, Stewart Hildebrand wrote:
> On 8/13/24 10:01, Jan Beulich wrote:
>> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>>> call path wasn't updated to reflect the change, leading to a failed
>>> assertion observed under the following conditions:
>>>
>>> * PV dom0
>>> * Debug build (debug=y) of Xen
>>> * There is an SR-IOV device in the system with one or more VFs enabled
>>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>>
>>> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
>>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>>> ...
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>>> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
>>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>>
>>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>>> associated PF to access the vf_rlen array. This array is initialized in
>>> pci_add_device() and is only populated in the associated PF's struct
>>> pci_dev.
>>>
>>> Add a link from the VF's struct pci_dev to the associated PF struct
>>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>>> VFs have gone away. Access the vf_rlen array via the new link to the PF,
>>> and remove the troublesome call to pci_get_pdev().
>>>
>>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>>> pci_add_device() to set up the link from VF to PF. In case the new
>>> pci_add_device() invocation fails to find the associated PF (returning
>>> NULL), we are no worse off than before: read_pci_mem_bar() will still
>>> return 0 in that case.
>>>
>>> Note that currently the only way for Xen to know if a device is a VF is
>>> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
>>> a VF is not a case that Xen handles.
>>
>> How does the toolstack come into play here? It's still the Dom0 kernel to
>> tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
>> PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
>> is even more kind of deprecated).
> 
> I guess I meant to say Dom0 kernel, not toolstack. I'm actually
> questioning how much value this last portion of the commit description
> is really adding. Maybe it would be better to just remove this bit.

+1

>>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>>  
>>>      list_del(&pdev->alldevs_list);
>>>      pdev_msi_deinit(pdev);
>>> -    xfree(pdev);
>>> +
>>> +    if ( pdev->info.is_virtfn )
>>> +    {
>>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>>> +
>>> +        if ( pf_pdev )
>>> +        {
>>> +            list_del(&pdev->virtfn.next);
>>> +            if ( pf_pdev->physfn.dealloc_pending &&
>>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>>> +                xfree(pf_pdev);
>>> +        }
>>> +        xfree(pdev);
>>> +    }
>>> +    else
>>> +    {
>>> +        if ( list_empty(&pdev->physfn.vf_list) )
>>> +            xfree(pdev);
>>> +        else
>>> +            pdev->physfn.dealloc_pending = true;
>>> +    }
>>
>> Could I talk you into un-indenting the last if/else by a level, by making
>> the earlier else and "else if()"?
>>
>> One aspect I didn't properly consider when making the suggestion: What if,
>> without all VFs having gone away, the PF is re-added? In that case we
>> would better recycle the existing structure. That's getting a little
>> complicated, so maybe a better approach is to refuse the request (in
>> pci_remove_device()) when the list isn't empty?
> 
> Hm. Right. The growing complexity of maintaining these PF<->VF links
> (introduced on a hunch that they may be useful in the future) is making
> me favor the previous approach from v2 of simply copying the vf_len
> array. So unless there are any objections I'll go back to that approach
> for v4.

I continue to consider the earlier approach at least undesirable. The
vf_len[] array shouldn't be copied around, risking for it to be / go
stale. There should be one central place for every bit of information,
unless there are very good reasons to duplicate something.

Jan
Stewart Hildebrand Aug. 23, 2024, 6:02 p.m. UTC | #4
On 8/15/24 04:36, Jan Beulich wrote:
> On 15.08.2024 03:28, Stewart Hildebrand wrote:
>> On 8/13/24 10:01, Jan Beulich wrote:
>>> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>>>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>>>  
>>>>      list_del(&pdev->alldevs_list);
>>>>      pdev_msi_deinit(pdev);
>>>> -    xfree(pdev);
>>>> +
>>>> +    if ( pdev->info.is_virtfn )
>>>> +    {
>>>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>>>> +
>>>> +        if ( pf_pdev )
>>>> +        {
>>>> +            list_del(&pdev->virtfn.next);
>>>> +            if ( pf_pdev->physfn.dealloc_pending &&
>>>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>>>> +                xfree(pf_pdev);
>>>> +        }
>>>> +        xfree(pdev);
>>>> +    }
>>>> +    else
>>>> +    {
>>>> +        if ( list_empty(&pdev->physfn.vf_list) )
>>>> +            xfree(pdev);
>>>> +        else
>>>> +            pdev->physfn.dealloc_pending = true;
>>>> +    }
>>>
>>> Could I talk you into un-indenting the last if/else by a level, by making
>>> the earlier else and "else if()"?
>>>
>>> One aspect I didn't properly consider when making the suggestion: What if,
>>> without all VFs having gone away, the PF is re-added? In that case we
>>> would better recycle the existing structure. That's getting a little
>>> complicated, so maybe a better approach is to refuse the request (in
>>> pci_remove_device()) when the list isn't empty?
>>
>> Hm. Right. The growing complexity of maintaining these PF<->VF links
>> (introduced on a hunch that they may be useful in the future) is making
>> me favor the previous approach from v2 of simply copying the vf_len
>> array. So unless there are any objections I'll go back to that approach
>> for v4.
> 
> I continue to consider the earlier approach at least undesirable. The
> vf_len[] array shouldn't be copied around, risking for it to be / go
> stale. There should be one central place for every bit of information,
> unless there are very good reasons to duplicate something.

OK, I'll continue to refine the PF<->VF link approach for v4.
Stewart Hildebrand Aug. 25, 2024, 6:03 p.m. UTC | #5
On 8/13/24 10:01, Jan Beulich wrote:
> On 12.08.2024 22:39, Stewart Hildebrand wrote:
>> In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
>> structure") a lock moved from allocate_and_map_msi_pirq() to the caller
>> and changed from pcidevs_lock() to read_lock(&d->pci_lock). However, one
>> call path wasn't updated to reflect the change, leading to a failed
>> assertion observed under the following conditions:
>>
>> * PV dom0
>> * Debug build (debug=y) of Xen
>> * There is an SR-IOV device in the system with one or more VFs enabled
>> * Dom0 has loaded the driver for the VF and enabled MSI-X
>>
>> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
>> (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
>> ...
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
>> (XEN)    [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
>> (XEN)    [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
>> (XEN)    [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
>> (XEN)    [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
>> (XEN)    [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
>> (XEN)    [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
>> (XEN)    [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
>> (XEN)    [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
>> (XEN)    [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
>> (XEN)    [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
>>
>> In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
>> associated PF to access the vf_rlen array. This array is initialized in
>> pci_add_device() and is only populated in the associated PF's struct
>> pci_dev.
>>
>> Add a link from the VF's struct pci_dev to the associated PF struct
>> pci_dev, ensuring the PF's struct doesn't get deallocated until all its
>> VFs have gone away. Access the vf_rlen array via the new link to the PF,
>> and remove the troublesome call to pci_get_pdev().
>>
>> Add a call to pci_get_pdev() inside the pcidevs_lock()-locked section of
>> pci_add_device() to set up the link from VF to PF. In case the new
>> pci_add_device() invocation fails to find the associated PF (returning
>> NULL), we are no worse off than before: read_pci_mem_bar() will still
>> return 0 in that case.
>>
>> Note that currently the only way for Xen to know if a device is a VF is
>> if the toolstack tells Xen about it. Using PHYSDEVOP_manage_pci_add for
>> a VF is not a case that Xen handles.
> 
> How does the toolstack come into play here? It's still the Dom0 kernel to
> tell Xen, via PHYSDEVOP_pci_device_add (preferred) or
> PHYSDEVOP_manage_pci_add_ext (kind of deprecated; PHYSDEVOP_manage_pci_add
> is even more kind of deprecated).

This last bit of the commit description isn't adding much, I'll remove
it.

>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -662,34 +662,32 @@ static int msi_capability_init(struct pci_dev *dev,
>>      return 0;
>>  }
>>  
>> -static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
>> +static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 slot,
>> +                            u8 func, u8 bir, int vf)
>>  {
>> +    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
>>      u8 limit;
>>      u32 addr, base = PCI_BASE_ADDRESS_0;
>>      u64 disp = 0;
>>  
>>      if ( vf >= 0 )
>>      {
>> -        struct pci_dev *pdev = pci_get_pdev(NULL,
>> -                                            PCI_SBDF(seg, bus, slot, func));
>>          unsigned int pos;
>>          uint16_t ctrl, num_vf, offset, stride;
>>  
>> -        if ( !pdev )
>> -            return 0;
>> -
>> -        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
>> -        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
>> -        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
>> -        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> -        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
>> +        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
>> +        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
>> +        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
>> +        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
>> +        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
>>  
>>          if ( !pos ||
>>               !(ctrl & PCI_SRIOV_CTRL_VFE) ||
>>               !(ctrl & PCI_SRIOV_CTRL_MSE) ||
>>               !num_vf || !offset || (num_vf > 1 && !stride) ||
>>               bir >= PCI_SRIOV_NUM_BARS ||
>> -             !pdev->vf_rlen[bir] )
>> +             !pf_info ||
> 
> See further down - I think it would be nice if we didn't need this new
> check.

OK

>> @@ -813,6 +811,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>          int vf;
>>          paddr_t pba_paddr;
>>          unsigned int pba_offset;
>> +        struct pf_info *pf_info = NULL;
> 
> Pointer-to-const?

OK

>> @@ -827,9 +826,12 @@ static int msix_capability_init(struct pci_dev *dev,
>>              pslot = PCI_SLOT(dev->info.physfn.devfn);
>>              pfunc = PCI_FUNC(dev->info.physfn.devfn);
>>              vf = dev->sbdf.bdf;
>> +            if ( dev->virtfn.pf_pdev )
>> +                pf_info = &dev->virtfn.pf_pdev->physfn;
>>          }
>>  
>> -        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
>> +        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, bir,
>> +                                       vf);
> 
> I don't think putting the new arg first is very nice. SBDF should remain
> first imo. Between the latter arguments I'm not as fussed.

OK

>> @@ -446,7 +448,27 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
>>  
>>      list_del(&pdev->alldevs_list);
>>      pdev_msi_deinit(pdev);
>> -    xfree(pdev);
>> +
>> +    if ( pdev->info.is_virtfn )
>> +    {
>> +        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
>> +
>> +        if ( pf_pdev )
>> +        {
>> +            list_del(&pdev->virtfn.next);
>> +            if ( pf_pdev->physfn.dealloc_pending &&
>> +                 list_empty(&pf_pdev->physfn.vf_list) )
>> +                xfree(pf_pdev);
>> +        }
>> +        xfree(pdev);
>> +    }
>> +    else
>> +    {
>> +        if ( list_empty(&pdev->physfn.vf_list) )
>> +            xfree(pdev);
>> +        else
>> +            pdev->physfn.dealloc_pending = true;
>> +    }
> 
> Could I talk you into un-indenting the last if/else by a level, by making
> the earlier else and "else if()"?

Yep. Actually, there will be no need for dealloc_pending after reworking
pci_remove_device() (see below).

> One aspect I didn't properly consider when making the suggestion: What if,
> without all VFs having gone away, the PF is re-added? In that case we
> would better recycle the existing structure. That's getting a little
> complicated, so maybe a better approach is to refuse the request (in
> pci_remove_device()) when the list isn't empty?

I set up a test case locally to remove a PF without removing the
associated VFs by hacking an SR-IOV PF driver. Although the PF driver
*should* remove the VFs first, it's completely up to the particular PF
driver how VFs/PFs are removed during hot-un-plug, in what order, or
whether at all to remove the VFs before removing the PF. Anyway, during
PF-only removal, at least the Linux PCI subsystem warns about it:

[  106.500334] igb 0000:01:00.0: driver left SR-IOV enabled after remove

Returning an error code in pci_remove_device() results in only a warning
from Linux:

[  106.507011] pci 0000:01:00.0: Failed to delete - passthrough or MSI/MSI-X might fail!

Despite the warning, Linux still proceeds to remove the PF, and we would
retain a stale PF in Xen. Re-adding (hotplugging) the just-removed PF
led to Xen crashing in another weird way.

To handle this more gracefully, I suggest removing the VFs right away
along with the PF in pci_remove_device() when a PF removal request comes
along. This would satisfy the test case described here without Xen
crashing.

>> @@ -700,10 +722,22 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>>           * extended function.
>>           */
>>          if ( pdev->info.is_virtfn )
>> +        {
>> +            struct pci_dev *pf_pdev;
>> +
>>              pdev->info.is_extfn = pf_is_extfn;
>> +            pf_pdev = pci_get_pdev(NULL,
>> +                                   PCI_SBDF(seg, info->physfn.bus,
>> +                                            info->physfn.devfn));
>> +            if ( pf_pdev )
>> +            {
>> +                pdev->virtfn.pf_pdev = pf_pdev;
>> +                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
>> +            }
>> +        }
> 
> Hmm, yes, we can't really use the result of the earlier pci_get_pdev(), as
> we drop the lock intermediately. I wonder though if we wouldn't better fail
> the function if we can't find the PF instance.

Yes

>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -150,7 +150,17 @@ struct pci_dev {
>>          unsigned int count;
>>  #define PT_FAULT_THRESHOLD 10
>>      } fault;
>> -    u64 vf_rlen[6];
>> +    struct pf_info {
>> +        /* Only populated for PFs (info.is_virtfn == false) */
>> +        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
>> +        struct list_head vf_list;             /* List head */
>> +        bool dealloc_pending;
>> +    } physfn;
>> +    struct {
>> +        /* Only populated for VFs (info.is_virtfn == true) */
>> +        struct pci_dev *pf_pdev;              /* Link from VF to PF */
>> +        struct list_head next;                /* Entry in vf_list */
> 
> For doubly-linked lists "next" isn't really a good name. Since both struct
> variants need such a member, why not use vf_list uniformly?

I'd like to retain some distinction between what's a list head and
what's an entry in the list. I suggest the name "entry".

> That'll then
> also lower the significance of my other question:
> 
>> +    } virtfn;
> 
> Should the two new struct-s perhaps be a union?

Yes
Jan Beulich Aug. 26, 2024, 10:20 a.m. UTC | #6
On 25.08.2024 20:03, Stewart Hildebrand wrote:
> On 8/13/24 10:01, Jan Beulich wrote:
>> One aspect I didn't properly consider when making the suggestion: What if,
>> without all VFs having gone away, the PF is re-added? In that case we
>> would better recycle the existing structure. That's getting a little
>> complicated, so maybe a better approach is to refuse the request (in
>> pci_remove_device()) when the list isn't empty?
> 
> I set up a test case locally to remove a PF without removing the
> associated VFs by hacking an SR-IOV PF driver. Although the PF driver
> *should* remove the VFs first, it's completely up to the particular PF
> driver how VFs/PFs are removed during hot-un-plug, in what order, or
> whether at all to remove the VFs before removing the PF. Anyway, during
> PF-only removal, at least the Linux PCI subsystem warns about it:
> 
> [  106.500334] igb 0000:01:00.0: driver left SR-IOV enabled after remove
> 
> Returning an error code in pci_remove_device() results in only a warning
> from Linux:
> 
> [  106.507011] pci 0000:01:00.0: Failed to delete - passthrough or MSI/MSI-X might fail!
> 
> Despite the warning, Linux still proceeds to remove the PF, and we would
> retain a stale PF in Xen. Re-adding (hotplugging) the just-removed PF
> led to Xen crashing in another weird way.
> 
> To handle this more gracefully, I suggest removing the VFs right away
> along with the PF in pci_remove_device() when a PF removal request comes
> along. This would satisfy the test case described here without Xen
> crashing.

Hmm. That's an option, yet would introduce an asymmetry: The PF can be
added late (after VFs), so it would only seem consistent to allow it to
be removed early (keeping the VFs). Suitably justified / commented it
may nevertheless be the route to take, for (hopefully) reducing possible
complications.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 0c97fbb3fc03..1e3a0c450224 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -662,34 +662,32 @@  static int msi_capability_init(struct pci_dev *dev,
     return 0;
 }
 
-static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
+static u64 read_pci_mem_bar(struct pf_info *pf_info, u16 seg, u8 bus, u8 slot,
+                            u8 func, u8 bir, int vf)
 {
+    pci_sbdf_t sbdf = PCI_SBDF(seg, bus, slot, func);
     u8 limit;
     u32 addr, base = PCI_BASE_ADDRESS_0;
     u64 disp = 0;
 
     if ( vf >= 0 )
     {
-        struct pci_dev *pdev = pci_get_pdev(NULL,
-                                            PCI_SBDF(seg, bus, slot, func));
         unsigned int pos;
         uint16_t ctrl, num_vf, offset, stride;
 
-        if ( !pdev )
-            return 0;
-
-        pos = pci_find_ext_capability(pdev->sbdf, PCI_EXT_CAP_ID_SRIOV);
-        ctrl = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_CTRL);
-        num_vf = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_NUM_VF);
-        offset = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_OFFSET);
-        stride = pci_conf_read16(pdev->sbdf, pos + PCI_SRIOV_VF_STRIDE);
+        pos = pci_find_ext_capability(sbdf, PCI_EXT_CAP_ID_SRIOV);
+        ctrl = pci_conf_read16(sbdf, pos + PCI_SRIOV_CTRL);
+        num_vf = pci_conf_read16(sbdf, pos + PCI_SRIOV_NUM_VF);
+        offset = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_OFFSET);
+        stride = pci_conf_read16(sbdf, pos + PCI_SRIOV_VF_STRIDE);
 
         if ( !pos ||
              !(ctrl & PCI_SRIOV_CTRL_VFE) ||
              !(ctrl & PCI_SRIOV_CTRL_MSE) ||
              !num_vf || !offset || (num_vf > 1 && !stride) ||
              bir >= PCI_SRIOV_NUM_BARS ||
-             !pdev->vf_rlen[bir] )
+             !pf_info ||
+             !pf_info->vf_rlen[bir] )
             return 0;
         base = pos + PCI_SRIOV_BAR;
         vf -= PCI_BDF(bus, slot, func) + offset;
@@ -703,8 +701,8 @@  static u64 read_pci_mem_bar(u16 seg, u8 bus, u8 slot, u8 func, u8 bir, int vf)
         }
         if ( vf >= num_vf )
             return 0;
-        BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
-        disp = vf * pdev->vf_rlen[bir];
+        BUILD_BUG_ON(ARRAY_SIZE(pf_info->vf_rlen) != PCI_SRIOV_NUM_BARS);
+        disp = vf * pf_info->vf_rlen[bir];
         limit = PCI_SRIOV_NUM_BARS;
     }
     else switch ( pci_conf_read8(PCI_SBDF(seg, bus, slot, func),
@@ -813,6 +811,7 @@  static int msix_capability_init(struct pci_dev *dev,
         int vf;
         paddr_t pba_paddr;
         unsigned int pba_offset;
+        struct pf_info *pf_info = NULL;
 
         if ( !dev->info.is_virtfn )
         {
@@ -827,9 +826,12 @@  static int msix_capability_init(struct pci_dev *dev,
             pslot = PCI_SLOT(dev->info.physfn.devfn);
             pfunc = PCI_FUNC(dev->info.physfn.devfn);
             vf = dev->sbdf.bdf;
+            if ( dev->virtfn.pf_pdev )
+                pf_info = &dev->virtfn.pf_pdev->physfn;
         }
 
-        table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        table_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, bir,
+                                       vf);
         WARN_ON(msi && msi->table_base != table_paddr);
         if ( !table_paddr )
         {
@@ -852,7 +854,7 @@  static int msix_capability_init(struct pci_dev *dev,
 
         pba_offset = pci_conf_read32(dev->sbdf, msix_pba_offset_reg(pos));
         bir = (u8)(pba_offset & PCI_MSIX_BIRMASK);
-        pba_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf);
+        pba_paddr = read_pci_mem_bar(pf_info, seg, pbus, pslot, pfunc, bir, vf);
         WARN_ON(!pba_paddr);
         pba_paddr += pba_offset & ~PCI_MSIX_BIRMASK;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 5a446d3dcee0..1294836bca44 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -341,6 +341,8 @@  static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
 
     list_add(&pdev->alldevs_list, &pseg->alldevs_list);
 
+    INIT_LIST_HEAD(&pdev->physfn.vf_list);
+
     /* update bus2bridge */
     switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
     {
@@ -446,7 +448,27 @@  static void free_pdev(struct pci_seg *pseg, struct pci_dev *pdev)
 
     list_del(&pdev->alldevs_list);
     pdev_msi_deinit(pdev);
-    xfree(pdev);
+
+    if ( pdev->info.is_virtfn )
+    {
+        struct pci_dev *pf_pdev = pdev->virtfn.pf_pdev;
+
+        if ( pf_pdev )
+        {
+            list_del(&pdev->virtfn.next);
+            if ( pf_pdev->physfn.dealloc_pending &&
+                 list_empty(&pf_pdev->physfn.vf_list) )
+                xfree(pf_pdev);
+        }
+        xfree(pdev);
+    }
+    else
+    {
+        if ( list_empty(&pdev->physfn.vf_list) )
+            xfree(pdev);
+        else
+            pdev->physfn.dealloc_pending = true;
+    }
 }
 
 static void __init _pci_hide_device(struct pci_dev *pdev)
@@ -700,10 +722,22 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
          * extended function.
          */
         if ( pdev->info.is_virtfn )
+        {
+            struct pci_dev *pf_pdev;
+
             pdev->info.is_extfn = pf_is_extfn;
+            pf_pdev = pci_get_pdev(NULL,
+                                   PCI_SBDF(seg, info->physfn.bus,
+                                            info->physfn.devfn));
+            if ( pf_pdev )
+            {
+                pdev->virtfn.pf_pdev = pf_pdev;
+                list_add(&pdev->virtfn.next, &pf_pdev->physfn.vf_list);
+            }
+        }
     }
 
-    if ( !pdev->info.is_virtfn && !pdev->vf_rlen[0] )
+    if ( !pdev->info.is_virtfn && !pdev->physfn.vf_rlen[0] )
     {
         unsigned int pos = pci_find_ext_capability(pdev->sbdf,
                                                    PCI_EXT_CAP_ID_SRIOV);
@@ -715,7 +749,9 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
         {
             unsigned int i;
 
-            BUILD_BUG_ON(ARRAY_SIZE(pdev->vf_rlen) != PCI_SRIOV_NUM_BARS);
+            BUILD_BUG_ON(ARRAY_SIZE(pdev->physfn.vf_rlen) !=
+                                    PCI_SRIOV_NUM_BARS);
+
             for ( i = 0; i < PCI_SRIOV_NUM_BARS; )
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
@@ -730,7 +766,7 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
                     continue;
                 }
                 ret = pci_size_mem_bar(pdev->sbdf, idx, NULL,
-                                       &pdev->vf_rlen[i],
+                                       &pdev->physfn.vf_rlen[i],
                                        PCI_BAR_VF |
                                        ((i == PCI_SRIOV_NUM_BARS - 1) ?
                                         PCI_BAR_LAST : 0));
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 63e49f0117e9..5bb6fef3934b 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -150,7 +150,17 @@  struct pci_dev {
         unsigned int count;
 #define PT_FAULT_THRESHOLD 10
     } fault;
-    u64 vf_rlen[6];
+    struct pf_info {
+        /* Only populated for PFs (info.is_virtfn == false) */
+        uint64_t vf_rlen[PCI_SRIOV_NUM_BARS];
+        struct list_head vf_list;             /* List head */
+        bool dealloc_pending;
+    } physfn;
+    struct {
+        /* Only populated for VFs (info.is_virtfn == true) */
+        struct pci_dev *pf_pdev;              /* Link from VF to PF */
+        struct list_head next;                /* Entry in vf_list */
+    } virtfn;
 
     /* Data for vPCI. */
     struct vpci *vpci;