diff mbox series

[v7,01/12] vpci: introduce per-domain lock to protect vpci structure

Message ID 20230613103159.524763-2-volodymyr_babchuk@epam.com (mailing list archive)
State Superseded
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Volodymyr Babchuk June 13, 2023, 10:32 a.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Introduce a per-domain read/write lock to check whether vpci is present,
so we are sure there are no accesses to the contents of the vpci struct
if not. This lock can be used (and in a few cases is used right away)
so that vpci removal can be performed while holding the lock in write
mode. Previously such removal could race with vpci_read for example.

1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if done
under the read lock, requires vpci->lock to be acquired on both devices
being compared, which may produce a deadlock. It is not possible to
upgrade read lock to write lock in such a case. So, in order to prevent
the deadlock, check which registers are going to be written and acquire
the lock in the appropriate mode from the beginning.

All other code, which doesn't lead to pdev->vpci destruction and does not
access multiple pdevs at the same time, can still use a combination of the
read lock and pdev->vpci->lock.

3. Optimize if ROM BAR write lock required detection by caching offset
of the ROM BAR register in vpci->header->rom_reg which depends on
header's type.

4. Reduce locked region in vpci_remove_device as it is now possible
to set pdev->vpci to NULL early right after the write lock is acquired.

5. Reduce locked region in vpci_add_handlers as it is possible to
initialize many more fields of the struct vpci before assigning it to
pdev->vpci.

6. vpci_{add|remove}_register are required to be called with the write lock
held, but it is not feasible to add an assert there as it requires
struct domain to be passed for that. So, add a comment about this requirement
to these and other functions with the equivalent constraints.

7. Drop const qualifier where the new rwlock is used and this is appropriate.

8. Do not call process_pending_softirqs with any locks held. For that unlock
prior the call and re-acquire the locks after. After re-acquiring the
lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
and if so, allow reading or writing the hardware register directly. This is
acceptable as we only deal with Dom0 as of now. Once DomU support is
added the write will need to be ignored and read return all 0's for the
guests, while Dom0 can still access the registers directly.

10. Introduce pcidevs_trylock, so there is a possibility to try locking
the pcidev's lock.

11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

12. This is based on the discussion at [1].

[1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
This was checked on x86: with and without PVH Dom0.
---
 xen/arch/x86/hvm/vmsi.c       |   7 +++
 xen/common/domain.c           |   3 +
 xen/drivers/passthrough/pci.c |   5 ++
 xen/drivers/vpci/header.c     |  52 +++++++++++++++++
 xen/drivers/vpci/msi.c        |  25 +++++++-
 xen/drivers/vpci/msix.c       |  51 +++++++++++++---
 xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
 xen/include/xen/pci.h         |   1 +
 xen/include/xen/sched.h       |   3 +
 xen/include/xen/vpci.h        |   6 ++
 10 files changed, 223 insertions(+), 37 deletions(-)

Comments

Roger Pau Monné June 16, 2023, 4:30 p.m. UTC | #1
On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Introduce a per-domain read/write lock to check whether vpci is present,
> so we are sure there are no accesses to the contents of the vpci struct
> if not. This lock can be used (and in a few cases is used right away)
> so that vpci removal can be performed while holding the lock in write
> mode. Previously such removal could race with vpci_read for example.
> 
> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if done
> under the read lock, requires vpci->lock to be acquired on both devices
> being compared, which may produce a deadlock. It is not possible to
> upgrade read lock to write lock in such a case. So, in order to prevent
> the deadlock, check which registers are going to be written and acquire
> the lock in the appropriate mode from the beginning.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does not
> access multiple pdevs at the same time, can still use a combination of the
> read lock and pdev->vpci->lock.
> 
> 3. Optimize if ROM BAR write lock required detection by caching offset
> of the ROM BAR register in vpci->header->rom_reg which depends on
> header's type.
> 
> 4. Reduce locked region in vpci_remove_device as it is now possible
> to set pdev->vpci to NULL early right after the write lock is acquired.
> 
> 5. Reduce locked region in vpci_add_handlers as it is possible to
> initialize many more fields of the struct vpci before assigning it to
> pdev->vpci.
> 
> 6. vpci_{add|remove}_register are required to be called with the write lock
> held, but it is not feasible to add an assert there as it requires
> struct domain to be passed for that. So, add a comment about this requirement
> to these and other functions with the equivalent constraints.
> 
> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> 
> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> prior the call and re-acquire the locks after. After re-acquiring the
> lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> and if so, allow reading or writing the hardware register directly. This is
> acceptable as we only deal with Dom0 as of now. Once DomU support is
> added the write will need to be ignored and read return all 0's for the
> guests, while Dom0 can still access the registers directly.
> 
> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> the pcidev's lock.
> 
> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> 12. This is based on the discussion at [1].
> 
> [1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Thanks.

I haven't looked in full detail, but I'm afraid there's an ABBA
deadlock with the per-domain vpci lock and the pcidevs lock in
modify_bars() vs  vpci_add_handlers() and vpci_remove_device().

I've made some comments below.

> ---
> This was checked on x86: with and without PVH Dom0.
> ---
>  xen/arch/x86/hvm/vmsi.c       |   7 +++
>  xen/common/domain.c           |   3 +
>  xen/drivers/passthrough/pci.c |   5 ++
>  xen/drivers/vpci/header.c     |  52 +++++++++++++++++
>  xen/drivers/vpci/msi.c        |  25 +++++++-
>  xen/drivers/vpci/msix.c       |  51 +++++++++++++---
>  xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
>  xen/include/xen/pci.h         |   1 +
>  xen/include/xen/sched.h       |   3 +
>  xen/include/xen/vpci.h        |   6 ++
>  10 files changed, 223 insertions(+), 37 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 3cd4923060..f188450b1b 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
> +    ASSERT(pcidevs_locked());
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            pcidevs_unlock();
> +            read_unlock(&pdev->domain->vpci_rwlock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->vpci_rwlock);
> +            pcidevs_lock();

Why do you need the pcidevs_lock here?  Isn't it enough to have the
per-domain rwlock taken in read mode in order to prevent devices from
being de-assigned from the domain?

>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 6a440590fe..a4640acb62 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>  
>  #ifdef CONFIG_HAS_PCI
>      INIT_LIST_HEAD(&d->pdev_list);
> +#ifdef CONFIG_HAS_VPCI
> +    rwlock_init(&d->vpci_rwlock);
> +#endif
>  #endif
>  
>      /* All error paths can depend on the above setup. */
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 07d1986d33..0afcb59af0 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>      spin_lock_recursive(&_pcidevs_lock);
>  }
>  
> +int pcidevs_trylock(void)
> +{
> +    return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
>  void pcidevs_unlock(void)
>  {
>      spin_unlock_recursive(&_pcidevs_lock);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ec2e978a4e..23b5223adf 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        read_lock(&v->domain->vpci_rwlock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
>                          rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
>                          !rc && v->vpci.rom_only);
>          spin_unlock(&v->vpci.pdev->vpci->lock);
> +        read_unlock(&v->domain->vpci_rwlock);

I think you likely want to expand the scope of the domain vpci lock in
order to also protect the data in v->vpci?  So that vPCI device
removal can clear this data if required without racing with
vpci_process_pending().  Otherwise you need to pause the domain when
removing vPCI.

>  
>          rangeset_destroy(v->vpci.mem);
>          v->vpci.mem = NULL;
> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        write_unlock(&d->vpci_rwlock);
>          process_pending_softirqs();
> +        write_lock(&d->vpci_rwlock);
> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>      raise_softirq(SCHEDULE_SOFTIRQ);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */

Why it must be in write mode?

Is this to avoid ABBA deadlocks as not taking the per-domain lock in
write mode would then force us to take each pdev->vpci->lock in order
to prevent concurrent modifications of remote devices?

>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>  {
>      struct vpci_header *header = &pdev->vpci->header;
> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>       * Check for overlaps with other BARs. Note that only BARs that are
>       * currently mapped (enabled) are checked for overlaps.
>       */
> +    pcidevs_lock();

This can be problematic I'm afraid, as it's a guest controlled path
that allows applying unrestricted contention on the pcidevs lock.  I'm
unsure whether multiple guests could be able to trigger the watchdog
if given enough devices/vcpus to be able to perform enough
simultaneous calls to modify_bars().

Maybe you could expand the per-domain vpci lock to also protect
modifications of domain->pdev_list?

IOW: kind of a per-domain pdev_lock.

>      for_each_pdev ( pdev->domain, tmp )
>      {
>          if ( tmp == pdev )
> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>                         start, end, rc);
>                  rangeset_destroy(mem);
> +                pcidevs_unlock();
>                  return rc;
>              }
>          }
>      }
> +    pcidevs_unlock();
>  
>      ASSERT(dev);
>  
> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>          }
>      }
>  
> +    ASSERT(!header->rom_reg);
> +
>      /* Check expansion ROM. */
>      rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>      if ( rc > 0 && size )
> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>                                 4, rom);
>          if ( rc )
>              rom->type = VPCI_BAR_EMPTY;
> +
> +        header->rom_reg = rom_reg;
>      }
>  
>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>  }
>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>  
> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
> +                    unsigned int r2_offset, unsigned int r2_size)
> +{
> +    /* Return true if there is an overlap. */
> +    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;

Hm, we already have a vpci_register_cmp(), which does a very similar
comparison.  I wonder if it would be possible to somehow use that
here.

> +}
> +
> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
> +                                 unsigned int start, unsigned int size)
> +{
> +    /*
> +     * Writing the command register and ROM BAR register may trigger
> +     * modify_bars to run, which in turn may access multiple pdevs while
> +     * checking for the existing BAR's overlap. The overlapping check, if done
> +     * under the read lock, requires vpci->lock to be acquired on both devices
> +     * being compared, which may produce a deadlock. At the same time it is not
> +     * possible to upgrade read lock to write lock in such a case.
> +     * Check which registers are going to be written and return true if lock
> +     * needs to be acquired in write mode.
> +     */
> +    if ( overlap(start, size, PCI_COMMAND, 2) ||
> +         (pdev->vpci->header.rom_reg &&
> +          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
> +        return true;
> +
> +    return false;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..e7d1f916a0 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      if ( !pos )
>          return 0;
>  
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->vpci_rwlock) )
> +            continue;
> +
> +        if ( !pcidevs_trylock() )
> +        {
> +            read_unlock(&d->vpci_rwlock);
> +            continue;
> +        }
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            pcidevs_unlock();
> +            read_unlock(&d->vpci_rwlock);
> +
>              process_pending_softirqs();
> +
> +            read_lock(&d->vpci_rwlock);
> +            pcidevs_lock();
>          }
> +        pcidevs_unlock();
> +        read_unlock(&d->vpci_rwlock);
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 25bde77586..b5a7dfdf9c 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -143,6 +143,7 @@ static void cf_check control_write(
>          pci_conf_write16(pdev->sbdf, reg, val);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->vpci_rwlock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->vpci_rwlock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->vpci_rwlock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +        read_unlock(&d->vpci_rwlock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->vpci_rwlock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->vpci_rwlock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +        read_unlock(&d->vpci_rwlock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->vpci_rwlock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->vpci_rwlock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>      struct vpci_msix *msix;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
> +
>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>                                        PCI_CAP_ID_MSIX);
>      if ( !msix_offset )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 652807a4a4..1270174e78 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
> +    struct vpci *vpci;
> +
> +    if ( !has_vpci(pdev->domain) )
>          return;
>  
> -    spin_lock(&pdev->vpci->lock);
> +    write_lock(&pdev->domain->vpci_rwlock);
> +    if ( !pdev->vpci )
> +    {
> +        write_unlock(&pdev->domain->vpci_rwlock);
> +        return;
> +    }
> +
> +    vpci = pdev->vpci;
> +    pdev->vpci = NULL;
> +    write_unlock(&pdev->domain->vpci_rwlock);
> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>                                                     struct vpci_register,
>                                                     node);
>  
>          list_del(&r->node);
>          xfree(r);
>      }
> -    spin_unlock(&pdev->vpci->lock);
> +
>      if ( pdev->vpci->msix )
>      {
>          unsigned int i;
> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>              if ( pdev->vpci->msix->table[i] )
>                  iounmap(pdev->vpci->msix->table[i]);
>      }
> -    xfree(pdev->vpci->msix);
> -    xfree(pdev->vpci->msi);
> -    xfree(pdev->vpci);
> -    pdev->vpci = NULL;
> +    xfree(vpci->msix);
> +    xfree(vpci->msi);
> +    xfree(vpci);
>  }
>  
>  int vpci_add_handlers(struct pci_dev *pdev)
>  {
> +    struct vpci *vpci;
>      unsigned int i;
>      int rc = 0;
>  
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> +    vpci = xzalloc(struct vpci);
> +    if ( !vpci )
> +        return -ENOMEM;
> +
> +    INIT_LIST_HEAD(&vpci->handlers);
> +    spin_lock_init(&vpci->lock);
> +
> +    write_lock(&pdev->domain->vpci_rwlock);

I think the usage of the vpci per-domain lock here and in
vpci_remove_device() create an ABBA deadlock situation with the usage
of it in modify_bars().

Both vpci_add_handlers() and vpci_remove_device() get called with the
pcidevs lock taken, and take the per-domain vpci lock afterwards,
while modify_bars() does the inverse locking, gets called with the
per-domain vpci lock taken and then take the pcidevs lock inside the
function.

> +
>      /* We should not get here twice for the same device. */
>      ASSERT(!pdev->vpci);
>  
> -    pdev->vpci = xzalloc(struct vpci);
> -    if ( !pdev->vpci )
> -        return -ENOMEM;
> -
> -    INIT_LIST_HEAD(&pdev->vpci->handlers);
> -    spin_lock_init(&pdev->vpci->lock);
> +    pdev->vpci = vpci;
>  
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>          if ( rc )
>              break;
>      }
> +    write_unlock(&pdev->domain->vpci_rwlock);
>  
>      if ( rc )
>          vpci_remove_device(pdev);
> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>      return pci_conf_read32(pdev->sbdf, reg);
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>                        vpci_write_t *write_handler, unsigned int offset,
>                        unsigned int size, void *data)
> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>      r->offset = offset;
>      r->private = data;
>  
> -    spin_lock(&vpci->lock);

If you remove the lock here we need to assert that the per-domain vpci
lock is taken in write mode.

> -
>      /* The list of handlers must be kept sorted at all times. */
>      list_for_each ( prev, &vpci->handlers )
>      {
> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>              break;
>          if ( cmp == 0 )
>          {
> -            spin_unlock(&vpci->lock);
>              xfree(r);
>              return -EEXIST;
>          }
>      }
>  
>      list_add_tail(&r->node, prev);
> -    spin_unlock(&vpci->lock);
>  
>      return 0;
>  }
>  
> +/* This must hold domain's vpci_rwlock in write mode. */
>  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>                           unsigned int size)
>  {
>      const struct vpci_register r = { .offset = offset, .size = size };
>      struct vpci_register *rm;
>  
> -    spin_lock(&vpci->lock);
>      list_for_each_entry ( rm, &vpci->handlers, node )
>      {
>          int cmp = vpci_register_cmp(&r, rm);
> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>          if ( !cmp && rm->offset == offset && rm->size == size )
>          {
>              list_del(&rm->node);
> -            spin_unlock(&vpci->lock);
>              xfree(rm);
>              return 0;
>          }
>          if ( cmp <= 0 )
>              break;
>      }
> -    spin_unlock(&vpci->lock);

Same here about the per-domain lock being taken.

>  
>      return -ENOENT;
>  }
> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>      }
>  
>      /* Find the PCI dev matching the address. */
> +    pcidevs_lock();
>      pdev = pci_get_pdev(d, sbdf);
> -    if ( !pdev || !pdev->vpci )
> +    pcidevs_unlock();

I think it's worth looking into expanding the per-domain vpci-lock to
a pdevs lock or similar in order to protect the pdev_list also if
possible.  So that we can avoid taking the pcidevs lock here.

Thanks, Roger.
Stewart Hildebrand June 20, 2023, 8:17 p.m. UTC | #2
On 6/13/23 06:32, Volodymyr Babchuk wrote:

...

> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 652807a4a4..1270174e78 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
> 
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
> +    struct vpci *vpci;
> +
> +    if ( !has_vpci(pdev->domain) )
>          return;
> 
> -    spin_lock(&pdev->vpci->lock);
> +    write_lock(&pdev->domain->vpci_rwlock);
> +    if ( !pdev->vpci )
> +    {
> +        write_unlock(&pdev->domain->vpci_rwlock);
> +        return;
> +    }
> +
> +    vpci = pdev->vpci;
> +    pdev->vpci = NULL;

Here we set pdev->vpci to NULL, yet there are still a few uses remaining after this in vpci_remove_device. I suspect they were missed during a s/pdev->vpci/vpci/ search and replace.

> +    write_unlock(&pdev->domain->vpci_rwlock);
> +
>      while ( !list_empty(&pdev->vpci->handlers) )

pdev->vpci dereferenced here

>      {
> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>                                                     struct vpci_register,
>                                                     node);
> 
>          list_del(&r->node);
>          xfree(r);
>      }
> -    spin_unlock(&pdev->vpci->lock);
> +
>      if ( pdev->vpci->msix )

pdev->vpci dereferenced here

>      {
>          unsigned int i;
> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>              if ( pdev->vpci->msix->table[i] )

pdev->vpci dereferenced here, and two more above not shown in the diff context

>                  iounmap(pdev->vpci->msix->table[i]);

pdev->vpci dereferenced here

>      }
> -    xfree(pdev->vpci->msix);
> -    xfree(pdev->vpci->msi);
> -    xfree(pdev->vpci);
> -    pdev->vpci = NULL;
> +    xfree(vpci->msix);
> +    xfree(vpci->msi);
> +    xfree(vpci);
>  }
Volodymyr Babchuk June 21, 2023, 10:07 p.m. UTC | #3
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Introduce a per-domain read/write lock to check whether vpci is present,
>> so we are sure there are no accesses to the contents of the vpci struct
>> if not. This lock can be used (and in a few cases is used right away)
>> so that vpci removal can be performed while holding the lock in write
>> mode. Previously such removal could race with vpci_read for example.
>> 
>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if done
>> under the read lock, requires vpci->lock to be acquired on both devices
>> being compared, which may produce a deadlock. It is not possible to
>> upgrade read lock to write lock in such a case. So, in order to prevent
>> the deadlock, check which registers are going to be written and acquire
>> the lock in the appropriate mode from the beginning.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does not
>> access multiple pdevs at the same time, can still use a combination of the
>> read lock and pdev->vpci->lock.
>> 
>> 3. Optimize if ROM BAR write lock required detection by caching offset
>> of the ROM BAR register in vpci->header->rom_reg which depends on
>> header's type.
>> 
>> 4. Reduce locked region in vpci_remove_device as it is now possible
>> to set pdev->vpci to NULL early right after the write lock is acquired.
>> 
>> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> initialize many more fields of the struct vpci before assigning it to
>> pdev->vpci.
>> 
>> 6. vpci_{add|remove}_register are required to be called with the write lock
>> held, but it is not feasible to add an assert there as it requires
>> struct domain to be passed for that. So, add a comment about this requirement
>> to these and other functions with the equivalent constraints.
>> 
>> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> 
>> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> prior the call and re-acquire the locks after. After re-acquiring the
>> lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>> 
>> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> and if so, allow reading or writing the hardware register directly. This is
>> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> added the write will need to be ignored and read return all 0's for the
>> guests, while Dom0 can still access the registers directly.
>> 
>> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>> 
>> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> 12. This is based on the discussion at [1].
>> 
>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>
> Thanks.
>
> I haven't looked in full detail, but I'm afraid there's an ABBA
> deadlock with the per-domain vpci lock and the pcidevs lock in
> modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>
> I've made some comments below.

Thank you for the review. I believe that it is a good idea to have a
per-domain pdev_list lock. See my answers below.


>> ---
>> This was checked on x86: with and without PVH Dom0.
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |   7 +++
>>  xen/common/domain.c           |   3 +
>>  xen/drivers/passthrough/pci.c |   5 ++
>>  xen/drivers/vpci/header.c     |  52 +++++++++++++++++
>>  xen/drivers/vpci/msi.c        |  25 +++++++-
>>  xen/drivers/vpci/msix.c       |  51 +++++++++++++---
>>  xen/drivers/vpci/vpci.c       | 107 +++++++++++++++++++++++++---------
>>  xen/include/xen/pci.h         |   1 +
>>  xen/include/xen/sched.h       |   3 +
>>  xen/include/xen/vpci.h        |   6 ++
>>  10 files changed, 223 insertions(+), 37 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..f188450b1b 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
>> +    ASSERT(pcidevs_locked());
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +916,11 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            pcidevs_unlock();
>> +            read_unlock(&pdev->domain->vpci_rwlock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->vpci_rwlock);
>> +            pcidevs_lock();
>
> Why do you need the pcidevs_lock here?  Isn't it enough to have the
> per-domain rwlock taken in read mode in order to prevent devices from
> being de-assigned from the domain?
>

Yes, looks like you are right. I'll remove pcidevs_lock() 


>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 6a440590fe..a4640acb62 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -622,6 +622,9 @@ struct domain *domain_create(domid_t domid,
>>  
>>  #ifdef CONFIG_HAS_PCI
>>      INIT_LIST_HEAD(&d->pdev_list);
>> +#ifdef CONFIG_HAS_VPCI
>> +    rwlock_init(&d->vpci_rwlock);
>> +#endif
>>  #endif
>>  
>>      /* All error paths can depend on the above setup. */
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 07d1986d33..0afcb59af0 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>>      spin_lock_recursive(&_pcidevs_lock);
>>  }
>>  
>> +int pcidevs_trylock(void)
>> +{
>> +    return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>>  void pcidevs_unlock(void)
>>  {
>>      spin_unlock_recursive(&_pcidevs_lock);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ec2e978a4e..23b5223adf 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,12 +152,14 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        read_lock(&v->domain->vpci_rwlock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>>                          rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
>>                          !rc && v->vpci.rom_only);
>>          spin_unlock(&v->vpci.pdev->vpci->lock);
>> +        read_unlock(&v->domain->vpci_rwlock);
>
> I think you likely want to expand the scope of the domain vpci lock in
> order to also protect the data in v->vpci?  So that vPCI device
> removal can clear this data if required without racing with
> vpci_process_pending().  Otherwise you need to pause the domain when
> removing vPCI.

Yes, you are right.

>
>>  
>>          rangeset_destroy(v->vpci.mem);
>>          v->vpci.mem = NULL;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        write_unlock(&d->vpci_rwlock);
>>          process_pending_softirqs();
>> +        write_lock(&d->vpci_rwlock);
>> +    }
>> +
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>>      raise_softirq(SCHEDULE_SOFTIRQ);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>
> Why it must be in write mode?
>
> Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> write mode would then force us to take each pdev->vpci->lock in order
> to prevent concurrent modifications of remote devices?

Yes, exactly this. This is mentioned in the commit message:

    2. Writing the command register and ROM BAR register may trigger
    modify_bars to run, which in turn may access multiple pdevs while
    checking for the existing BAR's overlap. The overlapping check, if done
    under the read lock, requires vpci->lock to be acquired on both devices
    being compared, which may produce a deadlock. It is not possible to
    upgrade read lock to write lock in such a case. So, in order to prevent
    the deadlock, check which registers are going to be written and acquire
    the lock in the appropriate mode from the beginning.

>
>>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>  {
>>      struct vpci_header *header = &pdev->vpci->header;
>> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>       * Check for overlaps with other BARs. Note that only BARs that are
>>       * currently mapped (enabled) are checked for overlaps.
>>       */
>> +    pcidevs_lock();
>
> This can be problematic I'm afraid, as it's a guest controlled path
> that allows applying unrestricted contention on the pcidevs lock.  I'm
> unsure whether multiple guests could be able to trigger the watchdog
> if given enough devices/vcpus to be able to perform enough
> simultaneous calls to modify_bars().
>
> Maybe you could expand the per-domain vpci lock to also protect
> modifications of domain->pdev_list?
>
> IOW: kind of a per-domain pdev_lock.

This might work, but I am not sure if we need to expand vpci lock. Maybe
it is better to add another lock that protects domain->pdev_list? I
afraid that combined lock will be confusing, as it protects two
semi-related entities.

>
>>      for_each_pdev ( pdev->domain, tmp )
>>      {
>>          if ( tmp == pdev )
>> @@ -326,10 +342,12 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>                  printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
>>                         start, end, rc);
>>                  rangeset_destroy(mem);
>> +                pcidevs_unlock();
>>                  return rc;
>>              }
>>          }
>>      }
>> +    pcidevs_unlock();
>>  
>>      ASSERT(dev);
>>  
>> @@ -482,6 +500,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      struct vpci_bar *bars = header->bars;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> @@ -570,6 +590,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>          }
>>      }
>>  
>> +    ASSERT(!header->rom_reg);
>> +
>>      /* Check expansion ROM. */
>>      rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
>>      if ( rc > 0 && size )
>> @@ -586,12 +608,42 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>                                 4, rom);
>>          if ( rc )
>>              rom->type = VPCI_BAR_EMPTY;
>> +
>> +        header->rom_reg = rom_reg;
>>      }
>>  
>>      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
>>  }
>>  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
>>  
>> +static bool overlap(unsigned int r1_offset, unsigned int r1_size,
>> +                    unsigned int r2_offset, unsigned int r2_size)
>> +{
>> +    /* Return true if there is an overlap. */
>> +    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;
>
> Hm, we already have a vpci_register_cmp(), which does a very similar
> comparison.  I wonder if it would be possible to somehow use that
> here.
>

Yeah, I'll revisit this part.

>> +}
>> +
>> +bool vpci_header_need_write_lock(const struct pci_dev *pdev,
>> +                                 unsigned int start, unsigned int size)
>> +{
>> +    /*
>> +     * Writing the command register and ROM BAR register may trigger
>> +     * modify_bars to run, which in turn may access multiple pdevs while
>> +     * checking for the existing BAR's overlap. The overlapping check, if done
>> +     * under the read lock, requires vpci->lock to be acquired on both devices
>> +     * being compared, which may produce a deadlock. At the same time it is not
>> +     * possible to upgrade read lock to write lock in such a case.
>> +     * Check which registers are going to be written and return true if lock
>> +     * needs to be acquired in write mode.
>> +     */
>> +    if ( overlap(start, size, PCI_COMMAND, 2) ||
>> +         (pdev->vpci->header.rom_reg &&
>> +          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
>> +        return true;
>> +
>> +    return false;
>> +}
>> +
>>  /*
>>   * Local variables:
>>   * mode: C
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e7d1f916a0 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>      uint16_t control;
>>      int ret;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      if ( !pos )
>>          return 0;
>>  
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -277,6 +279,15 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->vpci_rwlock) )
>> +            continue;
>> +
>> +        if ( !pcidevs_trylock() )
>> +        {
>> +            read_unlock(&d->vpci_rwlock);
>> +            continue;
>> +        }
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -318,14 +329,22 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            pcidevs_unlock();
>> +            read_unlock(&d->vpci_rwlock);
>> +
>>              process_pending_softirqs();
>> +
>> +            read_lock(&d->vpci_rwlock);
>> +            pcidevs_lock();
>>          }
>> +        pcidevs_unlock();
>> +        read_unlock(&d->vpci_rwlock);
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..b5a7dfdf9c 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -143,6 +143,7 @@ static void cf_check control_write(
>>          pci_conf_write16(pdev->sbdf, reg, val);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  {
>>      struct vpci_msix *msix;
>> @@ -163,7 +164,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->vpci_rwlock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->vpci_rwlock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +365,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0ul;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +424,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +512,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->vpci_rwlock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +        read_unlock(&d->vpci_rwlock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->vpci_rwlock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +613,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->vpci_rwlock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -665,6 +700,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>>      struct vpci_msix *msix;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
>> +
>>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>>                                        PCI_CAP_ID_MSIX);
>>      if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> +    struct vpci *vpci;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>>          return;
>>  
>> -    spin_lock(&pdev->vpci->lock);
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        write_unlock(&pdev->domain->vpci_rwlock);
>> +        return;
>> +    }
>> +
>> +    vpci = pdev->vpci;
>> +    pdev->vpci = NULL;
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>> +
>>      while ( !list_empty(&pdev->vpci->handlers) )
>>      {
>> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>>                                                     struct vpci_register,
>>                                                     node);
>>  
>>          list_del(&r->node);
>>          xfree(r);
>>      }
>> -    spin_unlock(&pdev->vpci->lock);
>> +
>>      if ( pdev->vpci->msix )
>>      {
>>          unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>>              if ( pdev->vpci->msix->table[i] )
>>                  iounmap(pdev->vpci->msix->table[i]);
>>      }
>> -    xfree(pdev->vpci->msix);
>> -    xfree(pdev->vpci->msi);
>> -    xfree(pdev->vpci);
>> -    pdev->vpci = NULL;
>> +    xfree(vpci->msix);
>> +    xfree(vpci->msi);
>> +    xfree(vpci);
>>  }
>>  
>>  int vpci_add_handlers(struct pci_dev *pdev)
>>  {
>> +    struct vpci *vpci;
>>      unsigned int i;
>>      int rc = 0;
>>  
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> +    vpci = xzalloc(struct vpci);
>> +    if ( !vpci )
>> +        return -ENOMEM;
>> +
>> +    INIT_LIST_HEAD(&vpci->handlers);
>> +    spin_lock_init(&vpci->lock);
>> +
>> +    write_lock(&pdev->domain->vpci_rwlock);
>
> I think the usage of the vpci per-domain lock here and in
> vpci_remove_device() create an ABBA deadlock situation with the usage
> of it in modify_bars().
>
> Both vpci_add_handlers() and vpci_remove_device() get called with the
> pcidevs lock taken, and take the per-domain vpci lock afterwards,
> while modify_bars() does the inverse locking, gets called with the
> per-domain vpci lock taken and then take the pcidevs lock inside the
> function.

You suggested to use separate per-domain pdev_list lock in
modify_bars. Looks like this can help with the ABBA deadlock.

>
>> +
>>      /* We should not get here twice for the same device. */
>>      ASSERT(!pdev->vpci);
>>  
>> -    pdev->vpci = xzalloc(struct vpci);
>> -    if ( !pdev->vpci )
>> -        return -ENOMEM;
>> -
>> -    INIT_LIST_HEAD(&pdev->vpci->handlers);
>> -    spin_lock_init(&pdev->vpci->lock);
>> +    pdev->vpci = vpci;
>>  
>>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>>      {
>> @@ -91,6 +107,7 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>          if ( rc )
>>              break;
>>      }
>> +    write_unlock(&pdev->domain->vpci_rwlock);
>>  
>>      if ( rc )
>>          vpci_remove_device(pdev);
>> @@ -139,6 +156,7 @@ uint32_t cf_check vpci_hw_read32(
>>      return pci_conf_read32(pdev->sbdf, reg);
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>                        vpci_write_t *write_handler, unsigned int offset,
>>                        unsigned int size, void *data)
>> @@ -162,8 +180,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>      r->offset = offset;
>>      r->private = data;
>>  
>> -    spin_lock(&vpci->lock);
>
> If you remove the lock here we need to assert that the per-domain vpci
> lock is taken in write mode.

Yep, assert will be better than comment above.

>
>> -
>>      /* The list of handlers must be kept sorted at all times. */
>>      list_for_each ( prev, &vpci->handlers )
>>      {
>> @@ -175,25 +191,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
>>              break;
>>          if ( cmp == 0 )
>>          {
>> -            spin_unlock(&vpci->lock);
>>              xfree(r);
>>              return -EEXIST;
>>          }
>>      }
>>  
>>      list_add_tail(&r->node, prev);
>> -    spin_unlock(&vpci->lock);
>>  
>>      return 0;
>>  }
>>  
>> +/* This must hold domain's vpci_rwlock in write mode. */
>>  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>                           unsigned int size)
>>  {
>>      const struct vpci_register r = { .offset = offset, .size = size };
>>      struct vpci_register *rm;
>>  
>> -    spin_lock(&vpci->lock);
>>      list_for_each_entry ( rm, &vpci->handlers, node )
>>      {
>>          int cmp = vpci_register_cmp(&r, rm);
>> @@ -205,14 +219,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset,
>>          if ( !cmp && rm->offset == offset && rm->size == size )
>>          {
>>              list_del(&rm->node);
>> -            spin_unlock(&vpci->lock);
>>              xfree(rm);
>>              return 0;
>>          }
>>          if ( cmp <= 0 )
>>              break;
>>      }
>> -    spin_unlock(&vpci->lock);
>
> Same here about the per-domain lock being taken.
>
>>  
>>      return -ENOENT;
>>  }
>> @@ -320,7 +332,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -333,10 +345,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>      }
>>  
>>      /* Find the PCI dev matching the address. */
>> +    pcidevs_lock();
>>      pdev = pci_get_pdev(d, sbdf);
>> -    if ( !pdev || !pdev->vpci )
>> +    pcidevs_unlock();
>
> I think it's worth looking into expanding the per-domain vpci-lock to
> a pdevs lock or similar in order to protect the pdev_list also if
> possible.  So that we can avoid taking the pcidevs lock here.

Agree
Roger Pau Monné June 22, 2023, 8:15 a.m. UTC | #4
On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> 
> >> Introduce a per-domain read/write lock to check whether vpci is present,
> >> so we are sure there are no accesses to the contents of the vpci struct
> >> if not. This lock can be used (and in a few cases is used right away)
> >> so that vpci removal can be performed while holding the lock in write
> >> mode. Previously such removal could race with vpci_read for example.
> >> 
> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> >> from being removed.
> >> 
> >> 2. Writing the command register and ROM BAR register may trigger
> >> modify_bars to run, which in turn may access multiple pdevs while
> >> checking for the existing BAR's overlap. The overlapping check, if done
> >> under the read lock, requires vpci->lock to be acquired on both devices
> >> being compared, which may produce a deadlock. It is not possible to
> >> upgrade read lock to write lock in such a case. So, in order to prevent
> >> the deadlock, check which registers are going to be written and acquire
> >> the lock in the appropriate mode from the beginning.
> >> 
> >> All other code, which doesn't lead to pdev->vpci destruction and does not
> >> access multiple pdevs at the same time, can still use a combination of the
> >> read lock and pdev->vpci->lock.
> >> 
> >> 3. Optimize if ROM BAR write lock required detection by caching offset
> >> of the ROM BAR register in vpci->header->rom_reg which depends on
> >> header's type.
> >> 
> >> 4. Reduce locked region in vpci_remove_device as it is now possible
> >> to set pdev->vpci to NULL early right after the write lock is acquired.
> >> 
> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
> >> initialize many more fields of the struct vpci before assigning it to
> >> pdev->vpci.
> >> 
> >> 6. vpci_{add|remove}_register are required to be called with the write lock
> >> held, but it is not feasible to add an assert there as it requires
> >> struct domain to be passed for that. So, add a comment about this requirement
> >> to these and other functions with the equivalent constraints.
> >> 
> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> >> 
> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> >> prior the call and re-acquire the locks after. After re-acquiring the
> >> lock there is no need to check if pdev->vpci exists:
> >>  - in apply_map because of the context it is called (no race condition
> >>    possible)
> >>  - for MSI/MSI-X debug code because it is called at the end of
> >>    pdev->vpci access and no further access to pdev->vpci is made
> >> 
> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> >> and if so, allow reading or writing the hardware register directly. This is
> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
> >> added the write will need to be ignored and read return all 0's for the
> >> guests, while Dom0 can still access the registers directly.
> >> 
> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> >> the pcidev's lock.
> >> 
> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> >> while accessing pdevs in vpci code.
> >> 
> >> 12. This is based on the discussion at [1].
> >> 
> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
> >> 
> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >
> > Thanks.
> >
> > I haven't looked in full detail, but I'm afraid there's an ABBA
> > deadlock with the per-domain vpci lock and the pcidevs lock in
> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
> >
> > I've made some comments below.
> 
> Thank you for the review. I believe that it is a good idea to have a
> per-domain pdev_list lock. See my answers below.

I think it's important that the lock that protects domain->pdev_list
must be the same that also protects pdev->vpci, or else you might run
into similar ABBA deadlock situations.

The problem then could be that in vpci_{read,write} you will take the
per-domain pdev lock in read mode in order to get the pdev, and for
writes to the command register or the ROM BAR you would have to
upgrade such lock to a write lock without dropping it, and we don't
have such functionality for rw locks ATM.

Maybe just re-starting the function knowing that the lock must be
taken in write mode would be a good solution: writes to the command
register will already be slow since they are likely to involve
modifications to the p2m.

> >> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
> >>      raise_softirq(SCHEDULE_SOFTIRQ);
> >>  }
> >>  
> >> +/* This must hold domain's vpci_rwlock in write mode. */
> >
> > Why it must be in write mode?
> >
> > Is this to avoid ABBA deadlocks as not taking the per-domain lock in
> > write mode would then force us to take each pdev->vpci->lock in order
> > to prevent concurrent modifications of remote devices?
> 
> Yes, exactly this. This is mentioned in the commit message:
> 
>     2. Writing the command register and ROM BAR register may trigger
>     modify_bars to run, which in turn may access multiple pdevs while
>     checking for the existing BAR's overlap. The overlapping check, if done
>     under the read lock, requires vpci->lock to be acquired on both devices
>     being compared, which may produce a deadlock. It is not possible to
>     upgrade read lock to write lock in such a case. So, in order to prevent
>     the deadlock, check which registers are going to be written and acquire
>     the lock in the appropriate mode from the beginning.

Might be good to put part of the commit message in the code comment,
just saying that the lock must be taken in write mode is not very
informative.

/*
 * The <lock-name> per domain lock must be taken in write mode in
 * order to prevent changes to the vPCI data of devices assigned to
 * the domain, since the function parses such data.
 */

> 
> >
> >>  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>  {
> >>      struct vpci_header *header = &pdev->vpci->header;
> >> @@ -287,6 +302,7 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
> >>       * Check for overlaps with other BARs. Note that only BARs that are
> >>       * currently mapped (enabled) are checked for overlaps.
> >>       */
> >> +    pcidevs_lock();
> >
> > This can be problematic I'm afraid, as it's a guest controlled path
> > that allows applying unrestricted contention on the pcidevs lock.  I'm
> > unsure whether multiple guests could be able to trigger the watchdog
> > if given enough devices/vcpus to be able to perform enough
> > simultaneous calls to modify_bars().
> >
> > Maybe you could expand the per-domain vpci lock to also protect
> > modifications of domain->pdev_list?
> >
> > IOW: kind of a per-domain pdev_lock.
> 
> This might work, but I am not sure if we need to expand vpci lock. Maybe
> it is better to add another lock that protects domain->pdev_list? I
> afraid that combined lock will be confusing, as it protects two
> semi-related entities.

I think expanding is the best solution, otherwise you are likely to
run into the same ABBA deadlock situations.

Thanks, Roger.
Volodymyr Babchuk June 22, 2023, 9:17 p.m. UTC | #5
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> 
>> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> if not. This lock can be used (and in a few cases is used right away)
>> >> so that vpci removal can be performed while holding the lock in write
>> >> mode. Previously such removal could race with vpci_read for example.
>> >> 
>> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> from being removed.
>> >> 
>> >> 2. Writing the command register and ROM BAR register may trigger
>> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> being compared, which may produce a deadlock. It is not possible to
>> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> the deadlock, check which registers are going to be written and acquire
>> >> the lock in the appropriate mode from the beginning.
>> >> 
>> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> access multiple pdevs at the same time, can still use a combination of the
>> >> read lock and pdev->vpci->lock.
>> >> 
>> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> header's type.
>> >> 
>> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> 
>> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> initialize many more fields of the struct vpci before assigning it to
>> >> pdev->vpci.
>> >> 
>> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> held, but it is not feasible to add an assert there as it requires
>> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> to these and other functions with the equivalent constraints.
>> >> 
>> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> 
>> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> lock there is no need to check if pdev->vpci exists:
>> >>  - in apply_map because of the context it is called (no race condition
>> >>    possible)
>> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> 
>> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> and if so, allow reading or writing the hardware register directly. This is
>> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> added the write will need to be ignored and read return all 0's for the
>> >> guests, while Dom0 can still access the registers directly.
>> >> 
>> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> the pcidev's lock.
>> >> 
>> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> while accessing pdevs in vpci code.
>> >> 
>> >> 12. This is based on the discussion at [1].
>> >> 
>> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> 
>> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >
>> > Thanks.
>> >
>> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >
>> > I've made some comments below.
>> 
>> Thank you for the review. I believe that it is a good idea to have a
>> per-domain pdev_list lock. See my answers below.
>
> I think it's important that the lock that protects domain->pdev_list
> must be the same that also protects pdev->vpci, or else you might run
> into similar ABBA deadlock situations.
>
> The problem then could be that in vpci_{read,write} you will take the
> per-domain pdev lock in read mode in order to get the pdev, and for
> writes to the command register or the ROM BAR you would have to
> upgrade such lock to a write lock without dropping it, and we don't
> have such functionality for rw locks ATM.
>
> Maybe just re-starting the function knowing that the lock must be
> taken in write mode would be a good solution: writes to the command
> register will already be slow since they are likely to involve
> modifications to the p2m.

Looks like modify_bars() is the only cause for this extended lock. I
know that this was discussed earlier, but can we rework modify_bars to
not iterate over all the pdevs? We can store copy of all enabled BARs in
a domain structure, protected by domain->vpci_lock. Something akin to

struct vpci_bar {
        list_head list;
        struct vpci *vpci;
        unsigned long start;
        unsigned long end;
        bool is_rom;
};


>> >> @@ -213,6 +227,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev,
>> >>      raise_softirq(SCHEDULE_SOFTIRQ);
>> >>  }
>> >>  
>> >> +/* This must hold domain's vpci_rwlock in write mode. */
>> >
>> > Why it must be in write mode?
>> >
>> > Is this to avoid ABBA deadlocks as not taking the per-domain lock in
>> > write mode would then force us to take each pdev->vpci->lock in order
>> > to prevent concurrent modifications of remote devices?
>> 
>> Yes, exactly this. This is mentioned in the commit message:
>> 
>>     2. Writing the command register and ROM BAR register may trigger
>>     modify_bars to run, which in turn may access multiple pdevs while
>>     checking for the existing BAR's overlap. The overlapping check, if done
>>     under the read lock, requires vpci->lock to be acquired on both devices
>>     being compared, which may produce a deadlock. It is not possible to
>>     upgrade read lock to write lock in such a case. So, in order to prevent
>>     the deadlock, check which registers are going to be written and acquire
>>     the lock in the appropriate mode from the beginning.
>
> Might be good to put part of the commit message in the code comment,
> just saying that the lock must be taken in write mode is not very
> informative.
>
> /*
>  * The <lock-name> per domain lock must be taken in write mode in
>  * order to prevent changes to the vPCI data of devices assigned to
>  * the domain, since the function parses such data.
>  */
>

Agree. Also, I'll add a corresponding ASSERT()
Volodymyr Babchuk June 22, 2023, 9:18 p.m. UTC | #6
Hi Stewart,

Stewart Hildebrand <stewart.hildebrand@amd.com> writes:

> On 6/13/23 06:32, Volodymyr Babchuk wrote:
>
> ...
>
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index 652807a4a4..1270174e78 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,20 +38,32 @@ extern vpci_register_init_t *const __end_vpci_array[];
>> 
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> -    if ( !has_vpci(pdev->domain) || !pdev->vpci )
>> +    struct vpci *vpci;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>>          return;
>> 
>> -    spin_lock(&pdev->vpci->lock);
>> +    write_lock(&pdev->domain->vpci_rwlock);
>> +    if ( !pdev->vpci )
>> +    {
>> +        write_unlock(&pdev->domain->vpci_rwlock);
>> +        return;
>> +    }
>> +
>> +    vpci = pdev->vpci;
>> +    pdev->vpci = NULL;
>
> Here we set pdev->vpci to NULL, yet there are still a few uses
> remaining after this in vpci_remove_device. I suspect they were missed
> during a s/pdev->vpci/vpci/ search and replace.
>

Yes, you are right. Thank you, I'll fix this in the next version.

>> +    write_unlock(&pdev->domain->vpci_rwlock);
>> +
>>      while ( !list_empty(&pdev->vpci->handlers) )
>
> pdev->vpci dereferenced here
>
>>      {
>> -        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
>> +        struct vpci_register *r = list_first_entry(&vpci->handlers,
>>                                                     struct vpci_register,
>>                                                     node);
>> 
>>          list_del(&r->node);
>>          xfree(r);
>>      }
>> -    spin_unlock(&pdev->vpci->lock);
>> +
>>      if ( pdev->vpci->msix )
>
> pdev->vpci dereferenced here
>
>>      {
>>          unsigned int i;
>> @@ -61,29 +73,33 @@ void vpci_remove_device(struct pci_dev *pdev)
>>              if ( pdev->vpci->msix->table[i] )
>
> pdev->vpci dereferenced here, and two more above not shown in the diff context
>
>>                  iounmap(pdev->vpci->msix->table[i]);
>
> pdev->vpci dereferenced here
>
>>      }
>> -    xfree(pdev->vpci->msix);
>> -    xfree(pdev->vpci->msi);
>> -    xfree(pdev->vpci);
>> -    pdev->vpci = NULL;
>> +    xfree(vpci->msix);
>> +    xfree(vpci->msi);
>> +    xfree(vpci);
>>  }
Roger Pau Monné June 23, 2023, 8:50 a.m. UTC | #7
On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
> >> 
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> 
> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
> >> >> so we are sure there are no accesses to the contents of the vpci struct
> >> >> if not. This lock can be used (and in a few cases is used right away)
> >> >> so that vpci removal can be performed while holding the lock in write
> >> >> mode. Previously such removal could race with vpci_read for example.
> >> >> 
> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
> >> >> from being removed.
> >> >> 
> >> >> 2. Writing the command register and ROM BAR register may trigger
> >> >> modify_bars to run, which in turn may access multiple pdevs while
> >> >> checking for the existing BAR's overlap. The overlapping check, if done
> >> >> under the read lock, requires vpci->lock to be acquired on both devices
> >> >> being compared, which may produce a deadlock. It is not possible to
> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
> >> >> the deadlock, check which registers are going to be written and acquire
> >> >> the lock in the appropriate mode from the beginning.
> >> >> 
> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
> >> >> access multiple pdevs at the same time, can still use a combination of the
> >> >> read lock and pdev->vpci->lock.
> >> >> 
> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
> >> >> header's type.
> >> >> 
> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
> >> >> 
> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
> >> >> initialize many more fields of the struct vpci before assigning it to
> >> >> pdev->vpci.
> >> >> 
> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
> >> >> held, but it is not feasible to add an assert there as it requires
> >> >> struct domain to be passed for that. So, add a comment about this requirement
> >> >> to these and other functions with the equivalent constraints.
> >> >> 
> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
> >> >> 
> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
> >> >> prior the call and re-acquire the locks after. After re-acquiring the
> >> >> lock there is no need to check if pdev->vpci exists:
> >> >>  - in apply_map because of the context it is called (no race condition
> >> >>    possible)
> >> >>  - for MSI/MSI-X debug code because it is called at the end of
> >> >>    pdev->vpci access and no further access to pdev->vpci is made
> >> >> 
> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
> >> >> and if so, allow reading or writing the hardware register directly. This is
> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
> >> >> added the write will need to be ignored and read return all 0's for the
> >> >> guests, while Dom0 can still access the registers directly.
> >> >> 
> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
> >> >> the pcidev's lock.
> >> >> 
> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
> >> >> while accessing pdevs in vpci code.
> >> >> 
> >> >> 12. This is based on the discussion at [1].
> >> >> 
> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
> >> >> 
> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >
> >> > Thanks.
> >> >
> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
> >> >
> >> > I've made some comments below.
> >> 
> >> Thank you for the review. I believe that it is a good idea to have a
> >> per-domain pdev_list lock. See my answers below.
> >
> > I think it's important that the lock that protects domain->pdev_list
> > must be the same that also protects pdev->vpci, or else you might run
> > into similar ABBA deadlock situations.
> >
> > The problem then could be that in vpci_{read,write} you will take the
> > per-domain pdev lock in read mode in order to get the pdev, and for
> > writes to the command register or the ROM BAR you would have to
> > upgrade such lock to a write lock without dropping it, and we don't
> > have such functionality for rw locks ATM.
> >
> > Maybe just re-starting the function knowing that the lock must be
> > taken in write mode would be a good solution: writes to the command
> > register will already be slow since they are likely to involve
> > modifications to the p2m.
> 
> Looks like modify_bars() is the only cause for this extended lock. I
> know that this was discussed earlier, but can we rework modify_bars to
> not iterate over all the pdevs? We can store copy of all enabled BARs in
> a domain structure, protected by domain->vpci_lock. Something akin to
> 
> struct vpci_bar {
>         list_head list;
>         struct vpci *vpci;
>         unsigned long start;
>         unsigned long end;
>         bool is_rom;
> };

This IMO makes the logic more complicated, as each time a BAR is
updated we would have to change the cached address and size in two
different places.  It's also duplicated data that takes up memory, and
there are system with a non-trivial amount of PCI devices and thus
BARs to track.

I think it's easier to just make the newly introduced per-domain
rwlock also protect the domain's pdev_list (unless I'm missing
something).  AFAICT it would also simplify locking, as such rwlock
protecting the domain->pdev_list will avoid you from having to take
the pcidevs lock in vpci_{read,write} in order to find the device the
access belongs to.

Thanks, Roger.
Volodymyr Babchuk June 23, 2023, 9:26 a.m. UTC | #8
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> 
>> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> >> if not. This lock can be used (and in a few cases is used right away)
>> >> >> so that vpci removal can be performed while holding the lock in write
>> >> >> mode. Previously such removal could race with vpci_read for example.
>> >> >> 
>> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> >> from being removed.
>> >> >> 
>> >> >> 2. Writing the command register and ROM BAR register may trigger
>> >> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> >> being compared, which may produce a deadlock. It is not possible to
>> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> >> the deadlock, check which registers are going to be written and acquire
>> >> >> the lock in the appropriate mode from the beginning.
>> >> >> 
>> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> >> access multiple pdevs at the same time, can still use a combination of the
>> >> >> read lock and pdev->vpci->lock.
>> >> >> 
>> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> >> header's type.
>> >> >> 
>> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> >> 
>> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> >> initialize many more fields of the struct vpci before assigning it to
>> >> >> pdev->vpci.
>> >> >> 
>> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> >> held, but it is not feasible to add an assert there as it requires
>> >> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> >> to these and other functions with the equivalent constraints.
>> >> >> 
>> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> >> 
>> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> >> lock there is no need to check if pdev->vpci exists:
>> >> >>  - in apply_map because of the context it is called (no race condition
>> >> >>    possible)
>> >> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> >> 
>> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> >> and if so, allow reading or writing the hardware register directly. This is
>> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> >> added the write will need to be ignored and read return all 0's for the
>> >> >> guests, while Dom0 can still access the registers directly.
>> >> >> 
>> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> >> the pcidev's lock.
>> >> >> 
>> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> >> while accessing pdevs in vpci code.
>> >> >> 
>> >> >> 12. This is based on the discussion at [1].
>> >> >> 
>> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> >> 
>> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >
>> >> > Thanks.
>> >> >
>> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >> >
>> >> > I've made some comments below.
>> >> 
>> >> Thank you for the review. I believe that it is a good idea to have a
>> >> per-domain pdev_list lock. See my answers below.
>> >
>> > I think it's important that the lock that protects domain->pdev_list
>> > must be the same that also protects pdev->vpci, or else you might run
>> > into similar ABBA deadlock situations.
>> >
>> > The problem then could be that in vpci_{read,write} you will take the
>> > per-domain pdev lock in read mode in order to get the pdev, and for
>> > writes to the command register or the ROM BAR you would have to
>> > upgrade such lock to a write lock without dropping it, and we don't
>> > have such functionality for rw locks ATM.
>> >
>> > Maybe just re-starting the function knowing that the lock must be
>> > taken in write mode would be a good solution: writes to the command
>> > register will already be slow since they are likely to involve
>> > modifications to the p2m.
>> 
>> Looks like modify_bars() is the only cause for this extended lock. I
>> know that this was discussed earlier, but can we rework modify_bars to
>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>> a domain structure, protected by domain->vpci_lock. Something akin to
>> 
>> struct vpci_bar {
>>         list_head list;
>>         struct vpci *vpci;
>>         unsigned long start;
>>         unsigned long end;
>>         bool is_rom;
>> };
>
> This IMO makes the logic more complicated, as each time a BAR is
> updated we would have to change the cached address and size in two
> different places.  It's also duplicated data that takes up memory, and
> there are system with a non-trivial amount of PCI devices and thus
> BARs to track.
>
> I think it's easier to just make the newly introduced per-domain
> rwlock also protect the domain's pdev_list (unless I'm missing
> something).  AFAICT it would also simplify locking, as such rwlock
> protecting the domain->pdev_list will avoid you from having to take
> the pcidevs lock in vpci_{read,write} in order to find the device the
> access belongs to.

In my opinion, this will make the whole locking logic complex, because
we will have one rwlock that protects:

1. pdev->vpci
2. domain->pdev_list

This is a two semi-related things. I feel that this will be hard to
understand for anyone who is not deeply intimate with the PCI
code. Anyways, I want this patch series to get going, so I believe your
judgment here. How should I name this lock? Taking into account, that
scope is will be more broad, "domain->vpci_rwlock" does not seems as a
good name.
Jan Beulich June 23, 2023, 5:09 p.m. UTC | #9
On 23.06.2023 11:26, Volodymyr Babchuk wrote:
> Roger Pau Monné <roger.pau@citrix.com> writes:
>> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>> On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>> I've made some comments below.
>>>>>
>>>>> Thank you for the review. I believe that it is a good idea to have a
>>>>> per-domain pdev_list lock. See my answers below.
>>>>
>>>> I think it's important that the lock that protects domain->pdev_list
>>>> must be the same that also protects pdev->vpci, or else you might run
>>>> into similar ABBA deadlock situations.
>>>>
>>>> The problem then could be that in vpci_{read,write} you will take the
>>>> per-domain pdev lock in read mode in order to get the pdev, and for
>>>> writes to the command register or the ROM BAR you would have to
>>>> upgrade such lock to a write lock without dropping it, and we don't
>>>> have such functionality for rw locks ATM.
>>>>
>>>> Maybe just re-starting the function knowing that the lock must be
>>>> taken in write mode would be a good solution: writes to the command
>>>> register will already be slow since they are likely to involve
>>>> modifications to the p2m.
>>>
>>> Looks like modify_bars() is the only cause for this extended lock. I
>>> know that this was discussed earlier, but can we rework modify_bars to
>>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>>> a domain structure, protected by domain->vpci_lock. Something akin to
>>>
>>> struct vpci_bar {
>>>         list_head list;
>>>         struct vpci *vpci;
>>>         unsigned long start;
>>>         unsigned long end;
>>>         bool is_rom;
>>> };
>>
>> This IMO makes the logic more complicated, as each time a BAR is
>> updated we would have to change the cached address and size in two
>> different places.  It's also duplicated data that takes up memory, and
>> there are system with a non-trivial amount of PCI devices and thus
>> BARs to track.
>>
>> I think it's easier to just make the newly introduced per-domain
>> rwlock also protect the domain's pdev_list (unless I'm missing
>> something).  AFAICT it would also simplify locking, as such rwlock
>> protecting the domain->pdev_list will avoid you from having to take
>> the pcidevs lock in vpci_{read,write} in order to find the device the
>> access belongs to.
> 
> In my opinion, this will make the whole locking logic complex, because
> we will have one rwlock that protects:
> 
> 1. pdev->vpci
> 2. domain->pdev_list

The import thing at this stage is to get the granularity down from
the global pcidevs lock. What exactly is grouped together is
secondary for the moment; it needs writing down clearly in a code
comment, of course.

Just to be sure I recall things correctly: 1 above is covering
only the pointer, not the contents of the pointed to struct? If
so, I would agree putting both under the same lock is a little odd,
but if that limits lock nesting issues, I'd say so be it. If not,
then this lock could simply be declared as covering everything
PCI-ish that a domain has in use. Especially in the latter case ...

> This is a two semi-related things. I feel that this will be hard to
> understand for anyone who is not deeply intimate with the PCI
> code. Anyways, I want this patch series to get going, so I believe your
> judgment here. How should I name this lock? Taking into account, that
> scope is will be more broad, "domain->vpci_rwlock" does not seems as a
> good name.

... d->pci_lock would seem quite appropriate.

Jan
Volodymyr Babchuk July 4, 2023, 9:03 p.m. UTC | #10
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jun 22, 2023 at 09:17:36PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jun 21, 2023 at 10:07:20PM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Tue, Jun 13, 2023 at 10:32:26AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> 
>> >> >> Introduce a per-domain read/write lock to check whether vpci is present,
>> >> >> so we are sure there are no accesses to the contents of the vpci struct
>> >> >> if not. This lock can be used (and in a few cases is used right away)
>> >> >> so that vpci removal can be performed while holding the lock in write
>> >> >> mode. Previously such removal could race with vpci_read for example.
>> >> >> 
>> >> >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure
>> >> >> from being removed.
>> >> >> 
>> >> >> 2. Writing the command register and ROM BAR register may trigger
>> >> >> modify_bars to run, which in turn may access multiple pdevs while
>> >> >> checking for the existing BAR's overlap. The overlapping check, if done
>> >> >> under the read lock, requires vpci->lock to be acquired on both devices
>> >> >> being compared, which may produce a deadlock. It is not possible to
>> >> >> upgrade read lock to write lock in such a case. So, in order to prevent
>> >> >> the deadlock, check which registers are going to be written and acquire
>> >> >> the lock in the appropriate mode from the beginning.
>> >> >> 
>> >> >> All other code, which doesn't lead to pdev->vpci destruction and does not
>> >> >> access multiple pdevs at the same time, can still use a combination of the
>> >> >> read lock and pdev->vpci->lock.
>> >> >> 
>> >> >> 3. Optimize if ROM BAR write lock required detection by caching offset
>> >> >> of the ROM BAR register in vpci->header->rom_reg which depends on
>> >> >> header's type.
>> >> >> 
>> >> >> 4. Reduce locked region in vpci_remove_device as it is now possible
>> >> >> to set pdev->vpci to NULL early right after the write lock is acquired.
>> >> >> 
>> >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to
>> >> >> initialize many more fields of the struct vpci before assigning it to
>> >> >> pdev->vpci.
>> >> >> 
>> >> >> 6. vpci_{add|remove}_register are required to be called with the write lock
>> >> >> held, but it is not feasible to add an assert there as it requires
>> >> >> struct domain to be passed for that. So, add a comment about this requirement
>> >> >> to these and other functions with the equivalent constraints.
>> >> >> 
>> >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate.
>> >> >> 
>> >> >> 8. Do not call process_pending_softirqs with any locks held. For that unlock
>> >> >> prior the call and re-acquire the locks after. After re-acquiring the
>> >> >> lock there is no need to check if pdev->vpci exists:
>> >> >>  - in apply_map because of the context it is called (no race condition
>> >> >>    possible)
>> >> >>  - for MSI/MSI-X debug code because it is called at the end of
>> >> >>    pdev->vpci access and no further access to pdev->vpci is made
>> >> >> 
>> >> >> 9. Check for !pdev->vpci in vpci_{read|write} after acquiring the lock
>> >> >> and if so, allow reading or writing the hardware register directly. This is
>> >> >> acceptable as we only deal with Dom0 as of now. Once DomU support is
>> >> >> added the write will need to be ignored and read return all 0's for the
>> >> >> guests, while Dom0 can still access the registers directly.
>> >> >> 
>> >> >> 10. Introduce pcidevs_trylock, so there is a possibility to try locking
>> >> >> the pcidev's lock.
>> >> >> 
>> >> >> 11. Use pcidev's lock around for_each_pdev and pci_get_pdev_by_domain
>> >> >> while accessing pdevs in vpci code.
>> >> >> 
>> >> >> 12. This is based on the discussion at [1].
>> >> >> 
>> >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!zPy31CWFWlyC0xhEHiSj6rOPe7RDSjLranI9KZqhG4ssmChJMWvsPLJPQGTcVsnnowZpP8-LaKJkIWIzb8ue0DoYhg$ [lore[.]kernel[.]org]
>> >> >> 
>> >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> >> >> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >
>> >> > Thanks.
>> >> >
>> >> > I haven't looked in full detail, but I'm afraid there's an ABBA
>> >> > deadlock with the per-domain vpci lock and the pcidevs lock in
>> >> > modify_bars() vs  vpci_add_handlers() and vpci_remove_device().
>> >> >
>> >> > I've made some comments below.
>> >> 
>> >> Thank you for the review. I believe that it is a good idea to have a
>> >> per-domain pdev_list lock. See my answers below.
>> >
>> > I think it's important that the lock that protects domain->pdev_list
>> > must be the same that also protects pdev->vpci, or else you might run
>> > into similar ABBA deadlock situations.
>> >
>> > The problem then could be that in vpci_{read,write} you will take the
>> > per-domain pdev lock in read mode in order to get the pdev, and for
>> > writes to the command register or the ROM BAR you would have to
>> > upgrade such lock to a write lock without dropping it, and we don't
>> > have such functionality for rw locks ATM.
>> >
>> > Maybe just re-starting the function knowing that the lock must be
>> > taken in write mode would be a good solution: writes to the command
>> > register will already be slow since they are likely to involve
>> > modifications to the p2m.
>> 
>> Looks like modify_bars() is the only cause for this extended lock. I
>> know that this was discussed earlier, but can we rework modify_bars to
>> not iterate over all the pdevs? We can store copy of all enabled BARs in
>> a domain structure, protected by domain->vpci_lock. Something akin to
>> 
>> struct vpci_bar {
>>         list_head list;
>>         struct vpci *vpci;
>>         unsigned long start;
>>         unsigned long end;
>>         bool is_rom;
>> };
>
> This IMO makes the logic more complicated, as each time a BAR is
> updated we would have to change the cached address and size in two
> different places.  It's also duplicated data that takes up memory, and
> there are system with a non-trivial amount of PCI devices and thus
> BARs to track.
>
> I think it's easier to just make the newly introduced per-domain
> rwlock also protect the domain's pdev_list (unless I'm missing
> something).  AFAICT it would also simplify locking, as such rwlock
> protecting the domain->pdev_list will avoid you from having to take
> the pcidevs lock in vpci_{read,write} in order to find the device the
> access belongs to.

I am currently implementing your proposal (along with Jan's
suggestions), but I am facing ABBA deadlock with IOMMU's
reassign_device() call, which has this piece of code:

        list_move(&pdev->domain_list, &target->pdev_list);

My immediate change was:

        write_lock(&pdev->domain->pci_lock);
        list_del(&pdev->domain_list);
        write_unlock(&pdev->domain->pci_lock);

        write_lock(&target->pci_lock);
        list_add(&pdev->domain_list, &target->pdev_list);
        write_unlock(&target->pci_lock);

But this will not work because reassign_device is called from
pci_release_devices() which iterates over d->pdev_list, so we need to
take a d->pci_lock early.

Any suggestions on how to fix this? My idea is to remove a device from a
list one at time:

int pci_release_devices(struct domain *d)
{
    struct pci_dev *pdev;
    u8 bus, devfn;
    int ret;

    pcidevs_lock();
    write_lock(&d->pci_lock);
    ret = arch_pci_clean_pirqs(d);
    if ( ret )
    {
        pcidevs_unlock();
        write_unlock(&d->pci_lock);
        return ret;
    }

    while ( !list_empty(&d->pdev_list) )
    {
        pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
        bus = pdev->bus;
        devfn = pdev->devfn;
        list_del(&pdev->domain_list);
        write_unlock(&d->pci_lock);
        ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
        write_lock(&d->pci_lock);
    }
    write_unlock(&d->pci_lock);
    pcidevs_unlock();

    return ret;
}

But it is ugly somewhat.
Jan Beulich July 5, 2023, 7:11 a.m. UTC | #11
On 04.07.2023 23:03, Volodymyr Babchuk wrote:
> I am currently implementing your proposal (along with Jan's
> suggestions), but I am facing ABBA deadlock with IOMMU's
> reassign_device() call, which has this piece of code:
> 
>         list_move(&pdev->domain_list, &target->pdev_list);
> 
> My immediate change was:
> 
>         write_lock(&pdev->domain->pci_lock);
>         list_del(&pdev->domain_list);
>         write_unlock(&pdev->domain->pci_lock);
> 
>         write_lock(&target->pci_lock);
>         list_add(&pdev->domain_list, &target->pdev_list);
>         write_unlock(&target->pci_lock);
> 
> But this will not work because reassign_device is called from
> pci_release_devices() which iterates over d->pdev_list, so we need to
> take a d->pci_lock early.
> 
> Any suggestions on how to fix this? My idea is to remove a device from a
> list one at time:
> 
> int pci_release_devices(struct domain *d)
> {
>     struct pci_dev *pdev;
>     u8 bus, devfn;
>     int ret;
> 
>     pcidevs_lock();
>     write_lock(&d->pci_lock);
>     ret = arch_pci_clean_pirqs(d);
>     if ( ret )
>     {
>         pcidevs_unlock();
>         write_unlock(&d->pci_lock);
>         return ret;
>     }
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>         bus = pdev->bus;
>         devfn = pdev->devfn;
>         list_del(&pdev->domain_list);
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>         write_lock(&d->pci_lock);

I think it needs doing almost like this, but with two more tweaks and
no list_del() right here (first and foremost to avoid needing to
figure whether removing early isn't going to subtly break anything;
see below for an error case that would end up with changed behavior):

    while ( !list_empty(&d->pdev_list) )
    {
        const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
        uint16_t seg = pdev->seg;
        uint8_t bus = pdev->bus;
        uint8_t devfn = pdev->devfn;

        write_unlock(&d->pci_lock);
        ret = deassign_device(d, seg, bus, devfn) ?: ret;
        write_lock(&d->pci_lock);
    }

One caveat though: The original list_for_each_entry_safe() guarantees
the loop to complete; your use of list_del() would guarantee that too,
but then the device wouldn't be on the list anymore if deassign_device()
failed. Therefore I guess you will need another local list where you
(temporarily) put all the devices which deassign_device() left on the
list, and which you would then move back to d->pdev_list after the loop
has finished. (Whether it is sufficient to inspect the list head to
determine whether the pdev is still on the list needs careful checking.)

Jan
Roger Pau Monné July 5, 2023, 8:59 a.m. UTC | #12
On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
> > I am currently implementing your proposal (along with Jan's
> > suggestions), but I am facing ABBA deadlock with IOMMU's
> > reassign_device() call, which has this piece of code:
> > 
> >         list_move(&pdev->domain_list, &target->pdev_list);
> > 
> > My immediate change was:
> > 
> >         write_lock(&pdev->domain->pci_lock);
> >         list_del(&pdev->domain_list);
> >         write_unlock(&pdev->domain->pci_lock);
> > 
> >         write_lock(&target->pci_lock);
> >         list_add(&pdev->domain_list, &target->pdev_list);
> >         write_unlock(&target->pci_lock);
> > 
> > But this will not work because reassign_device is called from
> > pci_release_devices() which iterates over d->pdev_list, so we need to
> > take a d->pci_lock early.
> > 
> > Any suggestions on how to fix this? My idea is to remove a device from a
> > list one at time:
> > 
> > int pci_release_devices(struct domain *d)
> > {
> >     struct pci_dev *pdev;
> >     u8 bus, devfn;
> >     int ret;
> > 
> >     pcidevs_lock();
> >     write_lock(&d->pci_lock);
> >     ret = arch_pci_clean_pirqs(d);
> >     if ( ret )
> >     {
> >         pcidevs_unlock();
> >         write_unlock(&d->pci_lock);
> >         return ret;
> >     }
> > 
> >     while ( !list_empty(&d->pdev_list) )
> >     {
> >         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
> >         bus = pdev->bus;
> >         devfn = pdev->devfn;
> >         list_del(&pdev->domain_list);
> >         write_unlock(&d->pci_lock);
> >         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
> >         write_lock(&d->pci_lock);
> 
> I think it needs doing almost like this, but with two more tweaks and
> no list_del() right here (first and foremost to avoid needing to
> figure whether removing early isn't going to subtly break anything;
> see below for an error case that would end up with changed behavior):
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
> 
>         write_unlock(&d->pci_lock);

I think you need to remove the device from the pdev_list before
dropping the lock, or else release could race with other operations.

That's unlikely, but still if the lock is dropped and the routine
needs to operate on the device it is better remove such device from
the domain so other operations cannot get a reference to it.

Otherwise you could modify reassign_device() implementations so they
require the caller to hold the source->pci_lock when calling the
routine, but that's ugly because the lock would need to be dropped in
order to reassign the device from source to target domains.

Another option would be to move the whole d->pdev_list to a local
variable (so that d->pdev_list would be empty) and then iterate over
it without the d->pci_lock.  On failure you would take the lock and
add the failing device back into d->pdev_list.

Thanks, Roger.
Jan Beulich July 5, 2023, 9:13 a.m. UTC | #13
On 05.07.2023 10:59, Roger Pau Monné wrote:
> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>> I am currently implementing your proposal (along with Jan's
>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>> reassign_device() call, which has this piece of code:
>>>
>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>
>>> My immediate change was:
>>>
>>>         write_lock(&pdev->domain->pci_lock);
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&pdev->domain->pci_lock);
>>>
>>>         write_lock(&target->pci_lock);
>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>         write_unlock(&target->pci_lock);
>>>
>>> But this will not work because reassign_device is called from
>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>> take a d->pci_lock early.
>>>
>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>> list one at time:
>>>
>>> int pci_release_devices(struct domain *d)
>>> {
>>>     struct pci_dev *pdev;
>>>     u8 bus, devfn;
>>>     int ret;
>>>
>>>     pcidevs_lock();
>>>     write_lock(&d->pci_lock);
>>>     ret = arch_pci_clean_pirqs(d);
>>>     if ( ret )
>>>     {
>>>         pcidevs_unlock();
>>>         write_unlock(&d->pci_lock);
>>>         return ret;
>>>     }
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         bus = pdev->bus;
>>>         devfn = pdev->devfn;
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&d->pci_lock);
>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>         write_lock(&d->pci_lock);
>>
>> I think it needs doing almost like this, but with two more tweaks and
>> no list_del() right here (first and foremost to avoid needing to
>> figure whether removing early isn't going to subtly break anything;
>> see below for an error case that would end up with changed behavior):
>>
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         uint16_t seg = pdev->seg;
>>         uint8_t bus = pdev->bus;
>>         uint8_t devfn = pdev->devfn;
>>
>>         write_unlock(&d->pci_lock);
> 
> I think you need to remove the device from the pdev_list before
> dropping the lock, or else release could race with other operations.
> 
> That's unlikely, but still if the lock is dropped and the routine
> needs to operate on the device it is better remove such device from
> the domain so other operations cannot get a reference to it.
> 
> Otherwise you could modify reassign_device() implementations so they
> require the caller to hold the source->pci_lock when calling the
> routine, but that's ugly because the lock would need to be dropped in
> order to reassign the device from source to target domains.
> 
> Another option would be to move the whole d->pdev_list to a local
> variable (so that d->pdev_list would be empty) and then iterate over
> it without the d->pci_lock.  On failure you would take the lock and
> add the failing device back into d->pdev_list.

Conceptually I like this last variant, but like the individual list_del()
it requires auditing code for no dependency on the device still being on
that list. In fact deassign_device()'s use of pci_get_pdev() does. The
function would then need changing to have struct pci_dev * passed in.
Yet who knows where else there are uses of pci_get_pdev() lurking.

Jan
Volodymyr Babchuk July 7, 2023, 2:02 a.m. UTC | #14
Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 05.07.2023 10:59, Roger Pau Monné wrote:
>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>> I am currently implementing your proposal (along with Jan's
>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>> reassign_device() call, which has this piece of code:
>>>>
>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>
>>>> My immediate change was:
>>>>
>>>>         write_lock(&pdev->domain->pci_lock);
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>
>>>>         write_lock(&target->pci_lock);
>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>         write_unlock(&target->pci_lock);
>>>>
>>>> But this will not work because reassign_device is called from
>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>> take a d->pci_lock early.
>>>>
>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>> list one at time:
>>>>
>>>> int pci_release_devices(struct domain *d)
>>>> {
>>>>     struct pci_dev *pdev;
>>>>     u8 bus, devfn;
>>>>     int ret;
>>>>
>>>>     pcidevs_lock();
>>>>     write_lock(&d->pci_lock);
>>>>     ret = arch_pci_clean_pirqs(d);
>>>>     if ( ret )
>>>>     {
>>>>         pcidevs_unlock();
>>>>         write_unlock(&d->pci_lock);
>>>>         return ret;
>>>>     }
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>         bus = pdev->bus;
>>>>         devfn = pdev->devfn;
>>>>         list_del(&pdev->domain_list);
>>>>         write_unlock(&d->pci_lock);
>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>         write_lock(&d->pci_lock);
>>>
>>> I think it needs doing almost like this, but with two more tweaks and
>>> no list_del() right here (first and foremost to avoid needing to
>>> figure whether removing early isn't going to subtly break anything;
>>> see below for an error case that would end up with changed behavior):
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         uint16_t seg = pdev->seg;
>>>         uint8_t bus = pdev->bus;
>>>         uint8_t devfn = pdev->devfn;
>>>
>>>         write_unlock(&d->pci_lock);
>> 
>> I think you need to remove the device from the pdev_list before
>> dropping the lock, or else release could race with other operations.
>> 
>> That's unlikely, but still if the lock is dropped and the routine
>> needs to operate on the device it is better remove such device from
>> the domain so other operations cannot get a reference to it.
>> 
>> Otherwise you could modify reassign_device() implementations so they
>> require the caller to hold the source->pci_lock when calling the
>> routine, but that's ugly because the lock would need to be dropped in
>> order to reassign the device from source to target domains.
>> 
>> Another option would be to move the whole d->pdev_list to a local
>> variable (so that d->pdev_list would be empty) and then iterate over
>> it without the d->pci_lock.  On failure you would take the lock and
>> add the failing device back into d->pdev_list.
>
> Conceptually I like this last variant, but like the individual list_del()
> it requires auditing code for no dependency on the device still being on
> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
> function would then need changing to have struct pci_dev * passed in.
> Yet who knows where else there are uses of pci_get_pdev() lurking.

Okay, so I changed deassign_device() signature and reworked the loop
in pci_release_devices() in a such way:

    INIT_LIST_HEAD(&tmp_list);
    /* Move all entries to tmp_list, so we can drop d->pci_lock */
    list_splice_init(&d->pdev_list, &tmp_list);
    write_unlock(&d->pci_lock);

    list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
    {
        pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
        rc = deassign_device(d, pdev);
        if ( rc )
        {
            /* Return device back to the domain list */
            write_lock(&d->pci_lock);
            list_add(&pdev->domain_list, &d->pdev_list);
            write_unlock(&d->pci_lock);
            func_ret = rc;
        }
    }


Also, I checked for all pci_get_pdev() calls and found that struct
domain (the first parameter) is passed only in handful of places:

*** xen/drivers/vpci/vpci.c:
vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);

*** xen/arch/x86/irq.c:
map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);

*** xen/drivers/passthrough/pci.c:
XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);

The last one is due to my change to deassign_device() signature.

==============================

d->pdev_list can be accessed there:

*** xen/drivers/passthrough/amd/pci_amd_iommu.c:
reassign_device[489]           list_add(&pdev->domain_list, &target->pdev_list);

*** xen/drivers/passthrough/pci.c:
_pci_hide_device[463]          list_add(&pdev->domain_list, &dom_xen->pdev_list);
pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, domain_list )
pci_add_device[759]            list_add(&pdev->domain_list, &hardware_domain->pdev_list);
pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
_setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, &ctxt->d->pdev_list);

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2819] list_add(&pdev->domain_list, &target->pdev_list);

*** xen/include/xen/pci.h:
for_each_pdev[149]             list_for_each_entry(pdev, &(domain)->pdev_list, domain_list)
has_arch_pdevs[151]            #define has_arch_pdevs(d) (!list_empty(&(d)->pdev_list))

==============================

And has_arch_pdevs() is used there:

*** xen/arch/x86/hvm/hvm.c:
hvm_set_cr0[2388]              has_arch_pdevs(d)) )

*** xen/arch/x86/hvm/vmx/vmcs.c:
vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop

*** xen/arch/x86/mm.c:
l1_disallow_mask[172]          !has_arch_pdevs(d) && \

*** xen/arch/x86/mm/p2m-pod.c:
p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || cache_flush_permitted(d) )

*** xen/arch/x86/mm/paging.c:
paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )

*** xen/drivers/passthrough/vtd/iommu.c:
reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )


has_arch_pdevs() bothers me most, actually, because it is not always
obvious how to add locking for the callers. I am planning to rework it
in the following way:

#define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))

static inline bool has_arch_pdevs(struct domain *d)
{
    bool ret;

    read_lock(&d->pci_lock);
    ret = has_arch_pdevs_unlocked(d);
    read_unlock(&d->pci_lock);

    return ret;
}

And then use appropriate macro/function depending on circumstances.
Jan Beulich July 7, 2023, 6:32 a.m. UTC | #15
On 07.07.2023 04:02, Volodymyr Babchuk wrote:
> 
> Hi Jan,
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 05.07.2023 10:59, Roger Pau Monné wrote:
>>> On Wed, Jul 05, 2023 at 09:11:10AM +0200, Jan Beulich wrote:
>>>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>>>> I am currently implementing your proposal (along with Jan's
>>>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>>>> reassign_device() call, which has this piece of code:
>>>>>
>>>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>>>
>>>>> My immediate change was:
>>>>>
>>>>>         write_lock(&pdev->domain->pci_lock);
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&pdev->domain->pci_lock);
>>>>>
>>>>>         write_lock(&target->pci_lock);
>>>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>>>         write_unlock(&target->pci_lock);
>>>>>
>>>>> But this will not work because reassign_device is called from
>>>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>>>> take a d->pci_lock early.
>>>>>
>>>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>>>> list one at time:
>>>>>
>>>>> int pci_release_devices(struct domain *d)
>>>>> {
>>>>>     struct pci_dev *pdev;
>>>>>     u8 bus, devfn;
>>>>>     int ret;
>>>>>
>>>>>     pcidevs_lock();
>>>>>     write_lock(&d->pci_lock);
>>>>>     ret = arch_pci_clean_pirqs(d);
>>>>>     if ( ret )
>>>>>     {
>>>>>         pcidevs_unlock();
>>>>>         write_unlock(&d->pci_lock);
>>>>>         return ret;
>>>>>     }
>>>>>
>>>>>     while ( !list_empty(&d->pdev_list) )
>>>>>     {
>>>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>>         bus = pdev->bus;
>>>>>         devfn = pdev->devfn;
>>>>>         list_del(&pdev->domain_list);
>>>>>         write_unlock(&d->pci_lock);
>>>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>>>         write_lock(&d->pci_lock);
>>>>
>>>> I think it needs doing almost like this, but with two more tweaks and
>>>> no list_del() right here (first and foremost to avoid needing to
>>>> figure whether removing early isn't going to subtly break anything;
>>>> see below for an error case that would end up with changed behavior):
>>>>
>>>>     while ( !list_empty(&d->pdev_list) )
>>>>     {
>>>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>>         uint16_t seg = pdev->seg;
>>>>         uint8_t bus = pdev->bus;
>>>>         uint8_t devfn = pdev->devfn;
>>>>
>>>>         write_unlock(&d->pci_lock);
>>>
>>> I think you need to remove the device from the pdev_list before
>>> dropping the lock, or else release could race with other operations.
>>>
>>> That's unlikely, but still if the lock is dropped and the routine
>>> needs to operate on the device it is better remove such device from
>>> the domain so other operations cannot get a reference to it.
>>>
>>> Otherwise you could modify reassign_device() implementations so they
>>> require the caller to hold the source->pci_lock when calling the
>>> routine, but that's ugly because the lock would need to be dropped in
>>> order to reassign the device from source to target domains.
>>>
>>> Another option would be to move the whole d->pdev_list to a local
>>> variable (so that d->pdev_list would be empty) and then iterate over
>>> it without the d->pci_lock.  On failure you would take the lock and
>>> add the failing device back into d->pdev_list.
>>
>> Conceptually I like this last variant, but like the individual list_del()
>> it requires auditing code for no dependency on the device still being on
>> that list. In fact deassign_device()'s use of pci_get_pdev() does. The
>> function would then need changing to have struct pci_dev * passed in.
>> Yet who knows where else there are uses of pci_get_pdev() lurking.
> 
> Okay, so I changed deassign_device() signature and reworked the loop
> in pci_release_devices() in a such way:
> 
>     INIT_LIST_HEAD(&tmp_list);
>     /* Move all entries to tmp_list, so we can drop d->pci_lock */
>     list_splice_init(&d->pdev_list, &tmp_list);
>     write_unlock(&d->pci_lock);
> 
>     list_for_each_entry_safe ( pdev, tmp, &tmp_list, domain_list )
>     {
>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>         rc = deassign_device(d, pdev);
>         if ( rc )
>         {
>             /* Return device back to the domain list */
>             write_lock(&d->pci_lock);
>             list_add(&pdev->domain_list, &d->pdev_list);
>             write_unlock(&d->pci_lock);
>             func_ret = rc;
>         }
>     }
> 
> 
> Also, I checked for all pci_get_pdev() calls and found that struct
> domain (the first parameter) is passed only in handful of places:
> 
> *** xen/drivers/vpci/vpci.c:
> vpci_read[504]                 pdev = pci_get_pdev(d, sbdf);
> vpci_read[506]                 pdev = pci_get_pdev(dom_xen, sbdf);
> vpci_write[621]                pdev = pci_get_pdev(d, sbdf);
> vpci_write[623]                pdev = pci_get_pdev(dom_xen, sbdf);
> 
> *** xen/arch/x86/irq.c:
> map_domain_pirq[2166]          pdev = pci_get_pdev(d, msi->sbdf);
> 
> *** xen/drivers/passthrough/pci.c:
> XEN_GUEST_HANDLE_PARAM[1712]   pdev = pci_get_pdev(d, machine_sbdf);
> 
> The last one is due to my change to deassign_device() signature.

And which is going to continue to return NULL when earlier on you've
emptied the list. The purpose of passing in struct pdev * was to
eliminate this call. (Yet there may be further reasons why eliminating
this call actually isn't correct.)

> ==============================
> 
> d->pdev_list can be accessed there:
> 
> *** xen/drivers/passthrough/amd/pci_amd_iommu.c:
> reassign_device[489]           list_add(&pdev->domain_list, &target->pdev_list);
> 
> *** xen/drivers/passthrough/pci.c:
> _pci_hide_device[463]          list_add(&pdev->domain_list, &dom_xen->pdev_list);
> pci_get_pdev[561]              list_for_each_entry ( pdev, &d->pdev_list, domain_list )
> pci_add_device[759]            list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> pci_release_devices[917]       list_splice_init(&d->pdev_list, &tmp_list);
> pci_release_devices[922]       pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
> pci_release_devices[928]       list_add(&pdev->domain_list, &d->pdev_list);
> _setup_hwdom_pci_devices[1155] list_add(&pdev->domain_list, &ctxt->d->pdev_list);
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2819] list_add(&pdev->domain_list, &target->pdev_list);
> 
> *** xen/include/xen/pci.h:
> for_each_pdev[149]             list_for_each_entry(pdev, &(domain)->pdev_list, domain_list)
> has_arch_pdevs[151]            #define has_arch_pdevs(d) (!list_empty(&(d)->pdev_list))
> 
> ==============================
> 
> And has_arch_pdevs() is used there:
> 
> *** xen/arch/x86/hvm/hvm.c:
> hvm_set_cr0[2388]              has_arch_pdevs(d)) )
> 
> *** xen/arch/x86/hvm/vmx/vmcs.c:
> vmx_do_resume[1892]            if ( has_arch_pdevs(v->domain) && !iommu_snoop
> 
> *** xen/arch/x86/mm.c:
> l1_disallow_mask[172]          !has_arch_pdevs(d) && \
> 
> *** xen/arch/x86/mm/p2m-pod.c:
> p2m_pod_set_mem_target[352]    if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
> guest_physmap_mark_populate_on_demand[1404] if ( has_arch_pdevs(d) || cache_flush_permitted(d) )
> 
> *** xen/arch/x86/mm/paging.c:
> paging_log_dirty_enable[208]   if ( has_arch_pdevs(d) )
> 
> *** xen/drivers/passthrough/vtd/iommu.c:
> reassign_device_ownership[2773] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2807] if ( !has_arch_pdevs(target) )
> reassign_device_ownership[2825] if ( !has_arch_pdevs(source) )
> 
> 
> has_arch_pdevs() bothers me most, actually, because it is not always
> obvious how to add locking for the callers. I am planning to rework it
> in the following way:

Locking is only one aspect. As above, another is whether the function
might wrongly return "false" when you prematurely empty the list of
devices.

> #define has_arch_pdevs_unlocked(d) (!list_empty(&(d)->pdev_list))
> 
> static inline bool has_arch_pdevs(struct domain *d)
> {
>     bool ret;
> 
>     read_lock(&d->pci_lock);
>     ret = has_arch_pdevs_unlocked(d);
>     read_unlock(&d->pci_lock);
> 
>     return ret;
> }

No, this follows a pattern that earlier was said isn't acceptable:
The result of such a check is meaningless to the caller without holding
the lock past actually consuming the result. It simply is stale by the
time you return to the caller. (There may be special cases where other
constraints eliminate the concern, like maybe during domain
construction or domain cleanup, but such need carefully considering in
each individual case. Which in particular means there shouldn't be any
common-use helper functions doing what is unsafe in the general case.)

Jan
Volodymyr Babchuk July 9, 2023, 10:41 p.m. UTC | #16
Hi Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>> I am currently implementing your proposal (along with Jan's
>> suggestions), but I am facing ABBA deadlock with IOMMU's
>> reassign_device() call, which has this piece of code:
>> 
>>         list_move(&pdev->domain_list, &target->pdev_list);
>> 
>> My immediate change was:
>> 
>>         write_lock(&pdev->domain->pci_lock);
>>         list_del(&pdev->domain_list);
>>         write_unlock(&pdev->domain->pci_lock);
>> 
>>         write_lock(&target->pci_lock);
>>         list_add(&pdev->domain_list, &target->pdev_list);
>>         write_unlock(&target->pci_lock);
>> 
>> But this will not work because reassign_device is called from
>> pci_release_devices() which iterates over d->pdev_list, so we need to
>> take a d->pci_lock early.
>> 
>> Any suggestions on how to fix this? My idea is to remove a device from a
>> list one at time:
>> 
>> int pci_release_devices(struct domain *d)
>> {
>>     struct pci_dev *pdev;
>>     u8 bus, devfn;
>>     int ret;
>> 
>>     pcidevs_lock();
>>     write_lock(&d->pci_lock);
>>     ret = arch_pci_clean_pirqs(d);
>>     if ( ret )
>>     {
>>         pcidevs_unlock();
>>         write_unlock(&d->pci_lock);
>>         return ret;
>>     }
>> 
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         bus = pdev->bus;
>>         devfn = pdev->devfn;
>>         list_del(&pdev->domain_list);
>>         write_unlock(&d->pci_lock);
>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>         write_lock(&d->pci_lock);
>
> I think it needs doing almost like this, but with two more tweaks and
> no list_del() right here (first and foremost to avoid needing to
> figure whether removing early isn't going to subtly break anything;
> see below for an error case that would end up with changed behavior):
>
>     while ( !list_empty(&d->pdev_list) )
>     {
>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
>
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, seg, bus, devfn) ?: ret;
>         write_lock(&d->pci_lock);
>     }
>
> One caveat though: The original list_for_each_entry_safe() guarantees
> the loop to complete; your use of list_del() would guarantee that too,
> but then the device wouldn't be on the list anymore if deassign_device()
> failed. Therefore I guess you will need another local list where you
> (temporarily) put all the devices which deassign_device() left on the
> list, and which you would then move back to d->pdev_list after the loop
> has finished.

Okay, taking into the account all that you wrote, I came with this
implementation:

int pci_release_devices(struct domain *d)
{
    int combined_ret;
    LIST_HEAD(failed_pdevs);

    pcidevs_lock();
    write_lock(&d->pci_lock);
    combined_ret = arch_pci_clean_pirqs(d);
    if ( combined_ret )
    {
        pcidevs_unlock();
        write_unlock(&d->pci_lock);
        return combined_ret;
    }

    while ( !list_empty(&d->pdev_list) )
    {
        struct pci_dev *pdev = list_first_entry(&d->pdev_list,
                                                struct pci_dev,
                                                domain_list);
        uint16_t seg = pdev->seg;
        uint8_t bus = pdev->bus;
        uint8_t devfn = pdev->devfn;
        int ret;

        write_unlock(&d->pci_lock);
        ret = deassign_device(d, seg, bus, devfn);
        write_lock(&d->pci_lock);
        if ( ret )
        {
            bool still_present = false;
            const struct pci_dev *tmp;

            for_each_pdev ( d, tmp )
            {
                if ( tmp == (const struct pci_dev*)pdev )
                {
                    still_present = true;
                    break;
                }
            }
            if ( still_present )
                list_move(&pdev->domain_list, &failed_pdevs);
            combined_ret = ret;
        }
    }

    list_splice(&failed_pdevs, &d->pdev_list);
    write_unlock(&d->pci_lock);
    pcidevs_unlock();

    return combined_ret;
}


> (Whether it is sufficient to inspect the list head to
> determine whether the pdev is still on the list needs careful checking.)

I am not sure that this is enough. We dropped d->pci_lock so we can
expect that the list was permutated in a random way.
Jan Beulich July 10, 2023, 6:20 a.m. UTC | #17
On 10.07.2023 00:41, Volodymyr Babchuk wrote:
> 
> Hi Jan,
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 04.07.2023 23:03, Volodymyr Babchuk wrote:
>>> I am currently implementing your proposal (along with Jan's
>>> suggestions), but I am facing ABBA deadlock with IOMMU's
>>> reassign_device() call, which has this piece of code:
>>>
>>>         list_move(&pdev->domain_list, &target->pdev_list);
>>>
>>> My immediate change was:
>>>
>>>         write_lock(&pdev->domain->pci_lock);
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&pdev->domain->pci_lock);
>>>
>>>         write_lock(&target->pci_lock);
>>>         list_add(&pdev->domain_list, &target->pdev_list);
>>>         write_unlock(&target->pci_lock);
>>>
>>> But this will not work because reassign_device is called from
>>> pci_release_devices() which iterates over d->pdev_list, so we need to
>>> take a d->pci_lock early.
>>>
>>> Any suggestions on how to fix this? My idea is to remove a device from a
>>> list one at time:
>>>
>>> int pci_release_devices(struct domain *d)
>>> {
>>>     struct pci_dev *pdev;
>>>     u8 bus, devfn;
>>>     int ret;
>>>
>>>     pcidevs_lock();
>>>     write_lock(&d->pci_lock);
>>>     ret = arch_pci_clean_pirqs(d);
>>>     if ( ret )
>>>     {
>>>         pcidevs_unlock();
>>>         write_unlock(&d->pci_lock);
>>>         return ret;
>>>     }
>>>
>>>     while ( !list_empty(&d->pdev_list) )
>>>     {
>>>         pdev = list_entry(&d->pdev_list, struct pci_dev, domain_list);
>>>         bus = pdev->bus;
>>>         devfn = pdev->devfn;
>>>         list_del(&pdev->domain_list);
>>>         write_unlock(&d->pci_lock);
>>>         ret = deassign_device(d, pdev->seg, bus, devfn) ?: ret;
>>>         write_lock(&d->pci_lock);
>>
>> I think it needs doing almost like this, but with two more tweaks and
>> no list_del() right here (first and foremost to avoid needing to
>> figure whether removing early isn't going to subtly break anything;
>> see below for an error case that would end up with changed behavior):
>>
>>     while ( !list_empty(&d->pdev_list) )
>>     {
>>         const struct pci_dev *pdev = list_first_entry(&d->pdev_list, struct pci_dev, domain_list);
>>         uint16_t seg = pdev->seg;
>>         uint8_t bus = pdev->bus;
>>         uint8_t devfn = pdev->devfn;
>>
>>         write_unlock(&d->pci_lock);
>>         ret = deassign_device(d, seg, bus, devfn) ?: ret;
>>         write_lock(&d->pci_lock);
>>     }
>>
>> One caveat though: The original list_for_each_entry_safe() guarantees
>> the loop to complete; your use of list_del() would guarantee that too,
>> but then the device wouldn't be on the list anymore if deassign_device()
>> failed. Therefore I guess you will need another local list where you
>> (temporarily) put all the devices which deassign_device() left on the
>> list, and which you would then move back to d->pdev_list after the loop
>> has finished.
> 
> Okay, taking into the account all that you wrote, I came with this
> implementation:

Looks plausible at the first glance, but will of course need looking at
in more detail in the context of the full patch. Just one (largely)
cosmetic remark right away:

> int pci_release_devices(struct domain *d)
> {
>     int combined_ret;
>     LIST_HEAD(failed_pdevs);
> 
>     pcidevs_lock();
>     write_lock(&d->pci_lock);
>     combined_ret = arch_pci_clean_pirqs(d);
>     if ( combined_ret )
>     {
>         pcidevs_unlock();
>         write_unlock(&d->pci_lock);
>         return combined_ret;
>     }
> 
>     while ( !list_empty(&d->pdev_list) )
>     {
>         struct pci_dev *pdev = list_first_entry(&d->pdev_list,
>                                                 struct pci_dev,
>                                                 domain_list);
>         uint16_t seg = pdev->seg;
>         uint8_t bus = pdev->bus;
>         uint8_t devfn = pdev->devfn;
>         int ret;
> 
>         write_unlock(&d->pci_lock);
>         ret = deassign_device(d, seg, bus, devfn);
>         write_lock(&d->pci_lock);
>         if ( ret )
>         {
>             bool still_present = false;
>             const struct pci_dev *tmp;
> 
>             for_each_pdev ( d, tmp )
>             {
>                 if ( tmp == (const struct pci_dev*)pdev )

As I'm sure I expressed earlier, casts should be avoided whenever
possible. I see no reason for one here: Comparing pointer-to-
const-<type> with pointer-to-<type> is permitted by the language.

>                 {
>                     still_present = true;
>                     break;
>                 }
>             }
>             if ( still_present )
>                 list_move(&pdev->domain_list, &failed_pdevs);
>             combined_ret = ret;
>         }
>     }
> 
>     list_splice(&failed_pdevs, &d->pdev_list);
>     write_unlock(&d->pci_lock);
>     pcidevs_unlock();
> 
>     return combined_ret;
> }
> 
> 
>> (Whether it is sufficient to inspect the list head to
>> determine whether the pdev is still on the list needs careful checking.)
> 
> I am not sure that this is enough. We dropped d->pci_lock so we can
> expect that the list was permutated in a random way.

Right, if that cannot be excluded for sure, then better stay on the safe
side for now. A code comment would be nice though ahead of the
for_each_pdev().

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 3cd4923060..f188450b1b 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -895,6 +895,9 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->vpci_rwlock));
+    ASSERT(pcidevs_locked());
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +916,11 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            pcidevs_unlock();
+            read_unlock(&pdev->domain->vpci_rwlock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->vpci_rwlock);
+            pcidevs_lock();
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 6a440590fe..a4640acb62 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -622,6 +622,9 @@  struct domain *domain_create(domid_t domid,
 
 #ifdef CONFIG_HAS_PCI
     INIT_LIST_HEAD(&d->pdev_list);
+#ifdef CONFIG_HAS_VPCI
+    rwlock_init(&d->vpci_rwlock);
+#endif
 #endif
 
     /* All error paths can depend on the above setup. */
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 07d1986d33..0afcb59af0 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -57,6 +57,11 @@  void pcidevs_lock(void)
     spin_lock_recursive(&_pcidevs_lock);
 }
 
+int pcidevs_trylock(void)
+{
+    return spin_trylock_recursive(&_pcidevs_lock);
+}
+
 void pcidevs_unlock(void)
 {
     spin_unlock_recursive(&_pcidevs_lock);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ec2e978a4e..23b5223adf 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -152,12 +152,14 @@  bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        read_lock(&v->domain->vpci_rwlock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
                         rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
                         !rc && v->vpci.rom_only);
         spin_unlock(&v->vpci.pdev->vpci->lock);
+        read_unlock(&v->domain->vpci_rwlock);
 
         rangeset_destroy(v->vpci.mem);
         v->vpci.mem = NULL;
@@ -181,8 +183,20 @@  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_write_locked(&d->vpci_rwlock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        write_unlock(&d->vpci_rwlock);
         process_pending_softirqs();
+        write_lock(&d->vpci_rwlock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -213,6 +227,7 @@  static void defer_map(struct domain *d, struct pci_dev *pdev,
     raise_softirq(SCHEDULE_SOFTIRQ);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
 {
     struct vpci_header *header = &pdev->vpci->header;
@@ -287,6 +302,7 @@  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
      * Check for overlaps with other BARs. Note that only BARs that are
      * currently mapped (enabled) are checked for overlaps.
      */
+    pcidevs_lock();
     for_each_pdev ( pdev->domain, tmp )
     {
         if ( tmp == pdev )
@@ -326,10 +342,12 @@  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
                 printk(XENLOG_G_WARNING "Failed to remove [%lx, %lx]: %d\n",
                        start, end, rc);
                 rangeset_destroy(mem);
+                pcidevs_unlock();
                 return rc;
             }
         }
     }
+    pcidevs_unlock();
 
     ASSERT(dev);
 
@@ -482,6 +500,8 @@  static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
@@ -570,6 +590,8 @@  static int cf_check init_bars(struct pci_dev *pdev)
         }
     }
 
+    ASSERT(!header->rom_reg);
+
     /* Check expansion ROM. */
     rc = pci_size_mem_bar(pdev->sbdf, rom_reg, &addr, &size, PCI_BAR_ROM);
     if ( rc > 0 && size )
@@ -586,12 +608,42 @@  static int cf_check init_bars(struct pci_dev *pdev)
                                4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
+
+        header->rom_reg = rom_reg;
     }
 
     return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
 }
 REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);
 
+static bool overlap(unsigned int r1_offset, unsigned int r1_size,
+                    unsigned int r2_offset, unsigned int r2_size)
+{
+    /* Return true if there is an overlap. */
+    return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size;
+}
+
+bool vpci_header_need_write_lock(const struct pci_dev *pdev,
+                                 unsigned int start, unsigned int size)
+{
+    /*
+     * Writing the command register and ROM BAR register may trigger
+     * modify_bars to run, which in turn may access multiple pdevs while
+     * checking for the existing BAR's overlap. The overlapping check, if done
+     * under the read lock, requires vpci->lock to be acquired on both devices
+     * being compared, which may produce a deadlock. At the same time it is not
+     * possible to upgrade read lock to write lock in such a case.
+     * Check which registers are going to be written and return true if lock
+     * needs to be acquired in write mode.
+     */
+    if ( overlap(start, size, PCI_COMMAND, 2) ||
+         (pdev->vpci->header.rom_reg &&
+          overlap(start, size, pdev->vpci->header.rom_reg, 4)) )
+        return true;
+
+    return false;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..e7d1f916a0 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -190,6 +190,8 @@  static int cf_check init_msi(struct pci_dev *pdev)
     uint16_t control;
     int ret;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     if ( !pos )
         return 0;
 
@@ -265,7 +267,7 @@  REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +279,15 @@  void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->vpci_rwlock) )
+            continue;
+
+        if ( !pcidevs_trylock() )
+        {
+            read_unlock(&d->vpci_rwlock);
+            continue;
+        }
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,14 +329,22 @@  void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            pcidevs_unlock();
+            read_unlock(&d->vpci_rwlock);
+
             process_pending_softirqs();
+
+            read_lock(&d->vpci_rwlock);
+            pcidevs_lock();
         }
+        pcidevs_unlock();
+        read_unlock(&d->vpci_rwlock);
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 25bde77586..b5a7dfdf9c 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -143,6 +143,7 @@  static void cf_check control_write(
         pci_conf_write16(pdev->sbdf, reg, val);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
@@ -163,7 +164,13 @@  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->vpci_rwlock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->vpci_rwlock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +365,34 @@  static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->vpci_rwlock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+        read_unlock(&d->vpci_rwlock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +424,7 @@  static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +512,32 @@  static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->vpci_rwlock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+        read_unlock(&d->vpci_rwlock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->vpci_rwlock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +613,7 @@  static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     return X86EMUL_OKAY;
 }
@@ -665,6 +700,8 @@  static int cf_check init_msix(struct pci_dev *pdev)
     struct vpci_msix *msix;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock));
+
     msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
                                       PCI_CAP_ID_MSIX);
     if ( !msix_offset )
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 652807a4a4..1270174e78 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,20 +38,32 @@  extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
-    if ( !has_vpci(pdev->domain) || !pdev->vpci )
+    struct vpci *vpci;
+
+    if ( !has_vpci(pdev->domain) )
         return;
 
-    spin_lock(&pdev->vpci->lock);
+    write_lock(&pdev->domain->vpci_rwlock);
+    if ( !pdev->vpci )
+    {
+        write_unlock(&pdev->domain->vpci_rwlock);
+        return;
+    }
+
+    vpci = pdev->vpci;
+    pdev->vpci = NULL;
+    write_unlock(&pdev->domain->vpci_rwlock);
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
-        struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
+        struct vpci_register *r = list_first_entry(&vpci->handlers,
                                                    struct vpci_register,
                                                    node);
 
         list_del(&r->node);
         xfree(r);
     }
-    spin_unlock(&pdev->vpci->lock);
+
     if ( pdev->vpci->msix )
     {
         unsigned int i;
@@ -61,29 +73,33 @@  void vpci_remove_device(struct pci_dev *pdev)
             if ( pdev->vpci->msix->table[i] )
                 iounmap(pdev->vpci->msix->table[i]);
     }
-    xfree(pdev->vpci->msix);
-    xfree(pdev->vpci->msi);
-    xfree(pdev->vpci);
-    pdev->vpci = NULL;
+    xfree(vpci->msix);
+    xfree(vpci->msi);
+    xfree(vpci);
 }
 
 int vpci_add_handlers(struct pci_dev *pdev)
 {
+    struct vpci *vpci;
     unsigned int i;
     int rc = 0;
 
     if ( !has_vpci(pdev->domain) )
         return 0;
 
+    vpci = xzalloc(struct vpci);
+    if ( !vpci )
+        return -ENOMEM;
+
+    INIT_LIST_HEAD(&vpci->handlers);
+    spin_lock_init(&vpci->lock);
+
+    write_lock(&pdev->domain->vpci_rwlock);
+
     /* We should not get here twice for the same device. */
     ASSERT(!pdev->vpci);
 
-    pdev->vpci = xzalloc(struct vpci);
-    if ( !pdev->vpci )
-        return -ENOMEM;
-
-    INIT_LIST_HEAD(&pdev->vpci->handlers);
-    spin_lock_init(&pdev->vpci->lock);
+    pdev->vpci = vpci;
 
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
@@ -91,6 +107,7 @@  int vpci_add_handlers(struct pci_dev *pdev)
         if ( rc )
             break;
     }
+    write_unlock(&pdev->domain->vpci_rwlock);
 
     if ( rc )
         vpci_remove_device(pdev);
@@ -139,6 +156,7 @@  uint32_t cf_check vpci_hw_read32(
     return pci_conf_read32(pdev->sbdf, reg);
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data)
@@ -162,8 +180,6 @@  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
     r->offset = offset;
     r->private = data;
 
-    spin_lock(&vpci->lock);
-
     /* The list of handlers must be kept sorted at all times. */
     list_for_each ( prev, &vpci->handlers )
     {
@@ -175,25 +191,23 @@  int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
             break;
         if ( cmp == 0 )
         {
-            spin_unlock(&vpci->lock);
             xfree(r);
             return -EEXIST;
         }
     }
 
     list_add_tail(&r->node, prev);
-    spin_unlock(&vpci->lock);
 
     return 0;
 }
 
+/* This must hold domain's vpci_rwlock in write mode. */
 int vpci_remove_register(struct vpci *vpci, unsigned int offset,
                          unsigned int size)
 {
     const struct vpci_register r = { .offset = offset, .size = size };
     struct vpci_register *rm;
 
-    spin_lock(&vpci->lock);
     list_for_each_entry ( rm, &vpci->handlers, node )
     {
         int cmp = vpci_register_cmp(&r, rm);
@@ -205,14 +219,12 @@  int vpci_remove_register(struct vpci *vpci, unsigned int offset,
         if ( !cmp && rm->offset == offset && rm->size == size )
         {
             list_del(&rm->node);
-            spin_unlock(&vpci->lock);
             xfree(rm);
             return 0;
         }
         if ( cmp <= 0 )
             break;
     }
-    spin_unlock(&vpci->lock);
 
     return -ENOENT;
 }
@@ -320,7 +332,7 @@  static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -333,10 +345,18 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
     }
 
     /* Find the PCI dev matching the address. */
+    pcidevs_lock();
     pdev = pci_get_pdev(d, sbdf);
-    if ( !pdev || !pdev->vpci )
+    pcidevs_unlock();
+    if ( !pdev )
         return vpci_read_hw(sbdf, reg, size);
 
+    read_lock(&d->vpci_rwlock);
+    if ( !pdev->vpci )
+    {
+        read_unlock(&d->vpci_rwlock);
+        return vpci_read_hw(sbdf, reg, size);
+    }
     spin_lock(&pdev->vpci->lock);
 
     /* Read from the hardware or the emulated register handlers. */
@@ -381,6 +401,7 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(&d->vpci_rwlock);
 
     if ( data_offset < size )
     {
@@ -423,11 +444,12 @@  static void vpci_write_helper(const struct pci_dev *pdev,
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     const unsigned long *ro_map = pci_get_ro_map(sbdf.seg);
+    bool write_locked = false;
 
     if ( !size )
     {
@@ -443,14 +465,38 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
      * Find the PCI dev matching the address.
      * Passthrough everything that's not trapped.
      */
+    pcidevs_lock();
     pdev = pci_get_pdev(d, sbdf);
-    if ( !pdev || !pdev->vpci )
+    pcidevs_unlock();
+    if ( !pdev )
     {
         vpci_write_hw(sbdf, reg, size, data);
         return;
     }
 
-    spin_lock(&pdev->vpci->lock);
+    if ( vpci_header_need_write_lock(pdev, reg, size) )
+    {
+        /* Gain exclusive access to all of the domain pdevs vpci. */
+        write_lock(&d->vpci_rwlock);
+        if ( !pdev->vpci )
+        {
+            write_unlock(&d->vpci_rwlock);
+            vpci_write_hw(sbdf, reg, size, data);
+            return;
+        }
+        write_locked = true;
+    }
+    else
+    {
+        read_lock(&d->vpci_rwlock);
+        if ( !pdev->vpci )
+        {
+            read_unlock(&d->vpci_rwlock);
+            vpci_write_hw(sbdf, reg, size, data);
+            return;
+        }
+        spin_lock(&pdev->vpci->lock);
+    }
 
     /* Write the value to the hardware or emulated registers. */
     list_for_each_entry ( r, &pdev->vpci->handlers, node )
@@ -485,7 +531,14 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
             break;
         ASSERT(data_offset < size);
     }
-    spin_unlock(&pdev->vpci->lock);
+
+    if ( write_locked )
+        write_unlock(&d->vpci_rwlock);
+    else
+    {
+        spin_unlock(&pdev->vpci->lock);
+        read_unlock(&d->vpci_rwlock);
+    }
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..4512910dca 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -157,6 +157,7 @@  struct pci_dev {
  */
 
 void pcidevs_lock(void);
+int pcidevs_trylock(void);
 void pcidevs_unlock(void);
 bool_t __must_check pcidevs_locked(void);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 85242a73d3..78227b7a1d 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -460,6 +460,9 @@  struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+#ifdef CONFIG_HAS_VPCI
+    rwlock_t vpci_rwlock;
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 0b8a2a3c74..3a7fcc4463 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -57,6 +57,9 @@  uint32_t cf_check vpci_hw_read32(
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
 
+bool vpci_header_need_write_lock(const struct pci_dev *pdev,
+                                 unsigned int start, unsigned int size);
+
 struct vpci {
     /* List of vPCI handlers for a device. */
     struct list_head handlers;
@@ -83,6 +86,9 @@  struct vpci {
         } bars[PCI_HEADER_NORMAL_NR_BARS + 1];
         /* At most 6 BARS + 1 expansion ROM BAR. */
 
+        /* Offset to the ROM BAR register if any. */
+        unsigned int rom_reg;
+
         /*
          * Store whether the ROM enable bit is set (doesn't imply ROM BAR
          * is mapped into guest p2m) if there's a ROM BAR on the device.