diff mbox series

[v9,02/16] vpci: use per-domain PCI lock to protect vpci structure

Message ID 20230829231912.4091958-3-volodymyr_babchuk@epam.com (mailing list archive)
State Superseded
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Volodymyr Babchuk Aug. 29, 2023, 11:19 p.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

When taking both d->pci_lock and pdev->vpci->lock they are should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations.

1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead. To prevent
deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
always lock hwdom first.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

There is a possible lock inversion in MSI code, as some parts of it
acquire pcidevs_lock() while already holding d->pci_lock.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---
Changes in v9:
 - extended locked region to protect vpci_remove_device and
   vpci_add_handlers() calls
 - vpci_write() takes lock in the write mode to protect
   potential call to modify_bars()
 - renamed lock releasing function
 - removed ASSERT()s from msi code
 - added trylock in vpci_dump_msi

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       | 24 ++++++++--------
 xen/arch/x86/hvm/vmx/vmx.c    |  2 --
 xen/arch/x86/irq.c            | 15 +++++++---
 xen/arch/x86/msi.c            |  8 ++----
 xen/drivers/passthrough/pci.c |  7 +++--
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 22 +++++++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
 9 files changed, 154 insertions(+), 40 deletions(-)

Comments

Roger Pau Monné Sept. 19, 2023, 3:39 p.m. UTC | #1
On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.
> 
> When taking both d->pci_lock and pdev->vpci->lock they are should be

When taking both d->pci_lock and pdev->vpci->lock the order should be
...

> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
> possible deadlock situations.
> 
> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made
> 
> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> There is a possible lock inversion in MSI code, as some parts of it
> acquire pcidevs_lock() while already holding d->pci_lock.

Those would as a minimum need to be pointed out with TODO comments of
some kind in order to be aware of them.

> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> Changes in v9:
>  - extended locked region to protect vpci_remove_device and
>    vpci_add_handlers() calls
>  - vpci_write() takes lock in the write mode to protect
>    potential call to modify_bars()
>  - renamed lock releasing function
>  - removed ASSERT()s from msi code
>  - added trylock in vpci_dump_msi
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c       | 24 ++++++++--------
>  xen/arch/x86/hvm/vmx/vmx.c    |  2 --
>  xen/arch/x86/irq.c            | 15 +++++++---
>  xen/arch/x86/msi.c            |  8 ++----
>  xen/drivers/passthrough/pci.c |  7 +++--
>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>  xen/drivers/vpci/msi.c        | 22 +++++++++++++--
>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>  9 files changed, 154 insertions(+), 40 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 128f236362..fde76cc6b4 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
>      struct msixtbl_entry *entry, *new_entry;
>      int r = -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
>      struct pci_dev *pdev;
>      struct msixtbl_entry *entry;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      if ( !msixtbl_initialised(d) )
> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
>  {
>      unsigned int i;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
>      if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
>      {
> @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>      int rc;
>  
>      ASSERT(msi->arch.pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq unbind = {
> @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
>  
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
>                                         msi->vectors, msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  }
>  
>  static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
> @@ -778,15 +777,13 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
>      int rc;
>  
>      ASSERT(msi->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>      rc = vpci_msi_enable(pdev, vectors, 0);
>      if ( rc < 0 )
>          return rc;
>      msi->arch.pirq = rc;
> -
> -    pcidevs_lock();
>      msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
>                                         msi->arch.pirq, msi->mask);
> -    pcidevs_unlock();
>  
>      return 0;
>  }
> @@ -797,8 +794,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      unsigned int i;
>  
>      ASSERT(pirq != INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>  
> -    pcidevs_lock();
>      for ( i = 0; i < nr && bound; i++ )
>      {
>          struct xen_domctl_bind_pt_irq bind = {
> @@ -814,7 +811,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
>      write_lock(&pdev->domain->event_lock);
>      unmap_domain_pirq(pdev->domain, pirq);
>      write_unlock(&pdev->domain->event_lock);
> -    pcidevs_unlock();
>  }
>  
>  void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
> @@ -854,6 +850,8 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>      int rc;
>  
>      ASSERT(entry->arch.pirq == INVALID_PIRQ);
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
>      rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
>                           table_base);
>      if ( rc < 0 )
> @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>  
>      entry->arch.pirq = rc;
>  
> -    pcidevs_lock();
>      rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
>                           entry->masked);
>      if ( rc )
> @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
>          vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
>          entry->arch.pirq = INVALID_PIRQ;
>      }
> -    pcidevs_unlock();
>  
>      return rc;
>  }
> @@ -895,6 +891,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +911,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);
>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 1edc7f1e91..545a27796e 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>  
>      spin_unlock_irq(&desc->lock);
>  
> -    ASSERT(pcidevs_locked());
> -

Hm, this removal seems dubious, same with some of the removal below.
And I don't see any comment in the log message as to why removing the
asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
safe.

>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>  
>   unlock_out:
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 6abfd81621..cb99ae5392 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2157,7 +2157,7 @@ int map_domain_pirq(
>          struct pci_dev *pdev;
>          unsigned int nr = 0;
>  
> -        ASSERT(pcidevs_locked());
> +        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>  
>          ret = -ENODEV;
>          if ( !cpu_has_apic )
> @@ -2314,7 +2314,7 @@ int unmap_domain_pirq(struct domain *d, int pirq)
>      if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
>          return -EINVAL;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
>      ASSERT(rw_is_write_locked(&d->event_lock));
>  
>      info = pirq_info(d, pirq);
> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>      msi->irq = irq;
>  
> -    pcidevs_lock();
> +    /*
> +     * If we are called via vPCI->vMSI path, we already are holding
> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
> +     * will cause lock inversion.
> +     */
> +    if ( !rw_is_locked(&d->pci_lock) )
> +        pcidevs_lock();

This is not a safe expression to use, rw_is_locked() just returns
whether the lock is taken, but not if it's taken by the current CPU.
This is fine to use in assertions and debug code, but not in order to
take lock ordering decisions I'm afraid.

You will likely need to move the locking to the callers of the
function.

>      /* Verify or get pirq. */
>      write_lock(&d->event_lock);
>      pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
> @@ -2924,7 +2930,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>  
>   done:
>      write_unlock(&d->event_lock);
> -    pcidevs_unlock();
> +    if ( !rw_is_locked(&d->pci_lock) )
> +        pcidevs_unlock();
>      if ( ret )
>      {
>          switch ( type )
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index d0bf63df1d..ba2963b7d2 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
>      u8 slot = PCI_SLOT(dev->devfn);
>      u8 func = PCI_FUNC(dev->devfn);
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>      if ( !pos )
>          return -ENODEV;
> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
>      if ( !pos )
>          return -ENODEV;
>  
> -    ASSERT(pcidevs_locked());
> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>  
>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>      /*
> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev )
>          return -ENODEV;
> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>      struct pci_dev *pdev;
>      struct msi_desc *old_desc;
>  
> -    ASSERT(pcidevs_locked());
>      pdev = pci_get_pdev(NULL, msi->sbdf);
>      if ( !pdev || !pdev->msix )
>          return -ENODEV;
> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>   */
>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>  {
> -    ASSERT(pcidevs_locked());
> -
>      if ( !use_msi )
>          return -EPERM;
>  
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 79ca928672..4f18293900 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -752,7 +752,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          pdev->domain = hardware_domain;
>          write_lock(&hardware_domain->pci_lock);
>          list_add(&pdev->domain_list, &hardware_domain->pdev_list);
> -        write_unlock(&hardware_domain->pci_lock);
>  
>          /*
>           * For devices not discovered by Xen during boot, add vPCI handlers
> @@ -762,17 +761,17 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
>          if ( ret )
>          {
>              printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);

You likely want to move the printk after the unlock now.

> -            write_lock(&hardware_domain->pci_lock);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
>              goto out;
>          }
> +        write_unlock(&hardware_domain->pci_lock);
>          ret = iommu_add_device(pdev);
>          if ( ret )
>          {
> -            vpci_remove_device(pdev);
>              write_lock(&hardware_domain->pci_lock);
> +            vpci_remove_device(pdev);
>              list_del(&pdev->domain_list);
>              write_unlock(&hardware_domain->pci_lock);
>              pdev->domain = NULL;
> @@ -1147,7 +1146,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 60f7049e34..177a6b57a5 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -172,6 +172,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -190,6 +191,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);

vpci_process_pending() is problematic wrt vpci_remove_device(), as the
removal of a device with pending map operations would render such
operations stale, effectively leaking the mappings to a device MMIO
area that's no longer owned by the domain.

In the same sense vpci_remove_device() should take care of removing
any MMIO mappings created, which is not currently the case.  I guess
such problem warrant at least some kind of comment in
vpci_process_pending() and/or vpci_remove_device().

>      }
>  
>      return false;
> @@ -201,8 +203,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));

You want rw_is_write_locked(), as that check for exclusive ownership
of the lock (like you have in modify_bars()).

> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        read_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        read_lock(&d->pci_lock);
> +    }
> +
>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -243,6 +257,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -522,6 +538,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));

Same here, initialization should be done with the lock exclusively
held.

> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..a0733bb2cb 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -265,7 +265,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +277,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            /*
> +             * Unlock lock to process pending softirqs. This is
> +             * potentially unsafe, as d->pdev_list can be changed in
> +             * meantime.
> +             */
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            if ( !read_trylock(&d->pci_lock) )
> +            {
> +                printk("unable to access other devices for the domain\n");
> +                goto domain_done;

Shouldn't the domain_done label be after the read_unlock(), so that we
can proceed to try to dump the devices for the next domain?  With the
proposed code a failure to acquire one of the domains pci_lock
terminates the dump.

> +            }
>          }
> +        read_unlock(&d->pci_lock);
>      }
> + domain_done:
>      rcu_read_unlock(&domlist_read_lock);
>  }
>  
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index f9df506f29..f8c5bd393b 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +426,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +616,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d73fa76302..34fff2ef2d 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> +    rwlock_t *lock;
>  
>      if ( !size )
>      {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       * Find the PCI dev matching the address, which for hwdom also requires
>       * consulting DomXEN.  Passthrough everything that's not trapped.
>       */
> +    lock = &d->pci_lock;
> +    read_lock(lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
> +    {
> +        read_unlock(lock);
> +        lock = &dom_xen->pci_lock;
> +        read_lock(lock);
>          pdev = pci_get_pdev(dom_xen, sbdf);
> +    }
>      if ( !pdev || !pdev->vpci )
> +    {
> +        read_unlock(lock);
>          return vpci_read_hw(sbdf, reg, size);
> +    }
>  
>      spin_lock(&pdev->vpci->lock);
>  
F> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    read_unlock(lock);
>  
>      if ( data_offset < size )
>      {
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void release_domain_locks(struct domain *d)

release_domain_write_locks() might be more descriptive in case we ever
need a similar helper for reads also.

> +{
> +    ASSERT(rw_is_write_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_write_locked(&dom_xen->pci_lock));
> +        write_unlock(&dom_xen->pci_lock);
> +    }
> +    write_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars() is called
>       */
> +    write_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        write_lock(&dom_xen->pci_lock);

Strictly speaking we only need the pci_lock in exclusive mode when
enabling/disabling the BARs AFAICT?  For the rest of the operations
the per-device vPCI lock already protects against concurrent accesses.
Might be worth to mention that the write lock is only required for
those accesses, but that such improvement is left as a TODO.

> +
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
>          pdev = pci_get_pdev(dom_xen, sbdf);
> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>              vpci_write_hw(sbdf, reg, size, data);
> +
> +        release_domain_locks(d);

You can release the lock before the vpci_write_hw() call.

Thanks, Roger.
Jan Beulich Sept. 19, 2023, 3:55 p.m. UTC | #2
On 19.09.2023 17:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
>>  
>>      msi->irq = irq;
>>  
>> -    pcidevs_lock();
>> +    /*
>> +     * If we are called via vPCI->vMSI path, we already are holding
>> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
>> +     * will cause lock inversion.
>> +     */
>> +    if ( !rw_is_locked(&d->pci_lock) )
>> +        pcidevs_lock();
> 
> This is not a safe expression to use, rw_is_locked() just returns
> whether the lock is taken, but not if it's taken by the current CPU.
> This is fine to use in assertions and debug code, but not in order to
> take lock ordering decisions I'm afraid.
> 
> You will likely need to move the locking to the callers of the
> function.

Along the lines of a later comment, I think it would by rw_is_write_locked()
here anyway. Noting that xen/rwlock.h already has an internal
_is_write_locked_by_me(), it would in principle be possible to construct
something along the lines of what the comment says. But it would certainly
be better if that could be avoided.

As to the comment: A lock inversion cannot be used to justify not
acquiring a necessary lock. The wording therefore wants adjusting (if the
logic was to stay).

Jan
Stewart Hildebrand Sept. 19, 2023, 4:20 p.m. UTC | #3
On 9/19/23 11:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..a0733bb2cb 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            /*
>> +             * Unlock lock to process pending softirqs. This is
>> +             * potentially unsafe, as d->pdev_list can be changed in
>> +             * meantime.
>> +             */
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            if ( !read_trylock(&d->pci_lock) )
>> +            {
>> +                printk("unable to access other devices for the domain\n");
>> +                goto domain_done;
> 
> Shouldn't the domain_done label be after the read_unlock(), so that we
> can proceed to try to dump the devices for the next domain?  With the
> proposed code a failure to acquire one of the domains pci_lock
> terminates the dump.
> 
>> +            }
>>          }
>> +        read_unlock(&d->pci_lock);
>>      }
>> + domain_done:
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>>

With the label moved, a no-op expression after the label is needed to make the compiler happy:

            }
        }
        read_unlock(&d->pci_lock);
 domain_done:
        (void)0;
    }
    rcu_read_unlock(&domlist_read_lock);
}


If the no-op is omitted, the compiler may complain (gcc 9.4.0):

drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
drivers/vpci/msi.c:351:2: error: label at end of compound statement
  351 |  domain_done:
      |  ^~~~~~~~~~~
Roger Pau Monné Sept. 20, 2023, 8:09 a.m. UTC | #4
On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
> On 9/19/23 11:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> >> index 8f2b59e61a..a0733bb2cb 100644
> >> --- a/xen/drivers/vpci/msi.c
> >> +++ b/xen/drivers/vpci/msi.c
> >> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
> >>                       * holding the lock.
> >>                       */
> >>                      printk("unable to print all MSI-X entries: %d\n", rc);
> >> -                    process_pending_softirqs();
> >> -                    continue;
> >> +                    goto pdev_done;
> >>                  }
> >>              }
> >>
> >>              spin_unlock(&pdev->vpci->lock);
> >> + pdev_done:
> >> +            /*
> >> +             * Unlock lock to process pending softirqs. This is
> >> +             * potentially unsafe, as d->pdev_list can be changed in
> >> +             * meantime.
> >> +             */
> >> +            read_unlock(&d->pci_lock);
> >>              process_pending_softirqs();
> >> +            if ( !read_trylock(&d->pci_lock) )
> >> +            {
> >> +                printk("unable to access other devices for the domain\n");
> >> +                goto domain_done;
> > 
> > Shouldn't the domain_done label be after the read_unlock(), so that we
> > can proceed to try to dump the devices for the next domain?  With the
> > proposed code a failure to acquire one of the domains pci_lock
> > terminates the dump.
> > 
> >> +            }
> >>          }
> >> +        read_unlock(&d->pci_lock);
> >>      }
> >> + domain_done:
> >>      rcu_read_unlock(&domlist_read_lock);
> >>  }
> >>
> 
> With the label moved, a no-op expression after the label is needed to make the compiler happy:
> 
>             }
>         }
>         read_unlock(&d->pci_lock);
>  domain_done:
>         (void)0;
>     }
>     rcu_read_unlock(&domlist_read_lock);
> }
> 
> 
> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
> 
> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>   351 |  domain_done:
>       |  ^~~~~~~~~~~


Might be better to place the label at the start of the loop, and
likely rename to next_domain.

Thanks, Roger.
Roger Pau Monné Sept. 20, 2023, 8:12 a.m. UTC | #5
On Tue, Sep 19, 2023 at 05:55:42PM +0200, Jan Beulich wrote:
> On 19.09.2023 17:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> @@ -2908,7 +2908,13 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
> >>  
> >>      msi->irq = irq;
> >>  
> >> -    pcidevs_lock();
> >> +    /*
> >> +     * If we are called via vPCI->vMSI path, we already are holding
> >> +     * d->pci_lock so there is no need to take pcidevs_lock, as it
> >> +     * will cause lock inversion.
> >> +     */
> >> +    if ( !rw_is_locked(&d->pci_lock) )
> >> +        pcidevs_lock();
> > 
> > This is not a safe expression to use, rw_is_locked() just returns
> > whether the lock is taken, but not if it's taken by the current CPU.
> > This is fine to use in assertions and debug code, but not in order to
> > take lock ordering decisions I'm afraid.
> > 
> > You will likely need to move the locking to the callers of the
> > function.
> 
> Along the lines of a later comment, I think it would by rw_is_write_locked()
> here anyway. Noting that xen/rwlock.h already has an internal
> _is_write_locked_by_me(), it would in principle be possible to construct
> something along the lines of what the comment says. But it would certainly
> be better if that could be avoided.

I personally don't like construct like the above, they are fragile and
should be avoided.

It might be better to introduce some wrappers around
allocate_and_map_msi_pirq() for the different locking contexts of
callers if possible.

Regards, Roger.
Stewart Hildebrand Sept. 20, 2023, 1:56 p.m. UTC | #6
On 9/20/23 04:09, Roger Pau Monné wrote:
> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
>> On 9/19/23 11:39, Roger Pau Monné wrote:
>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>>>> index 8f2b59e61a..a0733bb2cb 100644
>>>> --- a/xen/drivers/vpci/msi.c
>>>> +++ b/xen/drivers/vpci/msi.c
>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>>>                       * holding the lock.
>>>>                       */
>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>>>> -                    process_pending_softirqs();
>>>> -                    continue;
>>>> +                    goto pdev_done;
>>>>                  }
>>>>              }
>>>>
>>>>              spin_unlock(&pdev->vpci->lock);
>>>> + pdev_done:
>>>> +            /*
>>>> +             * Unlock lock to process pending softirqs. This is
>>>> +             * potentially unsafe, as d->pdev_list can be changed in
>>>> +             * meantime.
>>>> +             */
>>>> +            read_unlock(&d->pci_lock);
>>>>              process_pending_softirqs();
>>>> +            if ( !read_trylock(&d->pci_lock) )
>>>> +            {
>>>> +                printk("unable to access other devices for the domain\n");
>>>> +                goto domain_done;
>>>
>>> Shouldn't the domain_done label be after the read_unlock(), so that we
>>> can proceed to try to dump the devices for the next domain?  With the
>>> proposed code a failure to acquire one of the domains pci_lock
>>> terminates the dump.
>>>
>>>> +            }
>>>>          }
>>>> +        read_unlock(&d->pci_lock);
>>>>      }
>>>> + domain_done:
>>>>      rcu_read_unlock(&domlist_read_lock);
>>>>  }
>>>>
>>
>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
>>
>>             }
>>         }
>>         read_unlock(&d->pci_lock);
>>  domain_done:
>>         (void)0;
>>     }
>>     rcu_read_unlock(&domlist_read_lock);
>> }
>>
>>
>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
>>
>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>>   351 |  domain_done:
>>       |  ^~~~~~~~~~~
> 
> 
> Might be better to place the label at the start of the loop, and
> likely rename to next_domain.

That would bypass the loop condition and increment statements.
Stewart Hildebrand Sept. 20, 2023, 7:16 p.m. UTC | #7
On 9/19/23 11:39, Roger Pau Monné wrote:
> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index 1edc7f1e91..545a27796e 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
>>
>>      spin_unlock_irq(&desc->lock);
>>
>> -    ASSERT(pcidevs_locked());
>> -
> 
> Hm, this removal seems dubious, same with some of the removal below.
> And I don't see any comment in the log message as to why removing the
> asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
> safe.
> 

I suspect we may want:

    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));

However, we don't have d. Using v->domain here is tricky because v may be NULL. How about passing struct domain *d as an arg to {hvm,vmx}_pi_update_irte()? Or ensuring that all callers pass a valid v?

>>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
>>
>>   unlock_out:
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index d0bf63df1d..ba2963b7d2 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
>>      u8 slot = PCI_SLOT(dev->devfn);
>>      u8 func = PCI_FUNC(dev->devfn);
>>
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
>>      if ( !pos )
>>          return -ENODEV;
>> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
>>      if ( !pos )
>>          return -ENODEV;
>>
>> -    ASSERT(pcidevs_locked());
>> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
>>
>>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
>>      /*
>> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>>
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev )
>>          return -ENODEV;

I think we can move the ASSERT here, after we obtain the pdev. Then we can add the pdev->domain->pci_lock check into the mix:

    ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

>> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
>>      struct pci_dev *pdev;
>>      struct msi_desc *old_desc;
>>
>> -    ASSERT(pcidevs_locked());
>>      pdev = pci_get_pdev(NULL, msi->sbdf);
>>      if ( !pdev || !pdev->msix )
>>          return -ENODEV;

Same here

>> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
>>   */
>>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
>>  {
>> -    ASSERT(pcidevs_locked());
>> -

This removal inside pci_enable_msi() may be okay if both __pci_enable_msi() and __pci_enable_msix() have an appropriate ASSERT.

>>      if ( !use_msi )
>>          return -EPERM;
>>

Related: in xen/drivers/passthrough/pci.c:pci_get_pdev() I run into an ASSERT with a PVH dom0:

(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:534
(XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C    ]----
...
(XEN) Xen call trace:
(XEN)    [<ffff82d040285a3b>] R pci_get_pdev+0x4c/0xab
(XEN)    [<ffff82d04034742e>] F arch/x86/msi.c#__pci_enable_msi+0x1d/0xb4
(XEN)    [<ffff82d0403477b5>] F pci_enable_msi+0x20/0x28
(XEN)    [<ffff82d04034cfa4>] F map_domain_pirq+0x2b0/0x718
(XEN)    [<ffff82d04034e37c>] F allocate_and_map_msi_pirq+0xff/0x26b
(XEN)    [<ffff82d0402e088b>] F arch/x86/hvm/vmsi.c#vpci_msi_enable+0x53/0x9d
(XEN)    [<ffff82d0402e19d5>] F vpci_msi_arch_enable+0x36/0x6c
(XEN)    [<ffff82d04026f49d>] F drivers/vpci/msi.c#control_write+0x71/0x114
(XEN)    [<ffff82d04026d050>] F drivers/vpci/vpci.c#vpci_write_helper+0x6f/0x7c
(XEN)    [<ffff82d04026de39>] F vpci_write+0x249/0x2f9
...

With the patch applied, it's valid to call pci_get_pdev() with only d->pci_lock held, so the ASSERT in pci_get_pdev() needs to be reworked too. Inside pci_get_pdev(), d may be null, so we can't easily add || rw_is_locked(&d->pci_lock) into the ASSERT. Instead I propose something like the following, which resolves the observed assertion failure:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 572643abe412..2b4ad804510c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -531,8 +531,6 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
 {
     struct pci_dev *pdev;

-    ASSERT(d || pcidevs_locked());
-
     /*
      * The hardware domain owns the majority of the devices in the system.
      * When there are multiple segments, traversing the per-segment list is
@@ -549,12 +547,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
         list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
             if ( pdev->sbdf.bdf == sbdf.bdf &&
                  (!d || pdev->domain == d) )
+            {
+                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
                 return pdev;
+            }
     }
     else
         list_for_each_entry ( pdev, &d->pdev_list, domain_list )
             if ( pdev->sbdf.sbdf == sbdf.sbdf )
+            {
+                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));
                 return pdev;
+            }

     return NULL;
 }
Jan Beulich Sept. 21, 2023, 7:42 a.m. UTC | #8
On 20.09.2023 15:56, Stewart Hildebrand wrote:
> On 9/20/23 04:09, Roger Pau Monné wrote:
>> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
>>> On 9/19/23 11:39, Roger Pau Monné wrote:
>>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>>>>> index 8f2b59e61a..a0733bb2cb 100644
>>>>> --- a/xen/drivers/vpci/msi.c
>>>>> +++ b/xen/drivers/vpci/msi.c
>>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
>>>>>                       * holding the lock.
>>>>>                       */
>>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>>>>> -                    process_pending_softirqs();
>>>>> -                    continue;
>>>>> +                    goto pdev_done;
>>>>>                  }
>>>>>              }
>>>>>
>>>>>              spin_unlock(&pdev->vpci->lock);
>>>>> + pdev_done:
>>>>> +            /*
>>>>> +             * Unlock lock to process pending softirqs. This is
>>>>> +             * potentially unsafe, as d->pdev_list can be changed in
>>>>> +             * meantime.
>>>>> +             */
>>>>> +            read_unlock(&d->pci_lock);
>>>>>              process_pending_softirqs();
>>>>> +            if ( !read_trylock(&d->pci_lock) )
>>>>> +            {
>>>>> +                printk("unable to access other devices for the domain\n");
>>>>> +                goto domain_done;
>>>>
>>>> Shouldn't the domain_done label be after the read_unlock(), so that we
>>>> can proceed to try to dump the devices for the next domain?  With the
>>>> proposed code a failure to acquire one of the domains pci_lock
>>>> terminates the dump.
>>>>
>>>>> +            }
>>>>>          }
>>>>> +        read_unlock(&d->pci_lock);
>>>>>      }
>>>>> + domain_done:
>>>>>      rcu_read_unlock(&domlist_read_lock);
>>>>>  }
>>>>>
>>>
>>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
>>>
>>>             }
>>>         }
>>>         read_unlock(&d->pci_lock);
>>>  domain_done:
>>>         (void)0;
>>>     }
>>>     rcu_read_unlock(&domlist_read_lock);
>>> }
>>>
>>>
>>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
>>>
>>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
>>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
>>>   351 |  domain_done:
>>>       |  ^~~~~~~~~~~
>>
>>
>> Might be better to place the label at the start of the loop, and
>> likely rename to next_domain.
> 
> That would bypass the loop condition and increment statements.

Right, such a label would be bogus even without that; instead of "goto"
the use site then simply should use "continue".

Jan
Roger Pau Monné Sept. 21, 2023, 9 a.m. UTC | #9
On Thu, Sep 21, 2023 at 09:42:08AM +0200, Jan Beulich wrote:
> On 20.09.2023 15:56, Stewart Hildebrand wrote:
> > On 9/20/23 04:09, Roger Pau Monné wrote:
> >> On Tue, Sep 19, 2023 at 12:20:39PM -0400, Stewart Hildebrand wrote:
> >>> On 9/19/23 11:39, Roger Pau Monné wrote:
> >>>> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >>>>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> >>>>> index 8f2b59e61a..a0733bb2cb 100644
> >>>>> --- a/xen/drivers/vpci/msi.c
> >>>>> +++ b/xen/drivers/vpci/msi.c
> >>>>> @@ -318,15 +321,28 @@ void vpci_dump_msi(void)
> >>>>>                       * holding the lock.
> >>>>>                       */
> >>>>>                      printk("unable to print all MSI-X entries: %d\n", rc);
> >>>>> -                    process_pending_softirqs();
> >>>>> -                    continue;
> >>>>> +                    goto pdev_done;
> >>>>>                  }
> >>>>>              }
> >>>>>
> >>>>>              spin_unlock(&pdev->vpci->lock);
> >>>>> + pdev_done:
> >>>>> +            /*
> >>>>> +             * Unlock lock to process pending softirqs. This is
> >>>>> +             * potentially unsafe, as d->pdev_list can be changed in
> >>>>> +             * meantime.
> >>>>> +             */
> >>>>> +            read_unlock(&d->pci_lock);
> >>>>>              process_pending_softirqs();
> >>>>> +            if ( !read_trylock(&d->pci_lock) )
> >>>>> +            {
> >>>>> +                printk("unable to access other devices for the domain\n");
> >>>>> +                goto domain_done;
> >>>>
> >>>> Shouldn't the domain_done label be after the read_unlock(), so that we
> >>>> can proceed to try to dump the devices for the next domain?  With the
> >>>> proposed code a failure to acquire one of the domains pci_lock
> >>>> terminates the dump.
> >>>>
> >>>>> +            }
> >>>>>          }
> >>>>> +        read_unlock(&d->pci_lock);
> >>>>>      }
> >>>>> + domain_done:
> >>>>>      rcu_read_unlock(&domlist_read_lock);
> >>>>>  }
> >>>>>
> >>>
> >>> With the label moved, a no-op expression after the label is needed to make the compiler happy:
> >>>
> >>>             }
> >>>         }
> >>>         read_unlock(&d->pci_lock);
> >>>  domain_done:
> >>>         (void)0;
> >>>     }
> >>>     rcu_read_unlock(&domlist_read_lock);
> >>> }
> >>>
> >>>
> >>> If the no-op is omitted, the compiler may complain (gcc 9.4.0):
> >>>
> >>> drivers/vpci/msi.c: In function ‘vpci_dump_msi’:
> >>> drivers/vpci/msi.c:351:2: error: label at end of compound statement
> >>>   351 |  domain_done:
> >>>       |  ^~~~~~~~~~~
> >>
> >>
> >> Might be better to place the label at the start of the loop, and
> >> likely rename to next_domain.
> > 
> > That would bypass the loop condition and increment statements.
> 
> Right, such a label would be bogus even without that; instead of "goto"
> the use site then simply should use "continue".

IIRC continue is not suitable because the code would reach the
read_unlock() without having the lock held.

Anyway, I would leave to the submitter to find a suitable way to
continue the domain iteration.

Thanks, Roger.
Roger Pau Monné Sept. 21, 2023, 9:41 a.m. UTC | #10
On Wed, Sep 20, 2023 at 03:16:00PM -0400, Stewart Hildebrand wrote:
> On 9/19/23 11:39, Roger Pau Monné wrote:
> > On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
> >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >> index 1edc7f1e91..545a27796e 100644
> >> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >> @@ -413,8 +413,6 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v,
> >>
> >>      spin_unlock_irq(&desc->lock);
> >>
> >> -    ASSERT(pcidevs_locked());
> >> -
> > 
> > Hm, this removal seems dubious, same with some of the removal below.
> > And I don't see any comment in the log message as to why removing the
> > asserts here and in __pci_enable_msi{,x}(), pci_prepare_msix() is
> > safe.
> > 
> 
> I suspect we may want:
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
> 
> However, we don't have d. Using v->domain here is tricky because v may be NULL. How about passing struct domain *d as an arg to {hvm,vmx}_pi_update_irte()? Or ensuring that all callers pass a valid v?

I guess there was a reason to expect a path with v == NULL, but would
need to go trough the call paths that lead here.

Another option might be use use:

ASSERT(pcidevs_locked() || (v && rw_is_locked(&v->domain->pci_lock)));

But we would need some understanding of the call site of
vmx_pi_update_irte().

> 
> >>      return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
> >>
> >>   unlock_out:
> >> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> >> index d0bf63df1d..ba2963b7d2 100644
> >> --- a/xen/arch/x86/msi.c
> >> +++ b/xen/arch/x86/msi.c
> >> @@ -613,7 +613,7 @@ static int msi_capability_init(struct pci_dev *dev,
> >>      u8 slot = PCI_SLOT(dev->devfn);
> >>      u8 func = PCI_FUNC(dev->devfn);
> >>
> >> -    ASSERT(pcidevs_locked());
> >> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
> >>      pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
> >>      if ( !pos )
> >>          return -ENODEV;
> >> @@ -783,7 +783,7 @@ static int msix_capability_init(struct pci_dev *dev,
> >>      if ( !pos )
> >>          return -ENODEV;
> >>
> >> -    ASSERT(pcidevs_locked());
> >> +    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
> >>
> >>      control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
> >>      /*
> >> @@ -1000,7 +1000,6 @@ static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
> >>      struct pci_dev *pdev;
> >>      struct msi_desc *old_desc;
> >>
> >> -    ASSERT(pcidevs_locked());
> >>      pdev = pci_get_pdev(NULL, msi->sbdf);
> >>      if ( !pdev )
> >>          return -ENODEV;
> 
> I think we can move the ASSERT here, after we obtain the pdev. Then we can add the pdev->domain->pci_lock check into the mix:
> 
>     ASSERT(pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

Hm, it would be better to perform the ASSERT before possibly accessing
the pdev list without holding any locks, but it's just an assert so
that might be the best option.

> 
> >> @@ -1055,7 +1054,6 @@ static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
> >>      struct pci_dev *pdev;
> >>      struct msi_desc *old_desc;
> >>
> >> -    ASSERT(pcidevs_locked());
> >>      pdev = pci_get_pdev(NULL, msi->sbdf);
> >>      if ( !pdev || !pdev->msix )
> >>          return -ENODEV;
> 
> Same here
> 
> >> @@ -1170,8 +1168,6 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
> >>   */
> >>  int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
> >>  {
> >> -    ASSERT(pcidevs_locked());
> >> -
> 
> This removal inside pci_enable_msi() may be okay if both __pci_enable_msi() and __pci_enable_msix() have an appropriate ASSERT.

Hm, yes, that's likely fine, but would want a small mention in the
commit message.

> >>      if ( !use_msi )
> >>          return -EPERM;
> >>
> 
> Related: in xen/drivers/passthrough/pci.c:pci_get_pdev() I run into an ASSERT with a PVH dom0:
> 
> (XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:534
> (XEN) ----[ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C    ]----
> ...
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040285a3b>] R pci_get_pdev+0x4c/0xab
> (XEN)    [<ffff82d04034742e>] F arch/x86/msi.c#__pci_enable_msi+0x1d/0xb4
> (XEN)    [<ffff82d0403477b5>] F pci_enable_msi+0x20/0x28
> (XEN)    [<ffff82d04034cfa4>] F map_domain_pirq+0x2b0/0x718
> (XEN)    [<ffff82d04034e37c>] F allocate_and_map_msi_pirq+0xff/0x26b
> (XEN)    [<ffff82d0402e088b>] F arch/x86/hvm/vmsi.c#vpci_msi_enable+0x53/0x9d
> (XEN)    [<ffff82d0402e19d5>] F vpci_msi_arch_enable+0x36/0x6c
> (XEN)    [<ffff82d04026f49d>] F drivers/vpci/msi.c#control_write+0x71/0x114
> (XEN)    [<ffff82d04026d050>] F drivers/vpci/vpci.c#vpci_write_helper+0x6f/0x7c
> (XEN)    [<ffff82d04026de39>] F vpci_write+0x249/0x2f9
> ...
> 
> With the patch applied, it's valid to call pci_get_pdev() with only d->pci_lock held, so the ASSERT in pci_get_pdev() needs to be reworked too. Inside pci_get_pdev(), d may be null, so we can't easily add || rw_is_locked(&d->pci_lock) into the ASSERT. Instead I propose something like the following, which resolves the observed assertion failure:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 572643abe412..2b4ad804510c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -531,8 +531,6 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
>  {
>      struct pci_dev *pdev;
> 
> -    ASSERT(d || pcidevs_locked());
> -
>      /*
>       * The hardware domain owns the majority of the devices in the system.
>       * When there are multiple segments, traversing the per-segment list is
> @@ -549,12 +547,18 @@ struct pci_dev *pci_get_pdev(const struct domain *d, pci_sbdf_t sbdf)
>          list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
>              if ( pdev->sbdf.bdf == sbdf.bdf &&
>                   (!d || pdev->domain == d) )
> +            {
> +                ASSERT(d || pcidevs_locked() || rw_is_locked(&pdev->domain->pci_lock));

Hm, strictly speaking iterating over the pseg list while just holding
the d->pci_lock is not safe, we should instead iterate over d->pdev_list.

We might have to slightly modify pci_enable_msi() to take a pdev so
that the search can be done by the caller (holding the right lock).

Thanks, Roger.
Volodymyr Babchuk Sept. 25, 2023, 11:03 p.m. UTC | #11
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Tue, Aug 29, 2023 at 11:19:42PM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not. This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>> 
>> When taking both d->pci_lock and pdev->vpci->lock they are should be
>
> When taking both d->pci_lock and pdev->vpci->lock the order should be
> ...
>

>> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
>> possible deadlock situations.
>>

Will it be better to write like this:

"When taking both d->pci_lock and pdev->vpci->lock, they should be
taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid
possible deadlock situations."

?

I am asking because your suggestion leads to "When taking both
d->pci_lock and pdev->vpci->lock the order should be taken in this exact
order: ... "

[...]

As for other comments, I am taking into account your, Jan's and Stewart's
comments and reworking this patch.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f236362..fde76cc6b4 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -468,7 +468,7 @@  int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable)
     struct msixtbl_entry *entry, *new_entry;
     int r = -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -538,7 +538,7 @@  void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq)
     struct pci_dev *pdev;
     struct msixtbl_entry *entry;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
@@ -684,7 +684,7 @@  static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data,
 {
     unsigned int i;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
     if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
     {
@@ -725,8 +725,8 @@  void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
     int rc;
 
     ASSERT(msi->arch.pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
     {
         struct xen_domctl_bind_pt_irq unbind = {
@@ -745,7 +745,6 @@  void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev)
 
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
                                        msi->vectors, msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -778,15 +777,13 @@  int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev,
     int rc;
 
     ASSERT(msi->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
     rc = vpci_msi_enable(pdev, vectors, 0);
     if ( rc < 0 )
         return rc;
     msi->arch.pirq = rc;
-
-    pcidevs_lock();
     msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
                                        msi->arch.pirq, msi->mask);
-    pcidevs_unlock();
 
     return 0;
 }
@@ -797,8 +794,8 @@  static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     unsigned int i;
 
     ASSERT(pirq != INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
 
-    pcidevs_lock();
     for ( i = 0; i < nr && bound; i++ )
     {
         struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +811,6 @@  static void vpci_msi_disable(const struct pci_dev *pdev, int pirq,
     write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
     write_unlock(&pdev->domain->event_lock);
-    pcidevs_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -854,6 +850,8 @@  int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
     int rc;
 
     ASSERT(entry->arch.pirq == INVALID_PIRQ);
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry),
                          table_base);
     if ( rc < 0 )
@@ -861,7 +859,6 @@  int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
 
     entry->arch.pirq = rc;
 
-    pcidevs_lock();
     rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
                          entry->masked);
     if ( rc )
@@ -869,7 +866,6 @@  int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry,
         vpci_msi_disable(pdev, entry->arch.pirq, 1, false);
         entry->arch.pirq = INVALID_PIRQ;
     }
-    pcidevs_unlock();
 
     return rc;
 }
@@ -895,6 +891,8 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +911,9 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1edc7f1e91..545a27796e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -413,8 +413,6 @@  static int cf_check vmx_pi_update_irte(const struct vcpu *v,
 
     spin_unlock_irq(&desc->lock);
 
-    ASSERT(pcidevs_locked());
-
     return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg);
 
  unlock_out:
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 6abfd81621..cb99ae5392 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2157,7 +2157,7 @@  int map_domain_pirq(
         struct pci_dev *pdev;
         unsigned int nr = 0;
 
-        ASSERT(pcidevs_locked());
+        ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
 
         ret = -ENODEV;
         if ( !cpu_has_apic )
@@ -2314,7 +2314,7 @@  int unmap_domain_pirq(struct domain *d, int pirq)
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&d->pci_lock));
     ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
@@ -2908,7 +2908,13 @@  int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
     msi->irq = irq;
 
-    pcidevs_lock();
+    /*
+     * If we are called via vPCI->vMSI path, we already are holding
+     * d->pci_lock so there is no need to take pcidevs_lock, as it
+     * will cause lock inversion.
+     */
+    if ( !rw_is_locked(&d->pci_lock) )
+        pcidevs_lock();
     /* Verify or get pirq. */
     write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
@@ -2924,7 +2930,8 @@  int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p,
 
  done:
     write_unlock(&d->event_lock);
-    pcidevs_unlock();
+    if ( !rw_is_locked(&d->pci_lock) )
+        pcidevs_unlock();
     if ( ret )
     {
         switch ( type )
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index d0bf63df1d..ba2963b7d2 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -613,7 +613,7 @@  static int msi_capability_init(struct pci_dev *dev,
     u8 slot = PCI_SLOT(dev->devfn);
     u8 func = PCI_FUNC(dev->devfn);
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
     pos = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
     if ( !pos )
         return -ENODEV;
@@ -783,7 +783,7 @@  static int msix_capability_init(struct pci_dev *dev,
     if ( !pos )
         return -ENODEV;
 
-    ASSERT(pcidevs_locked());
+    ASSERT(pcidevs_locked() || rw_is_locked(&dev->domain->pci_lock));
 
     control = pci_conf_read16(dev->sbdf, msix_control_reg(pos));
     /*
@@ -1000,7 +1000,6 @@  static int __pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev )
         return -ENODEV;
@@ -1055,7 +1054,6 @@  static int __pci_enable_msix(struct msi_info *msi, struct msi_desc **desc)
     struct pci_dev *pdev;
     struct msi_desc *old_desc;
 
-    ASSERT(pcidevs_locked());
     pdev = pci_get_pdev(NULL, msi->sbdf);
     if ( !pdev || !pdev->msix )
         return -ENODEV;
@@ -1170,8 +1168,6 @@  int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off)
  */
 int pci_enable_msi(struct msi_info *msi, struct msi_desc **desc)
 {
-    ASSERT(pcidevs_locked());
-
     if ( !use_msi )
         return -EPERM;
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 79ca928672..4f18293900 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -752,7 +752,6 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
         pdev->domain = hardware_domain;
         write_lock(&hardware_domain->pci_lock);
         list_add(&pdev->domain_list, &hardware_domain->pdev_list);
-        write_unlock(&hardware_domain->pci_lock);
 
         /*
          * For devices not discovered by Xen during boot, add vPCI handlers
@@ -762,17 +761,17 @@  int pci_add_device(u16 seg, u8 bus, u8 devfn,
         if ( ret )
         {
             printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret);
-            write_lock(&hardware_domain->pci_lock);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
             goto out;
         }
+        write_unlock(&hardware_domain->pci_lock);
         ret = iommu_add_device(pdev);
         if ( ret )
         {
-            vpci_remove_device(pdev);
             write_lock(&hardware_domain->pci_lock);
+            vpci_remove_device(pdev);
             list_del(&pdev->domain_list);
             write_unlock(&hardware_domain->pci_lock);
             pdev->domain = NULL;
@@ -1147,7 +1146,9 @@  static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 60f7049e34..177a6b57a5 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -172,6 +172,7 @@  bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -190,6 +191,7 @@  bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -201,8 +203,20 @@  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        read_unlock(&d->pci_lock);
         process_pending_softirqs();
+        read_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -243,6 +257,8 @@  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -522,6 +538,8 @@  static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..a0733bb2cb 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -265,7 +265,7 @@  REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +277,9 @@  void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,15 +321,28 @@  void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            /*
+             * Unlock lock to process pending softirqs. This is
+             * potentially unsafe, as d->pdev_list can be changed in
+             * meantime.
+             */
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            if ( !read_trylock(&d->pci_lock) )
+            {
+                printk("unable to access other devices for the domain\n");
+                goto domain_done;
+            }
         }
+        read_unlock(&d->pci_lock);
     }
+ domain_done:
     rcu_read_unlock(&domlist_read_lock);
 }
 
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index f9df506f29..f8c5bd393b 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,35 @@  static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +426,7 @@  static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +514,33 @@  static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +616,7 @@  static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d73fa76302..34fff2ef2d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@  extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -73,6 +75,8 @@  int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -326,11 +330,12 @@  static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    rwlock_t *lock;
 
     if ( !size )
     {
@@ -342,11 +347,21 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
      */
+    lock = &d->pci_lock;
+    read_lock(lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
+    {
+        read_unlock(lock);
+        lock = &dom_xen->pci_lock;
+        read_lock(lock);
         pdev = pci_get_pdev(dom_xen, sbdf);
+    }
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -392,6 +407,7 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(lock);
 
     if ( data_offset < size )
     {
@@ -431,10 +447,23 @@  static void vpci_write_helper(const struct pci_dev *pdev,
              r->private);
 }
 
+/* Helper function to unlock locks taken by vpci_write in proper order */
+static void release_domain_locks(struct domain *d)
+{
+    ASSERT(rw_is_write_locked(&d->pci_lock));
+
+    if ( is_hardware_domain(d) )
+    {
+        ASSERT(rw_is_write_locked(&dom_xen->pci_lock));
+        write_unlock(&dom_xen->pci_lock);
+    }
+    write_unlock(&d->pci_lock);
+}
+
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -447,8 +476,16 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom, we need to hold locks for both domain in case if
+     * modify_bars() is called
      */
+    write_lock(&d->pci_lock);
+
+    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
+    if ( is_hardware_domain(d) )
+        write_lock(&dom_xen->pci_lock);
+
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -459,6 +496,8 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
+
+        release_domain_locks(d);
         return;
     }
 
@@ -498,6 +537,7 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    release_domain_locks(d);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */