diff mbox series

[v8,02/13] vpci: use per-domain PCI lock to protect vpci structure

Message ID 20230720003205.1828537-3-volodymyr_babchuk@epam.com (mailing list archive)
State Superseded
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Volodymyr Babchuk July 20, 2023, 12:32 a.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Use a previously introduced per-domain read/write lock to check
whether vpci is present, so we are sure there are no accesses to the
contents of the vpci struct if not. This lock can be used (and in a
few cases is used right away) so that vpci removal can be performed
while holding the lock in write mode. Previously such removal could
race with vpci_read for example.

1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
from being removed.

2. Writing the command register and ROM BAR register may trigger
modify_bars to run, which in turn may access multiple pdevs while
checking for the existing BAR's overlap. The overlapping check, if
done under the read lock, requires vpci->lock to be acquired on both
devices being compared, which may produce a deadlock. It is not
possible to upgrade read lock to write lock in such a case. So, in
order to prevent the deadlock, use d->pci_lock instead. To prevent
deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
always lock hwdom first.

All other code, which doesn't lead to pdev->vpci destruction and does
not access multiple pdevs at the same time, can still use a
combination of the read lock and pdev->vpci->lock.

3. Drop const qualifier where the new rwlock is used and this is
appropriate.

4. Do not call process_pending_softirqs with any locks held. For that
unlock prior the call and re-acquire the locks after. After
re-acquiring the lock there is no need to check if pdev->vpci exists:
 - in apply_map because of the context it is called (no race condition
   possible)
 - for MSI/MSI-X debug code because it is called at the end of
   pdev->vpci access and no further access to pdev->vpci is made

5. Introduce pcidevs_trylock, so there is a possibility to try locking
the pcidev's lock.

6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
while accessing pdevs in vpci code.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>

---

Changes in v8:
 - changed d->vpci_lock to d->pci_lock
 - introducing d->pci_lock in a separate patch
 - extended locked region in vpci_process_pending
 - removed pcidevs_lockis vpci_dump_msi()
 - removed some changes as they are not needed with
   the new locking scheme
 - added handling for hwdom && dom_xen case
---
 xen/arch/x86/hvm/vmsi.c       |  4 +++
 xen/drivers/passthrough/pci.c |  7 +++++
 xen/drivers/vpci/header.c     | 18 ++++++++++++
 xen/drivers/vpci/msi.c        | 14 ++++++++--
 xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
 xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
 xen/include/xen/pci.h         |  1 +
 7 files changed, 129 insertions(+), 13 deletions(-)

Comments

Roger Pau Monné July 20, 2023, 11:20 a.m. UTC | #1
On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Use a previously introduced per-domain read/write lock to check
> whether vpci is present, so we are sure there are no accesses to the
> contents of the vpci struct if not. This lock can be used (and in a
> few cases is used right away) so that vpci removal can be performed
> while holding the lock in write mode. Previously such removal could
> race with vpci_read for example.

This I think needs to state the locking order of the per-domain
pci_lock wrt the vpci->lock.  AFAICT that's d->pci_lock first, then
vpci->lock.

> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
> from being removed.
> 
> 2. Writing the command register and ROM BAR register may trigger
> modify_bars to run, which in turn may access multiple pdevs while
> checking for the existing BAR's overlap. The overlapping check, if
> done under the read lock, requires vpci->lock to be acquired on both
> devices being compared, which may produce a deadlock. It is not
> possible to upgrade read lock to write lock in such a case. So, in
> order to prevent the deadlock, use d->pci_lock instead. To prevent
> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
> always lock hwdom first.
> 
> All other code, which doesn't lead to pdev->vpci destruction and does
> not access multiple pdevs at the same time, can still use a
> combination of the read lock and pdev->vpci->lock.
> 
> 3. Drop const qualifier where the new rwlock is used and this is
> appropriate.
> 
> 4. Do not call process_pending_softirqs with any locks held. For that
> unlock prior the call and re-acquire the locks after. After
> re-acquiring the lock there is no need to check if pdev->vpci exists:
>  - in apply_map because of the context it is called (no race condition
>    possible)
>  - for MSI/MSI-X debug code because it is called at the end of
>    pdev->vpci access and no further access to pdev->vpci is made

I assume that's vpci_msix_arch_print(): there are further accesses to
pdev->vpci, but those use the msix local variable, which holds a copy
of the pointer in pdev->vpci->msix, so that last sentence is not true
I'm afraid.

However the code already try to cater for the pdev going away, and
hence it's IMO fine.  IOW: your change doesn't make this any better or
worse.

> 
> 5. Introduce pcidevs_trylock, so there is a possibility to try locking
> the pcidev's lock.

I'm confused by this addition, the more that's no used anywhere.  Can
you defer the addition until the patch that makes use of it?

> 
> 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
> while accessing pdevs in vpci code.
> 
> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
> Suggested-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
> 
> ---
> 
> Changes in v8:
>  - changed d->vpci_lock to d->pci_lock
>  - introducing d->pci_lock in a separate patch
>  - extended locked region in vpci_process_pending
>  - removed pcidevs_lockis vpci_dump_msi()
>  - removed some changes as they are not needed with
>    the new locking scheme
>  - added handling for hwdom && dom_xen case
> ---
>  xen/arch/x86/hvm/vmsi.c       |  4 +++
>  xen/drivers/passthrough/pci.c |  7 +++++
>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>  xen/drivers/vpci/msi.c        | 14 ++++++++--
>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>  xen/include/xen/pci.h         |  1 +
>  7 files changed, 129 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 3cd4923060..8c1bd66b9c 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>  {
>      unsigned int i;
>  
> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
> +
>      for ( i = 0; i < msix->max_entries; i++ )
>      {
>          const struct vpci_msix_entry *entry = &msix->entries[i];
> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>              struct pci_dev *pdev = msix->pdev;
>  
>              spin_unlock(&msix->pdev->vpci->lock);
> +            read_unlock(&pdev->domain->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&pdev->domain->pci_lock);

This should be a read_trylock(), much like the spin_trylock() below.

>              /* NB: we assume that pdev cannot go away for an alive domain. */
>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>                  return -EBUSY;
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 5b4632ead2..6f8692cd9c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>      spin_lock_recursive(&_pcidevs_lock);
>  }
>  
> +int pcidevs_trylock(void)
> +{
> +    return spin_trylock_recursive(&_pcidevs_lock);
> +}
> +
>  void pcidevs_unlock(void)
>  {
>      spin_unlock_recursive(&_pcidevs_lock);
> @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>      } while ( devfn != pdev->devfn &&
>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>  
> +    write_lock(&ctxt->d->pci_lock);
>      err = vpci_add_handlers(pdev);
> +    write_unlock(&ctxt->d->pci_lock);
>      if ( err )
>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>                 ctxt->d->domain_id, err);
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index b41556d007..2780fcae72 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
>          if ( rc == -ERESTART )
>              return true;
>  
> +        write_lock(&v->domain->pci_lock);
>          spin_lock(&v->vpci.pdev->vpci->lock);
>          /* Disable memory decoding unconditionally on failure. */
>          modify_decoding(v->vpci.pdev,
> @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
>               * failure.
>               */
>              vpci_remove_device(v->vpci.pdev);
> +        write_unlock(&v->domain->pci_lock);
>      }

The handling in vpci_process_pending() wrt vpci_remove_device() is
racy and will need some thinking to get it solved.  Your change
doesn't make it any worse, but I would also be fine with adding a note
in the commit message that vpci_process_pending() is not adjusted to
use the new lock because it needs to be reworked first in order to be
safe against a concurrent vpci_remove_device() call.

>  
>      return false;
> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>      struct map_data data = { .d = d, .map = true };
>      int rc;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
> +    {
> +        /*
> +         * It's safe to drop and reacquire the lock in this context
> +         * without risking pdev disappearing because devices cannot be
> +         * removed until the initial domain has been started.
> +         */
> +        read_unlock(&d->pci_lock);
>          process_pending_softirqs();
> +        read_lock(&d->pci_lock);
> +    }

Since this is init only code you could likely forego the usage of the
locks, but I guess that's more churn than just using them.  In any
case, as this gets called from modify_bars() the locks need to be
dropped/taken in write mode (see comment below).

>      rangeset_destroy(mem);
>      if ( !rc )
>          modify_decoding(pdev, cmd, false);
> @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>      unsigned int i;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));

The lock here needs to be taken in write mode I think, so the code can
safely iterate over the contents of each pdev->vpci assigned to the
domain.

> +
>      if ( !mem )
>          return -ENOMEM;
>  
> @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>      struct vpci_bar *bars = header->bars;
>      int rc;
>  
> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
> +
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
>      case PCI_HEADER_TYPE_NORMAL:
> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
> index 8f2b59e61a..e63152c224 100644
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));

I'm confused by the difference in lock requirements between
init_bars() and init_msi().  In the former you assert for the lock
being taken in read mode, while the later asserts for write mode.

We want to do initialization in write mode, so that modify_bars()
called by init_bars() has exclusive access to the contents of
pdev->vpci.

> +
>      if ( !pos )
>          return 0;
>  
> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>  
>  void vpci_dump_msi(void)
>  {
> -    const struct domain *d;
> +    struct domain *d;
>  
>      rcu_read_lock(&domlist_read_lock);
>      for_each_domain ( d )
> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>  
>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>  
> +        if ( !read_trylock(&d->pci_lock) )
> +            continue;
> +
>          for_each_pdev ( d, pdev )
>          {
>              const struct vpci_msi *msi;
> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>                       * holding the lock.
>                       */
>                      printk("unable to print all MSI-X entries: %d\n", rc);
> -                    process_pending_softirqs();
> -                    continue;
> +                    goto pdev_done;
>                  }
>              }
>  
>              spin_unlock(&pdev->vpci->lock);
> + pdev_done:
> +            read_unlock(&d->pci_lock);
>              process_pending_softirqs();
> +            read_lock(&d->pci_lock);

read_trylock().

This is not very safe, as the list could be modified while the lock is
dropped, but it's a debug key handler so I'm not very concerned.
However we should at least add a comment that this relies on the list
not being altered while the lock is dropped.

>          }
> +        read_unlock(&d->pci_lock);
>      }
>      rcu_read_unlock(&domlist_read_lock);
>  }
> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
> index 25bde77586..9481274579 100644
> --- a/xen/drivers/vpci/msix.c
> +++ b/xen/drivers/vpci/msix.c
> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  {
>      struct vpci_msix *msix;
>  
> +    ASSERT(rw_is_locked(&d->pci_lock));

Hm, here you are iterating over pdev->vpci->header.bars for multiple
devices, so I think in addition to the pci_lock in read mode we should
also take the vpci->lock for each pdev.

I think I would like to rework msix_find() so it's msix_get() and
returns with the appropriate vpci->lock taken.  Anyway, that's for a
different patch, the usage of the lock in read mode seems correct,
albeit I might want to move the read_lock() call inside of msix_get()
in the future.

> +
>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>      {
>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>  
>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>  {
> -    return !!msix_find(v->domain, addr);
> +    int rc;
> +
> +    read_lock(&v->domain->pci_lock);
> +    rc = !!msix_find(v->domain, addr);
> +    read_unlock(&v->domain->pci_lock);
> +
> +    return rc;
>  }
>  
>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
> @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_read(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      const struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
>      *data = ~0ul;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_read(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_read(d, msix, addr, len, data);

Nit: missing newline (here and below).

> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -404,6 +425,7 @@ static int cf_check msix_read(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>  static int cf_check msix_write(
>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>  {
> -    const struct domain *d = v->domain;
> -    struct vpci_msix *msix = msix_find(d, addr);
> +    struct domain *d = v->domain;
> +    struct vpci_msix *msix;
>      struct vpci_msix_entry *entry;
>      unsigned int offset;
>  
> +    read_lock(&d->pci_lock);
> +
> +    msix = msix_find(d, addr);
>      if ( !msix )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_RETRY;
> +    }
>  
>      if ( adjacent_handle(msix, addr) )
> -        return adjacent_write(d, msix, addr, len, data);
> +    {
> +        int rc = adjacent_write(d, msix, addr, len, data);
> +        read_unlock(&d->pci_lock);
> +        return rc;
> +    }
>  
>      if ( !access_allowed(msix->pdev, addr, len) )
> +    {
> +        read_unlock(&d->pci_lock);
>          return X86EMUL_OKAY;
> +    }
>  
>      spin_lock(&msix->pdev->vpci->lock);
>      entry = get_entry(msix, addr);
> @@ -579,6 +614,7 @@ static int cf_check msix_write(
>          break;
>      }
>      spin_unlock(&msix->pdev->vpci->lock);
> +    read_unlock(&d->pci_lock);
>  
>      return X86EMUL_OKAY;
>  }
> @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>      struct vpci_msix *msix;
>      int rc;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>                                        PCI_CAP_ID_MSIX);
>      if ( !msix_offset )
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index d73fa76302..f22cbf2112 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>  
>  void vpci_remove_device(struct pci_dev *pdev)
>  {
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>          return;
>  
> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>      const unsigned long *ro_map;
>      int rc = 0;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
>      if ( !has_vpci(pdev->domain) )
>          return 0;
>  
> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>  
>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
>      uint32_t data = ~(uint32_t)0;
> +    rwlock_t *lock;
>  
>      if ( !size )
>      {
> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       * Find the PCI dev matching the address, which for hwdom also requires
>       * consulting DomXEN.  Passthrough everything that's not trapped.
>       */
> +    lock = &d->pci_lock;
> +    read_lock(lock);
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
> +    {
> +        read_unlock(lock);
> +        lock = &dom_xen->pci_lock;
> +        read_lock(lock);
>          pdev = pci_get_pdev(dom_xen, sbdf);
> +    }
>      if ( !pdev || !pdev->vpci )
> +    {
> +        read_unlock(lock);
>          return vpci_read_hw(sbdf, reg, size);
> +    }
>  
>      spin_lock(&pdev->vpci->lock);
>  
> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    read_unlock(lock);
>  
>      if ( data_offset < size )
>      {
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void unlock_locks(struct domain *d)
> +{
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_locked(&d->pci_lock));
> +        read_unlock(&dom_xen->pci_lock);
> +    }
> +    read_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars is called()

Typo: the () wants to be at the end of modify_bars().

>       */
> +    read_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        read_lock(&dom_xen->pci_lock);

For modify_bars() we also want the locks to be in write mode (at least
the hw one), so that the position of the BARs can't be changed while
modify_bars() is iterating over them.

Is this something that will be done in a followup change?

> +
>      pdev = pci_get_pdev(d, sbdf);
>      if ( !pdev && is_hardware_domain(d) )
>          pdev = pci_get_pdev(dom_xen, sbdf);
> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>              vpci_write_hw(sbdf, reg, size, data);
> +
> +        unlock_locks(d);
>          return;
>      }
>  
> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    unlock_locks(d);

There's one issue here, some handlers will cal pcidevs_lock(), which
will result in a lock over inversion, as in the previous patch we
agreed that the locking order was pcidevs_lock first, d->pci_lock
after.

For example the MSI control_write() handler will call
vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
have to look into using a dedicated lock for MSI related handling, as
that's the only place where I think we have this pattern of taking the
pcidevs_lock after the d->pci_lock.

Thanks, Roger.
Jan Beulich July 20, 2023, 1:27 p.m. UTC | #2
On 20.07.2023 13:20, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> +     * If this is hwdom, we need to hold locks for both domain in case if
>> +     * modify_bars is called()
> 
> Typo: the () wants to be at the end of modify_bars().
> 
>>       */
>> +    read_lock(&d->pci_lock);
>> +
>> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> +    if ( is_hardware_domain(d) )
>> +        read_lock(&dom_xen->pci_lock);
> 
> For modify_bars() we also want the locks to be in write mode (at least
> the hw one), so that the position of the BARs can't be changed while
> modify_bars() is iterating over them.

Isn't changing of the BARs happening under the vpci lock? Or else I guess
I haven't understood the description correctly: My reading so far was
that it is only the presence (allocation status / pointer validity) that
is protected by this new lock.

Jan
Roger Pau Monné July 20, 2023, 1:50 p.m. UTC | #3
On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
> On 20.07.2023 13:20, Roger Pau Monné wrote:
> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>  
> >>      /*
> >>       * Find the PCI dev matching the address, which for hwdom also requires
> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
> >> +     * If this is hwdom, we need to hold locks for both domain in case if
> >> +     * modify_bars is called()
> > 
> > Typo: the () wants to be at the end of modify_bars().
> > 
> >>       */
> >> +    read_lock(&d->pci_lock);
> >> +
> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> >> +    if ( is_hardware_domain(d) )
> >> +        read_lock(&dom_xen->pci_lock);
> > 
> > For modify_bars() we also want the locks to be in write mode (at least
> > the hw one), so that the position of the BARs can't be changed while
> > modify_bars() is iterating over them.
> 
> Isn't changing of the BARs happening under the vpci lock?

It is.

> Or else I guess
> I haven't understood the description correctly: My reading so far was
> that it is only the presence (allocation status / pointer validity) that
> is protected by this new lock.

Hm, I see, yes.  I guess it was a previous patch version that also
took care of the modify_bars() issue by taking the lock in exclusive
mode here.

We can always do that later, so forget about that comment (for now).

Thanks, Roger.
Jan Beulich July 20, 2023, 3:53 p.m. UTC | #4
On 20.07.2023 13:20, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */

Note the comment here.

>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&d->pci_lock);
> 
> read_trylock().

Plus the same scheme as with the spin lock wants following imo:
vpci_msix_arch_print() returns an error only with (now) both locks
dropped. This then wants reflecting in the comment pointed out
above.

Jan
Jan Beulich July 20, 2023, 4:03 p.m. UTC | #5
On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> --- a/xen/drivers/vpci/msi.c
> +++ b/xen/drivers/vpci/msi.c
> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>      uint16_t control;
>      int ret;
>  
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));

I'm afraid I have to ask the opposite question, compared to Roger's:
Why do you need the lock held for write here (and in init_msix())?
Neither list of devices nor the pdev->vpci pointer are being altered.

Jan
Jan Beulich July 20, 2023, 4:09 p.m. UTC | #6
On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>               r->private);
>  }
>  
> +/* Helper function to unlock locks taken by vpci_write in proper order */
> +static void unlock_locks(struct domain *d)
> +{
> +    ASSERT(rw_is_locked(&d->pci_lock));
> +
> +    if ( is_hardware_domain(d) )
> +    {
> +        ASSERT(rw_is_locked(&d->pci_lock));

Copy-and-past mistake? You've asserted this same condition already above.

> +        read_unlock(&dom_xen->pci_lock);
> +    }
> +    read_unlock(&d->pci_lock);
> +}
> +
>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
>  {
> -    const struct domain *d = current->domain;
> +    struct domain *d = current->domain;
>      const struct pci_dev *pdev;
>      const struct vpci_register *r;
>      unsigned int data_offset = 0;
> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>  
>      /*
>       * Find the PCI dev matching the address, which for hwdom also requires
> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> +     * consulting DomXEN. Passthrough everything that's not trapped.
> +     * If this is hwdom, we need to hold locks for both domain in case if
> +     * modify_bars is called()
>       */
> +    read_lock(&d->pci_lock);
> +
> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> +    if ( is_hardware_domain(d) )
> +        read_lock(&dom_xen->pci_lock);

But I wonder anyway - can we perhaps get away without acquiring dom_xen's
lock here? Its list isn't altered anymore post-boot, iirc.

> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>          ASSERT(data_offset < size);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +    unlock_locks(d);

In this context the question arises whether the function wouldn't better
be named more specific to its purpose: It's obvious here that it doesn't
unlock all the locks involved.

Jan
Roger Pau Monné July 20, 2023, 4:14 p.m. UTC | #7
On Thu, Jul 20, 2023 at 06:03:49PM +0200, Jan Beulich wrote:
> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
> > --- a/xen/drivers/vpci/msi.c
> > +++ b/xen/drivers/vpci/msi.c
> > @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
> >      uint16_t control;
> >      int ret;
> >  
> > +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> 
> I'm afraid I have to ask the opposite question, compared to Roger's:
> Why do you need the lock held for write here (and in init_msix())?
> Neither list of devices nor the pdev->vpci pointer are being
> altered.

This is called from vpci_add_handlers() which will acquire (or
requires being called) with the lock in write mode in order to set
pdev->vpci I would assume.  Strictly speaking however the init
handlers don't require the lock in write mode unless we use such
locking to get exclusive access to all the devices assigned to the
domain BARs array for modify_bars().

Thanks, Roger.
Jan Beulich July 21, 2023, 6:02 a.m. UTC | #8
On 20.07.2023 18:14, Roger Pau Monné wrote:
> On Thu, Jul 20, 2023 at 06:03:49PM +0200, Jan Beulich wrote:
>> On 20.07.2023 02:32, Volodymyr Babchuk wrote:
>>> --- a/xen/drivers/vpci/msi.c
>>> +++ b/xen/drivers/vpci/msi.c
>>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>>      uint16_t control;
>>>      int ret;
>>>  
>>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>>
>> I'm afraid I have to ask the opposite question, compared to Roger's:
>> Why do you need the lock held for write here (and in init_msix())?
>> Neither list of devices nor the pdev->vpci pointer are being
>> altered.
> 
> This is called from vpci_add_handlers() which will acquire (or
> requires being called) with the lock in write mode in order to set
> pdev->vpci I would assume.

Right.

>  Strictly speaking however the init
> handlers don't require the lock in write mode unless we use such
> locking to get exclusive access to all the devices assigned to the
> domain BARs array for modify_bars().

Aiui in the present model modify_bars() has to use the vpci lock for
protection. Therefore imo in any of the init functions the assertions
should either express the real requirements of those functions, or be
omitted on the basis that they're all called out of add-handlers
anyway.

Jan
Roger Pau Monné July 21, 2023, 7:43 a.m. UTC | #9
On Fri, Jul 21, 2023 at 08:02:33AM +0200, Jan Beulich wrote:
> On 20.07.2023 18:14, Roger Pau Monné wrote:
> >  Strictly speaking however the init
> > handlers don't require the lock in write mode unless we use such
> > locking to get exclusive access to all the devices assigned to the
> > domain BARs array for modify_bars().
> 
> Aiui in the present model modify_bars() has to use the vpci lock for
> protection.

But the current protection is insufficient, as we only hold the vpci
lock of the current device, but we don't hold the vpci lock of the
other devices when we iterate over in order to find overlapping bars
(or else it wold be an ABBA deadlock situation).

So my suggestion (which can be done later) is to take the newly
introduced per-domain rwlock in exclusive mode for modify_bars() in
order to assert there are no changes to the other devices vpci bar
fields.

> Therefore imo in any of the init functions the assertions
> should either express the real requirements of those functions, or be
> omitted on the basis that they're all called out of add-handlers
> anyway.

I'm happy to omit for the time being.  Iff we agree that modify_bars()
requires the rwlock in exclusive mode then we could add the assertion
there, but not in the init functions themselves.

Thanks, Roger.
Jan Beulich July 21, 2023, 8:48 a.m. UTC | #10
On 21.07.2023 09:43, Roger Pau Monné wrote:
> On Fri, Jul 21, 2023 at 08:02:33AM +0200, Jan Beulich wrote:
>> On 20.07.2023 18:14, Roger Pau Monné wrote:
>>>  Strictly speaking however the init
>>> handlers don't require the lock in write mode unless we use such
>>> locking to get exclusive access to all the devices assigned to the
>>> domain BARs array for modify_bars().
>>
>> Aiui in the present model modify_bars() has to use the vpci lock for
>> protection.
> 
> But the current protection is insufficient, as we only hold the vpci
> lock of the current device, but we don't hold the vpci lock of the
> other devices when we iterate over in order to find overlapping bars
> (or else it wold be an ABBA deadlock situation).
> 
> So my suggestion (which can be done later) is to take the newly
> introduced per-domain rwlock in exclusive mode for modify_bars() in
> order to assert there are no changes to the other devices vpci bar
> fields.

I think this makes sense, just that it doesn't belong in this patch.

Jan

>> Therefore imo in any of the init functions the assertions
>> should either express the real requirements of those functions, or be
>> omitted on the basis that they're all called out of add-handlers
>> anyway.
> 
> I'm happy to omit for the time being.  Iff we agree that modify_bars()
> requires the rwlock in exclusive mode then we could add the assertion
> there, but not in the init functions themselves.
> 
> Thanks, Roger.
Volodymyr Babchuk July 24, 2023, 12:07 a.m. UTC | #11
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
>> On 20.07.2023 13:20, Roger Pau Monné wrote:
>> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >>  
>> >>      /*
>> >>       * Find the PCI dev matching the address, which for hwdom also requires
>> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> >> +     * If this is hwdom, we need to hold locks for both domain in case if
>> >> +     * modify_bars is called()
>> > 
>> > Typo: the () wants to be at the end of modify_bars().
>> > 
>> >>       */
>> >> +    read_lock(&d->pci_lock);
>> >> +
>> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> >> +    if ( is_hardware_domain(d) )
>> >> +        read_lock(&dom_xen->pci_lock);
>> > 
>> > For modify_bars() we also want the locks to be in write mode (at least
>> > the hw one), so that the position of the BARs can't be changed while
>> > modify_bars() is iterating over them.
>> 
>> Isn't changing of the BARs happening under the vpci lock?
>
> It is.
>
>> Or else I guess
>> I haven't understood the description correctly: My reading so far was
>> that it is only the presence (allocation status / pointer validity) that
>> is protected by this new lock.
>
> Hm, I see, yes.  I guess it was a previous patch version that also
> took care of the modify_bars() issue by taking the lock in exclusive
> mode here.
>
> We can always do that later, so forget about that comment (for now).

Are you sure? I'd rather rework the code to use write lock in the
modify_bars(). This is why we began all this journey in the first place.
Roger Pau Monné July 24, 2023, 7:59 a.m. UTC | #12
On Mon, Jul 24, 2023 at 12:07:48AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 20, 2023 at 03:27:29PM +0200, Jan Beulich wrote:
> >> On 20.07.2023 13:20, Roger Pau Monné wrote:
> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >>  
> >> >>      /*
> >> >>       * Find the PCI dev matching the address, which for hwdom also requires
> >> >> -     * consulting DomXEN.  Passthrough everything that's not trapped.
> >> >> +     * consulting DomXEN. Passthrough everything that's not trapped.
> >> >> +     * If this is hwdom, we need to hold locks for both domain in case if
> >> >> +     * modify_bars is called()
> >> > 
> >> > Typo: the () wants to be at the end of modify_bars().
> >> > 
> >> >>       */
> >> >> +    read_lock(&d->pci_lock);
> >> >> +
> >> >> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
> >> >> +    if ( is_hardware_domain(d) )
> >> >> +        read_lock(&dom_xen->pci_lock);
> >> > 
> >> > For modify_bars() we also want the locks to be in write mode (at least
> >> > the hw one), so that the position of the BARs can't be changed while
> >> > modify_bars() is iterating over them.
> >> 
> >> Isn't changing of the BARs happening under the vpci lock?
> >
> > It is.
> >
> >> Or else I guess
> >> I haven't understood the description correctly: My reading so far was
> >> that it is only the presence (allocation status / pointer validity) that
> >> is protected by this new lock.
> >
> > Hm, I see, yes.  I guess it was a previous patch version that also
> > took care of the modify_bars() issue by taking the lock in exclusive
> > mode here.
> >
> > We can always do that later, so forget about that comment (for now).
> 
> Are you sure? I'd rather rework the code to use write lock in the
> modify_bars(). This is why we began all this journey in the first place.

Well, I was just saying that it doesn't need to be done in this same
patch, it can be done as a followup if that's preferred, but one way
or another we need to deal with it.

I'm fine if you want to adjust the commit message and do the change in
this same patch.

Thanks, Roger.
Volodymyr Babchuk July 26, 2023, 1:17 a.m. UTC | #13
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> 
>> Use a previously introduced per-domain read/write lock to check
>> whether vpci is present, so we are sure there are no accesses to the
>> contents of the vpci struct if not. This lock can be used (and in a
>> few cases is used right away) so that vpci removal can be performed
>> while holding the lock in write mode. Previously such removal could
>> race with vpci_read for example.
>
> This I think needs to state the locking order of the per-domain
> pci_lock wrt the vpci->lock.  AFAICT that's d->pci_lock first, then
> vpci->lock.

Will add, thanks.

>> 1. Per-domain's pci_rwlock is used to protect pdev->vpci structure
>> from being removed.
>> 
>> 2. Writing the command register and ROM BAR register may trigger
>> modify_bars to run, which in turn may access multiple pdevs while
>> checking for the existing BAR's overlap. The overlapping check, if
>> done under the read lock, requires vpci->lock to be acquired on both
>> devices being compared, which may produce a deadlock. It is not
>> possible to upgrade read lock to write lock in such a case. So, in
>> order to prevent the deadlock, use d->pci_lock instead. To prevent
>> deadlock while locking both hwdom->pci_lock and dom_xen->pci_lock,
>> always lock hwdom first.
>> 
>> All other code, which doesn't lead to pdev->vpci destruction and does
>> not access multiple pdevs at the same time, can still use a
>> combination of the read lock and pdev->vpci->lock.
>> 
>> 3. Drop const qualifier where the new rwlock is used and this is
>> appropriate.
>> 
>> 4. Do not call process_pending_softirqs with any locks held. For that
>> unlock prior the call and re-acquire the locks after. After
>> re-acquiring the lock there is no need to check if pdev->vpci exists:
>>  - in apply_map because of the context it is called (no race condition
>>    possible)
>>  - for MSI/MSI-X debug code because it is called at the end of
>>    pdev->vpci access and no further access to pdev->vpci is made
>
> I assume that's vpci_msix_arch_print(): there are further accesses to
> pdev->vpci, but those use the msix local variable, which holds a copy
> of the pointer in pdev->vpci->msix, so that last sentence is not true
> I'm afraid.

Yes, I see. I am wondering if we can memorize sbdf and call pci_get_pdev()
after re-acquiring the lock. Of course, there is a slight chance that we
will get another pdev with the same sbdf...

> However the code already try to cater for the pdev going away, and
> hence it's IMO fine.  IOW: your change doesn't make this any better or
> worse.
>
>> 
>> 5. Introduce pcidevs_trylock, so there is a possibility to try locking
>> the pcidev's lock.
>
> I'm confused by this addition, the more that's no used anywhere.  Can
> you defer the addition until the patch that makes use of it?
>

Yup. This is another rebasing artifact. There were users for this
function it the previous version of the patch.

>> 
>> 6. Use d->pci_lock around for_each_pdev and pci_get_pdev_by_domain
>> while accessing pdevs in vpci code.
>> 
>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
>> Suggested-by: Jan Beulich <jbeulich@suse.com>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
>> 
>> ---
>> 
>> Changes in v8:
>>  - changed d->vpci_lock to d->pci_lock
>>  - introducing d->pci_lock in a separate patch
>>  - extended locked region in vpci_process_pending
>>  - removed pcidevs_lockis vpci_dump_msi()
>>  - removed some changes as they are not needed with
>>    the new locking scheme
>>  - added handling for hwdom && dom_xen case
>> ---
>>  xen/arch/x86/hvm/vmsi.c       |  4 +++
>>  xen/drivers/passthrough/pci.c |  7 +++++
>>  xen/drivers/vpci/header.c     | 18 ++++++++++++
>>  xen/drivers/vpci/msi.c        | 14 ++++++++--
>>  xen/drivers/vpci/msix.c       | 52 ++++++++++++++++++++++++++++++-----
>>  xen/drivers/vpci/vpci.c       | 46 +++++++++++++++++++++++++++++--
>>  xen/include/xen/pci.h         |  1 +
>>  7 files changed, 129 insertions(+), 13 deletions(-)
>> 
>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
>> index 3cd4923060..8c1bd66b9c 100644
>> --- a/xen/arch/x86/hvm/vmsi.c
>> +++ b/xen/arch/x86/hvm/vmsi.c
>> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>  {
>>      unsigned int i;
>>  
>> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>> +
>>      for ( i = 0; i < msix->max_entries; i++ )
>>      {
>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>              struct pci_dev *pdev = msix->pdev;
>>  
>>              spin_unlock(&msix->pdev->vpci->lock);
>> +            read_unlock(&pdev->domain->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&pdev->domain->pci_lock);
>
> This should be a read_trylock(), much like the spin_trylock() below.

vpci_dump_msi() expects that vpci_msix_arch_print() will return holding
this lock. I can rework both functions, of course. But then we will in
situation when we need to known exact behavior of vpci_dump_msi() wrt of
locks in the calling code...

>
>>              /* NB: we assume that pdev cannot go away for an alive domain. */
>>              if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
>>                  return -EBUSY;
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 5b4632ead2..6f8692cd9c 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -57,6 +57,11 @@ void pcidevs_lock(void)
>>      spin_lock_recursive(&_pcidevs_lock);
>>  }
>>  
>> +int pcidevs_trylock(void)
>> +{
>> +    return spin_trylock_recursive(&_pcidevs_lock);
>> +}
>> +
>>  void pcidevs_unlock(void)
>>  {
>>      spin_unlock_recursive(&_pcidevs_lock);
>> @@ -1144,7 +1149,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
>>      } while ( devfn != pdev->devfn &&
>>                PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
>>  
>> +    write_lock(&ctxt->d->pci_lock);
>>      err = vpci_add_handlers(pdev);
>> +    write_unlock(&ctxt->d->pci_lock);
>>      if ( err )
>>          printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
>>                 ctxt->d->domain_id, err);
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index b41556d007..2780fcae72 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -152,6 +152,7 @@ bool vpci_process_pending(struct vcpu *v)
>>          if ( rc == -ERESTART )
>>              return true;
>>  
>> +        write_lock(&v->domain->pci_lock);
>>          spin_lock(&v->vpci.pdev->vpci->lock);
>>          /* Disable memory decoding unconditionally on failure. */
>>          modify_decoding(v->vpci.pdev,
>> @@ -170,6 +171,7 @@ bool vpci_process_pending(struct vcpu *v)
>>               * failure.
>>               */
>>              vpci_remove_device(v->vpci.pdev);
>> +        write_unlock(&v->domain->pci_lock);
>>      }
>
> The handling in vpci_process_pending() wrt vpci_remove_device() is
> racy and will need some thinking to get it solved.  Your change
> doesn't make it any worse, but I would also be fine with adding a note
> in the commit message that vpci_process_pending() is not adjusted to
> use the new lock because it needs to be reworked first in order to be
> safe against a concurrent vpci_remove_device() call.

It is racy because we are accessing v->vpci.pdev->vpci, I see. At least
we can check if it is not NULL...

But the problem is broader, of course, as vpci struct could be destroyed
and created anew. This begs the question if we should delay
vpci_process_pending() in the first place.

>>  
>>      return false;
>> @@ -181,8 +183,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
>>      struct map_data data = { .d = d, .map = true };
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>>      while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
>> +    {
>> +        /*
>> +         * It's safe to drop and reacquire the lock in this context
>> +         * without risking pdev disappearing because devices cannot be
>> +         * removed until the initial domain has been started.
>> +         */
>> +        read_unlock(&d->pci_lock);
>>          process_pending_softirqs();
>> +        read_lock(&d->pci_lock);
>> +    }
>
> Since this is init only code you could likely forego the usage of the
> locks, but I guess that's more churn than just using them.  In any
> case, as this gets called from modify_bars() the locks need to be
> dropped/taken in write mode (see comment below).
>
>>      rangeset_destroy(mem);
>>      if ( !rc )
>>          modify_decoding(pdev, cmd, false);
>> @@ -223,6 +237,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
>>      unsigned int i;
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>
> The lock here needs to be taken in write mode I think, so the code can
> safely iterate over the contents of each pdev->vpci assigned to the
> domain.
>

Yep, reworked.

>> +
>>      if ( !mem )
>>          return -ENOMEM;
>>  
>> @@ -502,6 +518,8 @@ static int cf_check init_bars(struct pci_dev *pdev)
>>      struct vpci_bar *bars = header->bars;
>>      int rc;
>>  
>> +    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
>> +
>>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>      {
>>      case PCI_HEADER_TYPE_NORMAL:
>> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
>> index 8f2b59e61a..e63152c224 100644
>> --- a/xen/drivers/vpci/msi.c
>> +++ b/xen/drivers/vpci/msi.c
>> @@ -190,6 +190,8 @@ static int cf_check init_msi(struct pci_dev *pdev)
>>      uint16_t control;
>>      int ret;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>
> I'm confused by the difference in lock requirements between
> init_bars() and init_msi().  In the former you assert for the lock
> being taken in read mode, while the later asserts for write mode.
>
> We want to do initialization in write mode, so that modify_bars()
> called by init_bars() has exclusive access to the contents of
> pdev->vpci.
>

Taking into account your discussion with Jan, I removed this ASSERT at all. 

>> +
>>      if ( !pos )
>>          return 0;
>>  
>> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
>>  
>>  void vpci_dump_msi(void)
>>  {
>> -    const struct domain *d;
>> +    struct domain *d;
>>  
>>      rcu_read_lock(&domlist_read_lock);
>>      for_each_domain ( d )
>> @@ -277,6 +279,9 @@ void vpci_dump_msi(void)
>>  
>>          printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
>>  
>> +        if ( !read_trylock(&d->pci_lock) )
>> +            continue;
>> +
>>          for_each_pdev ( d, pdev )
>>          {
>>              const struct vpci_msi *msi;
>> @@ -318,14 +323,17 @@ void vpci_dump_msi(void)
>>                       * holding the lock.
>>                       */
>>                      printk("unable to print all MSI-X entries: %d\n", rc);
>> -                    process_pending_softirqs();
>> -                    continue;
>> +                    goto pdev_done;
>>                  }
>>              }
>>  
>>              spin_unlock(&pdev->vpci->lock);
>> + pdev_done:
>> +            read_unlock(&d->pci_lock);
>>              process_pending_softirqs();
>> +            read_lock(&d->pci_lock);
>
> read_trylock().
>
> This is not very safe, as the list could be modified while the lock is
> dropped, but it's a debug key handler so I'm not very concerned.
> However we should at least add a comment that this relies on the list
> not being altered while the lock is dropped.

Added, thanks.

>
>>          }
>> +        read_unlock(&d->pci_lock);
>>      }
>>      rcu_read_unlock(&domlist_read_lock);
>>  }
>> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
>> index 25bde77586..9481274579 100644
>> --- a/xen/drivers/vpci/msix.c
>> +++ b/xen/drivers/vpci/msix.c
>> @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  {
>>      struct vpci_msix *msix;
>>  
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>
> Hm, here you are iterating over pdev->vpci->header.bars for multiple
> devices, so I think in addition to the pci_lock in read mode we should
> also take the vpci->lock for each pdev.
>
> I think I would like to rework msix_find() so it's msix_get() and
> returns with the appropriate vpci->lock taken.  Anyway, that's for a
> different patch, the usage of the lock in read mode seems correct,
> albeit I might want to move the read_lock() call inside of msix_get()
> in the future.
>
>> +
>>      list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
>>      {
>>          const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
>> @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
>>  
>>  static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
>>  {
>> -    return !!msix_find(v->domain, addr);
>> +    int rc;
>> +
>> +    read_lock(&v->domain->pci_lock);
>> +    rc = !!msix_find(v->domain, addr);
>> +    read_unlock(&v->domain->pci_lock);
>> +
>> +    return rc;
>>  }
>>  
>>  static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
>> @@ -358,21 +366,34 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_read(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      const struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>>      *data = ~0ul;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_read(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_read(d, msix, addr, len, data);
>
> Nit: missing newline (here and below).
>
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -404,6 +425,7 @@ static int cf_check msix_read(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -491,19 +513,32 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
>>  static int cf_check msix_write(
>>      struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
>>  {
>> -    const struct domain *d = v->domain;
>> -    struct vpci_msix *msix = msix_find(d, addr);
>> +    struct domain *d = v->domain;
>> +    struct vpci_msix *msix;
>>      struct vpci_msix_entry *entry;
>>      unsigned int offset;
>>  
>> +    read_lock(&d->pci_lock);
>> +
>> +    msix = msix_find(d, addr);
>>      if ( !msix )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_RETRY;
>> +    }
>>  
>>      if ( adjacent_handle(msix, addr) )
>> -        return adjacent_write(d, msix, addr, len, data);
>> +    {
>> +        int rc = adjacent_write(d, msix, addr, len, data);
>> +        read_unlock(&d->pci_lock);
>> +        return rc;
>> +    }
>>  
>>      if ( !access_allowed(msix->pdev, addr, len) )
>> +    {
>> +        read_unlock(&d->pci_lock);
>>          return X86EMUL_OKAY;
>> +    }
>>  
>>      spin_lock(&msix->pdev->vpci->lock);
>>      entry = get_entry(msix, addr);
>> @@ -579,6 +614,7 @@ static int cf_check msix_write(
>>          break;
>>      }
>>      spin_unlock(&msix->pdev->vpci->lock);
>> +    read_unlock(&d->pci_lock);
>>  
>>      return X86EMUL_OKAY;
>>  }
>> @@ -665,6 +701,8 @@ static int cf_check init_msix(struct pci_dev *pdev)
>>      struct vpci_msix *msix;
>>      int rc;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
>>                                        PCI_CAP_ID_MSIX);
>>      if ( !msix_offset )
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index d73fa76302..f22cbf2112 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -38,6 +38,8 @@ extern vpci_register_init_t *const __end_vpci_array[];
>>  
>>  void vpci_remove_device(struct pci_dev *pdev)
>>  {
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) || !pdev->vpci )
>>          return;
>>  
>> @@ -73,6 +75,8 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>      const unsigned long *ro_map;
>>      int rc = 0;
>>  
>> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
>> +
>>      if ( !has_vpci(pdev->domain) )
>>          return 0;
>>  
>> @@ -326,11 +330,12 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
>>  
>>  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>>      uint32_t data = ~(uint32_t)0;
>> +    rwlock_t *lock;
>>  
>>      if ( !size )
>>      {
>> @@ -342,11 +347,21 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>       * Find the PCI dev matching the address, which for hwdom also requires
>>       * consulting DomXEN.  Passthrough everything that's not trapped.
>>       */
>> +    lock = &d->pci_lock;
>> +    read_lock(lock);
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>> +    {
>> +        read_unlock(lock);
>> +        lock = &dom_xen->pci_lock;
>> +        read_lock(lock);
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>> +    }
>>      if ( !pdev || !pdev->vpci )
>> +    {
>> +        read_unlock(lock);
>>          return vpci_read_hw(sbdf, reg, size);
>> +    }
>>  
>>      spin_lock(&pdev->vpci->lock);
>>  
>> @@ -392,6 +407,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    read_unlock(lock);
>>  
>>      if ( data_offset < size )
>>      {
>> @@ -431,10 +447,23 @@ static void vpci_write_helper(const struct pci_dev *pdev,
>>               r->private);
>>  }
>>  
>> +/* Helper function to unlock locks taken by vpci_write in proper order */
>> +static void unlock_locks(struct domain *d)
>> +{
>> +    ASSERT(rw_is_locked(&d->pci_lock));
>> +
>> +    if ( is_hardware_domain(d) )
>> +    {
>> +        ASSERT(rw_is_locked(&d->pci_lock));
>> +        read_unlock(&dom_xen->pci_lock);
>> +    }
>> +    read_unlock(&d->pci_lock);
>> +}
>> +
>>  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                  uint32_t data)
>>  {
>> -    const struct domain *d = current->domain;
>> +    struct domain *d = current->domain;
>>      const struct pci_dev *pdev;
>>      const struct vpci_register *r;
>>      unsigned int data_offset = 0;
>> @@ -447,8 +476,16 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>      /*
>>       * Find the PCI dev matching the address, which for hwdom also requires
>> -     * consulting DomXEN.  Passthrough everything that's not trapped.
>> +     * consulting DomXEN. Passthrough everything that's not trapped.
>> +     * If this is hwdom, we need to hold locks for both domain in case if
>> +     * modify_bars is called()
>
> Typo: the () wants to be at the end of modify_bars().
>
>>       */
>> +    read_lock(&d->pci_lock);
>> +
>> +    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
>> +    if ( is_hardware_domain(d) )
>> +        read_lock(&dom_xen->pci_lock);
>
> For modify_bars() we also want the locks to be in write mode (at least
> the hw one), so that the position of the BARs can't be changed while
> modify_bars() is iterating over them.
>
> Is this something that will be done in a followup change?

I'll done it in this change.

>
>> +
>>      pdev = pci_get_pdev(d, sbdf);
>>      if ( !pdev && is_hardware_domain(d) )
>>          pdev = pci_get_pdev(dom_xen, sbdf);
>> @@ -459,6 +496,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>  
>>          if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
>>              vpci_write_hw(sbdf, reg, size, data);
>> +
>> +        unlock_locks(d);
>>          return;
>>      }
>>  
>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>          ASSERT(data_offset < size);
>>      }
>>      spin_unlock(&pdev->vpci->lock);
>> +    unlock_locks(d);
>
> There's one issue here, some handlers will cal pcidevs_lock(), which
> will result in a lock over inversion, as in the previous patch we
> agreed that the locking order was pcidevs_lock first, d->pci_lock
> after.
>
> For example the MSI control_write() handler will call
> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> have to look into using a dedicated lock for MSI related handling, as
> that's the only place where I think we have this pattern of taking the
> pcidevs_lock after the d->pci_lock.

I'll mention this in the commit message. Is there something else that I
should do right now?
Jan Beulich July 26, 2023, 6:39 a.m. UTC | #14
On 26.07.2023 03:17, Volodymyr Babchuk wrote:
> Roger Pau Monné <roger.pau@citrix.com> writes:
>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>> --- a/xen/arch/x86/hvm/vmsi.c
>>> +++ b/xen/arch/x86/hvm/vmsi.c
>>> @@ -895,6 +895,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>>  {
>>>      unsigned int i;
>>>  
>>> +    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
>>> +
>>>      for ( i = 0; i < msix->max_entries; i++ )
>>>      {
>>>          const struct vpci_msix_entry *entry = &msix->entries[i];
>>> @@ -913,7 +915,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
>>>              struct pci_dev *pdev = msix->pdev;
>>>  
>>>              spin_unlock(&msix->pdev->vpci->lock);
>>> +            read_unlock(&pdev->domain->pci_lock);
>>>              process_pending_softirqs();
>>> +            read_lock(&pdev->domain->pci_lock);
>>
>> This should be a read_trylock(), much like the spin_trylock() below.
> 
> vpci_dump_msi() expects that vpci_msix_arch_print() will return holding
> this lock. I can rework both functions, of course. But then we will in
> situation when we need to known exact behavior of vpci_dump_msi() wrt of
> locks in the calling code...

Your reply sounds as if you hadn't seen my earlier suggestion on this
matter (making the behavior match also for the now 2nd lock involved).

Jan
Roger Pau Monné July 26, 2023, 9:35 a.m. UTC | #15
On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>          ASSERT(data_offset < size);
> >>      }
> >>      spin_unlock(&pdev->vpci->lock);
> >> +    unlock_locks(d);
> >
> > There's one issue here, some handlers will cal pcidevs_lock(), which
> > will result in a lock over inversion, as in the previous patch we
> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> > after.
> >
> > For example the MSI control_write() handler will call
> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> > have to look into using a dedicated lock for MSI related handling, as
> > that's the only place where I think we have this pattern of taking the
> > pcidevs_lock after the d->pci_lock.
> 
> I'll mention this in the commit message. Is there something else that I
> should do right now?

Well, I don't think we want to commit this as-is with a known lock
inversion.

The functions that require the pcidevs lock are:

pt_irq_{create,destroy}_bind()
unmap_domain_pirq()

AFAICT those functions require the lock in order to assert that the
underlying device doesn't go away, as they do also use d->event_lock
in order to get exclusive access to the data fields.  Please double
check that I'm not mistaken.

If that's accurate you will have to check the call tree that spawns
from those functions in order to modify the asserts to check for
either the pcidevs or the per-domain pci_list lock being taken.

Thanks, Roger.
Volodymyr Babchuk July 27, 2023, 12:56 a.m. UTC | #16
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>> 
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >>          ASSERT(data_offset < size);
>> >>      }
>> >>      spin_unlock(&pdev->vpci->lock);
>> >> +    unlock_locks(d);
>> >
>> > There's one issue here, some handlers will cal pcidevs_lock(), which
>> > will result in a lock over inversion, as in the previous patch we
>> > agreed that the locking order was pcidevs_lock first, d->pci_lock
>> > after.
>> >
>> > For example the MSI control_write() handler will call
>> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>> > have to look into using a dedicated lock for MSI related handling, as
>> > that's the only place where I think we have this pattern of taking the
>> > pcidevs_lock after the d->pci_lock.
>> 
>> I'll mention this in the commit message. Is there something else that I
>> should do right now?
>
> Well, I don't think we want to commit this as-is with a known lock
> inversion.
>
> The functions that require the pcidevs lock are:
>
> pt_irq_{create,destroy}_bind()
> unmap_domain_pirq()
>
> AFAICT those functions require the lock in order to assert that the
> underlying device doesn't go away, as they do also use d->event_lock
> in order to get exclusive access to the data fields.  Please double
> check that I'm not mistaken.

You are right, all three function does not access any of PCI state
directly. However...

> If that's accurate you will have to check the call tree that spawns
> from those functions in order to modify the asserts to check for
> either the pcidevs or the per-domain pci_list lock being taken.

... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
vmx_pi_update_irte():

amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().

Both functions read basic pdev fields like sbfd or type. I see no
problem there, as values of those fields are not supposed to be changed.
Also those function use own locks to protect shared state. But as IO-MMU
code is quite convoluted it is hard to be sure that it is safe to call
those functions without holding pdevs_lock. All I can say is that those
functions and their callees have no ASSERT(pcidevs_locked()).
Jan Beulich July 27, 2023, 7:41 a.m. UTC | #17
On 27.07.2023 02:56, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>
>>> Hi Roger,
>>>
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>
>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>          ASSERT(data_offset < size);
>>>>>      }
>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>> +    unlock_locks(d);
>>>>
>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>> will result in a lock over inversion, as in the previous patch we
>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>> after.
>>>>
>>>> For example the MSI control_write() handler will call
>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>> have to look into using a dedicated lock for MSI related handling, as
>>>> that's the only place where I think we have this pattern of taking the
>>>> pcidevs_lock after the d->pci_lock.
>>>
>>> I'll mention this in the commit message. Is there something else that I
>>> should do right now?
>>
>> Well, I don't think we want to commit this as-is with a known lock
>> inversion.
>>
>> The functions that require the pcidevs lock are:
>>
>> pt_irq_{create,destroy}_bind()
>> unmap_domain_pirq()
>>
>> AFAICT those functions require the lock in order to assert that the
>> underlying device doesn't go away, as they do also use d->event_lock
>> in order to get exclusive access to the data fields.  Please double
>> check that I'm not mistaken.
> 
> You are right, all three function does not access any of PCI state
> directly. However...
> 
>> If that's accurate you will have to check the call tree that spawns
>> from those functions in order to modify the asserts to check for
>> either the pcidevs or the per-domain pci_list lock being taken.
> 
> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> vmx_pi_update_irte():
> 
> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
> 
> Both functions read basic pdev fields like sbfd or type. I see no
> problem there, as values of those fields are not supposed to be changed.

But whether fields are basic or will never change doesn't matter when
the pdev struct itself suddenly disappears.

Jan

> Also those function use own locks to protect shared state. But as IO-MMU
> code is quite convoluted it is hard to be sure that it is safe to call
> those functions without holding pdevs_lock. All I can say is that those
> functions and their callees have no ASSERT(pcidevs_locked()).
>
Volodymyr Babchuk July 27, 2023, 10:31 a.m. UTC | #18
Hi Jan

Jan Beulich <jbeulich@suse.com> writes:

> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>
>>>> Hi Roger,
>>>>
>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>
>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>          ASSERT(data_offset < size);
>>>>>>      }
>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>> +    unlock_locks(d);
>>>>>
>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>> will result in a lock over inversion, as in the previous patch we
>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>> after.
>>>>>
>>>>> For example the MSI control_write() handler will call
>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>> that's the only place where I think we have this pattern of taking the
>>>>> pcidevs_lock after the d->pci_lock.
>>>>
>>>> I'll mention this in the commit message. Is there something else that I
>>>> should do right now?
>>>
>>> Well, I don't think we want to commit this as-is with a known lock
>>> inversion.
>>>
>>> The functions that require the pcidevs lock are:
>>>
>>> pt_irq_{create,destroy}_bind()
>>> unmap_domain_pirq()
>>>
>>> AFAICT those functions require the lock in order to assert that the
>>> underlying device doesn't go away, as they do also use d->event_lock
>>> in order to get exclusive access to the data fields.  Please double
>>> check that I'm not mistaken.
>> 
>> You are right, all three function does not access any of PCI state
>> directly. However...
>> 
>>> If that's accurate you will have to check the call tree that spawns
>>> from those functions in order to modify the asserts to check for
>>> either the pcidevs or the per-domain pci_list lock being taken.
>> 
>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>> vmx_pi_update_irte():
>> 
>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>> 
>> Both functions read basic pdev fields like sbfd or type. I see no
>> problem there, as values of those fields are not supposed to be changed.
>
> But whether fields are basic or will never change doesn't matter when
> the pdev struct itself suddenly disappears.

This is not a problem, as it is expected that d->pci_lock is being held,
so pdev structure will not disappear. I am trying to answer another
question: is d->pci_lock enough or pcidevs_lock is also should required?
Jan Beulich July 27, 2023, 11:37 a.m. UTC | #19
On 27.07.2023 12:31, Volodymyr Babchuk wrote:
> 
> Hi Jan
> 
> Jan Beulich <jbeulich@suse.com> writes:
> 
>> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>>> Hi Roger,
>>>
>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>
>>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>>
>>>>> Hi Roger,
>>>>>
>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>
>>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>          ASSERT(data_offset < size);
>>>>>>>      }
>>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>>> +    unlock_locks(d);
>>>>>>
>>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>>> will result in a lock over inversion, as in the previous patch we
>>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>>> after.
>>>>>>
>>>>>> For example the MSI control_write() handler will call
>>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>>> that's the only place where I think we have this pattern of taking the
>>>>>> pcidevs_lock after the d->pci_lock.
>>>>>
>>>>> I'll mention this in the commit message. Is there something else that I
>>>>> should do right now?
>>>>
>>>> Well, I don't think we want to commit this as-is with a known lock
>>>> inversion.
>>>>
>>>> The functions that require the pcidevs lock are:
>>>>
>>>> pt_irq_{create,destroy}_bind()
>>>> unmap_domain_pirq()
>>>>
>>>> AFAICT those functions require the lock in order to assert that the
>>>> underlying device doesn't go away, as they do also use d->event_lock
>>>> in order to get exclusive access to the data fields.  Please double
>>>> check that I'm not mistaken.
>>>
>>> You are right, all three function does not access any of PCI state
>>> directly. However...
>>>
>>>> If that's accurate you will have to check the call tree that spawns
>>>> from those functions in order to modify the asserts to check for
>>>> either the pcidevs or the per-domain pci_list lock being taken.
>>>
>>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>>> vmx_pi_update_irte():
>>>
>>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>>>
>>> Both functions read basic pdev fields like sbfd or type. I see no
>>> problem there, as values of those fields are not supposed to be changed.
>>
>> But whether fields are basic or will never change doesn't matter when
>> the pdev struct itself suddenly disappears.
> 
> This is not a problem, as it is expected that d->pci_lock is being held,
> so pdev structure will not disappear. I am trying to answer another
> question: is d->pci_lock enough or pcidevs_lock is also should required?

To answer such questions, may I ask that you first firmly write down
(and submit) what each of the locks guards?

Jan
Roger Pau Monné July 27, 2023, 12:42 p.m. UTC | #20
On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> >> 
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >>          ASSERT(data_offset < size);
> >> >>      }
> >> >>      spin_unlock(&pdev->vpci->lock);
> >> >> +    unlock_locks(d);
> >> >
> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
> >> > will result in a lock over inversion, as in the previous patch we
> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> >> > after.
> >> >
> >> > For example the MSI control_write() handler will call
> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> >> > have to look into using a dedicated lock for MSI related handling, as
> >> > that's the only place where I think we have this pattern of taking the
> >> > pcidevs_lock after the d->pci_lock.
> >> 
> >> I'll mention this in the commit message. Is there something else that I
> >> should do right now?
> >
> > Well, I don't think we want to commit this as-is with a known lock
> > inversion.
> >
> > The functions that require the pcidevs lock are:
> >
> > pt_irq_{create,destroy}_bind()
> > unmap_domain_pirq()
> >
> > AFAICT those functions require the lock in order to assert that the
> > underlying device doesn't go away, as they do also use d->event_lock
> > in order to get exclusive access to the data fields.  Please double
> > check that I'm not mistaken.
> 
> You are right, all three function does not access any of PCI state
> directly. However...
> 
> > If that's accurate you will have to check the call tree that spawns
> > from those functions in order to modify the asserts to check for
> > either the pcidevs or the per-domain pci_list lock being taken.
> 
> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> vmx_pi_update_irte():
> 
> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().

That path is only for VT-d, so strictly speaking you only need to worry
about msi_msg_write_remap_rte().

msi_msg_write_remap_rte() does take the IOMMU intremap lock.

There are also existing callers of iommu_update_ire_from_msi() that
call the functions without the pcidevs locked.  See
hpet_msi_set_affinity() for example.

Thanks, Roger.
Jan Beulich July 27, 2023, 12:56 p.m. UTC | #21
On 27.07.2023 14:42, Roger Pau Monné wrote:
> There are also existing callers of iommu_update_ire_from_msi() that
> call the functions without the pcidevs locked.  See
> hpet_msi_set_affinity() for example.

Ftaod first and foremost because there's no pdev in that case.

Jan
Roger Pau Monné July 27, 2023, 2:43 p.m. UTC | #22
On Thu, Jul 27, 2023 at 02:56:18PM +0200, Jan Beulich wrote:
> On 27.07.2023 14:42, Roger Pau Monné wrote:
> > There are also existing callers of iommu_update_ire_from_msi() that
> > call the functions without the pcidevs locked.  See
> > hpet_msi_set_affinity() for example.
> 
> Ftaod first and foremost because there's no pdev in that case.

Likewise for (mostly?) the rest of the callers, as callers of the
.set_affinity hw_irq_controller hook don't have a PCI device at
hand.

Thanks, Roger.
Volodymyr Babchuk July 27, 2023, 3:13 p.m. UTC | #23
Jan,

Jan Beulich <jbeulich@suse.com> writes:

> On 27.07.2023 12:31, Volodymyr Babchuk wrote:
>> 
>> Hi Jan
>> 
>> Jan Beulich <jbeulich@suse.com> writes:
>> 
>>> On 27.07.2023 02:56, Volodymyr Babchuk wrote:
>>>> Hi Roger,
>>>>
>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>
>>>>> On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>>>>>>
>>>>>> Hi Roger,
>>>>>>
>>>>>> Roger Pau Monné <roger.pau@citrix.com> writes:
>>>>>>
>>>>>>> On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>>>>          ASSERT(data_offset < size);
>>>>>>>>      }
>>>>>>>>      spin_unlock(&pdev->vpci->lock);
>>>>>>>> +    unlock_locks(d);
>>>>>>>
>>>>>>> There's one issue here, some handlers will cal pcidevs_lock(), which
>>>>>>> will result in a lock over inversion, as in the previous patch we
>>>>>>> agreed that the locking order was pcidevs_lock first, d->pci_lock
>>>>>>> after.
>>>>>>>
>>>>>>> For example the MSI control_write() handler will call
>>>>>>> vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>>>>>>> have to look into using a dedicated lock for MSI related handling, as
>>>>>>> that's the only place where I think we have this pattern of taking the
>>>>>>> pcidevs_lock after the d->pci_lock.
>>>>>>
>>>>>> I'll mention this in the commit message. Is there something else that I
>>>>>> should do right now?
>>>>>
>>>>> Well, I don't think we want to commit this as-is with a known lock
>>>>> inversion.
>>>>>
>>>>> The functions that require the pcidevs lock are:
>>>>>
>>>>> pt_irq_{create,destroy}_bind()
>>>>> unmap_domain_pirq()
>>>>>
>>>>> AFAICT those functions require the lock in order to assert that the
>>>>> underlying device doesn't go away, as they do also use d->event_lock
>>>>> in order to get exclusive access to the data fields.  Please double
>>>>> check that I'm not mistaken.
>>>>
>>>> You are right, all three function does not access any of PCI state
>>>> directly. However...
>>>>
>>>>> If that's accurate you will have to check the call tree that spawns
>>>>> from those functions in order to modify the asserts to check for
>>>>> either the pcidevs or the per-domain pci_list lock being taken.
>>>>
>>>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>>>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>>>> vmx_pi_update_irte():
>>>>
>>>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>>>>
>>>> Both functions read basic pdev fields like sbfd or type. I see no
>>>> problem there, as values of those fields are not supposed to be changed.
>>>
>>> But whether fields are basic or will never change doesn't matter when
>>> the pdev struct itself suddenly disappears.
>> 
>> This is not a problem, as it is expected that d->pci_lock is being held,
>> so pdev structure will not disappear. I am trying to answer another
>> question: is d->pci_lock enough or pcidevs_lock is also should required?
>
> To answer such questions, may I ask that you first firmly write down
> (and submit) what each of the locks guards?

I can do this for a newly introduced lock. So domain->pci_lock guards:

1. domain->pcidevs_list. This means that PCI devices can't be added to
or removed from a domain, when the lock is taken in read mode. As a
byproduct, any pdev assigned to a domain can't be deleted because we
need to deassign it first. To modify domain->pcidevs_list we need to
hold both d->pci_lock in write mode and pcidevs_lock.

2. Presence of pdev->vpci struct for any pdev assigned to a domain. The
structure itself is locked by pdev->vpci->lock. But to add/remove
pdev->vpci itself we need to hold d->pci_lock in the write mode.

As for pcidevs_lock, AFAIK, there is no strictly written rules, what is
exactly is protected by this lock.
Volodymyr Babchuk July 28, 2023, 12:21 a.m. UTC | #24
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
>> Hi Roger,
>> 
>> Roger Pau Monné <roger.pau@citrix.com> writes:
>> 
>> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
>> >> 
>> >> Hi Roger,
>> >> 
>> >> Roger Pau Monné <roger.pau@citrix.com> writes:
>> >> 
>> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
>> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>> >> >>          ASSERT(data_offset < size);
>> >> >>      }
>> >> >>      spin_unlock(&pdev->vpci->lock);
>> >> >> +    unlock_locks(d);
>> >> >
>> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
>> >> > will result in a lock over inversion, as in the previous patch we
>> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
>> >> > after.
>> >> >
>> >> > For example the MSI control_write() handler will call
>> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
>> >> > have to look into using a dedicated lock for MSI related handling, as
>> >> > that's the only place where I think we have this pattern of taking the
>> >> > pcidevs_lock after the d->pci_lock.
>> >> 
>> >> I'll mention this in the commit message. Is there something else that I
>> >> should do right now?
>> >
>> > Well, I don't think we want to commit this as-is with a known lock
>> > inversion.
>> >
>> > The functions that require the pcidevs lock are:
>> >
>> > pt_irq_{create,destroy}_bind()
>> > unmap_domain_pirq()
>> >
>> > AFAICT those functions require the lock in order to assert that the
>> > underlying device doesn't go away, as they do also use d->event_lock
>> > in order to get exclusive access to the data fields.  Please double
>> > check that I'm not mistaken.
>> 
>> You are right, all three function does not access any of PCI state
>> directly. However...
>> 
>> > If that's accurate you will have to check the call tree that spawns
>> > from those functions in order to modify the asserts to check for
>> > either the pcidevs or the per-domain pci_list lock being taken.
>> 
>> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
>> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
>> vmx_pi_update_irte():
>> 
>> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
>
> That path is only for VT-d, so strictly speaking you only need to worry
> about msi_msg_write_remap_rte().
>
> msi_msg_write_remap_rte() does take the IOMMU intremap lock.
>
> There are also existing callers of iommu_update_ire_from_msi() that
> call the functions without the pcidevs locked.  See
> hpet_msi_set_affinity() for example.

Thank you for clarifying this.

I have found another call path which causes troubles:
__pci_enable_msi[x] is called from pci_enable_msi() via vMSI, via
physdev_map_irq and also directly from ns16550 driver.

__pci_enable_msi[x] accesses pdev fields, mostly pdev->msix or
pdev->msi_list, so looks like we need pcidevs_lock(), as pdev fields are
not protected by d->pci_lock...
Roger Pau Monné July 28, 2023, 1:55 p.m. UTC | #25
On Fri, Jul 28, 2023 at 12:21:54AM +0000, Volodymyr Babchuk wrote:
> 
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Thu, Jul 27, 2023 at 12:56:54AM +0000, Volodymyr Babchuk wrote:
> >> Hi Roger,
> >> 
> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> 
> >> > On Wed, Jul 26, 2023 at 01:17:58AM +0000, Volodymyr Babchuk wrote:
> >> >> 
> >> >> Hi Roger,
> >> >> 
> >> >> Roger Pau Monné <roger.pau@citrix.com> writes:
> >> >> 
> >> >> > On Thu, Jul 20, 2023 at 12:32:31AM +0000, Volodymyr Babchuk wrote:
> >> >> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> >> >> @@ -498,6 +537,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >> >> >>          ASSERT(data_offset < size);
> >> >> >>      }
> >> >> >>      spin_unlock(&pdev->vpci->lock);
> >> >> >> +    unlock_locks(d);
> >> >> >
> >> >> > There's one issue here, some handlers will cal pcidevs_lock(), which
> >> >> > will result in a lock over inversion, as in the previous patch we
> >> >> > agreed that the locking order was pcidevs_lock first, d->pci_lock
> >> >> > after.
> >> >> >
> >> >> > For example the MSI control_write() handler will call
> >> >> > vpci_msi_arch_enable() which takes the pcidevs lock.  I think I will
> >> >> > have to look into using a dedicated lock for MSI related handling, as
> >> >> > that's the only place where I think we have this pattern of taking the
> >> >> > pcidevs_lock after the d->pci_lock.
> >> >> 
> >> >> I'll mention this in the commit message. Is there something else that I
> >> >> should do right now?
> >> >
> >> > Well, I don't think we want to commit this as-is with a known lock
> >> > inversion.
> >> >
> >> > The functions that require the pcidevs lock are:
> >> >
> >> > pt_irq_{create,destroy}_bind()
> >> > unmap_domain_pirq()
> >> >
> >> > AFAICT those functions require the lock in order to assert that the
> >> > underlying device doesn't go away, as they do also use d->event_lock
> >> > in order to get exclusive access to the data fields.  Please double
> >> > check that I'm not mistaken.
> >> 
> >> You are right, all three function does not access any of PCI state
> >> directly. However...
> >> 
> >> > If that's accurate you will have to check the call tree that spawns
> >> > from those functions in order to modify the asserts to check for
> >> > either the pcidevs or the per-domain pci_list lock being taken.
> >> 
> >> ... I checked calls for PT_IRQ_TYPE_MSI case, there is only call that
> >> bothers me: hvm_pi_update_irte(), which calls IO-MMU code via
> >> vmx_pi_update_irte():
> >> 
> >> amd_iommu_msi_msg_update_ire() or msi_msg_write_remap_rte().
> >
> > That path is only for VT-d, so strictly speaking you only need to worry
> > about msi_msg_write_remap_rte().
> >
> > msi_msg_write_remap_rte() does take the IOMMU intremap lock.
> >
> > There are also existing callers of iommu_update_ire_from_msi() that
> > call the functions without the pcidevs locked.  See
> > hpet_msi_set_affinity() for example.
> 
> Thank you for clarifying this.
> 
> I have found another call path which causes troubles:
> __pci_enable_msi[x] is called from pci_enable_msi() via vMSI, via
> physdev_map_irq and also directly from ns16550 driver.

Both vPCI and physdev_map_irq() use the same path: map_domain_pirq()
which gets called with d->event_lock taken in exclusive mode, that
should be enough as a device cannot be assigned to multiple guests.

ns16550_init_postirq() is an init function, which means it won't be
executed after Xen has booted, so I think this is all fine, as
concurrent accesses from ns16550_init_postirq() and map_domain_pirq()
are impossible.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 3cd4923060..8c1bd66b9c 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -895,6 +895,8 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
 {
     unsigned int i;
 
+    ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock));
+
     for ( i = 0; i < msix->max_entries; i++ )
     {
         const struct vpci_msix_entry *entry = &msix->entries[i];
@@ -913,7 +915,9 @@  int vpci_msix_arch_print(const struct vpci_msix *msix)
             struct pci_dev *pdev = msix->pdev;
 
             spin_unlock(&msix->pdev->vpci->lock);
+            read_unlock(&pdev->domain->pci_lock);
             process_pending_softirqs();
+            read_lock(&pdev->domain->pci_lock);
             /* NB: we assume that pdev cannot go away for an alive domain. */
             if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) )
                 return -EBUSY;
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 5b4632ead2..6f8692cd9c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -57,6 +57,11 @@  void pcidevs_lock(void)
     spin_lock_recursive(&_pcidevs_lock);
 }
 
+int pcidevs_trylock(void)
+{
+    return spin_trylock_recursive(&_pcidevs_lock);
+}
+
 void pcidevs_unlock(void)
 {
     spin_unlock_recursive(&_pcidevs_lock);
@@ -1144,7 +1149,9 @@  static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt,
     } while ( devfn != pdev->devfn &&
               PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) );
 
+    write_lock(&ctxt->d->pci_lock);
     err = vpci_add_handlers(pdev);
+    write_unlock(&ctxt->d->pci_lock);
     if ( err )
         printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n",
                ctxt->d->domain_id, err);
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index b41556d007..2780fcae72 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -152,6 +152,7 @@  bool vpci_process_pending(struct vcpu *v)
         if ( rc == -ERESTART )
             return true;
 
+        write_lock(&v->domain->pci_lock);
         spin_lock(&v->vpci.pdev->vpci->lock);
         /* Disable memory decoding unconditionally on failure. */
         modify_decoding(v->vpci.pdev,
@@ -170,6 +171,7 @@  bool vpci_process_pending(struct vcpu *v)
              * failure.
              */
             vpci_remove_device(v->vpci.pdev);
+        write_unlock(&v->domain->pci_lock);
     }
 
     return false;
@@ -181,8 +183,20 @@  static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
     struct map_data data = { .d = d, .map = true };
     int rc;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART )
+    {
+        /*
+         * It's safe to drop and reacquire the lock in this context
+         * without risking pdev disappearing because devices cannot be
+         * removed until the initial domain has been started.
+         */
+        read_unlock(&d->pci_lock);
         process_pending_softirqs();
+        read_lock(&d->pci_lock);
+    }
+
     rangeset_destroy(mem);
     if ( !rc )
         modify_decoding(pdev, cmd, false);
@@ -223,6 +237,8 @@  static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only)
     unsigned int i;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     if ( !mem )
         return -ENOMEM;
 
@@ -502,6 +518,8 @@  static int cf_check init_bars(struct pci_dev *pdev)
     struct vpci_bar *bars = header->bars;
     int rc;
 
+    ASSERT(rw_is_locked(&pdev->domain->pci_lock));
+
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
     case PCI_HEADER_TYPE_NORMAL:
diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c
index 8f2b59e61a..e63152c224 100644
--- a/xen/drivers/vpci/msi.c
+++ b/xen/drivers/vpci/msi.c
@@ -190,6 +190,8 @@  static int cf_check init_msi(struct pci_dev *pdev)
     uint16_t control;
     int ret;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !pos )
         return 0;
 
@@ -265,7 +267,7 @@  REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW);
 
 void vpci_dump_msi(void)
 {
-    const struct domain *d;
+    struct domain *d;
 
     rcu_read_lock(&domlist_read_lock);
     for_each_domain ( d )
@@ -277,6 +279,9 @@  void vpci_dump_msi(void)
 
         printk("vPCI MSI/MSI-X d%d\n", d->domain_id);
 
+        if ( !read_trylock(&d->pci_lock) )
+            continue;
+
         for_each_pdev ( d, pdev )
         {
             const struct vpci_msi *msi;
@@ -318,14 +323,17 @@  void vpci_dump_msi(void)
                      * holding the lock.
                      */
                     printk("unable to print all MSI-X entries: %d\n", rc);
-                    process_pending_softirqs();
-                    continue;
+                    goto pdev_done;
                 }
             }
 
             spin_unlock(&pdev->vpci->lock);
+ pdev_done:
+            read_unlock(&d->pci_lock);
             process_pending_softirqs();
+            read_lock(&d->pci_lock);
         }
+        read_unlock(&d->pci_lock);
     }
     rcu_read_unlock(&domlist_read_lock);
 }
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c
index 25bde77586..9481274579 100644
--- a/xen/drivers/vpci/msix.c
+++ b/xen/drivers/vpci/msix.c
@@ -147,6 +147,8 @@  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 {
     struct vpci_msix *msix;
 
+    ASSERT(rw_is_locked(&d->pci_lock));
+
     list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next )
     {
         const struct vpci_bar *bars = msix->pdev->vpci->header.bars;
@@ -163,7 +165,13 @@  static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr)
 
 static int cf_check msix_accept(struct vcpu *v, unsigned long addr)
 {
-    return !!msix_find(v->domain, addr);
+    int rc;
+
+    read_lock(&v->domain->pci_lock);
+    rc = !!msix_find(v->domain, addr);
+    read_unlock(&v->domain->pci_lock);
+
+    return rc;
 }
 
 static bool access_allowed(const struct pci_dev *pdev, unsigned long addr,
@@ -358,21 +366,34 @@  static int adjacent_read(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_read(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     const struct vpci_msix_entry *entry;
     unsigned int offset;
 
     *data = ~0ul;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_read(d, msix, addr, len, data);
+    {
+        int rc = adjacent_read(d, msix, addr, len, data);
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -404,6 +425,7 @@  static int cf_check msix_read(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -491,19 +513,32 @@  static int adjacent_write(const struct domain *d, const struct vpci_msix *msix,
 static int cf_check msix_write(
     struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data)
 {
-    const struct domain *d = v->domain;
-    struct vpci_msix *msix = msix_find(d, addr);
+    struct domain *d = v->domain;
+    struct vpci_msix *msix;
     struct vpci_msix_entry *entry;
     unsigned int offset;
 
+    read_lock(&d->pci_lock);
+
+    msix = msix_find(d, addr);
     if ( !msix )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_RETRY;
+    }
 
     if ( adjacent_handle(msix, addr) )
-        return adjacent_write(d, msix, addr, len, data);
+    {
+        int rc = adjacent_write(d, msix, addr, len, data);
+        read_unlock(&d->pci_lock);
+        return rc;
+    }
 
     if ( !access_allowed(msix->pdev, addr, len) )
+    {
+        read_unlock(&d->pci_lock);
         return X86EMUL_OKAY;
+    }
 
     spin_lock(&msix->pdev->vpci->lock);
     entry = get_entry(msix, addr);
@@ -579,6 +614,7 @@  static int cf_check msix_write(
         break;
     }
     spin_unlock(&msix->pdev->vpci->lock);
+    read_unlock(&d->pci_lock);
 
     return X86EMUL_OKAY;
 }
@@ -665,6 +701,8 @@  static int cf_check init_msix(struct pci_dev *pdev)
     struct vpci_msix *msix;
     int rc;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func,
                                       PCI_CAP_ID_MSIX);
     if ( !msix_offset )
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index d73fa76302..f22cbf2112 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -38,6 +38,8 @@  extern vpci_register_init_t *const __end_vpci_array[];
 
 void vpci_remove_device(struct pci_dev *pdev)
 {
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) || !pdev->vpci )
         return;
 
@@ -73,6 +75,8 @@  int vpci_add_handlers(struct pci_dev *pdev)
     const unsigned long *ro_map;
     int rc = 0;
 
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
     if ( !has_vpci(pdev->domain) )
         return 0;
 
@@ -326,11 +330,12 @@  static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size,
 
 uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
     uint32_t data = ~(uint32_t)0;
+    rwlock_t *lock;
 
     if ( !size )
     {
@@ -342,11 +347,21 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      * Find the PCI dev matching the address, which for hwdom also requires
      * consulting DomXEN.  Passthrough everything that's not trapped.
      */
+    lock = &d->pci_lock;
+    read_lock(lock);
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
+    {
+        read_unlock(lock);
+        lock = &dom_xen->pci_lock;
+        read_lock(lock);
         pdev = pci_get_pdev(dom_xen, sbdf);
+    }
     if ( !pdev || !pdev->vpci )
+    {
+        read_unlock(lock);
         return vpci_read_hw(sbdf, reg, size);
+    }
 
     spin_lock(&pdev->vpci->lock);
 
@@ -392,6 +407,7 @@  uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    read_unlock(lock);
 
     if ( data_offset < size )
     {
@@ -431,10 +447,23 @@  static void vpci_write_helper(const struct pci_dev *pdev,
              r->private);
 }
 
+/* Helper function to unlock locks taken by vpci_write in proper order */
+static void unlock_locks(struct domain *d)
+{
+    ASSERT(rw_is_locked(&d->pci_lock));
+
+    if ( is_hardware_domain(d) )
+    {
+        ASSERT(rw_is_locked(&d->pci_lock));
+        read_unlock(&dom_xen->pci_lock);
+    }
+    read_unlock(&d->pci_lock);
+}
+
 void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
 {
-    const struct domain *d = current->domain;
+    struct domain *d = current->domain;
     const struct pci_dev *pdev;
     const struct vpci_register *r;
     unsigned int data_offset = 0;
@@ -447,8 +476,16 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
     /*
      * Find the PCI dev matching the address, which for hwdom also requires
-     * consulting DomXEN.  Passthrough everything that's not trapped.
+     * consulting DomXEN. Passthrough everything that's not trapped.
+     * If this is hwdom, we need to hold locks for both domain in case if
+     * modify_bars is called()
      */
+    read_lock(&d->pci_lock);
+
+    /* dom_xen->pci_lock always should be taken second to prevent deadlock */
+    if ( is_hardware_domain(d) )
+        read_lock(&dom_xen->pci_lock);
+
     pdev = pci_get_pdev(d, sbdf);
     if ( !pdev && is_hardware_domain(d) )
         pdev = pci_get_pdev(dom_xen, sbdf);
@@ -459,6 +496,8 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
 
         if ( !ro_map || !test_bit(sbdf.bdf, ro_map) )
             vpci_write_hw(sbdf, reg, size, data);
+
+        unlock_locks(d);
         return;
     }
 
@@ -498,6 +537,7 @@  void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
         ASSERT(data_offset < size);
     }
     spin_unlock(&pdev->vpci->lock);
+    unlock_locks(d);
 
     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 5975ca2f30..4512910dca 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -157,6 +157,7 @@  struct pci_dev {
  */
 
 void pcidevs_lock(void);
+int pcidevs_trylock(void);
 void pcidevs_unlock(void);
 bool_t __must_check pcidevs_locked(void);