Message ID | 20240202213321.1920347-2-stewart.hildebrand@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | PCI devices passthrough on Arm, part 3 | expand |
On 2/2/24 16:33, Stewart Hildebrand wrote: > --- > Changes in v13: > - hold off adding Roger's R-b tag even though it was provided on v12.2 > - use a wrapper construct to ease readability of odd-looking ASSERTs > - new placement of ASSERT in __pci_enable_msix(), __pci_enable_msi(), > and pci_enable_msi(). Rearrange/add pdev NULL check. > - expand commit description with details about using either > pcidevs_lock() or d->pci_lock > > Changes in v12.2: > - drop Roger's R-b > - drop both locks on error paths in vpci_msix_arch_print() > - add another ASSERT in vpci_msix_arch_print(), to enforce the > expectation both locks are held before calling vpci_msix_arch_print() > - move pdev_done label in vpci_dump_msi() > - update comments in vpci_dump_msi() to say locks (plural) Here's a patchew link to show just the diff-of-diff from v12.2 (where Roger had given a R-b) to v13. https://patchew.org/Xen/20240115194309.45683-1-stewart.hildebrand@amd.com/diff/20240202213321.1920347-2-stewart.hildebrand@amd.com/
On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Use the per-domain PCI read/write lock to protect the presence of the > pci device vpci field. This lock can be used (and in a few cases is used > right away) so that vpci removal can be performed while holding the lock > in write mode. Previously such removal could race with vpci_read for > example. > > When taking both d->pci_lock and pdev->vpci->lock, they should be > taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid > possible deadlock situations. > > 1. Per-domain's pci_lock is used to protect pdev->vpci structure > from being removed. > > 2. Writing the command register and ROM BAR register may trigger > modify_bars to run, which in turn may access multiple pdevs while > checking for the existing BAR's overlap. The overlapping check, if > done under the read lock, requires vpci->lock to be acquired on both > devices being compared, which may produce a deadlock. It is not > possible to upgrade read lock to write lock in such a case. So, in > order to prevent the deadlock, use d->pci_lock in write mode instead. > > All other code, which doesn't lead to pdev->vpci destruction and does > not access multiple pdevs at the same time, can still use a > combination of the read lock and pdev->vpci->lock. > > 3. Drop const qualifier where the new rwlock is used and this is > appropriate. > > 4. Do not call process_pending_softirqs with any locks held. For that > unlock prior the call and re-acquire the locks after. After > re-acquiring the lock there is no need to check if pdev->vpci exists: > - in apply_map because of the context it is called (no race condition > possible) > - for MSI/MSI-X debug code because it is called at the end of > pdev->vpci access and no further access to pdev->vpci is made > > 5. Use d->pci_lock around for_each_pdev and pci_get_pdev() > while accessing pdevs in vpci code. > > 6. Switch vPCI functions to use per-domain pci_lock for ensuring pdevs > do not go away. The vPCI functions call several MSI-related functions > which already have existing non-vPCI callers. Change those MSI-related > functions to allow using either pcidevs_lock() or d->pci_lock for > ensuring pdevs do not go away. Holding d->pci_lock in read mode is > sufficient. Note that this pdev protection mechanism does not protect > other state or critical sections. These MSI-related functions already > have other race condition and state protection mechanims (e.g. > d->event_lock and msixtbl RCU), so we deduce that the use of the global > pcidevs_lock() is to ensure that pdevs do not go away. Existing non-vPCI > callers of these MSI-related functions will remain (ab)using the global > pcidevs_lock() to ensure pdevs do not go away so as to minimize changes > to existing non-vPCI call paths. > > 7. Introduce wrapper construct, pdev_list_is_read_locked(), for checking > that pdevs do not go away. The purpose of this wrapper is to aid > readability and document the intent of the pdev protection mechanism. I would add that when possible, the existing callers haven't been switched to use the newly introduced per-domain pci_lock, and will continue to use the global pcidevs lock. This is done to reduce the risk of the new locking scheme introducing regressions. Those users will be adjusted in due time. IIRC Jan had concerns about why some existing use-cases are not switched straight to use the new per-domain pci_lock in this patch. > > Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> > Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> > --- > Changes in v13: > - hold off adding Roger's R-b tag even though it was provided on v12.2 > - use a wrapper construct to ease readability of odd-looking ASSERTs > - new placement of ASSERT in __pci_enable_msix(), __pci_enable_msi(), > and pci_enable_msi(). Rearrange/add pdev NULL check. > - expand commit description with details about using either > pcidevs_lock() or d->pci_lock > > Changes in v12.2: > - drop Roger's R-b > - drop both locks on error paths in vpci_msix_arch_print() > - add another ASSERT in vpci_msix_arch_print(), to enforce the > expectation both locks are held before calling vpci_msix_arch_print() > - move pdev_done label in vpci_dump_msi() > - update comments in vpci_dump_msi() to say locks (plural) > > Changes in v12.1: > - use read_trylock() in vpci_msix_arch_print() > - fixup in-code comments (revert double space, use DomXEN) in > vpci_{read,write}() > - minor updates in commit message > - add Roger's R-b > > Changes in v12: > - s/pci_rwlock/pci_lock/ in commit message > - expand comment about scope of pci_lock in sched.h > - in vpci_{read,write}, if hwdom is trying to access a device assigned > to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold > dom_xen->pci_lock) > - reintroduce ASSERT in vmx_pi_update_irte() > - reintroduce ASSERT in __pci_enable_msi{x}() > - delete note 6. in commit message about removing ASSERTs since we have > reintroduced them > > Changes in v11: > - Fixed commit message regarding possible spinlocks > - Removed parameter from allocate_and_map_msi_pirq(), which was added > in the prev version. Now we are taking pcidevs_lock in > physdev_map_pirq() > - Returned ASSERT to pci_enable_msi > - Fixed case when we took read lock instead of write one > - Fixed label indentation > > Changes in v10: > - Moved printk pas locked area > - Returned back ASSERTs > - Added new parameter to allocate_and_map_msi_pirq() so it knows if > it should take the global pci lock > - Added comment about possible improvement in vpci_write > - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in > appropriate places > - Renamed release_domain_locks() to release_domain_write_locks() > - moved domain_done label in vpci_dump_msi() to correct place > Changes in v9: > - extended locked region to protect vpci_remove_device and > vpci_add_handlers() calls > - vpci_write() takes lock in the write mode to protect > potential call to modify_bars() > - renamed lock releasing function > - removed ASSERT()s from msi code > - added trylock in vpci_dump_msi > > Changes in v8: > - changed d->vpci_lock to d->pci_lock > - introducing d->pci_lock in a separate patch > - extended locked region in vpci_process_pending > - removed pcidevs_lockis vpci_dump_msi() > - removed some changes as they are not needed with > the new locking scheme > - added handling for hwdom && dom_xen case > --- > xen/arch/x86/hvm/vmsi.c | 31 +++++++++++++-------- > xen/arch/x86/hvm/vmx/vmx.c | 2 +- > xen/arch/x86/irq.c | 8 +++--- > xen/arch/x86/msi.c | 20 +++++++++----- > xen/arch/x86/physdev.c | 2 ++ > xen/drivers/passthrough/pci.c | 9 +++--- > xen/drivers/vpci/header.c | 18 ++++++++++++ > xen/drivers/vpci/msi.c | 30 +++++++++++++++++--- > xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++----- > xen/drivers/vpci/vpci.c | 24 ++++++++++++++-- > xen/include/xen/sched.h | 15 +++++++++- > 11 files changed, 170 insertions(+), 41 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c > index 128f23636279..f29089178a59 100644 > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable) > struct msixtbl_entry *entry, *new_entry; > int r = -EINVAL; > > - ASSERT(pcidevs_locked()); > + ASSERT(pdev_list_is_read_locked(d)); > ASSERT(rw_is_write_locked(&d->event_lock)); > > if ( !msixtbl_initialised(d) ) > @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq) > struct pci_dev *pdev; > struct msixtbl_entry *entry; > > - ASSERT(pcidevs_locked()); > + ASSERT(pdev_list_is_read_locked(d)); > ASSERT(rw_is_write_locked(&d->event_lock)); > > if ( !msixtbl_initialised(d) ) > @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data, > { > unsigned int i; > > - ASSERT(pcidevs_locked()); > + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); Any reason to not use the newly introduced helper here? I know the pcidevs will never be locked here given the new lock usage, but still it would be less confusing if the new helper was used consistently. Otherwise we need a comment here as to why the helper can't be used, in order to avoid confusion in the future. >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index 9da91e0e6244..c3adec1aca3c 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -462,7 +462,8 @@ struct domain > #ifdef CONFIG_HAS_PCI > struct list_head pdev_list; > /* > - * pci_lock protects access to pdev_list. > + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci > + * structure from being removed. > * > * Any user *reading* from pdev_list, or from devices stored in pdev_list, > * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, > @@ -628,6 +629,18 @@ struct domain > unsigned int cdf; > }; > > +/* > + * Check for use in ASSERTs to ensure that: > + * 1. we can *read* d->pdev_list > + * 2. pdevs (belonging to this domain) do not go away > + * 3. pdevs (belonging to this domain) do not get assigned to other domains I think you can just state that this check ensures there will be no changes to the entries in d->pdev_list, but not the contents of each entry. No changes to d->pdev_list already ensures not devices can be deassigned or removed from the system, and obviously makes the list safe to iterate against. I would also drop the explicitly mention this is intended for ASSERT usage: there's nothing specific in the code that prevents it from being used in other places (albeit I think that's unlikely). > + * This check is not suitable for protecting other state or critical regions. > + */ > +#define pdev_list_is_read_locked(d) ({ \ I would be tempted to drop at least the '_read_' part from the name, the name is getting a bit too long for my taste. > + struct domain *d_ = (d); \ Why do you need this local domain variable? Can't you use the d parameter directly? Such assign will prevent using a const 'd' parameter, and 'd_' itself should be const IMO (iff we really need this). Also sched.h is not the best place, can't you just place it in pci.h? Thanks, Roger.
On 13.02.2024 09:35, Roger Pau Monné wrote: > On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -462,7 +462,8 @@ struct domain >> #ifdef CONFIG_HAS_PCI >> struct list_head pdev_list; >> /* >> - * pci_lock protects access to pdev_list. >> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci >> + * structure from being removed. >> * >> * Any user *reading* from pdev_list, or from devices stored in pdev_list, >> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, >> @@ -628,6 +629,18 @@ struct domain >> unsigned int cdf; >> }; >> >> +/* >> + * Check for use in ASSERTs to ensure that: >> + * 1. we can *read* d->pdev_list >> + * 2. pdevs (belonging to this domain) do not go away >> + * 3. pdevs (belonging to this domain) do not get assigned to other domains > > I think you can just state that this check ensures there will be no > changes to the entries in d->pdev_list, but not the contents of each > entry. No changes to d->pdev_list already ensures not devices can be > deassigned or removed from the system, and obviously makes the list > safe to iterate against. > > I would also drop the explicitly mention this is intended for ASSERT > usage: there's nothing specific in the code that prevents it from > being used in other places (albeit I think that's unlikely). But pcidevs_locked(), resolving to spin_is_locked(), isn't reliable. The assertion usage is best-effort only, without a guarantee that all wrong uses would be caught. >> + * This check is not suitable for protecting other state or critical regions. >> + */ >> +#define pdev_list_is_read_locked(d) ({ \ > > I would be tempted to drop at least the '_read_' part from the name, > the name is getting a bit too long for my taste. While I agree with the long-ish aspect, I'm afraid the "read" part is crucial. As a result I see no room for shortening. >> + struct domain *d_ = (d); \ > > Why do you need this local domain variable? Can't you use the d > parameter directly? It would be evaluated then somewhere between 0 and 2 times. Jan
On Tue, Feb 13, 2024 at 09:44:58AM +0100, Jan Beulich wrote: > On 13.02.2024 09:35, Roger Pau Monné wrote: > > On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: > >> --- a/xen/include/xen/sched.h > >> +++ b/xen/include/xen/sched.h > >> @@ -462,7 +462,8 @@ struct domain > >> #ifdef CONFIG_HAS_PCI > >> struct list_head pdev_list; > >> /* > >> - * pci_lock protects access to pdev_list. > >> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci > >> + * structure from being removed. > >> * > >> * Any user *reading* from pdev_list, or from devices stored in pdev_list, > >> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, > >> @@ -628,6 +629,18 @@ struct domain > >> unsigned int cdf; > >> }; > >> > >> +/* > >> + * Check for use in ASSERTs to ensure that: > >> + * 1. we can *read* d->pdev_list > >> + * 2. pdevs (belonging to this domain) do not go away > >> + * 3. pdevs (belonging to this domain) do not get assigned to other domains > > > > I think you can just state that this check ensures there will be no > > changes to the entries in d->pdev_list, but not the contents of each > > entry. No changes to d->pdev_list already ensures not devices can be > > deassigned or removed from the system, and obviously makes the list > > safe to iterate against. > > > > I would also drop the explicitly mention this is intended for ASSERT > > usage: there's nothing specific in the code that prevents it from > > being used in other places (albeit I think that's unlikely). > > But pcidevs_locked(), resolving to spin_is_locked(), isn't reliable. The > assertion usage is best-effort only, without a guarantee that all wrong > uses would be caught. Do we want to protect this with !NDEBUG guards then? > >> + * This check is not suitable for protecting other state or critical regions. > >> + */ > >> +#define pdev_list_is_read_locked(d) ({ \ > > > > I would be tempted to drop at least the '_read_' part from the name, > > the name is getting a bit too long for my taste. > > While I agree with the long-ish aspect, I'm afraid the "read" part is > crucial. As a result I see no room for shortening. OK, if you think that's crucial then I'm not going to argue. > >> + struct domain *d_ = (d); \ > > > > Why do you need this local domain variable? Can't you use the d > > parameter directly? > > It would be evaluated then somewhere between 0 and 2 times. It's ASSERT code only, so I don't see that as an issue. Otherwise d_ needs to be made const. Thanks, Roger.
On 13.02.2024 10:01, Roger Pau Monné wrote: > On Tue, Feb 13, 2024 at 09:44:58AM +0100, Jan Beulich wrote: >> On 13.02.2024 09:35, Roger Pau Monné wrote: >>> On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: >>>> --- a/xen/include/xen/sched.h >>>> +++ b/xen/include/xen/sched.h >>>> @@ -462,7 +462,8 @@ struct domain >>>> #ifdef CONFIG_HAS_PCI >>>> struct list_head pdev_list; >>>> /* >>>> - * pci_lock protects access to pdev_list. >>>> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci >>>> + * structure from being removed. >>>> * >>>> * Any user *reading* from pdev_list, or from devices stored in pdev_list, >>>> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, >>>> @@ -628,6 +629,18 @@ struct domain >>>> unsigned int cdf; >>>> }; >>>> >>>> +/* >>>> + * Check for use in ASSERTs to ensure that: >>>> + * 1. we can *read* d->pdev_list >>>> + * 2. pdevs (belonging to this domain) do not go away >>>> + * 3. pdevs (belonging to this domain) do not get assigned to other domains >>> >>> I think you can just state that this check ensures there will be no >>> changes to the entries in d->pdev_list, but not the contents of each >>> entry. No changes to d->pdev_list already ensures not devices can be >>> deassigned or removed from the system, and obviously makes the list >>> safe to iterate against. >>> >>> I would also drop the explicitly mention this is intended for ASSERT >>> usage: there's nothing specific in the code that prevents it from >>> being used in other places (albeit I think that's unlikely). >> >> But pcidevs_locked(), resolving to spin_is_locked(), isn't reliable. The >> assertion usage is best-effort only, without a guarantee that all wrong >> uses would be caught. > > Do we want to protect this with !NDEBUG guards then? Yes, that would look to be desirable. >>>> + * This check is not suitable for protecting other state or critical regions. >>>> + */ >>>> +#define pdev_list_is_read_locked(d) ({ \ >>> >>> I would be tempted to drop at least the '_read_' part from the name, >>> the name is getting a bit too long for my taste. >> >> While I agree with the long-ish aspect, I'm afraid the "read" part is >> crucial. As a result I see no room for shortening. > > OK, if you think that's crucial then I'm not going to argue. > >>>> + struct domain *d_ = (d); \ >>> >>> Why do you need this local domain variable? Can't you use the d >>> parameter directly? >> >> It would be evaluated then somewhere between 0 and 2 times. > > It's ASSERT code only, so I don't see that as an issue. Fair point. > Otherwise d_ needs to be made const. Indeed, but for assert-only code I agree the option is slightly better, ideally suitably commented upon. Jan
On 2/13/24 03:35, Roger Pau Monné wrote: > On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> Use the per-domain PCI read/write lock to protect the presence of the >> pci device vpci field. This lock can be used (and in a few cases is used >> right away) so that vpci removal can be performed while holding the lock >> in write mode. Previously such removal could race with vpci_read for >> example. >> >> When taking both d->pci_lock and pdev->vpci->lock, they should be >> taken in this exact order: d->pci_lock then pdev->vpci->lock to avoid >> possible deadlock situations. >> >> 1. Per-domain's pci_lock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if >> done under the read lock, requires vpci->lock to be acquired on both >> devices being compared, which may produce a deadlock. It is not >> possible to upgrade read lock to write lock in such a case. So, in >> order to prevent the deadlock, use d->pci_lock in write mode instead. >> >> All other code, which doesn't lead to pdev->vpci destruction and does >> not access multiple pdevs at the same time, can still use a >> combination of the read lock and pdev->vpci->lock. >> >> 3. Drop const qualifier where the new rwlock is used and this is >> appropriate. >> >> 4. Do not call process_pending_softirqs with any locks held. For that >> unlock prior the call and re-acquire the locks after. After >> re-acquiring the lock there is no need to check if pdev->vpci exists: >> - in apply_map because of the context it is called (no race condition >> possible) >> - for MSI/MSI-X debug code because it is called at the end of >> pdev->vpci access and no further access to pdev->vpci is made >> >> 5. Use d->pci_lock around for_each_pdev and pci_get_pdev() >> while accessing pdevs in vpci code. >> >> 6. Switch vPCI functions to use per-domain pci_lock for ensuring pdevs >> do not go away. The vPCI functions call several MSI-related functions >> which already have existing non-vPCI callers. Change those MSI-related >> functions to allow using either pcidevs_lock() or d->pci_lock for >> ensuring pdevs do not go away. Holding d->pci_lock in read mode is >> sufficient. Note that this pdev protection mechanism does not protect >> other state or critical sections. These MSI-related functions already >> have other race condition and state protection mechanims (e.g. >> d->event_lock and msixtbl RCU), so we deduce that the use of the global >> pcidevs_lock() is to ensure that pdevs do not go away. Existing non-vPCI >> callers of these MSI-related functions will remain (ab)using the global >> pcidevs_lock() to ensure pdevs do not go away so as to minimize changes >> to existing non-vPCI call paths. >> >> 7. Introduce wrapper construct, pdev_list_is_read_locked(), for checking >> that pdevs do not go away. The purpose of this wrapper is to aid >> readability and document the intent of the pdev protection mechanism. > > I would add that when possible, the existing callers haven't been > switched to use the newly introduced per-domain pci_lock, and will > continue to use the global pcidevs lock. This is done to reduce the > risk of the new locking scheme introducing regressions. Those users > will be adjusted in due time. I'll use this wording, thanks > > IIRC Jan had concerns about why some existing use-cases are not > switched straight to use the new per-domain pci_lock in this patch. I hope the clarified commit description addresses this > >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> >> Suggested-by: Jan Beulich <jbeulich@suse.com> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> >> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> >> --- >> Changes in v13: >> - hold off adding Roger's R-b tag even though it was provided on v12.2 >> - use a wrapper construct to ease readability of odd-looking ASSERTs >> - new placement of ASSERT in __pci_enable_msix(), __pci_enable_msi(), >> and pci_enable_msi(). Rearrange/add pdev NULL check. >> - expand commit description with details about using either >> pcidevs_lock() or d->pci_lock >> >> Changes in v12.2: >> - drop Roger's R-b >> - drop both locks on error paths in vpci_msix_arch_print() >> - add another ASSERT in vpci_msix_arch_print(), to enforce the >> expectation both locks are held before calling vpci_msix_arch_print() >> - move pdev_done label in vpci_dump_msi() >> - update comments in vpci_dump_msi() to say locks (plural) >> >> Changes in v12.1: >> - use read_trylock() in vpci_msix_arch_print() >> - fixup in-code comments (revert double space, use DomXEN) in >> vpci_{read,write}() >> - minor updates in commit message >> - add Roger's R-b >> >> Changes in v12: >> - s/pci_rwlock/pci_lock/ in commit message >> - expand comment about scope of pci_lock in sched.h >> - in vpci_{read,write}, if hwdom is trying to access a device assigned >> to dom_xen, holding hwdom->pci_lock is sufficient (no need to hold >> dom_xen->pci_lock) >> - reintroduce ASSERT in vmx_pi_update_irte() >> - reintroduce ASSERT in __pci_enable_msi{x}() >> - delete note 6. in commit message about removing ASSERTs since we have >> reintroduced them >> >> Changes in v11: >> - Fixed commit message regarding possible spinlocks >> - Removed parameter from allocate_and_map_msi_pirq(), which was added >> in the prev version. Now we are taking pcidevs_lock in >> physdev_map_pirq() >> - Returned ASSERT to pci_enable_msi >> - Fixed case when we took read lock instead of write one >> - Fixed label indentation >> >> Changes in v10: >> - Moved printk pas locked area >> - Returned back ASSERTs >> - Added new parameter to allocate_and_map_msi_pirq() so it knows if >> it should take the global pci lock >> - Added comment about possible improvement in vpci_write >> - Changed ASSERT(rw_is_locked()) to rw_is_write_locked() in >> appropriate places >> - Renamed release_domain_locks() to release_domain_write_locks() >> - moved domain_done label in vpci_dump_msi() to correct place >> Changes in v9: >> - extended locked region to protect vpci_remove_device and >> vpci_add_handlers() calls >> - vpci_write() takes lock in the write mode to protect >> potential call to modify_bars() >> - renamed lock releasing function >> - removed ASSERT()s from msi code >> - added trylock in vpci_dump_msi >> >> Changes in v8: >> - changed d->vpci_lock to d->pci_lock >> - introducing d->pci_lock in a separate patch >> - extended locked region in vpci_process_pending >> - removed pcidevs_lockis vpci_dump_msi() >> - removed some changes as they are not needed with >> the new locking scheme >> - added handling for hwdom && dom_xen case >> --- >> xen/arch/x86/hvm/vmsi.c | 31 +++++++++++++-------- >> xen/arch/x86/hvm/vmx/vmx.c | 2 +- >> xen/arch/x86/irq.c | 8 +++--- >> xen/arch/x86/msi.c | 20 +++++++++----- >> xen/arch/x86/physdev.c | 2 ++ >> xen/drivers/passthrough/pci.c | 9 +++--- >> xen/drivers/vpci/header.c | 18 ++++++++++++ >> xen/drivers/vpci/msi.c | 30 +++++++++++++++++--- >> xen/drivers/vpci/msix.c | 52 ++++++++++++++++++++++++++++++----- >> xen/drivers/vpci/vpci.c | 24 ++++++++++++++-- >> xen/include/xen/sched.h | 15 +++++++++- >> 11 files changed, 170 insertions(+), 41 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >> index 128f23636279..f29089178a59 100644 >> --- a/xen/arch/x86/hvm/vmsi.c >> +++ b/xen/arch/x86/hvm/vmsi.c >> @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable) >> struct msixtbl_entry *entry, *new_entry; >> int r = -EINVAL; >> >> - ASSERT(pcidevs_locked()); >> + ASSERT(pdev_list_is_read_locked(d)); >> ASSERT(rw_is_write_locked(&d->event_lock)); >> >> if ( !msixtbl_initialised(d) ) >> @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq) >> struct pci_dev *pdev; >> struct msixtbl_entry *entry; >> >> - ASSERT(pcidevs_locked()); >> + ASSERT(pdev_list_is_read_locked(d)); >> ASSERT(rw_is_write_locked(&d->event_lock)); >> >> if ( !msixtbl_initialised(d) ) >> @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data, >> { >> unsigned int i; >> >> - ASSERT(pcidevs_locked()); >> + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); > > Any reason to not use the newly introduced helper here? I know the > pcidevs will never be locked here given the new lock usage, but still > it would be less confusing if the new helper was used consistently. Nope. Agreed, it would help readability, I'll switch to the helper. In addition to vpci_msi_update(), I assume this comment also applies to the remaining occurrences: xen/arch/x86/hvm/vmsi.c:vpci_msi_arch_update() xen/arch/x86/hvm/vmsi.c:vpci_msi_arch_enable() xen/arch/x86/hvm/vmsi.c:vpci_msi_disable() xen/arch/x86/hvm/vmsi.c:vpci_msix_arch_enable_entry() xen/drivers/vpci/msix.c:msix_find() But not: xen/arch/x86/hvm/vmsi.c:vpci_msix_arch_print() I'll add a suitable comment to this one exception. > > Otherwise we need a comment here as to why the helper can't be used, > in order to avoid confusion in the future. > >>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >> index 9da91e0e6244..c3adec1aca3c 100644 >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -462,7 +462,8 @@ struct domain >> #ifdef CONFIG_HAS_PCI >> struct list_head pdev_list; >> /* >> - * pci_lock protects access to pdev_list. >> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci >> + * structure from being removed. >> * >> * Any user *reading* from pdev_list, or from devices stored in pdev_list, >> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, >> @@ -628,6 +629,18 @@ struct domain >> unsigned int cdf; >> }; >> >> +/* >> + * Check for use in ASSERTs to ensure that: >> + * 1. we can *read* d->pdev_list >> + * 2. pdevs (belonging to this domain) do not go away >> + * 3. pdevs (belonging to this domain) do not get assigned to other domains > > I think you can just state that this check ensures there will be no > changes to the entries in d->pdev_list, but not the contents of each > entry. No changes to d->pdev_list already ensures not devices can be > deassigned or removed from the system, and obviously makes the list > safe to iterate against. OK, I'll simplify the comment > > I would also drop the explicitly mention this is intended for ASSERT > usage: there's nothing specific in the code that prevents it from > being used in other places (albeit I think that's unlikely). > >> + * This check is not suitable for protecting other state or critical regions. >> + */ >> +#define pdev_list_is_read_locked(d) ({ \ > > I would be tempted to drop at least the '_read_' part from the name, > the name is getting a bit too long for my taste. > >> + struct domain *d_ = (d); \ > > Why do you need this local domain variable? Can't you use the d > parameter directly? > > Such assign will prevent using a const 'd' parameter, and 'd_' itself > should be const IMO (iff we really need this). > > Also sched.h is not the best place, can't you just place it in > pci.h? Yes, I'll move it to xen/include/xen/pci.h
On 2/13/24 04:05, Jan Beulich wrote: > On 13.02.2024 10:01, Roger Pau Monné wrote: >> On Tue, Feb 13, 2024 at 09:44:58AM +0100, Jan Beulich wrote: >>> On 13.02.2024 09:35, Roger Pau Monné wrote: >>>> On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: >>>>> --- a/xen/include/xen/sched.h >>>>> +++ b/xen/include/xen/sched.h >>>>> @@ -462,7 +462,8 @@ struct domain >>>>> #ifdef CONFIG_HAS_PCI >>>>> struct list_head pdev_list; >>>>> /* >>>>> - * pci_lock protects access to pdev_list. >>>>> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci >>>>> + * structure from being removed. >>>>> * >>>>> * Any user *reading* from pdev_list, or from devices stored in pdev_list, >>>>> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, >>>>> @@ -628,6 +629,18 @@ struct domain >>>>> unsigned int cdf; >>>>> }; >>>>> >>>>> +/* >>>>> + * Check for use in ASSERTs to ensure that: >>>>> + * 1. we can *read* d->pdev_list >>>>> + * 2. pdevs (belonging to this domain) do not go away >>>>> + * 3. pdevs (belonging to this domain) do not get assigned to other domains >>>> >>>> I think you can just state that this check ensures there will be no >>>> changes to the entries in d->pdev_list, but not the contents of each >>>> entry. No changes to d->pdev_list already ensures not devices can be >>>> deassigned or removed from the system, and obviously makes the list >>>> safe to iterate against. >>>> >>>> I would also drop the explicitly mention this is intended for ASSERT >>>> usage: there's nothing specific in the code that prevents it from >>>> being used in other places (albeit I think that's unlikely). >>> >>> But pcidevs_locked(), resolving to spin_is_locked(), isn't reliable. The >>> assertion usage is best-effort only, without a guarantee that all wrong >>> uses would be caught. >> >> Do we want to protect this with !NDEBUG guards then? > > Yes, that would look to be desirable. We will then also need a definition of pdev_list_is_read_locked() in the #else case so we don't risk running into "error: implicit declaration of function 'pdev_list_is_read_locked'". Such a definition might look like: #define pdev_list_is_read_locked(d) ({ (void)d; ASSERT_UNREACHABLE(); false; }) so that we still evaluate d exactly once in the NDEBUG case. >>>>> + * This check is not suitable for protecting other state or critical regions. >>>>> + */ >>>>> +#define pdev_list_is_read_locked(d) ({ \ >>>> >>>> I would be tempted to drop at least the '_read_' part from the name, >>>> the name is getting a bit too long for my taste. >>> >>> While I agree with the long-ish aspect, I'm afraid the "read" part is >>> crucial. As a result I see no room for shortening. >> >> OK, if you think that's crucial then I'm not going to argue. >> >>>>> + struct domain *d_ = (d); \ >>>> >>>> Why do you need this local domain variable? Can't you use the d >>>> parameter directly? >>> >>> It would be evaluated then somewhere between 0 and 2 times. >> >> It's ASSERT code only, so I don't see that as an issue. > > Fair point. > >> Otherwise d_ needs to be made const. > > Indeed, but for assert-only code I agree the option is slightly better, > ideally suitably commented upon. Is "the option" here referring to making d_ const, or using d directly (with suitable comment)?
On 13.02.2024 17:58, Stewart Hildebrand wrote: > On 2/13/24 04:05, Jan Beulich wrote: >> On 13.02.2024 10:01, Roger Pau Monné wrote: >>> On Tue, Feb 13, 2024 at 09:44:58AM +0100, Jan Beulich wrote: >>>> On 13.02.2024 09:35, Roger Pau Monné wrote: >>>>> On Fri, Feb 02, 2024 at 04:33:05PM -0500, Stewart Hildebrand wrote: >>>>>> --- a/xen/include/xen/sched.h >>>>>> +++ b/xen/include/xen/sched.h >>>>>> @@ -462,7 +462,8 @@ struct domain >>>>>> #ifdef CONFIG_HAS_PCI >>>>>> struct list_head pdev_list; >>>>>> /* >>>>>> - * pci_lock protects access to pdev_list. >>>>>> + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci >>>>>> + * structure from being removed. >>>>>> * >>>>>> * Any user *reading* from pdev_list, or from devices stored in pdev_list, >>>>>> * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, >>>>>> @@ -628,6 +629,18 @@ struct domain >>>>>> unsigned int cdf; >>>>>> }; >>>>>> >>>>>> +/* >>>>>> + * Check for use in ASSERTs to ensure that: >>>>>> + * 1. we can *read* d->pdev_list >>>>>> + * 2. pdevs (belonging to this domain) do not go away >>>>>> + * 3. pdevs (belonging to this domain) do not get assigned to other domains >>>>> >>>>> I think you can just state that this check ensures there will be no >>>>> changes to the entries in d->pdev_list, but not the contents of each >>>>> entry. No changes to d->pdev_list already ensures not devices can be >>>>> deassigned or removed from the system, and obviously makes the list >>>>> safe to iterate against. >>>>> >>>>> I would also drop the explicitly mention this is intended for ASSERT >>>>> usage: there's nothing specific in the code that prevents it from >>>>> being used in other places (albeit I think that's unlikely). >>>> >>>> But pcidevs_locked(), resolving to spin_is_locked(), isn't reliable. The >>>> assertion usage is best-effort only, without a guarantee that all wrong >>>> uses would be caught. >>> >>> Do we want to protect this with !NDEBUG guards then? >> >> Yes, that would look to be desirable. > > We will then also need a definition of pdev_list_is_read_locked() in the > #else case so we don't risk running into "error: implicit declaration of > function 'pdev_list_is_read_locked'". > > Such a definition might look like: > > #define pdev_list_is_read_locked(d) ({ (void)d; ASSERT_UNREACHABLE(); false; }) > > so that we still evaluate d exactly once in the NDEBUG case. Except that ASSERT_UNREACHABLE() use is bogus in the NDEBUG case. The way our ASSERT() works in the NDEBUG case looks to make it sufficient for there to be bool pdev_list_is_read_locked(const struct domain *d); in the #else case (with no implementation anywhere). >>>>>> + * This check is not suitable for protecting other state or critical regions. >>>>>> + */ >>>>>> +#define pdev_list_is_read_locked(d) ({ \ >>>>> >>>>> I would be tempted to drop at least the '_read_' part from the name, >>>>> the name is getting a bit too long for my taste. >>>> >>>> While I agree with the long-ish aspect, I'm afraid the "read" part is >>>> crucial. As a result I see no room for shortening. >>> >>> OK, if you think that's crucial then I'm not going to argue. >>> >>>>>> + struct domain *d_ = (d); \ >>>>> >>>>> Why do you need this local domain variable? Can't you use the d >>>>> parameter directly? >>>> >>>> It would be evaluated then somewhere between 0 and 2 times. >>> >>> It's ASSERT code only, so I don't see that as an issue. >> >> Fair point. >> >>> Otherwise d_ needs to be made const. >> >> Indeed, but for assert-only code I agree the option is slightly better, >> ideally suitably commented upon. > > Is "the option" here referring to making d_ const, or using d directly > (with suitable comment)? The latter. Jan
On 02.02.2024 22:33, Stewart Hildebrand wrote: > --- a/xen/arch/x86/physdev.c > +++ b/xen/arch/x86/physdev.c > @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p, > > case MAP_PIRQ_TYPE_MSI: > case MAP_PIRQ_TYPE_MULTI_MSI: > + pcidevs_lock(); > ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi); > + pcidevs_unlock(); > break; I'm afraid I need to come back to this: This is the only place where this patch retains (moves) use of the global lock. By moving, its scope is actually extended. It was previously said that conversion doesn't happen to limit the scope of what is changing. But with allocate_and_map_msi_pirq() being happy about either lock being held, I'm having a hard time seeing why here the global lock would continue to need using. To me doing so suggests uncertainty whether the checking in the function is actually correct. Jan
On 2/14/24 06:38, Jan Beulich wrote: > On 02.02.2024 22:33, Stewart Hildebrand wrote: >> --- a/xen/arch/x86/physdev.c >> +++ b/xen/arch/x86/physdev.c >> @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p, >> >> case MAP_PIRQ_TYPE_MSI: >> case MAP_PIRQ_TYPE_MULTI_MSI: >> + pcidevs_lock(); >> ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi); >> + pcidevs_unlock(); >> break; > > I'm afraid I need to come back to this: This is the only place where this > patch retains (moves) use of the global lock. By moving, its scope is > actually extended. It was previously said that conversion doesn't happen > to limit the scope of what is changing. But with allocate_and_map_msi_pirq() > being happy about either lock being held, I'm having a hard time seeing why > here the global lock would continue to need using. To me doing so suggests > uncertainty whether the checking in the function is actually correct. I understand the concern. I've gone back and forth on this one in particular myself [1]. I've tested it both ways (i.e. as shown here with pcidevs_lock() and replacing it with read_lock(&d->pci_lock)). In the analysis I've done, I also cannot find a need to continue using pcidevs_lock() here. read_lock(&d->pci_lock) is sufficient. Let's be clear: both ways are correct. The only reason I left it at pcidevs_lock() for v13 was to make sure the code was doing what the commit description and notes in the cover letter say. allocate_and_map_msi_pirq() acquires write_lock(&d->event_lock), and accesses to non-domain-specific data fall in the category of reading __read_mostly globals or are appropriately protected by desc->lock and/or vector_lock and/or cmpxchg. I'll change it to read_lock(&d->pci_lock) in this one particular instance (and update the description appropriately). [1] https://lore.kernel.org/xen-devel/3c1023c4-25fb-41f1-83eb-03cbc1c3720e@amd.com/
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 128f23636279..f29089178a59 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -468,7 +468,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq *pirq, uint64_t gtable) struct msixtbl_entry *entry, *new_entry; int r = -EINVAL; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(d)); ASSERT(rw_is_write_locked(&d->event_lock)); if ( !msixtbl_initialised(d) ) @@ -538,7 +538,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq *pirq) struct pci_dev *pdev; struct msixtbl_entry *entry; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(d)); ASSERT(rw_is_write_locked(&d->event_lock)); if ( !msixtbl_initialised(d) ) @@ -684,7 +684,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, uint32_t data, { unsigned int i; - ASSERT(pcidevs_locked()); + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER ) { @@ -725,8 +725,8 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev) int rc; ASSERT(msi->arch.pirq != INVALID_PIRQ); + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); - pcidevs_lock(); for ( i = 0; i < msi->vectors && msi->arch.bound; i++ ) { struct xen_domctl_bind_pt_irq unbind = { @@ -745,7 +745,6 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const struct pci_dev *pdev) msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, msi->vectors, msi->arch.pirq, msi->mask); - pcidevs_unlock(); } static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr, @@ -778,15 +777,14 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const struct pci_dev *pdev, int rc; ASSERT(msi->arch.pirq == INVALID_PIRQ); + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); rc = vpci_msi_enable(pdev, vectors, 0); if ( rc < 0 ) return rc; msi->arch.pirq = rc; - pcidevs_lock(); msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors, msi->arch.pirq, msi->mask); - pcidevs_unlock(); return 0; } @@ -797,8 +795,8 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq, unsigned int i; ASSERT(pirq != INVALID_PIRQ); + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); - pcidevs_lock(); for ( i = 0; i < nr && bound; i++ ) { struct xen_domctl_bind_pt_irq bind = { @@ -814,7 +812,6 @@ static void vpci_msi_disable(const struct pci_dev *pdev, int pirq, write_lock(&pdev->domain->event_lock); unmap_domain_pirq(pdev->domain, pirq); write_unlock(&pdev->domain->event_lock); - pcidevs_unlock(); } void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev) @@ -854,6 +851,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry, int rc; ASSERT(entry->arch.pirq == INVALID_PIRQ); + ASSERT(rw_is_locked(&pdev->domain->pci_lock)); rc = vpci_msi_enable(pdev, vmsix_entry_nr(pdev->vpci->msix, entry), table_base); if ( rc < 0 ) @@ -861,7 +859,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry, entry->arch.pirq = rc; - pcidevs_lock(); rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq, entry->masked); if ( rc ) @@ -869,7 +866,6 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry *entry, vpci_msi_disable(pdev, entry->arch.pirq, 1, false); entry->arch.pirq = INVALID_PIRQ; } - pcidevs_unlock(); return rc; } @@ -895,6 +891,9 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) { unsigned int i; + ASSERT(rw_is_locked(&msix->pdev->domain->pci_lock)); + ASSERT(spin_is_locked(&msix->pdev->vpci->lock)); + for ( i = 0; i < msix->max_entries; i++ ) { const struct vpci_msix_entry *entry = &msix->entries[i]; @@ -913,13 +912,23 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) struct pci_dev *pdev = msix->pdev; spin_unlock(&msix->pdev->vpci->lock); + read_unlock(&pdev->domain->pci_lock); process_pending_softirqs(); + + if ( !read_trylock(&pdev->domain->pci_lock) ) + return -EBUSY; + /* NB: we assume that pdev cannot go away for an alive domain. */ if ( !pdev->vpci || !spin_trylock(&pdev->vpci->lock) ) + { + read_unlock(&pdev->domain->pci_lock); return -EBUSY; + } + if ( pdev->vpci->msix != msix ) { spin_unlock(&pdev->vpci->lock); + read_unlock(&pdev->domain->pci_lock); return -EAGAIN; } } diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 1500dca6039f..5aac4863d30a 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -413,7 +413,7 @@ static int cf_check vmx_pi_update_irte(const struct vcpu *v, spin_unlock_irq(&desc->lock); - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(msi_desc->dev->domain)); return iommu_update_ire_from_msi(msi_desc, &msi_desc->msg); diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c index bbae7751e494..e31144d82fd4 100644 --- a/xen/arch/x86/irq.c +++ b/xen/arch/x86/irq.c @@ -2164,7 +2164,7 @@ int map_domain_pirq( struct pci_dev *pdev; unsigned int nr = 0; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(d)); ret = -ENODEV; if ( !cpu_has_apic ) @@ -2321,7 +2321,7 @@ int unmap_domain_pirq(struct domain *d, int pirq) if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) return -EINVAL; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(d)); ASSERT(rw_is_write_locked(&d->event_lock)); info = pirq_info(d, pirq); @@ -2886,6 +2886,8 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p, { int irq, pirq, ret; + ASSERT(pdev_list_is_read_locked(d)); + switch ( type ) { case MAP_PIRQ_TYPE_MSI: @@ -2915,7 +2917,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p, msi->irq = irq; - pcidevs_lock(); /* Verify or get pirq. */ write_lock(&d->event_lock); pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr); @@ -2931,7 +2932,6 @@ int allocate_and_map_msi_pirq(struct domain *d, int index, int *pirq_p, done: write_unlock(&d->event_lock); - pcidevs_unlock(); if ( ret ) { switch ( type ) diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index 335c0868a225..e008b6789a28 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -602,7 +602,7 @@ static int msi_capability_init(struct pci_dev *dev, unsigned int i, mpos; uint16_t control; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(dev->domain)); pos = pci_find_cap_offset(dev->sbdf, PCI_CAP_ID_MSI); if ( !pos ) return -ENODEV; @@ -771,7 +771,7 @@ static int msix_capability_init(struct pci_dev *dev, if ( !pos ) return -ENODEV; - ASSERT(pcidevs_locked()); + ASSERT(pdev_list_is_read_locked(dev->domain)); control = pci_conf_read16(dev->sbdf, msix_control_reg(pos)); /* @@ -988,11 +988,11 @@ static int __pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi, { struct msi_desc *old_desc; - ASSERT(pcidevs_locked()); - if ( !pdev ) return -ENODEV; + ASSERT(pdev_list_is_read_locked(pdev->domain)); + old_desc = find_msi_entry(pdev, msi->irq, PCI_CAP_ID_MSI); if ( old_desc ) { @@ -1043,9 +1043,12 @@ static int __pci_enable_msix(struct pci_dev *pdev, struct msi_info *msi, { struct msi_desc *old_desc; - ASSERT(pcidevs_locked()); + if ( !pdev ) + return -ENODEV; + + ASSERT(pdev_list_is_read_locked(pdev->domain)); - if ( !pdev || !pdev->msix ) + if ( !pdev->msix ) return -ENODEV; if ( msi->entry_nr >= pdev->msix->nr_entries ) @@ -1154,7 +1157,10 @@ int pci_prepare_msix(u16 seg, u8 bus, u8 devfn, bool off) int pci_enable_msi(struct pci_dev *pdev, struct msi_info *msi, struct msi_desc **desc) { - ASSERT(pcidevs_locked()); + if ( !pdev ) + return -ENODEV; + + ASSERT(pdev_list_is_read_locked(pdev->domain)); if ( !use_msi ) return -EPERM; diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c index 47c4da0af7e1..369c9e788c1c 100644 --- a/xen/arch/x86/physdev.c +++ b/xen/arch/x86/physdev.c @@ -123,7 +123,9 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p, case MAP_PIRQ_TYPE_MSI: case MAP_PIRQ_TYPE_MULTI_MSI: + pcidevs_lock(); ret = allocate_and_map_msi_pirq(d, *index, pirq_p, type, msi); + pcidevs_unlock(); break; default: diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 47c0eee7bdcc..c97dd4504a7a 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -750,7 +750,6 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, pdev->domain = hardware_domain; write_lock(&hardware_domain->pci_lock); list_add(&pdev->domain_list, &hardware_domain->pdev_list); - write_unlock(&hardware_domain->pci_lock); /* * For devices not discovered by Xen during boot, add vPCI handlers @@ -759,18 +758,18 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn, ret = vpci_add_handlers(pdev); if ( ret ) { - printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret); - write_lock(&hardware_domain->pci_lock); list_del(&pdev->domain_list); write_unlock(&hardware_domain->pci_lock); pdev->domain = NULL; + printk(XENLOG_ERR "Setup of vPCI failed: %d\n", ret); goto out; } + write_unlock(&hardware_domain->pci_lock); ret = iommu_add_device(pdev); if ( ret ) { - vpci_remove_device(pdev); write_lock(&hardware_domain->pci_lock); + vpci_remove_device(pdev); list_del(&pdev->domain_list); write_unlock(&hardware_domain->pci_lock); pdev->domain = NULL; @@ -1146,7 +1145,9 @@ static void __hwdom_init setup_one_hwdom_device(const struct setup_hwdom *ctxt, } while ( devfn != pdev->devfn && PCI_SLOT(devfn) == PCI_SLOT(pdev->devfn) ); + write_lock(&ctxt->d->pci_lock); err = vpci_add_handlers(pdev); + write_unlock(&ctxt->d->pci_lock); if ( err ) printk(XENLOG_ERR "setup of vPCI for d%d failed: %d\n", ctxt->d->domain_id, err); diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c index 58195549d50a..8f5850b8cf6d 100644 --- a/xen/drivers/vpci/header.c +++ b/xen/drivers/vpci/header.c @@ -173,6 +173,7 @@ bool vpci_process_pending(struct vcpu *v) if ( rc == -ERESTART ) return true; + write_lock(&v->domain->pci_lock); spin_lock(&v->vpci.pdev->vpci->lock); /* Disable memory decoding unconditionally on failure. */ modify_decoding(v->vpci.pdev, @@ -191,6 +192,7 @@ bool vpci_process_pending(struct vcpu *v) * failure. */ vpci_remove_device(v->vpci.pdev); + write_unlock(&v->domain->pci_lock); } return false; @@ -202,8 +204,20 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, struct map_data data = { .d = d, .map = true }; int rc; + ASSERT(rw_is_write_locked(&d->pci_lock)); + while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) + { + /* + * It's safe to drop and reacquire the lock in this context + * without risking pdev disappearing because devices cannot be + * removed until the initial domain has been started. + */ + write_unlock(&d->pci_lock); process_pending_softirqs(); + write_lock(&d->pci_lock); + } + rangeset_destroy(mem); if ( !rc ) modify_decoding(pdev, cmd, false); @@ -244,6 +258,8 @@ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only) unsigned int i; int rc; + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + if ( !mem ) return -ENOMEM; @@ -524,6 +540,8 @@ static int cf_check init_header(struct pci_dev *pdev) int rc; bool mask_cap_list = false; + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) { case PCI_HEADER_TYPE_NORMAL: diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c index a253ccbd7db7..dc71938e23f5 100644 --- a/xen/drivers/vpci/msi.c +++ b/xen/drivers/vpci/msi.c @@ -263,7 +263,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); void vpci_dump_msi(void) { - const struct domain *d; + struct domain *d; rcu_read_lock(&domlist_read_lock); for_each_domain ( d ) @@ -275,6 +275,9 @@ void vpci_dump_msi(void) printk("vPCI MSI/MSI-X d%d\n", d->domain_id); + if ( !read_trylock(&d->pci_lock) ) + continue; + for_each_pdev ( d, pdev ) { const struct vpci_msi *msi; @@ -313,17 +316,36 @@ void vpci_dump_msi(void) { /* * On error vpci_msix_arch_print will always return without - * holding the lock. + * holding the locks. */ printk("unable to print all MSI-X entries: %d\n", rc); - process_pending_softirqs(); - continue; + goto pdev_done; } } + /* + * Unlock locks to process pending softirqs. This is + * potentially unsafe, as d->pdev_list can be changed in + * meantime. + */ spin_unlock(&pdev->vpci->lock); + read_unlock(&d->pci_lock); + pdev_done: process_pending_softirqs(); + if ( !read_trylock(&d->pci_lock) ) + { + printk("unable to access other devices for the domain\n"); + goto domain_done; + } } + read_unlock(&d->pci_lock); + domain_done: + /* + * We need this label at the end of the loop, but some + * compilers might not be happy about label at the end of the + * compound statement so we adding an empty statement here. + */ + ; } rcu_read_unlock(&domlist_read_lock); } diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c index d1126a417da9..b6abab47efdd 100644 --- a/xen/drivers/vpci/msix.c +++ b/xen/drivers/vpci/msix.c @@ -147,6 +147,8 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) { struct vpci_msix *msix; + ASSERT(rw_is_locked(&d->pci_lock)); + list_for_each_entry ( msix, &d->arch.hvm.msix_tables, next ) { const struct vpci_bar *bars = msix->pdev->vpci->header.bars; @@ -163,7 +165,13 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) static int cf_check msix_accept(struct vcpu *v, unsigned long addr) { - return !!msix_find(v->domain, addr); + int rc; + + read_lock(&v->domain->pci_lock); + rc = !!msix_find(v->domain, addr); + read_unlock(&v->domain->pci_lock); + + return rc; } static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, @@ -358,21 +366,35 @@ static int adjacent_read(const struct domain *d, const struct vpci_msix *msix, static int cf_check msix_read( struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data) { - const struct domain *d = v->domain; - struct vpci_msix *msix = msix_find(d, addr); + struct domain *d = v->domain; + struct vpci_msix *msix; const struct vpci_msix_entry *entry; unsigned int offset; *data = ~0UL; + read_lock(&d->pci_lock); + + msix = msix_find(d, addr); if ( !msix ) + { + read_unlock(&d->pci_lock); return X86EMUL_RETRY; + } if ( adjacent_handle(msix, addr) ) - return adjacent_read(d, msix, addr, len, data); + { + int rc = adjacent_read(d, msix, addr, len, data); + + read_unlock(&d->pci_lock); + return rc; + } if ( !access_allowed(msix->pdev, addr, len) ) + { + read_unlock(&d->pci_lock); return X86EMUL_OKAY; + } spin_lock(&msix->pdev->vpci->lock); entry = get_entry(msix, addr); @@ -404,6 +426,7 @@ static int cf_check msix_read( break; } spin_unlock(&msix->pdev->vpci->lock); + read_unlock(&d->pci_lock); return X86EMUL_OKAY; } @@ -491,19 +514,33 @@ static int adjacent_write(const struct domain *d, const struct vpci_msix *msix, static int cf_check msix_write( struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data) { - const struct domain *d = v->domain; - struct vpci_msix *msix = msix_find(d, addr); + struct domain *d = v->domain; + struct vpci_msix *msix; struct vpci_msix_entry *entry; unsigned int offset; + read_lock(&d->pci_lock); + + msix = msix_find(d, addr); if ( !msix ) + { + read_unlock(&d->pci_lock); return X86EMUL_RETRY; + } if ( adjacent_handle(msix, addr) ) - return adjacent_write(d, msix, addr, len, data); + { + int rc = adjacent_write(d, msix, addr, len, data); + + read_unlock(&d->pci_lock); + return rc; + } if ( !access_allowed(msix->pdev, addr, len) ) + { + read_unlock(&d->pci_lock); return X86EMUL_OKAY; + } spin_lock(&msix->pdev->vpci->lock); entry = get_entry(msix, addr); @@ -579,6 +616,7 @@ static int cf_check msix_write( break; } spin_unlock(&msix->pdev->vpci->lock); + read_unlock(&d->pci_lock); return X86EMUL_OKAY; } diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 72ef277c4f8e..475272b173f3 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -42,6 +42,8 @@ extern vpci_register_init_t *const __end_vpci_array[]; void vpci_remove_device(struct pci_dev *pdev) { + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + if ( !has_vpci(pdev->domain) || !pdev->vpci ) return; @@ -77,6 +79,8 @@ int vpci_add_handlers(struct pci_dev *pdev) const unsigned long *ro_map; int rc = 0; + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + if ( !has_vpci(pdev->domain) ) return 0; @@ -361,7 +365,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) { - const struct domain *d = current->domain; + struct domain *d = current->domain; const struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; @@ -376,12 +380,18 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) /* * Find the PCI dev matching the address, which for hwdom also requires * consulting DomXEN. Passthrough everything that's not trapped. + * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's + * pci_lock is sufficient. */ + read_lock(&d->pci_lock); pdev = pci_get_pdev(d, sbdf); if ( !pdev && is_hardware_domain(d) ) pdev = pci_get_pdev(dom_xen, sbdf); if ( !pdev || !pdev->vpci ) + { + read_unlock(&d->pci_lock); return vpci_read_hw(sbdf, reg, size); + } spin_lock(&pdev->vpci->lock); @@ -428,6 +438,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) ASSERT(data_offset < size); } spin_unlock(&pdev->vpci->lock); + read_unlock(&d->pci_lock); if ( data_offset < size ) { @@ -470,7 +481,7 @@ static void vpci_write_helper(const struct pci_dev *pdev, void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, uint32_t data) { - const struct domain *d = current->domain; + struct domain *d = current->domain; const struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; @@ -484,7 +495,13 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, /* * Find the PCI dev matching the address, which for hwdom also requires * consulting DomXEN. Passthrough everything that's not trapped. + * If this is hwdom and the device is assigned to DomXEN, acquiring hwdom's + * pci_lock is sufficient. + * + * TODO: We need to take pci_locks in exclusive mode only if we + * are modifying BARs, so there is a room for improvement. */ + write_lock(&d->pci_lock); pdev = pci_get_pdev(d, sbdf); if ( !pdev && is_hardware_domain(d) ) pdev = pci_get_pdev(dom_xen, sbdf); @@ -493,6 +510,8 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, /* Ignore writes to read-only devices, which have no ->vpci. */ const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); + write_unlock(&d->pci_lock); + if ( !ro_map || !test_bit(sbdf.bdf, ro_map) ) vpci_write_hw(sbdf, reg, size, data); return; @@ -534,6 +553,7 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, ASSERT(data_offset < size); } spin_unlock(&pdev->vpci->lock); + write_unlock(&d->pci_lock); if ( data_offset < size ) /* Tailing gap, write the remaining. */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 9da91e0e6244..c3adec1aca3c 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -462,7 +462,8 @@ struct domain #ifdef CONFIG_HAS_PCI struct list_head pdev_list; /* - * pci_lock protects access to pdev_list. + * pci_lock protects access to pdev_list. pci_lock also protects pdev->vpci + * structure from being removed. * * Any user *reading* from pdev_list, or from devices stored in pdev_list, * should hold either pcidevs_lock() or pci_lock in read mode. Optionally, @@ -628,6 +629,18 @@ struct domain unsigned int cdf; }; +/* + * Check for use in ASSERTs to ensure that: + * 1. we can *read* d->pdev_list + * 2. pdevs (belonging to this domain) do not go away + * 3. pdevs (belonging to this domain) do not get assigned to other domains + * This check is not suitable for protecting other state or critical regions. + */ +#define pdev_list_is_read_locked(d) ({ \ + struct domain *d_ = (d); \ + pcidevs_locked() || (d_ && rw_is_locked(&d_->pci_lock)); \ + }) + static inline struct page_list_head *page_to_list( struct domain *d, const struct page_info *pg) {