Message ID | 20220209133627.959649-1-andr2000@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | vpci: introduce per-domain lock to protect vpci structure | expand |
On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Introduce a per-domain read/write lock to check whether vpci is present, > so we are sure there are no accesses to the contents of the vpci struct > if not. This lock can be used (and in a few cases is used right away) > so that vpci removal can be performed while holding the lock in write > mode. Previously such removal could race with vpci_read for example. Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt pci_remove_device, and likely when vPCI gets also used in {de}assign_device I think. > 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure > from being removed. > > 2. Writing the command register and ROM BAR register may trigger > modify_bars to run, which in turn may access multiple pdevs while > checking for the existing BAR's overlap. The overlapping check, if done > under the read lock, requires vpci->lock to be acquired on both devices > being compared, which may produce a deadlock. It is not possible to > upgrade read lock to write lock in such a case. So, in order to prevent > the deadlock, check which registers are going to be written and acquire > the lock in the appropriate mode from the beginning. > > All other code, which doesn't lead to pdev->vpci destruction and does not > access multiple pdevs at the same time, can still use a combination of the > read lock and pdev->vpci->lock. > > 3. Optimize if ROM BAR write lock required detection by caching offset > of the ROM BAR register in vpci->header->rom_reg which depends on > header's type. > > 4. Reduce locked region in vpci_remove_device as it is now possible > to set pdev->vpci to NULL early right after the write lock is acquired. > > 5. Reduce locked region in vpci_add_handlers as it is possible to > initialize many more fields of the struct vpci before assigning it to > pdev->vpci. > > 6. vpci_{add|remove}_register are required to be called with the write lock > held, but it is not feasible to add an assert there as it requires > struct domain to be passed for that. So, add a comment about this requirement > to these and other functions with the equivalent constraints. > > 7. Drop const qualifier where the new rwlock is used and this is appropriate. > > 8. This is based on the discussion at [1]. > > [1] https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/ > > Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > Suggested-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > --- > This was checked on x86: with and without PVH Dom0. > --- > xen/arch/x86/hvm/vmsi.c | 2 + > xen/common/domain.c | 3 + > xen/drivers/vpci/header.c | 8 +++ > xen/drivers/vpci/msi.c | 8 ++- > xen/drivers/vpci/msix.c | 40 +++++++++++-- > xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++---------- > xen/include/xen/sched.h | 3 + > xen/include/xen/vpci.h | 2 + > 8 files changed, 146 insertions(+), 34 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c > index 13e2a190b439..351cb968a423 100644 > --- a/xen/arch/x86/hvm/vmsi.c > +++ b/xen/arch/x86/hvm/vmsi.c > @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > { > unsigned int i; > > + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); ^ no need for the double negation. Also this asserts that the lock is taken, but could be by a different pCPU. I guess it's better than nothing. > + > for ( i = 0; i < msix->max_entries; i++ ) > { > const struct vpci_msix_entry *entry = &msix->entries[i]; Since this function is now called with the per-domain rwlock read locked it's likely not appropriate to call process_pending_softirqs while holding such lock (check below). We will likely need to re-iterate over the list of pdevs assigned to the domain and assert that the pdev is still assigned to the same domain. > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 2048ebad86ff..10558c22285d 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid, > > #ifdef CONFIG_HAS_PCI > INIT_LIST_HEAD(&d->pdev_list); > +#ifdef CONFIG_HAS_VPCI > + rwlock_init(&d->vpci_rwlock); > +#endif > #endif > > /* All error paths can depend on the above setup. */ > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c > index 40ff79c33f8f..9e2aeb2055c9 100644 > --- a/xen/drivers/vpci/header.c > +++ b/xen/drivers/vpci/header.c > @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v) > if ( rc == -ERESTART ) > return true; > > + read_lock(&v->domain->vpci_rwlock); > spin_lock(&v->vpci.pdev->vpci->lock); > /* Disable memory decoding unconditionally on failure. */ > modify_decoding(v->vpci.pdev, > rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd, > !rc && v->vpci.rom_only); > spin_unlock(&v->vpci.pdev->vpci->lock); > + read_unlock(&v->domain->vpci_rwlock); > > rangeset_destroy(v->vpci.mem); > v->vpci.mem = NULL; > @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev, > raise_softirq(SCHEDULE_SOFTIRQ); > } > > +/* This must hold domain's vpci_rwlock in write mode. */ > static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only) > { > struct vpci_header *header = &pdev->vpci->header; > @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev) > struct vpci_bar *bars = header->bars; > int rc; > > + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); > + > switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) > { > case PCI_HEADER_TYPE_NORMAL: > @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev) > { > struct vpci_bar *rom = &header->bars[num_bars]; > > + header->rom_reg = rom_reg; > rom->type = VPCI_BAR_ROM; > rom->size = size; > rom->addr = addr; > @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev) > if ( rc ) > rom->type = VPCI_BAR_EMPTY; You can also set 'rom_reg = ~0' here. Or move the setting of rom_reg after the handler has been successfully installed. I think it would be easier to just signal no ROM BAR with rom_reg == 0. There's no header where the ROM BAR is at offset 0. That way you will only have to set rom_reg on the successful path, but you don't need to care about the case where there's no ROM BAR. > } > + else > + header->rom_reg = ~(unsigned int)0; No need for the cast. > > return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0; > } > diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c > index 5757a7aed20f..5df3dfa8243c 100644 > --- a/xen/drivers/vpci/msi.c > +++ b/xen/drivers/vpci/msi.c > @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev) > uint16_t control; > int ret; > > + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); > + > if ( !pos ) > return 0; > > @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); > > void vpci_dump_msi(void) > { > - const struct domain *d; > + struct domain *d; > > rcu_read_lock(&domlist_read_lock); > for_each_domain ( d ) > @@ -277,6 +279,9 @@ void vpci_dump_msi(void) > > printk("vPCI MSI/MSI-X d%d\n", d->domain_id); > > + if ( !read_trylock(&d->vpci_rwlock) ) > + continue; > + > for_each_pdev ( d, pdev ) > { > const struct vpci_msi *msi; > @@ -326,6 +331,7 @@ void vpci_dump_msi(void) > spin_unlock(&pdev->vpci->lock); > process_pending_softirqs(); > } > + read_unlock(&d->vpci_rwlock); Same here, you are calling process_pending_softirqs while holding vpci_rwlock. > } > rcu_read_unlock(&domlist_read_lock); > } > diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c > index 846f1b8d7038..5296d6025d8e 100644 > --- a/xen/drivers/vpci/msix.c > +++ b/xen/drivers/vpci/msix.c > @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg, > pci_conf_write16(pdev->sbdf, reg, val); > } > > +/* This must hold domain's vpci_rwlock in write mode. */ > static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) > { > struct vpci_msix *msix; > @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) > > static int msix_accept(struct vcpu *v, unsigned long addr) > { > - return !!msix_find(v->domain, addr); > + int rc; > + > + read_lock(&v->domain->vpci_rwlock); > + rc = !!msix_find(v->domain, addr); > + read_unlock(&v->domain->vpci_rwlock); Newline before return. > + return rc; > } > > static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, > index fb0947179b79..16bb3b832e6a 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev) > } > #endif /* __XEN__ */ > > -static int vpci_register_cmp(const struct vpci_register *r1, > - const struct vpci_register *r2) > +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, > + unsigned int r2_offset, unsigned int r2_size) > { > /* Return 0 if registers overlap. */ > - if ( r1->offset < r2->offset + r2->size && > - r2->offset < r1->offset + r1->size ) > + if ( r1_offset < r2_offset + r2_size && > + r2_offset < r1_offset + r1_size ) > return 0; > - if ( r1->offset < r2->offset ) > + if ( r1_offset < r2_offset ) > return -1; > - if ( r1->offset > r2->offset ) > + if ( r1_offset > r2_offset ) > return 1; > > ASSERT_UNREACHABLE(); > return 0; > } > > +static int vpci_register_cmp(const struct vpci_register *r1, > + const struct vpci_register *r2) > +{ > + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size); > +} Seeing how this gets used below I'm not sure it's very helpful to reuse vpci_register_cmp, see my comment below. > + > /* Dummy hooks, writes are ignored, reads return 1's */ > static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int reg, > void *data) > @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg, > return pci_conf_read32(pdev->sbdf, reg); > } > > +/* This must hold domain's vpci_rwlock in write mode. */ > int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, > vpci_write_t *write_handler, unsigned int offset, > unsigned int size, void *data) > @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, > r->offset = offset; > r->private = data; > > - spin_lock(&vpci->lock); > - > /* The list of handlers must be kept sorted at all times. */ > list_for_each ( prev, &vpci->handlers ) > { > @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, > break; > if ( cmp == 0 ) > { > - spin_unlock(&vpci->lock); > xfree(r); > return -EEXIST; > } > } > > list_add_tail(&r->node, prev); > - spin_unlock(&vpci->lock); > > return 0; > } > > +/* This must hold domain's vpci_rwlock in write mode. */ > int vpci_remove_register(struct vpci *vpci, unsigned int offset, > unsigned int size) > { > const struct vpci_register r = { .offset = offset, .size = size }; > struct vpci_register *rm; > > - spin_lock(&vpci->lock); > list_for_each_entry ( rm, &vpci->handlers, node ) > { > int cmp = vpci_register_cmp(&r, rm); > @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset, > if ( !cmp && rm->offset == offset && rm->size == size ) > { > list_del(&rm->node); > - spin_unlock(&vpci->lock); > xfree(rm); > return 0; > } > if ( cmp <= 0 ) > break; > } > - spin_unlock(&vpci->lock); > > return -ENOENT; > } > @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, > > uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > { > - const struct domain *d = current->domain; > + struct domain *d = current->domain; > const struct pci_dev *pdev; > const struct vpci_register *r; > unsigned int data_offset = 0; > @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > if ( !pdev ) > return vpci_read_hw(sbdf, reg, size); > > + read_lock(&d->vpci_rwlock); After taking the domain lock you need to check that pdev->vpci != NULL, as vpci_remove_device will set pdev->vpci == NULL before removing the device from the domain. Same applies to vpci_add_handlers which will be called with the device already added to the domain, but with pdev->vpci == NULL. We would also need some kind of protection arround pci_get_pdev_by_domain so that devices are not removed (from the domain) while we are iterating over it. > spin_lock(&pdev->vpci->lock); > > /* Read from the hardware or the emulated register handlers. */ > @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) > ASSERT(data_offset < size); > } > spin_unlock(&pdev->vpci->lock); > + read_unlock(&d->vpci_rwlock); > > if ( data_offset < size ) > { > @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, > r->private); > } > > +static bool vpci_header_write_lock(const struct pci_dev *pdev, > + unsigned int start, unsigned int size) I think this should live in header.c, for consistency. I'm also not sure it's worth adding vpci_offset_cmp: you just need to do a range overlap check, and that can be easily open coded. It's just the first 'if' in vpci_register_cmp that you want, the rest of the code is just adding overhead. > +{ > + /* > + * Writing the command register and ROM BAR register may trigger > + * modify_bars to run which in turn may access multiple pdevs while > + * checking for the existing BAR's overlap. The overlapping check, if done > + * under the read lock, requires vpci->lock to be acquired on both devices > + * being compared, which may produce a deadlock. It is not possible to > + * upgrade read lock to write lock in such a case. So, in order to prevent > + * the deadlock, check which registers are going to be written and acquire > + * the lock in the appropriate mode from the beginning. > + */ > + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) > + return true; > + > + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) No need for the comparison if rom_reg is unset. Also you can OR both conditions into a single if. > + return true; > + > + return false; > +} > + > void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > uint32_t data) > { > - const struct domain *d = current->domain; > + struct domain *d = current->domain; > const struct pci_dev *pdev; > const struct vpci_register *r; > unsigned int data_offset = 0; > const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); > + bool write_locked = false; > > if ( !size ) > { > @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > return; > } > > - spin_lock(&pdev->vpci->lock); > + if ( vpci_header_write_lock(pdev, reg, size) ) > + { > + /* Gain exclusive access to all of the domain pdevs vpci. */ > + write_lock(&d->vpci_rwlock); > + write_locked = true; Here you need to check that pdev->vpci != NULL... > + } > + else > + { > + read_lock(&d->vpci_rwlock); ...and also here. > + spin_lock(&pdev->vpci->lock); > + } > > /* Write the value to the hardware or emulated registers. */ > list_for_each_entry ( r, &pdev->vpci->handlers, node ) > @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, > break; > ASSERT(data_offset < size); > } > - spin_unlock(&pdev->vpci->lock); > + > + if ( write_locked ) > + write_unlock(&d->vpci_rwlock); > + else > + { > + spin_unlock(&pdev->vpci->lock); > + read_unlock(&d->vpci_rwlock); > + } > > if ( data_offset < size ) > /* Tailing gap, write the remaining. */ > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index 37f78cc4c4c9..ecd34481a7af 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -444,6 +444,9 @@ struct domain > > #ifdef CONFIG_HAS_PCI > struct list_head pdev_list; > +#ifdef CONFIG_HAS_VPCI > + rwlock_t vpci_rwlock; > +#endif > #endif > > #ifdef CONFIG_HAS_PASSTHROUGH > diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h > index e8ac1eb39513..e19e462ee5cb 100644 > --- a/xen/include/xen/vpci.h > +++ b/xen/include/xen/vpci.h > @@ -88,6 +88,8 @@ struct vpci { > * is mapped into guest p2m) if there's a ROM BAR on the device. > */ > bool rom_enabled : 1; > + /* Offset to the ROM BAR register if any. */ > + unsigned int rom_reg; Could you please place this field after 'type'? That will avoid some padding in the structure. Thanks, Roger.
Hi, Roger! On 10.02.22 18:16, Roger Pau Monné wrote: > On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> Introduce a per-domain read/write lock to check whether vpci is present, >> so we are sure there are no accesses to the contents of the vpci struct >> if not. This lock can be used (and in a few cases is used right away) >> so that vpci removal can be performed while holding the lock in write >> mode. Previously such removal could race with vpci_read for example. > Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > pci_remove_device, and likely when vPCI gets also used in > {de}assign_device I think. Yes, this is indeed an issue, but I was not trying to solve it in context of vPCI locking yet. I think we should discuss how do we approach pdev locking, so I can create a patch for that. that being said, I would like not to solve pdev in this patch yet ...I do understand we do want to avoid that, but at the moment a single reliable way for making sure pdev is alive seems to be pcidevs_lock.... > >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure >> from being removed. >> >> 2. Writing the command register and ROM BAR register may trigger >> modify_bars to run, which in turn may access multiple pdevs while >> checking for the existing BAR's overlap. The overlapping check, if done >> under the read lock, requires vpci->lock to be acquired on both devices >> being compared, which may produce a deadlock. It is not possible to >> upgrade read lock to write lock in such a case. So, in order to prevent >> the deadlock, check which registers are going to be written and acquire >> the lock in the appropriate mode from the beginning. >> >> All other code, which doesn't lead to pdev->vpci destruction and does not >> access multiple pdevs at the same time, can still use a combination of the >> read lock and pdev->vpci->lock. >> >> 3. Optimize if ROM BAR write lock required detection by caching offset >> of the ROM BAR register in vpci->header->rom_reg which depends on >> header's type. >> >> 4. Reduce locked region in vpci_remove_device as it is now possible >> to set pdev->vpci to NULL early right after the write lock is acquired. >> >> 5. Reduce locked region in vpci_add_handlers as it is possible to >> initialize many more fields of the struct vpci before assigning it to >> pdev->vpci. >> >> 6. vpci_{add|remove}_register are required to be called with the write lock >> held, but it is not feasible to add an assert there as it requires >> struct domain to be passed for that. So, add a comment about this requirement >> to these and other functions with the equivalent constraints. >> >> 7. Drop const qualifier where the new rwlock is used and this is appropriate. >> >> 8. This is based on the discussion at [1]. >> >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!gObSySzN7s6zSKrcpSEi6vw18fRPls157cuRoqq4KDd7Ic_Nvh_cFlyVXPRpEWBkI38pgsvvfg$ [lore[.]kernel[.]org] >> >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> >> Suggested-by: Jan Beulich <jbeulich@suse.com> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> --- >> This was checked on x86: with and without PVH Dom0. >> --- >> xen/arch/x86/hvm/vmsi.c | 2 + >> xen/common/domain.c | 3 + >> xen/drivers/vpci/header.c | 8 +++ >> xen/drivers/vpci/msi.c | 8 ++- >> xen/drivers/vpci/msix.c | 40 +++++++++++-- >> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++---------- >> xen/include/xen/sched.h | 3 + >> xen/include/xen/vpci.h | 2 + >> 8 files changed, 146 insertions(+), 34 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >> index 13e2a190b439..351cb968a423 100644 >> --- a/xen/arch/x86/hvm/vmsi.c >> +++ b/xen/arch/x86/hvm/vmsi.c >> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >> { >> unsigned int i; >> >> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); > ^ no need for the double negation. Ok, will update all asserts which use !! > > Also this asserts that the lock is taken, but could be by a different > pCPU. I guess it's better than nothing. Fair enough. Do you still want the asserts or should I remove them? > >> + >> for ( i = 0; i < msix->max_entries; i++ ) >> { >> const struct vpci_msix_entry *entry = &msix->entries[i]; > Since this function is now called with the per-domain rwlock read > locked it's likely not appropriate to call process_pending_softirqs > while holding such lock (check below). You are right, as it is possible that: process_pending_softirqs -> vpci_process_pending -> read_lock Even more, vpci_process_pending may also read_unlock -> vpci_remove_device -> write_lock in its error path. So, any invocation of process_pending_softirqs must not hold d->vpci_rwlock at least. And also we need to check that pdev->vpci was not removed in between or *re-created* > > We will likely need to re-iterate over the list of pdevs assigned to > the domain and assert that the pdev is still assigned to the same > domain. So, do you mean a pattern like the below should be used at all places where we need to call process_pending_softirqs? read_unlock process_pending_softirqs read_lock pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) <continue processing> > >> diff --git a/xen/common/domain.c b/xen/common/domain.c >> index 2048ebad86ff..10558c22285d 100644 >> --- a/xen/common/domain.c >> +++ b/xen/common/domain.c >> @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid, >> >> #ifdef CONFIG_HAS_PCI >> INIT_LIST_HEAD(&d->pdev_list); >> +#ifdef CONFIG_HAS_VPCI >> + rwlock_init(&d->vpci_rwlock); >> +#endif >> #endif >> >> /* All error paths can depend on the above setup. */ >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >> index 40ff79c33f8f..9e2aeb2055c9 100644 >> --- a/xen/drivers/vpci/header.c >> +++ b/xen/drivers/vpci/header.c >> @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v) >> if ( rc == -ERESTART ) >> return true; >> >> + read_lock(&v->domain->vpci_rwlock); >> spin_lock(&v->vpci.pdev->vpci->lock); >> /* Disable memory decoding unconditionally on failure. */ >> modify_decoding(v->vpci.pdev, >> rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd, >> !rc && v->vpci.rom_only); >> spin_unlock(&v->vpci.pdev->vpci->lock); >> + read_unlock(&v->domain->vpci_rwlock); >> >> rangeset_destroy(v->vpci.mem); >> v->vpci.mem = NULL; >> @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev, >> raise_softirq(SCHEDULE_SOFTIRQ); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only) >> { >> struct vpci_header *header = &pdev->vpci->header; >> @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev) >> struct vpci_bar *bars = header->bars; >> int rc; >> >> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) >> { >> case PCI_HEADER_TYPE_NORMAL: >> @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev) >> { >> struct vpci_bar *rom = &header->bars[num_bars]; >> >> + header->rom_reg = rom_reg; >> rom->type = VPCI_BAR_ROM; >> rom->size = size; >> rom->addr = addr; >> @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev) >> if ( rc ) >> rom->type = VPCI_BAR_EMPTY; > You can also set 'rom_reg = ~0' here. Or move the setting of rom_reg > after the handler has been successfully installed. > > I think it would be easier to just signal no ROM BAR with rom_reg == > 0. There's no header where the ROM BAR is at offset 0. That way you > will only have to set rom_reg on the successful path, but you don't > need to care about the case where there's no ROM BAR. Yes, it is possible > >> } >> + else >> + header->rom_reg = ~(unsigned int)0; > No need for the cast. Ok > >> >> return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0; >> } >> diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c >> index 5757a7aed20f..5df3dfa8243c 100644 >> --- a/xen/drivers/vpci/msi.c >> +++ b/xen/drivers/vpci/msi.c >> @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev) >> uint16_t control; >> int ret; >> >> + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> if ( !pos ) >> return 0; >> >> @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); >> >> void vpci_dump_msi(void) >> { >> - const struct domain *d; >> + struct domain *d; >> >> rcu_read_lock(&domlist_read_lock); >> for_each_domain ( d ) >> @@ -277,6 +279,9 @@ void vpci_dump_msi(void) >> >> printk("vPCI MSI/MSI-X d%d\n", d->domain_id); >> >> + if ( !read_trylock(&d->vpci_rwlock) ) >> + continue; >> + >> for_each_pdev ( d, pdev ) >> { >> const struct vpci_msi *msi; >> @@ -326,6 +331,7 @@ void vpci_dump_msi(void) >> spin_unlock(&pdev->vpci->lock); >> process_pending_softirqs(); >> } >> + read_unlock(&d->vpci_rwlock); > Same here, you are calling process_pending_softirqs while holding > vpci_rwlock. Same as above > >> } >> rcu_read_unlock(&domlist_read_lock); >> } >> diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c >> index 846f1b8d7038..5296d6025d8e 100644 >> --- a/xen/drivers/vpci/msix.c >> +++ b/xen/drivers/vpci/msix.c >> @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg, >> pci_conf_write16(pdev->sbdf, reg, val); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) >> { >> struct vpci_msix *msix; >> @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) >> >> static int msix_accept(struct vcpu *v, unsigned long addr) >> { >> - return !!msix_find(v->domain, addr); >> + int rc; >> + >> + read_lock(&v->domain->vpci_rwlock); >> + rc = !!msix_find(v->domain, addr); >> + read_unlock(&v->domain->vpci_rwlock); > Newline before return. Ok > >> + return rc; >> } >> >> static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, >> index fb0947179b79..16bb3b832e6a 100644 >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev) >> } >> #endif /* __XEN__ */ >> >> -static int vpci_register_cmp(const struct vpci_register *r1, >> - const struct vpci_register *r2) >> +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, >> + unsigned int r2_offset, unsigned int r2_size) >> { >> /* Return 0 if registers overlap. */ >> - if ( r1->offset < r2->offset + r2->size && >> - r2->offset < r1->offset + r1->size ) >> + if ( r1_offset < r2_offset + r2_size && >> + r2_offset < r1_offset + r1_size ) >> return 0; >> - if ( r1->offset < r2->offset ) >> + if ( r1_offset < r2_offset ) >> return -1; >> - if ( r1->offset > r2->offset ) >> + if ( r1_offset > r2_offset ) >> return 1; >> >> ASSERT_UNREACHABLE(); >> return 0; >> } >> >> +static int vpci_register_cmp(const struct vpci_register *r1, >> + const struct vpci_register *r2) >> +{ >> + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size); >> +} > Seeing how this gets used below I'm not sure it's very helpful to > reuse vpci_register_cmp, see my comment below. Yes, the only thing which is used is the first check for the overlap > >> + >> /* Dummy hooks, writes are ignored, reads return 1's */ >> static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int reg, >> void *data) >> @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg, >> return pci_conf_read32(pdev->sbdf, reg); >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >> vpci_write_t *write_handler, unsigned int offset, >> unsigned int size, void *data) >> @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >> r->offset = offset; >> r->private = data; >> >> - spin_lock(&vpci->lock); >> - >> /* The list of handlers must be kept sorted at all times. */ >> list_for_each ( prev, &vpci->handlers ) >> { >> @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, >> break; >> if ( cmp == 0 ) >> { >> - spin_unlock(&vpci->lock); >> xfree(r); >> return -EEXIST; >> } >> } >> >> list_add_tail(&r->node, prev); >> - spin_unlock(&vpci->lock); >> >> return 0; >> } >> >> +/* This must hold domain's vpci_rwlock in write mode. */ >> int vpci_remove_register(struct vpci *vpci, unsigned int offset, >> unsigned int size) >> { >> const struct vpci_register r = { .offset = offset, .size = size }; >> struct vpci_register *rm; >> >> - spin_lock(&vpci->lock); >> list_for_each_entry ( rm, &vpci->handlers, node ) >> { >> int cmp = vpci_register_cmp(&r, rm); >> @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset, >> if ( !cmp && rm->offset == offset && rm->size == size ) >> { >> list_del(&rm->node); >> - spin_unlock(&vpci->lock); >> xfree(rm); >> return 0; >> } >> if ( cmp <= 0 ) >> break; >> } >> - spin_unlock(&vpci->lock); >> >> return -ENOENT; >> } >> @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, >> >> uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> if ( !pdev ) >> return vpci_read_hw(sbdf, reg, size); >> >> + read_lock(&d->vpci_rwlock); > After taking the domain lock you need to check that pdev->vpci != > NULL, as vpci_remove_device will set pdev->vpci == NULL before > removing the device from the domain. Same applies to vpci_add_handlers > which will be called with the device already added to the domain, but > with pdev->vpci == NULL. Sure, will add > > We would also need some kind of protection arround > pci_get_pdev_by_domain so that devices are not removed (from the > domain) while we are iterating over it. Again, this is a problem. And we need to find a way to solve that > >> spin_lock(&pdev->vpci->lock); >> >> /* Read from the hardware or the emulated register handlers. */ >> @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) >> ASSERT(data_offset < size); >> } >> spin_unlock(&pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> >> if ( data_offset < size ) >> { >> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, >> r->private); >> } >> >> +static bool vpci_header_write_lock(const struct pci_dev *pdev, >> + unsigned int start, unsigned int size) > I think this should live in header.c, for consistency. Ok, will move > > I'm also not sure it's worth adding vpci_offset_cmp: you just need to > do a range overlap check, and that can be easily open coded. It's just > the first 'if' in vpci_register_cmp that you want, the rest of the > code is just adding overhead. Agree > >> +{ >> + /* >> + * Writing the command register and ROM BAR register may trigger >> + * modify_bars to run which in turn may access multiple pdevs while >> + * checking for the existing BAR's overlap. The overlapping check, if done >> + * under the read lock, requires vpci->lock to be acquired on both devices >> + * being compared, which may produce a deadlock. It is not possible to >> + * upgrade read lock to write lock in such a case. So, in order to prevent >> + * the deadlock, check which registers are going to be written and acquire >> + * the lock in the appropriate mode from the beginning. >> + */ >> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >> + return true; >> + >> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) > No need for the comparison if rom_reg is unset. Also you can OR both > conditions into a single if. If we open code vpci_offset_cmp with a single if all this is going to be a bit clumsy: if ( r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size ) return 0; This is a single check. Now we need to check two registers with the code above and also check that pdev->vpci->header.rom_reg != 0 I think it would be more readable if we have a tiny helper function static bool vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, unsigned int r2_offset, unsigned int r2_size) { /* Return 0 if registers overlap. */ if ( r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size ) return false; return true; } So, then we can have something like static bool vpci_header_write_lock(const struct pci_dev *pdev, unsigned int start, unsigned int size) { if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) || (pdev->vpci->header.rom_reg && !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4)) ) return true; return false; } > >> + return true; >> + >> + return false; >> +} >> + >> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >> uint32_t data) >> { >> - const struct domain *d = current->domain; >> + struct domain *d = current->domain; >> const struct pci_dev *pdev; >> const struct vpci_register *r; >> unsigned int data_offset = 0; >> const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); >> + bool write_locked = false; >> >> if ( !size ) >> { >> @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >> return; >> } >> >> - spin_lock(&pdev->vpci->lock); >> + if ( vpci_header_write_lock(pdev, reg, size) ) >> + { >> + /* Gain exclusive access to all of the domain pdevs vpci. */ >> + write_lock(&d->vpci_rwlock); >> + write_locked = true; > Here you need to check that pdev->vpci != NULL... Will do > >> + } >> + else >> + { >> + read_lock(&d->vpci_rwlock); > ...and also here. Will do > >> + spin_lock(&pdev->vpci->lock); >> + } >> >> /* Write the value to the hardware or emulated registers. */ >> list_for_each_entry ( r, &pdev->vpci->handlers, node ) >> @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, >> break; >> ASSERT(data_offset < size); >> } >> - spin_unlock(&pdev->vpci->lock); >> + >> + if ( write_locked ) >> + write_unlock(&d->vpci_rwlock); >> + else >> + { >> + spin_unlock(&pdev->vpci->lock); >> + read_unlock(&d->vpci_rwlock); >> + } >> >> if ( data_offset < size ) >> /* Tailing gap, write the remaining. */ >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >> index 37f78cc4c4c9..ecd34481a7af 100644 >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -444,6 +444,9 @@ struct domain >> >> #ifdef CONFIG_HAS_PCI >> struct list_head pdev_list; >> +#ifdef CONFIG_HAS_VPCI >> + rwlock_t vpci_rwlock; >> +#endif >> #endif >> >> #ifdef CONFIG_HAS_PASSTHROUGH >> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h >> index e8ac1eb39513..e19e462ee5cb 100644 >> --- a/xen/include/xen/vpci.h >> +++ b/xen/include/xen/vpci.h >> @@ -88,6 +88,8 @@ struct vpci { >> * is mapped into guest p2m) if there's a ROM BAR on the device. >> */ >> bool rom_enabled : 1; >> + /* Offset to the ROM BAR register if any. */ >> + unsigned int rom_reg; > Could you please place this field after 'type'? That will avoid some > padding in the structure. Sure > > Thanks, Roger. Thank you, Oleksandr
On 10.02.22 18:16, Roger Pau Monné wrote: > On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> Introduce a per-domain read/write lock to check whether vpci is present, >> so we are sure there are no accesses to the contents of the vpci struct >> if not. This lock can be used (and in a few cases is used right away) >> so that vpci removal can be performed while holding the lock in write >> mode. Previously such removal could race with vpci_read for example. > Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > pci_remove_device, and likely when vPCI gets also used in > {de}assign_device I think. > How about the below? It seems to guarantee that we can access pdev without issues and without requiring pcidevs_lock to be used? diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index e8b09d77d880..fd464a58b3b3 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -937,8 +937,14 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, } devfn = pdev->devfn; +#ifdef CONFIG_HAS_VPCI + write_lock(&d->vpci_rwlock); +#endif ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn, pci_to_dev(pdev)); +#ifdef CONFIG_HAS_VPCI + write_unlock(&d->vpci_rwlock); +#endif if ( ret ) goto out; @@ -1474,6 +1480,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) const struct domain_iommu *hd = dom_iommu(d); struct pci_dev *pdev; int rc = 0; +#ifdef CONFIG_HAS_VPCI + struct domain *old_d; +#endif if ( !is_iommu_enabled(d) ) return 0; @@ -1487,15 +1496,34 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) ASSERT(pdev && (pdev->domain == hardware_domain || pdev->domain == dom_io)); +#ifdef CONFIG_HAS_VPCI + /* pdev->domain is either hwdom or dom_io. We do not want the later. */ + old_d = pdev->domain == hardware_domain ? pdev->domain : NULL; + if ( old_d ) + write_lock(&old_d->vpci_rwlock); +#endif + rc = pdev_msix_assign(d, pdev); if ( rc ) + { +#ifdef CONFIG_HAS_VPCI + if ( old_d ) + write_unlock(&old_d->vpci_rwlock); +#endif goto done; + } pdev->fault.count = 0; if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn, pci_to_dev(pdev), flag)) ) + { +#ifdef CONFIG_HAS_VPCI + if ( old_d ) + write_unlock(&old_d->vpci_rwlock); +#endif goto done; + } for ( ; pdev->phantom_stride; rc = 0 ) { I think we don't care about pci_remove_device because: int pci_remove_device(u16 seg, u8 bus, u8 devfn) { [snip] pcidevs_lock(); list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) if ( pdev->bus == bus && pdev->devfn == devfn ) { vpci_remove_device(pdev); as vpci_remove_device will take care there are no readers and will safely remove vpci. Thank you, Oleksandr
On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: > Hi, Roger! > > On 10.02.22 18:16, Roger Pau Monné wrote: > > On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: > >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >> > >> Introduce a per-domain read/write lock to check whether vpci is present, > >> so we are sure there are no accesses to the contents of the vpci struct > >> if not. This lock can be used (and in a few cases is used right away) > >> so that vpci removal can be performed while holding the lock in write > >> mode. Previously such removal could race with vpci_read for example. > > Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > > pci_remove_device, and likely when vPCI gets also used in > > {de}assign_device I think. > Yes, this is indeed an issue, but I was not trying to solve it in > context of vPCI locking yet. I think we should discuss how do > we approach pdev locking, so I can create a patch for that. > that being said, I would like not to solve pdev in this patch yet > > ...I do understand we do want to avoid that, but at the moment > a single reliable way for making sure pdev is alive seems to > be pcidevs_lock.... I think we will need to make pcidevs_lock a rwlock and take it in read mode for pci_get_pdev_by_domain. We didn't have this scenario before where PCI emulation is done in the hypervisor, and hence the locking around those data structures has not been designed for those use-cases. > > > >> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure > >> from being removed. > >> > >> 2. Writing the command register and ROM BAR register may trigger > >> modify_bars to run, which in turn may access multiple pdevs while > >> checking for the existing BAR's overlap. The overlapping check, if done > >> under the read lock, requires vpci->lock to be acquired on both devices > >> being compared, which may produce a deadlock. It is not possible to > >> upgrade read lock to write lock in such a case. So, in order to prevent > >> the deadlock, check which registers are going to be written and acquire > >> the lock in the appropriate mode from the beginning. > >> > >> All other code, which doesn't lead to pdev->vpci destruction and does not > >> access multiple pdevs at the same time, can still use a combination of the > >> read lock and pdev->vpci->lock. > >> > >> 3. Optimize if ROM BAR write lock required detection by caching offset > >> of the ROM BAR register in vpci->header->rom_reg which depends on > >> header's type. > >> > >> 4. Reduce locked region in vpci_remove_device as it is now possible > >> to set pdev->vpci to NULL early right after the write lock is acquired. > >> > >> 5. Reduce locked region in vpci_add_handlers as it is possible to > >> initialize many more fields of the struct vpci before assigning it to > >> pdev->vpci. > >> > >> 6. vpci_{add|remove}_register are required to be called with the write lock > >> held, but it is not feasible to add an assert there as it requires > >> struct domain to be passed for that. So, add a comment about this requirement > >> to these and other functions with the equivalent constraints. > >> > >> 7. Drop const qualifier where the new rwlock is used and this is appropriate. > >> > >> 8. This is based on the discussion at [1]. > >> > >> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!gObSySzN7s6zSKrcpSEi6vw18fRPls157cuRoqq4KDd7Ic_Nvh_cFlyVXPRpEWBkI38pgsvvfg$ [lore[.]kernel[.]org] > >> > >> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> > >> Suggested-by: Jan Beulich <jbeulich@suse.com> > >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >> > >> --- > >> This was checked on x86: with and without PVH Dom0. > >> --- > >> xen/arch/x86/hvm/vmsi.c | 2 + > >> xen/common/domain.c | 3 + > >> xen/drivers/vpci/header.c | 8 +++ > >> xen/drivers/vpci/msi.c | 8 ++- > >> xen/drivers/vpci/msix.c | 40 +++++++++++-- > >> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++---------- > >> xen/include/xen/sched.h | 3 + > >> xen/include/xen/vpci.h | 2 + > >> 8 files changed, 146 insertions(+), 34 deletions(-) > >> > >> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c > >> index 13e2a190b439..351cb968a423 100644 > >> --- a/xen/arch/x86/hvm/vmsi.c > >> +++ b/xen/arch/x86/hvm/vmsi.c > >> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) > >> { > >> unsigned int i; > >> > >> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); > > ^ no need for the double negation. > Ok, will update all asserts which use !! > > > > Also this asserts that the lock is taken, but could be by a different > > pCPU. I guess it's better than nothing. > Fair enough. Do you still want the asserts or should I remove them? Likely fine to leave them. > > > >> + > >> for ( i = 0; i < msix->max_entries; i++ ) > >> { > >> const struct vpci_msix_entry *entry = &msix->entries[i]; > > Since this function is now called with the per-domain rwlock read > > locked it's likely not appropriate to call process_pending_softirqs > > while holding such lock (check below). > You are right, as it is possible that: > > process_pending_softirqs -> vpci_process_pending -> read_lock > > Even more, vpci_process_pending may also > > read_unlock -> vpci_remove_device -> write_lock > > in its error path. So, any invocation of process_pending_softirqs > must not hold d->vpci_rwlock at least. > > And also we need to check that pdev->vpci was not removed > in between or *re-created* > > > > We will likely need to re-iterate over the list of pdevs assigned to > > the domain and assert that the pdev is still assigned to the same > > domain. > So, do you mean a pattern like the below should be used at all > places where we need to call process_pending_softirqs? > > read_unlock > process_pending_softirqs > read_lock > pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); > if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) > <continue processing> Something along those lines. You likely need to continue iterate using for_each_pdev. > >> +{ > >> + /* > >> + * Writing the command register and ROM BAR register may trigger > >> + * modify_bars to run which in turn may access multiple pdevs while > >> + * checking for the existing BAR's overlap. The overlapping check, if done > >> + * under the read lock, requires vpci->lock to be acquired on both devices > >> + * being compared, which may produce a deadlock. It is not possible to > >> + * upgrade read lock to write lock in such a case. So, in order to prevent > >> + * the deadlock, check which registers are going to be written and acquire > >> + * the lock in the appropriate mode from the beginning. > >> + */ > >> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) > >> + return true; > >> + > >> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) > > No need for the comparison if rom_reg is unset. Also you can OR both > > conditions into a single if. > If we open code vpci_offset_cmp with a single if all this is going > to be a bit clumsy: > > if ( r1_offset < r2_offset + r2_size && > r2_offset < r1_offset + r1_size ) > return 0; > This is a single check. > Now we need to check two registers with the code above and > also check that pdev->vpci->header.rom_reg != 0 > > I think it would be more readable if we have a tiny helper function > > static bool vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, > unsigned int r2_offset, unsigned int r2_size) > { > /* Return 0 if registers overlap. */ > if ( r1_offset < r2_offset + r2_size && > r2_offset < r1_offset + r1_size ) > return false; > return true; > } > > So, then we can have something like > > static bool vpci_header_write_lock(const struct pci_dev *pdev, > unsigned int start, unsigned int size) > { > if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) || > (pdev->vpci->header.rom_reg && > !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4)) ) > return true; > > return false; > } Just create an 'overlaps' static function in header.c. Thanks, Roger. >
On Fri, Feb 11, 2022 at 08:46:59AM +0000, Oleksandr Andrushchenko wrote: > > > On 10.02.22 18:16, Roger Pau Monné wrote: > > On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: > >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >> > >> Introduce a per-domain read/write lock to check whether vpci is present, > >> so we are sure there are no accesses to the contents of the vpci struct > >> if not. This lock can be used (and in a few cases is used right away) > >> so that vpci removal can be performed while holding the lock in write > >> mode. Previously such removal could race with vpci_read for example. > > Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > > pci_remove_device, and likely when vPCI gets also used in > > {de}assign_device I think. > > > How about the below? It seems to guarantee that we can access pdev > without issues and without requiring pcidevs_lock to be used? Hm, I'm unsure this is correct. It's in general a bad idea to use a per-domain lock approach to protect the consistency of elements moving between domains. In order for this to be safe you will likely need to hold both the source and the destination per-domain locks, and then you could also get into ABBA lock issues unless you always take the lock in the same order. I think it's safer to use a global lock in this case (pcidevs_lock). > diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c > index e8b09d77d880..fd464a58b3b3 100644 > --- a/xen/drivers/passthrough/pci.c > +++ b/xen/drivers/passthrough/pci.c > @@ -937,8 +937,14 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, > } > > devfn = pdev->devfn; > +#ifdef CONFIG_HAS_VPCI > + write_lock(&d->vpci_rwlock); > +#endif > ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn, > pci_to_dev(pdev)); > +#ifdef CONFIG_HAS_VPCI > + write_unlock(&d->vpci_rwlock); > +#endif > if ( ret ) > goto out; > > @@ -1474,6 +1480,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > const struct domain_iommu *hd = dom_iommu(d); > struct pci_dev *pdev; > int rc = 0; > +#ifdef CONFIG_HAS_VPCI > + struct domain *old_d; > +#endif > > if ( !is_iommu_enabled(d) ) > return 0; > @@ -1487,15 +1496,34 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) > ASSERT(pdev && (pdev->domain == hardware_domain || > pdev->domain == dom_io)); > > +#ifdef CONFIG_HAS_VPCI > + /* pdev->domain is either hwdom or dom_io. We do not want the later. */ > + old_d = pdev->domain == hardware_domain ? pdev->domain : NULL; > + if ( old_d ) > + write_lock(&old_d->vpci_rwlock); > +#endif > + > rc = pdev_msix_assign(d, pdev); I don't think you need the vpci lock for this operation. > if ( rc ) > + { > +#ifdef CONFIG_HAS_VPCI > + if ( old_d ) > + write_unlock(&old_d->vpci_rwlock); > +#endif > goto done; > + } > > pdev->fault.count = 0; > > if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn, > pci_to_dev(pdev), flag)) ) > + { > +#ifdef CONFIG_HAS_VPCI > + if ( old_d ) > + write_unlock(&old_d->vpci_rwlock); > +#endif Like I've mentioned above, I'm unsure this is correct. You are holding the lock of the previous domain, but at some point the device will be assigned to a new domain, so that change won't be protected by the lock of the new domain. Thanks, Roger.
On 11.02.22 13:40, Roger Pau Monné wrote: > On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: >> Hi, Roger! >> >> On 10.02.22 18:16, Roger Pau Monné wrote: >>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> >>>> Introduce a per-domain read/write lock to check whether vpci is present, >>>> so we are sure there are no accesses to the contents of the vpci struct >>>> if not. This lock can be used (and in a few cases is used right away) >>>> so that vpci removal can be performed while holding the lock in write >>>> mode. Previously such removal could race with vpci_read for example. >>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt >>> pci_remove_device, and likely when vPCI gets also used in >>> {de}assign_device I think. >> Yes, this is indeed an issue, but I was not trying to solve it in >> context of vPCI locking yet. I think we should discuss how do >> we approach pdev locking, so I can create a patch for that. >> that being said, I would like not to solve pdev in this patch yet >> >> ...I do understand we do want to avoid that, but at the moment >> a single reliable way for making sure pdev is alive seems to >> be pcidevs_lock.... > I think we will need to make pcidevs_lock a rwlock and take it in read > mode for pci_get_pdev_by_domain. > > We didn't have this scenario before where PCI emulation is done in the > hypervisor, and hence the locking around those data structures has not > been designed for those use-cases. Yes, I do understand that. I hope pcidevs lock move to rwlock can be done as a separate patch. While this is not done, do you think we can proceed with vPCI series and pcidevs locking re-work being done after? > >>>> 1. Per-domain's vpci_rwlock is used to protect pdev->vpci structure >>>> from being removed. >>>> >>>> 2. Writing the command register and ROM BAR register may trigger >>>> modify_bars to run, which in turn may access multiple pdevs while >>>> checking for the existing BAR's overlap. The overlapping check, if done >>>> under the read lock, requires vpci->lock to be acquired on both devices >>>> being compared, which may produce a deadlock. It is not possible to >>>> upgrade read lock to write lock in such a case. So, in order to prevent >>>> the deadlock, check which registers are going to be written and acquire >>>> the lock in the appropriate mode from the beginning. >>>> >>>> All other code, which doesn't lead to pdev->vpci destruction and does not >>>> access multiple pdevs at the same time, can still use a combination of the >>>> read lock and pdev->vpci->lock. >>>> >>>> 3. Optimize if ROM BAR write lock required detection by caching offset >>>> of the ROM BAR register in vpci->header->rom_reg which depends on >>>> header's type. >>>> >>>> 4. Reduce locked region in vpci_remove_device as it is now possible >>>> to set pdev->vpci to NULL early right after the write lock is acquired. >>>> >>>> 5. Reduce locked region in vpci_add_handlers as it is possible to >>>> initialize many more fields of the struct vpci before assigning it to >>>> pdev->vpci. >>>> >>>> 6. vpci_{add|remove}_register are required to be called with the write lock >>>> held, but it is not feasible to add an assert there as it requires >>>> struct domain to be passed for that. So, add a comment about this requirement >>>> to these and other functions with the equivalent constraints. >>>> >>>> 7. Drop const qualifier where the new rwlock is used and this is appropriate. >>>> >>>> 8. This is based on the discussion at [1]. >>>> >>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/20220204063459.680961-4-andr2000@gmail.com/__;!!GF_29dbcQIUBPA!gObSySzN7s6zSKrcpSEi6vw18fRPls157cuRoqq4KDd7Ic_Nvh_cFlyVXPRpEWBkI38pgsvvfg$ [lore[.]kernel[.]org] >>>> >>>> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> >>>> Suggested-by: Jan Beulich <jbeulich@suse.com> >>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> >>>> --- >>>> This was checked on x86: with and without PVH Dom0. >>>> --- >>>> xen/arch/x86/hvm/vmsi.c | 2 + >>>> xen/common/domain.c | 3 + >>>> xen/drivers/vpci/header.c | 8 +++ >>>> xen/drivers/vpci/msi.c | 8 ++- >>>> xen/drivers/vpci/msix.c | 40 +++++++++++-- >>>> xen/drivers/vpci/vpci.c | 114 ++++++++++++++++++++++++++++---------- >>>> xen/include/xen/sched.h | 3 + >>>> xen/include/xen/vpci.h | 2 + >>>> 8 files changed, 146 insertions(+), 34 deletions(-) >>>> >>>> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c >>>> index 13e2a190b439..351cb968a423 100644 >>>> --- a/xen/arch/x86/hvm/vmsi.c >>>> +++ b/xen/arch/x86/hvm/vmsi.c >>>> @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) >>>> { >>>> unsigned int i; >>>> >>>> + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); >>> ^ no need for the double negation. >> Ok, will update all asserts which use !! >>> Also this asserts that the lock is taken, but could be by a different >>> pCPU. I guess it's better than nothing. >> Fair enough. Do you still want the asserts or should I remove them? > Likely fine to leave them. Ok > >>>> + >>>> for ( i = 0; i < msix->max_entries; i++ ) >>>> { >>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>> Since this function is now called with the per-domain rwlock read >>> locked it's likely not appropriate to call process_pending_softirqs >>> while holding such lock (check below). >> You are right, as it is possible that: >> >> process_pending_softirqs -> vpci_process_pending -> read_lock >> >> Even more, vpci_process_pending may also >> >> read_unlock -> vpci_remove_device -> write_lock >> >> in its error path. So, any invocation of process_pending_softirqs >> must not hold d->vpci_rwlock at least. >> >> And also we need to check that pdev->vpci was not removed >> in between or *re-created* >>> We will likely need to re-iterate over the list of pdevs assigned to >>> the domain and assert that the pdev is still assigned to the same >>> domain. >> So, do you mean a pattern like the below should be used at all >> places where we need to call process_pending_softirqs? >> >> read_unlock >> process_pending_softirqs >> read_lock >> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >> <continue processing> > Something along those lines. You likely need to continue iterate using > for_each_pdev. I'll try to play around it and see what will it look like > >>>> +{ >>>> + /* >>>> + * Writing the command register and ROM BAR register may trigger >>>> + * modify_bars to run which in turn may access multiple pdevs while >>>> + * checking for the existing BAR's overlap. The overlapping check, if done >>>> + * under the read lock, requires vpci->lock to be acquired on both devices >>>> + * being compared, which may produce a deadlock. It is not possible to >>>> + * upgrade read lock to write lock in such a case. So, in order to prevent >>>> + * the deadlock, check which registers are going to be written and acquire >>>> + * the lock in the appropriate mode from the beginning. >>>> + */ >>>> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >>>> + return true; >>>> + >>>> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) >>> No need for the comparison if rom_reg is unset. Also you can OR both >>> conditions into a single if. >> If we open code vpci_offset_cmp with a single if all this is going >> to be a bit clumsy: >> >> if ( r1_offset < r2_offset + r2_size && >> r2_offset < r1_offset + r1_size ) >> return 0; >> This is a single check. >> Now we need to check two registers with the code above and >> also check that pdev->vpci->header.rom_reg != 0 >> >> I think it would be more readable if we have a tiny helper function >> >> static bool vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, >> unsigned int r2_offset, unsigned int r2_size) >> { >> /* Return 0 if registers overlap. */ >> if ( r1_offset < r2_offset + r2_size && >> r2_offset < r1_offset + r1_size ) >> return false; >> return true; >> } Do you mean this helper to be converted into static bool overlaps(unsigned int r1_offset, unsigned int r1_size, unsigned int r2_offset, unsigned int r2_size) >> >> So, then we can have something like >> >> static bool vpci_header_write_lock(const struct pci_dev *pdev, >> unsigned int start, unsigned int size) >> { >> if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) || >> (pdev->vpci->header.rom_reg && >> !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4)) ) >> return true; >> >> return false; >> } > Just create an 'overlaps' static function in header.c. Please see above > > Thanks, Roger. Thank you, Oleksandr
On 11.02.22 13:51, Roger Pau Monné wrote: > On Fri, Feb 11, 2022 at 08:46:59AM +0000, Oleksandr Andrushchenko wrote: >> >> On 10.02.22 18:16, Roger Pau Monné wrote: >>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> >>>> Introduce a per-domain read/write lock to check whether vpci is present, >>>> so we are sure there are no accesses to the contents of the vpci struct >>>> if not. This lock can be used (and in a few cases is used right away) >>>> so that vpci removal can be performed while holding the lock in write >>>> mode. Previously such removal could race with vpci_read for example. >>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt >>> pci_remove_device, and likely when vPCI gets also used in >>> {de}assign_device I think. >>> >> How about the below? It seems to guarantee that we can access pdev >> without issues and without requiring pcidevs_lock to be used? > Hm, I'm unsure this is correct. Yes, we need pcidevs as rwlock in order to solve this reliably... > It's in general a bad idea to use a > per-domain lock approach to protect the consistency of elements moving > between domains. > > In order for this to be safe you will likely need to hold both the > source and the destination per-domain locks, and then you could also > get into ABBA lock issues unless you always take the lock in the same > order. > > I think it's safer to use a global lock in this case (pcidevs_lock). > >> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c >> index e8b09d77d880..fd464a58b3b3 100644 >> --- a/xen/drivers/passthrough/pci.c >> +++ b/xen/drivers/passthrough/pci.c >> @@ -937,8 +937,14 @@ static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus, >> } >> >> devfn = pdev->devfn; >> +#ifdef CONFIG_HAS_VPCI >> + write_lock(&d->vpci_rwlock); >> +#endif >> ret = iommu_call(hd->platform_ops, reassign_device, d, target, devfn, >> pci_to_dev(pdev)); >> +#ifdef CONFIG_HAS_VPCI >> + write_unlock(&d->vpci_rwlock); >> +#endif >> if ( ret ) >> goto out; >> >> @@ -1474,6 +1480,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> const struct domain_iommu *hd = dom_iommu(d); >> struct pci_dev *pdev; >> int rc = 0; >> +#ifdef CONFIG_HAS_VPCI >> + struct domain *old_d; >> +#endif >> >> if ( !is_iommu_enabled(d) ) >> return 0; >> @@ -1487,15 +1496,34 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag) >> ASSERT(pdev && (pdev->domain == hardware_domain || >> pdev->domain == dom_io)); >> >> +#ifdef CONFIG_HAS_VPCI >> + /* pdev->domain is either hwdom or dom_io. We do not want the later. */ >> + old_d = pdev->domain == hardware_domain ? pdev->domain : NULL; >> + if ( old_d ) >> + write_lock(&old_d->vpci_rwlock); >> +#endif >> + >> rc = pdev_msix_assign(d, pdev); > I don't think you need the vpci lock for this operation. > >> if ( rc ) >> + { >> +#ifdef CONFIG_HAS_VPCI >> + if ( old_d ) >> + write_unlock(&old_d->vpci_rwlock); >> +#endif >> goto done; >> + } >> >> pdev->fault.count = 0; >> >> if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn, >> pci_to_dev(pdev), flag)) ) >> + { >> +#ifdef CONFIG_HAS_VPCI >> + if ( old_d ) >> + write_unlock(&old_d->vpci_rwlock); >> +#endif > Like I've mentioned above, I'm unsure this is correct. You are holding > the lock of the previous domain, but at some point the device will be > assigned to a new domain, so that change won't be protected by the > lock of the new domain. > > Thanks, Roger.
On Fri, Feb 11, 2022 at 12:13:38PM +0000, Oleksandr Andrushchenko wrote: > > > On 11.02.22 13:40, Roger Pau Monné wrote: > > On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: > >> Hi, Roger! > >> > >> On 10.02.22 18:16, Roger Pau Monné wrote: > >>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: > >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >>>> > >>>> Introduce a per-domain read/write lock to check whether vpci is present, > >>>> so we are sure there are no accesses to the contents of the vpci struct > >>>> if not. This lock can be used (and in a few cases is used right away) > >>>> so that vpci removal can be performed while holding the lock in write > >>>> mode. Previously such removal could race with vpci_read for example. > >>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > >>> pci_remove_device, and likely when vPCI gets also used in > >>> {de}assign_device I think. > >> Yes, this is indeed an issue, but I was not trying to solve it in > >> context of vPCI locking yet. I think we should discuss how do > >> we approach pdev locking, so I can create a patch for that. > >> that being said, I would like not to solve pdev in this patch yet > >> > >> ...I do understand we do want to avoid that, but at the moment > >> a single reliable way for making sure pdev is alive seems to > >> be pcidevs_lock.... > > I think we will need to make pcidevs_lock a rwlock and take it in read > > mode for pci_get_pdev_by_domain. > > > > We didn't have this scenario before where PCI emulation is done in the > > hypervisor, and hence the locking around those data structures has not > > been designed for those use-cases. > Yes, I do understand that. > I hope pcidevs lock move to rwlock can be done as a separate > patch. While this is not done, do you think we can proceed with > vPCI series and pcidevs locking re-work being done after? Ideally we would like to sort out the locking once and for all. I would like to be sure that what we introduce now doesn't turn out to interact badly when we decide to look at the pcidevs locking issue. Thanks, Roger.
On 11.02.22 17:44, Roger Pau Monné wrote: > On Fri, Feb 11, 2022 at 12:13:38PM +0000, Oleksandr Andrushchenko wrote: >> >> On 11.02.22 13:40, Roger Pau Monné wrote: >>> On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: >>>> Hi, Roger! >>>> >>>> On 10.02.22 18:16, Roger Pau Monné wrote: >>>>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>> >>>>>> Introduce a per-domain read/write lock to check whether vpci is present, >>>>>> so we are sure there are no accesses to the contents of the vpci struct >>>>>> if not. This lock can be used (and in a few cases is used right away) >>>>>> so that vpci removal can be performed while holding the lock in write >>>>>> mode. Previously such removal could race with vpci_read for example. >>>>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt >>>>> pci_remove_device, and likely when vPCI gets also used in >>>>> {de}assign_device I think. >>>> Yes, this is indeed an issue, but I was not trying to solve it in >>>> context of vPCI locking yet. I think we should discuss how do >>>> we approach pdev locking, so I can create a patch for that. >>>> that being said, I would like not to solve pdev in this patch yet >>>> >>>> ...I do understand we do want to avoid that, but at the moment >>>> a single reliable way for making sure pdev is alive seems to >>>> be pcidevs_lock.... >>> I think we will need to make pcidevs_lock a rwlock and take it in read >>> mode for pci_get_pdev_by_domain. >>> >>> We didn't have this scenario before where PCI emulation is done in the >>> hypervisor, and hence the locking around those data structures has not >>> been designed for those use-cases. >> Yes, I do understand that. >> I hope pcidevs lock move to rwlock can be done as a separate >> patch. While this is not done, do you think we can proceed with >> vPCI series and pcidevs locking re-work being done after? > Ideally we would like to sort out the locking once and for all. I > would like to be sure that what we introduce now doesn't turn out to > interact badly when we decide to look at the pcidevs locking issue. Ok, so I'll start converting pcidevs into rwlock then > > Thanks, Roger. Thank you, Oleksandr
On Mon, Feb 14, 2022 at 06:33:07AM +0000, Oleksandr Andrushchenko wrote: > > > On 11.02.22 17:44, Roger Pau Monné wrote: > > On Fri, Feb 11, 2022 at 12:13:38PM +0000, Oleksandr Andrushchenko wrote: > >> > >> On 11.02.22 13:40, Roger Pau Monné wrote: > >>> On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: > >>>> Hi, Roger! > >>>> > >>>> On 10.02.22 18:16, Roger Pau Monné wrote: > >>>>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: > >>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >>>>>> > >>>>>> Introduce a per-domain read/write lock to check whether vpci is present, > >>>>>> so we are sure there are no accesses to the contents of the vpci struct > >>>>>> if not. This lock can be used (and in a few cases is used right away) > >>>>>> so that vpci removal can be performed while holding the lock in write > >>>>>> mode. Previously such removal could race with vpci_read for example. > >>>>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt > >>>>> pci_remove_device, and likely when vPCI gets also used in > >>>>> {de}assign_device I think. > >>>> Yes, this is indeed an issue, but I was not trying to solve it in > >>>> context of vPCI locking yet. I think we should discuss how do > >>>> we approach pdev locking, so I can create a patch for that. > >>>> that being said, I would like not to solve pdev in this patch yet > >>>> > >>>> ...I do understand we do want to avoid that, but at the moment > >>>> a single reliable way for making sure pdev is alive seems to > >>>> be pcidevs_lock.... > >>> I think we will need to make pcidevs_lock a rwlock and take it in read > >>> mode for pci_get_pdev_by_domain. > >>> > >>> We didn't have this scenario before where PCI emulation is done in the > >>> hypervisor, and hence the locking around those data structures has not > >>> been designed for those use-cases. > >> Yes, I do understand that. > >> I hope pcidevs lock move to rwlock can be done as a separate > >> patch. While this is not done, do you think we can proceed with > >> vPCI series and pcidevs locking re-work being done after? > > Ideally we would like to sort out the locking once and for all. I > > would like to be sure that what we introduce now doesn't turn out to > > interact badly when we decide to look at the pcidevs locking issue. > Ok, so I'll start converting pcidevs into rwlock then Sorry, maybe I didn't express myself correctly, since the current series doesn't lead to a functional implementation of vPCI for domUs I would be fine with postponing the locking work, as long as the currently introduced code doesn't make it any worse or extend the locking scheme into new paths, but maybe that's not very helpful. Thanks, Roger.
On 14.02.22 10:47, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 06:33:07AM +0000, Oleksandr Andrushchenko wrote: >> >> On 11.02.22 17:44, Roger Pau Monné wrote: >>> On Fri, Feb 11, 2022 at 12:13:38PM +0000, Oleksandr Andrushchenko wrote: >>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>> On Fri, Feb 11, 2022 at 07:27:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>> Hi, Roger! >>>>>> >>>>>> On 10.02.22 18:16, Roger Pau Monné wrote: >>>>>>> On Wed, Feb 09, 2022 at 03:36:27PM +0200, Oleksandr Andrushchenko wrote: >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>> >>>>>>>> Introduce a per-domain read/write lock to check whether vpci is present, >>>>>>>> so we are sure there are no accesses to the contents of the vpci struct >>>>>>>> if not. This lock can be used (and in a few cases is used right away) >>>>>>>> so that vpci removal can be performed while holding the lock in write >>>>>>>> mode. Previously such removal could race with vpci_read for example. >>>>>>> Sadly there's still a race in the usage of pci_get_pdev_by_domain wrt >>>>>>> pci_remove_device, and likely when vPCI gets also used in >>>>>>> {de}assign_device I think. >>>>>> Yes, this is indeed an issue, but I was not trying to solve it in >>>>>> context of vPCI locking yet. I think we should discuss how do >>>>>> we approach pdev locking, so I can create a patch for that. >>>>>> that being said, I would like not to solve pdev in this patch yet >>>>>> >>>>>> ...I do understand we do want to avoid that, but at the moment >>>>>> a single reliable way for making sure pdev is alive seems to >>>>>> be pcidevs_lock.... >>>>> I think we will need to make pcidevs_lock a rwlock and take it in read >>>>> mode for pci_get_pdev_by_domain. >>>>> >>>>> We didn't have this scenario before where PCI emulation is done in the >>>>> hypervisor, and hence the locking around those data structures has not >>>>> been designed for those use-cases. >>>> Yes, I do understand that. >>>> I hope pcidevs lock move to rwlock can be done as a separate >>>> patch. While this is not done, do you think we can proceed with >>>> vPCI series and pcidevs locking re-work being done after? >>> Ideally we would like to sort out the locking once and for all. I >>> would like to be sure that what we introduce now doesn't turn out to >>> interact badly when we decide to look at the pcidevs locking issue. >> Ok, so I'll start converting pcidevs into rwlock then > Sorry, maybe I didn't express myself correctly, since the current > series doesn't lead to a functional implementation of vPCI for domUs I > would be fine with postponing the locking work, as long as the > currently introduced code doesn't make it any worse or extend the > locking scheme into new paths, but maybe that's not very helpful. Indeed, I misunderstood you probably. Great, so we can continue working on the vPCI series which when accepted will unblock MSI/MSI-X work which depends on vPCI. Then, in parallel with MSIs, we can start re-working pcidevs locking. > > Thanks, Roger. Thank you, Oleksandr
On 11.02.22 13:40, Roger Pau Monné wrote: > + >>>> for ( i = 0; i < msix->max_entries; i++ ) >>>> { >>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>> Since this function is now called with the per-domain rwlock read >>> locked it's likely not appropriate to call process_pending_softirqs >>> while holding such lock (check below). >> You are right, as it is possible that: >> >> process_pending_softirqs -> vpci_process_pending -> read_lock >> >> Even more, vpci_process_pending may also >> >> read_unlock -> vpci_remove_device -> write_lock >> >> in its error path. So, any invocation of process_pending_softirqs >> must not hold d->vpci_rwlock at least. >> >> And also we need to check that pdev->vpci was not removed >> in between or *re-created* >>> We will likely need to re-iterate over the list of pdevs assigned to >>> the domain and assert that the pdev is still assigned to the same >>> domain. >> So, do you mean a pattern like the below should be used at all >> places where we need to call process_pending_softirqs? >> >> read_unlock >> process_pending_softirqs >> read_lock >> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >> <continue processing> > Something along those lines. You likely need to continue iterate using > for_each_pdev. How do we tell if pdev->vpci is the same? Jan has already brought this question before [1] and I was about to use some ID for that purpose: pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks Thank you, Oleksandr [1] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg113790.html
On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: > > > On 11.02.22 13:40, Roger Pau Monné wrote: > > + > >>>> for ( i = 0; i < msix->max_entries; i++ ) > >>>> { > >>>> const struct vpci_msix_entry *entry = &msix->entries[i]; > >>> Since this function is now called with the per-domain rwlock read > >>> locked it's likely not appropriate to call process_pending_softirqs > >>> while holding such lock (check below). > >> You are right, as it is possible that: > >> > >> process_pending_softirqs -> vpci_process_pending -> read_lock > >> > >> Even more, vpci_process_pending may also > >> > >> read_unlock -> vpci_remove_device -> write_lock > >> > >> in its error path. So, any invocation of process_pending_softirqs > >> must not hold d->vpci_rwlock at least. > >> > >> And also we need to check that pdev->vpci was not removed > >> in between or *re-created* > >>> We will likely need to re-iterate over the list of pdevs assigned to > >>> the domain and assert that the pdev is still assigned to the same > >>> domain. > >> So, do you mean a pattern like the below should be used at all > >> places where we need to call process_pending_softirqs? > >> > >> read_unlock > >> process_pending_softirqs > >> read_lock > >> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); > >> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) > >> <continue processing> > > Something along those lines. You likely need to continue iterate using > > for_each_pdev. > How do we tell if pdev->vpci is the same? Jan has already brought > this question before [1] and I was about to use some ID for that purpose: > pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks Given this is a debug message I would be OK with just doing the minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) and that the resume MSI entry is not past the current limit. Otherwise just print a message and move on to the next device. The recreating of pdev->vpci only occurs as a result of some admin operations, and doing it while also trying to print the current MSI status is not a reliable approach. So dumping an incomplete or incoherent state as a result of ongoing admin operations would be fine. Thanks, Roger.
On 14.02.22 12:34, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >> >> On 11.02.22 13:40, Roger Pau Monné wrote: >>> + >>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>> { >>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>> Since this function is now called with the per-domain rwlock read >>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>> while holding such lock (check below). >>>> You are right, as it is possible that: >>>> >>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>> >>>> Even more, vpci_process_pending may also >>>> >>>> read_unlock -> vpci_remove_device -> write_lock >>>> >>>> in its error path. So, any invocation of process_pending_softirqs >>>> must not hold d->vpci_rwlock at least. >>>> >>>> And also we need to check that pdev->vpci was not removed >>>> in between or *re-created* >>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>> the domain and assert that the pdev is still assigned to the same >>>>> domain. >>>> So, do you mean a pattern like the below should be used at all >>>> places where we need to call process_pending_softirqs? >>>> >>>> read_unlock >>>> process_pending_softirqs >>>> read_lock >>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>> <continue processing> >>> Something along those lines. You likely need to continue iterate using >>> for_each_pdev. >> How do we tell if pdev->vpci is the same? Jan has already brought >> this question before [1] and I was about to use some ID for that purpose: >> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks > Given this is a debug message I would be OK with just doing the > minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) > and that the resume MSI entry is not past the current limit. Otherwise > just print a message and move on to the next device. Agree, I see no big issue (probably) if we are not able to print How about this one: diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c index 809a6b4773e1..50373f04da82 100644 --- a/xen/drivers/vpci/header.c +++ b/xen/drivers/vpci/header.c @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, struct rangeset *mem, uint16_t cmd) { struct map_data data = { .d = d, .map = true }; + pci_sbdf_t sbdf = pdev->sbdf; int rc; + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); + while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) + { + + /* + * process_pending_softirqs may trigger vpci_process_pending which + * may need to acquire pdev->domain->vpci_rwlock in read mode. + */ + write_unlock(&pdev->domain->vpci_rwlock); process_pending_softirqs(); + write_lock(&pdev->domain->vpci_rwlock); + + /* Check if pdev still exists and vPCI was not removed or re-created. */ + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) + if ( vpci is NOT the same ) + { + rc = 0; + break; + } + } + rangeset_destroy(mem); if ( !rc ) modify_decoding(pdev, cmd, false); This one also wants process_pending_softirqs to run so it *might* want pdev and vpci checks. But at the same time apply_map runs at ( system_state < SYS_STATE_active ), so defer_map won't be running yet, thus no vpci_process_pending is possible yet (in terms it has something to do yet). So, I think we just need: write_unlock(&pdev->domain->vpci_rwlock); process_pending_softirqs(); write_lock(&pdev->domain->vpci_rwlock); and this should be enough > > The recreating of pdev->vpci only occurs as a result of some admin > operations, and doing it while also trying to print the current MSI > status is not a reliable approach. So dumping an incomplete or > incoherent state as a result of ongoing admin operations would be > fine. Ok > > Thanks, Roger. > Thank you, Oleksandr
On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: > > > On 14.02.22 12:34, Roger Pau Monné wrote: > > On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: > >> > >> On 11.02.22 13:40, Roger Pau Monné wrote: > >>> + > >>>>>> for ( i = 0; i < msix->max_entries; i++ ) > >>>>>> { > >>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; > >>>>> Since this function is now called with the per-domain rwlock read > >>>>> locked it's likely not appropriate to call process_pending_softirqs > >>>>> while holding such lock (check below). > >>>> You are right, as it is possible that: > >>>> > >>>> process_pending_softirqs -> vpci_process_pending -> read_lock > >>>> > >>>> Even more, vpci_process_pending may also > >>>> > >>>> read_unlock -> vpci_remove_device -> write_lock > >>>> > >>>> in its error path. So, any invocation of process_pending_softirqs > >>>> must not hold d->vpci_rwlock at least. > >>>> > >>>> And also we need to check that pdev->vpci was not removed > >>>> in between or *re-created* > >>>>> We will likely need to re-iterate over the list of pdevs assigned to > >>>>> the domain and assert that the pdev is still assigned to the same > >>>>> domain. > >>>> So, do you mean a pattern like the below should be used at all > >>>> places where we need to call process_pending_softirqs? > >>>> > >>>> read_unlock > >>>> process_pending_softirqs > >>>> read_lock > >>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); > >>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) > >>>> <continue processing> > >>> Something along those lines. You likely need to continue iterate using > >>> for_each_pdev. > >> How do we tell if pdev->vpci is the same? Jan has already brought > >> this question before [1] and I was about to use some ID for that purpose: > >> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks > > Given this is a debug message I would be OK with just doing the > > minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) > > and that the resume MSI entry is not past the current limit. Otherwise > > just print a message and move on to the next device. > Agree, I see no big issue (probably) if we are not able to print > > How about this one: > > diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c > index 809a6b4773e1..50373f04da82 100644 > --- a/xen/drivers/vpci/header.c > +++ b/xen/drivers/vpci/header.c > @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, > struct rangeset *mem, uint16_t cmd) > { > struct map_data data = { .d = d, .map = true }; > + pci_sbdf_t sbdf = pdev->sbdf; > int rc; > > + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); > + > while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) > + { > + > + /* > + * process_pending_softirqs may trigger vpci_process_pending which > + * may need to acquire pdev->domain->vpci_rwlock in read mode. > + */ > + write_unlock(&pdev->domain->vpci_rwlock); > process_pending_softirqs(); > + write_lock(&pdev->domain->vpci_rwlock); > + > + /* Check if pdev still exists and vPCI was not removed or re-created. */ > + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) > + if ( vpci is NOT the same ) > + { > + rc = 0; > + break; > + } > + } > + > rangeset_destroy(mem); > if ( !rc ) > modify_decoding(pdev, cmd, false); > > This one also wants process_pending_softirqs to run so it *might* > want pdev and vpci checks. But at the same time apply_map runs > at ( system_state < SYS_STATE_active ), so defer_map won't be > running yet, thus no vpci_process_pending is possible yet (in terms > it has something to do yet). So, I think we just need: > > write_unlock(&pdev->domain->vpci_rwlock); > process_pending_softirqs(); > write_lock(&pdev->domain->vpci_rwlock); > > and this should be enough Given the context apply_map is called from (dom0 specific init code), there's no need to check for the pdev to still exits, or whether vpci has been recreated, as it's not possible. Just add a comment to explicitly note that the context of the function is special, and thus there's no possibility of either the device or vpci going away. Thanks, Roger.
On 14.02.22 13:11, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 12:34, Roger Pau Monné wrote: >>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>> + >>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>> { >>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>> while holding such lock (check below). >>>>>> You are right, as it is possible that: >>>>>> >>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>> >>>>>> Even more, vpci_process_pending may also >>>>>> >>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>> >>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>> must not hold d->vpci_rwlock at least. >>>>>> >>>>>> And also we need to check that pdev->vpci was not removed >>>>>> in between or *re-created* >>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>> domain. >>>>>> So, do you mean a pattern like the below should be used at all >>>>>> places where we need to call process_pending_softirqs? >>>>>> >>>>>> read_unlock >>>>>> process_pending_softirqs >>>>>> read_lock >>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>> <continue processing> >>>>> Something along those lines. You likely need to continue iterate using >>>>> for_each_pdev. >>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>> this question before [1] and I was about to use some ID for that purpose: >>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>> Given this is a debug message I would be OK with just doing the >>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>> and that the resume MSI entry is not past the current limit. Otherwise >>> just print a message and move on to the next device. >> Agree, I see no big issue (probably) if we are not able to print >> >> How about this one: >> >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >> index 809a6b4773e1..50373f04da82 100644 >> --- a/xen/drivers/vpci/header.c >> +++ b/xen/drivers/vpci/header.c >> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >> struct rangeset *mem, uint16_t cmd) >> { >> struct map_data data = { .d = d, .map = true }; >> + pci_sbdf_t sbdf = pdev->sbdf; >> int rc; >> >> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >> + >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >> + { >> + >> + /* >> + * process_pending_softirqs may trigger vpci_process_pending which >> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >> + */ >> + write_unlock(&pdev->domain->vpci_rwlock); >> process_pending_softirqs(); >> + write_lock(&pdev->domain->vpci_rwlock); >> + >> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >> + if ( vpci is NOT the same ) >> + { >> + rc = 0; >> + break; >> + } >> + } >> + >> rangeset_destroy(mem); >> if ( !rc ) >> modify_decoding(pdev, cmd, false); >> >> This one also wants process_pending_softirqs to run so it *might* >> want pdev and vpci checks. But at the same time apply_map runs >> at ( system_state < SYS_STATE_active ), so defer_map won't be >> running yet, thus no vpci_process_pending is possible yet (in terms >> it has something to do yet). So, I think we just need: >> >> write_unlock(&pdev->domain->vpci_rwlock); >> process_pending_softirqs(); >> write_lock(&pdev->domain->vpci_rwlock); >> >> and this should be enough > Given the context apply_map is called from (dom0 specific init code), > there's no need to check for the pdev to still exits, or whether vpci > has been recreated, as it's not possible. Just add a comment to > explicitly note that the context of the function is special, and thus > there's no possibility of either the device or vpci going away. Does it really need write_unlock/write_lock given the context?... I think it doesn't as there is no chance defer_map is called, thus process_pending_softirqs -> vpci_process_pending -> read_lock is not possible I'll just add a comment about that > Thanks, Roger. Thank you, Oleksandr
On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: > > > On 14.02.22 13:11, Roger Pau Monné wrote: > > On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: > >> > >> On 14.02.22 12:34, Roger Pau Monné wrote: > >>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: > >>>> On 11.02.22 13:40, Roger Pau Monné wrote: > >>>>> + > >>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) > >>>>>>>> { > >>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; > >>>>>>> Since this function is now called with the per-domain rwlock read > >>>>>>> locked it's likely not appropriate to call process_pending_softirqs > >>>>>>> while holding such lock (check below). > >>>>>> You are right, as it is possible that: > >>>>>> > >>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock > >>>>>> > >>>>>> Even more, vpci_process_pending may also > >>>>>> > >>>>>> read_unlock -> vpci_remove_device -> write_lock > >>>>>> > >>>>>> in its error path. So, any invocation of process_pending_softirqs > >>>>>> must not hold d->vpci_rwlock at least. > >>>>>> > >>>>>> And also we need to check that pdev->vpci was not removed > >>>>>> in between or *re-created* > >>>>>>> We will likely need to re-iterate over the list of pdevs assigned to > >>>>>>> the domain and assert that the pdev is still assigned to the same > >>>>>>> domain. > >>>>>> So, do you mean a pattern like the below should be used at all > >>>>>> places where we need to call process_pending_softirqs? > >>>>>> > >>>>>> read_unlock > >>>>>> process_pending_softirqs > >>>>>> read_lock > >>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); > >>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) > >>>>>> <continue processing> > >>>>> Something along those lines. You likely need to continue iterate using > >>>>> for_each_pdev. > >>>> How do we tell if pdev->vpci is the same? Jan has already brought > >>>> this question before [1] and I was about to use some ID for that purpose: > >>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks > >>> Given this is a debug message I would be OK with just doing the > >>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) > >>> and that the resume MSI entry is not past the current limit. Otherwise > >>> just print a message and move on to the next device. > >> Agree, I see no big issue (probably) if we are not able to print > >> > >> How about this one: > >> > >> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c > >> index 809a6b4773e1..50373f04da82 100644 > >> --- a/xen/drivers/vpci/header.c > >> +++ b/xen/drivers/vpci/header.c > >> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, > >> struct rangeset *mem, uint16_t cmd) > >> { > >> struct map_data data = { .d = d, .map = true }; > >> + pci_sbdf_t sbdf = pdev->sbdf; > >> int rc; > >> > >> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); > >> + > >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) > >> + { > >> + > >> + /* > >> + * process_pending_softirqs may trigger vpci_process_pending which > >> + * may need to acquire pdev->domain->vpci_rwlock in read mode. > >> + */ > >> + write_unlock(&pdev->domain->vpci_rwlock); > >> process_pending_softirqs(); > >> + write_lock(&pdev->domain->vpci_rwlock); > >> + > >> + /* Check if pdev still exists and vPCI was not removed or re-created. */ > >> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) > >> + if ( vpci is NOT the same ) > >> + { > >> + rc = 0; > >> + break; > >> + } > >> + } > >> + > >> rangeset_destroy(mem); > >> if ( !rc ) > >> modify_decoding(pdev, cmd, false); > >> > >> This one also wants process_pending_softirqs to run so it *might* > >> want pdev and vpci checks. But at the same time apply_map runs > >> at ( system_state < SYS_STATE_active ), so defer_map won't be > >> running yet, thus no vpci_process_pending is possible yet (in terms > >> it has something to do yet). So, I think we just need: > >> > >> write_unlock(&pdev->domain->vpci_rwlock); > >> process_pending_softirqs(); > >> write_lock(&pdev->domain->vpci_rwlock); > >> > >> and this should be enough > > Given the context apply_map is called from (dom0 specific init code), > > there's no need to check for the pdev to still exits, or whether vpci > > has been recreated, as it's not possible. Just add a comment to > > explicitly note that the context of the function is special, and thus > > there's no possibility of either the device or vpci going away. > Does it really need write_unlock/write_lock given the context?... I think it's bad practice to call process_pending_softirqs while holding any locks. This is a very specific context so it's likely fine to not drop the lock, but would still seem incorrect to me. > I think it doesn't as there is no chance defer_map is called, thus > process_pending_softirqs -> vpci_process_pending -> read_lock Indeed, there's no chance of that because process_pending_softirqs will never try to do a scheduling operation that would result in our context being scheduled out. Thanks, Roger.
On 14.02.22 13:25, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 13:11, Roger Pau Monné wrote: >>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>> + >>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>> { >>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>> while holding such lock (check below). >>>>>>>> You are right, as it is possible that: >>>>>>>> >>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>> >>>>>>>> Even more, vpci_process_pending may also >>>>>>>> >>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>> >>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>> >>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>> in between or *re-created* >>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>> domain. >>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>> >>>>>>>> read_unlock >>>>>>>> process_pending_softirqs >>>>>>>> read_lock >>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>> <continue processing> >>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>> for_each_pdev. >>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>> Given this is a debug message I would be OK with just doing the >>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>> just print a message and move on to the next device. >>>> Agree, I see no big issue (probably) if we are not able to print >>>> >>>> How about this one: >>>> >>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>> index 809a6b4773e1..50373f04da82 100644 >>>> --- a/xen/drivers/vpci/header.c >>>> +++ b/xen/drivers/vpci/header.c >>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>> struct rangeset *mem, uint16_t cmd) >>>> { >>>> struct map_data data = { .d = d, .map = true }; >>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>> int rc; >>>> >>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>> + >>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>> + { >>>> + >>>> + /* >>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>> + */ >>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>> process_pending_softirqs(); >>>> + write_lock(&pdev->domain->vpci_rwlock); >>>> + >>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>> + if ( vpci is NOT the same ) >>>> + { >>>> + rc = 0; >>>> + break; >>>> + } >>>> + } >>>> + >>>> rangeset_destroy(mem); >>>> if ( !rc ) >>>> modify_decoding(pdev, cmd, false); >>>> >>>> This one also wants process_pending_softirqs to run so it *might* >>>> want pdev and vpci checks. But at the same time apply_map runs >>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>> it has something to do yet). So, I think we just need: >>>> >>>> write_unlock(&pdev->domain->vpci_rwlock); >>>> process_pending_softirqs(); >>>> write_lock(&pdev->domain->vpci_rwlock); >>>> >>>> and this should be enough >>> Given the context apply_map is called from (dom0 specific init code), >>> there's no need to check for the pdev to still exits, or whether vpci >>> has been recreated, as it's not possible. Just add a comment to >>> explicitly note that the context of the function is special, and thus >>> there's no possibility of either the device or vpci going away. >> Does it really need write_unlock/write_lock given the context?... > I think it's bad practice to call process_pending_softirqs while > holding any locks. This is a very specific context so it's likely fine > to not drop the lock, but would still seem incorrect to me. Ok > >> I think it doesn't as there is no chance defer_map is called, thus >> process_pending_softirqs -> vpci_process_pending -> read_lock > Indeed, there's no chance of that because process_pending_softirqs > will never try to do a scheduling operation that would result in our > context being scheduled out. while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) { /* * FIXME: Given the context apply_map is called from (dom0 specific * init code at system_state < SYS_STATE_active) it is not strictly * required that pdev->domain->vpci_rwlock is unlocked before calling * process_pending_softirqs as there is no contention possible between * this code and vpci_process_pending trying to acquire the lock in * read mode. But running process_pending_softirqs with any lock held * doesn't seem to be a good practice, so drop the lock and re-acquire * it right again. */ write_unlock(&pdev->domain->vpci_rwlock); process_pending_softirqs(); write_lock(&pdev->domain->vpci_rwlock); } Will this be good enough? > > Thanks, Roger. > Thank you, Oleksandr
On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: > > > On 14.02.22 13:25, Roger Pau Monné wrote: >> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>> >>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>> + >>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>> { >>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>> while holding such lock (check below). >>>>>>>>> You are right, as it is possible that: >>>>>>>>> >>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>> >>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>> >>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>> >>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>> >>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>> in between or *re-created* >>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>> domain. >>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>> >>>>>>>>> read_unlock >>>>>>>>> process_pending_softirqs >>>>>>>>> read_lock >>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>> <continue processing> >>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>> for_each_pdev. >>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>> Given this is a debug message I would be OK with just doing the >>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>> just print a message and move on to the next device. >>>>> Agree, I see no big issue (probably) if we are not able to print >>>>> >>>>> How about this one: >>>>> >>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>> index 809a6b4773e1..50373f04da82 100644 >>>>> --- a/xen/drivers/vpci/header.c >>>>> +++ b/xen/drivers/vpci/header.c >>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>> struct rangeset *mem, uint16_t cmd) >>>>> { >>>>> struct map_data data = { .d = d, .map = true }; >>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>> int rc; >>>>> >>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>> + >>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>> + { >>>>> + >>>>> + /* >>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>> + */ >>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>> process_pending_softirqs(); >>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>> + >>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>> + if ( vpci is NOT the same ) >>>>> + { >>>>> + rc = 0; >>>>> + break; >>>>> + } >>>>> + } >>>>> + >>>>> rangeset_destroy(mem); >>>>> if ( !rc ) >>>>> modify_decoding(pdev, cmd, false); >>>>> >>>>> This one also wants process_pending_softirqs to run so it *might* >>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>> it has something to do yet). So, I think we just need: >>>>> >>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>> process_pending_softirqs(); >>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>> >>>>> and this should be enough >>>> Given the context apply_map is called from (dom0 specific init code), >>>> there's no need to check for the pdev to still exits, or whether vpci >>>> has been recreated, as it's not possible. Just add a comment to >>>> explicitly note that the context of the function is special, and thus >>>> there's no possibility of either the device or vpci going away. >>> Does it really need write_unlock/write_lock given the context?... >> I think it's bad practice to call process_pending_softirqs while >> holding any locks. This is a very specific context so it's likely fine >> to not drop the lock, but would still seem incorrect to me. > Ok >> >>> I think it doesn't as there is no chance defer_map is called, thus >>> process_pending_softirqs -> vpci_process_pending -> read_lock >> Indeed, there's no chance of that because process_pending_softirqs >> will never try to do a scheduling operation that would result in our >> context being scheduled out. > > while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) > { > /* > * FIXME: Given the context apply_map is called from (dom0 specific > * init code at system_state < SYS_STATE_active) it is not strictly > * required that pdev->domain->vpci_rwlock is unlocked before calling > * process_pending_softirqs as there is no contention possible between > * this code and vpci_process_pending trying to acquire the lock in > * read mode. But running process_pending_softirqs with any lock held > * doesn't seem to be a good practice, so drop the lock and re-acquire > * it right again. > */ > write_unlock(&pdev->domain->vpci_rwlock); > process_pending_softirqs(); > write_lock(&pdev->domain->vpci_rwlock); > } I'm afraid that's misleading at best. apply_map() is merely a specific example where you know the lock is going to be taken. But really any softirq handler could be acquiring any lock, so requesting to process softirqs cannot ever be done with any lock held. What you instead want to explain is why, after re-acquiring the lock, no further checking is needed for potentially changed state. Jan
On 14.02.22 14:57, Jan Beulich wrote: > On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 13:25, Roger Pau Monné wrote: >>> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>>> + >>>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>>> { >>>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>>> while holding such lock (check below). >>>>>>>>>> You are right, as it is possible that: >>>>>>>>>> >>>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>>> >>>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>>> >>>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>>> >>>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>>> >>>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>>> in between or *re-created* >>>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>>> domain. >>>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>>> >>>>>>>>>> read_unlock >>>>>>>>>> process_pending_softirqs >>>>>>>>>> read_lock >>>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>>> <continue processing> >>>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>>> for_each_pdev. >>>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>>> Given this is a debug message I would be OK with just doing the >>>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>>> just print a message and move on to the next device. >>>>>> Agree, I see no big issue (probably) if we are not able to print >>>>>> >>>>>> How about this one: >>>>>> >>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>>> index 809a6b4773e1..50373f04da82 100644 >>>>>> --- a/xen/drivers/vpci/header.c >>>>>> +++ b/xen/drivers/vpci/header.c >>>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>>> struct rangeset *mem, uint16_t cmd) >>>>>> { >>>>>> struct map_data data = { .d = d, .map = true }; >>>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>>> int rc; >>>>>> >>>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>>> + >>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>> + { >>>>>> + >>>>>> + /* >>>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>>> + */ >>>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>>> process_pending_softirqs(); >>>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>>> + >>>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>>> + if ( vpci is NOT the same ) >>>>>> + { >>>>>> + rc = 0; >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> + >>>>>> rangeset_destroy(mem); >>>>>> if ( !rc ) >>>>>> modify_decoding(pdev, cmd, false); >>>>>> >>>>>> This one also wants process_pending_softirqs to run so it *might* >>>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>>> it has something to do yet). So, I think we just need: >>>>>> >>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>> process_pending_softirqs(); >>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>> >>>>>> and this should be enough >>>>> Given the context apply_map is called from (dom0 specific init code), >>>>> there's no need to check for the pdev to still exits, or whether vpci >>>>> has been recreated, as it's not possible. Just add a comment to >>>>> explicitly note that the context of the function is special, and thus >>>>> there's no possibility of either the device or vpci going away. >>>> Does it really need write_unlock/write_lock given the context?... >>> I think it's bad practice to call process_pending_softirqs while >>> holding any locks. This is a very specific context so it's likely fine >>> to not drop the lock, but would still seem incorrect to me. >> Ok >>>> I think it doesn't as there is no chance defer_map is called, thus >>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>> Indeed, there's no chance of that because process_pending_softirqs >>> will never try to do a scheduling operation that would result in our >>> context being scheduled out. >> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >> { >> /* >> * FIXME: Given the context apply_map is called from (dom0 specific >> * init code at system_state < SYS_STATE_active) it is not strictly >> * required that pdev->domain->vpci_rwlock is unlocked before calling >> * process_pending_softirqs as there is no contention possible between >> * this code and vpci_process_pending trying to acquire the lock in >> * read mode. But running process_pending_softirqs with any lock held >> * doesn't seem to be a good practice, so drop the lock and re-acquire >> * it right again. >> */ >> write_unlock(&pdev->domain->vpci_rwlock); >> process_pending_softirqs(); >> write_lock(&pdev->domain->vpci_rwlock); >> } > I'm afraid that's misleading at best. apply_map() is merely a specific > example where you know the lock is going to be taken. But really any > softirq handler could be acquiring any lock, so requesting to process > softirqs cannot ever be done with any lock held. > > What you instead want to explain is why, after re-acquiring the lock, > no further checking is needed for potentially changed state. How about: /* * FIXME: Given the context apply_map is called from (dom0 specific * init code at system_state < SYS_STATE_active) there is no contention * possible between this code and vpci_process_pending trying to acquire * the lock in read mode and destroy pdev->vpci in its error path. * Neither pdev may be disposed yet, so it is not required to check if the * relevant pdev still exists after re-acquiring the lock. */ > > Jan > Thank you, Oleksandr
On 14.02.2022 14:13, Oleksandr Andrushchenko wrote: > > > On 14.02.22 14:57, Jan Beulich wrote: >> On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: >>> >>> On 14.02.22 13:25, Roger Pau Monné wrote: >>>> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>>>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>>>> + >>>>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>>>> { >>>>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>>>> while holding such lock (check below). >>>>>>>>>>> You are right, as it is possible that: >>>>>>>>>>> >>>>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>>>> >>>>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>>>> >>>>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>>>> >>>>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>>>> >>>>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>>>> in between or *re-created* >>>>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>>>> domain. >>>>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>>>> >>>>>>>>>>> read_unlock >>>>>>>>>>> process_pending_softirqs >>>>>>>>>>> read_lock >>>>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>>>> <continue processing> >>>>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>>>> for_each_pdev. >>>>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>>>> Given this is a debug message I would be OK with just doing the >>>>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>>>> just print a message and move on to the next device. >>>>>>> Agree, I see no big issue (probably) if we are not able to print >>>>>>> >>>>>>> How about this one: >>>>>>> >>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>>>> index 809a6b4773e1..50373f04da82 100644 >>>>>>> --- a/xen/drivers/vpci/header.c >>>>>>> +++ b/xen/drivers/vpci/header.c >>>>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>>>> struct rangeset *mem, uint16_t cmd) >>>>>>> { >>>>>>> struct map_data data = { .d = d, .map = true }; >>>>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>>>> int rc; >>>>>>> >>>>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>>>> + >>>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>>> + { >>>>>>> + >>>>>>> + /* >>>>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>>>> + */ >>>>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>>>> process_pending_softirqs(); >>>>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>>>> + >>>>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>>>> + if ( vpci is NOT the same ) >>>>>>> + { >>>>>>> + rc = 0; >>>>>>> + break; >>>>>>> + } >>>>>>> + } >>>>>>> + >>>>>>> rangeset_destroy(mem); >>>>>>> if ( !rc ) >>>>>>> modify_decoding(pdev, cmd, false); >>>>>>> >>>>>>> This one also wants process_pending_softirqs to run so it *might* >>>>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>>>> it has something to do yet). So, I think we just need: >>>>>>> >>>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>>> process_pending_softirqs(); >>>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>>> >>>>>>> and this should be enough >>>>>> Given the context apply_map is called from (dom0 specific init code), >>>>>> there's no need to check for the pdev to still exits, or whether vpci >>>>>> has been recreated, as it's not possible. Just add a comment to >>>>>> explicitly note that the context of the function is special, and thus >>>>>> there's no possibility of either the device or vpci going away. >>>>> Does it really need write_unlock/write_lock given the context?... >>>> I think it's bad practice to call process_pending_softirqs while >>>> holding any locks. This is a very specific context so it's likely fine >>>> to not drop the lock, but would still seem incorrect to me. >>> Ok >>>>> I think it doesn't as there is no chance defer_map is called, thus >>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>> Indeed, there's no chance of that because process_pending_softirqs >>>> will never try to do a scheduling operation that would result in our >>>> context being scheduled out. >>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>> { >>> /* >>> * FIXME: Given the context apply_map is called from (dom0 specific >>> * init code at system_state < SYS_STATE_active) it is not strictly >>> * required that pdev->domain->vpci_rwlock is unlocked before calling >>> * process_pending_softirqs as there is no contention possible between >>> * this code and vpci_process_pending trying to acquire the lock in >>> * read mode. But running process_pending_softirqs with any lock held >>> * doesn't seem to be a good practice, so drop the lock and re-acquire >>> * it right again. >>> */ >>> write_unlock(&pdev->domain->vpci_rwlock); >>> process_pending_softirqs(); >>> write_lock(&pdev->domain->vpci_rwlock); >>> } >> I'm afraid that's misleading at best. apply_map() is merely a specific >> example where you know the lock is going to be taken. But really any >> softirq handler could be acquiring any lock, so requesting to process >> softirqs cannot ever be done with any lock held. >> >> What you instead want to explain is why, after re-acquiring the lock, >> no further checking is needed for potentially changed state. > How about: > > /* > * FIXME: Given the context apply_map is called from (dom0 specific > * init code at system_state < SYS_STATE_active) there is no contention > * possible between this code and vpci_process_pending trying to acquire > * the lock in read mode and destroy pdev->vpci in its error path. > * Neither pdev may be disposed yet, so it is not required to check if the > * relevant pdev still exists after re-acquiring the lock. > */ I'm not sure I follow the first sentence; I guess a comma or two may help, and or using "as well as" in place of one of the two "and". I also don't think you mean contention, but rather a race between the named entities? Jan
On 14.02.22 15:22, Jan Beulich wrote: > On 14.02.2022 14:13, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 14:57, Jan Beulich wrote: >>> On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: >>>> On 14.02.22 13:25, Roger Pau Monné wrote: >>>>> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>>>>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>>>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>>>>> + >>>>>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>>>>> { >>>>>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>>>>> while holding such lock (check below). >>>>>>>>>>>> You are right, as it is possible that: >>>>>>>>>>>> >>>>>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>>>>> >>>>>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>>>>> >>>>>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>>>>> >>>>>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>>>>> >>>>>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>>>>> in between or *re-created* >>>>>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>>>>> domain. >>>>>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>>>>> >>>>>>>>>>>> read_unlock >>>>>>>>>>>> process_pending_softirqs >>>>>>>>>>>> read_lock >>>>>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>>>>> <continue processing> >>>>>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>>>>> for_each_pdev. >>>>>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>>>>> Given this is a debug message I would be OK with just doing the >>>>>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>>>>> just print a message and move on to the next device. >>>>>>>> Agree, I see no big issue (probably) if we are not able to print >>>>>>>> >>>>>>>> How about this one: >>>>>>>> >>>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>>>>> index 809a6b4773e1..50373f04da82 100644 >>>>>>>> --- a/xen/drivers/vpci/header.c >>>>>>>> +++ b/xen/drivers/vpci/header.c >>>>>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>>>>> struct rangeset *mem, uint16_t cmd) >>>>>>>> { >>>>>>>> struct map_data data = { .d = d, .map = true }; >>>>>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>>>>> int rc; >>>>>>>> >>>>>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>>>>> + >>>>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>>>> + { >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>>>>> + */ >>>>>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>> process_pending_softirqs(); >>>>>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>>>>> + >>>>>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>>>>> + if ( vpci is NOT the same ) >>>>>>>> + { >>>>>>>> + rc = 0; >>>>>>>> + break; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + >>>>>>>> rangeset_destroy(mem); >>>>>>>> if ( !rc ) >>>>>>>> modify_decoding(pdev, cmd, false); >>>>>>>> >>>>>>>> This one also wants process_pending_softirqs to run so it *might* >>>>>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>>>>> it has something to do yet). So, I think we just need: >>>>>>>> >>>>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>> process_pending_softirqs(); >>>>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>>>> >>>>>>>> and this should be enough >>>>>>> Given the context apply_map is called from (dom0 specific init code), >>>>>>> there's no need to check for the pdev to still exits, or whether vpci >>>>>>> has been recreated, as it's not possible. Just add a comment to >>>>>>> explicitly note that the context of the function is special, and thus >>>>>>> there's no possibility of either the device or vpci going away. >>>>>> Does it really need write_unlock/write_lock given the context?... >>>>> I think it's bad practice to call process_pending_softirqs while >>>>> holding any locks. This is a very specific context so it's likely fine >>>>> to not drop the lock, but would still seem incorrect to me. >>>> Ok >>>>>> I think it doesn't as there is no chance defer_map is called, thus >>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>> Indeed, there's no chance of that because process_pending_softirqs >>>>> will never try to do a scheduling operation that would result in our >>>>> context being scheduled out. >>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>> { >>>> /* >>>> * FIXME: Given the context apply_map is called from (dom0 specific >>>> * init code at system_state < SYS_STATE_active) it is not strictly >>>> * required that pdev->domain->vpci_rwlock is unlocked before calling >>>> * process_pending_softirqs as there is no contention possible between >>>> * this code and vpci_process_pending trying to acquire the lock in >>>> * read mode. But running process_pending_softirqs with any lock held >>>> * doesn't seem to be a good practice, so drop the lock and re-acquire >>>> * it right again. >>>> */ >>>> write_unlock(&pdev->domain->vpci_rwlock); >>>> process_pending_softirqs(); >>>> write_lock(&pdev->domain->vpci_rwlock); >>>> } >>> I'm afraid that's misleading at best. apply_map() is merely a specific >>> example where you know the lock is going to be taken. But really any >>> softirq handler could be acquiring any lock, so requesting to process >>> softirqs cannot ever be done with any lock held. >>> >>> What you instead want to explain is why, after re-acquiring the lock, >>> no further checking is needed for potentially changed state. >> How about: >> >> /* >> * FIXME: Given the context apply_map is called from (dom0 specific >> * init code at system_state < SYS_STATE_active) there is no contention >> * possible between this code and vpci_process_pending trying to acquire >> * the lock in read mode and destroy pdev->vpci in its error path. >> * Neither pdev may be disposed yet, so it is not required to check if the >> * relevant pdev still exists after re-acquiring the lock. >> */ > I'm not sure I follow the first sentence; I guess a comma or two may help, > and or using "as well as" in place of one of the two "and". I also don't > think you mean contention, but rather a race between the named entities? /* * FIXME: Given the context from which apply_map is called (dom0 specific * init code at system_state < SYS_STATE_active) there is no race condition * possible between this code and vpci_process_pending which may try to acquire * the lock in read mode and also try to destroy pdev->vpci in its error path. * Neither pdev may be disposed yet, so it is not required to check if the * relevant pdev still exists after re-acquiring the lock. */ > > Jan > Thank you, Oleksandr
On 14.02.2022 14:27, Oleksandr Andrushchenko wrote: > > > On 14.02.22 15:22, Jan Beulich wrote: >> On 14.02.2022 14:13, Oleksandr Andrushchenko wrote: >>> >>> On 14.02.22 14:57, Jan Beulich wrote: >>>> On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: >>>>> On 14.02.22 13:25, Roger Pau Monné wrote: >>>>>> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>>>>>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>>>>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>>>>>> + >>>>>>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>>>>>> while holding such lock (check below). >>>>>>>>>>>>> You are right, as it is possible that: >>>>>>>>>>>>> >>>>>>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>>>>>> >>>>>>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>>>>>> >>>>>>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>>>>>> >>>>>>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>>>>>> >>>>>>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>>>>>> in between or *re-created* >>>>>>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>>>>>> domain. >>>>>>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>>>>>> >>>>>>>>>>>>> read_unlock >>>>>>>>>>>>> process_pending_softirqs >>>>>>>>>>>>> read_lock >>>>>>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>>>>>> <continue processing> >>>>>>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>>>>>> for_each_pdev. >>>>>>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>>>>>> Given this is a debug message I would be OK with just doing the >>>>>>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>>>>>> just print a message and move on to the next device. >>>>>>>>> Agree, I see no big issue (probably) if we are not able to print >>>>>>>>> >>>>>>>>> How about this one: >>>>>>>>> >>>>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>>>>>> index 809a6b4773e1..50373f04da82 100644 >>>>>>>>> --- a/xen/drivers/vpci/header.c >>>>>>>>> +++ b/xen/drivers/vpci/header.c >>>>>>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>>>>>> struct rangeset *mem, uint16_t cmd) >>>>>>>>> { >>>>>>>>> struct map_data data = { .d = d, .map = true }; >>>>>>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>>>>>> int rc; >>>>>>>>> >>>>>>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>>>>>> + >>>>>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>>>>> + { >>>>>>>>> + >>>>>>>>> + /* >>>>>>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>>>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>>>>>> + */ >>>>>>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>>> process_pending_softirqs(); >>>>>>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>>>>>> + >>>>>>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>>>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>>>>>> + if ( vpci is NOT the same ) >>>>>>>>> + { >>>>>>>>> + rc = 0; >>>>>>>>> + break; >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> rangeset_destroy(mem); >>>>>>>>> if ( !rc ) >>>>>>>>> modify_decoding(pdev, cmd, false); >>>>>>>>> >>>>>>>>> This one also wants process_pending_softirqs to run so it *might* >>>>>>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>>>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>>>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>>>>>> it has something to do yet). So, I think we just need: >>>>>>>>> >>>>>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>>> process_pending_softirqs(); >>>>>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>>>>> >>>>>>>>> and this should be enough >>>>>>>> Given the context apply_map is called from (dom0 specific init code), >>>>>>>> there's no need to check for the pdev to still exits, or whether vpci >>>>>>>> has been recreated, as it's not possible. Just add a comment to >>>>>>>> explicitly note that the context of the function is special, and thus >>>>>>>> there's no possibility of either the device or vpci going away. >>>>>>> Does it really need write_unlock/write_lock given the context?... >>>>>> I think it's bad practice to call process_pending_softirqs while >>>>>> holding any locks. This is a very specific context so it's likely fine >>>>>> to not drop the lock, but would still seem incorrect to me. >>>>> Ok >>>>>>> I think it doesn't as there is no chance defer_map is called, thus >>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>> Indeed, there's no chance of that because process_pending_softirqs >>>>>> will never try to do a scheduling operation that would result in our >>>>>> context being scheduled out. >>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>> { >>>>> /* >>>>> * FIXME: Given the context apply_map is called from (dom0 specific >>>>> * init code at system_state < SYS_STATE_active) it is not strictly >>>>> * required that pdev->domain->vpci_rwlock is unlocked before calling >>>>> * process_pending_softirqs as there is no contention possible between >>>>> * this code and vpci_process_pending trying to acquire the lock in >>>>> * read mode. But running process_pending_softirqs with any lock held >>>>> * doesn't seem to be a good practice, so drop the lock and re-acquire >>>>> * it right again. >>>>> */ >>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>> process_pending_softirqs(); >>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>> } >>>> I'm afraid that's misleading at best. apply_map() is merely a specific >>>> example where you know the lock is going to be taken. But really any >>>> softirq handler could be acquiring any lock, so requesting to process >>>> softirqs cannot ever be done with any lock held. >>>> >>>> What you instead want to explain is why, after re-acquiring the lock, >>>> no further checking is needed for potentially changed state. >>> How about: >>> >>> /* >>> * FIXME: Given the context apply_map is called from (dom0 specific >>> * init code at system_state < SYS_STATE_active) there is no contention >>> * possible between this code and vpci_process_pending trying to acquire >>> * the lock in read mode and destroy pdev->vpci in its error path. >>> * Neither pdev may be disposed yet, so it is not required to check if the >>> * relevant pdev still exists after re-acquiring the lock. >>> */ >> I'm not sure I follow the first sentence; I guess a comma or two may help, >> and or using "as well as" in place of one of the two "and". I also don't >> think you mean contention, but rather a race between the named entities? > /* > * FIXME: Given the context from which apply_map is called (dom0 specific > * init code at system_state < SYS_STATE_active) there is no race condition > * possible between this code and vpci_process_pending which may try to acquire > * the lock in read mode and also try to destroy pdev->vpci in its error path. > * Neither pdev may be disposed yet, so it is not required to check if the > * relevant pdev still exists after re-acquiring the lock. > */ I'm still struggling with the language, sorry. You look to only have replaced "contention"? Reading it again I'd also like to mention that to me (not a native speaker) "Neither pdev may be ..." expresses "None of the pdev-s may be ...", when I think you mean "Nor may pdev be ..." Jan
On 14.02.22 15:48, Jan Beulich wrote: > On 14.02.2022 14:27, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 15:22, Jan Beulich wrote: >>> On 14.02.2022 14:13, Oleksandr Andrushchenko wrote: >>>> On 14.02.22 14:57, Jan Beulich wrote: >>>>> On 14.02.2022 12:37, Oleksandr Andrushchenko wrote: >>>>>> On 14.02.22 13:25, Roger Pau Monné wrote: >>>>>>> On Mon, Feb 14, 2022 at 11:15:27AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>> On 14.02.22 13:11, Roger Pau Monné wrote: >>>>>>>>> On Mon, Feb 14, 2022 at 10:53:43AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>>> On 14.02.22 12:34, Roger Pau Monné wrote: >>>>>>>>>>> On Mon, Feb 14, 2022 at 09:36:39AM +0000, Oleksandr Andrushchenko wrote: >>>>>>>>>>>> On 11.02.22 13:40, Roger Pau Monné wrote: >>>>>>>>>>>>> + >>>>>>>>>>>>>>>> for ( i = 0; i < msix->max_entries; i++ ) >>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>> const struct vpci_msix_entry *entry = &msix->entries[i]; >>>>>>>>>>>>>>> Since this function is now called with the per-domain rwlock read >>>>>>>>>>>>>>> locked it's likely not appropriate to call process_pending_softirqs >>>>>>>>>>>>>>> while holding such lock (check below). >>>>>>>>>>>>>> You are right, as it is possible that: >>>>>>>>>>>>>> >>>>>>>>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>>>>>>>>> >>>>>>>>>>>>>> Even more, vpci_process_pending may also >>>>>>>>>>>>>> >>>>>>>>>>>>>> read_unlock -> vpci_remove_device -> write_lock >>>>>>>>>>>>>> >>>>>>>>>>>>>> in its error path. So, any invocation of process_pending_softirqs >>>>>>>>>>>>>> must not hold d->vpci_rwlock at least. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And also we need to check that pdev->vpci was not removed >>>>>>>>>>>>>> in between or *re-created* >>>>>>>>>>>>>>> We will likely need to re-iterate over the list of pdevs assigned to >>>>>>>>>>>>>>> the domain and assert that the pdev is still assigned to the same >>>>>>>>>>>>>>> domain. >>>>>>>>>>>>>> So, do you mean a pattern like the below should be used at all >>>>>>>>>>>>>> places where we need to call process_pending_softirqs? >>>>>>>>>>>>>> >>>>>>>>>>>>>> read_unlock >>>>>>>>>>>>>> process_pending_softirqs >>>>>>>>>>>>>> read_lock >>>>>>>>>>>>>> pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); >>>>>>>>>>>>>> if ( pdev && pdev->vpci && is_the_same_vpci(pdev->vpci) ) >>>>>>>>>>>>>> <continue processing> >>>>>>>>>>>>> Something along those lines. You likely need to continue iterate using >>>>>>>>>>>>> for_each_pdev. >>>>>>>>>>>> How do we tell if pdev->vpci is the same? Jan has already brought >>>>>>>>>>>> this question before [1] and I was about to use some ID for that purpose: >>>>>>>>>>>> pdev->vpci->id = d->vpci_id++ and then we use pdev->vpci->id for checks >>>>>>>>>>> Given this is a debug message I would be OK with just doing the >>>>>>>>>>> minimal checks to prevent Xen from crashing (ie: pdev->vpci exists) >>>>>>>>>>> and that the resume MSI entry is not past the current limit. Otherwise >>>>>>>>>>> just print a message and move on to the next device. >>>>>>>>>> Agree, I see no big issue (probably) if we are not able to print >>>>>>>>>> >>>>>>>>>> How about this one: >>>>>>>>>> >>>>>>>>>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c >>>>>>>>>> index 809a6b4773e1..50373f04da82 100644 >>>>>>>>>> --- a/xen/drivers/vpci/header.c >>>>>>>>>> +++ b/xen/drivers/vpci/header.c >>>>>>>>>> @@ -171,10 +171,31 @@ static int __init apply_map(struct domain *d, const struct pci_dev *pdev, >>>>>>>>>> struct rangeset *mem, uint16_t cmd) >>>>>>>>>> { >>>>>>>>>> struct map_data data = { .d = d, .map = true }; >>>>>>>>>> + pci_sbdf_t sbdf = pdev->sbdf; >>>>>>>>>> int rc; >>>>>>>>>> >>>>>>>>>> + ASSERT(rw_is_write_locked(&pdev->domain->vpci_rwlock)); >>>>>>>>>> + >>>>>>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>>>>>> + { >>>>>>>>>> + >>>>>>>>>> + /* >>>>>>>>>> + * process_pending_softirqs may trigger vpci_process_pending which >>>>>>>>>> + * may need to acquire pdev->domain->vpci_rwlock in read mode. >>>>>>>>>> + */ >>>>>>>>>> + write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>>>> process_pending_softirqs(); >>>>>>>>>> + write_lock(&pdev->domain->vpci_rwlock); >>>>>>>>>> + >>>>>>>>>> + /* Check if pdev still exists and vPCI was not removed or re-created. */ >>>>>>>>>> + if (pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn) != pdev) >>>>>>>>>> + if ( vpci is NOT the same ) >>>>>>>>>> + { >>>>>>>>>> + rc = 0; >>>>>>>>>> + break; >>>>>>>>>> + } >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> rangeset_destroy(mem); >>>>>>>>>> if ( !rc ) >>>>>>>>>> modify_decoding(pdev, cmd, false); >>>>>>>>>> >>>>>>>>>> This one also wants process_pending_softirqs to run so it *might* >>>>>>>>>> want pdev and vpci checks. But at the same time apply_map runs >>>>>>>>>> at ( system_state < SYS_STATE_active ), so defer_map won't be >>>>>>>>>> running yet, thus no vpci_process_pending is possible yet (in terms >>>>>>>>>> it has something to do yet). So, I think we just need: >>>>>>>>>> >>>>>>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>>>>>> process_pending_softirqs(); >>>>>>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>>>>>> >>>>>>>>>> and this should be enough >>>>>>>>> Given the context apply_map is called from (dom0 specific init code), >>>>>>>>> there's no need to check for the pdev to still exits, or whether vpci >>>>>>>>> has been recreated, as it's not possible. Just add a comment to >>>>>>>>> explicitly note that the context of the function is special, and thus >>>>>>>>> there's no possibility of either the device or vpci going away. >>>>>>>> Does it really need write_unlock/write_lock given the context?... >>>>>>> I think it's bad practice to call process_pending_softirqs while >>>>>>> holding any locks. This is a very specific context so it's likely fine >>>>>>> to not drop the lock, but would still seem incorrect to me. >>>>>> Ok >>>>>>>> I think it doesn't as there is no chance defer_map is called, thus >>>>>>>> process_pending_softirqs -> vpci_process_pending -> read_lock >>>>>>> Indeed, there's no chance of that because process_pending_softirqs >>>>>>> will never try to do a scheduling operation that would result in our >>>>>>> context being scheduled out. >>>>>> while ( (rc = rangeset_consume_ranges(mem, map_range, &data)) == -ERESTART ) >>>>>> { >>>>>> /* >>>>>> * FIXME: Given the context apply_map is called from (dom0 specific >>>>>> * init code at system_state < SYS_STATE_active) it is not strictly >>>>>> * required that pdev->domain->vpci_rwlock is unlocked before calling >>>>>> * process_pending_softirqs as there is no contention possible between >>>>>> * this code and vpci_process_pending trying to acquire the lock in >>>>>> * read mode. But running process_pending_softirqs with any lock held >>>>>> * doesn't seem to be a good practice, so drop the lock and re-acquire >>>>>> * it right again. >>>>>> */ >>>>>> write_unlock(&pdev->domain->vpci_rwlock); >>>>>> process_pending_softirqs(); >>>>>> write_lock(&pdev->domain->vpci_rwlock); >>>>>> } >>>>> I'm afraid that's misleading at best. apply_map() is merely a specific >>>>> example where you know the lock is going to be taken. But really any >>>>> softirq handler could be acquiring any lock, so requesting to process >>>>> softirqs cannot ever be done with any lock held. >>>>> >>>>> What you instead want to explain is why, after re-acquiring the lock, >>>>> no further checking is needed for potentially changed state. >>>> How about: >>>> >>>> /* >>>> * FIXME: Given the context apply_map is called from (dom0 specific >>>> * init code at system_state < SYS_STATE_active) there is no contention >>>> * possible between this code and vpci_process_pending trying to acquire >>>> * the lock in read mode and destroy pdev->vpci in its error path. >>>> * Neither pdev may be disposed yet, so it is not required to check if the >>>> * relevant pdev still exists after re-acquiring the lock. >>>> */ >>> I'm not sure I follow the first sentence; I guess a comma or two may help, >>> and or using "as well as" in place of one of the two "and". I also don't >>> think you mean contention, but rather a race between the named entities? >> /* >> * FIXME: Given the context from which apply_map is called (dom0 specific >> * init code at system_state < SYS_STATE_active) there is no race condition >> * possible between this code and vpci_process_pending which may try to acquire >> * the lock in read mode and also try to destroy pdev->vpci in its error path. >> * Neither pdev may be disposed yet, so it is not required to check if the >> * relevant pdev still exists after re-acquiring the lock. >> */ > I'm still struggling with the language, sorry. You look to only have replaced > "contention"? Reading it again I'd also like to mention that to me (not a > native speaker) "Neither pdev may be ..." expresses "None of the pdev-s may > be ...", when I think you mean "Nor may pdev be ..." /* * FIXME: apply_map is called from dom0 specific init code when * system_state < SYS_STATE_active, so there is no race condition * possible between this code and vpci_process_pending. So, neither * vpci_process_pending may try to acquire the lock in read mode and * also destroy pdev->vpci in its error path nor pdev may be disposed yet. * This means that it is not required to check if the relevant pdev * still exists after re-acquiring the lock. */ > Jan > Thank you, Oleksandr
On 14.02.2022 15:00, Oleksandr Andrushchenko wrote: > /* > * FIXME: apply_map is called from dom0 specific init code when > * system_state < SYS_STATE_active, so there is no race condition > * possible between this code and vpci_process_pending. So, neither > * vpci_process_pending may try to acquire the lock in read mode and > * also destroy pdev->vpci in its error path nor pdev may be disposed yet. > * This means that it is not required to check if the relevant pdev > * still exists after re-acquiring the lock. > */ I think I'm okay with this variant, pending me seeing it in context. Jan
On 09.02.2022 14:36, Oleksandr Andrushchenko wrote: > @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, > r->private); > } > > +static bool vpci_header_write_lock(const struct pci_dev *pdev, > + unsigned int start, unsigned int size) > +{ > + /* > + * Writing the command register and ROM BAR register may trigger > + * modify_bars to run which in turn may access multiple pdevs while > + * checking for the existing BAR's overlap. The overlapping check, if done > + * under the read lock, requires vpci->lock to be acquired on both devices > + * being compared, which may produce a deadlock. It is not possible to > + * upgrade read lock to write lock in such a case. So, in order to prevent > + * the deadlock, check which registers are going to be written and acquire > + * the lock in the appropriate mode from the beginning. > + */ > + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) > + return true; > + > + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) > + return true; > + > + return false; > +} A function of this name gives (especially at the call site(s)) the impression of acquiring a lock. Considering that of the prefixes neither "vpci" nor "header" are really relevant here, may I suggest to use need_write_lock()? May I further suggest that you either split the comment or combine the two if()-s (perhaps even straight into single return statement)? Personally I'd prefer the single return statement approach here ... Jan
On 14.02.22 16:19, Jan Beulich wrote: > On 09.02.2022 14:36, Oleksandr Andrushchenko wrote: >> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, >> r->private); >> } >> >> +static bool vpci_header_write_lock(const struct pci_dev *pdev, >> + unsigned int start, unsigned int size) >> +{ >> + /* >> + * Writing the command register and ROM BAR register may trigger >> + * modify_bars to run which in turn may access multiple pdevs while >> + * checking for the existing BAR's overlap. The overlapping check, if done >> + * under the read lock, requires vpci->lock to be acquired on both devices >> + * being compared, which may produce a deadlock. It is not possible to >> + * upgrade read lock to write lock in such a case. So, in order to prevent >> + * the deadlock, check which registers are going to be written and acquire >> + * the lock in the appropriate mode from the beginning. >> + */ >> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >> + return true; >> + >> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) >> + return true; >> + >> + return false; >> +} > A function of this name gives (especially at the call site(s)) the > impression of acquiring a lock. Considering that of the prefixes > neither "vpci" nor "header" are really relevant here, may I suggest > to use need_write_lock()? > > May I further suggest that you either split the comment or combine > the two if()-s (perhaps even straight into single return statement)? > Personally I'd prefer the single return statement approach here ... That was already questioned by Roger and now it looks like: static bool overlap(unsigned int r1_offset, unsigned int r1_size, unsigned int r2_offset, unsigned int r2_size) { /* Return true if there is an overlap. */ return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size; } bool vpci_header_write_lock(const struct pci_dev *pdev, unsigned int start, unsigned int size) { /* * Writing the command register and ROM BAR register may trigger * modify_bars to run which in turn may access multiple pdevs while * checking for the existing BAR's overlap. The overlapping check, if done * under the read lock, requires vpci->lock to be acquired on both devices * being compared, which may produce a deadlock. It is not possible to * upgrade read lock to write lock in such a case. So, in order to prevent * the deadlock, check which registers are going to be written and acquire * the lock in the appropriate mode from the beginning. */ if ( overlap(start, size, PCI_COMMAND, 2) || (pdev->vpci->header.rom_reg && overlap(start, size, pdev->vpci->header.rom_reg, 4)) ) return true; return false; } vpci_header_write_lock moved to header.c and is not static anymore. So, sitting in header.c, the name seems to be appropriate now > > Jan > Thank you, Oleksandr
On 14.02.2022 15:26, Oleksandr Andrushchenko wrote: > > > On 14.02.22 16:19, Jan Beulich wrote: >> On 09.02.2022 14:36, Oleksandr Andrushchenko wrote: >>> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, >>> r->private); >>> } >>> >>> +static bool vpci_header_write_lock(const struct pci_dev *pdev, >>> + unsigned int start, unsigned int size) >>> +{ >>> + /* >>> + * Writing the command register and ROM BAR register may trigger >>> + * modify_bars to run which in turn may access multiple pdevs while >>> + * checking for the existing BAR's overlap. The overlapping check, if done >>> + * under the read lock, requires vpci->lock to be acquired on both devices >>> + * being compared, which may produce a deadlock. It is not possible to >>> + * upgrade read lock to write lock in such a case. So, in order to prevent >>> + * the deadlock, check which registers are going to be written and acquire >>> + * the lock in the appropriate mode from the beginning. >>> + */ >>> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >>> + return true; >>> + >>> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) >>> + return true; >>> + >>> + return false; >>> +} >> A function of this name gives (especially at the call site(s)) the >> impression of acquiring a lock. Considering that of the prefixes >> neither "vpci" nor "header" are really relevant here, may I suggest >> to use need_write_lock()? >> >> May I further suggest that you either split the comment or combine >> the two if()-s (perhaps even straight into single return statement)? >> Personally I'd prefer the single return statement approach here ... > That was already questioned by Roger and now it looks like: > > static bool overlap(unsigned int r1_offset, unsigned int r1_size, > unsigned int r2_offset, unsigned int r2_size) > { > /* Return true if there is an overlap. */ > return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size; > } > > bool vpci_header_write_lock(const struct pci_dev *pdev, > unsigned int start, unsigned int size) > { > /* > * Writing the command register and ROM BAR register may trigger > * modify_bars to run which in turn may access multiple pdevs while > * checking for the existing BAR's overlap. The overlapping check, if done > * under the read lock, requires vpci->lock to be acquired on both devices > * being compared, which may produce a deadlock. It is not possible to > * upgrade read lock to write lock in such a case. So, in order to prevent > * the deadlock, check which registers are going to be written and acquire > * the lock in the appropriate mode from the beginning. > */ > if ( overlap(start, size, PCI_COMMAND, 2) || > (pdev->vpci->header.rom_reg && > overlap(start, size, pdev->vpci->header.rom_reg, 4)) ) > return true; > > return false; > } > > vpci_header_write_lock moved to header.c and is not static anymore. > So, sitting in header.c, the name seems to be appropriate now The prefix of the name - yes. But as said, a function of this name looks as if it would acquire a lock. Imo you want to insert "need" or some such. Jan
On 14.02.22 16:31, Jan Beulich wrote: > On 14.02.2022 15:26, Oleksandr Andrushchenko wrote: >> >> On 14.02.22 16:19, Jan Beulich wrote: >>> On 09.02.2022 14:36, Oleksandr Andrushchenko wrote: >>>> @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, >>>> r->private); >>>> } >>>> >>>> +static bool vpci_header_write_lock(const struct pci_dev *pdev, >>>> + unsigned int start, unsigned int size) >>>> +{ >>>> + /* >>>> + * Writing the command register and ROM BAR register may trigger >>>> + * modify_bars to run which in turn may access multiple pdevs while >>>> + * checking for the existing BAR's overlap. The overlapping check, if done >>>> + * under the read lock, requires vpci->lock to be acquired on both devices >>>> + * being compared, which may produce a deadlock. It is not possible to >>>> + * upgrade read lock to write lock in such a case. So, in order to prevent >>>> + * the deadlock, check which registers are going to be written and acquire >>>> + * the lock in the appropriate mode from the beginning. >>>> + */ >>>> + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) >>>> + return true; >>>> + >>>> + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) >>>> + return true; >>>> + >>>> + return false; >>>> +} >>> A function of this name gives (especially at the call site(s)) the >>> impression of acquiring a lock. Considering that of the prefixes >>> neither "vpci" nor "header" are really relevant here, may I suggest >>> to use need_write_lock()? >>> >>> May I further suggest that you either split the comment or combine >>> the two if()-s (perhaps even straight into single return statement)? >>> Personally I'd prefer the single return statement approach here ... >> That was already questioned by Roger and now it looks like: >> >> static bool overlap(unsigned int r1_offset, unsigned int r1_size, >> unsigned int r2_offset, unsigned int r2_size) >> { >> /* Return true if there is an overlap. */ >> return r1_offset < r2_offset + r2_size && r2_offset < r1_offset + r1_size; >> } >> >> bool vpci_header_write_lock(const struct pci_dev *pdev, >> unsigned int start, unsigned int size) >> { >> /* >> * Writing the command register and ROM BAR register may trigger >> * modify_bars to run which in turn may access multiple pdevs while >> * checking for the existing BAR's overlap. The overlapping check, if done >> * under the read lock, requires vpci->lock to be acquired on both devices >> * being compared, which may produce a deadlock. It is not possible to >> * upgrade read lock to write lock in such a case. So, in order to prevent >> * the deadlock, check which registers are going to be written and acquire >> * the lock in the appropriate mode from the beginning. >> */ >> if ( overlap(start, size, PCI_COMMAND, 2) || >> (pdev->vpci->header.rom_reg && >> overlap(start, size, pdev->vpci->header.rom_reg, 4)) ) >> return true; >> >> return false; >> } >> >> vpci_header_write_lock moved to header.c and is not static anymore. >> So, sitting in header.c, the name seems to be appropriate now > The prefix of the name - yes. But as said, a function of this name looks > as if it would acquire a lock. Imo you want to insert "need" or some > such. Agree. Then vpci_header_need_write_lock. I will also update the comment because it makes an impression that the function acquires the lock > > Jan > Thank you, Oleksandr
On Mon, Feb 14, 2022 at 02:00:26PM +0000, Oleksandr Andrushchenko wrote: > /* > * FIXME: apply_map is called from dom0 specific init code when > * system_state < SYS_STATE_active, so there is no race condition > * possible between this code and vpci_process_pending. So, neither > * vpci_process_pending may try to acquire the lock in read mode and > * also destroy pdev->vpci in its error path nor pdev may be disposed yet. > * This means that it is not required to check if the relevant pdev > * still exists after re-acquiring the lock. I'm not sure why you need to mention vpci_process_pending here: apply_map and defer_map are mutually exclusive, so given the current code it's impossible to get in a situation where apply_map is called while there's pending work on the vCPU (ie: v->vpci.mem != NULL). Also there's no need for a FIXME tag: the current approach doesn't require any fixes unless we start using apply_map in a different context. Hence I think the comment should be along the lines of: /* * It's safe to drop and reacquire the lock in this context without * risking pdev disappearing because devices cannot be removed until the * initial domain has been started. */ Thanks, Roger.
On 15.02.22 10:30, Roger Pau Monné wrote: > On Mon, Feb 14, 2022 at 02:00:26PM +0000, Oleksandr Andrushchenko wrote: >> /* >> * FIXME: apply_map is called from dom0 specific init code when >> * system_state < SYS_STATE_active, so there is no race condition >> * possible between this code and vpci_process_pending. So, neither >> * vpci_process_pending may try to acquire the lock in read mode and >> * also destroy pdev->vpci in its error path nor pdev may be disposed yet. >> * This means that it is not required to check if the relevant pdev >> * still exists after re-acquiring the lock. > I'm not sure why you need to mention vpci_process_pending here: > apply_map and defer_map are mutually exclusive, so given the current > code it's impossible to get in a situation where apply_map is called > while there's pending work on the vCPU (ie: v->vpci.mem != NULL). > > Also there's no need for a FIXME tag: the current approach doesn't > require any fixes unless we start using apply_map in a different > context. > > Hence I think the comment should be along the lines of: > > /* > * It's safe to drop and reacquire the lock in this context without > * risking pdev disappearing because devices cannot be removed until the > * initial domain has been started. > */ Urgh, I've just sent v2. I'll move this there and answer > > Thanks, Roger. > Thank you, Oleksandr
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 13e2a190b439..351cb968a423 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -893,6 +893,8 @@ int vpci_msix_arch_print(const struct vpci_msix *msix) { unsigned int i; + ASSERT(!!rw_is_locked(&msix->pdev->domain->vpci_rwlock)); + for ( i = 0; i < msix->max_entries; i++ ) { const struct vpci_msix_entry *entry = &msix->entries[i]; diff --git a/xen/common/domain.c b/xen/common/domain.c index 2048ebad86ff..10558c22285d 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -616,6 +616,9 @@ struct domain *domain_create(domid_t domid, #ifdef CONFIG_HAS_PCI INIT_LIST_HEAD(&d->pdev_list); +#ifdef CONFIG_HAS_VPCI + rwlock_init(&d->vpci_rwlock); +#endif #endif /* All error paths can depend on the above setup. */ diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c index 40ff79c33f8f..9e2aeb2055c9 100644 --- a/xen/drivers/vpci/header.c +++ b/xen/drivers/vpci/header.c @@ -142,12 +142,14 @@ bool vpci_process_pending(struct vcpu *v) if ( rc == -ERESTART ) return true; + read_lock(&v->domain->vpci_rwlock); spin_lock(&v->vpci.pdev->vpci->lock); /* Disable memory decoding unconditionally on failure. */ modify_decoding(v->vpci.pdev, rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd, !rc && v->vpci.rom_only); spin_unlock(&v->vpci.pdev->vpci->lock); + read_unlock(&v->domain->vpci_rwlock); rangeset_destroy(v->vpci.mem); v->vpci.mem = NULL; @@ -203,6 +205,7 @@ static void defer_map(struct domain *d, struct pci_dev *pdev, raise_softirq(SCHEDULE_SOFTIRQ); } +/* This must hold domain's vpci_rwlock in write mode. */ static int modify_bars(const struct pci_dev *pdev, uint16_t cmd, bool rom_only) { struct vpci_header *header = &pdev->vpci->header; @@ -454,6 +457,8 @@ static int init_bars(struct pci_dev *pdev) struct vpci_bar *bars = header->bars; int rc; + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); + switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f ) { case PCI_HEADER_TYPE_NORMAL: @@ -548,6 +553,7 @@ static int init_bars(struct pci_dev *pdev) { struct vpci_bar *rom = &header->bars[num_bars]; + header->rom_reg = rom_reg; rom->type = VPCI_BAR_ROM; rom->size = size; rom->addr = addr; @@ -559,6 +565,8 @@ static int init_bars(struct pci_dev *pdev) if ( rc ) rom->type = VPCI_BAR_EMPTY; } + else + header->rom_reg = ~(unsigned int)0; return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0; } diff --git a/xen/drivers/vpci/msi.c b/xen/drivers/vpci/msi.c index 5757a7aed20f..5df3dfa8243c 100644 --- a/xen/drivers/vpci/msi.c +++ b/xen/drivers/vpci/msi.c @@ -190,6 +190,8 @@ static int init_msi(struct pci_dev *pdev) uint16_t control; int ret; + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); + if ( !pos ) return 0; @@ -265,7 +267,7 @@ REGISTER_VPCI_INIT(init_msi, VPCI_PRIORITY_LOW); void vpci_dump_msi(void) { - const struct domain *d; + struct domain *d; rcu_read_lock(&domlist_read_lock); for_each_domain ( d ) @@ -277,6 +279,9 @@ void vpci_dump_msi(void) printk("vPCI MSI/MSI-X d%d\n", d->domain_id); + if ( !read_trylock(&d->vpci_rwlock) ) + continue; + for_each_pdev ( d, pdev ) { const struct vpci_msi *msi; @@ -326,6 +331,7 @@ void vpci_dump_msi(void) spin_unlock(&pdev->vpci->lock); process_pending_softirqs(); } + read_unlock(&d->vpci_rwlock); } rcu_read_unlock(&domlist_read_lock); } diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c index 846f1b8d7038..5296d6025d8e 100644 --- a/xen/drivers/vpci/msix.c +++ b/xen/drivers/vpci/msix.c @@ -138,6 +138,7 @@ static void control_write(const struct pci_dev *pdev, unsigned int reg, pci_conf_write16(pdev->sbdf, reg, val); } +/* This must hold domain's vpci_rwlock in write mode. */ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) { struct vpci_msix *msix; @@ -158,7 +159,12 @@ static struct vpci_msix *msix_find(const struct domain *d, unsigned long addr) static int msix_accept(struct vcpu *v, unsigned long addr) { - return !!msix_find(v->domain, addr); + int rc; + + read_lock(&v->domain->vpci_rwlock); + rc = !!msix_find(v->domain, addr); + read_unlock(&v->domain->vpci_rwlock); + return rc; } static bool access_allowed(const struct pci_dev *pdev, unsigned long addr, @@ -185,18 +191,27 @@ static struct vpci_msix_entry *get_entry(struct vpci_msix *msix, static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len, unsigned long *data) { - const struct domain *d = v->domain; - struct vpci_msix *msix = msix_find(d, addr); + struct domain *d = v->domain; + struct vpci_msix *msix; const struct vpci_msix_entry *entry; unsigned int offset; *data = ~0ul; + read_lock(&d->vpci_rwlock); + + msix = msix_find(d, addr); if ( !msix ) + { + read_unlock(&d->vpci_rwlock); return X86EMUL_RETRY; + } if ( !access_allowed(msix->pdev, addr, len) ) + { + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; + } if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) { @@ -222,6 +237,7 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len, break; } + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; } @@ -255,6 +271,7 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len, break; } spin_unlock(&msix->pdev->vpci->lock); + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; } @@ -262,16 +279,25 @@ static int msix_read(struct vcpu *v, unsigned long addr, unsigned int len, static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len, unsigned long data) { - const struct domain *d = v->domain; - struct vpci_msix *msix = msix_find(d, addr); + struct domain *d = v->domain; + struct vpci_msix *msix; struct vpci_msix_entry *entry; unsigned int offset; + read_lock(&d->vpci_rwlock); + + msix = msix_find(d, addr); if ( !msix ) + { + read_unlock(&d->vpci_rwlock); return X86EMUL_RETRY; + } if ( !access_allowed(msix->pdev, addr, len) ) + { + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; + } if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) { @@ -294,6 +320,7 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len, } } + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; } @@ -371,6 +398,7 @@ static int msix_write(struct vcpu *v, unsigned long addr, unsigned int len, break; } spin_unlock(&msix->pdev->vpci->lock); + read_unlock(&d->vpci_rwlock); return X86EMUL_OKAY; } @@ -437,6 +465,8 @@ static int init_msix(struct pci_dev *pdev) struct vpci_msix *msix; int rc; + ASSERT(!!rw_is_write_locked(&pdev->domain->vpci_rwlock)); + msix_offset = pci_find_cap_offset(pdev->seg, pdev->bus, slot, func, PCI_CAP_ID_MSIX); if ( !msix_offset ) diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index fb0947179b79..16bb3b832e6a 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -37,28 +37,39 @@ extern vpci_register_init_t *const __end_vpci_array[]; void vpci_remove_device(struct pci_dev *pdev) { + struct vpci *vpci; + if ( !has_vpci(pdev->domain) ) return; - spin_lock(&pdev->vpci->lock); + write_lock(&pdev->domain->vpci_rwlock); + if ( !pdev->vpci ) + { + write_unlock(&pdev->domain->vpci_rwlock); + return; + } + + vpci = pdev->vpci; + pdev->vpci = NULL; + write_unlock(&pdev->domain->vpci_rwlock); + while ( !list_empty(&pdev->vpci->handlers) ) { - struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, + struct vpci_register *r = list_first_entry(&vpci->handlers, struct vpci_register, node); list_del(&r->node); xfree(r); } - spin_unlock(&pdev->vpci->lock); - xfree(pdev->vpci->msix); - xfree(pdev->vpci->msi); - xfree(pdev->vpci); - pdev->vpci = NULL; + xfree(vpci->msix); + xfree(vpci->msi); + xfree(vpci); } int vpci_add_handlers(struct pci_dev *pdev) { + struct vpci *vpci; unsigned int i; int rc = 0; @@ -68,12 +79,15 @@ int vpci_add_handlers(struct pci_dev *pdev) /* We should not get here twice for the same device. */ ASSERT(!pdev->vpci); - pdev->vpci = xzalloc(struct vpci); - if ( !pdev->vpci ) + vpci = xzalloc(struct vpci); + if ( !vpci ) return -ENOMEM; - INIT_LIST_HEAD(&pdev->vpci->handlers); - spin_lock_init(&pdev->vpci->lock); + INIT_LIST_HEAD(&vpci->handlers); + spin_lock_init(&vpci->lock); + + write_lock(&pdev->domain->vpci_rwlock); + pdev->vpci = vpci; for ( i = 0; i < NUM_VPCI_INIT; i++ ) { @@ -81,6 +95,7 @@ int vpci_add_handlers(struct pci_dev *pdev) if ( rc ) break; } + write_unlock(&pdev->domain->vpci_rwlock); if ( rc ) vpci_remove_device(pdev); @@ -89,22 +104,28 @@ int vpci_add_handlers(struct pci_dev *pdev) } #endif /* __XEN__ */ -static int vpci_register_cmp(const struct vpci_register *r1, - const struct vpci_register *r2) +static int vpci_offset_cmp(unsigned int r1_offset, unsigned int r1_size, + unsigned int r2_offset, unsigned int r2_size) { /* Return 0 if registers overlap. */ - if ( r1->offset < r2->offset + r2->size && - r2->offset < r1->offset + r1->size ) + if ( r1_offset < r2_offset + r2_size && + r2_offset < r1_offset + r1_size ) return 0; - if ( r1->offset < r2->offset ) + if ( r1_offset < r2_offset ) return -1; - if ( r1->offset > r2->offset ) + if ( r1_offset > r2_offset ) return 1; ASSERT_UNREACHABLE(); return 0; } +static int vpci_register_cmp(const struct vpci_register *r1, + const struct vpci_register *r2) +{ + return vpci_offset_cmp(r1->offset, r1->size, r2->offset, r2->size); +} + /* Dummy hooks, writes are ignored, reads return 1's */ static uint32_t vpci_ignored_read(const struct pci_dev *pdev, unsigned int reg, void *data) @@ -129,6 +150,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg, return pci_conf_read32(pdev->sbdf, reg); } +/* This must hold domain's vpci_rwlock in write mode. */ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, vpci_write_t *write_handler, unsigned int offset, unsigned int size, void *data) @@ -152,8 +174,6 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, r->offset = offset; r->private = data; - spin_lock(&vpci->lock); - /* The list of handlers must be kept sorted at all times. */ list_for_each ( prev, &vpci->handlers ) { @@ -165,25 +185,23 @@ int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler, break; if ( cmp == 0 ) { - spin_unlock(&vpci->lock); xfree(r); return -EEXIST; } } list_add_tail(&r->node, prev); - spin_unlock(&vpci->lock); return 0; } +/* This must hold domain's vpci_rwlock in write mode. */ int vpci_remove_register(struct vpci *vpci, unsigned int offset, unsigned int size) { const struct vpci_register r = { .offset = offset, .size = size }; struct vpci_register *rm; - spin_lock(&vpci->lock); list_for_each_entry ( rm, &vpci->handlers, node ) { int cmp = vpci_register_cmp(&r, rm); @@ -195,14 +213,12 @@ int vpci_remove_register(struct vpci *vpci, unsigned int offset, if ( !cmp && rm->offset == offset && rm->size == size ) { list_del(&rm->node); - spin_unlock(&vpci->lock); xfree(rm); return 0; } if ( cmp <= 0 ) break; } - spin_unlock(&vpci->lock); return -ENOENT; } @@ -310,7 +326,7 @@ static uint32_t merge_result(uint32_t data, uint32_t new, unsigned int size, uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) { - const struct domain *d = current->domain; + struct domain *d = current->domain; const struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; @@ -327,6 +343,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) if ( !pdev ) return vpci_read_hw(sbdf, reg, size); + read_lock(&d->vpci_rwlock); spin_lock(&pdev->vpci->lock); /* Read from the hardware or the emulated register handlers. */ @@ -371,6 +388,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size) ASSERT(data_offset < size); } spin_unlock(&pdev->vpci->lock); + read_unlock(&d->vpci_rwlock); if ( data_offset < size ) { @@ -410,14 +428,37 @@ static void vpci_write_helper(const struct pci_dev *pdev, r->private); } +static bool vpci_header_write_lock(const struct pci_dev *pdev, + unsigned int start, unsigned int size) +{ + /* + * Writing the command register and ROM BAR register may trigger + * modify_bars to run which in turn may access multiple pdevs while + * checking for the existing BAR's overlap. The overlapping check, if done + * under the read lock, requires vpci->lock to be acquired on both devices + * being compared, which may produce a deadlock. It is not possible to + * upgrade read lock to write lock in such a case. So, in order to prevent + * the deadlock, check which registers are going to be written and acquire + * the lock in the appropriate mode from the beginning. + */ + if ( !vpci_offset_cmp(start, size, PCI_COMMAND, 2) ) + return true; + + if ( !vpci_offset_cmp(start, size, pdev->vpci->header.rom_reg, 4) ) + return true; + + return false; +} + void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, uint32_t data) { - const struct domain *d = current->domain; + struct domain *d = current->domain; const struct pci_dev *pdev; const struct vpci_register *r; unsigned int data_offset = 0; const unsigned long *ro_map = pci_get_ro_map(sbdf.seg); + bool write_locked = false; if ( !size ) { @@ -440,7 +481,17 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, return; } - spin_lock(&pdev->vpci->lock); + if ( vpci_header_write_lock(pdev, reg, size) ) + { + /* Gain exclusive access to all of the domain pdevs vpci. */ + write_lock(&d->vpci_rwlock); + write_locked = true; + } + else + { + read_lock(&d->vpci_rwlock); + spin_lock(&pdev->vpci->lock); + } /* Write the value to the hardware or emulated registers. */ list_for_each_entry ( r, &pdev->vpci->handlers, node ) @@ -475,7 +526,14 @@ void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size, break; ASSERT(data_offset < size); } - spin_unlock(&pdev->vpci->lock); + + if ( write_locked ) + write_unlock(&d->vpci_rwlock); + else + { + spin_unlock(&pdev->vpci->lock); + read_unlock(&d->vpci_rwlock); + } if ( data_offset < size ) /* Tailing gap, write the remaining. */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 37f78cc4c4c9..ecd34481a7af 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -444,6 +444,9 @@ struct domain #ifdef CONFIG_HAS_PCI struct list_head pdev_list; +#ifdef CONFIG_HAS_VPCI + rwlock_t vpci_rwlock; +#endif #endif #ifdef CONFIG_HAS_PASSTHROUGH diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h index e8ac1eb39513..e19e462ee5cb 100644 --- a/xen/include/xen/vpci.h +++ b/xen/include/xen/vpci.h @@ -88,6 +88,8 @@ struct vpci { * is mapped into guest p2m) if there's a ROM BAR on the device. */ bool rom_enabled : 1; + /* Offset to the ROM BAR register if any. */ + unsigned int rom_reg; /* FIXME: currently there's no support for SR-IOV. */ } header;