Message ID | 20211105065629.940943-3-andr2000@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | PCI devices passthrough on Arm, part 3 | expand |
On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > When a vPCI is removed for a PCI device it is possible that we have > scheduled a delayed work for map/unmap operations for that device. > For example, the following scenario can illustrate the problem: > > pci_physdev_op > pci_add_device > init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > iommu_add_device <- FAILS > vpci_remove_device -> xfree(pdev->vpci) > > leave_hypervisor_to_guest > vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL > > For the hardware domain we continue execution as the worse that > could happen is that MMIO mappings are left in place when the > device has been deassigned Is continuing safe in this case? I.e. isn't there the risk of a NULL deref? > For unprivileged domains that get a failure in the middle of a vPCI > {un}map operation we need to destroy them, as we don't know in which > state the p2m is. This can only happen in vpci_process_pending for > DomUs as they won't be allowed to call pci_add_device. You saying "we need to destroy them" made me look for a new domain_crash() that you add, but there is none. What is this about? > @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) > return false; > } > > +void vpci_cancel_pending(const struct pci_dev *pdev) > +{ > + struct vcpu *v = current; > + > + /* Cancel any pending work now. */ Doesn't "any" include pending work on all vCPU-s of the guest, not just current? Is current even relevant (as in: part of the correct domain), considering ... > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) > xfree(r); > } > spin_unlock(&pdev->vpci->lock); > + > + vpci_cancel_pending(pdev); ... this code path, when coming here from pci_{add,remove}_device()? I can agree that there's a problem here, but I think you need to properly (i.e. in a race free manner) drain pending work. Jan
On 15.11.21 18:56, Jan Beulich wrote: > On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> When a vPCI is removed for a PCI device it is possible that we have >> scheduled a delayed work for map/unmap operations for that device. >> For example, the following scenario can illustrate the problem: >> >> pci_physdev_op >> pci_add_device >> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >> iommu_add_device <- FAILS >> vpci_remove_device -> xfree(pdev->vpci) >> >> leave_hypervisor_to_guest >> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >> >> For the hardware domain we continue execution as the worse that >> could happen is that MMIO mappings are left in place when the >> device has been deassigned > Is continuing safe in this case? I.e. isn't there the risk of a NULL > deref? I think it is safe to continue > >> For unprivileged domains that get a failure in the middle of a vPCI >> {un}map operation we need to destroy them, as we don't know in which >> state the p2m is. This can only happen in vpci_process_pending for >> DomUs as they won't be allowed to call pci_add_device. > You saying "we need to destroy them" made me look for a new domain_crash() > that you add, but there is none. What is this about? Yes, I guess we need to implicitly destroy the domain, @Roger are you ok with that? > >> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >> return false; >> } >> >> +void vpci_cancel_pending(const struct pci_dev *pdev) >> +{ >> + struct vcpu *v = current; >> + >> + /* Cancel any pending work now. */ > Doesn't "any" include pending work on all vCPU-s of the guest, not > just current? Is current even relevant (as in: part of the correct > domain), considering ... > >> --- a/xen/drivers/vpci/vpci.c >> +++ b/xen/drivers/vpci/vpci.c >> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >> xfree(r); >> } >> spin_unlock(&pdev->vpci->lock); >> + >> + vpci_cancel_pending(pdev); > ... this code path, when coming here from pci_{add,remove}_device()? > > I can agree that there's a problem here, but I think you need to > properly (i.e. in a race free manner) drain pending work. Yes, the code is inconsistent with this respect. I am thinking about: void vpci_cancel_pending(const struct pci_dev *pdev) { struct domain *d = pdev->domain; struct vcpu *v; /* Cancel any pending work now. */ domain_lock(d); for_each_vcpu ( d, v ) { vcpu_pause(v); if ( v->vpci.mem && v->vpci.pdev == pdev) { rangeset_destroy(v->vpci.mem); v->vpci.mem = NULL; } vcpu_unpause(v); } domain_unlock(d); } which seems to solve all the concerns. Is this what you mean? > > Jan > Thank you, Oleksandr
On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: > On 15.11.21 18:56, Jan Beulich wrote: >> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>> >>> When a vPCI is removed for a PCI device it is possible that we have >>> scheduled a delayed work for map/unmap operations for that device. >>> For example, the following scenario can illustrate the problem: >>> >>> pci_physdev_op >>> pci_add_device >>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>> iommu_add_device <- FAILS >>> vpci_remove_device -> xfree(pdev->vpci) >>> >>> leave_hypervisor_to_guest >>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>> >>> For the hardware domain we continue execution as the worse that >>> could happen is that MMIO mappings are left in place when the >>> device has been deassigned >> Is continuing safe in this case? I.e. isn't there the risk of a NULL >> deref? > I think it is safe to continue And why do you think so? I.e. why is there no race for Dom0 when there is one for DomU? >>> For unprivileged domains that get a failure in the middle of a vPCI >>> {un}map operation we need to destroy them, as we don't know in which >>> state the p2m is. This can only happen in vpci_process_pending for >>> DomUs as they won't be allowed to call pci_add_device. >> You saying "we need to destroy them" made me look for a new domain_crash() >> that you add, but there is none. What is this about? > Yes, I guess we need to implicitly destroy the domain, What do you mean by "implicitly"? >>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >>> return false; >>> } >>> >>> +void vpci_cancel_pending(const struct pci_dev *pdev) >>> +{ >>> + struct vcpu *v = current; >>> + >>> + /* Cancel any pending work now. */ >> Doesn't "any" include pending work on all vCPU-s of the guest, not >> just current? Is current even relevant (as in: part of the correct >> domain), considering ... >> >>> --- a/xen/drivers/vpci/vpci.c >>> +++ b/xen/drivers/vpci/vpci.c >>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >>> xfree(r); >>> } >>> spin_unlock(&pdev->vpci->lock); >>> + >>> + vpci_cancel_pending(pdev); >> ... this code path, when coming here from pci_{add,remove}_device()? >> >> I can agree that there's a problem here, but I think you need to >> properly (i.e. in a race free manner) drain pending work. > Yes, the code is inconsistent with this respect. I am thinking about: > > void vpci_cancel_pending(const struct pci_dev *pdev) > { > struct domain *d = pdev->domain; > struct vcpu *v; > > /* Cancel any pending work now. */ > domain_lock(d); > for_each_vcpu ( d, v ) > { > vcpu_pause(v); > if ( v->vpci.mem && v->vpci.pdev == pdev) Nit: Same style issue as in the original patch. > { > rangeset_destroy(v->vpci.mem); > v->vpci.mem = NULL; > } > vcpu_unpause(v); > } > domain_unlock(d); > } > > which seems to solve all the concerns. Is this what you mean? Something along these lines. I expect you will want to make use of domain_pause_except_self(), and I don't understand the purpose of acquiring the domain lock. Jan
On 16.11.21 10:01, Jan Beulich wrote: > On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >> On 15.11.21 18:56, Jan Beulich wrote: >>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> >>>> When a vPCI is removed for a PCI device it is possible that we have >>>> scheduled a delayed work for map/unmap operations for that device. >>>> For example, the following scenario can illustrate the problem: >>>> >>>> pci_physdev_op >>>> pci_add_device >>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>> iommu_add_device <- FAILS >>>> vpci_remove_device -> xfree(pdev->vpci) >>>> >>>> leave_hypervisor_to_guest >>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>> >>>> For the hardware domain we continue execution as the worse that >>>> could happen is that MMIO mappings are left in place when the >>>> device has been deassigned >>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>> deref? >> I think it is safe to continue > And why do you think so? I.e. why is there no race for Dom0 when there > is one for DomU? Well, then we need to use a lock to synchronize the two. I guess this needs to be pci devs lock unfortunately > >>>> For unprivileged domains that get a failure in the middle of a vPCI >>>> {un}map operation we need to destroy them, as we don't know in which >>>> state the p2m is. This can only happen in vpci_process_pending for >>>> DomUs as they won't be allowed to call pci_add_device. >>> You saying "we need to destroy them" made me look for a new domain_crash() >>> that you add, but there is none. What is this about? >> Yes, I guess we need to implicitly destroy the domain, > What do you mean by "implicitly"? @@ -151,14 +151,18 @@ bool vpci_process_pending(struct vcpu *v) vpci_cancel_pending(v->vpci.pdev); if ( rc ) + { /* * FIXME: in case of failure remove the device from the domain. * Note that there might still be leftover mappings. While this is + * safe for Dom0, for DomUs the domain needs to be killed in order + * to avoid leaking stale p2m mappings on failure. */ vpci_remove_device(v->vpci.pdev); + + if ( !is_hardware_domain(v->domain) ) + domain_crash(v->domain); > >>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >>>> return false; >>>> } >>>> >>>> +void vpci_cancel_pending(const struct pci_dev *pdev) >>>> +{ >>>> + struct vcpu *v = current; >>>> + >>>> + /* Cancel any pending work now. */ >>> Doesn't "any" include pending work on all vCPU-s of the guest, not >>> just current? Is current even relevant (as in: part of the correct >>> domain), considering ... >>> >>>> --- a/xen/drivers/vpci/vpci.c >>>> +++ b/xen/drivers/vpci/vpci.c >>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >>>> xfree(r); >>>> } >>>> spin_unlock(&pdev->vpci->lock); >>>> + >>>> + vpci_cancel_pending(pdev); >>> ... this code path, when coming here from pci_{add,remove}_device()? >>> >>> I can agree that there's a problem here, but I think you need to >>> properly (i.e. in a race free manner) drain pending work. >> Yes, the code is inconsistent with this respect. I am thinking about: >> >> void vpci_cancel_pending(const struct pci_dev *pdev) >> { >> struct domain *d = pdev->domain; >> struct vcpu *v; >> >> /* Cancel any pending work now. */ >> domain_lock(d); >> for_each_vcpu ( d, v ) >> { >> vcpu_pause(v); >> if ( v->vpci.mem && v->vpci.pdev == pdev) > Nit: Same style issue as in the original patch. Will fix > >> { >> rangeset_destroy(v->vpci.mem); >> v->vpci.mem = NULL; >> } >> vcpu_unpause(v); >> } >> domain_unlock(d); >> } >> >> which seems to solve all the concerns. Is this what you mean? > Something along these lines. I expect you will want to make use of > domain_pause_except_self(), Yes, this is what is needed here, thanks. The only question is that int domain_pause_except_self(struct domain *d) { [snip] /* Avoid racing with other vcpus which may want to be pausing us */ if ( !spin_trylock(&d->hypercall_deadlock_mutex) ) return -ERESTART; so it is not clear what do we do in case of -ERESTART: do we want to spin? Otherwise we will leave the job not done effectively not canceling the pending work. Any idea other then spinning? > and I don't understand the purpose of > acquiring the domain lock. You are right, no need > > Jan > Thank you, Oleksandr
On 16.11.2021 09:23, Oleksandr Andrushchenko wrote: > > > On 16.11.21 10:01, Jan Beulich wrote: >> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>> On 15.11.21 18:56, Jan Beulich wrote: >>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >>>>> return false; >>>>> } >>>>> >>>>> +void vpci_cancel_pending(const struct pci_dev *pdev) >>>>> +{ >>>>> + struct vcpu *v = current; >>>>> + >>>>> + /* Cancel any pending work now. */ >>>> Doesn't "any" include pending work on all vCPU-s of the guest, not >>>> just current? Is current even relevant (as in: part of the correct >>>> domain), considering ... >>>> >>>>> --- a/xen/drivers/vpci/vpci.c >>>>> +++ b/xen/drivers/vpci/vpci.c >>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >>>>> xfree(r); >>>>> } >>>>> spin_unlock(&pdev->vpci->lock); >>>>> + >>>>> + vpci_cancel_pending(pdev); >>>> ... this code path, when coming here from pci_{add,remove}_device()? >>>> >>>> I can agree that there's a problem here, but I think you need to >>>> properly (i.e. in a race free manner) drain pending work. >>> Yes, the code is inconsistent with this respect. I am thinking about: >>> >>> void vpci_cancel_pending(const struct pci_dev *pdev) >>> { >>> struct domain *d = pdev->domain; >>> struct vcpu *v; >>> >>> /* Cancel any pending work now. */ >>> domain_lock(d); >>> for_each_vcpu ( d, v ) >>> { >>> vcpu_pause(v); >>> if ( v->vpci.mem && v->vpci.pdev == pdev) >> Nit: Same style issue as in the original patch. > Will fix >> >>> { >>> rangeset_destroy(v->vpci.mem); >>> v->vpci.mem = NULL; >>> } >>> vcpu_unpause(v); >>> } >>> domain_unlock(d); >>> } >>> >>> which seems to solve all the concerns. Is this what you mean? >> Something along these lines. I expect you will want to make use of >> domain_pause_except_self(), > Yes, this is what is needed here, thanks. The only question is that > > int domain_pause_except_self(struct domain *d) > { > [snip] > /* Avoid racing with other vcpus which may want to be pausing us */ > if ( !spin_trylock(&d->hypercall_deadlock_mutex) ) > return -ERESTART; > > so it is not clear what do we do in case of -ERESTART: do we want to spin? > Otherwise we will leave the job not done effectively not canceling the > pending work. Any idea other then spinning? Depends on the call chain you come through. There may need to be some rearrangements such that you may be able to preempt the enclosing hypercall. Jan
On 16.11.21 13:38, Jan Beulich wrote: > On 16.11.2021 09:23, Oleksandr Andrushchenko wrote: >> >> On 16.11.21 10:01, Jan Beulich wrote: >>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >>>>>> return false; >>>>>> } >>>>>> >>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev) >>>>>> +{ >>>>>> + struct vcpu *v = current; >>>>>> + >>>>>> + /* Cancel any pending work now. */ >>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not >>>>> just current? Is current even relevant (as in: part of the correct >>>>> domain), considering ... >>>>> >>>>>> --- a/xen/drivers/vpci/vpci.c >>>>>> +++ b/xen/drivers/vpci/vpci.c >>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >>>>>> xfree(r); >>>>>> } >>>>>> spin_unlock(&pdev->vpci->lock); >>>>>> + >>>>>> + vpci_cancel_pending(pdev); >>>>> ... this code path, when coming here from pci_{add,remove}_device()? >>>>> >>>>> I can agree that there's a problem here, but I think you need to >>>>> properly (i.e. in a race free manner) drain pending work. >>>> Yes, the code is inconsistent with this respect. I am thinking about: >>>> >>>> void vpci_cancel_pending(const struct pci_dev *pdev) >>>> { >>>> struct domain *d = pdev->domain; >>>> struct vcpu *v; >>>> >>>> /* Cancel any pending work now. */ >>>> domain_lock(d); >>>> for_each_vcpu ( d, v ) >>>> { >>>> vcpu_pause(v); >>>> if ( v->vpci.mem && v->vpci.pdev == pdev) >>> Nit: Same style issue as in the original patch. >> Will fix >>>> { >>>> rangeset_destroy(v->vpci.mem); >>>> v->vpci.mem = NULL; >>>> } >>>> vcpu_unpause(v); >>>> } >>>> domain_unlock(d); >>>> } >>>> >>>> which seems to solve all the concerns. Is this what you mean? >>> Something along these lines. I expect you will want to make use of >>> domain_pause_except_self(), >> Yes, this is what is needed here, thanks. The only question is that >> >> int domain_pause_except_self(struct domain *d) >> { >> [snip] >> /* Avoid racing with other vcpus which may want to be pausing us */ >> if ( !spin_trylock(&d->hypercall_deadlock_mutex) ) >> return -ERESTART; >> >> so it is not clear what do we do in case of -ERESTART: do we want to spin? >> Otherwise we will leave the job not done effectively not canceling the >> pending work. Any idea other then spinning? > Depends on the call chain you come through. There may need to be some > rearrangements such that you may be able to preempt the enclosing > hypercall. Well, there are three places which may lead to the pending work needs to be canceled: MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path) PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending So, taking into account the MMIO path, I think about the below code /* * Cancel any pending work now. * * FIXME: this can be called from an MMIO trap handler's error * path, so we cannot just return an error code here, so upper * layers can handle it. The best we can do is to still try * removing the range sets. */ while ( (rc = domain_pause_except_self(d)) == -ERESTART ) cpu_relax(); if ( rc ) printk(XENLOG_G_ERR "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", &pdev->sbdf, pdev->domain, rc); I am not sure how to handle this otherwise @Roger, do you see any other good way? > > Jan > Thank you, Oleksandr
On 16.11.21 10:23, Oleksandr Andrushchenko wrote: > > On 16.11.21 10:01, Jan Beulich wrote: >> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>> On 15.11.21 18:56, Jan Beulich wrote: >>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>> >>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>> scheduled a delayed work for map/unmap operations for that device. >>>>> For example, the following scenario can illustrate the problem: >>>>> >>>>> pci_physdev_op >>>>> pci_add_device >>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>> iommu_add_device <- FAILS >>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>> >>>>> leave_hypervisor_to_guest >>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>> >>>>> For the hardware domain we continue execution as the worse that >>>>> could happen is that MMIO mappings are left in place when the >>>>> device has been deassigned >>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>> deref? >>> I think it is safe to continue >> And why do you think so? I.e. why is there no race for Dom0 when there >> is one for DomU? > Well, then we need to use a lock to synchronize the two. > I guess this needs to be pci devs lock unfortunately The parties involved in deferred work and its cancellation: MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending x86: two places -> hvm_do_resume -> vpci_process_pending So, both defer_map and vpci_process_pending need to be synchronized with pcidevs_{lock|unlock). As both functions existed before the code I introduce I would prefer this to be a dedicated patch for v5 of the series. Thank you, Oleksandr
On 16.11.2021 14:27, Oleksandr Andrushchenko wrote: > > > On 16.11.21 13:38, Jan Beulich wrote: >> On 16.11.2021 09:23, Oleksandr Andrushchenko wrote: >>> >>> On 16.11.21 10:01, Jan Beulich wrote: >>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) >>>>>>> return false; >>>>>>> } >>>>>>> >>>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev) >>>>>>> +{ >>>>>>> + struct vcpu *v = current; >>>>>>> + >>>>>>> + /* Cancel any pending work now. */ >>>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not >>>>>> just current? Is current even relevant (as in: part of the correct >>>>>> domain), considering ... >>>>>> >>>>>>> --- a/xen/drivers/vpci/vpci.c >>>>>>> +++ b/xen/drivers/vpci/vpci.c >>>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) >>>>>>> xfree(r); >>>>>>> } >>>>>>> spin_unlock(&pdev->vpci->lock); >>>>>>> + >>>>>>> + vpci_cancel_pending(pdev); >>>>>> ... this code path, when coming here from pci_{add,remove}_device()? >>>>>> >>>>>> I can agree that there's a problem here, but I think you need to >>>>>> properly (i.e. in a race free manner) drain pending work. >>>>> Yes, the code is inconsistent with this respect. I am thinking about: >>>>> >>>>> void vpci_cancel_pending(const struct pci_dev *pdev) >>>>> { >>>>> struct domain *d = pdev->domain; >>>>> struct vcpu *v; >>>>> >>>>> /* Cancel any pending work now. */ >>>>> domain_lock(d); >>>>> for_each_vcpu ( d, v ) >>>>> { >>>>> vcpu_pause(v); >>>>> if ( v->vpci.mem && v->vpci.pdev == pdev) >>>> Nit: Same style issue as in the original patch. >>> Will fix >>>>> { >>>>> rangeset_destroy(v->vpci.mem); >>>>> v->vpci.mem = NULL; >>>>> } >>>>> vcpu_unpause(v); >>>>> } >>>>> domain_unlock(d); >>>>> } >>>>> >>>>> which seems to solve all the concerns. Is this what you mean? >>>> Something along these lines. I expect you will want to make use of >>>> domain_pause_except_self(), >>> Yes, this is what is needed here, thanks. The only question is that >>> >>> int domain_pause_except_self(struct domain *d) >>> { >>> [snip] >>> /* Avoid racing with other vcpus which may want to be pausing us */ >>> if ( !spin_trylock(&d->hypercall_deadlock_mutex) ) >>> return -ERESTART; >>> >>> so it is not clear what do we do in case of -ERESTART: do we want to spin? >>> Otherwise we will leave the job not done effectively not canceling the >>> pending work. Any idea other then spinning? >> Depends on the call chain you come through. There may need to be some >> rearrangements such that you may be able to preempt the enclosing >> hypercall. > Well, there are three places which may lead to the pending work > needs to be canceled: > > MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path) > > PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending > > PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending > > So, taking into account the MMIO path, I think about the below code > > /* > * Cancel any pending work now. > * > * FIXME: this can be called from an MMIO trap handler's error > * path, so we cannot just return an error code here, so upper > * layers can handle it. The best we can do is to still try > * removing the range sets. > */ The MMIO trap path should simply exit to the guest to have the insn retried. With the vPCI removal a subsequent emulation attempt will no longer be able to resolve the address to BDF, and hence do whatever it would do for an attempted access to config space not belonging to any device. For the two physdevops it may not be possible to preempt the hypercalls without further code adjustments. Jan > while ( (rc = domain_pause_except_self(d)) == -ERESTART ) > cpu_relax(); > > if ( rc ) > printk(XENLOG_G_ERR > "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", > &pdev->sbdf, pdev->domain, rc); > > I am not sure how to handle this otherwise > @Roger, do you see any other good way? >> >> Jan >> > Thank you, > Oleksandr >
On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: > > > On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >> >> On 16.11.21 10:01, Jan Beulich wrote: >>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>> >>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>> For example, the following scenario can illustrate the problem: >>>>>> >>>>>> pci_physdev_op >>>>>> pci_add_device >>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>> iommu_add_device <- FAILS >>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>> >>>>>> leave_hypervisor_to_guest >>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>> >>>>>> For the hardware domain we continue execution as the worse that >>>>>> could happen is that MMIO mappings are left in place when the >>>>>> device has been deassigned >>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>> deref? >>>> I think it is safe to continue >>> And why do you think so? I.e. why is there no race for Dom0 when there >>> is one for DomU? >> Well, then we need to use a lock to synchronize the two. >> I guess this needs to be pci devs lock unfortunately > The parties involved in deferred work and its cancellation: > > MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map > > Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending > > x86: two places -> hvm_do_resume -> vpci_process_pending > > So, both defer_map and vpci_process_pending need to be synchronized with > pcidevs_{lock|unlock). If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock getting used in leave_hypervisor_to_guest. Jan
On 16.11.21 16:12, Jan Beulich wrote: > On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: >> >> On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >>> On 16.11.21 10:01, Jan Beulich wrote: >>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>> >>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>> For example, the following scenario can illustrate the problem: >>>>>>> >>>>>>> pci_physdev_op >>>>>>> pci_add_device >>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>> iommu_add_device <- FAILS >>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>> >>>>>>> leave_hypervisor_to_guest >>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>> >>>>>>> For the hardware domain we continue execution as the worse that >>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>> device has been deassigned >>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>>> deref? >>>>> I think it is safe to continue >>>> And why do you think so? I.e. why is there no race for Dom0 when there >>>> is one for DomU? >>> Well, then we need to use a lock to synchronize the two. >>> I guess this needs to be pci devs lock unfortunately >> The parties involved in deferred work and its cancellation: >> >> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map >> >> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending >> >> x86: two places -> hvm_do_resume -> vpci_process_pending >> >> So, both defer_map and vpci_process_pending need to be synchronized with >> pcidevs_{lock|unlock). > If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock > getting used in leave_hypervisor_to_guest. I do agree this is really not good, but it seems I am limited in choices. @Stefano, @Julien, do you see any better way of doing that? We were thinking about introducing a dedicated lock for vpci [1], but finally decided to use pcidevs_lock for now > Jan > [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
On 16.11.21 16:24, Oleksandr Andrushchenko wrote: > > On 16.11.21 16:12, Jan Beulich wrote: >> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: >>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >>>> On 16.11.21 10:01, Jan Beulich wrote: >>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>> >>>>>>>> pci_physdev_op >>>>>>>> pci_add_device >>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>> iommu_add_device <- FAILS >>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>> >>>>>>>> leave_hypervisor_to_guest >>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>> >>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>> device has been deassigned >>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>>>> deref? >>>>>> I think it is safe to continue >>>>> And why do you think so? I.e. why is there no race for Dom0 when there >>>>> is one for DomU? >>>> Well, then we need to use a lock to synchronize the two. >>>> I guess this needs to be pci devs lock unfortunately >>> The parties involved in deferred work and its cancellation: >>> >>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map >>> >>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending >>> >>> x86: two places -> hvm_do_resume -> vpci_process_pending >>> >>> So, both defer_map and vpci_process_pending need to be synchronized with >>> pcidevs_{lock|unlock). >> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock >> getting used in leave_hypervisor_to_guest. > I do agree this is really not good, but it seems I am limited in choices. > @Stefano, @Julien, do you see any better way of doing that? > > We were thinking about introducing a dedicated lock for vpci [1], > but finally decided to use pcidevs_lock for now Or even better and simpler: we just use pdev->vpci->lock to protect vpci_process_pending vs MMIO trap handlers which already use it. >> Jan >> > [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
On 16.11.2021 15:24, Oleksandr Andrushchenko wrote: > > > On 16.11.21 16:12, Jan Beulich wrote: >> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: >>> >>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >>>> On 16.11.21 10:01, Jan Beulich wrote: >>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>> >>>>>>>> pci_physdev_op >>>>>>>> pci_add_device >>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>> iommu_add_device <- FAILS >>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>> >>>>>>>> leave_hypervisor_to_guest >>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>> >>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>> device has been deassigned >>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>>>> deref? >>>>>> I think it is safe to continue >>>>> And why do you think so? I.e. why is there no race for Dom0 when there >>>>> is one for DomU? >>>> Well, then we need to use a lock to synchronize the two. >>>> I guess this needs to be pci devs lock unfortunately >>> The parties involved in deferred work and its cancellation: >>> >>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map >>> >>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending >>> >>> x86: two places -> hvm_do_resume -> vpci_process_pending >>> >>> So, both defer_map and vpci_process_pending need to be synchronized with >>> pcidevs_{lock|unlock). >> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock >> getting used in leave_hypervisor_to_guest. > I do agree this is really not good, but it seems I am limited in choices. > @Stefano, @Julien, do you see any better way of doing that? > > We were thinking about introducing a dedicated lock for vpci [1], > but finally decided to use pcidevs_lock for now Even that locking model might be too heavyweight for this purpose, unless an r/w lock was intended. The problem would still be that all guest exiting would be serialized within a domain. (That's still better than serializing all guest exiting on the host, of course.) Jan > [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/ >
Hi Oleksandr, On 16/11/2021 14:24, Oleksandr Andrushchenko wrote: > > > On 16.11.21 16:12, Jan Beulich wrote: >> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: >>> >>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >>>> On 16.11.21 10:01, Jan Beulich wrote: >>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>> >>>>>>>> pci_physdev_op >>>>>>>> pci_add_device >>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>> iommu_add_device <- FAILS >>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>> >>>>>>>> leave_hypervisor_to_guest >>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>> >>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>> device has been deassigned >>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>>>> deref? >>>>>> I think it is safe to continue >>>>> And why do you think so? I.e. why is there no race for Dom0 when there >>>>> is one for DomU? >>>> Well, then we need to use a lock to synchronize the two. >>>> I guess this needs to be pci devs lock unfortunately >>> The parties involved in deferred work and its cancellation: >>> >>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map >>> >>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending >>> >>> x86: two places -> hvm_do_resume -> vpci_process_pending >>> >>> So, both defer_map and vpci_process_pending need to be synchronized with >>> pcidevs_{lock|unlock). >> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock >> getting used in leave_hypervisor_to_guest. > I do agree this is really not good, but it seems I am limited in choices. > @Stefano, @Julien, do you see any better way of doing that? I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not fine-grained enough for been call in vpci_process_pending(). I haven't yet looked at the rest of the series to be able to suggest the exact lock. But we at least want a per-domain spinlock. > > We were thinking about introducing a dedicated lock for vpci [1], > but finally decided to use pcidevs_lock for now Skimming through the thread, you decided to use pcidevs_lock because it was simpler and sufficient for the use case discussed back then. Now, we have a use case where it would be a problem to use pcidevs_lock. So I think the extra complexity is justified. Cheers,
On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > When a vPCI is removed for a PCI device it is possible that we have > scheduled a delayed work for map/unmap operations for that device. > For example, the following scenario can illustrate the problem: > > pci_physdev_op > pci_add_device > init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > iommu_add_device <- FAILS > vpci_remove_device -> xfree(pdev->vpci) > > leave_hypervisor_to_guest > vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL > > For the hardware domain we continue execution as the worse that > could happen is that MMIO mappings are left in place when the > device has been deassigned > > For unprivileged domains that get a failure in the middle of a vPCI > {un}map operation we need to destroy them, as we don't know in which > state the p2m is. This can only happen in vpci_process_pending for > DomUs as they won't be allowed to call pci_add_device. > > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Thinking about it some more, I'm not convinced any of this is really needed in the presented form. Removal of a vPCI device is the analogue of hot-unplug on baremetal. That's not a "behind the backs of everything" operation. Instead the host admin has to prepare the device for removal, which will result in it being quiescent (which in particular means no BAR adjustments anymore). The act of removing the device from the system has as its virtual counterpart "xl pci-detach". I think it ought to be in this context when pending requests get drained, and an indicator be set that no further changes to that device are permitted. This would mean invoking from vpci_deassign_device() as added by patch 4, not from vpci_remove_device(). This would yield removal of a device from the host being independent of removal of a device from a guest. The need for vpci_remove_device() seems questionable in the first place: Even for hot-unplug on the host it may be better to require a pci-detach from (PVH) Dom0 before the actual device removal. This would involve an adjustment to the de-assignment logic for the case of no quarantining: We'd need to make sure explicit de-assignment from Dom0 actually removes the device from there; right now de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending on quarantining mode). Thoughts? Jan
On 17.11.21 10:28, Jan Beulich wrote: > On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> >> When a vPCI is removed for a PCI device it is possible that we have >> scheduled a delayed work for map/unmap operations for that device. >> For example, the following scenario can illustrate the problem: >> >> pci_physdev_op >> pci_add_device >> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >> iommu_add_device <- FAILS >> vpci_remove_device -> xfree(pdev->vpci) >> >> leave_hypervisor_to_guest >> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >> >> For the hardware domain we continue execution as the worse that >> could happen is that MMIO mappings are left in place when the >> device has been deassigned >> >> For unprivileged domains that get a failure in the middle of a vPCI >> {un}map operation we need to destroy them, as we don't know in which >> state the p2m is. This can only happen in vpci_process_pending for >> DomUs as they won't be allowed to call pci_add_device. >> >> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > Thinking about it some more, I'm not convinced any of this is really > needed in the presented form. The intention of this patch was to handle error conditions which are abnormal, e.g. when iommu_add_device fails and we are in the middle of initialization. So, I am trying to cancel all pending work which might already be there and not to crash. > Removal of a vPCI device is the analogue > of hot-unplug on baremetal. That's not a "behind the backs of > everything" operation. Instead the host admin has to prepare the > device for removal, which will result in it being quiescent (which in > particular means no BAR adjustments anymore). The act of removing the > device from the system has as its virtual counterpart "xl pci-detach". > I think it ought to be in this context when pending requests get > drained, and an indicator be set that no further changes to that > device are permitted. This would mean invoking from > vpci_deassign_device() as added by patch 4, not from > vpci_remove_device(). This would yield removal of a device from the > host being independent of removal of a device from a guest. > > The need for vpci_remove_device() seems questionable in the first > place: Even for hot-unplug on the host it may be better to require a > pci-detach from (PVH) Dom0 before the actual device removal. This > would involve an adjustment to the de-assignment logic for the case > of no quarantining: We'd need to make sure explicit de-assignment > from Dom0 actually removes the device from there; right now > de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending > on quarantining mode). Please see above. What you wrote might be perfectly fine for the "expected" removals, but what about the errors which are out of administrator's control? > > Thoughts? > > Jan > Thank you, Oleksandr
On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: > > > On 17.11.21 10:28, Jan Beulich wrote: >> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>> >>> When a vPCI is removed for a PCI device it is possible that we have >>> scheduled a delayed work for map/unmap operations for that device. >>> For example, the following scenario can illustrate the problem: >>> >>> pci_physdev_op >>> pci_add_device >>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>> iommu_add_device <- FAILS >>> vpci_remove_device -> xfree(pdev->vpci) >>> >>> leave_hypervisor_to_guest >>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>> >>> For the hardware domain we continue execution as the worse that >>> could happen is that MMIO mappings are left in place when the >>> device has been deassigned >>> >>> For unprivileged domains that get a failure in the middle of a vPCI >>> {un}map operation we need to destroy them, as we don't know in which >>> state the p2m is. This can only happen in vpci_process_pending for >>> DomUs as they won't be allowed to call pci_add_device. >>> >>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >> Thinking about it some more, I'm not convinced any of this is really >> needed in the presented form. > The intention of this patch was to handle error conditions which are > abnormal, e.g. when iommu_add_device fails and we are in the middle > of initialization. So, I am trying to cancel all pending work which might > already be there and not to crash. Only Dom0 may be able to prematurely access the device during "add". Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. Hence I'm not sure I see the need for dealing with these. >> Removal of a vPCI device is the analogue >> of hot-unplug on baremetal. That's not a "behind the backs of >> everything" operation. Instead the host admin has to prepare the >> device for removal, which will result in it being quiescent (which in >> particular means no BAR adjustments anymore). The act of removing the >> device from the system has as its virtual counterpart "xl pci-detach". >> I think it ought to be in this context when pending requests get >> drained, and an indicator be set that no further changes to that >> device are permitted. This would mean invoking from >> vpci_deassign_device() as added by patch 4, not from >> vpci_remove_device(). This would yield removal of a device from the >> host being independent of removal of a device from a guest. >> >> The need for vpci_remove_device() seems questionable in the first >> place: Even for hot-unplug on the host it may be better to require a >> pci-detach from (PVH) Dom0 before the actual device removal. This >> would involve an adjustment to the de-assignment logic for the case >> of no quarantining: We'd need to make sure explicit de-assignment >> from Dom0 actually removes the device from there; right now >> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending >> on quarantining mode). As to this, I meanwhile think that add/remove can very well have Dom0 related vPCI init/teardown. But for DomU all of that should happen during assign/de-assign. A device still assigned to a DomU simply should never be subject to physical hot-unplug in the first place. Jan
On 18.11.21 10:36, Jan Beulich wrote: > On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >> >> On 17.11.21 10:28, Jan Beulich wrote: >>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> >>>> When a vPCI is removed for a PCI device it is possible that we have >>>> scheduled a delayed work for map/unmap operations for that device. >>>> For example, the following scenario can illustrate the problem: >>>> >>>> pci_physdev_op >>>> pci_add_device >>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>> iommu_add_device <- FAILS >>>> vpci_remove_device -> xfree(pdev->vpci) >>>> >>>> leave_hypervisor_to_guest >>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>> >>>> For the hardware domain we continue execution as the worse that >>>> could happen is that MMIO mappings are left in place when the >>>> device has been deassigned >>>> >>>> For unprivileged domains that get a failure in the middle of a vPCI >>>> {un}map operation we need to destroy them, as we don't know in which >>>> state the p2m is. This can only happen in vpci_process_pending for >>>> DomUs as they won't be allowed to call pci_add_device. >>>> >>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>> Thinking about it some more, I'm not convinced any of this is really >>> needed in the presented form. >> The intention of this patch was to handle error conditions which are >> abnormal, e.g. when iommu_add_device fails and we are in the middle >> of initialization. So, I am trying to cancel all pending work which might >> already be there and not to crash. > Only Dom0 may be able to prematurely access the device during "add". > Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. > Hence I'm not sure I see the need for dealing with these. Probably I don't follow you here. The issue I am facing is Dom0 related, e.g. Xen was not able to initialize during "add" and thus wanted to clean up the leftovers. As the result the already scheduled work crashes as it was not neither canceled nor interrupted in some safe manner. So, this sounds like something we need to take care of, thus this patch. Another story, which I am getting more convinced now, is that the proper locking should be a dedicated patch as it should not only add locks as required by this patch, but also probably revisit the existing locking schemes to be acceptable for new use-cases. > >>> Removal of a vPCI device is the analogue >>> of hot-unplug on baremetal. That's not a "behind the backs of >>> everything" operation. Instead the host admin has to prepare the >>> device for removal, which will result in it being quiescent (which in >>> particular means no BAR adjustments anymore). The act of removing the >>> device from the system has as its virtual counterpart "xl pci-detach". >>> I think it ought to be in this context when pending requests get >>> drained, and an indicator be set that no further changes to that >>> device are permitted. This would mean invoking from >>> vpci_deassign_device() as added by patch 4, not from >>> vpci_remove_device(). This would yield removal of a device from the >>> host being independent of removal of a device from a guest. >>> >>> The need for vpci_remove_device() seems questionable in the first >>> place: Even for hot-unplug on the host it may be better to require a >>> pci-detach from (PVH) Dom0 before the actual device removal. This >>> would involve an adjustment to the de-assignment logic for the case >>> of no quarantining: We'd need to make sure explicit de-assignment >>> from Dom0 actually removes the device from there; right now >>> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending >>> on quarantining mode). > As to this, I meanwhile think that add/remove can very well have Dom0 > related vPCI init/teardown. But for DomU all of that should happen > during assign/de-assign. Yes, I agree. The model I also see is: - for Dom0 we use add/remove - for DomUs we use assign/de-assign > A device still assigned to a DomU simply > should never be subject to physical hot-unplug in the first place. Double that > > Jan > Thank you, Oleksandr
On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: > On 18.11.21 10:36, Jan Beulich wrote: >> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >>> On 17.11.21 10:28, Jan Beulich wrote: >>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>> >>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>> scheduled a delayed work for map/unmap operations for that device. >>>>> For example, the following scenario can illustrate the problem: >>>>> >>>>> pci_physdev_op >>>>> pci_add_device >>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>> iommu_add_device <- FAILS >>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>> >>>>> leave_hypervisor_to_guest >>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>> >>>>> For the hardware domain we continue execution as the worse that >>>>> could happen is that MMIO mappings are left in place when the >>>>> device has been deassigned >>>>> >>>>> For unprivileged domains that get a failure in the middle of a vPCI >>>>> {un}map operation we need to destroy them, as we don't know in which >>>>> state the p2m is. This can only happen in vpci_process_pending for >>>>> DomUs as they won't be allowed to call pci_add_device. >>>>> >>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>> Thinking about it some more, I'm not convinced any of this is really >>>> needed in the presented form. >>> The intention of this patch was to handle error conditions which are >>> abnormal, e.g. when iommu_add_device fails and we are in the middle >>> of initialization. So, I am trying to cancel all pending work which might >>> already be there and not to crash. >> Only Dom0 may be able to prematurely access the device during "add". >> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. >> Hence I'm not sure I see the need for dealing with these. > Probably I don't follow you here. The issue I am facing is Dom0 > related, e.g. Xen was not able to initialize during "add" and thus > wanted to clean up the leftovers. As the result the already > scheduled work crashes as it was not neither canceled nor interrupted > in some safe manner. So, this sounds like something we need to take > care of, thus this patch. But my point was the question of why there would be any pending work in the first place in this case, when we expect Dom0 to be well-behaved. Jan
On 18.11.21 11:15, Jan Beulich wrote: > On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: >> On 18.11.21 10:36, Jan Beulich wrote: >>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >>>> On 17.11.21 10:28, Jan Beulich wrote: >>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>> >>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>> For example, the following scenario can illustrate the problem: >>>>>> >>>>>> pci_physdev_op >>>>>> pci_add_device >>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>> iommu_add_device <- FAILS >>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>> >>>>>> leave_hypervisor_to_guest >>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>> >>>>>> For the hardware domain we continue execution as the worse that >>>>>> could happen is that MMIO mappings are left in place when the >>>>>> device has been deassigned >>>>>> >>>>>> For unprivileged domains that get a failure in the middle of a vPCI >>>>>> {un}map operation we need to destroy them, as we don't know in which >>>>>> state the p2m is. This can only happen in vpci_process_pending for >>>>>> DomUs as they won't be allowed to call pci_add_device. >>>>>> >>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>> Thinking about it some more, I'm not convinced any of this is really >>>>> needed in the presented form. >>>> The intention of this patch was to handle error conditions which are >>>> abnormal, e.g. when iommu_add_device fails and we are in the middle >>>> of initialization. So, I am trying to cancel all pending work which might >>>> already be there and not to crash. >>> Only Dom0 may be able to prematurely access the device during "add". >>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. >>> Hence I'm not sure I see the need for dealing with these. >> Probably I don't follow you here. The issue I am facing is Dom0 >> related, e.g. Xen was not able to initialize during "add" and thus >> wanted to clean up the leftovers. As the result the already >> scheduled work crashes as it was not neither canceled nor interrupted >> in some safe manner. So, this sounds like something we need to take >> care of, thus this patch. > But my point was the question of why there would be any pending work > in the first place in this case, when we expect Dom0 to be well-behaved. I am not saying Dom0 misbehaves here. This is my real use-case (as in the commit message): pci_physdev_op pci_add_device init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) iommu_add_device <- FAILS vpci_remove_device -> xfree(pdev->vpci) leave_hypervisor_to_guest vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL So, this made me implement the patch. Then we decided that it is also possible that other vCPUs may also have some pending work and I agreed that this is a good point and we want to remove all pending work for all vCPUs. So, if you doubt the patch and we still have the scenario above, what would you suggest in order to make sure we do not crash? > Jan > Thank you, Oleksandr
Hi, Julien! On 16.11.21 20:02, Julien Grall wrote: > Hi Oleksandr, > > On 16/11/2021 14:24, Oleksandr Andrushchenko wrote: >> >> >> On 16.11.21 16:12, Jan Beulich wrote: >>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote: >>>> >>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote: >>>>> On 16.11.21 10:01, Jan Beulich wrote: >>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote: >>>>>>> On 15.11.21 18:56, Jan Beulich wrote: >>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>>> >>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>>> >>>>>>>>> pci_physdev_op >>>>>>>>> pci_add_device >>>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>>> iommu_add_device <- FAILS >>>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>>> >>>>>>>>> leave_hypervisor_to_guest >>>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>>> >>>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>>> device has been deassigned >>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL >>>>>>>> deref? >>>>>>> I think it is safe to continue >>>>>> And why do you think so? I.e. why is there no race for Dom0 when there >>>>>> is one for DomU? >>>>> Well, then we need to use a lock to synchronize the two. >>>>> I guess this needs to be pci devs lock unfortunately >>>> The parties involved in deferred work and its cancellation: >>>> >>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map >>>> >>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending >>>> >>>> x86: two places -> hvm_do_resume -> vpci_process_pending >>>> >>>> So, both defer_map and vpci_process_pending need to be synchronized with >>>> pcidevs_{lock|unlock). >>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock >>> getting used in leave_hypervisor_to_guest. >> I do agree this is really not good, but it seems I am limited in choices. >> @Stefano, @Julien, do you see any better way of doing that? > > I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not fine-grained enough for been call in vpci_process_pending(). > > I haven't yet looked at the rest of the series to be able to suggest the exact lock. But we at least want a per-domain spinlock. > >> >> We were thinking about introducing a dedicated lock for vpci [1], >> but finally decided to use pcidevs_lock for now > > Skimming through the thread, you decided to use pcidevs_lock because it was simpler and sufficient for the use case discussed back then. Now, we have a use case where it would be a problem to use pcidevs_lock. So I think the extra complexity is justified. I would like to understand what is this lock so I can implement that properly. We have the following options as I can see: 1. pcidevs_{lock|unlock} - considered too heavy, per host 2. pdev->vpci->lock - better, but still heavy, per PCI device 3. We may convert pdev->vpci->lock into r/w lock 4. We may introduce a specific lock To better understand the scope of the lock: 1. MMIO trap handlers (vpci_{read|write} - already protected with pdev->vpci->lock 2. vpci_process_pending (SOFTIRQ context) 3. Hypercalls which call pci_{add|remove|assign|deassign}_device 4. @Roger, did I miss something? And I feel that this needs a dedicated patch for that: I am not sure it is a good idea to add this locking change into this patch which seems not so relevant > > Cheers, >
On 18.11.2021 10:32, Oleksandr Andrushchenko wrote: > > > On 18.11.21 11:15, Jan Beulich wrote: >> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: >>> On 18.11.21 10:36, Jan Beulich wrote: >>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >>>>> On 17.11.21 10:28, Jan Beulich wrote: >>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>> >>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>> For example, the following scenario can illustrate the problem: >>>>>>> >>>>>>> pci_physdev_op >>>>>>> pci_add_device >>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>> iommu_add_device <- FAILS >>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>> >>>>>>> leave_hypervisor_to_guest >>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>> >>>>>>> For the hardware domain we continue execution as the worse that >>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>> device has been deassigned >>>>>>> >>>>>>> For unprivileged domains that get a failure in the middle of a vPCI >>>>>>> {un}map operation we need to destroy them, as we don't know in which >>>>>>> state the p2m is. This can only happen in vpci_process_pending for >>>>>>> DomUs as they won't be allowed to call pci_add_device. >>>>>>> >>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>> Thinking about it some more, I'm not convinced any of this is really >>>>>> needed in the presented form. >>>>> The intention of this patch was to handle error conditions which are >>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle >>>>> of initialization. So, I am trying to cancel all pending work which might >>>>> already be there and not to crash. >>>> Only Dom0 may be able to prematurely access the device during "add". >>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. >>>> Hence I'm not sure I see the need for dealing with these. >>> Probably I don't follow you here. The issue I am facing is Dom0 >>> related, e.g. Xen was not able to initialize during "add" and thus >>> wanted to clean up the leftovers. As the result the already >>> scheduled work crashes as it was not neither canceled nor interrupted >>> in some safe manner. So, this sounds like something we need to take >>> care of, thus this patch. >> But my point was the question of why there would be any pending work >> in the first place in this case, when we expect Dom0 to be well-behaved. > I am not saying Dom0 misbehaves here. This is my real use-case > (as in the commit message): > > pci_physdev_op > pci_add_device > init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > iommu_add_device <- FAILS > vpci_remove_device -> xfree(pdev->vpci) > > leave_hypervisor_to_guest > vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL First of all I'm sorry for having lost track of that particular case in the course of the discussion. I wonder though whether that's something we really need to take care of. At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but use apply_map() instead. I wonder whether this wouldn't be appropriate generally in the context of init_bars() when used for Dom0 (not sure whether init_bars() would find some form of use for DomU-s as well). This is even more so as it would better be the exception that devices discovered post-boot start out with memory decoding enabled (which is a prereq for modify_bars() to be called in the first place). Jan
On 18.11.21 15:25, Jan Beulich wrote: > On 18.11.2021 10:32, Oleksandr Andrushchenko wrote: >> >> On 18.11.21 11:15, Jan Beulich wrote: >>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 10:36, Jan Beulich wrote: >>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >>>>>> On 17.11.21 10:28, Jan Beulich wrote: >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>> >>>>>>>> pci_physdev_op >>>>>>>> pci_add_device >>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>> iommu_add_device <- FAILS >>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>> >>>>>>>> leave_hypervisor_to_guest >>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>> >>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>> device has been deassigned >>>>>>>> >>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI >>>>>>>> {un}map operation we need to destroy them, as we don't know in which >>>>>>>> state the p2m is. This can only happen in vpci_process_pending for >>>>>>>> DomUs as they won't be allowed to call pci_add_device. >>>>>>>> >>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>> Thinking about it some more, I'm not convinced any of this is really >>>>>>> needed in the presented form. >>>>>> The intention of this patch was to handle error conditions which are >>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle >>>>>> of initialization. So, I am trying to cancel all pending work which might >>>>>> already be there and not to crash. >>>>> Only Dom0 may be able to prematurely access the device during "add". >>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. >>>>> Hence I'm not sure I see the need for dealing with these. >>>> Probably I don't follow you here. The issue I am facing is Dom0 >>>> related, e.g. Xen was not able to initialize during "add" and thus >>>> wanted to clean up the leftovers. As the result the already >>>> scheduled work crashes as it was not neither canceled nor interrupted >>>> in some safe manner. So, this sounds like something we need to take >>>> care of, thus this patch. >>> But my point was the question of why there would be any pending work >>> in the first place in this case, when we expect Dom0 to be well-behaved. >> I am not saying Dom0 misbehaves here. This is my real use-case >> (as in the commit message): >> >> pci_physdev_op >> pci_add_device >> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >> iommu_add_device <- FAILS >> vpci_remove_device -> xfree(pdev->vpci) >> >> leave_hypervisor_to_guest >> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL > First of all I'm sorry for having lost track of that particular case in > the course of the discussion. No problem, I see on the ML how much you review every day... > > I wonder though whether that's something we really need to take care of. > At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but > use apply_map() instead. I wonder whether this wouldn't be appropriate > generally in the context of init_bars() when used for Dom0 (not sure > whether init_bars() would find some form of use for DomU-s as well). > This is even more so as it would better be the exception that devices > discovered post-boot start out with memory decoding enabled (which is a > prereq for modify_bars() to be called in the first place). Well, first of all init_bars is going to be used for DomUs as well: that was discussed previously and is reflected in this series. But the real use-case for the deferred mapping would be the one from PCI_COMMAND register write emulation: void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg, uint32_t cmd, void *data) { [snip] modify_bars(pdev, cmd, false); which in turn calls defer_map. I believe Roger did that for a good reason not to stall Xen while doing map/unmap (which may take quite some time) and moved that to vpci_process_pending and SOFTIRQ context. > > Jan > Thank you, Oleksandr
Sorry, I've been quite busy with other staff. On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote: > > > On 18.11.21 15:25, Jan Beulich wrote: > > On 18.11.2021 10:32, Oleksandr Andrushchenko wrote: > >> > >> On 18.11.21 11:15, Jan Beulich wrote: > >>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: > >>>> On 18.11.21 10:36, Jan Beulich wrote: > >>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: > >>>>>> On 17.11.21 10:28, Jan Beulich wrote: > >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: > >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >>>>>>>> > >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have > >>>>>>>> scheduled a delayed work for map/unmap operations for that device. > >>>>>>>> For example, the following scenario can illustrate the problem: > >>>>>>>> > >>>>>>>> pci_physdev_op > >>>>>>>> pci_add_device > >>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > >>>>>>>> iommu_add_device <- FAILS > >>>>>>>> vpci_remove_device -> xfree(pdev->vpci) > >>>>>>>> > >>>>>>>> leave_hypervisor_to_guest > >>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL > >>>>>>>> > >>>>>>>> For the hardware domain we continue execution as the worse that > >>>>>>>> could happen is that MMIO mappings are left in place when the > >>>>>>>> device has been deassigned > >>>>>>>> > >>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI > >>>>>>>> {un}map operation we need to destroy them, as we don't know in which > >>>>>>>> state the p2m is. This can only happen in vpci_process_pending for > >>>>>>>> DomUs as they won't be allowed to call pci_add_device. > >>>>>>>> > >>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > >>>>>>> Thinking about it some more, I'm not convinced any of this is really > >>>>>>> needed in the presented form. > >>>>>> The intention of this patch was to handle error conditions which are > >>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle > >>>>>> of initialization. So, I am trying to cancel all pending work which might > >>>>>> already be there and not to crash. > >>>>> Only Dom0 may be able to prematurely access the device during "add". > >>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. > >>>>> Hence I'm not sure I see the need for dealing with these. > >>>> Probably I don't follow you here. The issue I am facing is Dom0 > >>>> related, e.g. Xen was not able to initialize during "add" and thus > >>>> wanted to clean up the leftovers. As the result the already > >>>> scheduled work crashes as it was not neither canceled nor interrupted > >>>> in some safe manner. So, this sounds like something we need to take > >>>> care of, thus this patch. > >>> But my point was the question of why there would be any pending work > >>> in the first place in this case, when we expect Dom0 to be well-behaved. > >> I am not saying Dom0 misbehaves here. This is my real use-case > >> (as in the commit message): > >> > >> pci_physdev_op > >> pci_add_device > >> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > >> iommu_add_device <- FAILS > >> vpci_remove_device -> xfree(pdev->vpci) > >> > >> leave_hypervisor_to_guest > >> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL > > First of all I'm sorry for having lost track of that particular case in > > the course of the discussion. > No problem, I see on the ML how much you review every day... > > > > I wonder though whether that's something we really need to take care of. > > At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but > > use apply_map() instead. I wonder whether this wouldn't be appropriate > > generally in the context of init_bars() when used for Dom0 (not sure > > whether init_bars() would find some form of use for DomU-s as well). > > This is even more so as it would better be the exception that devices > > discovered post-boot start out with memory decoding enabled (which is a > > prereq for modify_bars() to be called in the first place). > Well, first of all init_bars is going to be used for DomUs as well: > that was discussed previously and is reflected in this series. > > But the real use-case for the deferred mapping would be the one > from PCI_COMMAND register write emulation: > > void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg, > uint32_t cmd, void *data) > { > [snip] > modify_bars(pdev, cmd, false); > > which in turn calls defer_map. > > I believe Roger did that for a good reason not to stall Xen while doing > map/unmap (which may take quite some time) and moved that to > vpci_process_pending and SOFTIRQ context. Indeed. In the physdevop failure case this comes from an hypercall context, so maybe you could do the mapping in place using hypercall continuations if required. Not sure how complex that would be, compared to just deferring to guest entry point and then dealing with the possible cleanup on failure. Thanks, Roger.
On 18.11.21 16:04, Roger Pau Monné wrote: > Sorry, I've been quite busy with other staff. > > On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote: >> >> On 18.11.21 15:25, Jan Beulich wrote: >>> On 18.11.2021 10:32, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 11:15, Jan Beulich wrote: >>>>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote: >>>>>> On 18.11.21 10:36, Jan Beulich wrote: >>>>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote: >>>>>>>> On 17.11.21 10:28, Jan Beulich wrote: >>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote: >>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>>>> >>>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have >>>>>>>>>> scheduled a delayed work for map/unmap operations for that device. >>>>>>>>>> For example, the following scenario can illustrate the problem: >>>>>>>>>> >>>>>>>>>> pci_physdev_op >>>>>>>>>> pci_add_device >>>>>>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>>>>>> iommu_add_device <- FAILS >>>>>>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>>>>>> >>>>>>>>>> leave_hypervisor_to_guest >>>>>>>>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>>>>>>>>> >>>>>>>>>> For the hardware domain we continue execution as the worse that >>>>>>>>>> could happen is that MMIO mappings are left in place when the >>>>>>>>>> device has been deassigned >>>>>>>>>> >>>>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI >>>>>>>>>> {un}map operation we need to destroy them, as we don't know in which >>>>>>>>>> state the p2m is. This can only happen in vpci_process_pending for >>>>>>>>>> DomUs as they won't be allowed to call pci_add_device. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> >>>>>>>>> Thinking about it some more, I'm not convinced any of this is really >>>>>>>>> needed in the presented form. >>>>>>>> The intention of this patch was to handle error conditions which are >>>>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle >>>>>>>> of initialization. So, I am trying to cancel all pending work which might >>>>>>>> already be there and not to crash. >>>>>>> Only Dom0 may be able to prematurely access the device during "add". >>>>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved. >>>>>>> Hence I'm not sure I see the need for dealing with these. >>>>>> Probably I don't follow you here. The issue I am facing is Dom0 >>>>>> related, e.g. Xen was not able to initialize during "add" and thus >>>>>> wanted to clean up the leftovers. As the result the already >>>>>> scheduled work crashes as it was not neither canceled nor interrupted >>>>>> in some safe manner. So, this sounds like something we need to take >>>>>> care of, thus this patch. >>>>> But my point was the question of why there would be any pending work >>>>> in the first place in this case, when we expect Dom0 to be well-behaved. >>>> I am not saying Dom0 misbehaves here. This is my real use-case >>>> (as in the commit message): >>>> >>>> pci_physdev_op >>>> pci_add_device >>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>> iommu_add_device <- FAILS >>>> vpci_remove_device -> xfree(pdev->vpci) >>>> >>>> leave_hypervisor_to_guest >>>> vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL >>> First of all I'm sorry for having lost track of that particular case in >>> the course of the discussion. >> No problem, I see on the ML how much you review every day... >>> I wonder though whether that's something we really need to take care of. >>> At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but >>> use apply_map() instead. I wonder whether this wouldn't be appropriate >>> generally in the context of init_bars() when used for Dom0 (not sure >>> whether init_bars() would find some form of use for DomU-s as well). >>> This is even more so as it would better be the exception that devices >>> discovered post-boot start out with memory decoding enabled (which is a >>> prereq for modify_bars() to be called in the first place). >> Well, first of all init_bars is going to be used for DomUs as well: >> that was discussed previously and is reflected in this series. >> >> But the real use-case for the deferred mapping would be the one >> from PCI_COMMAND register write emulation: >> >> void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg, >> uint32_t cmd, void *data) >> { >> [snip] >> modify_bars(pdev, cmd, false); >> >> which in turn calls defer_map. >> >> I believe Roger did that for a good reason not to stall Xen while doing >> map/unmap (which may take quite some time) and moved that to >> vpci_process_pending and SOFTIRQ context. > Indeed. In the physdevop failure case this comes from an hypercall > context, so maybe you could do the mapping in place using hypercall > continuations if required. Not sure how complex that would be, > compared to just deferring to guest entry point and then dealing with > the possible cleanup on failure. This will solve one part of the equation: pci_physdev_op pci_add_device init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) iommu_add_device <- FAILS vpci_remove_device -> xfree(pdev->vpci) But what about the other one, e.g. vpci_process_pending is triggered in parallel with PCI device de-assign for example? > > Thanks, Roger. Thank you, Oleksandr
On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: > On 18.11.21 16:04, Roger Pau Monné wrote: >> Indeed. In the physdevop failure case this comes from an hypercall >> context, so maybe you could do the mapping in place using hypercall >> continuations if required. Not sure how complex that would be, >> compared to just deferring to guest entry point and then dealing with >> the possible cleanup on failure. > This will solve one part of the equation: > > pci_physdev_op > pci_add_device > init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) > iommu_add_device <- FAILS > vpci_remove_device -> xfree(pdev->vpci) > > But what about the other one, e.g. vpci_process_pending is triggered in > parallel with PCI device de-assign for example? Well, that's again in hypercall context, so using hypercall continuations may again be an option. Of course at the point a de-assign is initiated, you "only" need to drain requests (for that device, but that's unlikely to be worthwhile optimizing for), while ensuring no new requests can be issued. Again, for the device in question, but here this is relevant - a flag may want setting to refuse all further requests. Or maybe the register handling hooks may want tearing down before draining pending BAR mapping requests; without the hooks in place no new such requests can possibly appear. Jan
On 18.11.21 16:35, Jan Beulich wrote: > On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: >> On 18.11.21 16:04, Roger Pau Monné wrote: >>> Indeed. In the physdevop failure case this comes from an hypercall >>> context, so maybe you could do the mapping in place using hypercall >>> continuations if required. Not sure how complex that would be, >>> compared to just deferring to guest entry point and then dealing with >>> the possible cleanup on failure. >> This will solve one part of the equation: >> >> pci_physdev_op >> pci_add_device >> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >> iommu_add_device <- FAILS >> vpci_remove_device -> xfree(pdev->vpci) >> >> But what about the other one, e.g. vpci_process_pending is triggered in >> parallel with PCI device de-assign for example? > Well, that's again in hypercall context, so using hypercall continuations > may again be an option. Of course at the point a de-assign is initiated, > you "only" need to drain requests (for that device, but that's unlikely > to be worthwhile optimizing for), while ensuring no new requests can be > issued. Again, for the device in question, but here this is relevant - > a flag may want setting to refuse all further requests. Or maybe the > register handling hooks may want tearing down before draining pending > BAR mapping requests; without the hooks in place no new such requests > can possibly appear. This can be probably even easier to solve as we were talking about pausing all vCPUs: void vpci_cancel_pending(const struct pci_dev *pdev) { struct domain *d = pdev->domain; struct vcpu *v; int rc; while ( (rc = domain_pause_except_self(d)) == -ERESTART ) cpu_relax(); if ( rc ) printk(XENLOG_G_ERR "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", &pdev->sbdf, pdev->domain, rc); for_each_vcpu ( d, v ) { if ( v->vpci.map_pending && (v->vpci.pdev == pdev) ) This will prevent all vCPUs to run, but current, thus making it impossible to run vpci_process_pending in parallel with any hypercall. So, even without locking in vpci_process_pending the above should be enough. The only concern here is that domain_pause_except_self may return the error code we somehow need to handle... > Jan > Thank you, Oleksandr
On 18.11.2021 16:11, Oleksandr Andrushchenko wrote: > > > On 18.11.21 16:35, Jan Beulich wrote: >> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: >>> On 18.11.21 16:04, Roger Pau Monné wrote: >>>> Indeed. In the physdevop failure case this comes from an hypercall >>>> context, so maybe you could do the mapping in place using hypercall >>>> continuations if required. Not sure how complex that would be, >>>> compared to just deferring to guest entry point and then dealing with >>>> the possible cleanup on failure. >>> This will solve one part of the equation: >>> >>> pci_physdev_op >>> pci_add_device >>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>> iommu_add_device <- FAILS >>> vpci_remove_device -> xfree(pdev->vpci) >>> >>> But what about the other one, e.g. vpci_process_pending is triggered in >>> parallel with PCI device de-assign for example? >> Well, that's again in hypercall context, so using hypercall continuations >> may again be an option. Of course at the point a de-assign is initiated, >> you "only" need to drain requests (for that device, but that's unlikely >> to be worthwhile optimizing for), while ensuring no new requests can be >> issued. Again, for the device in question, but here this is relevant - >> a flag may want setting to refuse all further requests. Or maybe the >> register handling hooks may want tearing down before draining pending >> BAR mapping requests; without the hooks in place no new such requests >> can possibly appear. > This can be probably even easier to solve as we were talking about > pausing all vCPUs: I have to admit I'm not sure. It might be easier, but it may also be less desirable. > void vpci_cancel_pending(const struct pci_dev *pdev) > { > struct domain *d = pdev->domain; > struct vcpu *v; > int rc; > > while ( (rc = domain_pause_except_self(d)) == -ERESTART ) > cpu_relax(); > > if ( rc ) > printk(XENLOG_G_ERR > "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", > &pdev->sbdf, pdev->domain, rc); > > for_each_vcpu ( d, v ) > { > if ( v->vpci.map_pending && (v->vpci.pdev == pdev) ) > > This will prevent all vCPUs to run, but current, thus making it impossible > to run vpci_process_pending in parallel with any hypercall. > So, even without locking in vpci_process_pending the above should > be enough. > The only concern here is that domain_pause_except_self may return > the error code we somehow need to handle... Not just this. The -ERESTART handling isn't appropriate this way either. For the moment I can't help thinking that draining would be preferable over canceling. Jan
On 18.11.21 17:16, Jan Beulich wrote: > On 18.11.2021 16:11, Oleksandr Andrushchenko wrote: >> >> On 18.11.21 16:35, Jan Beulich wrote: >>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 16:04, Roger Pau Monné wrote: >>>>> Indeed. In the physdevop failure case this comes from an hypercall >>>>> context, so maybe you could do the mapping in place using hypercall >>>>> continuations if required. Not sure how complex that would be, >>>>> compared to just deferring to guest entry point and then dealing with >>>>> the possible cleanup on failure. >>>> This will solve one part of the equation: >>>> >>>> pci_physdev_op >>>> pci_add_device >>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>> iommu_add_device <- FAILS >>>> vpci_remove_device -> xfree(pdev->vpci) >>>> >>>> But what about the other one, e.g. vpci_process_pending is triggered in >>>> parallel with PCI device de-assign for example? >>> Well, that's again in hypercall context, so using hypercall continuations >>> may again be an option. Of course at the point a de-assign is initiated, >>> you "only" need to drain requests (for that device, but that's unlikely >>> to be worthwhile optimizing for), while ensuring no new requests can be >>> issued. Again, for the device in question, but here this is relevant - >>> a flag may want setting to refuse all further requests. Or maybe the >>> register handling hooks may want tearing down before draining pending >>> BAR mapping requests; without the hooks in place no new such requests >>> can possibly appear. >> This can be probably even easier to solve as we were talking about >> pausing all vCPUs: > I have to admit I'm not sure. It might be easier, but it may also be > less desirable. > >> void vpci_cancel_pending(const struct pci_dev *pdev) >> { >> struct domain *d = pdev->domain; >> struct vcpu *v; >> int rc; >> >> while ( (rc = domain_pause_except_self(d)) == -ERESTART ) >> cpu_relax(); >> >> if ( rc ) >> printk(XENLOG_G_ERR >> "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", >> &pdev->sbdf, pdev->domain, rc); >> >> for_each_vcpu ( d, v ) >> { >> if ( v->vpci.map_pending && (v->vpci.pdev == pdev) ) >> >> This will prevent all vCPUs to run, but current, thus making it impossible >> to run vpci_process_pending in parallel with any hypercall. >> So, even without locking in vpci_process_pending the above should >> be enough. >> The only concern here is that domain_pause_except_self may return >> the error code we somehow need to handle... > Not just this. The -ERESTART handling isn't appropriate this way > either. Are you talking about cpu_relax()? > For the moment I can't help thinking that draining would > be preferable over canceling. Given that cancellation is going to happen on error path or on device de-assign/remove I think this can be acceptable. Any reason why not? > > Jan > Thank you, Oleksandr
On 18.11.2021 16:21, Oleksandr Andrushchenko wrote: > On 18.11.21 17:16, Jan Beulich wrote: >> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote: >>> On 18.11.21 16:35, Jan Beulich wrote: >>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: >>>>> On 18.11.21 16:04, Roger Pau Monné wrote: >>>>>> Indeed. In the physdevop failure case this comes from an hypercall >>>>>> context, so maybe you could do the mapping in place using hypercall >>>>>> continuations if required. Not sure how complex that would be, >>>>>> compared to just deferring to guest entry point and then dealing with >>>>>> the possible cleanup on failure. >>>>> This will solve one part of the equation: >>>>> >>>>> pci_physdev_op >>>>> pci_add_device >>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>> iommu_add_device <- FAILS >>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>> >>>>> But what about the other one, e.g. vpci_process_pending is triggered in >>>>> parallel with PCI device de-assign for example? >>>> Well, that's again in hypercall context, so using hypercall continuations >>>> may again be an option. Of course at the point a de-assign is initiated, >>>> you "only" need to drain requests (for that device, but that's unlikely >>>> to be worthwhile optimizing for), while ensuring no new requests can be >>>> issued. Again, for the device in question, but here this is relevant - >>>> a flag may want setting to refuse all further requests. Or maybe the >>>> register handling hooks may want tearing down before draining pending >>>> BAR mapping requests; without the hooks in place no new such requests >>>> can possibly appear. >>> This can be probably even easier to solve as we were talking about >>> pausing all vCPUs: >> I have to admit I'm not sure. It might be easier, but it may also be >> less desirable. >> >>> void vpci_cancel_pending(const struct pci_dev *pdev) >>> { >>> struct domain *d = pdev->domain; >>> struct vcpu *v; >>> int rc; >>> >>> while ( (rc = domain_pause_except_self(d)) == -ERESTART ) >>> cpu_relax(); >>> >>> if ( rc ) >>> printk(XENLOG_G_ERR >>> "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", >>> &pdev->sbdf, pdev->domain, rc); >>> >>> for_each_vcpu ( d, v ) >>> { >>> if ( v->vpci.map_pending && (v->vpci.pdev == pdev) ) >>> >>> This will prevent all vCPUs to run, but current, thus making it impossible >>> to run vpci_process_pending in parallel with any hypercall. >>> So, even without locking in vpci_process_pending the above should >>> be enough. >>> The only concern here is that domain_pause_except_self may return >>> the error code we somehow need to handle... >> Not just this. The -ERESTART handling isn't appropriate this way >> either. > Are you talking about cpu_relax()? I'm talking about that spin-waiting loop as a whole. >> For the moment I can't help thinking that draining would >> be preferable over canceling. > Given that cancellation is going to happen on error path or > on device de-assign/remove I think this can be acceptable. > Any reason why not? It would seem to me that the correctness of a draining approach is going to be easier to prove than that of a canceling one, where I expect races to be a bigger risk. Especially something that gets executed infrequently, if ever (error paths in particular), knowing things are well from testing isn't typically possible. Jan
On 18.11.21 17:41, Jan Beulich wrote: > On 18.11.2021 16:21, Oleksandr Andrushchenko wrote: >> On 18.11.21 17:16, Jan Beulich wrote: >>> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 16:35, Jan Beulich wrote: >>>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote: >>>>>> On 18.11.21 16:04, Roger Pau Monné wrote: >>>>>>> Indeed. In the physdevop failure case this comes from an hypercall >>>>>>> context, so maybe you could do the mapping in place using hypercall >>>>>>> continuations if required. Not sure how complex that would be, >>>>>>> compared to just deferring to guest entry point and then dealing with >>>>>>> the possible cleanup on failure. >>>>>> This will solve one part of the equation: >>>>>> >>>>>> pci_physdev_op >>>>>> pci_add_device >>>>>> init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ) >>>>>> iommu_add_device <- FAILS >>>>>> vpci_remove_device -> xfree(pdev->vpci) >>>>>> >>>>>> But what about the other one, e.g. vpci_process_pending is triggered in >>>>>> parallel with PCI device de-assign for example? >>>>> Well, that's again in hypercall context, so using hypercall continuations >>>>> may again be an option. Of course at the point a de-assign is initiated, >>>>> you "only" need to drain requests (for that device, but that's unlikely >>>>> to be worthwhile optimizing for), while ensuring no new requests can be >>>>> issued. Again, for the device in question, but here this is relevant - >>>>> a flag may want setting to refuse all further requests. Or maybe the >>>>> register handling hooks may want tearing down before draining pending >>>>> BAR mapping requests; without the hooks in place no new such requests >>>>> can possibly appear. >>>> This can be probably even easier to solve as we were talking about >>>> pausing all vCPUs: >>> I have to admit I'm not sure. It might be easier, but it may also be >>> less desirable. >>> >>>> void vpci_cancel_pending(const struct pci_dev *pdev) >>>> { >>>> struct domain *d = pdev->domain; >>>> struct vcpu *v; >>>> int rc; >>>> >>>> while ( (rc = domain_pause_except_self(d)) == -ERESTART ) >>>> cpu_relax(); >>>> >>>> if ( rc ) >>>> printk(XENLOG_G_ERR >>>> "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n", >>>> &pdev->sbdf, pdev->domain, rc); >>>> >>>> for_each_vcpu ( d, v ) >>>> { >>>> if ( v->vpci.map_pending && (v->vpci.pdev == pdev) ) >>>> >>>> This will prevent all vCPUs to run, but current, thus making it impossible >>>> to run vpci_process_pending in parallel with any hypercall. >>>> So, even without locking in vpci_process_pending the above should >>>> be enough. >>>> The only concern here is that domain_pause_except_self may return >>>> the error code we somehow need to handle... >>> Not just this. The -ERESTART handling isn't appropriate this way >>> either. >> Are you talking about cpu_relax()? > I'm talking about that spin-waiting loop as a whole. > >>> For the moment I can't help thinking that draining would >>> be preferable over canceling. >> Given that cancellation is going to happen on error path or >> on device de-assign/remove I think this can be acceptable. >> Any reason why not? > It would seem to me that the correctness of a draining approach is > going to be easier to prove than that of a canceling one, where I > expect races to be a bigger risk. Especially something that gets > executed infrequently, if ever (error paths in particular), knowing > things are well from testing isn't typically possible. Could you please then give me a hint how to do that: 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci 2. We have de-assign/remove on vCPU1 How do we drain that? Do you mean some atomic variable to be used in vpci_process_pending to flag it is running and de-assign/remove needs to wait and spinning checking that? > > Jan > > Thank you, Oleksandr
On 18.11.2021 16:46, Oleksandr Andrushchenko wrote: > On 18.11.21 17:41, Jan Beulich wrote: >> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote: >>> On 18.11.21 17:16, Jan Beulich wrote: >>>> For the moment I can't help thinking that draining would >>>> be preferable over canceling. >>> Given that cancellation is going to happen on error path or >>> on device de-assign/remove I think this can be acceptable. >>> Any reason why not? >> It would seem to me that the correctness of a draining approach is >> going to be easier to prove than that of a canceling one, where I >> expect races to be a bigger risk. Especially something that gets >> executed infrequently, if ever (error paths in particular), knowing >> things are well from testing isn't typically possible. > Could you please then give me a hint how to do that: > 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci > 2. We have de-assign/remove on vCPU1 > > How do we drain that? Do you mean some atomic variable to be > used in vpci_process_pending to flag it is running and de-assign/remove > needs to wait and spinning checking that? First of all let's please keep remove and de-assign separate. I think we have largely reached agreement that remove may need handling differently, for being a Dom0-only operation. As to draining during de-assign: I did suggest before that removing the register handling hooks first would guarantee no new requests to appear. Then it should be merely a matter of using hypercall continuations until the respective domain has no pending requests anymore for the device in question. Some locking (or lock barrier) may of course be needed to make sure another CPU isn't just about to pend a new request. Jan
Hi, Roger, Jan! On 18.11.21 17:53, Jan Beulich wrote: > On 18.11.2021 16:46, Oleksandr Andrushchenko wrote: >> On 18.11.21 17:41, Jan Beulich wrote: >>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote: >>>> On 18.11.21 17:16, Jan Beulich wrote: >>>>> For the moment I can't help thinking that draining would >>>>> be preferable over canceling. >>>> Given that cancellation is going to happen on error path or >>>> on device de-assign/remove I think this can be acceptable. >>>> Any reason why not? >>> It would seem to me that the correctness of a draining approach is >>> going to be easier to prove than that of a canceling one, where I >>> expect races to be a bigger risk. Especially something that gets >>> executed infrequently, if ever (error paths in particular), knowing >>> things are well from testing isn't typically possible. >> Could you please then give me a hint how to do that: >> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci >> 2. We have de-assign/remove on vCPU1 >> >> How do we drain that? Do you mean some atomic variable to be >> used in vpci_process_pending to flag it is running and de-assign/remove >> needs to wait and spinning checking that? > First of all let's please keep remove and de-assign separate. I think we > have largely reached agreement that remove may need handling differently, > for being a Dom0-only operation. > > As to draining during de-assign: I did suggest before that removing the > register handling hooks first would guarantee no new requests to appear. > Then it should be merely a matter of using hypercall continuations until > the respective domain has no pending requests anymore for the device in > question. Some locking (or lock barrier) may of course be needed to > make sure another CPU isn't just about to pend a new request. > > Jan > > Too long, but please read. The below is some simplified analysis of what is happening with respect to deferred mapping. First we start from looking at what hypercals are used which may run in parallel with vpci_process_pending, which lock they hold: 1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during PHYSDEVOP_pci_device_add =============================================================================== pci_physdev_op: <- no hypercall_create_continuation pci_add_device <- pcidevs_lock() vpci_add_handlers init_bars cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND); modify_bars <- if cmd & PCI_COMMAND_MEMORY struct rangeset *mem = rangeset_new(NULL, NULL, 0); if ( system_state < SYS_STATE_active ) <- Dom0 is being created return apply_map(pdev->domain, pdev, mem, cmd); defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is running curr->vpci.pdev = pdev; curr->vpci.mem = mem; ret = iommu_add_device(pdev); <- FAIL if ( ret ) vpci_remove_device remove vPCI register handlers xfree(pdev->vpci); pdev->vpci = NULL; <- this will crash vpci_process_pending if it was scheduled and yet to run 2. do_physdev_op(PHYSDEVOP_pci_device_remove) =============================================================================== pci_physdev_op: <- no hypercall_create_continuation pci_remove_device <- pcidevs_lock() vpci_remove_device pdev->vpci = NULL; <- this will crash vpci_process_pending if it was scheduled and yet to run 3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device) =============================================================================== case XEN_DOMCTL_assign_device <- pcidevs_lock(); ret = assign_device(d, seg, bus, devfn, flags); if ( ret == -ERESTART ) ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl); 4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock(); =============================================================================== case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation ret = deassign_device(d, seg, bus, devfn); 5. vPCI MMIO trap for PCI_COMMAND =============================================================================== vpci_mmio_{read|write} vpci_ecam_{read|write} vpci_{read|write} <- NO locking yet pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn); spin_lock(&pdev->vpci->lock); cmd_write modify_bars defer_map 6. SoftIRQ processing =============================================================================== hvm_do_resume vcpu_ioreq_handle_completion vpci_process_pending if ( v->vpci.mem ) rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data); if ( rc == -ERESTART ) return true; <- re-scheduling ========================================================================= spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL ========================================================================= spin_unlock(&v->vpci.pdev->vpci->lock); v->vpci.mem = NULL; if ( rc ) <- rc is from rangeset_consume_ranges vpci_remove_device <- this is a BUG as it is potentially possible that vpci_process_pending is running on other vCPU So, from the above it is clearly seen that it might be that there is a PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel with a hypercall. Some analysis on the hypercalls with respect to domains which are eligible targets. 1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, PHYSDEVOP_pci_device_remove 2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device Possible crash paths =============================================================================== 1. Failure in PHYSDEVOP_pci_device_add after defer_map may make vpci_process_pending crash because of pdev->vpci == NULL 2. vpci_process_pending on other vCPU may crash if runs in parallel with itself because of vpci_remove_device may be called 3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove vpci_remove_device makes pdev->vpci == NULL 4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be unaffected. Synchronization is needed between: - vpci_remove_device - vpci_process_pending - vpci_mmio_{read|write} Possible locking and other work needed: ======================================= 1. pcidevs_{lock|unlock} is too heavy and is per-host 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device 3. We may want a dedicated per-domain rw lock to be implemented: diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 28146ee404e6..ebf071893b21 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -444,6 +444,7 @@ struct domain #ifdef CONFIG_HAS_PCI struct list_head pdev_list; + rwlock_t vpci_rwlock; + bool vpci_terminating; <- atomic? #endif then vpci_remove_device is a writer (cold path) and vpci_process_pending and vpci_mmio_{read|write} are readers (hot path). do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation to be implemented, so when re-start removal if need be: vpci_remove_device() { d->vpci_terminating = true; remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others if ( !write_trylock(d->vpci_rwlock) ) return -ERESTART; xfree(pdev->vpci); pdev->vpci = NULL; } Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for other operations which may require it, e.g. virtual bus topology can use it when assigning vSBDF etc. 4. vpci_remove_device needs to be removed from vpci_process_pending and do nothing for Dom0 and crash DomU otherwise: if ( rc ) { /* * FIXME: in case of failure remove the device from the domain. * Note that there might still be leftover mappings. While this is * safe for Dom0, for DomUs the domain needs to be killed in order * to avoid leaking stale p2m mappings on failure. */ if ( !is_hardware_domain(v->domain) ) domain_crash(v->domain); I do hope we can finally come up with some decision which I can implement... Thank you, Oleksandr
On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: > Possible locking and other work needed: > ======================================= > > 1. pcidevs_{lock|unlock} is too heavy and is per-host > 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device > 3. We may want a dedicated per-domain rw lock to be implemented: > > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index 28146ee404e6..ebf071893b21 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -444,6 +444,7 @@ struct domain > > #ifdef CONFIG_HAS_PCI > struct list_head pdev_list; > + rwlock_t vpci_rwlock; > + bool vpci_terminating; <- atomic? > #endif > then vpci_remove_device is a writer (cold path) and vpci_process_pending and > vpci_mmio_{read|write} are readers (hot path). Right - you need such a lock for other purposes anyway, as per the discussion with Julien. > do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation > to be implemented, so when re-start removal if need be: > > vpci_remove_device() > { > d->vpci_terminating = true; > remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others > if ( !write_trylock(d->vpci_rwlock) ) > return -ERESTART; > xfree(pdev->vpci); > pdev->vpci = NULL; > } > > Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for > other operations which may require it, e.g. virtual bus topology can > use it when assigning vSBDF etc. > > 4. vpci_remove_device needs to be removed from vpci_process_pending > and do nothing for Dom0 and crash DomU otherwise: Why is this? I'm not outright opposed, but I don't immediately see why trying to remove the problematic device wouldn't be a reasonable course of action anymore. vpci_remove_device() may need to become more careful as to not crashing, though. Jan
On 19.11.21 15:00, Jan Beulich wrote: > On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >> Possible locking and other work needed: >> ======================================= >> >> 1. pcidevs_{lock|unlock} is too heavy and is per-host >> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >> 3. We may want a dedicated per-domain rw lock to be implemented: >> >> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >> index 28146ee404e6..ebf071893b21 100644 >> --- a/xen/include/xen/sched.h >> +++ b/xen/include/xen/sched.h >> @@ -444,6 +444,7 @@ struct domain >> >> #ifdef CONFIG_HAS_PCI >> struct list_head pdev_list; >> + rwlock_t vpci_rwlock; >> + bool vpci_terminating; <- atomic? >> #endif >> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >> vpci_mmio_{read|write} are readers (hot path). > Right - you need such a lock for other purposes anyway, as per the > discussion with Julien. What about bool vpci_terminating? Do you see it as an atomic type or just bool? > >> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >> to be implemented, so when re-start removal if need be: >> >> vpci_remove_device() >> { >> d->vpci_terminating = true; >> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >> if ( !write_trylock(d->vpci_rwlock) ) >> return -ERESTART; >> xfree(pdev->vpci); >> pdev->vpci = NULL; >> } >> >> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >> other operations which may require it, e.g. virtual bus topology can >> use it when assigning vSBDF etc. >> >> 4. vpci_remove_device needs to be removed from vpci_process_pending >> and do nothing for Dom0 and crash DomU otherwise: > Why is this? I'm not outright opposed, but I don't immediately see why > trying to remove the problematic device wouldn't be a reasonable course > of action anymore. vpci_remove_device() may need to become more careful > as to not crashing, vpci_remove_device does not crash, vpci_process_pending does > though. Assume we are in an error state in vpci_process_pending *on one of the vCPUs* and we call vpci_remove_device. vpci_remove_device tries to acquire the lock and it can't just because there are some other vpci code is running on other vCPU. Then what do we do here? We are in SoftIRQ context now and we can't spin trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci structure because it is seen by all vCPUs and may crash them. If vpci_remove_device is in hypercall context it just returns -ERESTART and hypercall continuation helps here. But not in SoftIRQ context. Thus, I think we need to remove vpci_remove_device call from vpci_process_pending and crash the domain if it is a guest domain. Leave with partially done map/unmap if it is the hardware domain as per Roger's comment in the code. > > Jan > Thank you, Oleksandr
On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: > On 19.11.21 15:00, Jan Beulich wrote: >> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>> Possible locking and other work needed: >>> ======================================= >>> >>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>> 3. We may want a dedicated per-domain rw lock to be implemented: >>> >>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>> index 28146ee404e6..ebf071893b21 100644 >>> --- a/xen/include/xen/sched.h >>> +++ b/xen/include/xen/sched.h >>> @@ -444,6 +444,7 @@ struct domain >>> >>> #ifdef CONFIG_HAS_PCI >>> struct list_head pdev_list; >>> + rwlock_t vpci_rwlock; >>> + bool vpci_terminating; <- atomic? >>> #endif >>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>> vpci_mmio_{read|write} are readers (hot path). >> Right - you need such a lock for other purposes anyway, as per the >> discussion with Julien. > What about bool vpci_terminating? Do you see it as an atomic type or just bool? Having seen only ... >>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>> to be implemented, so when re-start removal if need be: >>> >>> vpci_remove_device() >>> { >>> d->vpci_terminating = true; ... this use so far, I can't tell yet. But at a first glance a boolean looks to be what you need. >>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>> if ( !write_trylock(d->vpci_rwlock) ) >>> return -ERESTART; >>> xfree(pdev->vpci); >>> pdev->vpci = NULL; >>> } >>> >>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>> other operations which may require it, e.g. virtual bus topology can >>> use it when assigning vSBDF etc. >>> >>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>> and do nothing for Dom0 and crash DomU otherwise: >> Why is this? I'm not outright opposed, but I don't immediately see why >> trying to remove the problematic device wouldn't be a reasonable course >> of action anymore. vpci_remove_device() may need to become more careful >> as to not crashing, > vpci_remove_device does not crash, vpci_process_pending does >> though. > Assume we are in an error state in vpci_process_pending *on one of the vCPUs* > and we call vpci_remove_device. vpci_remove_device tries to acquire the > lock and it can't just because there are some other vpci code is running on other vCPU. > Then what do we do here? We are in SoftIRQ context now and we can't spin > trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci > structure because it is seen by all vCPUs and may crash them. > > If vpci_remove_device is in hypercall context it just returns -ERESTART and > hypercall continuation helps here. But not in SoftIRQ context. Maybe then you want to invoke this cleanup from RCU context (whether vpci_remove_device() itself or a suitable clone there of is TBD)? (I will admit though that I didn't check whether that would satisfy all constraints.) Then again it also hasn't become clear to me why you use write_trylock() there. The lock contention you describe doesn't, on the surface, look any different from situations elsewhere. Jan
On 19.11.21 15:25, Jan Beulich wrote: > On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >> On 19.11.21 15:00, Jan Beulich wrote: >>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>> Possible locking and other work needed: >>>> ======================================= >>>> >>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>> >>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>> index 28146ee404e6..ebf071893b21 100644 >>>> --- a/xen/include/xen/sched.h >>>> +++ b/xen/include/xen/sched.h >>>> @@ -444,6 +444,7 @@ struct domain >>>> >>>> #ifdef CONFIG_HAS_PCI >>>> struct list_head pdev_list; >>>> + rwlock_t vpci_rwlock; >>>> + bool vpci_terminating; <- atomic? >>>> #endif >>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>> vpci_mmio_{read|write} are readers (hot path). >>> Right - you need such a lock for other purposes anyway, as per the >>> discussion with Julien. >> What about bool vpci_terminating? Do you see it as an atomic type or just bool? > Having seen only ... > >>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>> to be implemented, so when re-start removal if need be: >>>> >>>> vpci_remove_device() >>>> { >>>> d->vpci_terminating = true; > ... this use so far, I can't tell yet. But at a first glance a boolean > looks to be what you need. > >>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>> if ( !write_trylock(d->vpci_rwlock) ) >>>> return -ERESTART; >>>> xfree(pdev->vpci); >>>> pdev->vpci = NULL; >>>> } >>>> >>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>> other operations which may require it, e.g. virtual bus topology can >>>> use it when assigning vSBDF etc. >>>> >>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>> and do nothing for Dom0 and crash DomU otherwise: >>> Why is this? I'm not outright opposed, but I don't immediately see why >>> trying to remove the problematic device wouldn't be a reasonable course >>> of action anymore. vpci_remove_device() may need to become more careful >>> as to not crashing, >> vpci_remove_device does not crash, vpci_process_pending does >>> though. >> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >> and we call vpci_remove_device. vpci_remove_device tries to acquire the >> lock and it can't just because there are some other vpci code is running on other vCPU. >> Then what do we do here? We are in SoftIRQ context now and we can't spin >> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >> structure because it is seen by all vCPUs and may crash them. >> >> If vpci_remove_device is in hypercall context it just returns -ERESTART and >> hypercall continuation helps here. But not in SoftIRQ context. > Maybe then you want to invoke this cleanup from RCU context (whether > vpci_remove_device() itself or a suitable clone there of is TBD)? (I > will admit though that I didn't check whether that would satisfy all > constraints.) > > Then again it also hasn't become clear to me why you use write_trylock() > there. The lock contention you describe doesn't, on the surface, look > any different from situations elsewhere. I use write_trylock in vpci_remove_device because if we can't acquire the lock then we defer device removal. This would work well if called from a hypercall which will employ hypercall continuation. But SoftIRQ getting -ERESTART is something that we can't probably handle by restarting as hypercall can, thus I only see that vpci_process_pending will need to spin and wait until vpci_remove_device succeeds. > > Jan > Thank you, Oleksandr
On 19.11.21 15:34, Oleksandr Andrushchenko wrote: > > On 19.11.21 15:25, Jan Beulich wrote: >> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >>> On 19.11.21 15:00, Jan Beulich wrote: >>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>>> Possible locking and other work needed: >>>>> ======================================= >>>>> >>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>>> >>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>>> index 28146ee404e6..ebf071893b21 100644 >>>>> --- a/xen/include/xen/sched.h >>>>> +++ b/xen/include/xen/sched.h >>>>> @@ -444,6 +444,7 @@ struct domain >>>>> >>>>> #ifdef CONFIG_HAS_PCI >>>>> struct list_head pdev_list; >>>>> + rwlock_t vpci_rwlock; >>>>> + bool vpci_terminating; <- atomic? >>>>> #endif >>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>>> vpci_mmio_{read|write} are readers (hot path). >>>> Right - you need such a lock for other purposes anyway, as per the >>>> discussion with Julien. >>> What about bool vpci_terminating? Do you see it as an atomic type or just bool? >> Having seen only ... >> >>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>>> to be implemented, so when re-start removal if need be: >>>>> >>>>> vpci_remove_device() >>>>> { >>>>> d->vpci_terminating = true; >> ... this use so far, I can't tell yet. But at a first glance a boolean >> looks to be what you need. >> >>>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>>> if ( !write_trylock(d->vpci_rwlock) ) >>>>> return -ERESTART; >>>>> xfree(pdev->vpci); >>>>> pdev->vpci = NULL; >>>>> } >>>>> >>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>>> other operations which may require it, e.g. virtual bus topology can >>>>> use it when assigning vSBDF etc. >>>>> >>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>>> and do nothing for Dom0 and crash DomU otherwise: >>>> Why is this? I'm not outright opposed, but I don't immediately see why >>>> trying to remove the problematic device wouldn't be a reasonable course >>>> of action anymore. vpci_remove_device() may need to become more careful >>>> as to not crashing, >>> vpci_remove_device does not crash, vpci_process_pending does >>>> though. >>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >>> and we call vpci_remove_device. vpci_remove_device tries to acquire the >>> lock and it can't just because there are some other vpci code is running on other vCPU. >>> Then what do we do here? We are in SoftIRQ context now and we can't spin >>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >>> structure because it is seen by all vCPUs and may crash them. >>> >>> If vpci_remove_device is in hypercall context it just returns -ERESTART and >>> hypercall continuation helps here. But not in SoftIRQ context. >> Maybe then you want to invoke this cleanup from RCU context (whether >> vpci_remove_device() itself or a suitable clone there of is TBD)? (I >> will admit though that I didn't check whether that would satisfy all >> constraints.) >> >> Then again it also hasn't become clear to me why you use write_trylock() >> there. The lock contention you describe doesn't, on the surface, look >> any different from situations elsewhere. > I use write_trylock in vpci_remove_device because if we can't > acquire the lock then we defer device removal. This would work > well if called from a hypercall which will employ hypercall continuation. > But SoftIRQ getting -ERESTART is something that we can't probably > handle by restarting as hypercall can, thus I only see that vpci_process_pending > will need to spin and wait until vpci_remove_device succeeds. Does anybody have any better solution for preventing SoftIRQ from spinning on vpci_remove_device and -ERESTART? >> Jan >> > Thank you, > Oleksandr Thank you, Oleksandr
On 22.11.2021 15:21, Oleksandr Andrushchenko wrote: > On 19.11.21 15:34, Oleksandr Andrushchenko wrote: >> On 19.11.21 15:25, Jan Beulich wrote: >>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >>>> On 19.11.21 15:00, Jan Beulich wrote: >>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>>>> Possible locking and other work needed: >>>>>> ======================================= >>>>>> >>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>>>> >>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>>>> index 28146ee404e6..ebf071893b21 100644 >>>>>> --- a/xen/include/xen/sched.h >>>>>> +++ b/xen/include/xen/sched.h >>>>>> @@ -444,6 +444,7 @@ struct domain >>>>>> >>>>>> #ifdef CONFIG_HAS_PCI >>>>>> struct list_head pdev_list; >>>>>> + rwlock_t vpci_rwlock; >>>>>> + bool vpci_terminating; <- atomic? >>>>>> #endif >>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>>>> vpci_mmio_{read|write} are readers (hot path). >>>>> Right - you need such a lock for other purposes anyway, as per the >>>>> discussion with Julien. >>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool? >>> Having seen only ... >>> >>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>>>> to be implemented, so when re-start removal if need be: >>>>>> >>>>>> vpci_remove_device() >>>>>> { >>>>>> d->vpci_terminating = true; >>> ... this use so far, I can't tell yet. But at a first glance a boolean >>> looks to be what you need. >>> >>>>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>>>> if ( !write_trylock(d->vpci_rwlock) ) >>>>>> return -ERESTART; >>>>>> xfree(pdev->vpci); >>>>>> pdev->vpci = NULL; >>>>>> } >>>>>> >>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>>>> other operations which may require it, e.g. virtual bus topology can >>>>>> use it when assigning vSBDF etc. >>>>>> >>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>>>> and do nothing for Dom0 and crash DomU otherwise: >>>>> Why is this? I'm not outright opposed, but I don't immediately see why >>>>> trying to remove the problematic device wouldn't be a reasonable course >>>>> of action anymore. vpci_remove_device() may need to become more careful >>>>> as to not crashing, >>>> vpci_remove_device does not crash, vpci_process_pending does >>>>> though. >>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the >>>> lock and it can't just because there are some other vpci code is running on other vCPU. >>>> Then what do we do here? We are in SoftIRQ context now and we can't spin >>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >>>> structure because it is seen by all vCPUs and may crash them. >>>> >>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and >>>> hypercall continuation helps here. But not in SoftIRQ context. >>> Maybe then you want to invoke this cleanup from RCU context (whether >>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I >>> will admit though that I didn't check whether that would satisfy all >>> constraints.) >>> >>> Then again it also hasn't become clear to me why you use write_trylock() >>> there. The lock contention you describe doesn't, on the surface, look >>> any different from situations elsewhere. >> I use write_trylock in vpci_remove_device because if we can't >> acquire the lock then we defer device removal. This would work >> well if called from a hypercall which will employ hypercall continuation. >> But SoftIRQ getting -ERESTART is something that we can't probably >> handle by restarting as hypercall can, thus I only see that vpci_process_pending >> will need to spin and wait until vpci_remove_device succeeds. > Does anybody have any better solution for preventing SoftIRQ from > spinning on vpci_remove_device and -ERESTART? Well, at this point I can suggest only a marginal improvement: Instead of spinning inside the softirq handler, you want to re-raise the softirq and exit the handler. That way at least higher "priority" softirqs won't be starved. Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such an event, with the work deferred to a tasklet? Yet I don't think my earlier question regarding the use of write_trylock() was really answered. What you said in reply doesn't explain (to me at least) why write_lock() is not an option. Jan
On 22.11.21 16:37, Jan Beulich wrote: > On 22.11.2021 15:21, Oleksandr Andrushchenko wrote: >> On 19.11.21 15:34, Oleksandr Andrushchenko wrote: >>> On 19.11.21 15:25, Jan Beulich wrote: >>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >>>>> On 19.11.21 15:00, Jan Beulich wrote: >>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>>>>> Possible locking and other work needed: >>>>>>> ======================================= >>>>>>> >>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>>>>> >>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>>>>> index 28146ee404e6..ebf071893b21 100644 >>>>>>> --- a/xen/include/xen/sched.h >>>>>>> +++ b/xen/include/xen/sched.h >>>>>>> @@ -444,6 +444,7 @@ struct domain >>>>>>> >>>>>>> #ifdef CONFIG_HAS_PCI >>>>>>> struct list_head pdev_list; >>>>>>> + rwlock_t vpci_rwlock; >>>>>>> + bool vpci_terminating; <- atomic? >>>>>>> #endif >>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>>>>> vpci_mmio_{read|write} are readers (hot path). >>>>>> Right - you need such a lock for other purposes anyway, as per the >>>>>> discussion with Julien. >>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool? >>>> Having seen only ... >>>> >>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>>>>> to be implemented, so when re-start removal if need be: >>>>>>> >>>>>>> vpci_remove_device() >>>>>>> { >>>>>>> d->vpci_terminating = true; >>>> ... this use so far, I can't tell yet. But at a first glance a boolean >>>> looks to be what you need. >>>> >>>>>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>>>>> if ( !write_trylock(d->vpci_rwlock) ) >>>>>>> return -ERESTART; >>>>>>> xfree(pdev->vpci); >>>>>>> pdev->vpci = NULL; >>>>>>> } >>>>>>> >>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>>>>> other operations which may require it, e.g. virtual bus topology can >>>>>>> use it when assigning vSBDF etc. >>>>>>> >>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>>>>> and do nothing for Dom0 and crash DomU otherwise: >>>>>> Why is this? I'm not outright opposed, but I don't immediately see why >>>>>> trying to remove the problematic device wouldn't be a reasonable course >>>>>> of action anymore. vpci_remove_device() may need to become more careful >>>>>> as to not crashing, >>>>> vpci_remove_device does not crash, vpci_process_pending does >>>>>> though. >>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the >>>>> lock and it can't just because there are some other vpci code is running on other vCPU. >>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin >>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >>>>> structure because it is seen by all vCPUs and may crash them. >>>>> >>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and >>>>> hypercall continuation helps here. But not in SoftIRQ context. >>>> Maybe then you want to invoke this cleanup from RCU context (whether >>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I >>>> will admit though that I didn't check whether that would satisfy all >>>> constraints.) >>>> >>>> Then again it also hasn't become clear to me why you use write_trylock() >>>> there. The lock contention you describe doesn't, on the surface, look >>>> any different from situations elsewhere. >>> I use write_trylock in vpci_remove_device because if we can't >>> acquire the lock then we defer device removal. This would work >>> well if called from a hypercall which will employ hypercall continuation. >>> But SoftIRQ getting -ERESTART is something that we can't probably >>> handle by restarting as hypercall can, thus I only see that vpci_process_pending >>> will need to spin and wait until vpci_remove_device succeeds. >> Does anybody have any better solution for preventing SoftIRQ from >> spinning on vpci_remove_device and -ERESTART? > Well, at this point I can suggest only a marginal improvement: Instead of > spinning inside the softirq handler, you want to re-raise the softirq and > exit the handler. That way at least higher "priority" softirqs won't be > starved. > > Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such > an event, with the work deferred to a tasklet? > > Yet I don't think my earlier question regarding the use of write_trylock() > was really answered. What you said in reply doesn't explain (to me at > least) why write_lock() is not an option. I was thinking that we do not want to freeze in case we are calling vpci_remove_device from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART indicating that the removal needs to be deferred. If we use write_lock, then SoftIRQ -> write_lock will spin there waiting for readers to release the lock. write_lock actually makes things a lot easier, but I just don't know if it is ok to use it. If so, then vpci_remove_device becomes synchronous and there is no need in hypercall continuation and other heavy machinery for re-scheduling SoftIRQ... > > Jan > Thank you, Oleksandr
On 22.11.2021 15:45, Oleksandr Andrushchenko wrote: > > > On 22.11.21 16:37, Jan Beulich wrote: >> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote: >>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote: >>>> On 19.11.21 15:25, Jan Beulich wrote: >>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >>>>>> On 19.11.21 15:00, Jan Beulich wrote: >>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>>>>>> Possible locking and other work needed: >>>>>>>> ======================================= >>>>>>>> >>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>>>>>> >>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>>>>>> index 28146ee404e6..ebf071893b21 100644 >>>>>>>> --- a/xen/include/xen/sched.h >>>>>>>> +++ b/xen/include/xen/sched.h >>>>>>>> @@ -444,6 +444,7 @@ struct domain >>>>>>>> >>>>>>>> #ifdef CONFIG_HAS_PCI >>>>>>>> struct list_head pdev_list; >>>>>>>> + rwlock_t vpci_rwlock; >>>>>>>> + bool vpci_terminating; <- atomic? >>>>>>>> #endif >>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>>>>>> vpci_mmio_{read|write} are readers (hot path). >>>>>>> Right - you need such a lock for other purposes anyway, as per the >>>>>>> discussion with Julien. >>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool? >>>>> Having seen only ... >>>>> >>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>>>>>> to be implemented, so when re-start removal if need be: >>>>>>>> >>>>>>>> vpci_remove_device() >>>>>>>> { >>>>>>>> d->vpci_terminating = true; >>>>> ... this use so far, I can't tell yet. But at a first glance a boolean >>>>> looks to be what you need. >>>>> >>>>>>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>>>>>> if ( !write_trylock(d->vpci_rwlock) ) >>>>>>>> return -ERESTART; >>>>>>>> xfree(pdev->vpci); >>>>>>>> pdev->vpci = NULL; >>>>>>>> } >>>>>>>> >>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>>>>>> other operations which may require it, e.g. virtual bus topology can >>>>>>>> use it when assigning vSBDF etc. >>>>>>>> >>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>>>>>> and do nothing for Dom0 and crash DomU otherwise: >>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why >>>>>>> trying to remove the problematic device wouldn't be a reasonable course >>>>>>> of action anymore. vpci_remove_device() may need to become more careful >>>>>>> as to not crashing, >>>>>> vpci_remove_device does not crash, vpci_process_pending does >>>>>>> though. >>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the >>>>>> lock and it can't just because there are some other vpci code is running on other vCPU. >>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin >>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >>>>>> structure because it is seen by all vCPUs and may crash them. >>>>>> >>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and >>>>>> hypercall continuation helps here. But not in SoftIRQ context. >>>>> Maybe then you want to invoke this cleanup from RCU context (whether >>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I >>>>> will admit though that I didn't check whether that would satisfy all >>>>> constraints.) >>>>> >>>>> Then again it also hasn't become clear to me why you use write_trylock() >>>>> there. The lock contention you describe doesn't, on the surface, look >>>>> any different from situations elsewhere. >>>> I use write_trylock in vpci_remove_device because if we can't >>>> acquire the lock then we defer device removal. This would work >>>> well if called from a hypercall which will employ hypercall continuation. >>>> But SoftIRQ getting -ERESTART is something that we can't probably >>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending >>>> will need to spin and wait until vpci_remove_device succeeds. >>> Does anybody have any better solution for preventing SoftIRQ from >>> spinning on vpci_remove_device and -ERESTART? >> Well, at this point I can suggest only a marginal improvement: Instead of >> spinning inside the softirq handler, you want to re-raise the softirq and >> exit the handler. That way at least higher "priority" softirqs won't be >> starved. >> >> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such >> an event, with the work deferred to a tasklet? >> >> Yet I don't think my earlier question regarding the use of write_trylock() >> was really answered. What you said in reply doesn't explain (to me at >> least) why write_lock() is not an option. > I was thinking that we do not want to freeze in case we are calling vpci_remove_device > from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART > indicating that the removal needs to be deferred. If we use write_lock, then > SoftIRQ -> write_lock will spin there waiting for readers to release the lock. > > write_lock actually makes things a lot easier, but I just don't know if it > is ok to use it. If so, then vpci_remove_device becomes synchronous and > there is no need in hypercall continuation and other heavy machinery for > re-scheduling SoftIRQ... I'm inclined to ask: If it wasn't okay to use here, then where would it be okay to use? Of course I realize there are cases when long spinning times can be a problem. But I expect there aren't going to be excessively long lock holding regions for this lock, and I also would expect average contention to not be overly bad. But in the end you know better the code that you're writing (and which may lead to issues with the lock usage) than I do ... Jan
On 22.11.21 16:57, Jan Beulich wrote: > On 22.11.2021 15:45, Oleksandr Andrushchenko wrote: >> >> On 22.11.21 16:37, Jan Beulich wrote: >>> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote: >>>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote: >>>>> On 19.11.21 15:25, Jan Beulich wrote: >>>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote: >>>>>>> On 19.11.21 15:00, Jan Beulich wrote: >>>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote: >>>>>>>>> Possible locking and other work needed: >>>>>>>>> ======================================= >>>>>>>>> >>>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host >>>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device >>>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented: >>>>>>>>> >>>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h >>>>>>>>> index 28146ee404e6..ebf071893b21 100644 >>>>>>>>> --- a/xen/include/xen/sched.h >>>>>>>>> +++ b/xen/include/xen/sched.h >>>>>>>>> @@ -444,6 +444,7 @@ struct domain >>>>>>>>> >>>>>>>>> #ifdef CONFIG_HAS_PCI >>>>>>>>> struct list_head pdev_list; >>>>>>>>> + rwlock_t vpci_rwlock; >>>>>>>>> + bool vpci_terminating; <- atomic? >>>>>>>>> #endif >>>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and >>>>>>>>> vpci_mmio_{read|write} are readers (hot path). >>>>>>>> Right - you need such a lock for other purposes anyway, as per the >>>>>>>> discussion with Julien. >>>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool? >>>>>> Having seen only ... >>>>>> >>>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation >>>>>>>>> to be implemented, so when re-start removal if need be: >>>>>>>>> >>>>>>>>> vpci_remove_device() >>>>>>>>> { >>>>>>>>> d->vpci_terminating = true; >>>>>> ... this use so far, I can't tell yet. But at a first glance a boolean >>>>>> looks to be what you need. >>>>>> >>>>>>>>> remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others >>>>>>>>> if ( !write_trylock(d->vpci_rwlock) ) >>>>>>>>> return -ERESTART; >>>>>>>>> xfree(pdev->vpci); >>>>>>>>> pdev->vpci = NULL; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for >>>>>>>>> other operations which may require it, e.g. virtual bus topology can >>>>>>>>> use it when assigning vSBDF etc. >>>>>>>>> >>>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending >>>>>>>>> and do nothing for Dom0 and crash DomU otherwise: >>>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why >>>>>>>> trying to remove the problematic device wouldn't be a reasonable course >>>>>>>> of action anymore. vpci_remove_device() may need to become more careful >>>>>>>> as to not crashing, >>>>>>> vpci_remove_device does not crash, vpci_process_pending does >>>>>>>> though. >>>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs* >>>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the >>>>>>> lock and it can't just because there are some other vpci code is running on other vCPU. >>>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin >>>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci >>>>>>> structure because it is seen by all vCPUs and may crash them. >>>>>>> >>>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and >>>>>>> hypercall continuation helps here. But not in SoftIRQ context. >>>>>> Maybe then you want to invoke this cleanup from RCU context (whether >>>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I >>>>>> will admit though that I didn't check whether that would satisfy all >>>>>> constraints.) >>>>>> >>>>>> Then again it also hasn't become clear to me why you use write_trylock() >>>>>> there. The lock contention you describe doesn't, on the surface, look >>>>>> any different from situations elsewhere. >>>>> I use write_trylock in vpci_remove_device because if we can't >>>>> acquire the lock then we defer device removal. This would work >>>>> well if called from a hypercall which will employ hypercall continuation. >>>>> But SoftIRQ getting -ERESTART is something that we can't probably >>>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending >>>>> will need to spin and wait until vpci_remove_device succeeds. >>>> Does anybody have any better solution for preventing SoftIRQ from >>>> spinning on vpci_remove_device and -ERESTART? >>> Well, at this point I can suggest only a marginal improvement: Instead of >>> spinning inside the softirq handler, you want to re-raise the softirq and >>> exit the handler. That way at least higher "priority" softirqs won't be >>> starved. >>> >>> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such >>> an event, with the work deferred to a tasklet? >>> >>> Yet I don't think my earlier question regarding the use of write_trylock() >>> was really answered. What you said in reply doesn't explain (to me at >>> least) why write_lock() is not an option. >> I was thinking that we do not want to freeze in case we are calling vpci_remove_device >> from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART >> indicating that the removal needs to be deferred. If we use write_lock, then >> SoftIRQ -> write_lock will spin there waiting for readers to release the lock. >> >> write_lock actually makes things a lot easier, but I just don't know if it >> is ok to use it. If so, then vpci_remove_device becomes synchronous and >> there is no need in hypercall continuation and other heavy machinery for >> re-scheduling SoftIRQ... > I'm inclined to ask: If it wasn't okay to use here, then where would it be > okay to use? Of course I realize there are cases when long spinning times > can be a problem. I can't prove, but I have a feeling that write_lock could be less "harmful" in case of a hypercall rather than SoftIRQ > But I expect there aren't going to be excessively long > lock holding regions for this lock, and I also would expect average > contention to not be overly bad. Yes, this is my impression as well > But in the end you know better the code > that you're writing (and which may lead to issues with the lock usage) than > I do ... I am pretty much ok with write_lock as it does make things way easier. So I'll got with write_lock then and if we spot great contention then we can improve that > > Jan > Thank you!! Oleksandr
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c index 40ff79c33f8f..ef538386e95d 100644 --- a/xen/drivers/vpci/header.c +++ b/xen/drivers/vpci/header.c @@ -149,8 +149,7 @@ bool vpci_process_pending(struct vcpu *v) !rc && v->vpci.rom_only); spin_unlock(&v->vpci.pdev->vpci->lock); - rangeset_destroy(v->vpci.mem); - v->vpci.mem = NULL; + vpci_cancel_pending(v->vpci.pdev); if ( rc ) /* * FIXME: in case of failure remove the device from the domain. @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v) return false; } +void vpci_cancel_pending(const struct pci_dev *pdev) +{ + struct vcpu *v = current; + + /* Cancel any pending work now. */ + if ( v->vpci.mem && v->vpci.pdev == pdev) + { + rangeset_destroy(v->vpci.mem); + v->vpci.mem = NULL; + } +} + static int __init apply_map(struct domain *d, const struct pci_dev *pdev, struct rangeset *mem, uint16_t cmd) { diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 657697fe3406..4e24956419aa 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) xfree(r); } spin_unlock(&pdev->vpci->lock); + + vpci_cancel_pending(pdev); xfree(pdev->vpci->msix); xfree(pdev->vpci->msi); xfree(pdev->vpci); diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h index 3f32de9d7eb3..609d6383b252 100644 --- a/xen/include/xen/vpci.h +++ b/xen/include/xen/vpci.h @@ -56,6 +56,7 @@ uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg, * should not run. */ bool __must_check vpci_process_pending(struct vcpu *v); +void vpci_cancel_pending(const struct pci_dev *pdev); struct vpci { /* List of vPCI handlers for a device. */ @@ -252,6 +253,11 @@ static inline bool __must_check vpci_process_pending(struct vcpu *v) ASSERT_UNREACHABLE(); return false; } + +static inline void vpci_cancel_pending(const struct pci_dev *pdev) +{ + ASSERT_UNREACHABLE(); +} #endif #endif