diff mbox series

[v4,02/11] vpci: cancel pending map/unmap on vpci removal

Message ID 20211105065629.940943-3-andr2000@gmail.com (mailing list archive)
State Superseded
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Oleksandr Andrushchenko Nov. 5, 2021, 6:56 a.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a vPCI is removed for a PCI device it is possible that we have
scheduled a delayed work for map/unmap operations for that device.
For example, the following scenario can illustrate the problem:

pci_physdev_op
   pci_add_device
       init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
   iommu_add_device <- FAILS
   vpci_remove_device -> xfree(pdev->vpci)

leave_hypervisor_to_guest
   vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

For the hardware domain we continue execution as the worse that
could happen is that MMIO mappings are left in place when the
device has been deassigned

For unprivileged domains that get a failure in the middle of a vPCI
{un}map operation we need to destroy them, as we don't know in which
state the p2m is. This can only happen in vpci_process_pending for
DomUs as they won't be allowed to call pci_add_device.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

---
Cc: Roger Pau Monné <roger.pau@citrix.com>

New in v4
---
 xen/drivers/vpci/header.c | 15 +++++++++++++--
 xen/drivers/vpci/vpci.c   |  2 ++
 xen/include/xen/vpci.h    |  6 ++++++
 3 files changed, 21 insertions(+), 2 deletions(-)

Comments

Jan Beulich Nov. 15, 2021, 4:56 p.m. UTC | #1
On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a vPCI is removed for a PCI device it is possible that we have
> scheduled a delayed work for map/unmap operations for that device.
> For example, the following scenario can illustrate the problem:
> 
> pci_physdev_op
>    pci_add_device
>        init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>    iommu_add_device <- FAILS
>    vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>    vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> 
> For the hardware domain we continue execution as the worse that
> could happen is that MMIO mappings are left in place when the
> device has been deassigned

Is continuing safe in this case? I.e. isn't there the risk of a NULL
deref?

> For unprivileged domains that get a failure in the middle of a vPCI
> {un}map operation we need to destroy them, as we don't know in which
> state the p2m is. This can only happen in vpci_process_pending for
> DomUs as they won't be allowed to call pci_add_device.

You saying "we need to destroy them" made me look for a new domain_crash()
that you add, but there is none. What is this about?

> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>      return false;
>  }
>  
> +void vpci_cancel_pending(const struct pci_dev *pdev)
> +{
> +    struct vcpu *v = current;
> +
> +    /* Cancel any pending work now. */

Doesn't "any" include pending work on all vCPU-s of the guest, not
just current? Is current even relevant (as in: part of the correct
domain), considering ...

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>          xfree(r);
>      }
>      spin_unlock(&pdev->vpci->lock);
> +
> +    vpci_cancel_pending(pdev);

... this code path, when coming here from pci_{add,remove}_device()?

I can agree that there's a problem here, but I think you need to
properly (i.e. in a race free manner) drain pending work.

Jan
Oleksandr Andrushchenko Nov. 16, 2021, 7:32 a.m. UTC | #2
On 15.11.21 18:56, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> When a vPCI is removed for a PCI device it is possible that we have
>> scheduled a delayed work for map/unmap operations for that device.
>> For example, the following scenario can illustrate the problem:
>>
>> pci_physdev_op
>>     pci_add_device
>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>     iommu_add_device <- FAILS
>>     vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>
>> For the hardware domain we continue execution as the worse that
>> could happen is that MMIO mappings are left in place when the
>> device has been deassigned
> Is continuing safe in this case? I.e. isn't there the risk of a NULL
> deref?
I think it is safe to continue
>
>> For unprivileged domains that get a failure in the middle of a vPCI
>> {un}map operation we need to destroy them, as we don't know in which
>> state the p2m is. This can only happen in vpci_process_pending for
>> DomUs as they won't be allowed to call pci_add_device.
> You saying "we need to destroy them" made me look for a new domain_crash()
> that you add, but there is none. What is this about?
Yes, I guess we need to implicitly destroy the domain,
@Roger are you ok with that?
>
>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>       return false;
>>   }
>>   
>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>> +{
>> +    struct vcpu *v = current;
>> +
>> +    /* Cancel any pending work now. */
> Doesn't "any" include pending work on all vCPU-s of the guest, not
> just current? Is current even relevant (as in: part of the correct
> domain), considering ...
>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>           xfree(r);
>>       }
>>       spin_unlock(&pdev->vpci->lock);
>> +
>> +    vpci_cancel_pending(pdev);
> ... this code path, when coming here from pci_{add,remove}_device()?
>
> I can agree that there's a problem here, but I think you need to
> properly (i.e. in a race free manner) drain pending work.
Yes, the code is inconsistent with this respect. I am thinking about:

void vpci_cancel_pending(const struct pci_dev *pdev)
{
     struct domain *d = pdev->domain;
     struct vcpu *v;

     /* Cancel any pending work now. */
     domain_lock(d);
     for_each_vcpu ( d, v )
     {
         vcpu_pause(v);
         if ( v->vpci.mem && v->vpci.pdev == pdev)
         {
             rangeset_destroy(v->vpci.mem);
             v->vpci.mem = NULL;
         }
         vcpu_unpause(v);
     }
     domain_unlock(d);
}

which seems to solve all the concerns. Is this what you mean?
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 16, 2021, 8:01 a.m. UTC | #3
On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
> On 15.11.21 18:56, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> When a vPCI is removed for a PCI device it is possible that we have
>>> scheduled a delayed work for map/unmap operations for that device.
>>> For example, the following scenario can illustrate the problem:
>>>
>>> pci_physdev_op
>>>     pci_add_device
>>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>     iommu_add_device <- FAILS
>>>     vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> leave_hypervisor_to_guest
>>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>
>>> For the hardware domain we continue execution as the worse that
>>> could happen is that MMIO mappings are left in place when the
>>> device has been deassigned
>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>> deref?
> I think it is safe to continue

And why do you think so? I.e. why is there no race for Dom0 when there
is one for DomU?

>>> For unprivileged domains that get a failure in the middle of a vPCI
>>> {un}map operation we need to destroy them, as we don't know in which
>>> state the p2m is. This can only happen in vpci_process_pending for
>>> DomUs as they won't be allowed to call pci_add_device.
>> You saying "we need to destroy them" made me look for a new domain_crash()
>> that you add, but there is none. What is this about?
> Yes, I guess we need to implicitly destroy the domain,

What do you mean by "implicitly"?

>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>       return false;
>>>   }
>>>   
>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>> +{
>>> +    struct vcpu *v = current;
>>> +
>>> +    /* Cancel any pending work now. */
>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>> just current? Is current even relevant (as in: part of the correct
>> domain), considering ...
>>
>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>           xfree(r);
>>>       }
>>>       spin_unlock(&pdev->vpci->lock);
>>> +
>>> +    vpci_cancel_pending(pdev);
>> ... this code path, when coming here from pci_{add,remove}_device()?
>>
>> I can agree that there's a problem here, but I think you need to
>> properly (i.e. in a race free manner) drain pending work.
> Yes, the code is inconsistent with this respect. I am thinking about:
> 
> void vpci_cancel_pending(const struct pci_dev *pdev)
> {
>      struct domain *d = pdev->domain;
>      struct vcpu *v;
> 
>      /* Cancel any pending work now. */
>      domain_lock(d);
>      for_each_vcpu ( d, v )
>      {
>          vcpu_pause(v);
>          if ( v->vpci.mem && v->vpci.pdev == pdev)

Nit: Same style issue as in the original patch.

>          {
>              rangeset_destroy(v->vpci.mem);
>              v->vpci.mem = NULL;
>          }
>          vcpu_unpause(v);
>      }
>      domain_unlock(d);
> }
> 
> which seems to solve all the concerns. Is this what you mean?

Something along these lines. I expect you will want to make use of
domain_pause_except_self(), and I don't understand the purpose of
acquiring the domain lock.

Jan
Oleksandr Andrushchenko Nov. 16, 2021, 8:23 a.m. UTC | #4
On 16.11.21 10:01, Jan Beulich wrote:
> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>> On 15.11.21 18:56, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>> scheduled a delayed work for map/unmap operations for that device.
>>>> For example, the following scenario can illustrate the problem:
>>>>
>>>> pci_physdev_op
>>>>      pci_add_device
>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>      iommu_add_device <- FAILS
>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>
>>>> For the hardware domain we continue execution as the worse that
>>>> could happen is that MMIO mappings are left in place when the
>>>> device has been deassigned
>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>> deref?
>> I think it is safe to continue
> And why do you think so? I.e. why is there no race for Dom0 when there
> is one for DomU?
Well, then we need to use a lock to synchronize the two.
I guess this needs to be pci devs lock unfortunately
>
>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>> {un}map operation we need to destroy them, as we don't know in which
>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>> DomUs as they won't be allowed to call pci_add_device.
>>> You saying "we need to destroy them" made me look for a new domain_crash()
>>> that you add, but there is none. What is this about?
>> Yes, I guess we need to implicitly destroy the domain,
> What do you mean by "implicitly"?
@@ -151,14 +151,18 @@ bool vpci_process_pending(struct vcpu *v)

          vpci_cancel_pending(v->vpci.pdev);
          if ( rc )
+        {
              /*
               * FIXME: in case of failure remove the device from the domain.
               * Note that there might still be leftover mappings. While this is
+             * safe for Dom0, for DomUs the domain needs to be killed in order
+             * to avoid leaking stale p2m mappings on failure.
               */
              vpci_remove_device(v->vpci.pdev);
+
+            if ( !is_hardware_domain(v->domain) )
+                domain_crash(v->domain);

>
>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>        return false;
>>>>    }
>>>>    
>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> +{
>>>> +    struct vcpu *v = current;
>>>> +
>>>> +    /* Cancel any pending work now. */
>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>> just current? Is current even relevant (as in: part of the correct
>>> domain), considering ...
>>>
>>>> --- a/xen/drivers/vpci/vpci.c
>>>> +++ b/xen/drivers/vpci/vpci.c
>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>            xfree(r);
>>>>        }
>>>>        spin_unlock(&pdev->vpci->lock);
>>>> +
>>>> +    vpci_cancel_pending(pdev);
>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>
>>> I can agree that there's a problem here, but I think you need to
>>> properly (i.e. in a race free manner) drain pending work.
>> Yes, the code is inconsistent with this respect. I am thinking about:
>>
>> void vpci_cancel_pending(const struct pci_dev *pdev)
>> {
>>       struct domain *d = pdev->domain;
>>       struct vcpu *v;
>>
>>       /* Cancel any pending work now. */
>>       domain_lock(d);
>>       for_each_vcpu ( d, v )
>>       {
>>           vcpu_pause(v);
>>           if ( v->vpci.mem && v->vpci.pdev == pdev)
> Nit: Same style issue as in the original patch.
Will fix
>
>>           {
>>               rangeset_destroy(v->vpci.mem);
>>               v->vpci.mem = NULL;
>>           }
>>           vcpu_unpause(v);
>>       }
>>       domain_unlock(d);
>> }
>>
>> which seems to solve all the concerns. Is this what you mean?
> Something along these lines. I expect you will want to make use of
> domain_pause_except_self(),
Yes, this is what is needed here, thanks. The only question is that

int domain_pause_except_self(struct domain *d)
{
[snip]
         /* Avoid racing with other vcpus which may want to be pausing us */
         if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
             return -ERESTART;

so it is not clear what do we do in case of -ERESTART: do we want to spin?
Otherwise we will leave the job not done effectively not canceling the
pending work. Any idea other then spinning?
>   and I don't understand the purpose of
> acquiring the domain lock.
You are right, no need
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 16, 2021, 11:38 a.m. UTC | #5
On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 10:01, Jan Beulich wrote:
>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>        return false;
>>>>>    }
>>>>>    
>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>> +{
>>>>> +    struct vcpu *v = current;
>>>>> +
>>>>> +    /* Cancel any pending work now. */
>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>> just current? Is current even relevant (as in: part of the correct
>>>> domain), considering ...
>>>>
>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>            xfree(r);
>>>>>        }
>>>>>        spin_unlock(&pdev->vpci->lock);
>>>>> +
>>>>> +    vpci_cancel_pending(pdev);
>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>
>>>> I can agree that there's a problem here, but I think you need to
>>>> properly (i.e. in a race free manner) drain pending work.
>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>
>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>> {
>>>       struct domain *d = pdev->domain;
>>>       struct vcpu *v;
>>>
>>>       /* Cancel any pending work now. */
>>>       domain_lock(d);
>>>       for_each_vcpu ( d, v )
>>>       {
>>>           vcpu_pause(v);
>>>           if ( v->vpci.mem && v->vpci.pdev == pdev)
>> Nit: Same style issue as in the original patch.
> Will fix
>>
>>>           {
>>>               rangeset_destroy(v->vpci.mem);
>>>               v->vpci.mem = NULL;
>>>           }
>>>           vcpu_unpause(v);
>>>       }
>>>       domain_unlock(d);
>>> }
>>>
>>> which seems to solve all the concerns. Is this what you mean?
>> Something along these lines. I expect you will want to make use of
>> domain_pause_except_self(),
> Yes, this is what is needed here, thanks. The only question is that
> 
> int domain_pause_except_self(struct domain *d)
> {
> [snip]
>          /* Avoid racing with other vcpus which may want to be pausing us */
>          if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>              return -ERESTART;
> 
> so it is not clear what do we do in case of -ERESTART: do we want to spin?
> Otherwise we will leave the job not done effectively not canceling the
> pending work. Any idea other then spinning?

Depends on the call chain you come through. There may need to be some
rearrangements such that you may be able to preempt the enclosing
hypercall.

Jan
Oleksandr Andrushchenko Nov. 16, 2021, 1:27 p.m. UTC | #6
On 16.11.21 13:38, Jan Beulich wrote:
> On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:01, Jan Beulich wrote:
>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>>         return false;
>>>>>>     }
>>>>>>     
>>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>>> +{
>>>>>> +    struct vcpu *v = current;
>>>>>> +
>>>>>> +    /* Cancel any pending work now. */
>>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>>> just current? Is current even relevant (as in: part of the correct
>>>>> domain), considering ...
>>>>>
>>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>>             xfree(r);
>>>>>>         }
>>>>>>         spin_unlock(&pdev->vpci->lock);
>>>>>> +
>>>>>> +    vpci_cancel_pending(pdev);
>>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>>
>>>>> I can agree that there's a problem here, but I think you need to
>>>>> properly (i.e. in a race free manner) drain pending work.
>>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>>
>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> {
>>>>        struct domain *d = pdev->domain;
>>>>        struct vcpu *v;
>>>>
>>>>        /* Cancel any pending work now. */
>>>>        domain_lock(d);
>>>>        for_each_vcpu ( d, v )
>>>>        {
>>>>            vcpu_pause(v);
>>>>            if ( v->vpci.mem && v->vpci.pdev == pdev)
>>> Nit: Same style issue as in the original patch.
>> Will fix
>>>>            {
>>>>                rangeset_destroy(v->vpci.mem);
>>>>                v->vpci.mem = NULL;
>>>>            }
>>>>            vcpu_unpause(v);
>>>>        }
>>>>        domain_unlock(d);
>>>> }
>>>>
>>>> which seems to solve all the concerns. Is this what you mean?
>>> Something along these lines. I expect you will want to make use of
>>> domain_pause_except_self(),
>> Yes, this is what is needed here, thanks. The only question is that
>>
>> int domain_pause_except_self(struct domain *d)
>> {
>> [snip]
>>           /* Avoid racing with other vcpus which may want to be pausing us */
>>           if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>>               return -ERESTART;
>>
>> so it is not clear what do we do in case of -ERESTART: do we want to spin?
>> Otherwise we will leave the job not done effectively not canceling the
>> pending work. Any idea other then spinning?
> Depends on the call chain you come through. There may need to be some
> rearrangements such that you may be able to preempt the enclosing
> hypercall.
Well, there are three places which may lead to the pending work
needs to be canceled:

MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path)

PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending

PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending

So, taking into account the MMIO path, I think about the below code

     /*
      * Cancel any pending work now.
      *
      * FIXME: this can be called from an MMIO trap handler's error
      * path, so we cannot just return an error code here, so upper
      * layers can handle it. The best we can do is to still try
      * removing the range sets.
      */
     while ( (rc = domain_pause_except_self(d)) == -ERESTART )
         cpu_relax();

     if ( rc )
         printk(XENLOG_G_ERR
                "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
                &pdev->sbdf, pdev->domain, rc);

I am not sure how to handle this otherwise
@Roger, do you see any other good way?
>
> Jan
>
Thank you,
Oleksandr
Oleksandr Andrushchenko Nov. 16, 2021, 1:41 p.m. UTC | #7
On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>
> On 16.11.21 10:01, Jan Beulich wrote:
>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>> For example, the following scenario can illustrate the problem:
>>>>>
>>>>> pci_physdev_op
>>>>>       pci_add_device
>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>       iommu_add_device <- FAILS
>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> leave_hypervisor_to_guest
>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>
>>>>> For the hardware domain we continue execution as the worse that
>>>>> could happen is that MMIO mappings are left in place when the
>>>>> device has been deassigned
>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>> deref?
>>> I think it is safe to continue
>> And why do you think so? I.e. why is there no race for Dom0 when there
>> is one for DomU?
> Well, then we need to use a lock to synchronize the two.
> I guess this needs to be pci devs lock unfortunately
The parties involved in deferred work and its cancellation:

MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map

Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending

x86: two places -> hvm_do_resume -> vpci_process_pending

So, both defer_map and vpci_process_pending need to be synchronized with
pcidevs_{lock|unlock).
As both functions existed before the code I introduce I would prefer this to be
a dedicated patch for v5 of the series.

Thank you,
Oleksandr
Jan Beulich Nov. 16, 2021, 2:11 p.m. UTC | #8
On 16.11.2021 14:27, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 13:38, Jan Beulich wrote:
>> On 16.11.2021 09:23, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -165,6 +164,18 @@ bool vpci_process_pending(struct vcpu *v)
>>>>>>>         return false;
>>>>>>>     }
>>>>>>>     
>>>>>>> +void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>>>> +{
>>>>>>> +    struct vcpu *v = current;
>>>>>>> +
>>>>>>> +    /* Cancel any pending work now. */
>>>>>> Doesn't "any" include pending work on all vCPU-s of the guest, not
>>>>>> just current? Is current even relevant (as in: part of the correct
>>>>>> domain), considering ...
>>>>>>
>>>>>>> --- a/xen/drivers/vpci/vpci.c
>>>>>>> +++ b/xen/drivers/vpci/vpci.c
>>>>>>> @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev)
>>>>>>>             xfree(r);
>>>>>>>         }
>>>>>>>         spin_unlock(&pdev->vpci->lock);
>>>>>>> +
>>>>>>> +    vpci_cancel_pending(pdev);
>>>>>> ... this code path, when coming here from pci_{add,remove}_device()?
>>>>>>
>>>>>> I can agree that there's a problem here, but I think you need to
>>>>>> properly (i.e. in a race free manner) drain pending work.
>>>>> Yes, the code is inconsistent with this respect. I am thinking about:
>>>>>
>>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>>> {
>>>>>        struct domain *d = pdev->domain;
>>>>>        struct vcpu *v;
>>>>>
>>>>>        /* Cancel any pending work now. */
>>>>>        domain_lock(d);
>>>>>        for_each_vcpu ( d, v )
>>>>>        {
>>>>>            vcpu_pause(v);
>>>>>            if ( v->vpci.mem && v->vpci.pdev == pdev)
>>>> Nit: Same style issue as in the original patch.
>>> Will fix
>>>>>            {
>>>>>                rangeset_destroy(v->vpci.mem);
>>>>>                v->vpci.mem = NULL;
>>>>>            }
>>>>>            vcpu_unpause(v);
>>>>>        }
>>>>>        domain_unlock(d);
>>>>> }
>>>>>
>>>>> which seems to solve all the concerns. Is this what you mean?
>>>> Something along these lines. I expect you will want to make use of
>>>> domain_pause_except_self(),
>>> Yes, this is what is needed here, thanks. The only question is that
>>>
>>> int domain_pause_except_self(struct domain *d)
>>> {
>>> [snip]
>>>           /* Avoid racing with other vcpus which may want to be pausing us */
>>>           if ( !spin_trylock(&d->hypercall_deadlock_mutex) )
>>>               return -ERESTART;
>>>
>>> so it is not clear what do we do in case of -ERESTART: do we want to spin?
>>> Otherwise we will leave the job not done effectively not canceling the
>>> pending work. Any idea other then spinning?
>> Depends on the call chain you come through. There may need to be some
>> rearrangements such that you may be able to preempt the enclosing
>> hypercall.
> Well, there are three places which may lead to the pending work
> needs to be canceled:
> 
> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> vpci_cancel_pending (on modify_bars fail path)
> 
> PHYSDEVOP_pci_device_add -> pci_add_device (error path) -> vpci_remove_device -> vpci_cancel_pending
> 
> PHYSDEVOP_pci_device_remove -> pci_remove_device -> vpci_remove_device -> vpci_cancel_pending
> 
> So, taking into account the MMIO path, I think about the below code
> 
>      /*
>       * Cancel any pending work now.
>       *
>       * FIXME: this can be called from an MMIO trap handler's error
>       * path, so we cannot just return an error code here, so upper
>       * layers can handle it. The best we can do is to still try
>       * removing the range sets.
>       */

The MMIO trap path should simply exit to the guest to have the insn
retried. With the vPCI removal a subsequent emulation attempt will
no longer be able to resolve the address to BDF, and hence do
whatever it would do for an attempted access to config space not
belonging to any device.

For the two physdevops it may not be possible to preempt the
hypercalls without further code adjustments.

Jan

>      while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>          cpu_relax();
> 
>      if ( rc )
>          printk(XENLOG_G_ERR
>                 "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>                 &pdev->sbdf, pdev->domain, rc);
> 
> I am not sure how to handle this otherwise
> @Roger, do you see any other good way?
>>
>> Jan
>>
> Thank you,
> Oleksandr
>
Jan Beulich Nov. 16, 2021, 2:12 p.m. UTC | #9
On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:01, Jan Beulich wrote:
>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>       pci_add_device
>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>       iommu_add_device <- FAILS
>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> leave_hypervisor_to_guest
>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>
>>>>>> For the hardware domain we continue execution as the worse that
>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>> device has been deassigned
>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>> deref?
>>>> I think it is safe to continue
>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>> is one for DomU?
>> Well, then we need to use a lock to synchronize the two.
>> I guess this needs to be pci devs lock unfortunately
> The parties involved in deferred work and its cancellation:
> 
> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
> 
> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
> 
> x86: two places -> hvm_do_resume -> vpci_process_pending
> 
> So, both defer_map and vpci_process_pending need to be synchronized with
> pcidevs_{lock|unlock).

If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
getting used in leave_hypervisor_to_guest.

Jan
Oleksandr Andrushchenko Nov. 16, 2021, 2:24 p.m. UTC | #10
On 16.11.21 16:12, Jan Beulich wrote:
> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>
>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>
>>>>>>> pci_physdev_op
>>>>>>>        pci_add_device
>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>        iommu_add_device <- FAILS
>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>
>>>>>>> leave_hypervisor_to_guest
>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>
>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>> device has been deassigned
>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>> deref?
>>>>> I think it is safe to continue
>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>> is one for DomU?
>>> Well, then we need to use a lock to synchronize the two.
>>> I guess this needs to be pci devs lock unfortunately
>> The parties involved in deferred work and its cancellation:
>>
>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>
>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>
>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>
>> So, both defer_map and vpci_process_pending need to be synchronized with
>> pcidevs_{lock|unlock).
> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
> getting used in leave_hypervisor_to_guest.
I do agree this is really not good, but it seems I am limited in choices.
@Stefano, @Julien, do you see any better way of doing that?

We were thinking about introducing a dedicated lock for vpci [1],
but finally decided to use pcidevs_lock for now
> Jan
>

[1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
Oleksandr Andrushchenko Nov. 16, 2021, 2:37 p.m. UTC | #11
On 16.11.21 16:24, Oleksandr Andrushchenko wrote:
>
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>         pci_add_device
>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?
>
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now
Or even better and simpler: we just use pdev->vpci->lock to
protect vpci_process_pending vs MMIO trap handlers which  already
use it.
>> Jan
>>
> [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
Jan Beulich Nov. 16, 2021, 4:09 p.m. UTC | #12
On 16.11.2021 15:24, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>        pci_add_device
>>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>        iommu_add_device <- FAILS
>>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?
> 
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now

Even that locking model might be too heavyweight for this purpose,
unless an r/w lock was intended. The problem would still be that
all guest exiting would be serialized within a domain. (That's still
better than serializing all guest exiting on the host, of course.)

Jan

> [1] https://lore.kernel.org/all/afe47397-a792-6b0c-0a89-b47c523e50d9@epam.com/
>
Julien Grall Nov. 16, 2021, 6:02 p.m. UTC | #13
Hi Oleksandr,

On 16/11/2021 14:24, Oleksandr Andrushchenko wrote:
> 
> 
> On 16.11.21 16:12, Jan Beulich wrote:
>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>
>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>         pci_add_device
>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>> deref?
>>>>>> I think it is safe to continue
>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>> is one for DomU?
>>>> Well, then we need to use a lock to synchronize the two.
>>>> I guess this needs to be pci devs lock unfortunately
>>> The parties involved in deferred work and its cancellation:
>>>
>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>
>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>
>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>
>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>> pcidevs_{lock|unlock).
>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>> getting used in leave_hypervisor_to_guest.
> I do agree this is really not good, but it seems I am limited in choices.
> @Stefano, @Julien, do you see any better way of doing that?

I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not 
fine-grained enough for been call in vpci_process_pending().

I haven't yet looked at the rest of the series to be able to suggest the 
exact lock. But we at least want a per-domain spinlock.

> 
> We were thinking about introducing a dedicated lock for vpci [1],
> but finally decided to use pcidevs_lock for now

Skimming through the thread, you decided to use pcidevs_lock because it 
was simpler and sufficient for the use case discussed back then. Now, we 
have a use case where it would be a problem to use pcidevs_lock. So I 
think the extra complexity is justified.

Cheers,
Jan Beulich Nov. 17, 2021, 8:28 a.m. UTC | #14
On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> When a vPCI is removed for a PCI device it is possible that we have
> scheduled a delayed work for map/unmap operations for that device.
> For example, the following scenario can illustrate the problem:
> 
> pci_physdev_op
>    pci_add_device
>        init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>    iommu_add_device <- FAILS
>    vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>    vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> 
> For the hardware domain we continue execution as the worse that
> could happen is that MMIO mappings are left in place when the
> device has been deassigned
> 
> For unprivileged domains that get a failure in the middle of a vPCI
> {un}map operation we need to destroy them, as we don't know in which
> state the p2m is. This can only happen in vpci_process_pending for
> DomUs as they won't be allowed to call pci_add_device.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Thinking about it some more, I'm not convinced any of this is really
needed in the presented form. Removal of a vPCI device is the analogue
of hot-unplug on baremetal. That's not a "behind the backs of
everything" operation. Instead the host admin has to prepare the
device for removal, which will result in it being quiescent (which in
particular means no BAR adjustments anymore). The act of removing the
device from the system has as its virtual counterpart "xl pci-detach".
I think it ought to be in this context when pending requests get
drained, and an indicator be set that no further changes to that
device are permitted. This would mean invoking from
vpci_deassign_device() as added by patch 4, not from
vpci_remove_device(). This would yield removal of a device from the
host being independent of removal of a device from a guest.

The need for vpci_remove_device() seems questionable in the first
place: Even for hot-unplug on the host it may be better to require a
pci-detach from (PVH) Dom0 before the actual device removal. This
would involve an adjustment to the de-assignment logic for the case
of no quarantining: We'd need to make sure explicit de-assignment
from Dom0 actually removes the device from there; right now
de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
on quarantining mode).

Thoughts?

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 7:49 a.m. UTC | #15
On 17.11.21 10:28, Jan Beulich wrote:
> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> When a vPCI is removed for a PCI device it is possible that we have
>> scheduled a delayed work for map/unmap operations for that device.
>> For example, the following scenario can illustrate the problem:
>>
>> pci_physdev_op
>>     pci_add_device
>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>     iommu_add_device <- FAILS
>>     vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>
>> For the hardware domain we continue execution as the worse that
>> could happen is that MMIO mappings are left in place when the
>> device has been deassigned
>>
>> For unprivileged domains that get a failure in the middle of a vPCI
>> {un}map operation we need to destroy them, as we don't know in which
>> state the p2m is. This can only happen in vpci_process_pending for
>> DomUs as they won't be allowed to call pci_add_device.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Thinking about it some more, I'm not convinced any of this is really
> needed in the presented form.
The intention of this patch was to handle error conditions which are
abnormal, e.g. when iommu_add_device fails and we are in the middle
of initialization. So, I am trying to cancel all pending work which might
already be there and not to crash.
>   Removal of a vPCI device is the analogue
> of hot-unplug on baremetal. That's not a "behind the backs of
> everything" operation. Instead the host admin has to prepare the
> device for removal, which will result in it being quiescent (which in
> particular means no BAR adjustments anymore). The act of removing the
> device from the system has as its virtual counterpart "xl pci-detach".
> I think it ought to be in this context when pending requests get
> drained, and an indicator be set that no further changes to that
> device are permitted. This would mean invoking from
> vpci_deassign_device() as added by patch 4, not from
> vpci_remove_device(). This would yield removal of a device from the
> host being independent of removal of a device from a guest.
>
> The need for vpci_remove_device() seems questionable in the first
> place: Even for hot-unplug on the host it may be better to require a
> pci-detach from (PVH) Dom0 before the actual device removal. This
> would involve an adjustment to the de-assignment logic for the case
> of no quarantining: We'd need to make sure explicit de-assignment
> from Dom0 actually removes the device from there; right now
> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
> on quarantining mode).
Please see above. What you wrote might be perfectly fine for
the "expected" removals, but what about the errors which are
out of administrator's control?
>
> Thoughts?
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 8:36 a.m. UTC | #16
On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
> 
> 
> On 17.11.21 10:28, Jan Beulich wrote:
>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> When a vPCI is removed for a PCI device it is possible that we have
>>> scheduled a delayed work for map/unmap operations for that device.
>>> For example, the following scenario can illustrate the problem:
>>>
>>> pci_physdev_op
>>>     pci_add_device
>>>         init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>     iommu_add_device <- FAILS
>>>     vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> leave_hypervisor_to_guest
>>>     vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>
>>> For the hardware domain we continue execution as the worse that
>>> could happen is that MMIO mappings are left in place when the
>>> device has been deassigned
>>>
>>> For unprivileged domains that get a failure in the middle of a vPCI
>>> {un}map operation we need to destroy them, as we don't know in which
>>> state the p2m is. This can only happen in vpci_process_pending for
>>> DomUs as they won't be allowed to call pci_add_device.
>>>
>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Thinking about it some more, I'm not convinced any of this is really
>> needed in the presented form.
> The intention of this patch was to handle error conditions which are
> abnormal, e.g. when iommu_add_device fails and we are in the middle
> of initialization. So, I am trying to cancel all pending work which might
> already be there and not to crash.

Only Dom0 may be able to prematurely access the device during "add".
Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
Hence I'm not sure I see the need for dealing with these.

>>   Removal of a vPCI device is the analogue
>> of hot-unplug on baremetal. That's not a "behind the backs of
>> everything" operation. Instead the host admin has to prepare the
>> device for removal, which will result in it being quiescent (which in
>> particular means no BAR adjustments anymore). The act of removing the
>> device from the system has as its virtual counterpart "xl pci-detach".
>> I think it ought to be in this context when pending requests get
>> drained, and an indicator be set that no further changes to that
>> device are permitted. This would mean invoking from
>> vpci_deassign_device() as added by patch 4, not from
>> vpci_remove_device(). This would yield removal of a device from the
>> host being independent of removal of a device from a guest.
>>
>> The need for vpci_remove_device() seems questionable in the first
>> place: Even for hot-unplug on the host it may be better to require a
>> pci-detach from (PVH) Dom0 before the actual device removal. This
>> would involve an adjustment to the de-assignment logic for the case
>> of no quarantining: We'd need to make sure explicit de-assignment
>> from Dom0 actually removes the device from there; right now
>> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
>> on quarantining mode).

As to this, I meanwhile think that add/remove can very well have Dom0
related vPCI init/teardown. But for DomU all of that should happen
during assign/de-assign. A device still assigned to a DomU simply
should never be subject to physical hot-unplug in the first place.

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 8:54 a.m. UTC | #17
On 18.11.21 10:36, Jan Beulich wrote:
> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>
>> On 17.11.21 10:28, Jan Beulich wrote:
>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>> scheduled a delayed work for map/unmap operations for that device.
>>>> For example, the following scenario can illustrate the problem:
>>>>
>>>> pci_physdev_op
>>>>      pci_add_device
>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>      iommu_add_device <- FAILS
>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>
>>>> For the hardware domain we continue execution as the worse that
>>>> could happen is that MMIO mappings are left in place when the
>>>> device has been deassigned
>>>>
>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>> {un}map operation we need to destroy them, as we don't know in which
>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>
>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>> Thinking about it some more, I'm not convinced any of this is really
>>> needed in the presented form.
>> The intention of this patch was to handle error conditions which are
>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>> of initialization. So, I am trying to cancel all pending work which might
>> already be there and not to crash.
> Only Dom0 may be able to prematurely access the device during "add".
> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
> Hence I'm not sure I see the need for dealing with these.
Probably I don't follow you here. The issue I am facing is Dom0
related, e.g. Xen was not able to initialize during "add" and thus
wanted to clean up the leftovers. As the result the already
scheduled work crashes as it was not neither canceled nor interrupted
in some safe manner. So, this sounds like something we need to take
care of, thus this patch.

Another story, which I am getting more convinced now, is that
the proper locking should be a dedicated patch as it should not only
add locks as required by this patch, but also probably revisit the
existing locking schemes to be acceptable for new use-cases.
>
>>>    Removal of a vPCI device is the analogue
>>> of hot-unplug on baremetal. That's not a "behind the backs of
>>> everything" operation. Instead the host admin has to prepare the
>>> device for removal, which will result in it being quiescent (which in
>>> particular means no BAR adjustments anymore). The act of removing the
>>> device from the system has as its virtual counterpart "xl pci-detach".
>>> I think it ought to be in this context when pending requests get
>>> drained, and an indicator be set that no further changes to that
>>> device are permitted. This would mean invoking from
>>> vpci_deassign_device() as added by patch 4, not from
>>> vpci_remove_device(). This would yield removal of a device from the
>>> host being independent of removal of a device from a guest.
>>>
>>> The need for vpci_remove_device() seems questionable in the first
>>> place: Even for hot-unplug on the host it may be better to require a
>>> pci-detach from (PVH) Dom0 before the actual device removal. This
>>> would involve an adjustment to the de-assignment logic for the case
>>> of no quarantining: We'd need to make sure explicit de-assignment
>>> from Dom0 actually removes the device from there; right now
>>> de-assignment assumes "from DomU" and "to Dom0 or DomIO" (depending
>>> on quarantining mode).
> As to this, I meanwhile think that add/remove can very well have Dom0
> related vPCI init/teardown. But for DomU all of that should happen
> during assign/de-assign.
Yes, I agree. The model I also see is:
- for Dom0 we use add/remove
- for DomUs we use assign/de-assign
>   A device still assigned to a DomU simply
> should never be subject to physical hot-unplug in the first place.
Double that
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 9:15 a.m. UTC | #18
On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
> On 18.11.21 10:36, Jan Beulich wrote:
>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>> For example, the following scenario can illustrate the problem:
>>>>>
>>>>> pci_physdev_op
>>>>>      pci_add_device
>>>>>          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>      iommu_add_device <- FAILS
>>>>>      vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> leave_hypervisor_to_guest
>>>>>      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>
>>>>> For the hardware domain we continue execution as the worse that
>>>>> could happen is that MMIO mappings are left in place when the
>>>>> device has been deassigned
>>>>>
>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>
>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>> Thinking about it some more, I'm not convinced any of this is really
>>>> needed in the presented form.
>>> The intention of this patch was to handle error conditions which are
>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>> of initialization. So, I am trying to cancel all pending work which might
>>> already be there and not to crash.
>> Only Dom0 may be able to prematurely access the device during "add".
>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>> Hence I'm not sure I see the need for dealing with these.
> Probably I don't follow you here. The issue I am facing is Dom0
> related, e.g. Xen was not able to initialize during "add" and thus
> wanted to clean up the leftovers. As the result the already
> scheduled work crashes as it was not neither canceled nor interrupted
> in some safe manner. So, this sounds like something we need to take
> care of, thus this patch.

But my point was the question of why there would be any pending work
in the first place in this case, when we expect Dom0 to be well-behaved.

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 9:32 a.m. UTC | #19
On 18.11.21 11:15, Jan Beulich wrote:
> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>> On 18.11.21 10:36, Jan Beulich wrote:
>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>       pci_add_device
>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>       iommu_add_device <- FAILS
>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> leave_hypervisor_to_guest
>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>
>>>>>> For the hardware domain we continue execution as the worse that
>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>> device has been deassigned
>>>>>>
>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>
>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>> needed in the presented form.
>>>> The intention of this patch was to handle error conditions which are
>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>> of initialization. So, I am trying to cancel all pending work which might
>>>> already be there and not to crash.
>>> Only Dom0 may be able to prematurely access the device during "add".
>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>> Hence I'm not sure I see the need for dealing with these.
>> Probably I don't follow you here. The issue I am facing is Dom0
>> related, e.g. Xen was not able to initialize during "add" and thus
>> wanted to clean up the leftovers. As the result the already
>> scheduled work crashes as it was not neither canceled nor interrupted
>> in some safe manner. So, this sounds like something we need to take
>> care of, thus this patch.
> But my point was the question of why there would be any pending work
> in the first place in this case, when we expect Dom0 to be well-behaved.
I am not saying Dom0 misbehaves here. This is my real use-case
(as in the commit message):

pci_physdev_op
      pci_add_device
          init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
      iommu_add_device <- FAILS
      vpci_remove_device -> xfree(pdev->vpci)

leave_hypervisor_to_guest
      vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

So, this made me implement the patch. Then we decided that it is also
possible that other vCPUs may also have some pending work and I
agreed that this is a good point and we want to remove all pending work
for all vCPUs.

So, if you doubt the patch and we still have the scenario above, what
would you suggest in order to make sure we do not crash?
> Jan
>
Thank you,
Oleksandr
Oleksandr Andrushchenko Nov. 18, 2021, 12:57 p.m. UTC | #20
Hi, Julien!

On 16.11.21 20:02, Julien Grall wrote:
> Hi Oleksandr,
>
> On 16/11/2021 14:24, Oleksandr Andrushchenko wrote:
>>
>>
>> On 16.11.21 16:12, Jan Beulich wrote:
>>> On 16.11.2021 14:41, Oleksandr Andrushchenko wrote:
>>>>
>>>> On 16.11.21 10:23, Oleksandr Andrushchenko wrote:
>>>>> On 16.11.21 10:01, Jan Beulich wrote:
>>>>>> On 16.11.2021 08:32, Oleksandr Andrushchenko wrote:
>>>>>>> On 15.11.21 18:56, Jan Beulich wrote:
>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>
>>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>>
>>>>>>>>> pci_physdev_op
>>>>>>>>>         pci_add_device
>>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>>
>>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>>
>>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>>> device has been deassigned
>>>>>>>> Is continuing safe in this case? I.e. isn't there the risk of a NULL
>>>>>>>> deref?
>>>>>>> I think it is safe to continue
>>>>>> And why do you think so? I.e. why is there no race for Dom0 when there
>>>>>> is one for DomU?
>>>>> Well, then we need to use a lock to synchronize the two.
>>>>> I guess this needs to be pci devs lock unfortunately
>>>> The parties involved in deferred work and its cancellation:
>>>>
>>>> MMIO trap -> vpci_write -> vpci_cmd_write -> modify_bars -> defer_map
>>>>
>>>> Arm: leave_hypervisor_to_guest -> check_for_vcpu_work -> vpci_process_pending
>>>>
>>>> x86: two places -> hvm_do_resume -> vpci_process_pending
>>>>
>>>> So, both defer_map and vpci_process_pending need to be synchronized with
>>>> pcidevs_{lock|unlock).
>>> If I was an Arm maintainer, I'm afraid I would object to the pcidevs lock
>>> getting used in leave_hypervisor_to_guest.
>> I do agree this is really not good, but it seems I am limited in choices.
>> @Stefano, @Julien, do you see any better way of doing that?
>
> I agree with Jan about using the pcidevs_{lock|unlock}. The lock is not fine-grained enough for been call in vpci_process_pending().
>
> I haven't yet looked at the rest of the series to be able to suggest the exact lock. But we at least want a per-domain spinlock.
>
>>
>> We were thinking about introducing a dedicated lock for vpci [1],
>> but finally decided to use pcidevs_lock for now
>
> Skimming through the thread, you decided to use pcidevs_lock because it was simpler and sufficient for the use case discussed back then. Now, we have a use case where it would be a problem to use pcidevs_lock. So I think the extra complexity is justified.
I would like to understand what is this lock so I can implement that properly.
We have the following options as I can see:

1. pcidevs_{lock|unlock} - considered too heavy, per host
2. pdev->vpci->lock - better, but still heavy, per PCI device
3. We may convert pdev->vpci->lock into r/w lock
4. We may introduce a specific lock

To better understand the scope of the lock:
1. MMIO trap handlers (vpci_{read|write} - already protected with pdev->vpci->lock
2. vpci_process_pending (SOFTIRQ context)
3. Hypercalls which call pci_{add|remove|assign|deassign}_device
4. @Roger, did I miss something?

And I feel that this needs a dedicated patch for that: I am not sure it is a
good idea to add this locking change into this patch which seems not so relevant
>
> Cheers,
>
Jan Beulich Nov. 18, 2021, 1:25 p.m. UTC | #21
On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 11:15, Jan Beulich wrote:
>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>
>>>>>>> pci_physdev_op
>>>>>>>       pci_add_device
>>>>>>>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>       iommu_add_device <- FAILS
>>>>>>>       vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>
>>>>>>> leave_hypervisor_to_guest
>>>>>>>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>
>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>> device has been deassigned
>>>>>>>
>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>
>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>> needed in the presented form.
>>>>> The intention of this patch was to handle error conditions which are
>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>> already be there and not to crash.
>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>> Hence I'm not sure I see the need for dealing with these.
>>> Probably I don't follow you here. The issue I am facing is Dom0
>>> related, e.g. Xen was not able to initialize during "add" and thus
>>> wanted to clean up the leftovers. As the result the already
>>> scheduled work crashes as it was not neither canceled nor interrupted
>>> in some safe manner. So, this sounds like something we need to take
>>> care of, thus this patch.
>> But my point was the question of why there would be any pending work
>> in the first place in this case, when we expect Dom0 to be well-behaved.
> I am not saying Dom0 misbehaves here. This is my real use-case
> (as in the commit message):
> 
> pci_physdev_op
>       pci_add_device
>           init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>       iommu_add_device <- FAILS
>       vpci_remove_device -> xfree(pdev->vpci)
> 
> leave_hypervisor_to_guest
>       vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL

First of all I'm sorry for having lost track of that particular case in
the course of the discussion.

I wonder though whether that's something we really need to take care of.
At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
use apply_map() instead. I wonder whether this wouldn't be appropriate
generally in the context of init_bars() when used for Dom0 (not sure
whether init_bars() would find some form of use for DomU-s as well).
This is even more so as it would better be the exception that devices
discovered post-boot start out with memory decoding enabled (which is a
prereq for modify_bars() to be called in the first place).

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 1:48 p.m. UTC | #22
On 18.11.21 15:25, Jan Beulich wrote:
> On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 11:15, Jan Beulich wrote:
>>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>
>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>
>>>>>>>> pci_physdev_op
>>>>>>>>        pci_add_device
>>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>        iommu_add_device <- FAILS
>>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>
>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>
>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>> device has been deassigned
>>>>>>>>
>>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>>
>>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>>> needed in the presented form.
>>>>>> The intention of this patch was to handle error conditions which are
>>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>>> already be there and not to crash.
>>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>>> Hence I'm not sure I see the need for dealing with these.
>>>> Probably I don't follow you here. The issue I am facing is Dom0
>>>> related, e.g. Xen was not able to initialize during "add" and thus
>>>> wanted to clean up the leftovers. As the result the already
>>>> scheduled work crashes as it was not neither canceled nor interrupted
>>>> in some safe manner. So, this sounds like something we need to take
>>>> care of, thus this patch.
>>> But my point was the question of why there would be any pending work
>>> in the first place in this case, when we expect Dom0 to be well-behaved.
>> I am not saying Dom0 misbehaves here. This is my real use-case
>> (as in the commit message):
>>
>> pci_physdev_op
>>        pci_add_device
>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>        iommu_add_device <- FAILS
>>        vpci_remove_device -> xfree(pdev->vpci)
>>
>> leave_hypervisor_to_guest
>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> First of all I'm sorry for having lost track of that particular case in
> the course of the discussion.
No problem, I see on the ML how much you review every day...
>
> I wonder though whether that's something we really need to take care of.
> At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
> use apply_map() instead. I wonder whether this wouldn't be appropriate
> generally in the context of init_bars() when used for Dom0 (not sure
> whether init_bars() would find some form of use for DomU-s as well).
> This is even more so as it would better be the exception that devices
> discovered post-boot start out with memory decoding enabled (which is a
> prereq for modify_bars() to be called in the first place).
Well, first of all init_bars is going to be used for DomUs as well:
that was discussed previously and is reflected in this series.

But the real use-case for the deferred mapping would be the one
from PCI_COMMAND register write emulation:

void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
                     uint32_t cmd, void *data)
{
[snip]
         modify_bars(pdev, cmd, false);

which in turn calls defer_map.

I believe Roger did that for a good reason not to stall Xen while doing
map/unmap (which may take quite some time) and moved that to
vpci_process_pending and SOFTIRQ context.
>
> Jan
>
Thank you,
Oleksandr
Roger Pau Monné Nov. 18, 2021, 2:04 p.m. UTC | #23
Sorry, I've been quite busy with other staff.

On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 15:25, Jan Beulich wrote:
> > On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
> >>
> >> On 18.11.21 11:15, Jan Beulich wrote:
> >>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
> >>>> On 18.11.21 10:36, Jan Beulich wrote:
> >>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
> >>>>>> On 17.11.21 10:28, Jan Beulich wrote:
> >>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
> >>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>>>>>>
> >>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
> >>>>>>>> scheduled a delayed work for map/unmap operations for that device.
> >>>>>>>> For example, the following scenario can illustrate the problem:
> >>>>>>>>
> >>>>>>>> pci_physdev_op
> >>>>>>>>        pci_add_device
> >>>>>>>>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
> >>>>>>>>        iommu_add_device <- FAILS
> >>>>>>>>        vpci_remove_device -> xfree(pdev->vpci)
> >>>>>>>>
> >>>>>>>> leave_hypervisor_to_guest
> >>>>>>>>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> >>>>>>>>
> >>>>>>>> For the hardware domain we continue execution as the worse that
> >>>>>>>> could happen is that MMIO mappings are left in place when the
> >>>>>>>> device has been deassigned
> >>>>>>>>
> >>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
> >>>>>>>> {un}map operation we need to destroy them, as we don't know in which
> >>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
> >>>>>>>> DomUs as they won't be allowed to call pci_add_device.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >>>>>>> Thinking about it some more, I'm not convinced any of this is really
> >>>>>>> needed in the presented form.
> >>>>>> The intention of this patch was to handle error conditions which are
> >>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
> >>>>>> of initialization. So, I am trying to cancel all pending work which might
> >>>>>> already be there and not to crash.
> >>>>> Only Dom0 may be able to prematurely access the device during "add".
> >>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
> >>>>> Hence I'm not sure I see the need for dealing with these.
> >>>> Probably I don't follow you here. The issue I am facing is Dom0
> >>>> related, e.g. Xen was not able to initialize during "add" and thus
> >>>> wanted to clean up the leftovers. As the result the already
> >>>> scheduled work crashes as it was not neither canceled nor interrupted
> >>>> in some safe manner. So, this sounds like something we need to take
> >>>> care of, thus this patch.
> >>> But my point was the question of why there would be any pending work
> >>> in the first place in this case, when we expect Dom0 to be well-behaved.
> >> I am not saying Dom0 misbehaves here. This is my real use-case
> >> (as in the commit message):
> >>
> >> pci_physdev_op
> >>        pci_add_device
> >>            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
> >>        iommu_add_device <- FAILS
> >>        vpci_remove_device -> xfree(pdev->vpci)
> >>
> >> leave_hypervisor_to_guest
> >>        vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
> > First of all I'm sorry for having lost track of that particular case in
> > the course of the discussion.
> No problem, I see on the ML how much you review every day...
> >
> > I wonder though whether that's something we really need to take care of.
> > At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
> > use apply_map() instead. I wonder whether this wouldn't be appropriate
> > generally in the context of init_bars() when used for Dom0 (not sure
> > whether init_bars() would find some form of use for DomU-s as well).
> > This is even more so as it would better be the exception that devices
> > discovered post-boot start out with memory decoding enabled (which is a
> > prereq for modify_bars() to be called in the first place).
> Well, first of all init_bars is going to be used for DomUs as well:
> that was discussed previously and is reflected in this series.
> 
> But the real use-case for the deferred mapping would be the one
> from PCI_COMMAND register write emulation:
> 
> void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>                      uint32_t cmd, void *data)
> {
> [snip]
>          modify_bars(pdev, cmd, false);
> 
> which in turn calls defer_map.
> 
> I believe Roger did that for a good reason not to stall Xen while doing
> map/unmap (which may take quite some time) and moved that to
> vpci_process_pending and SOFTIRQ context.

Indeed. In the physdevop failure case this comes from an hypercall
context, so maybe you could do the mapping in place using hypercall
continuations if required. Not sure how complex that would be,
compared to just deferring to guest entry point and then dealing with
the possible cleanup on failure.

Thanks, Roger.
Oleksandr Andrushchenko Nov. 18, 2021, 2:14 p.m. UTC | #24
On 18.11.21 16:04, Roger Pau Monné wrote:
> Sorry, I've been quite busy with other staff.
>
> On Thu, Nov 18, 2021 at 01:48:06PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 15:25, Jan Beulich wrote:
>>> On 18.11.2021 10:32, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 11:15, Jan Beulich wrote:
>>>>> On 18.11.2021 09:54, Oleksandr Andrushchenko wrote:
>>>>>> On 18.11.21 10:36, Jan Beulich wrote:
>>>>>>> On 18.11.2021 08:49, Oleksandr Andrushchenko wrote:
>>>>>>>> On 17.11.21 10:28, Jan Beulich wrote:
>>>>>>>>> On 05.11.2021 07:56, Oleksandr Andrushchenko wrote:
>>>>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>>>
>>>>>>>>>> When a vPCI is removed for a PCI device it is possible that we have
>>>>>>>>>> scheduled a delayed work for map/unmap operations for that device.
>>>>>>>>>> For example, the following scenario can illustrate the problem:
>>>>>>>>>>
>>>>>>>>>> pci_physdev_op
>>>>>>>>>>         pci_add_device
>>>>>>>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>>>>>         iommu_add_device <- FAILS
>>>>>>>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>>>>>>>
>>>>>>>>>> leave_hypervisor_to_guest
>>>>>>>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>>>>>>>>>
>>>>>>>>>> For the hardware domain we continue execution as the worse that
>>>>>>>>>> could happen is that MMIO mappings are left in place when the
>>>>>>>>>> device has been deassigned
>>>>>>>>>>
>>>>>>>>>> For unprivileged domains that get a failure in the middle of a vPCI
>>>>>>>>>> {un}map operation we need to destroy them, as we don't know in which
>>>>>>>>>> state the p2m is. This can only happen in vpci_process_pending for
>>>>>>>>>> DomUs as they won't be allowed to call pci_add_device.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>>> Thinking about it some more, I'm not convinced any of this is really
>>>>>>>>> needed in the presented form.
>>>>>>>> The intention of this patch was to handle error conditions which are
>>>>>>>> abnormal, e.g. when iommu_add_device fails and we are in the middle
>>>>>>>> of initialization. So, I am trying to cancel all pending work which might
>>>>>>>> already be there and not to crash.
>>>>>>> Only Dom0 may be able to prematurely access the device during "add".
>>>>>>> Yet unlike for DomU-s we generally expect Dom0 to be well-behaved.
>>>>>>> Hence I'm not sure I see the need for dealing with these.
>>>>>> Probably I don't follow you here. The issue I am facing is Dom0
>>>>>> related, e.g. Xen was not able to initialize during "add" and thus
>>>>>> wanted to clean up the leftovers. As the result the already
>>>>>> scheduled work crashes as it was not neither canceled nor interrupted
>>>>>> in some safe manner. So, this sounds like something we need to take
>>>>>> care of, thus this patch.
>>>>> But my point was the question of why there would be any pending work
>>>>> in the first place in this case, when we expect Dom0 to be well-behaved.
>>>> I am not saying Dom0 misbehaves here. This is my real use-case
>>>> (as in the commit message):
>>>>
>>>> pci_physdev_op
>>>>         pci_add_device
>>>>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>         iommu_add_device <- FAILS
>>>>         vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> leave_hypervisor_to_guest
>>>>         vpci_process_pending: v->vpci.mem != NULL; v->vpci.pdev->vpci == NULL
>>> First of all I'm sorry for having lost track of that particular case in
>>> the course of the discussion.
>> No problem, I see on the ML how much you review every day...
>>> I wonder though whether that's something we really need to take care of.
>>> At boot (on x86) modify_bars() wouldn't call defer_map() anyway, but
>>> use apply_map() instead. I wonder whether this wouldn't be appropriate
>>> generally in the context of init_bars() when used for Dom0 (not sure
>>> whether init_bars() would find some form of use for DomU-s as well).
>>> This is even more so as it would better be the exception that devices
>>> discovered post-boot start out with memory decoding enabled (which is a
>>> prereq for modify_bars() to be called in the first place).
>> Well, first of all init_bars is going to be used for DomUs as well:
>> that was discussed previously and is reflected in this series.
>>
>> But the real use-case for the deferred mapping would be the one
>> from PCI_COMMAND register write emulation:
>>
>> void vpci_cmd_write(const struct pci_dev *pdev, unsigned int reg,
>>                       uint32_t cmd, void *data)
>> {
>> [snip]
>>           modify_bars(pdev, cmd, false);
>>
>> which in turn calls defer_map.
>>
>> I believe Roger did that for a good reason not to stall Xen while doing
>> map/unmap (which may take quite some time) and moved that to
>> vpci_process_pending and SOFTIRQ context.
> Indeed. In the physdevop failure case this comes from an hypercall
> context, so maybe you could do the mapping in place using hypercall
> continuations if required. Not sure how complex that would be,
> compared to just deferring to guest entry point and then dealing with
> the possible cleanup on failure.
This will solve one part of the equation:

pci_physdev_op
        pci_add_device
            init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
        iommu_add_device <- FAILS
        vpci_remove_device -> xfree(pdev->vpci)

But what about the other one, e.g. vpci_process_pending is triggered in
parallel with PCI device de-assign for example?

>
> Thanks, Roger.
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 2:35 p.m. UTC | #25
On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
> On 18.11.21 16:04, Roger Pau Monné wrote:
>> Indeed. In the physdevop failure case this comes from an hypercall
>> context, so maybe you could do the mapping in place using hypercall
>> continuations if required. Not sure how complex that would be,
>> compared to just deferring to guest entry point and then dealing with
>> the possible cleanup on failure.
> This will solve one part of the equation:
> 
> pci_physdev_op
>         pci_add_device
>             init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>         iommu_add_device <- FAILS
>         vpci_remove_device -> xfree(pdev->vpci)
> 
> But what about the other one, e.g. vpci_process_pending is triggered in
> parallel with PCI device de-assign for example?

Well, that's again in hypercall context, so using hypercall continuations
may again be an option. Of course at the point a de-assign is initiated,
you "only" need to drain requests (for that device, but that's unlikely
to be worthwhile optimizing for), while ensuring no new requests can be
issued. Again, for the device in question, but here this is relevant -
a flag may want setting to refuse all further requests. Or maybe the
register handling hooks may want tearing down before draining pending
BAR mapping requests; without the hooks in place no new such requests
can possibly appear.

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 3:11 p.m. UTC | #26
On 18.11.21 16:35, Jan Beulich wrote:
> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>> Indeed. In the physdevop failure case this comes from an hypercall
>>> context, so maybe you could do the mapping in place using hypercall
>>> continuations if required. Not sure how complex that would be,
>>> compared to just deferring to guest entry point and then dealing with
>>> the possible cleanup on failure.
>> This will solve one part of the equation:
>>
>> pci_physdev_op
>>          pci_add_device
>>              init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>          iommu_add_device <- FAILS
>>          vpci_remove_device -> xfree(pdev->vpci)
>>
>> But what about the other one, e.g. vpci_process_pending is triggered in
>> parallel with PCI device de-assign for example?
> Well, that's again in hypercall context, so using hypercall continuations
> may again be an option. Of course at the point a de-assign is initiated,
> you "only" need to drain requests (for that device, but that's unlikely
> to be worthwhile optimizing for), while ensuring no new requests can be
> issued. Again, for the device in question, but here this is relevant -
> a flag may want setting to refuse all further requests. Or maybe the
> register handling hooks may want tearing down before draining pending
> BAR mapping requests; without the hooks in place no new such requests
> can possibly appear.
This can be probably even easier to solve as we were talking about
pausing all vCPUs:

void vpci_cancel_pending(const struct pci_dev *pdev)
{
     struct domain *d = pdev->domain;
     struct vcpu *v;
     int rc;

     while ( (rc = domain_pause_except_self(d)) == -ERESTART )
         cpu_relax();

     if ( rc )
         printk(XENLOG_G_ERR
                "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
                &pdev->sbdf, pdev->domain, rc);

     for_each_vcpu ( d, v )
     {
         if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )

This will prevent all vCPUs to run, but current, thus making it impossible
to run vpci_process_pending in parallel with any hypercall.
So, even without locking in vpci_process_pending the above should
be enough.
The only concern here is that domain_pause_except_self may return
the error code we somehow need to handle...
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 3:16 p.m. UTC | #27
On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
> 
> 
> On 18.11.21 16:35, Jan Beulich wrote:
>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>> context, so maybe you could do the mapping in place using hypercall
>>>> continuations if required. Not sure how complex that would be,
>>>> compared to just deferring to guest entry point and then dealing with
>>>> the possible cleanup on failure.
>>> This will solve one part of the equation:
>>>
>>> pci_physdev_op
>>>          pci_add_device
>>>              init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>          iommu_add_device <- FAILS
>>>          vpci_remove_device -> xfree(pdev->vpci)
>>>
>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>> parallel with PCI device de-assign for example?
>> Well, that's again in hypercall context, so using hypercall continuations
>> may again be an option. Of course at the point a de-assign is initiated,
>> you "only" need to drain requests (for that device, but that's unlikely
>> to be worthwhile optimizing for), while ensuring no new requests can be
>> issued. Again, for the device in question, but here this is relevant -
>> a flag may want setting to refuse all further requests. Or maybe the
>> register handling hooks may want tearing down before draining pending
>> BAR mapping requests; without the hooks in place no new such requests
>> can possibly appear.
> This can be probably even easier to solve as we were talking about
> pausing all vCPUs:

I have to admit I'm not sure. It might be easier, but it may also be
less desirable.

> void vpci_cancel_pending(const struct pci_dev *pdev)
> {
>      struct domain *d = pdev->domain;
>      struct vcpu *v;
>      int rc;
> 
>      while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>          cpu_relax();
> 
>      if ( rc )
>          printk(XENLOG_G_ERR
>                 "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>                 &pdev->sbdf, pdev->domain, rc);
> 
>      for_each_vcpu ( d, v )
>      {
>          if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
> 
> This will prevent all vCPUs to run, but current, thus making it impossible
> to run vpci_process_pending in parallel with any hypercall.
> So, even without locking in vpci_process_pending the above should
> be enough.
> The only concern here is that domain_pause_except_self may return
> the error code we somehow need to handle...

Not just this. The -ERESTART handling isn't appropriate this way
either. For the moment I can't help thinking that draining would
be preferable over canceling.

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 3:21 p.m. UTC | #28
On 18.11.21 17:16, Jan Beulich wrote:
> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>
>> On 18.11.21 16:35, Jan Beulich wrote:
>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>> continuations if required. Not sure how complex that would be,
>>>>> compared to just deferring to guest entry point and then dealing with
>>>>> the possible cleanup on failure.
>>>> This will solve one part of the equation:
>>>>
>>>> pci_physdev_op
>>>>           pci_add_device
>>>>               init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>           iommu_add_device <- FAILS
>>>>           vpci_remove_device -> xfree(pdev->vpci)
>>>>
>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>> parallel with PCI device de-assign for example?
>>> Well, that's again in hypercall context, so using hypercall continuations
>>> may again be an option. Of course at the point a de-assign is initiated,
>>> you "only" need to drain requests (for that device, but that's unlikely
>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>> issued. Again, for the device in question, but here this is relevant -
>>> a flag may want setting to refuse all further requests. Or maybe the
>>> register handling hooks may want tearing down before draining pending
>>> BAR mapping requests; without the hooks in place no new such requests
>>> can possibly appear.
>> This can be probably even easier to solve as we were talking about
>> pausing all vCPUs:
> I have to admit I'm not sure. It might be easier, but it may also be
> less desirable.
>
>> void vpci_cancel_pending(const struct pci_dev *pdev)
>> {
>>       struct domain *d = pdev->domain;
>>       struct vcpu *v;
>>       int rc;
>>
>>       while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>           cpu_relax();
>>
>>       if ( rc )
>>           printk(XENLOG_G_ERR
>>                  "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>                  &pdev->sbdf, pdev->domain, rc);
>>
>>       for_each_vcpu ( d, v )
>>       {
>>           if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>
>> This will prevent all vCPUs to run, but current, thus making it impossible
>> to run vpci_process_pending in parallel with any hypercall.
>> So, even without locking in vpci_process_pending the above should
>> be enough.
>> The only concern here is that domain_pause_except_self may return
>> the error code we somehow need to handle...
> Not just this. The -ERESTART handling isn't appropriate this way
> either.
Are you talking about cpu_relax()?
>   For the moment I can't help thinking that draining would
> be preferable over canceling.
Given that cancellation is going to happen on error path or
on device de-assign/remove I think this can be acceptable.
Any reason why not?
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 3:41 p.m. UTC | #29
On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
> On 18.11.21 17:16, Jan Beulich wrote:
>> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 16:35, Jan Beulich wrote:
>>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>>> continuations if required. Not sure how complex that would be,
>>>>>> compared to just deferring to guest entry point and then dealing with
>>>>>> the possible cleanup on failure.
>>>>> This will solve one part of the equation:
>>>>>
>>>>> pci_physdev_op
>>>>>           pci_add_device
>>>>>               init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>           iommu_add_device <- FAILS
>>>>>           vpci_remove_device -> xfree(pdev->vpci)
>>>>>
>>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>>> parallel with PCI device de-assign for example?
>>>> Well, that's again in hypercall context, so using hypercall continuations
>>>> may again be an option. Of course at the point a de-assign is initiated,
>>>> you "only" need to drain requests (for that device, but that's unlikely
>>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>>> issued. Again, for the device in question, but here this is relevant -
>>>> a flag may want setting to refuse all further requests. Or maybe the
>>>> register handling hooks may want tearing down before draining pending
>>>> BAR mapping requests; without the hooks in place no new such requests
>>>> can possibly appear.
>>> This can be probably even easier to solve as we were talking about
>>> pausing all vCPUs:
>> I have to admit I'm not sure. It might be easier, but it may also be
>> less desirable.
>>
>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>> {
>>>       struct domain *d = pdev->domain;
>>>       struct vcpu *v;
>>>       int rc;
>>>
>>>       while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>>           cpu_relax();
>>>
>>>       if ( rc )
>>>           printk(XENLOG_G_ERR
>>>                  "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>>                  &pdev->sbdf, pdev->domain, rc);
>>>
>>>       for_each_vcpu ( d, v )
>>>       {
>>>           if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>>
>>> This will prevent all vCPUs to run, but current, thus making it impossible
>>> to run vpci_process_pending in parallel with any hypercall.
>>> So, even without locking in vpci_process_pending the above should
>>> be enough.
>>> The only concern here is that domain_pause_except_self may return
>>> the error code we somehow need to handle...
>> Not just this. The -ERESTART handling isn't appropriate this way
>> either.
> Are you talking about cpu_relax()?

I'm talking about that spin-waiting loop as a whole.

>>   For the moment I can't help thinking that draining would
>> be preferable over canceling.
> Given that cancellation is going to happen on error path or
> on device de-assign/remove I think this can be acceptable.
> Any reason why not?

It would seem to me that the correctness of a draining approach is
going to be easier to prove than that of a canceling one, where I
expect races to be a bigger risk. Especially something that gets
executed infrequently, if ever (error paths in particular), knowing
things are well from testing isn't typically possible.

Jan
Oleksandr Andrushchenko Nov. 18, 2021, 3:46 p.m. UTC | #30
On 18.11.21 17:41, Jan Beulich wrote:
> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:16, Jan Beulich wrote:
>>> On 18.11.2021 16:11, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 16:35, Jan Beulich wrote:
>>>>> On 18.11.2021 15:14, Oleksandr Andrushchenko wrote:
>>>>>> On 18.11.21 16:04, Roger Pau Monné wrote:
>>>>>>> Indeed. In the physdevop failure case this comes from an hypercall
>>>>>>> context, so maybe you could do the mapping in place using hypercall
>>>>>>> continuations if required. Not sure how complex that would be,
>>>>>>> compared to just deferring to guest entry point and then dealing with
>>>>>>> the possible cleanup on failure.
>>>>>> This will solve one part of the equation:
>>>>>>
>>>>>> pci_physdev_op
>>>>>>            pci_add_device
>>>>>>                init_bars -> modify_bars -> defer_map -> raise_softirq(SCHEDULE_SOFTIRQ)
>>>>>>            iommu_add_device <- FAILS
>>>>>>            vpci_remove_device -> xfree(pdev->vpci)
>>>>>>
>>>>>> But what about the other one, e.g. vpci_process_pending is triggered in
>>>>>> parallel with PCI device de-assign for example?
>>>>> Well, that's again in hypercall context, so using hypercall continuations
>>>>> may again be an option. Of course at the point a de-assign is initiated,
>>>>> you "only" need to drain requests (for that device, but that's unlikely
>>>>> to be worthwhile optimizing for), while ensuring no new requests can be
>>>>> issued. Again, for the device in question, but here this is relevant -
>>>>> a flag may want setting to refuse all further requests. Or maybe the
>>>>> register handling hooks may want tearing down before draining pending
>>>>> BAR mapping requests; without the hooks in place no new such requests
>>>>> can possibly appear.
>>>> This can be probably even easier to solve as we were talking about
>>>> pausing all vCPUs:
>>> I have to admit I'm not sure. It might be easier, but it may also be
>>> less desirable.
>>>
>>>> void vpci_cancel_pending(const struct pci_dev *pdev)
>>>> {
>>>>        struct domain *d = pdev->domain;
>>>>        struct vcpu *v;
>>>>        int rc;
>>>>
>>>>        while ( (rc = domain_pause_except_self(d)) == -ERESTART )
>>>>            cpu_relax();
>>>>
>>>>        if ( rc )
>>>>            printk(XENLOG_G_ERR
>>>>                   "Failed to pause vCPUs while canceling vPCI map/unmap for %pp %pd: %d\n",
>>>>                   &pdev->sbdf, pdev->domain, rc);
>>>>
>>>>        for_each_vcpu ( d, v )
>>>>        {
>>>>            if ( v->vpci.map_pending && (v->vpci.pdev == pdev) )
>>>>
>>>> This will prevent all vCPUs to run, but current, thus making it impossible
>>>> to run vpci_process_pending in parallel with any hypercall.
>>>> So, even without locking in vpci_process_pending the above should
>>>> be enough.
>>>> The only concern here is that domain_pause_except_self may return
>>>> the error code we somehow need to handle...
>>> Not just this. The -ERESTART handling isn't appropriate this way
>>> either.
>> Are you talking about cpu_relax()?
> I'm talking about that spin-waiting loop as a whole.
>
>>>    For the moment I can't help thinking that draining would
>>> be preferable over canceling.
>> Given that cancellation is going to happen on error path or
>> on device de-assign/remove I think this can be acceptable.
>> Any reason why not?
> It would seem to me that the correctness of a draining approach is
> going to be easier to prove than that of a canceling one, where I
> expect races to be a bigger risk. Especially something that gets
> executed infrequently, if ever (error paths in particular), knowing
> things are well from testing isn't typically possible.
Could you please then give me a hint how to do that:
1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
2. We have de-assign/remove on vCPU1

How do we drain that? Do you mean some atomic variable to be
used in vpci_process_pending to flag it is running and de-assign/remove
needs to wait and spinning checking that?
>
> Jan
>
>
Thank you,
Oleksandr
Jan Beulich Nov. 18, 2021, 3:53 p.m. UTC | #31
On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
> On 18.11.21 17:41, Jan Beulich wrote:
>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>    For the moment I can't help thinking that draining would
>>>> be preferable over canceling.
>>> Given that cancellation is going to happen on error path or
>>> on device de-assign/remove I think this can be acceptable.
>>> Any reason why not?
>> It would seem to me that the correctness of a draining approach is
>> going to be easier to prove than that of a canceling one, where I
>> expect races to be a bigger risk. Especially something that gets
>> executed infrequently, if ever (error paths in particular), knowing
>> things are well from testing isn't typically possible.
> Could you please then give me a hint how to do that:
> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
> 2. We have de-assign/remove on vCPU1
> 
> How do we drain that? Do you mean some atomic variable to be
> used in vpci_process_pending to flag it is running and de-assign/remove
> needs to wait and spinning checking that?

First of all let's please keep remove and de-assign separate. I think we
have largely reached agreement that remove may need handling differently,
for being a Dom0-only operation.

As to draining during de-assign: I did suggest before that removing the
register handling hooks first would guarantee no new requests to appear.
Then it should be merely a matter of using hypercall continuations until
the respective domain has no pending requests anymore for the device in
question. Some locking (or lock barrier) may of course be needed to
make sure another CPU isn't just about to pend a new request.

Jan
Oleksandr Andrushchenko Nov. 19, 2021, 12:34 p.m. UTC | #32
Hi, Roger, Jan!

On 18.11.21 17:53, Jan Beulich wrote:
> On 18.11.2021 16:46, Oleksandr Andrushchenko wrote:
>> On 18.11.21 17:41, Jan Beulich wrote:
>>> On 18.11.2021 16:21, Oleksandr Andrushchenko wrote:
>>>> On 18.11.21 17:16, Jan Beulich wrote:
>>>>>     For the moment I can't help thinking that draining would
>>>>> be preferable over canceling.
>>>> Given that cancellation is going to happen on error path or
>>>> on device de-assign/remove I think this can be acceptable.
>>>> Any reason why not?
>>> It would seem to me that the correctness of a draining approach is
>>> going to be easier to prove than that of a canceling one, where I
>>> expect races to be a bigger risk. Especially something that gets
>>> executed infrequently, if ever (error paths in particular), knowing
>>> things are well from testing isn't typically possible.
>> Could you please then give me a hint how to do that:
>> 1. We have scheduled SOFTIRQ on vCPU0 and it is about to touch pdev->vpci
>> 2. We have de-assign/remove on vCPU1
>>
>> How do we drain that? Do you mean some atomic variable to be
>> used in vpci_process_pending to flag it is running and de-assign/remove
>> needs to wait and spinning checking that?
> First of all let's please keep remove and de-assign separate. I think we
> have largely reached agreement that remove may need handling differently,
> for being a Dom0-only operation.
>
> As to draining during de-assign: I did suggest before that removing the
> register handling hooks first would guarantee no new requests to appear.
> Then it should be merely a matter of using hypercall continuations until
> the respective domain has no pending requests anymore for the device in
> question. Some locking (or lock barrier) may of course be needed to
> make sure another CPU isn't just about to pend a new request.
>
> Jan
>
>
Too long, but please read.

The below is some simplified analysis of what is happening with
respect to deferred mapping. First we start from looking at what
hypercals are used which may run in parallel with vpci_process_pending,
which lock they hold:

1. do_physdev_op(PHYSDEVOP_pci_device_add): failure during PHYSDEVOP_pci_device_add
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_add_device  <- pcidevs_lock()
       vpci_add_handlers
         init_bars
           cmd = pci_conf_read16(pdev->sbdf, PCI_COMMAND);
             modify_bars <- if cmd & PCI_COMMAND_MEMORY
               struct rangeset *mem = rangeset_new(NULL, NULL, 0);

               if ( system_state < SYS_STATE_active ) <- Dom0 is being created
                  return apply_map(pdev->domain, pdev, mem, cmd);

               defer_map(dev->domain, dev, mem, cmd, rom_only); < Dom0 is running
                 curr->vpci.pdev = pdev;
                 curr->vpci.mem = mem;

       ret = iommu_add_device(pdev); <- FAIL
       if ( ret )
           vpci_remove_device
             remove vPCI register handlers
             xfree(pdev->vpci);
             pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
                                   scheduled and yet to run

2. do_physdev_op(PHYSDEVOP_pci_device_remove)
===============================================================================
   pci_physdev_op: <- no hypercall_create_continuation
     pci_remove_device <- pcidevs_lock()
       vpci_remove_device
         pdev->vpci = NULL; <- this will crash vpci_process_pending if it was
                               scheduled and yet to run

3. iommu_do_pci_domctl(XEN_DOMCTL_assign_device)
===============================================================================
case XEN_DOMCTL_assign_device <- pcidevs_lock();
   ret = assign_device(d, seg, bus, devfn, flags);
   if ( ret == -ERESTART )
     ret = hypercall_create_continuation(__HYPERVISOR_domctl, "h", u_domctl);


4. iommu_do_pci_domctl(XEN_DOMCTL_deassign_device) <- pcidevs_lock();
===============================================================================
case XEN_DOMCTL_deassign_device: <- no hypercall_create_continuation
   ret = deassign_device(d, seg, bus, devfn);


5. vPCI MMIO trap for PCI_COMMAND
===============================================================================
vpci_mmio_{read|write}
   vpci_ecam_{read|write}
     vpci_{read|write} <- NO locking yet
       pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
       spin_lock(&pdev->vpci->lock);
         cmd_write
           modify_bars
             defer_map

6. SoftIRQ processing
===============================================================================
hvm_do_resume
   vcpu_ioreq_handle_completion
     vpci_process_pending
       if ( v->vpci.mem )
         rc = rangeset_consume_ranges(v->vpci.mem, map_range, &data);
         if ( rc == -ERESTART )
             return true; <- re-scheduling

=========================================================================
         spin_lock(&v->vpci.pdev->vpci->lock); <- v->vpci.pdev->vpci can be NULL
=========================================================================
         spin_unlock(&v->vpci.pdev->vpci->lock);
         v->vpci.mem = NULL;
       if ( rc ) <- rc is from rangeset_consume_ranges
         vpci_remove_device <- this is a BUG as it is potentially possible that
                               vpci_process_pending is running on other vCPU

So, from the above it is clearly seen that it might be that there is a
PCI_COMMAND's write triggered mapping is happening on other vCPU in parallel
with a hypercall.

Some analysis on the hypercalls with respect to domains which are eligible targets.
1. Dom0 (hardware domain) only: PHYSDEVOP_pci_device_add, PHYSDEVOP_pci_device_remove
2. Any domain: XEN_DOMCTL_assign_device, XEN_DOMCTL_deassign_device

Possible crash paths
===============================================================================
1. Failure in PHYSDEVOP_pci_device_add after defer_map may make
vpci_process_pending crash because of pdev->vpci == NULL
2. vpci_process_pending on other vCPU may crash if runs in parallel with itself
because of vpci_remove_device may be called
3. Crash in vpci_mmio_{read|write} after PHYSDEVOP_pci_device_remove
vpci_remove_device makes pdev->vpci == NULL
4. Both XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device seem to be
unaffected.

Synchronization is needed between:
  - vpci_remove_device
  - vpci_process_pending
  - vpci_mmio_{read|write}

Possible locking and other work needed:
=======================================

1. pcidevs_{lock|unlock} is too heavy and is per-host
2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
3. We may want a dedicated per-domain rw lock to be implemented:

diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 28146ee404e6..ebf071893b21 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -444,6 +444,7 @@ struct domain

  #ifdef CONFIG_HAS_PCI
      struct list_head pdev_list;
+    rwlock_t vpci_rwlock;
+    bool vpci_terminating; <- atomic?
  #endif
then vpci_remove_device is a writer (cold path) and vpci_process_pending and
vpci_mmio_{read|write} are readers (hot path).

do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
to be implemented, so when re-start removal if need be:

vpci_remove_device()
{
   d->vpci_terminating = true;
   remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
   if ( !write_trylock(d->vpci_rwlock) )
     return -ERESTART;
   xfree(pdev->vpci);
   pdev->vpci = NULL;
}

Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
other operations which may require it, e.g. virtual bus topology can
use it when assigning vSBDF etc.

4. vpci_remove_device needs to be removed from vpci_process_pending
and do nothing for Dom0 and crash DomU otherwise:

if ( rc )
{
   /*
    * FIXME: in case of failure remove the device from the domain.
    * Note that there might still be leftover mappings. While this is
    * safe for Dom0, for DomUs the domain needs to be killed in order
    * to avoid leaking stale p2m mappings on failure.
    */
   if ( !is_hardware_domain(v->domain) )
     domain_crash(v->domain);

I do hope we can finally come up with some decision which I can implement...

Thank you,
Oleksandr
Jan Beulich Nov. 19, 2021, 1 p.m. UTC | #33
On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
> Possible locking and other work needed:
> =======================================
> 
> 1. pcidevs_{lock|unlock} is too heavy and is per-host
> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
> 3. We may want a dedicated per-domain rw lock to be implemented:
> 
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 28146ee404e6..ebf071893b21 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -444,6 +444,7 @@ struct domain
> 
>   #ifdef CONFIG_HAS_PCI
>       struct list_head pdev_list;
> +    rwlock_t vpci_rwlock;
> +    bool vpci_terminating; <- atomic?
>   #endif
> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
> vpci_mmio_{read|write} are readers (hot path).

Right - you need such a lock for other purposes anyway, as per the
discussion with Julien.

> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
> to be implemented, so when re-start removal if need be:
> 
> vpci_remove_device()
> {
>    d->vpci_terminating = true;
>    remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>    if ( !write_trylock(d->vpci_rwlock) )
>      return -ERESTART;
>    xfree(pdev->vpci);
>    pdev->vpci = NULL;
> }
> 
> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
> other operations which may require it, e.g. virtual bus topology can
> use it when assigning vSBDF etc.
> 
> 4. vpci_remove_device needs to be removed from vpci_process_pending
> and do nothing for Dom0 and crash DomU otherwise:

Why is this? I'm not outright opposed, but I don't immediately see why
trying to remove the problematic device wouldn't be a reasonable course
of action anymore. vpci_remove_device() may need to become more careful
as to not crashing, though.

Jan
Oleksandr Andrushchenko Nov. 19, 2021, 1:16 p.m. UTC | #34
On 19.11.21 15:00, Jan Beulich wrote:
> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>> Possible locking and other work needed:
>> =======================================
>>
>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 28146ee404e6..ebf071893b21 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -444,6 +444,7 @@ struct domain
>>
>>    #ifdef CONFIG_HAS_PCI
>>        struct list_head pdev_list;
>> +    rwlock_t vpci_rwlock;
>> +    bool vpci_terminating; <- atomic?
>>    #endif
>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>> vpci_mmio_{read|write} are readers (hot path).
> Right - you need such a lock for other purposes anyway, as per the
> discussion with Julien.
What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>
>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>> to be implemented, so when re-start removal if need be:
>>
>> vpci_remove_device()
>> {
>>     d->vpci_terminating = true;
>>     remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>     if ( !write_trylock(d->vpci_rwlock) )
>>       return -ERESTART;
>>     xfree(pdev->vpci);
>>     pdev->vpci = NULL;
>> }
>>
>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>> other operations which may require it, e.g. virtual bus topology can
>> use it when assigning vSBDF etc.
>>
>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>> and do nothing for Dom0 and crash DomU otherwise:
> Why is this? I'm not outright opposed, but I don't immediately see why
> trying to remove the problematic device wouldn't be a reasonable course
> of action anymore. vpci_remove_device() may need to become more careful
> as to not crashing,
vpci_remove_device does not crash, vpci_process_pending does
>   though.
Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
and we call vpci_remove_device. vpci_remove_device tries to acquire the
lock and it can't just because there are some other vpci code is running on other vCPU.
Then what do we do here? We are in SoftIRQ context now and we can't spin
trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
structure because it is seen by all vCPUs and may crash them.

If vpci_remove_device is in hypercall context it just returns -ERESTART and
hypercall continuation helps here. But not in SoftIRQ context.

Thus, I think we need to remove vpci_remove_device call from vpci_process_pending
and crash the domain if it is a guest domain. Leave with partially done map/unmap if
it is the hardware domain as per Roger's comment in the code.
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 19, 2021, 1:25 p.m. UTC | #35
On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
> On 19.11.21 15:00, Jan Beulich wrote:
>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>> Possible locking and other work needed:
>>> =======================================
>>>
>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>
>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>> index 28146ee404e6..ebf071893b21 100644
>>> --- a/xen/include/xen/sched.h
>>> +++ b/xen/include/xen/sched.h
>>> @@ -444,6 +444,7 @@ struct domain
>>>
>>>    #ifdef CONFIG_HAS_PCI
>>>        struct list_head pdev_list;
>>> +    rwlock_t vpci_rwlock;
>>> +    bool vpci_terminating; <- atomic?
>>>    #endif
>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>> vpci_mmio_{read|write} are readers (hot path).
>> Right - you need such a lock for other purposes anyway, as per the
>> discussion with Julien.
> What about bool vpci_terminating? Do you see it as an atomic type or just bool?

Having seen only ...

>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>> to be implemented, so when re-start removal if need be:
>>>
>>> vpci_remove_device()
>>> {
>>>     d->vpci_terminating = true;

... this use so far, I can't tell yet. But at a first glance a boolean
looks to be what you need.

>>>     remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>     if ( !write_trylock(d->vpci_rwlock) )
>>>       return -ERESTART;
>>>     xfree(pdev->vpci);
>>>     pdev->vpci = NULL;
>>> }
>>>
>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>> other operations which may require it, e.g. virtual bus topology can
>>> use it when assigning vSBDF etc.
>>>
>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>> and do nothing for Dom0 and crash DomU otherwise:
>> Why is this? I'm not outright opposed, but I don't immediately see why
>> trying to remove the problematic device wouldn't be a reasonable course
>> of action anymore. vpci_remove_device() may need to become more careful
>> as to not crashing,
> vpci_remove_device does not crash, vpci_process_pending does
>>   though.
> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
> and we call vpci_remove_device. vpci_remove_device tries to acquire the
> lock and it can't just because there are some other vpci code is running on other vCPU.
> Then what do we do here? We are in SoftIRQ context now and we can't spin
> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
> structure because it is seen by all vCPUs and may crash them.
> 
> If vpci_remove_device is in hypercall context it just returns -ERESTART and
> hypercall continuation helps here. But not in SoftIRQ context.

Maybe then you want to invoke this cleanup from RCU context (whether
vpci_remove_device() itself or a suitable clone there of is TBD)? (I
will admit though that I didn't check whether that would satisfy all
constraints.)

Then again it also hasn't become clear to me why you use write_trylock()
there. The lock contention you describe doesn't, on the surface, look
any different from situations elsewhere.

Jan
Oleksandr Andrushchenko Nov. 19, 2021, 1:34 p.m. UTC | #36
On 19.11.21 15:25, Jan Beulich wrote:
> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:00, Jan Beulich wrote:
>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>> Possible locking and other work needed:
>>>> =======================================
>>>>
>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>
>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>> index 28146ee404e6..ebf071893b21 100644
>>>> --- a/xen/include/xen/sched.h
>>>> +++ b/xen/include/xen/sched.h
>>>> @@ -444,6 +444,7 @@ struct domain
>>>>
>>>>     #ifdef CONFIG_HAS_PCI
>>>>         struct list_head pdev_list;
>>>> +    rwlock_t vpci_rwlock;
>>>> +    bool vpci_terminating; <- atomic?
>>>>     #endif
>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>> vpci_mmio_{read|write} are readers (hot path).
>>> Right - you need such a lock for other purposes anyway, as per the
>>> discussion with Julien.
>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
> Having seen only ...
>
>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>> to be implemented, so when re-start removal if need be:
>>>>
>>>> vpci_remove_device()
>>>> {
>>>>      d->vpci_terminating = true;
> ... this use so far, I can't tell yet. But at a first glance a boolean
> looks to be what you need.
>
>>>>      remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>      if ( !write_trylock(d->vpci_rwlock) )
>>>>        return -ERESTART;
>>>>      xfree(pdev->vpci);
>>>>      pdev->vpci = NULL;
>>>> }
>>>>
>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>> other operations which may require it, e.g. virtual bus topology can
>>>> use it when assigning vSBDF etc.
>>>>
>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>> and do nothing for Dom0 and crash DomU otherwise:
>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>> trying to remove the problematic device wouldn't be a reasonable course
>>> of action anymore. vpci_remove_device() may need to become more careful
>>> as to not crashing,
>> vpci_remove_device does not crash, vpci_process_pending does
>>>    though.
>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>> lock and it can't just because there are some other vpci code is running on other vCPU.
>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>> structure because it is seen by all vCPUs and may crash them.
>>
>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>> hypercall continuation helps here. But not in SoftIRQ context.
> Maybe then you want to invoke this cleanup from RCU context (whether
> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
> will admit though that I didn't check whether that would satisfy all
> constraints.)
>
> Then again it also hasn't become clear to me why you use write_trylock()
> there. The lock contention you describe doesn't, on the surface, look
> any different from situations elsewhere.
I use write_trylock in vpci_remove_device because if we can't
acquire the lock then we defer device removal. This would work
well if called from a hypercall which will employ hypercall continuation.
But SoftIRQ getting -ERESTART is something that we can't probably
handle by restarting as hypercall can, thus I only see that vpci_process_pending
will need to spin and wait until vpci_remove_device succeeds.
>
> Jan
>
Thank you,
Oleksandr
Oleksandr Andrushchenko Nov. 22, 2021, 2:21 p.m. UTC | #37
On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>
> On 19.11.21 15:25, Jan Beulich wrote:
>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>> Possible locking and other work needed:
>>>>> =======================================
>>>>>
>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>
>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>> --- a/xen/include/xen/sched.h
>>>>> +++ b/xen/include/xen/sched.h
>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>
>>>>>      #ifdef CONFIG_HAS_PCI
>>>>>          struct list_head pdev_list;
>>>>> +    rwlock_t vpci_rwlock;
>>>>> +    bool vpci_terminating; <- atomic?
>>>>>      #endif
>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>> Right - you need such a lock for other purposes anyway, as per the
>>>> discussion with Julien.
>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>> Having seen only ...
>>
>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>> to be implemented, so when re-start removal if need be:
>>>>>
>>>>> vpci_remove_device()
>>>>> {
>>>>>       d->vpci_terminating = true;
>> ... this use so far, I can't tell yet. But at a first glance a boolean
>> looks to be what you need.
>>
>>>>>       remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>       if ( !write_trylock(d->vpci_rwlock) )
>>>>>         return -ERESTART;
>>>>>       xfree(pdev->vpci);
>>>>>       pdev->vpci = NULL;
>>>>> }
>>>>>
>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>> use it when assigning vSBDF etc.
>>>>>
>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>> as to not crashing,
>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>     though.
>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>> structure because it is seen by all vCPUs and may crash them.
>>>
>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>> hypercall continuation helps here. But not in SoftIRQ context.
>> Maybe then you want to invoke this cleanup from RCU context (whether
>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>> will admit though that I didn't check whether that would satisfy all
>> constraints.)
>>
>> Then again it also hasn't become clear to me why you use write_trylock()
>> there. The lock contention you describe doesn't, on the surface, look
>> any different from situations elsewhere.
> I use write_trylock in vpci_remove_device because if we can't
> acquire the lock then we defer device removal. This would work
> well if called from a hypercall which will employ hypercall continuation.
> But SoftIRQ getting -ERESTART is something that we can't probably
> handle by restarting as hypercall can, thus I only see that vpci_process_pending
> will need to spin and wait until vpci_remove_device succeeds.
Does anybody have any better solution for preventing SoftIRQ from
spinning on vpci_remove_device and -ERESTART?
>> Jan
>>
> Thank you,
> Oleksandr
Thank you,
Oleksandr
Jan Beulich Nov. 22, 2021, 2:37 p.m. UTC | #38
On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:25, Jan Beulich wrote:
>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>> Possible locking and other work needed:
>>>>>> =======================================
>>>>>>
>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>
>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>> --- a/xen/include/xen/sched.h
>>>>>> +++ b/xen/include/xen/sched.h
>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>
>>>>>>      #ifdef CONFIG_HAS_PCI
>>>>>>          struct list_head pdev_list;
>>>>>> +    rwlock_t vpci_rwlock;
>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>      #endif
>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>> discussion with Julien.
>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>> Having seen only ...
>>>
>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>
>>>>>> vpci_remove_device()
>>>>>> {
>>>>>>       d->vpci_terminating = true;
>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>> looks to be what you need.
>>>
>>>>>>       remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>       if ( !write_trylock(d->vpci_rwlock) )
>>>>>>         return -ERESTART;
>>>>>>       xfree(pdev->vpci);
>>>>>>       pdev->vpci = NULL;
>>>>>> }
>>>>>>
>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>> use it when assigning vSBDF etc.
>>>>>>
>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>> as to not crashing,
>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>     though.
>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>> structure because it is seen by all vCPUs and may crash them.
>>>>
>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>> will admit though that I didn't check whether that would satisfy all
>>> constraints.)
>>>
>>> Then again it also hasn't become clear to me why you use write_trylock()
>>> there. The lock contention you describe doesn't, on the surface, look
>>> any different from situations elsewhere.
>> I use write_trylock in vpci_remove_device because if we can't
>> acquire the lock then we defer device removal. This would work
>> well if called from a hypercall which will employ hypercall continuation.
>> But SoftIRQ getting -ERESTART is something that we can't probably
>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>> will need to spin and wait until vpci_remove_device succeeds.
> Does anybody have any better solution for preventing SoftIRQ from
> spinning on vpci_remove_device and -ERESTART?

Well, at this point I can suggest only a marginal improvement: Instead of
spinning inside the softirq handler, you want to re-raise the softirq and
exit the handler. That way at least higher "priority" softirqs won't be
starved.

Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
an event, with the work deferred to a tasklet?

Yet I don't think my earlier question regarding the use of write_trylock()
was really answered. What you said in reply doesn't explain (to me at
least) why write_lock() is not an option.

Jan
Oleksandr Andrushchenko Nov. 22, 2021, 2:45 p.m. UTC | #39
On 22.11.21 16:37, Jan Beulich wrote:
> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>> Possible locking and other work needed:
>>>>>>> =======================================
>>>>>>>
>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>
>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>
>>>>>>>       #ifdef CONFIG_HAS_PCI
>>>>>>>           struct list_head pdev_list;
>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>       #endif
>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>> discussion with Julien.
>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>> Having seen only ...
>>>>
>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>
>>>>>>> vpci_remove_device()
>>>>>>> {
>>>>>>>        d->vpci_terminating = true;
>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>> looks to be what you need.
>>>>
>>>>>>>        remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>        if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>          return -ERESTART;
>>>>>>>        xfree(pdev->vpci);
>>>>>>>        pdev->vpci = NULL;
>>>>>>> }
>>>>>>>
>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>> use it when assigning vSBDF etc.
>>>>>>>
>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>> as to not crashing,
>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>      though.
>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>
>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>> will admit though that I didn't check whether that would satisfy all
>>>> constraints.)
>>>>
>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>> there. The lock contention you describe doesn't, on the surface, look
>>>> any different from situations elsewhere.
>>> I use write_trylock in vpci_remove_device because if we can't
>>> acquire the lock then we defer device removal. This would work
>>> well if called from a hypercall which will employ hypercall continuation.
>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>> will need to spin and wait until vpci_remove_device succeeds.
>> Does anybody have any better solution for preventing SoftIRQ from
>> spinning on vpci_remove_device and -ERESTART?
> Well, at this point I can suggest only a marginal improvement: Instead of
> spinning inside the softirq handler, you want to re-raise the softirq and
> exit the handler. That way at least higher "priority" softirqs won't be
> starved.
>
> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
> an event, with the work deferred to a tasklet?
>
> Yet I don't think my earlier question regarding the use of write_trylock()
> was really answered. What you said in reply doesn't explain (to me at
> least) why write_lock() is not an option.
I was thinking that we do not want to freeze in case we are calling vpci_remove_device
from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
indicating that the removal needs to be deferred. If we use write_lock, then
SoftIRQ -> write_lock will spin there waiting for readers to release the lock.

write_lock actually makes things a lot easier, but I just don't know if it
is ok to use it. If so, then vpci_remove_device becomes synchronous and
there is no need in hypercall continuation and other heavy machinery for
re-scheduling SoftIRQ...
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Nov. 22, 2021, 2:57 p.m. UTC | #40
On 22.11.2021 15:45, Oleksandr Andrushchenko wrote:
> 
> 
> On 22.11.21 16:37, Jan Beulich wrote:
>> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>>> Possible locking and other work needed:
>>>>>>>> =======================================
>>>>>>>>
>>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>>
>>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>>
>>>>>>>>       #ifdef CONFIG_HAS_PCI
>>>>>>>>           struct list_head pdev_list;
>>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>>       #endif
>>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>>> discussion with Julien.
>>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>>> Having seen only ...
>>>>>
>>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>>
>>>>>>>> vpci_remove_device()
>>>>>>>> {
>>>>>>>>        d->vpci_terminating = true;
>>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>>> looks to be what you need.
>>>>>
>>>>>>>>        remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>>        if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>>          return -ERESTART;
>>>>>>>>        xfree(pdev->vpci);
>>>>>>>>        pdev->vpci = NULL;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>>> use it when assigning vSBDF etc.
>>>>>>>>
>>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>>> as to not crashing,
>>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>>      though.
>>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>>
>>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>>> will admit though that I didn't check whether that would satisfy all
>>>>> constraints.)
>>>>>
>>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>>> there. The lock contention you describe doesn't, on the surface, look
>>>>> any different from situations elsewhere.
>>>> I use write_trylock in vpci_remove_device because if we can't
>>>> acquire the lock then we defer device removal. This would work
>>>> well if called from a hypercall which will employ hypercall continuation.
>>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>>> will need to spin and wait until vpci_remove_device succeeds.
>>> Does anybody have any better solution for preventing SoftIRQ from
>>> spinning on vpci_remove_device and -ERESTART?
>> Well, at this point I can suggest only a marginal improvement: Instead of
>> spinning inside the softirq handler, you want to re-raise the softirq and
>> exit the handler. That way at least higher "priority" softirqs won't be
>> starved.
>>
>> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
>> an event, with the work deferred to a tasklet?
>>
>> Yet I don't think my earlier question regarding the use of write_trylock()
>> was really answered. What you said in reply doesn't explain (to me at
>> least) why write_lock() is not an option.
> I was thinking that we do not want to freeze in case we are calling vpci_remove_device
> from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
> indicating that the removal needs to be deferred. If we use write_lock, then
> SoftIRQ -> write_lock will spin there waiting for readers to release the lock.
> 
> write_lock actually makes things a lot easier, but I just don't know if it
> is ok to use it. If so, then vpci_remove_device becomes synchronous and
> there is no need in hypercall continuation and other heavy machinery for
> re-scheduling SoftIRQ...

I'm inclined to ask: If it wasn't okay to use here, then where would it be
okay to use? Of course I realize there are cases when long spinning times
can be a problem. But I expect there aren't going to be excessively long
lock holding regions for this lock, and I also would expect average
contention to not be overly bad. But in the end you know better the code
that you're writing (and which may lead to issues with the lock usage) than
I do ...

Jan
Oleksandr Andrushchenko Nov. 22, 2021, 3:02 p.m. UTC | #41
On 22.11.21 16:57, Jan Beulich wrote:
> On 22.11.2021 15:45, Oleksandr Andrushchenko wrote:
>>
>> On 22.11.21 16:37, Jan Beulich wrote:
>>> On 22.11.2021 15:21, Oleksandr Andrushchenko wrote:
>>>> On 19.11.21 15:34, Oleksandr Andrushchenko wrote:
>>>>> On 19.11.21 15:25, Jan Beulich wrote:
>>>>>> On 19.11.2021 14:16, Oleksandr Andrushchenko wrote:
>>>>>>> On 19.11.21 15:00, Jan Beulich wrote:
>>>>>>>> On 19.11.2021 13:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> Possible locking and other work needed:
>>>>>>>>> =======================================
>>>>>>>>>
>>>>>>>>> 1. pcidevs_{lock|unlock} is too heavy and is per-host
>>>>>>>>> 2. pdev->vpci->lock cannot be used as vpci is freed by vpci_remove_device
>>>>>>>>> 3. We may want a dedicated per-domain rw lock to be implemented:
>>>>>>>>>
>>>>>>>>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>>>>>>>>> index 28146ee404e6..ebf071893b21 100644
>>>>>>>>> --- a/xen/include/xen/sched.h
>>>>>>>>> +++ b/xen/include/xen/sched.h
>>>>>>>>> @@ -444,6 +444,7 @@ struct domain
>>>>>>>>>
>>>>>>>>>        #ifdef CONFIG_HAS_PCI
>>>>>>>>>            struct list_head pdev_list;
>>>>>>>>> +    rwlock_t vpci_rwlock;
>>>>>>>>> +    bool vpci_terminating; <- atomic?
>>>>>>>>>        #endif
>>>>>>>>> then vpci_remove_device is a writer (cold path) and vpci_process_pending and
>>>>>>>>> vpci_mmio_{read|write} are readers (hot path).
>>>>>>>> Right - you need such a lock for other purposes anyway, as per the
>>>>>>>> discussion with Julien.
>>>>>>> What about bool vpci_terminating? Do you see it as an atomic type or just bool?
>>>>>> Having seen only ...
>>>>>>
>>>>>>>>> do_physdev_op(PHYSDEVOP_pci_device_remove) will need hypercall_create_continuation
>>>>>>>>> to be implemented, so when re-start removal if need be:
>>>>>>>>>
>>>>>>>>> vpci_remove_device()
>>>>>>>>> {
>>>>>>>>>         d->vpci_terminating = true;
>>>>>> ... this use so far, I can't tell yet. But at a first glance a boolean
>>>>>> looks to be what you need.
>>>>>>
>>>>>>>>>         remove vPCI register handlers <- this will cut off PCI_COMMAND emulation among others
>>>>>>>>>         if ( !write_trylock(d->vpci_rwlock) )
>>>>>>>>>           return -ERESTART;
>>>>>>>>>         xfree(pdev->vpci);
>>>>>>>>>         pdev->vpci = NULL;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Then this d->vpci_rwlock becomes a dedicated vpci per-domain lock for
>>>>>>>>> other operations which may require it, e.g. virtual bus topology can
>>>>>>>>> use it when assigning vSBDF etc.
>>>>>>>>>
>>>>>>>>> 4. vpci_remove_device needs to be removed from vpci_process_pending
>>>>>>>>> and do nothing for Dom0 and crash DomU otherwise:
>>>>>>>> Why is this? I'm not outright opposed, but I don't immediately see why
>>>>>>>> trying to remove the problematic device wouldn't be a reasonable course
>>>>>>>> of action anymore. vpci_remove_device() may need to become more careful
>>>>>>>> as to not crashing,
>>>>>>> vpci_remove_device does not crash, vpci_process_pending does
>>>>>>>>       though.
>>>>>>> Assume we are in an error state in vpci_process_pending *on one of the vCPUs*
>>>>>>> and we call vpci_remove_device. vpci_remove_device tries to acquire the
>>>>>>> lock and it can't just because there are some other vpci code is running on other vCPU.
>>>>>>> Then what do we do here? We are in SoftIRQ context now and we can't spin
>>>>>>> trying to acquire d->vpci_rwlock forever. Neither we can blindly free vpci
>>>>>>> structure because it is seen by all vCPUs and may crash them.
>>>>>>>
>>>>>>> If vpci_remove_device is in hypercall context it just returns -ERESTART and
>>>>>>> hypercall continuation helps here. But not in SoftIRQ context.
>>>>>> Maybe then you want to invoke this cleanup from RCU context (whether
>>>>>> vpci_remove_device() itself or a suitable clone there of is TBD)? (I
>>>>>> will admit though that I didn't check whether that would satisfy all
>>>>>> constraints.)
>>>>>>
>>>>>> Then again it also hasn't become clear to me why you use write_trylock()
>>>>>> there. The lock contention you describe doesn't, on the surface, look
>>>>>> any different from situations elsewhere.
>>>>> I use write_trylock in vpci_remove_device because if we can't
>>>>> acquire the lock then we defer device removal. This would work
>>>>> well if called from a hypercall which will employ hypercall continuation.
>>>>> But SoftIRQ getting -ERESTART is something that we can't probably
>>>>> handle by restarting as hypercall can, thus I only see that vpci_process_pending
>>>>> will need to spin and wait until vpci_remove_device succeeds.
>>>> Does anybody have any better solution for preventing SoftIRQ from
>>>> spinning on vpci_remove_device and -ERESTART?
>>> Well, at this point I can suggest only a marginal improvement: Instead of
>>> spinning inside the softirq handler, you want to re-raise the softirq and
>>> exit the handler. That way at least higher "priority" softirqs won't be
>>> starved.
>>>
>>> Beyond that - maybe the guest (or just a vcpu of it) needs pausing in such
>>> an event, with the work deferred to a tasklet?
>>>
>>> Yet I don't think my earlier question regarding the use of write_trylock()
>>> was really answered. What you said in reply doesn't explain (to me at
>>> least) why write_lock() is not an option.
>> I was thinking that we do not want to freeze in case we are calling vpci_remove_device
>> from SoftIRQ context, thus we try to lock and if we can't we return -ERESTART
>> indicating that the removal needs to be deferred. If we use write_lock, then
>> SoftIRQ -> write_lock will spin there waiting for readers to release the lock.
>>
>> write_lock actually makes things a lot easier, but I just don't know if it
>> is ok to use it. If so, then vpci_remove_device becomes synchronous and
>> there is no need in hypercall continuation and other heavy machinery for
>> re-scheduling SoftIRQ...
> I'm inclined to ask: If it wasn't okay to use here, then where would it be
> okay to use? Of course I realize there are cases when long spinning times
> can be a problem.
I can't prove, but I have a feeling that write_lock could be less
"harmful" in case of a hypercall rather than SoftIRQ
>   But I expect there aren't going to be excessively long
> lock holding regions for this lock, and I also would expect average
> contention to not be overly bad.
Yes, this is my impression as well
>   But in the end you know better the code
> that you're writing (and which may lead to issues with the lock usage) than
> I do ...
I am pretty much ok with write_lock as it does make things way easier.
So I'll got with write_lock then and if we spot great contention then we
can improve that
>
> Jan
>
Thank you!!
Oleksandr
diff mbox series

Patch

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 40ff79c33f8f..ef538386e95d 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -149,8 +149,7 @@  bool vpci_process_pending(struct vcpu *v)
                         !rc && v->vpci.rom_only);
         spin_unlock(&v->vpci.pdev->vpci->lock);
 
-        rangeset_destroy(v->vpci.mem);
-        v->vpci.mem = NULL;
+        vpci_cancel_pending(v->vpci.pdev);
         if ( rc )
             /*
              * FIXME: in case of failure remove the device from the domain.
@@ -165,6 +164,18 @@  bool vpci_process_pending(struct vcpu *v)
     return false;
 }
 
+void vpci_cancel_pending(const struct pci_dev *pdev)
+{
+    struct vcpu *v = current;
+
+    /* Cancel any pending work now. */
+    if ( v->vpci.mem && v->vpci.pdev == pdev)
+    {
+        rangeset_destroy(v->vpci.mem);
+        v->vpci.mem = NULL;
+    }
+}
+
 static int __init apply_map(struct domain *d, const struct pci_dev *pdev,
                             struct rangeset *mem, uint16_t cmd)
 {
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 657697fe3406..4e24956419aa 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -51,6 +51,8 @@  void vpci_remove_device(struct pci_dev *pdev)
         xfree(r);
     }
     spin_unlock(&pdev->vpci->lock);
+
+    vpci_cancel_pending(pdev);
     xfree(pdev->vpci->msix);
     xfree(pdev->vpci->msi);
     xfree(pdev->vpci);
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 3f32de9d7eb3..609d6383b252 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -56,6 +56,7 @@  uint32_t vpci_hw_read32(const struct pci_dev *pdev, unsigned int reg,
  * should not run.
  */
 bool __must_check vpci_process_pending(struct vcpu *v);
+void vpci_cancel_pending(const struct pci_dev *pdev);
 
 struct vpci {
     /* List of vPCI handlers for a device. */
@@ -252,6 +253,11 @@  static inline bool __must_check vpci_process_pending(struct vcpu *v)
     ASSERT_UNREACHABLE();
     return false;
 }
+
+static inline void vpci_cancel_pending(const struct pci_dev *pdev)
+{
+    ASSERT_UNREACHABLE();
+}
 #endif
 
 #endif