[v6,05/13] vpci: add hooks for PCI device assign/de-assign

Message ID	20220204063459.680961-6-andr2000@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Oleksandr Andrushchenko <andr2000@gmail.com> To: xen-devel@lists.xenproject.org Cc: julien@xen.org, sstabellini@kernel.org, oleksandr_tyshchenko@epam.com, volodymyr_babchuk@epam.com, artem_mygaiev@epam.com, roger.pau@citrix.com, jbeulich@suse.com, andrew.cooper3@citrix.com, george.dunlap@citrix.com, paul@xen.org, bertrand.marquis@arm.com, rahul.singh@arm.com, Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Subject: [PATCH v6 05/13] vpci: add hooks for PCI device assign/de-assign Date: Fri, 4 Feb 2022 08:34:51 +0200 Message-Id: <20220204063459.680961-6-andr2000@gmail.com> In-Reply-To: <20220204063459.680961-1-andr2000@gmail.com> References: <20220204063459.680961-1-andr2000@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	PCI devices passthrough on Arm, part 3 \| expand [v6,00/13] PCI devices passthrough on Arm, part 3 [v6,01/13] xen/pci: arm: add stub for is_memory_hole [v6,02/13] rangeset: add RANGESETF_no_print flag [v6,03/13] vpci: move lock outside of struct vpci [v6,04/13] vpci: restrict unhandled read/write operations for guests [v6,05/13] vpci: add hooks for PCI device assign/de-assign [v6,06/13] vpci/header: implement guest BAR register handlers [v6,07/13] vpci/header: handle p2m range sets per BAR [v6,08/13] vpci/header: program p2m with guest BAR view [v6,09/13] vpci/header: emulate PCI_COMMAND register for guests [v6,10/13] vpci/header: reset the command register when adding devices [v6,11/13] vpci: add initial support for virtual PCI bus topology [v6,12/13] xen/arm: translate virtual PCI bus topology for guests [v6,13/13] xen/arm: account IO handlers for emulated PCI MSI-X

Oleksandr Andrushchenko Feb. 4, 2022, 6:34 a.m. UTC

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

When a PCI device gets assigned/de-assigned some work on vPCI side needs
to be done for that device. Introduce a pair of hooks so vPCI can handle
that.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v5:
- do not split code into run_vpci_init
- do not check for is_system_domain in vpci_{de}assign_device
- do not use vpci_remove_device_handlers_locked and re-allocate
  pdev->vpci completely
- make vpci_deassign_device void
Since v4:
 - de-assign vPCI from the previous domain on device assignment
 - do not remove handlers in vpci_assign_device as those must not
   exist at that point
Since v3:
 - remove toolstack roll-back description from the commit message
   as error are to be handled with proper cleanup in Xen itself
 - remove __must_check
 - remove redundant rc check while assigning devices
 - fix redundant CONFIG_HAS_VPCI check for CONFIG_HAS_VPCI_GUEST_SUPPORT
 - use REGISTER_VPCI_INIT machinery to run required steps on device
   init/assign: add run_vpci_init helper
Since v2:
- define CONFIG_HAS_VPCI_GUEST_SUPPORT so dead code is not compiled
  for x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - extended the commit message
---
 xen/drivers/Kconfig           |  4 ++++
 xen/drivers/passthrough/pci.c |  6 ++++++
 xen/drivers/vpci/vpci.c       | 27 +++++++++++++++++++++++++++
 xen/include/xen/vpci.h        | 15 +++++++++++++++
 4 files changed, 52 insertions(+)

Jan Beulich Feb. 7, 2022, 4:28 p.m. UTC | #1

On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>                          pci_to_dev(pdev), flag);
>      }
>  
> +    rc = vpci_assign_device(d, pdev);
> +
>   done:
>      if ( rc )
>          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",

There's no attempt to undo anything in the case of getting back an
error. ISTR this being deemed okay on the basis that the tool stack
would then take whatever action, but whatever it is that is supposed
to deal with errors here wants spelling out in the description.
What's important is that no caller up the call tree may be left with
the impression that the device is still owned by the original
domain. With how you have it, the device is going to be owned by the
new domain, but not really usable.

> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  
>      return rc;
>  }
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +/* Notify vPCI that device is assigned to guest. */
> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    if ( !has_vpci(d) )
> +        return 0;
> +
> +    rc = vpci_add_handlers(pdev);
> +    if ( rc )
> +        vpci_deassign_device(d, pdev);
> +
> +    return rc;
> +}
> +
> +/* Notify vPCI that device is de-assigned from guest. */
> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
> +{
> +    if ( !has_vpci(d) )
> +        return;
> +
> +    vpci_remove_device(pdev);
> +}
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */

While for the latter function you look to need two parameters, do you
really need them also in the former one?

Symmetry considerations make me wonder though whether the de-assign
hook shouldn't be called earlier, when pdev->domain still has the
original owner. At which point the 2nd parameter could disappear there
as well.

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 8:32 a.m. UTC | #2

On 07.02.22 18:28, Jan Beulich wrote:
> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>                           pci_to_dev(pdev), flag);
>>       }
>>   
>> +    rc = vpci_assign_device(d, pdev);
>> +
>>    done:
>>       if ( rc )
>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
> There's no attempt to undo anything in the case of getting back an
> error. ISTR this being deemed okay on the basis that the tool stack
> would then take whatever action, but whatever it is that is supposed
> to deal with errors here wants spelling out in the description.
Why? I don't change the previously expected decision and implementation
of the assign_device function: I use error paths as they were used before
for the existing code. So, I see no clear reason to stress that the existing
and new code relies on the toolstack
> What's important is that no caller up the call tree may be left with
> the impression that the device is still owned by the original
> domain. With how you have it, the device is going to be owned by the
> new domain, but not really usable.
This is not true: vpci_assign_device will call vpci_deassign_device
internally if it fails. So, the device won't be assigned in this case
>
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>   
>>       return rc;
>>   }
>> +
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +/* Notify vPCI that device is assigned to guest. */
>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    if ( !has_vpci(d) )
>> +        return 0;
>> +
>> +    rc = vpci_add_handlers(pdev);
>> +    if ( rc )
>> +        vpci_deassign_device(d, pdev);
>> +
>> +    return rc;
>> +}
>> +
>> +/* Notify vPCI that device is de-assigned from guest. */
>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>> +{
>> +    if ( !has_vpci(d) )
>> +        return;
>> +
>> +    vpci_remove_device(pdev);
>> +}
>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> While for the latter function you look to need two parameters, do you
> really need them also in the former one?
Do you mean instead of passing d we could just use pdev->domain?
int vpci_assign_device(struct pci_dev *pdev)
+{
+    int rc;
+
+    if ( !has_vpci(pdev->domain) )
+        return 0;
Yes, we probably can, but the rest of functions called from assign_device
are accepting both d and pdev, so not sure why would we want these
two be any different. Any good reason not to change others as well then?
> Symmetry considerations make me wonder though whether the de-assign
> hook shouldn't be called earlier, when pdev->domain still has the
> original owner. At which point the 2nd parameter could disappear there
> as well.
static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
{
[snip]
     vpci_deassign_device(pdev->domain, pdev);
[snip]
     rc = vpci_assign_device(d, pdev);

It looks ok to me
> Jan
>
Thank you,
Oleksandr

Jan Beulich Feb. 8, 2022, 9:13 a.m. UTC | #3

On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
> On 07.02.22 18:28, Jan Beulich wrote:
>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>                           pci_to_dev(pdev), flag);
>>>       }
>>>   
>>> +    rc = vpci_assign_device(d, pdev);
>>> +
>>>    done:
>>>       if ( rc )
>>>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>> There's no attempt to undo anything in the case of getting back an
>> error. ISTR this being deemed okay on the basis that the tool stack
>> would then take whatever action, but whatever it is that is supposed
>> to deal with errors here wants spelling out in the description.
> Why? I don't change the previously expected decision and implementation
> of the assign_device function: I use error paths as they were used before
> for the existing code. So, I see no clear reason to stress that the existing
> and new code relies on the toolstack

Saying half a sentence on this is helping review.

>> What's important is that no caller up the call tree may be left with
>> the impression that the device is still owned by the original
>> domain. With how you have it, the device is going to be owned by the
>> new domain, but not really usable.
> This is not true: vpci_assign_device will call vpci_deassign_device
> internally if it fails. So, the device won't be assigned in this case

No. The device is assigned to whatever pdev->domain holds. Calling
vpci_deassign_device() there merely makes sure that the device will
have _no_ vPCI data and hooks in place, rather than something
partial.

>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>   
>>>       return rc;
>>>   }
>>> +
>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>> +/* Notify vPCI that device is assigned to guest. */
>>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    int rc;
>>> +
>>> +    if ( !has_vpci(d) )
>>> +        return 0;
>>> +
>>> +    rc = vpci_add_handlers(pdev);
>>> +    if ( rc )
>>> +        vpci_deassign_device(d, pdev);
>>> +
>>> +    return rc;
>>> +}
>>> +
>>> +/* Notify vPCI that device is de-assigned from guest. */
>>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>>> +{
>>> +    if ( !has_vpci(d) )
>>> +        return;
>>> +
>>> +    vpci_remove_device(pdev);
>>> +}
>>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>> While for the latter function you look to need two parameters, do you
>> really need them also in the former one?
> Do you mean instead of passing d we could just use pdev->domain?
> int vpci_assign_device(struct pci_dev *pdev)
> +{
> +    int rc;
> +
> +    if ( !has_vpci(pdev->domain) )
> +        return 0;

Yes.

> Yes, we probably can, but the rest of functions called from assign_device
> are accepting both d and pdev, so not sure why would we want these
> two be any different. Any good reason not to change others as well then?

Yes: Prior to the call of the ->assign_device() hook, d != pdev->domain.
It is the _purpose_ of this function to change ownership of the device.

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 9:27 a.m. UTC | #4

On 08.02.22 11:13, Jan Beulich wrote:
> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>> On 07.02.22 18:28, Jan Beulich wrote:
>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>                            pci_to_dev(pdev), flag);
>>>>        }
>>>>    
>>>> +    rc = vpci_assign_device(d, pdev);
>>>> +
>>>>     done:
>>>>        if ( rc )
>>>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>> There's no attempt to undo anything in the case of getting back an
>>> error. ISTR this being deemed okay on the basis that the tool stack
>>> would then take whatever action, but whatever it is that is supposed
>>> to deal with errors here wants spelling out in the description.
>> Why? I don't change the previously expected decision and implementation
>> of the assign_device function: I use error paths as they were used before
>> for the existing code. So, I see no clear reason to stress that the existing
>> and new code relies on the toolstack
> Saying half a sentence on this is helping review.
Ok
>
>>> What's important is that no caller up the call tree may be left with
>>> the impression that the device is still owned by the original
>>> domain. With how you have it, the device is going to be owned by the
>>> new domain, but not really usable.
>> This is not true: vpci_assign_device will call vpci_deassign_device
>> internally if it fails. So, the device won't be assigned in this case
> No. The device is assigned to whatever pdev->domain holds. Calling
> vpci_deassign_device() there merely makes sure that the device will
> have _no_ vPCI data and hooks in place, rather than something
> partial.
So, this patch is only dealing with vpci assign/de-assign
And it rolls back what it did in case of a failure
It also returns rc in assign_device to signal it has failed
What else is expected from this patch??
>
>>>> --- a/xen/drivers/vpci/vpci.c
>>>> +++ b/xen/drivers/vpci/vpci.c
>>>> @@ -99,6 +99,33 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>>>    
>>>>        return rc;
>>>>    }
>>>> +
>>>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>>>> +/* Notify vPCI that device is assigned to guest. */
>>>> +int vpci_assign_device(struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    int rc;
>>>> +
>>>> +    if ( !has_vpci(d) )
>>>> +        return 0;
>>>> +
>>>> +    rc = vpci_add_handlers(pdev);
>>>> +    if ( rc )
>>>> +        vpci_deassign_device(d, pdev);
>>>> +
>>>> +    return rc;
>>>> +}
>>>> +
>>>> +/* Notify vPCI that device is de-assigned from guest. */
>>>> +void vpci_deassign_device(struct domain *d, struct pci_dev *pdev)
>>>> +{
>>>> +    if ( !has_vpci(d) )
>>>> +        return;
>>>> +
>>>> +    vpci_remove_device(pdev);
>>>> +}
>>>> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
>>> While for the latter function you look to need two parameters, do you
>>> really need them also in the former one?
>> Do you mean instead of passing d we could just use pdev->domain?
>> int vpci_assign_device(struct pci_dev *pdev)
>> +{
>> +    int rc;
>> +
>> +    if ( !has_vpci(pdev->domain) )
>> +        return 0;
> Yes.
>
>> Yes, we probably can, but the rest of functions called from assign_device
>> are accepting both d and pdev, so not sure why would we want these
>> two be any different. Any good reason not to change others as well then?
> Yes: Prior to the call of the ->assign_device() hook, d != pdev->domain.
> It is the _purpose_ of this function to change ownership of the device.
This can be done and makes sense.
@Roger which way do you want this?
>
> Jan
>
Thank you,
Oleksandr

Jan Beulich Feb. 8, 2022, 9:44 a.m. UTC | #5

On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
> On 08.02.22 11:13, Jan Beulich wrote:
>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>                            pci_to_dev(pdev), flag);
>>>>>        }
>>>>>    
>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>> +
>>>>>     done:
>>>>>        if ( rc )
>>>>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>> There's no attempt to undo anything in the case of getting back an
>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>> would then take whatever action, but whatever it is that is supposed
>>>> to deal with errors here wants spelling out in the description.
>>> Why? I don't change the previously expected decision and implementation
>>> of the assign_device function: I use error paths as they were used before
>>> for the existing code. So, I see no clear reason to stress that the existing
>>> and new code relies on the toolstack
>> Saying half a sentence on this is helping review.
> Ok
>>
>>>> What's important is that no caller up the call tree may be left with
>>>> the impression that the device is still owned by the original
>>>> domain. With how you have it, the device is going to be owned by the
>>>> new domain, but not really usable.
>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>> internally if it fails. So, the device won't be assigned in this case
>> No. The device is assigned to whatever pdev->domain holds. Calling
>> vpci_deassign_device() there merely makes sure that the device will
>> have _no_ vPCI data and hooks in place, rather than something
>> partial.
> So, this patch is only dealing with vpci assign/de-assign
> And it rolls back what it did in case of a failure
> It also returns rc in assign_device to signal it has failed
> What else is expected from this patch??

Until now if assign_device() returns an error, this tells the caller
that the device did not change ownership; in the worst case it either
only moved to the quarantine domain, or the new owner may have been
crashed. In no case is the device owned by an alive DomU. You're
changing this property, and hence you need to make clear/sure that
this isn't colliding with assumptions made elsewhere.

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 9:55 a.m. UTC | #6

On 08.02.22 11:44, Jan Beulich wrote:
> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>> On 08.02.22 11:13, Jan Beulich wrote:
>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>                             pci_to_dev(pdev), flag);
>>>>>>         }
>>>>>>     
>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>> +
>>>>>>      done:
>>>>>>         if ( rc )
>>>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>> There's no attempt to undo anything in the case of getting back an
>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>> would then take whatever action, but whatever it is that is supposed
>>>>> to deal with errors here wants spelling out in the description.
>>>> Why? I don't change the previously expected decision and implementation
>>>> of the assign_device function: I use error paths as they were used before
>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>> and new code relies on the toolstack
>>> Saying half a sentence on this is helping review.
>> Ok
>>>>> What's important is that no caller up the call tree may be left with
>>>>> the impression that the device is still owned by the original
>>>>> domain. With how you have it, the device is going to be owned by the
>>>>> new domain, but not really usable.
>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>> internally if it fails. So, the device won't be assigned in this case
>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>> vpci_deassign_device() there merely makes sure that the device will
>>> have _no_ vPCI data and hooks in place, rather than something
>>> partial.
>> So, this patch is only dealing with vpci assign/de-assign
>> And it rolls back what it did in case of a failure
>> It also returns rc in assign_device to signal it has failed
>> What else is expected from this patch??
> Until now if assign_device() returns an error, this tells the caller
> that the device did not change ownership;
Not sure this is the case:
     if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
                           pci_to_dev(pdev), flag)) )
iommu_call can leave the new ownership even now without
vpci_assign_device. My understanding is that the roll-back is
expected to be performed by the toolstack and vpci_assign_device
doesn't prevent that by returning rc. Even more, before we discussed
that it would be good for vpci_assign_device to try recovering from
a possible error early which is done by calling vpci_deassign_device
internally.

So, if you want the things to be clearly handled without relying on the
toolstack then it is not vpci_assign_device introduced issue, but the
existing one, which needs (if there is a good reason) to be fixed
separately.
I think that new code doesn't make things worse. At least

>   in the worst case it either
> only moved to the quarantine domain, or the new owner may have been
> crashed. In no case is the device owned by an alive DomU. You're
> changing this property, and hence you need to make clear/sure that
> this isn't colliding with assumptions made elsewhere.
>
> Jan
>
>

Jan Beulich Feb. 8, 2022, 10:09 a.m. UTC | #7

On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 11:44, Jan Beulich wrote:
>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>                             pci_to_dev(pdev), flag);
>>>>>>>         }
>>>>>>>     
>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>> +
>>>>>>>      done:
>>>>>>>         if ( rc )
>>>>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>> to deal with errors here wants spelling out in the description.
>>>>> Why? I don't change the previously expected decision and implementation
>>>>> of the assign_device function: I use error paths as they were used before
>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>> and new code relies on the toolstack
>>>> Saying half a sentence on this is helping review.
>>> Ok
>>>>>> What's important is that no caller up the call tree may be left with
>>>>>> the impression that the device is still owned by the original
>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>> new domain, but not really usable.
>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>> internally if it fails. So, the device won't be assigned in this case
>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>> vpci_deassign_device() there merely makes sure that the device will
>>>> have _no_ vPCI data and hooks in place, rather than something
>>>> partial.
>>> So, this patch is only dealing with vpci assign/de-assign
>>> And it rolls back what it did in case of a failure
>>> It also returns rc in assign_device to signal it has failed
>>> What else is expected from this patch??
>> Until now if assign_device() returns an error, this tells the caller
>> that the device did not change ownership;
> Not sure this is the case:
>      if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>                            pci_to_dev(pdev), flag)) )
> iommu_call can leave the new ownership even now without
> vpci_assign_device.

Did you check the actual hook functions for when exactly the ownership
change happens. For both VT-d and AMD it is the last thing they do,
when no error can occur anymore.

 My understanding is that the roll-back is
> expected to be performed by the toolstack and vpci_assign_device
> doesn't prevent that by returning rc. Even more, before we discussed
> that it would be good for vpci_assign_device to try recovering from
> a possible error early which is done by calling vpci_deassign_device
> internally.

Yes, but that's only part of it. It at least needs considering what
effects have resulted from operations prior to vpci_assign_device().

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 10:22 a.m. UTC | #8

On 08.02.22 12:09, Jan Beulich wrote:
> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 11:44, Jan Beulich wrote:
>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>                              pci_to_dev(pdev), flag);
>>>>>>>>          }
>>>>>>>>      
>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>> +
>>>>>>>>       done:
>>>>>>>>          if ( rc )
>>>>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>> and new code relies on the toolstack
>>>>> Saying half a sentence on this is helping review.
>>>> Ok
>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>> the impression that the device is still owned by the original
>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>> new domain, but not really usable.
>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>> partial.
>>>> So, this patch is only dealing with vpci assign/de-assign
>>>> And it rolls back what it did in case of a failure
>>>> It also returns rc in assign_device to signal it has failed
>>>> What else is expected from this patch??
>>> Until now if assign_device() returns an error, this tells the caller
>>> that the device did not change ownership;
>> Not sure this is the case:
>>       if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>                             pci_to_dev(pdev), flag)) )
>> iommu_call can leave the new ownership even now without
>> vpci_assign_device.
> Did you check the actual hook functions for when exactly the ownership
> change happens. For both VT-d and AMD it is the last thing they do,
> when no error can occur anymore.
This functionality does not exist for Arm yet, so this is up to the
future series to add that.

WRT to the existing code:

static int amd_iommu_assign_device(struct domain *d, u8 devfn,
                                    struct pci_dev *pdev,
                                    u32 flag)
{
     if ( !rc )
         rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain

     if ( rc && !is_hardware_domain(d) )
     {
         int ret = amd_iommu_reserve_domain_unity_unmap(
                       d, ivrs_mappings[req_id].unity_map);

         if ( ret )
         {
             printk(XENLOG_ERR "AMD-Vi: "
                    "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
                    d, pdev->seg, pdev->bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
             domain_crash(d);
         }
So....

This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
This is something that needs to be done by the PCI code itself and
not relying on each IOMMU callback implementation
>
>   My understanding is that the roll-back is
>> expected to be performed by the toolstack and vpci_assign_device
>> doesn't prevent that by returning rc. Even more, before we discussed
>> that it would be good for vpci_assign_device to try recovering from
>> a possible error early which is done by calling vpci_deassign_device
>> internally.
> Yes, but that's only part of it. It at least needs considering what
> effects have resulted from operations prior to vpci_assign_device().
Taking into account the code snippet above: what is your expectation
from this patch with this respect?

>
> Jan
>
Thank you,
Oleksandr

Jan Beulich Feb. 8, 2022, 10:29 a.m. UTC | #9

On 08.02.2022 11:22, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 12:09, Jan Beulich wrote:
>> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>>
>>> On 08.02.22 11:44, Jan Beulich wrote:
>>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>>                              pci_to_dev(pdev), flag);
>>>>>>>>>          }
>>>>>>>>>      
>>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>>> +
>>>>>>>>>       done:
>>>>>>>>>          if ( rc )
>>>>>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>>> and new code relies on the toolstack
>>>>>> Saying half a sentence on this is helping review.
>>>>> Ok
>>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>>> the impression that the device is still owned by the original
>>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>>> new domain, but not really usable.
>>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>>> partial.
>>>>> So, this patch is only dealing with vpci assign/de-assign
>>>>> And it rolls back what it did in case of a failure
>>>>> It also returns rc in assign_device to signal it has failed
>>>>> What else is expected from this patch??
>>>> Until now if assign_device() returns an error, this tells the caller
>>>> that the device did not change ownership;
>>> Not sure this is the case:
>>>       if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>>                             pci_to_dev(pdev), flag)) )
>>> iommu_call can leave the new ownership even now without
>>> vpci_assign_device.
>> Did you check the actual hook functions for when exactly the ownership
>> change happens. For both VT-d and AMD it is the last thing they do,
>> when no error can occur anymore.
> This functionality does not exist for Arm yet, so this is up to the
> future series to add that.
> 
> WRT to the existing code:
> 
> static int amd_iommu_assign_device(struct domain *d, u8 devfn,
>                                     struct pci_dev *pdev,
>                                     u32 flag)
> {
>      if ( !rc )
>          rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain
> 
>      if ( rc && !is_hardware_domain(d) )
>      {
>          int ret = amd_iommu_reserve_domain_unity_unmap(
>                        d, ivrs_mappings[req_id].unity_map);
> 
>          if ( ret )
>          {
>              printk(XENLOG_ERR "AMD-Vi: "
>                     "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
>                     d, pdev->seg, pdev->bus,
>                     PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
>              domain_crash(d);
>          }
> So....
> 
> This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
> This is something that needs to be done by the PCI code itself and
> not relying on each IOMMU callback implementation
>>
>>   My understanding is that the roll-back is
>>> expected to be performed by the toolstack and vpci_assign_device
>>> doesn't prevent that by returning rc. Even more, before we discussed
>>> that it would be good for vpci_assign_device to try recovering from
>>> a possible error early which is done by calling vpci_deassign_device
>>> internally.
>> Yes, but that's only part of it. It at least needs considering what
>> effects have resulted from operations prior to vpci_assign_device().
> Taking into account the code snippet above: what is your expectation
> from this patch with this respect?

You did note the domain_crash() in there, didn't you? The snippet above
still matches the "device not assigned to an alive DomU" criteria (which
can be translated to "no exposure of a device to an untrusted entity in
case of error"). Such domain_crash() uses aren't nice, and I'd prefer to
see them go away, but said property needs to be retained with any
alternative solutions.

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 10:52 a.m. UTC | #10

On 08.02.22 12:29, Jan Beulich wrote:
> On 08.02.2022 11:22, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 12:09, Jan Beulich wrote:
>>> On 08.02.2022 10:55, Oleksandr Andrushchenko wrote:
>>>> On 08.02.22 11:44, Jan Beulich wrote:
>>>>> On 08.02.2022 10:27, Oleksandr Andrushchenko wrote:
>>>>>> On 08.02.22 11:13, Jan Beulich wrote:
>>>>>>> On 08.02.2022 09:32, Oleksandr Andrushchenko wrote:
>>>>>>>> On 07.02.22 18:28, Jan Beulich wrote:
>>>>>>>>> On 04.02.2022 07:34, Oleksandr Andrushchenko wrote:
>>>>>>>>>> @@ -1507,6 +1511,8 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>>>>>>                               pci_to_dev(pdev), flag);
>>>>>>>>>>           }
>>>>>>>>>>       
>>>>>>>>>> +    rc = vpci_assign_device(d, pdev);
>>>>>>>>>> +
>>>>>>>>>>        done:
>>>>>>>>>>           if ( rc )
>>>>>>>>>>               printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>>>>> There's no attempt to undo anything in the case of getting back an
>>>>>>>>> error. ISTR this being deemed okay on the basis that the tool stack
>>>>>>>>> would then take whatever action, but whatever it is that is supposed
>>>>>>>>> to deal with errors here wants spelling out in the description.
>>>>>>>> Why? I don't change the previously expected decision and implementation
>>>>>>>> of the assign_device function: I use error paths as they were used before
>>>>>>>> for the existing code. So, I see no clear reason to stress that the existing
>>>>>>>> and new code relies on the toolstack
>>>>>>> Saying half a sentence on this is helping review.
>>>>>> Ok
>>>>>>>>> What's important is that no caller up the call tree may be left with
>>>>>>>>> the impression that the device is still owned by the original
>>>>>>>>> domain. With how you have it, the device is going to be owned by the
>>>>>>>>> new domain, but not really usable.
>>>>>>>> This is not true: vpci_assign_device will call vpci_deassign_device
>>>>>>>> internally if it fails. So, the device won't be assigned in this case
>>>>>>> No. The device is assigned to whatever pdev->domain holds. Calling
>>>>>>> vpci_deassign_device() there merely makes sure that the device will
>>>>>>> have _no_ vPCI data and hooks in place, rather than something
>>>>>>> partial.
>>>>>> So, this patch is only dealing with vpci assign/de-assign
>>>>>> And it rolls back what it did in case of a failure
>>>>>> It also returns rc in assign_device to signal it has failed
>>>>>> What else is expected from this patch??
>>>>> Until now if assign_device() returns an error, this tells the caller
>>>>> that the device did not change ownership;
>>>> Not sure this is the case:
>>>>        if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
>>>>                              pci_to_dev(pdev), flag)) )
>>>> iommu_call can leave the new ownership even now without
>>>> vpci_assign_device.
>>> Did you check the actual hook functions for when exactly the ownership
>>> change happens. For both VT-d and AMD it is the last thing they do,
>>> when no error can occur anymore.
>> This functionality does not exist for Arm yet, so this is up to the
>> future series to add that.
>>
>> WRT to the existing code:
>>
>> static int amd_iommu_assign_device(struct domain *d, u8 devfn,
>>                                      struct pci_dev *pdev,
>>                                      u32 flag)
>> {
>>       if ( !rc )
>>           rc = reassign_device(pdev->domain, d, devfn, pdev); <<<<< this will set pdev->domain
>>
>>       if ( rc && !is_hardware_domain(d) )
>>       {
>>           int ret = amd_iommu_reserve_domain_unity_unmap(
>>                         d, ivrs_mappings[req_id].unity_map);
>>
>>           if ( ret )
>>           {
>>               printk(XENLOG_ERR "AMD-Vi: "
>>                      "unity-unmap for %pd/%04x:%02x:%02x.%u failed (%d)\n",
>>                      d, pdev->seg, pdev->bus,
>>                      PCI_SLOT(devfn), PCI_FUNC(devfn), ret);
>>               domain_crash(d);
>>           }
>> So....
>>
>> This is IMO wrong in the first place to let IOMMU code assign pdev->domain.
>> This is something that needs to be done by the PCI code itself and
>> not relying on each IOMMU callback implementation
>>>    My understanding is that the roll-back is
>>>> expected to be performed by the toolstack and vpci_assign_device
>>>> doesn't prevent that by returning rc. Even more, before we discussed
>>>> that it would be good for vpci_assign_device to try recovering from
>>>> a possible error early which is done by calling vpci_deassign_device
>>>> internally.
>>> Yes, but that's only part of it. It at least needs considering what
>>> effects have resulted from operations prior to vpci_assign_device().
>> Taking into account the code snippet above: what is your expectation
>> from this patch with this respect?
> You did note the domain_crash() in there, didn't you?
Which is AMD specific implementation which can be different for
other IOMMUs. Yes, I did.
> The snippet above
> still matches the "device not assigned to an alive DomU" criteria (which
> can be translated to "no exposure of a device to an untrusted entity in
> case of error"). Such domain_crash() uses aren't nice, and I'd prefer to
> see them go away, but said property needs to be retained with any
> alternative solutions.
This smells like we first need to fix the existing code, so
pdev->domain is not assigned by specific IOMMU implementations,
but instead controlled by the code which relies on that, assign_device.

I can have something like:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 88836aab6baf..cc7790709a50 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
  static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
  {
      const struct domain_iommu *hd = dom_iommu(d);
+    struct domain *old_owner;
      struct pci_dev *pdev;
      int rc = 0;

@@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
      ASSERT(pdev && (pdev->domain == hardware_domain ||
                      pdev->domain == dom_io));

+    /* We need to restore the old owner in case of an error. */
+    old_owner = pdev->domain;
+
      vpci_deassign_device(pdev->domain, pdev);

      rc = pdev_msix_assign(d, pdev);
@@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)

   done:
      if ( rc )
+    {
          printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
                 d, &PCI_SBDF3(seg, bus, devfn), rc);
+        /* We failed to assign, so restore the previous owner. */
+        pdev->domain = old_owner;
+    }
      /* The device is assigned to dom_io so mark it as quarantined */
      else if ( d == dom_io )
          pdev->quarantine = true;

But I do not think this belongs to this patch
>
> Jan
>
Thank you,
Oleksandr

Jan Beulich Feb. 8, 2022, 11 a.m. UTC | #11

On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
> This smells like we first need to fix the existing code, so
> pdev->domain is not assigned by specific IOMMU implementations,
> but instead controlled by the code which relies on that, assign_device.

Feel free to come up with proposals how to cleanly do so. Moving the
assignment to pdev->domain may even be possible now, but if you go
back you may find that the code was quite different earlier on.

> I can have something like:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 88836aab6baf..cc7790709a50 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>   static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>   {
>       const struct domain_iommu *hd = dom_iommu(d);
> +    struct domain *old_owner;
>       struct pci_dev *pdev;
>       int rc = 0;
> 
> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>       ASSERT(pdev && (pdev->domain == hardware_domain ||
>                       pdev->domain == dom_io));
> 
> +    /* We need to restore the old owner in case of an error. */
> +    old_owner = pdev->domain;
> +
>       vpci_deassign_device(pdev->domain, pdev);
> 
>       rc = pdev_msix_assign(d, pdev);
> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
> 
>    done:
>       if ( rc )
> +    {
>           printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>                  d, &PCI_SBDF3(seg, bus, devfn), rc);
> +        /* We failed to assign, so restore the previous owner. */
> +        pdev->domain = old_owner;
> +    }
>       /* The device is assigned to dom_io so mark it as quarantined */
>       else if ( d == dom_io )
>           pdev->quarantine = true;
> 
> But I do not think this belongs to this patch

Indeed. Plus I'm sure you understand that it's not that simple. Assigning
to pdev->domain is only the last step of assignment. Restoring the original
owner would entail putting in place the original IOMMU table entries as
well, which in turn can fail. Hence why you'll find a number of uses of
domain_crash() in places where rolling back is far from easy.

Jan

Oleksandr Andrushchenko Feb. 8, 2022, 11:25 a.m. UTC | #12

On 08.02.22 13:00, Jan Beulich wrote:
> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>> This smells like we first need to fix the existing code, so
>> pdev->domain is not assigned by specific IOMMU implementations,
>> but instead controlled by the code which relies on that, assign_device.
> Feel free to come up with proposals how to cleanly do so. Moving the
> assignment to pdev->domain may even be possible now, but if you go
> back you may find that the code was quite different earlier on.
I do understand that as the code evolves new use cases bring
new issues.
>
>> I can have something like:
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index 88836aab6baf..cc7790709a50 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>    static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>    {
>>        const struct domain_iommu *hd = dom_iommu(d);
>> +    struct domain *old_owner;
>>        struct pci_dev *pdev;
>>        int rc = 0;
>>
>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>        ASSERT(pdev && (pdev->domain == hardware_domain ||
>>                        pdev->domain == dom_io));
>>
>> +    /* We need to restore the old owner in case of an error. */
>> +    old_owner = pdev->domain;
>> +
>>        vpci_deassign_device(pdev->domain, pdev);
>>
>>        rc = pdev_msix_assign(d, pdev);
>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>
>>     done:
>>        if ( rc )
>> +    {
>>            printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>                   d, &PCI_SBDF3(seg, bus, devfn), rc);
>> +        /* We failed to assign, so restore the previous owner. */
>> +        pdev->domain = old_owner;
>> +    }
>>        /* The device is assigned to dom_io so mark it as quarantined */
>>        else if ( d == dom_io )
>>            pdev->quarantine = true;
>>
>> But I do not think this belongs to this patch
> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
> to pdev->domain is only the last step of assignment. Restoring the original
> owner would entail putting in place the original IOMMU table entries as
> well, which in turn can fail. Hence why you'll find a number of uses of
> domain_crash() in places where rolling back is far from easy.
So, why don't we just rely on the toolstack to do the roll back then?
This way we won't add new domain_crash() calls.
I do understand though that we may live Xen in a wrong state though.
So, do you think it is possible if we just call deassign_device from
assign_device on the error path? This is just like I do in vpci_assign_device:
I call vpci_deassign_device if the former fails.
> Jan
>
Thank you,
Oleksandr

Oleksandr Andrushchenko Feb. 10, 2022, 8:21 a.m. UTC | #13

On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>
> On 08.02.22 13:00, Jan Beulich wrote:
>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>> This smells like we first need to fix the existing code, so
>>> pdev->domain is not assigned by specific IOMMU implementations,
>>> but instead controlled by the code which relies on that, assign_device.
>> Feel free to come up with proposals how to cleanly do so. Moving the
>> assignment to pdev->domain may even be possible now, but if you go
>> back you may find that the code was quite different earlier on.
> I do understand that as the code evolves new use cases bring
> new issues.
>>> I can have something like:
>>>
>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>> index 88836aab6baf..cc7790709a50 100644
>>> --- a/xen/drivers/passthrough/pci.c
>>> +++ b/xen/drivers/passthrough/pci.c
>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>     static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>     {
>>>         const struct domain_iommu *hd = dom_iommu(d);
>>> +    struct domain *old_owner;
>>>         struct pci_dev *pdev;
>>>         int rc = 0;
>>>
>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>         ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>                         pdev->domain == dom_io));
>>>
>>> +    /* We need to restore the old owner in case of an error. */
>>> +    old_owner = pdev->domain;
>>> +
>>>         vpci_deassign_device(pdev->domain, pdev);
>>>
>>>         rc = pdev_msix_assign(d, pdev);
>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>
>>>      done:
>>>         if ( rc )
>>> +    {
>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>                    d, &PCI_SBDF3(seg, bus, devfn), rc);
>>> +        /* We failed to assign, so restore the previous owner. */
>>> +        pdev->domain = old_owner;
>>> +    }
>>>         /* The device is assigned to dom_io so mark it as quarantined */
>>>         else if ( d == dom_io )
>>>             pdev->quarantine = true;
>>>
>>> But I do not think this belongs to this patch
>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>> to pdev->domain is only the last step of assignment. Restoring the original
>> owner would entail putting in place the original IOMMU table entries as
>> well, which in turn can fail. Hence why you'll find a number of uses of
>> domain_crash() in places where rolling back is far from easy.
> So, why don't we just rely on the toolstack to do the roll back then?
> This way we won't add new domain_crash() calls.
> I do understand though that we may live Xen in a wrong state though.
> So, do you think it is possible if we just call deassign_device from
> assign_device on the error path? This is just like I do in vpci_assign_device:
> I call vpci_deassign_device if the former fails.
With the following addition:

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index c4ae22aeefcd..d6c00449193c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
      }

      rc = vpci_assign_device(pdev);
+    if ( rc )
+        /*
+         * Ignore the return code as we want to preserve the one from the
+         * failed assign operation.
+         */
+        deassign_device(d, seg, bus, devfn);

   done:
      if ( rc )

I see the following logs (PV Dom0):

(XEN) assign_device seg 0 bus 3 devfn 0
(XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
(XEN) [VT-D]d4:PCIe: map 0000:03:00.0
(XEN) assign_device vpci_assign rc -22 from d[IO] to d4
(XEN) deassign_device current d4 to d[IO]
(XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
(XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
(XEN) deassign_device ret 0
(XEN) d4: assign (0000:03:00.0) failed (-22)
libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed

So, it seems to properly solve the issue with pdev->domain left
set to the domain we couldn't create.

@Jan, will this address your concern?

Thank you,
Oleksandr

Jan Beulich Feb. 10, 2022, 9:22 a.m. UTC | #14

On 10.02.2022 09:21, Oleksandr Andrushchenko wrote:
> 
> 
> On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 13:00, Jan Beulich wrote:
>>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>>> This smells like we first need to fix the existing code, so
>>>> pdev->domain is not assigned by specific IOMMU implementations,
>>>> but instead controlled by the code which relies on that, assign_device.
>>> Feel free to come up with proposals how to cleanly do so. Moving the
>>> assignment to pdev->domain may even be possible now, but if you go
>>> back you may find that the code was quite different earlier on.
>> I do understand that as the code evolves new use cases bring
>> new issues.
>>>> I can have something like:
>>>>
>>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>>> index 88836aab6baf..cc7790709a50 100644
>>>> --- a/xen/drivers/passthrough/pci.c
>>>> +++ b/xen/drivers/passthrough/pci.c
>>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>>     static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>     {
>>>>         const struct domain_iommu *hd = dom_iommu(d);
>>>> +    struct domain *old_owner;
>>>>         struct pci_dev *pdev;
>>>>         int rc = 0;
>>>>
>>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>         ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>>                         pdev->domain == dom_io));
>>>>
>>>> +    /* We need to restore the old owner in case of an error. */
>>>> +    old_owner = pdev->domain;
>>>> +
>>>>         vpci_deassign_device(pdev->domain, pdev);
>>>>
>>>>         rc = pdev_msix_assign(d, pdev);
>>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>
>>>>      done:
>>>>         if ( rc )
>>>> +    {
>>>>             printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>                    d, &PCI_SBDF3(seg, bus, devfn), rc);
>>>> +        /* We failed to assign, so restore the previous owner. */
>>>> +        pdev->domain = old_owner;
>>>> +    }
>>>>         /* The device is assigned to dom_io so mark it as quarantined */
>>>>         else if ( d == dom_io )
>>>>             pdev->quarantine = true;
>>>>
>>>> But I do not think this belongs to this patch
>>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>>> to pdev->domain is only the last step of assignment. Restoring the original
>>> owner would entail putting in place the original IOMMU table entries as
>>> well, which in turn can fail. Hence why you'll find a number of uses of
>>> domain_crash() in places where rolling back is far from easy.
>> So, why don't we just rely on the toolstack to do the roll back then?
>> This way we won't add new domain_crash() calls.
>> I do understand though that we may live Xen in a wrong state though.
>> So, do you think it is possible if we just call deassign_device from
>> assign_device on the error path? This is just like I do in vpci_assign_device:
>> I call vpci_deassign_device if the former fails.
> With the following addition:
> 
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index c4ae22aeefcd..d6c00449193c 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>       }
> 
>       rc = vpci_assign_device(pdev);
> +    if ( rc )
> +        /*
> +         * Ignore the return code as we want to preserve the one from the
> +         * failed assign operation.
> +         */
> +        deassign_device(d, seg, bus, devfn);
> 
>    done:
>       if ( rc )
> 
> I see the following logs (PV Dom0):
> 
> (XEN) assign_device seg 0 bus 3 devfn 0
> (XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
> (XEN) [VT-D]d4:PCIe: map 0000:03:00.0
> (XEN) assign_device vpci_assign rc -22 from d[IO] to d4
> (XEN) deassign_device current d4 to d[IO]
> (XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
> (XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
> (XEN) deassign_device ret 0
> (XEN) d4: assign (0000:03:00.0) failed (-22)
> libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
> libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
> libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
> libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
> libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
> libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed
> 
> So, it seems to properly solve the issue with pdev->domain left
> set to the domain we couldn't create.
> 
> @Jan, will this address your concern?

Partly: For one I'd have to think through what further implications there
are from going this route. And then completely ignoring the return value
is unlikely to be correct: You certainly want to retain the original
error code for returning to the caller, but you can't leave the error
unhandled. That's likely another case where the "best" choice is to crash
the guest.

Jan

Oleksandr Andrushchenko Feb. 10, 2022, 9:33 a.m. UTC | #15

On 10.02.22 11:22, Jan Beulich wrote:
> On 10.02.2022 09:21, Oleksandr Andrushchenko wrote:
>>
>> On 08.02.22 13:25, Oleksandr Andrushchenko wrote:
>>> On 08.02.22 13:00, Jan Beulich wrote:
>>>> On 08.02.2022 11:52, Oleksandr Andrushchenko wrote:
>>>>> This smells like we first need to fix the existing code, so
>>>>> pdev->domain is not assigned by specific IOMMU implementations,
>>>>> but instead controlled by the code which relies on that, assign_device.
>>>> Feel free to come up with proposals how to cleanly do so. Moving the
>>>> assignment to pdev->domain may even be possible now, but if you go
>>>> back you may find that the code was quite different earlier on.
>>> I do understand that as the code evolves new use cases bring
>>> new issues.
>>>>> I can have something like:
>>>>>
>>>>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>>>>> index 88836aab6baf..cc7790709a50 100644
>>>>> --- a/xen/drivers/passthrough/pci.c
>>>>> +++ b/xen/drivers/passthrough/pci.c
>>>>> @@ -1475,6 +1475,7 @@ static int device_assigned(u16 seg, u8 bus, u8 devfn)
>>>>>      static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>      {
>>>>>          const struct domain_iommu *hd = dom_iommu(d);
>>>>> +    struct domain *old_owner;
>>>>>          struct pci_dev *pdev;
>>>>>          int rc = 0;
>>>>>
>>>>> @@ -1490,6 +1491,9 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>          ASSERT(pdev && (pdev->domain == hardware_domain ||
>>>>>                          pdev->domain == dom_io));
>>>>>
>>>>> +    /* We need to restore the old owner in case of an error. */
>>>>> +    old_owner = pdev->domain;
>>>>> +
>>>>>          vpci_deassign_device(pdev->domain, pdev);
>>>>>
>>>>>          rc = pdev_msix_assign(d, pdev);
>>>>> @@ -1515,8 +1519,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>>>>
>>>>>       done:
>>>>>          if ( rc )
>>>>> +    {
>>>>>              printk(XENLOG_G_WARNING "%pd: assign (%pp) failed (%d)\n",
>>>>>                     d, &PCI_SBDF3(seg, bus, devfn), rc);
>>>>> +        /* We failed to assign, so restore the previous owner. */
>>>>> +        pdev->domain = old_owner;
>>>>> +    }
>>>>>          /* The device is assigned to dom_io so mark it as quarantined */
>>>>>          else if ( d == dom_io )
>>>>>              pdev->quarantine = true;
>>>>>
>>>>> But I do not think this belongs to this patch
>>>> Indeed. Plus I'm sure you understand that it's not that simple. Assigning
>>>> to pdev->domain is only the last step of assignment. Restoring the original
>>>> owner would entail putting in place the original IOMMU table entries as
>>>> well, which in turn can fail. Hence why you'll find a number of uses of
>>>> domain_crash() in places where rolling back is far from easy.
>>> So, why don't we just rely on the toolstack to do the roll back then?
>>> This way we won't add new domain_crash() calls.
>>> I do understand though that we may live Xen in a wrong state though.
>>> So, do you think it is possible if we just call deassign_device from
>>> assign_device on the error path? This is just like I do in vpci_assign_device:
>>> I call vpci_deassign_device if the former fails.
>> With the following addition:
>>
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index c4ae22aeefcd..d6c00449193c 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -1511,6 +1511,12 @@ static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
>>        }
>>
>>        rc = vpci_assign_device(pdev);
>> +    if ( rc )
>> +        /*
>> +         * Ignore the return code as we want to preserve the one from the
>> +         * failed assign operation.
>> +         */
>> +        deassign_device(d, seg, bus, devfn);
This needs devfn to be preserved as it can be modified by the loop above:
     for ( ; pdev->phantom_stride; rc = 0 )
     {
         devfn += pdev->phantom_stride;

>>
>>     done:
>>        if ( rc )
>>
>> I see the following logs (PV Dom0):
>>
>> (XEN) assign_device seg 0 bus 3 devfn 0
>> (XEN) [VT-D]d[IO]:PCIe: unmap 0000:03:00.0
>> (XEN) [VT-D]d4:PCIe: map 0000:03:00.0
>> (XEN) assign_device vpci_assign rc -22 from d[IO] to d4
>> (XEN) deassign_device current d4 to d[IO]
>> (XEN) [VT-D]d4:PCIe: unmap 0000:03:00.0
>> (XEN) [VT-D]d[IO]:PCIe: map 0000:03:00.0
>> (XEN) deassign_device ret 0
>> (XEN) d4: assign (0000:03:00.0) failed (-22)
>> libxl: error: libxl_pci.c:1498:pci_add_dm_done: Domain 4:xc_assign_device failed: Invalid argument
>> libxl: error: libxl_pci.c:1781:device_pci_add_done: Domain 4:libxl__device_pci_add failed for PCI device 0:3:0.0 (rc -3)
>> libxl: error: libxl_create.c:1895:domcreate_attach_devices: Domain 4:unable to add pci devices
>> libxl: error: libxl_domain.c:1183:libxl__destroy_domid: Domain 4:Non-existant domain
>> libxl: error: libxl_domain.c:1137:domain_destroy_callback: Domain 4:Unable to destroy guest
>> libxl: error: libxl_domain.c:1064:domain_destroy_cb: Domain 4:Destruction of domain failed
>>
>> So, it seems to properly solve the issue with pdev->domain left
>> set to the domain we couldn't create.
>>
>> @Jan, will this address your concern?
> Partly: For one I'd have to think through what further implications there
> are from going this route. And then completely ignoring the return value
> is unlikely to be correct: You certainly want to retain the original
> error code for returning to the caller, but you can't leave the error
> unhandled. That's likely another case where the "best" choice is to crash
> the guest.
Ok, then I'll crash the domain...
>
> Jan
>
>
Thank you,
Oleksandr

[v6,05/13] vpci: add hooks for PCI device assign/de-assign

Commit Message

Comments

Patch