diff mbox series

[v10,13/17] vpci: add initial support for virtual PCI bus topology

Message ID 20231012220854.2736994-14-volodymyr_babchuk@epam.com (mailing list archive)
State Superseded
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Volodymyr Babchuk Oct. 12, 2023, 10:09 p.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
In v10:
- Removed ASSERT(pcidevs_locked())
- Removed redundant code (local sbdf variable, clearing sbdf during
device removal, etc)
- Added __maybe_unused attribute to "out:" label
- Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
  first patch where it is used (previously was in "vpci: add hooks for
  PCI device assign/de-assign")
In v9:
- Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
In v8:
- Added write lock in add_virtual_device
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/Kconfig     |  4 +++
 xen/drivers/vpci/vpci.c | 63 +++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |  8 ++++++
 xen/include/xen/vpci.h  | 11 +++++++
 4 files changed, 86 insertions(+)

Comments

Julien Grall Nov. 16, 2023, 4:06 p.m. UTC | #1
Hi Volodymyr,

This patch was mentioned in another context about allocating the BDF. So 
I thought I would comment here as well.

On 12/10/2023 23:09, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Your signed-off-by should be added even if you are only sending the 
patch on behalf of Oleksandr. This is part of the DCO [1]

> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 5e34d0092a..7c46a2d3f4 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[];
>   extern vpci_register_init_t *const __end_vpci_array[];
>   #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>   
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);

IIUC, this means that Xen will allocate the BDF. I think this will 
become a problem quite quickly as some of the PCI may need to be 
assigned at a specific vBDF (I have the intel graphic card in mind).

Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which 
would not work with this approach.

For dom0less passthrough, I feel the virtual BDF should always be 
specified in device-tree. When a domain is created after boot, then I 
think you want to support <bdf>@<vslot> where <vslot> is optional.

[1] https://cert-manager.io/docs/contributing/sign-off/
Stefano Stabellini Nov. 16, 2023, 11:28 p.m. UTC | #2
On Thu, 16 Nov 2023, Julien Grall wrote:
> IIUC, this means that Xen will allocate the BDF. I think this will become a
> problem quite quickly as some of the PCI may need to be assigned at a specific
> vBDF (I have the intel graphic card in mind).
> 
> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
> work with this approach.
> 
> For dom0less passthrough, I feel the virtual BDF should always be specified in
> device-tree. When a domain is created after boot, then I think you want to
> support <bdf>@<vslot> where <vslot> is optional.

Hi Julien,

I also think there should be a way to specify the virtual BDF, but if
possible (meaning: it is not super difficult to implement) I think it
would be very convenient if we could let Xen pick whatever virtual BDF
Xen wants when the user doesn't specify the virtual BDF. That's
because it would make it easier to specify the configuration for the
user. Typically the user doesn't care about the virtual BDF, only to
expose a specific host device to the VM. There are exceptions of course
and that's why I think we should also have a way for the user to
request a specific virtual BDF. One of these exceptions are integrated
GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
the device shows up at a different virtual BDF compared to the host.

Thinking more about this, one way to simplify the problem would be if we
always reuse the physical BDF as virtual BDF for passthrough devices. I
think that would solve the problem and makes it much more unlikely to
run into drivers bugs.

And we allocate a "special" virtual BDF space for emulated devices, with
the Root Complex still emulated in Xen. For instance, we could reserve
ff:xx:xx and in case of clashes we could refuse to continue. Or we could
allocate the first free virtual BDF, after all the pasthrough devices.

Example:
- the user wants to assign physical 00:11.5 and b3:00.1 to the guest
- Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
- Xen allocates the next virtual BDF for emulated devices: b4:xx.x
- If more virtual BDFs are needed for emulated devices, Xen allocates
  b5:xx.x

I still think, no matter the BDF allocation scheme, that we should try
to avoid as much as possible to have two different PCI Root Complex
emulators. Ideally we would have only one PCI Root Complex emulated by
Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
tolerable but not ideal. The worst case I would like to avoid is to have
two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
Julien Grall Nov. 17, 2023, 12:06 a.m. UTC | #3
Hi Stefano,

On 16/11/2023 23:28, Stefano Stabellini wrote:
> On Thu, 16 Nov 2023, Julien Grall wrote:
>> IIUC, this means that Xen will allocate the BDF. I think this will become a
>> problem quite quickly as some of the PCI may need to be assigned at a specific
>> vBDF (I have the intel graphic card in mind).
>>
>> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
>> work with this approach.
>>
>> For dom0less passthrough, I feel the virtual BDF should always be specified in
>> device-tree. When a domain is created after boot, then I think you want to
>> support <bdf>@<vslot> where <vslot> is optional.
> 
> Hi Julien,
> 
> I also think there should be a way to specify the virtual BDF, but if
> possible (meaning: it is not super difficult to implement) I think it
> would be very convenient if we could let Xen pick whatever virtual BDF
> Xen wants when the user doesn't specify the virtual BDF. That's
> because it would make it easier to specify the configuration for the
> user. Typically the user doesn't care about the virtual BDF, only to
> expose a specific host device to the VM. There are exceptions of course
> and that's why I think we should also have a way for the user to
> request a specific virtual BDF. One of these exceptions are integrated
> GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> the device shows up at a different virtual BDF compared to the host.

If you let Xen allocating the vBDF, then wouldn't you need a way to tell 
the toolstack/Device Models which vBDF was allocated?

> 
> Thinking more about this, one way to simplify the problem would be if we
> always reuse the physical BDF as virtual BDF for passthrough devices. I
> think that would solve the problem and makes it much more unlikely to
> run into drivers bugs.

This works so long you have only one physical segment (i.e. hostbridge). 
If you have multiple one, then you either have to expose multiple 
hostbridge to the guest (which is not great) or need someone to allocate 
the vBDF.

> 
> And we allocate a "special" virtual BDF space for emulated devices, with
> the Root Complex still emulated in Xen. For instance, we could reserve
> ff:xx:xx.
Hmmm... Wouldn't this means reserving ECAM space for 256 buses? 
Obviously, we could use 5 (just as random number). Yet, it still 
requires to reserve more memory than necessary.

 > and in case of clashes we could refuse to continue.

Urgh. And what would be the solution users triggering this clash?

> Or we could
> allocate the first free virtual BDF, after all the pasthrough devices.

This is only works if you don't want to support PCI hotplug. It may not 
be a thing for embedded, but it is used by cloud. So you need a 
mechanism that works with hotplug as well.

> 
> Example:
> - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> - If more virtual BDFs are needed for emulated devices, Xen allocates
>    b5:xx.x >
> I still think, no matter the BDF allocation scheme, that we should try
> to avoid as much as possible to have two different PCI Root Complex
> emulators. Ideally we would have only one PCI Root Complex emulated by
> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> tolerable but not ideal. The worst case I would like to avoid is to have
> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.

So while I agree that one emulated hostbridge is the best solution, I 
don't think your proposal would work. As I wrote above, you may have a 
system with multiple physical hostbridge. It would not be possible to 
assign two PCI devices with the same BDF but from different segment.

I agree unlikely, but if we can avoid it then it would be best. There 
are one scheme which fits that:
   1. If the vBDF is not specified, then pick a free one.
   2. Otherwise check if the specified vBDF is free. If not return an error.

This scheme should be used for both virtual and physical. This is pretty 
much the algorithm used by QEMU today. It works, so what's would be the 
benefits to do something different?

Cheers,
Volodymyr Babchuk Nov. 17, 2023, 12:21 a.m. UTC | #4
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Thu, 16 Nov 2023, Julien Grall wrote:
>> IIUC, this means that Xen will allocate the BDF. I think this will become a
>> problem quite quickly as some of the PCI may need to be assigned at a specific
>> vBDF (I have the intel graphic card in mind).
>> 
>> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not
>> work with this approach.
>> 
>> For dom0less passthrough, I feel the virtual BDF should always be specified in
>> device-tree. When a domain is created after boot, then I think you want to
>> support <bdf>@<vslot> where <vslot> is optional.
>
> Hi Julien,
>
> I also think there should be a way to specify the virtual BDF, but if
> possible (meaning: it is not super difficult to implement) I think it
> would be very convenient if we could let Xen pick whatever virtual BDF
> Xen wants when the user doesn't specify the virtual BDF. That's
> because it would make it easier to specify the configuration for the
> user. Typically the user doesn't care about the virtual BDF, only to
> expose a specific host device to the VM. There are exceptions of course
> and that's why I think we should also have a way for the user to
> request a specific virtual BDF. One of these exceptions are integrated
> GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> the device shows up at a different virtual BDF compared to the host.
>
> Thinking more about this, one way to simplify the problem would be if we
> always reuse the physical BDF as virtual BDF for passthrough devices. I
> think that would solve the problem and makes it much more unlikely to
> run into drivers bugs.

I'm not sure that this is possible. AFAIK, if we have device with B>0,
we need to have bridge device for it. So, if I want to passthrough
device 08:00.0, I need to provide a virtual bridge with BDF 0:NN.0. This
unnecessary complicates things.

Also, there can be funny situation with conflicting BFD numbers exposed
by different domains. I know that this is not your typical setup, but
imagine that Domain A acts as a driver domain for PCI controller A and
Domain B acts as a driver domain for PCI controller B. They may expose
devices with same BDFs but with different segments.

> And we allocate a "special" virtual BDF space for emulated devices, with
> the Root Complex still emulated in Xen. For instance, we could reserve
> ff:xx:xx and in case of clashes we could refuse to continue. Or we could
> allocate the first free virtual BDF, after all the pasthrough devices.

Again, I may be wrong there, but we need an emulated PCI bridge device if we
want to use Bus numbers > 0.

>
> Example:
> - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> - If more virtual BDFs are needed for emulated devices, Xen allocates
>   b5:xx.x
>
> I still think, no matter the BDF allocation scheme, that we should try
> to avoid as much as possible to have two different PCI Root Complex
> emulators. Ideally we would have only one PCI Root Complex emulated by
> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> tolerable but not ideal.

But what is exactly wrong with this setup?

> The worst case I would like to avoid is to have
> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.

This is how our setup works right now.

I agree that we need some way to provide static vBDF numbers. But I am
wondering what is the best way to do this. We need some entity that
manages and assigns those vBDFs. It should reside in Xen, because there
is Dom0less use case. Probably we need to extend
xen_domctl_assign_device so we can either request a free vBDF or a
specific vBDF. And in the first case, Xen should return assigned vBDF so
toolstack can give it to a backend, if PCI device is purely virtual.
Stefano Stabellini Nov. 17, 2023, 12:51 a.m. UTC | #5
On Fri, 17 Nov 2023, Julien Grall wrote:
> Hi Stefano,
> 
> On 16/11/2023 23:28, Stefano Stabellini wrote:
> > On Thu, 16 Nov 2023, Julien Grall wrote:
> > > IIUC, this means that Xen will allocate the BDF. I think this will become
> > > a
> > > problem quite quickly as some of the PCI may need to be assigned at a
> > > specific
> > > vBDF (I have the intel graphic card in mind).
> > > 
> > > Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would
> > > not
> > > work with this approach.
> > > 
> > > For dom0less passthrough, I feel the virtual BDF should always be
> > > specified in
> > > device-tree. When a domain is created after boot, then I think you want to
> > > support <bdf>@<vslot> where <vslot> is optional.
> > 
> > Hi Julien,
> > 
> > I also think there should be a way to specify the virtual BDF, but if
> > possible (meaning: it is not super difficult to implement) I think it
> > would be very convenient if we could let Xen pick whatever virtual BDF
> > Xen wants when the user doesn't specify the virtual BDF. That's
> > because it would make it easier to specify the configuration for the
> > user. Typically the user doesn't care about the virtual BDF, only to
> > expose a specific host device to the VM. There are exceptions of course
> > and that's why I think we should also have a way for the user to
> > request a specific virtual BDF. One of these exceptions are integrated
> > GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if
> > the device shows up at a different virtual BDF compared to the host.
> 
> If you let Xen allocating the vBDF, then wouldn't you need a way to tell the
> toolstack/Device Models which vBDF was allocated?
> 
> > 
> > Thinking more about this, one way to simplify the problem would be if we
> > always reuse the physical BDF as virtual BDF for passthrough devices. I
> > think that would solve the problem and makes it much more unlikely to
> > run into drivers bugs.
> 
> This works so long you have only one physical segment (i.e. hostbridge). If
> you have multiple one, then you either have to expose multiple hostbridge to
> the guest (which is not great) or need someone to allocate the vBDF.
> 
> > 
> > And we allocate a "special" virtual BDF space for emulated devices, with
> > the Root Complex still emulated in Xen. For instance, we could reserve
> > ff:xx:xx.
> Hmmm... Wouldn't this means reserving ECAM space for 256 buses? Obviously, we
> could use 5 (just as random number). Yet, it still requires to reserve more
> memory than necessary.
> 
> > and in case of clashes we could refuse to continue.
> 
> Urgh. And what would be the solution users triggering this clash?
> 
> > Or we could
> > allocate the first free virtual BDF, after all the pasthrough devices.
> 
> This is only works if you don't want to support PCI hotplug. It may not be a
> thing for embedded, but it is used by cloud. So you need a mechanism that
> works with hotplug as well.
> 
> > 
> > Example:
> > - the user wants to assign physical 00:11.5 and b3:00.1 to the guest
> > - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices
> > - Xen allocates the next virtual BDF for emulated devices: b4:xx.x
> > - If more virtual BDFs are needed for emulated devices, Xen allocates
> >    b5:xx.x >
> > I still think, no matter the BDF allocation scheme, that we should try
> > to avoid as much as possible to have two different PCI Root Complex
> > emulators. Ideally we would have only one PCI Root Complex emulated by
> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > tolerable but not ideal. The worst case I would like to avoid is to have
> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> 
> So while I agree that one emulated hostbridge is the best solution, I don't
> think your proposal would work. As I wrote above, you may have a system with
> multiple physical hostbridge. It would not be possible to assign two PCI
> devices with the same BDF but from different segment.
> 
> I agree unlikely, but if we can avoid it then it would be best. There are one
> scheme which fits that:
>   1. If the vBDF is not specified, then pick a free one.
>   2. Otherwise check if the specified vBDF is free. If not return an error.
> 
> This scheme should be used for both virtual and physical. This is pretty much
> the algorithm used by QEMU today. It works, so what's would be the benefits to
> do something different?

I am OK with that. I was trying to find a way that could work without
user intervention in almost 100% of the cases. I think both 1. and 2.
you proposed are fine.
Stefano Stabellini Nov. 17, 2023, 12:58 a.m. UTC | #6
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > I still think, no matter the BDF allocation scheme, that we should try
> > to avoid as much as possible to have two different PCI Root Complex
> > emulators. Ideally we would have only one PCI Root Complex emulated by
> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > tolerable but not ideal.
> 
> But what is exactly wrong with this setup?

[...]

> > The worst case I would like to avoid is to have
> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> 
> This is how our setup works right now.

If we have:
- a single PCI Root Complex emulated in Xen
- Xen is safety certified
- individual Virtio devices emulated by QEMU with grants for memory

We can go very far in terms of being able to use Virtio in safety
use-cases. We might even be able to use Virtio (frontends) in a SafeOS.

On the other hand if we put an additional Root Complex in QEMU:
- we pay a price in terms of complexity of the codebase
- we pay a price in terms of resource utilization
- we have one additional problem in terms of using this setup with a
  SafeOS (one more device emulated by a non-safe component)

Having 2 PCI Root Complexes both emulated in Xen is a middle ground
solution because:
- we still pay a price in terms of resource utilization
- the code complexity goes up a bit but hopefully not by much
- there is no impact on safety compared to the ideal scenario

This is why I wrote that it is tolerable.


> I agree that we need some way to provide static vBDF numbers. But I am
> wondering what is the best way to do this. We need some entity that
> manages and assigns those vBDFs. It should reside in Xen, because there
> is Dom0less use case. Probably we need to extend
> xen_domctl_assign_device so we can either request a free vBDF or a
> specific vBDF. And in the first case, Xen should return assigned vBDF so
> toolstack can give it to a backend, if PCI device is purely virtual.

I think that would be fine.
Volodymyr Babchuk Nov. 17, 2023, 2:09 p.m. UTC | #7
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> > I still think, no matter the BDF allocation scheme, that we should try
>> > to avoid as much as possible to have two different PCI Root Complex
>> > emulators. Ideally we would have only one PCI Root Complex emulated by
>> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> > tolerable but not ideal.
>> 
>> But what is exactly wrong with this setup?
>
> [...]
>
>> > The worst case I would like to avoid is to have
>> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> 
>> This is how our setup works right now.
>
> If we have:
> - a single PCI Root Complex emulated in Xen
> - Xen is safety certified
> - individual Virtio devices emulated by QEMU with grants for memory
>
> We can go very far in terms of being able to use Virtio in safety
> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>
> On the other hand if we put an additional Root Complex in QEMU:
> - we pay a price in terms of complexity of the codebase
> - we pay a price in terms of resource utilization
> - we have one additional problem in terms of using this setup with a
>   SafeOS (one more device emulated by a non-safe component)
>
> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> solution because:
> - we still pay a price in terms of resource utilization
> - the code complexity goes up a bit but hopefully not by much
> - there is no impact on safety compared to the ideal scenario
>
> This is why I wrote that it is tolerable.

Ah, I see now. Yes, I am agree with this. Also I want to add some more
points:

- There is ongoing work on implementing virtio backends as a separate
  applications, written in Rust. Linaro are doing this part. Right now
  they are implementing only virtio-mmio, but if they want to provide
  virtio-pci as well, they will need a mechanism to plug only
  virtio-pci, without Root Complex. This is argument for using single Root
  Complex emulated in Xen.

- As far as I know (actually, Oleksandr told this to me), QEMU has no
  mechanism for exposing virtio-pci backends without exposing PCI root
  complex as well. Architecturally, there should be a PCI bus to which
  virtio-pci devices are connected. Or we need to make some changes to
  QEMU internals to be able to create virtio-pci backends that are not
  connected to any bus. Also, added benefit that PCI Root Complex
  emulator in QEMU handles legacy PCI interrupts for us. This is
  argument for separate Root Complex for QEMU.

As right now we have only virtio-pci backends provided by QEMU and this
setup is already working, I propose to stick to this
solution. Especially, taking into account that it does not require any
changes to hypervisor code.
Julien Grall Nov. 17, 2023, 6:30 p.m. UTC | #8
Hi Volodymyr,

On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> 
> Hi Stefano,
> 
> Stefano Stabellini <sstabellini@kernel.org> writes:
> 
>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>>>> I still think, no matter the BDF allocation scheme, that we should try
>>>> to avoid as much as possible to have two different PCI Root Complex
>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>>>> tolerable but not ideal.
>>>
>>> But what is exactly wrong with this setup?
>>
>> [...]
>>
>>>> The worst case I would like to avoid is to have
>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>>>
>>> This is how our setup works right now.
>>
>> If we have:
>> - a single PCI Root Complex emulated in Xen
>> - Xen is safety certified
>> - individual Virtio devices emulated by QEMU with grants for memory
>>
>> We can go very far in terms of being able to use Virtio in safety
>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>>
>> On the other hand if we put an additional Root Complex in QEMU:
>> - we pay a price in terms of complexity of the codebase
>> - we pay a price in terms of resource utilization
>> - we have one additional problem in terms of using this setup with a
>>    SafeOS (one more device emulated by a non-safe component)
>>
>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> solution because:
>> - we still pay a price in terms of resource utilization
>> - the code complexity goes up a bit but hopefully not by much
>> - there is no impact on safety compared to the ideal scenario
>>
>> This is why I wrote that it is tolerable.
> 
> Ah, I see now. Yes, I am agree with this. Also I want to add some more
> points:
> 
> - There is ongoing work on implementing virtio backends as a separate
>    applications, written in Rust. Linaro are doing this part. Right now
>    they are implementing only virtio-mmio, but if they want to provide
>    virtio-pci as well, they will need a mechanism to plug only
>    virtio-pci, without Root Complex. This is argument for using single Root
>    Complex emulated in Xen.
> 
> - As far as I know (actually, Oleksandr told this to me), QEMU has no
>    mechanism for exposing virtio-pci backends without exposing PCI root
>    complex as well. Architecturally, there should be a PCI bus to which
>    virtio-pci devices are connected. Or we need to make some changes to
>    QEMU internals to be able to create virtio-pci backends that are not
>    connected to any bus. Also, added benefit that PCI Root Complex
>    emulator in QEMU handles legacy PCI interrupts for us. This is
>    argument for separate Root Complex for QEMU.
> 
> As right now we have only virtio-pci backends provided by QEMU and this
> setup is already working, I propose to stick to this
> solution. Especially, taking into account that it does not require any
> changes to hypervisor code.

I am not against two hostbridge as a temporary solution as long as this 
is not a one way door decision. I am not concerned about the hypervisor 
itself, I am more concerned about the interface exposed by the toolstack 
and QEMU.

To clarify, I don't particular want to have to maintain the two 
hostbridges solution once we can use a single hostbridge. So we need to 
be able to get rid of it without impacting the interface too much.

Cheers,
Volodymyr Babchuk Nov. 17, 2023, 8:08 p.m. UTC | #9
Hi Julien,

Julien Grall <julien@xen.org> writes:

> Hi Volodymyr,
>
> On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> Hi Stefano,
>> Stefano Stabellini <sstabellini@kernel.org> writes:
>> 
>>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>>>>> I still think, no matter the BDF allocation scheme, that we should try
>>>>> to avoid as much as possible to have two different PCI Root Complex
>>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>>>>> tolerable but not ideal.
>>>>
>>>> But what is exactly wrong with this setup?
>>>
>>> [...]
>>>
>>>>> The worst case I would like to avoid is to have
>>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>>>>
>>>> This is how our setup works right now.
>>>
>>> If we have:
>>> - a single PCI Root Complex emulated in Xen
>>> - Xen is safety certified
>>> - individual Virtio devices emulated by QEMU with grants for memory
>>>
>>> We can go very far in terms of being able to use Virtio in safety
>>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>>>
>>> On the other hand if we put an additional Root Complex in QEMU:
>>> - we pay a price in terms of complexity of the codebase
>>> - we pay a price in terms of resource utilization
>>> - we have one additional problem in terms of using this setup with a
>>>    SafeOS (one more device emulated by a non-safe component)
>>>
>>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>>> solution because:
>>> - we still pay a price in terms of resource utilization
>>> - the code complexity goes up a bit but hopefully not by much
>>> - there is no impact on safety compared to the ideal scenario
>>>
>>> This is why I wrote that it is tolerable.
>> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> more
>> points:
>> - There is ongoing work on implementing virtio backends as a
>> separate
>>    applications, written in Rust. Linaro are doing this part. Right now
>>    they are implementing only virtio-mmio, but if they want to provide
>>    virtio-pci as well, they will need a mechanism to plug only
>>    virtio-pci, without Root Complex. This is argument for using single Root
>>    Complex emulated in Xen.
>> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> no
>>    mechanism for exposing virtio-pci backends without exposing PCI root
>>    complex as well. Architecturally, there should be a PCI bus to which
>>    virtio-pci devices are connected. Or we need to make some changes to
>>    QEMU internals to be able to create virtio-pci backends that are not
>>    connected to any bus. Also, added benefit that PCI Root Complex
>>    emulator in QEMU handles legacy PCI interrupts for us. This is
>>    argument for separate Root Complex for QEMU.
>> As right now we have only virtio-pci backends provided by QEMU and
>> this
>> setup is already working, I propose to stick to this
>> solution. Especially, taking into account that it does not require any
>> changes to hypervisor code.
>
> I am not against two hostbridge as a temporary solution as long as
> this is not a one way door decision. I am not concerned about the
> hypervisor itself, I am more concerned about the interface exposed by
> the toolstack and QEMU.
>
> To clarify, I don't particular want to have to maintain the two
> hostbridges solution once we can use a single hostbridge. So we need
> to be able to get rid of it without impacting the interface too much.

This depends on virtio-pci backends availability. AFAIK, now only one
option is to use QEMU and QEMU provides own host bridge. So if we want
get rid of the second host bridge we need either another virtio-pci
backend or we need to alter QEMU code so it can live without host
bridge.

As for interfaces, it appears that QEMU case does not require any changes
into hypervisor itself, it just boils down to writing couple of xenstore
entries and spawning QEMU with correct command line arguments.

From the user perspective, all this is configured via xl.conf entry like

virtio=[
'backend=DomD, type=virtio,device, transport=pci, bdf=0000:00:01.0, grant_usage=1, backend_type=qemu',
]

In the future we can add backend_type=standalone for non-QEMU-based
backends. If there will be no QEMU-based backends, there will be no
second host bridge.
Stefano Stabellini Nov. 17, 2023, 9:43 p.m. UTC | #10
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> Hi Julien,
> 
> Julien Grall <julien@xen.org> writes:
> 
> > Hi Volodymyr,
> >
> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> Hi Stefano,
> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> 
> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >>>>> tolerable but not ideal.
> >>>>
> >>>> But what is exactly wrong with this setup?
> >>>
> >>> [...]
> >>>
> >>>>> The worst case I would like to avoid is to have
> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >>>>
> >>>> This is how our setup works right now.
> >>>
> >>> If we have:
> >>> - a single PCI Root Complex emulated in Xen
> >>> - Xen is safety certified
> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >>>
> >>> We can go very far in terms of being able to use Virtio in safety
> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >>>
> >>> On the other hand if we put an additional Root Complex in QEMU:
> >>> - we pay a price in terms of complexity of the codebase
> >>> - we pay a price in terms of resource utilization
> >>> - we have one additional problem in terms of using this setup with a
> >>>    SafeOS (one more device emulated by a non-safe component)
> >>>
> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >>> solution because:
> >>> - we still pay a price in terms of resource utilization
> >>> - the code complexity goes up a bit but hopefully not by much
> >>> - there is no impact on safety compared to the ideal scenario
> >>>
> >>> This is why I wrote that it is tolerable.
> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> more
> >> points:
> >> - There is ongoing work on implementing virtio backends as a
> >> separate
> >>    applications, written in Rust. Linaro are doing this part. Right now
> >>    they are implementing only virtio-mmio, but if they want to provide
> >>    virtio-pci as well, they will need a mechanism to plug only
> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >>    Complex emulated in Xen.
> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> no
> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >>    complex as well. Architecturally, there should be a PCI bus to which
> >>    virtio-pci devices are connected. Or we need to make some changes to
> >>    QEMU internals to be able to create virtio-pci backends that are not
> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >>    argument for separate Root Complex for QEMU.
> >> As right now we have only virtio-pci backends provided by QEMU and
> >> this
> >> setup is already working, I propose to stick to this
> >> solution. Especially, taking into account that it does not require any
> >> changes to hypervisor code.
> >
> > I am not against two hostbridge as a temporary solution as long as
> > this is not a one way door decision. I am not concerned about the
> > hypervisor itself, I am more concerned about the interface exposed by
> > the toolstack and QEMU.

I agree with this...


> > To clarify, I don't particular want to have to maintain the two
> > hostbridges solution once we can use a single hostbridge. So we need
> > to be able to get rid of it without impacting the interface too much.

...and this


> This depends on virtio-pci backends availability. AFAIK, now only one
> option is to use QEMU and QEMU provides own host bridge. So if we want
> get rid of the second host bridge we need either another virtio-pci
> backend or we need to alter QEMU code so it can live without host
> bridge.
> 
> As for interfaces, it appears that QEMU case does not require any changes
> into hypervisor itself, it just boils down to writing couple of xenstore
> entries and spawning QEMU with correct command line arguments.

One thing that Stewart wrote in his reply that is important: it doesn't
matter if QEMU thinks it is emulating a PCI Root Complex because that's
required from QEMU's point of view to emulate an individual PCI device.

If we can arrange it so the QEMU PCI Root Complex is not registered
against Xen as part of the ioreq interface, then QEMU's emulated PCI
Root Complex is going to be left unused. I think that would be great
because we still have a clean QEMU-Xen-tools interface and the only
downside is some extra unused emulation in QEMU. It would be a
fantastic starting point.
Volodymyr Babchuk Nov. 17, 2023, 10:22 p.m. UTC | #11
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> Hi Julien,
>> 
>> Julien Grall <julien@xen.org> writes:
>> 
>> > Hi Volodymyr,
>> >
>> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> >> Hi Stefano,
>> >> Stefano Stabellini <sstabellini@kernel.org> writes:
>> >> 
>> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >>>>> I still think, no matter the BDF allocation scheme, that we should try
>> >>>>> to avoid as much as possible to have two different PCI Root Complex
>> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> >>>>> tolerable but not ideal.
>> >>>>
>> >>>> But what is exactly wrong with this setup?
>> >>>
>> >>> [...]
>> >>>
>> >>>>> The worst case I would like to avoid is to have
>> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> >>>>
>> >>>> This is how our setup works right now.
>> >>>
>> >>> If we have:
>> >>> - a single PCI Root Complex emulated in Xen
>> >>> - Xen is safety certified
>> >>> - individual Virtio devices emulated by QEMU with grants for memory
>> >>>
>> >>> We can go very far in terms of being able to use Virtio in safety
>> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>> >>>
>> >>> On the other hand if we put an additional Root Complex in QEMU:
>> >>> - we pay a price in terms of complexity of the codebase
>> >>> - we pay a price in terms of resource utilization
>> >>> - we have one additional problem in terms of using this setup with a
>> >>>    SafeOS (one more device emulated by a non-safe component)
>> >>>
>> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> >>> solution because:
>> >>> - we still pay a price in terms of resource utilization
>> >>> - the code complexity goes up a bit but hopefully not by much
>> >>> - there is no impact on safety compared to the ideal scenario
>> >>>
>> >>> This is why I wrote that it is tolerable.
>> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> >> more
>> >> points:
>> >> - There is ongoing work on implementing virtio backends as a
>> >> separate
>> >>    applications, written in Rust. Linaro are doing this part. Right now
>> >>    they are implementing only virtio-mmio, but if they want to provide
>> >>    virtio-pci as well, they will need a mechanism to plug only
>> >>    virtio-pci, without Root Complex. This is argument for using single Root
>> >>    Complex emulated in Xen.
>> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> >> no
>> >>    mechanism for exposing virtio-pci backends without exposing PCI root
>> >>    complex as well. Architecturally, there should be a PCI bus to which
>> >>    virtio-pci devices are connected. Or we need to make some changes to
>> >>    QEMU internals to be able to create virtio-pci backends that are not
>> >>    connected to any bus. Also, added benefit that PCI Root Complex
>> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
>> >>    argument for separate Root Complex for QEMU.
>> >> As right now we have only virtio-pci backends provided by QEMU and
>> >> this
>> >> setup is already working, I propose to stick to this
>> >> solution. Especially, taking into account that it does not require any
>> >> changes to hypervisor code.
>> >
>> > I am not against two hostbridge as a temporary solution as long as
>> > this is not a one way door decision. I am not concerned about the
>> > hypervisor itself, I am more concerned about the interface exposed by
>> > the toolstack and QEMU.
>
> I agree with this...
>
>
>> > To clarify, I don't particular want to have to maintain the two
>> > hostbridges solution once we can use a single hostbridge. So we need
>> > to be able to get rid of it without impacting the interface too much.
>
> ...and this
>
>
>> This depends on virtio-pci backends availability. AFAIK, now only one
>> option is to use QEMU and QEMU provides own host bridge. So if we want
>> get rid of the second host bridge we need either another virtio-pci
>> backend or we need to alter QEMU code so it can live without host
>> bridge.
>> 
>> As for interfaces, it appears that QEMU case does not require any changes
>> into hypervisor itself, it just boils down to writing couple of xenstore
>> entries and spawning QEMU with correct command line arguments.
>
> One thing that Stewart wrote in his reply that is important: it doesn't
> matter if QEMU thinks it is emulating a PCI Root Complex because that's
> required from QEMU's point of view to emulate an individual PCI device.
>
> If we can arrange it so the QEMU PCI Root Complex is not registered
> against Xen as part of the ioreq interface, then QEMU's emulated PCI
> Root Complex is going to be left unused. I think that would be great
> because we still have a clean QEMU-Xen-tools interface and the only
> downside is some extra unused emulation in QEMU. It would be a
> fantastic starting point.

I believe, that in this case we need to set manual ioreq handlers, like
what was done in patch "xen/arm: Intercept vPCI config accesses and
forward them to emulator", because we need to route ECAM accesses either
to a virtio-pci backend or to a real PCI device. Also we need to tell
QEMU to not install own ioreq handles for ECAM space.

Another point is PCI legacy interrupts, which should be emulated on Xen
side. And unless I miss something, we will need some new mechanism to
signal those interrupts from QEMU/other backend. I am not sure if we can
use already existing IRQ signaling mechanism, because PCI interrupts are
ORed for all devices on a bridge and are level-sensitive.

Of course, we will need all of this anyways, if we want to support
standalone virtio-pci backends, but for me it sounds more like "finish
point" :)
Stefano Stabellini Nov. 18, 2023, 12:45 a.m. UTC | #12
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> Hi Julien,
> >> 
> >> Julien Grall <julien@xen.org> writes:
> >> 
> >> > Hi Volodymyr,
> >> >
> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> >> Hi Stefano,
> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> >> 
> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >> >>>>> tolerable but not ideal.
> >> >>>>
> >> >>>> But what is exactly wrong with this setup?
> >> >>>
> >> >>> [...]
> >> >>>
> >> >>>>> The worst case I would like to avoid is to have
> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >> >>>>
> >> >>>> This is how our setup works right now.
> >> >>>
> >> >>> If we have:
> >> >>> - a single PCI Root Complex emulated in Xen
> >> >>> - Xen is safety certified
> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >> >>>
> >> >>> We can go very far in terms of being able to use Virtio in safety
> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >> >>>
> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> >> >>> - we pay a price in terms of complexity of the codebase
> >> >>> - we pay a price in terms of resource utilization
> >> >>> - we have one additional problem in terms of using this setup with a
> >> >>>    SafeOS (one more device emulated by a non-safe component)
> >> >>>
> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >> >>> solution because:
> >> >>> - we still pay a price in terms of resource utilization
> >> >>> - the code complexity goes up a bit but hopefully not by much
> >> >>> - there is no impact on safety compared to the ideal scenario
> >> >>>
> >> >>> This is why I wrote that it is tolerable.
> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> >> more
> >> >> points:
> >> >> - There is ongoing work on implementing virtio backends as a
> >> >> separate
> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> >> >>    they are implementing only virtio-mmio, but if they want to provide
> >> >>    virtio-pci as well, they will need a mechanism to plug only
> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >> >>    Complex emulated in Xen.
> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> >> no
> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >> >>    argument for separate Root Complex for QEMU.
> >> >> As right now we have only virtio-pci backends provided by QEMU and
> >> >> this
> >> >> setup is already working, I propose to stick to this
> >> >> solution. Especially, taking into account that it does not require any
> >> >> changes to hypervisor code.
> >> >
> >> > I am not against two hostbridge as a temporary solution as long as
> >> > this is not a one way door decision. I am not concerned about the
> >> > hypervisor itself, I am more concerned about the interface exposed by
> >> > the toolstack and QEMU.
> >
> > I agree with this...
> >
> >
> >> > To clarify, I don't particular want to have to maintain the two
> >> > hostbridges solution once we can use a single hostbridge. So we need
> >> > to be able to get rid of it without impacting the interface too much.
> >
> > ...and this
> >
> >
> >> This depends on virtio-pci backends availability. AFAIK, now only one
> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> >> get rid of the second host bridge we need either another virtio-pci
> >> backend or we need to alter QEMU code so it can live without host
> >> bridge.
> >> 
> >> As for interfaces, it appears that QEMU case does not require any changes
> >> into hypervisor itself, it just boils down to writing couple of xenstore
> >> entries and spawning QEMU with correct command line arguments.
> >
> > One thing that Stewart wrote in his reply that is important: it doesn't
> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > required from QEMU's point of view to emulate an individual PCI device.
> >
> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > Root Complex is going to be left unused. I think that would be great
> > because we still have a clean QEMU-Xen-tools interface and the only
> > downside is some extra unused emulation in QEMU. It would be a
> > fantastic starting point.
> 
> I believe, that in this case we need to set manual ioreq handlers, like
> what was done in patch "xen/arm: Intercept vPCI config accesses and
> forward them to emulator", because we need to route ECAM accesses
> either to a virtio-pci backend or to a real PCI device. Also we need
> to tell QEMU to not install own ioreq handles for ECAM space.

I was imagining that the interface would look like this: QEMU registers
a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
reads/writes requests for the PCI config space of that BDF only. It
would not be the entire ECAM space but only individual PCI conf
reads/writes that the BDF only.


> Another point is PCI legacy interrupts, which should be emulated on Xen
> side. And unless I miss something, we will need some new mechanism to
> signal those interrupts from QEMU/other backend. I am not sure if we can
> use already existing IRQ signaling mechanism, because PCI interrupts are
> ORed for all devices on a bridge and are level-sensitive.

I hope we can reuse xc_hvm_set_pci_intx_level or another XEN_DMOP
hypercall


> Of course, we will need all of this anyways, if we want to support
> standalone virtio-pci backends, but for me it sounds more like "finish
> point" :)
Volodymyr Babchuk Nov. 21, 2023, 12:42 a.m. UTC | #13
Hi Stefano,

Stefano Stabellini <sstabellini@kernel.org> writes:

> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >> Hi Julien,
>> >> 
>> >> Julien Grall <julien@xen.org> writes:
>> >> 
>> >> > Hi Volodymyr,
>> >> >
>> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
>> >> >> Hi Stefano,
>> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
>> >> >> 
>> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
>> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
>> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
>> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
>> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
>> >> >>>>> tolerable but not ideal.
>> >> >>>>
>> >> >>>> But what is exactly wrong with this setup?
>> >> >>>
>> >> >>> [...]
>> >> >>>
>> >> >>>>> The worst case I would like to avoid is to have
>> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
>> >> >>>>
>> >> >>>> This is how our setup works right now.
>> >> >>>
>> >> >>> If we have:
>> >> >>> - a single PCI Root Complex emulated in Xen
>> >> >>> - Xen is safety certified
>> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
>> >> >>>
>> >> >>> We can go very far in terms of being able to use Virtio in safety
>> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
>> >> >>>
>> >> >>> On the other hand if we put an additional Root Complex in QEMU:
>> >> >>> - we pay a price in terms of complexity of the codebase
>> >> >>> - we pay a price in terms of resource utilization
>> >> >>> - we have one additional problem in terms of using this setup with a
>> >> >>>    SafeOS (one more device emulated by a non-safe component)
>> >> >>>
>> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
>> >> >>> solution because:
>> >> >>> - we still pay a price in terms of resource utilization
>> >> >>> - the code complexity goes up a bit but hopefully not by much
>> >> >>> - there is no impact on safety compared to the ideal scenario
>> >> >>>
>> >> >>> This is why I wrote that it is tolerable.
>> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
>> >> >> more
>> >> >> points:
>> >> >> - There is ongoing work on implementing virtio backends as a
>> >> >> separate
>> >> >>    applications, written in Rust. Linaro are doing this part. Right now
>> >> >>    they are implementing only virtio-mmio, but if they want to provide
>> >> >>    virtio-pci as well, they will need a mechanism to plug only
>> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
>> >> >>    Complex emulated in Xen.
>> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
>> >> >> no
>> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
>> >> >>    complex as well. Architecturally, there should be a PCI bus to which
>> >> >>    virtio-pci devices are connected. Or we need to make some changes to
>> >> >>    QEMU internals to be able to create virtio-pci backends that are not
>> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
>> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
>> >> >>    argument for separate Root Complex for QEMU.
>> >> >> As right now we have only virtio-pci backends provided by QEMU and
>> >> >> this
>> >> >> setup is already working, I propose to stick to this
>> >> >> solution. Especially, taking into account that it does not require any
>> >> >> changes to hypervisor code.
>> >> >
>> >> > I am not against two hostbridge as a temporary solution as long as
>> >> > this is not a one way door decision. I am not concerned about the
>> >> > hypervisor itself, I am more concerned about the interface exposed by
>> >> > the toolstack and QEMU.
>> >
>> > I agree with this...
>> >
>> >
>> >> > To clarify, I don't particular want to have to maintain the two
>> >> > hostbridges solution once we can use a single hostbridge. So we need
>> >> > to be able to get rid of it without impacting the interface too much.
>> >
>> > ...and this
>> >
>> >
>> >> This depends on virtio-pci backends availability. AFAIK, now only one
>> >> option is to use QEMU and QEMU provides own host bridge. So if we want
>> >> get rid of the second host bridge we need either another virtio-pci
>> >> backend or we need to alter QEMU code so it can live without host
>> >> bridge.
>> >> 
>> >> As for interfaces, it appears that QEMU case does not require any changes
>> >> into hypervisor itself, it just boils down to writing couple of xenstore
>> >> entries and spawning QEMU with correct command line arguments.
>> >
>> > One thing that Stewart wrote in his reply that is important: it doesn't
>> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
>> > required from QEMU's point of view to emulate an individual PCI device.
>> >
>> > If we can arrange it so the QEMU PCI Root Complex is not registered
>> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
>> > Root Complex is going to be left unused. I think that would be great
>> > because we still have a clean QEMU-Xen-tools interface and the only
>> > downside is some extra unused emulation in QEMU. It would be a
>> > fantastic starting point.
>> 
>> I believe, that in this case we need to set manual ioreq handlers, like
>> what was done in patch "xen/arm: Intercept vPCI config accesses and
>> forward them to emulator", because we need to route ECAM accesses
>> either to a virtio-pci backend or to a real PCI device. Also we need
>> to tell QEMU to not install own ioreq handles for ECAM space.
>
> I was imagining that the interface would look like this: QEMU registers
> a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> reads/writes requests for the PCI config space of that BDF only. It
> would not be the entire ECAM space but only individual PCI conf
> reads/writes that the BDF only.
>

Okay, I see that there is the
xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
PCI BDF?

I see that xen-hvm-common.c in QEMU is able to handle only standard 256
bytes configuration space, but I hope that it will be easy fix.
Roger Pau Monné Nov. 21, 2023, 2:40 p.m. UTC | #14
On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> In v10:
> - Removed ASSERT(pcidevs_locked())
> - Removed redundant code (local sbdf variable, clearing sbdf during
> device removal, etc)
> - Added __maybe_unused attribute to "out:" label
> - Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the
>   first patch where it is used (previously was in "vpci: add hooks for
>   PCI device assign/de-assign")
> In v9:
> - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart)
> In v8:
> - Added write lock in add_virtual_device
> Since v6:
> - re-work wrt new locking scheme
> - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
> Since v5:
> - s/vpci_add_virtual_device/add_virtual_device and make it static
> - call add_virtual_device from vpci_assign_device and do not use
>   REGISTER_VPCI_INIT machinery
> - add pcidevs_locked ASSERT
> - use DECLARE_BITMAP for vpci_dev_assigned_map
> Since v4:
> - moved and re-worked guest sbdf initializers
> - s/set_bit/__set_bit
> - s/clear_bit/__clear_bit
> - minor comment fix s/Virtual/Guest/
> - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
>   later for counting the number of MMIO handlers required for a guest
>   (Julien)
> Since v3:
>  - make use of VPCI_INIT
>  - moved all new code to vpci.c which belongs to it
>  - changed open-coded 31 to PCI_SLOT(~0)
>  - added comments and code to reject multifunction devices with
>    functions other than 0
>  - updated comment about vpci_dev_next and made it unsigned int
>  - implement roll back in case of error while assigning/deassigning devices
>  - s/dom%pd/%pd
> Since v2:
>  - remove casts that are (a) malformed and (b) unnecessary
>  - add new line for better readability
>  - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
>     functions are now completely gated with this config
>  - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
> New in v2
> ---
>  xen/drivers/Kconfig     |  4 +++
>  xen/drivers/vpci/vpci.c | 63 +++++++++++++++++++++++++++++++++++++++++
>  xen/include/xen/sched.h |  8 ++++++
>  xen/include/xen/vpci.h  | 11 +++++++
>  4 files changed, 86 insertions(+)
> 
> diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
> index db94393f47..780490cf8e 100644
> --- a/xen/drivers/Kconfig
> +++ b/xen/drivers/Kconfig
> @@ -15,4 +15,8 @@ source "drivers/video/Kconfig"
>  config HAS_VPCI
>  	bool
>  
> +config HAS_VPCI_GUEST_SUPPORT
> +	bool
> +	depends on HAS_VPCI
> +
>  endmenu
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index 5e34d0092a..7c46a2d3f4 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[];
>  extern vpci_register_init_t *const __end_vpci_array[];
>  #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    unsigned long new_dev_number;

Why unsigned long?  unsigned int seems more than enough to account for
all possible dev numbers [0, 31].

> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number == VPCI_MAX_VIRT_DEV )
> +    {
> +        write_unlock(&pdev->domain->pci_lock);

This write_unlock() looks bogus, as the lock is not taken by this
function.  Won't this create an unlock imbalance when the caller of
vpci_assign_device() also attempts to write-unlock d->pci_lock?

> +        return -ENOSPC;
> +    }
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> +
> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);
> +
> +    return 0;
> +}
> +
> +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
> +
>  void vpci_deassign_device(struct pci_dev *pdev)
>  {
>      unsigned int i;
> @@ -46,6 +92,13 @@ void vpci_deassign_device(struct pci_dev *pdev)
>          return;
>  
>      spin_lock(&pdev->vpci->lock);
> +
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +#endif

This chunk could in principle be outside of the vpci->lock region
AFAICT.

> +
>      while ( !list_empty(&pdev->vpci->handlers) )
>      {
>          struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
> @@ -101,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev)
>      INIT_LIST_HEAD(&pdev->vpci->handlers);
>      spin_lock_init(&pdev->vpci->lock);
>  
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    pdev->vpci->guest_sbdf.sbdf = ~0;
> +    rc = add_virtual_device(pdev);
> +    if ( rc )
> +        goto out;
> +#endif
> +
>      for ( i = 0; i < NUM_VPCI_INIT; i++ )
>      {
>          rc = __start_vpci_array[i](pdev);
> @@ -108,11 +168,14 @@ int vpci_assign_device(struct pci_dev *pdev)
>              break;
>      }
>  
> + out:
> +    __maybe_unused;

Can you place it in the same line as the out: label please?

>      if ( rc )
>          vpci_deassign_device(pdev);
>  
>      return rc;
>  }
> +

Stray newline?

Thanks, Roger.
Stefano Stabellini Nov. 22, 2023, 1:12 a.m. UTC | #15
On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> Stefano Stabellini <sstabellini@kernel.org> writes:
> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >> Hi Julien,
> >> >> 
> >> >> Julien Grall <julien@xen.org> writes:
> >> >> 
> >> >> > Hi Volodymyr,
> >> >> >
> >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> >> >> >> Hi Stefano,
> >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> >> >> >> 
> >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> >> >> >>>>> tolerable but not ideal.
> >> >> >>>>
> >> >> >>>> But what is exactly wrong with this setup?
> >> >> >>>
> >> >> >>> [...]
> >> >> >>>
> >> >> >>>>> The worst case I would like to avoid is to have
> >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> >> >> >>>>
> >> >> >>>> This is how our setup works right now.
> >> >> >>>
> >> >> >>> If we have:
> >> >> >>> - a single PCI Root Complex emulated in Xen
> >> >> >>> - Xen is safety certified
> >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> >> >> >>>
> >> >> >>> We can go very far in terms of being able to use Virtio in safety
> >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> >> >> >>>
> >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> >> >> >>> - we pay a price in terms of complexity of the codebase
> >> >> >>> - we pay a price in terms of resource utilization
> >> >> >>> - we have one additional problem in terms of using this setup with a
> >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> >> >> >>>
> >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> >> >> >>> solution because:
> >> >> >>> - we still pay a price in terms of resource utilization
> >> >> >>> - the code complexity goes up a bit but hopefully not by much
> >> >> >>> - there is no impact on safety compared to the ideal scenario
> >> >> >>>
> >> >> >>> This is why I wrote that it is tolerable.
> >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> >> >> >> more
> >> >> >> points:
> >> >> >> - There is ongoing work on implementing virtio backends as a
> >> >> >> separate
> >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> >> >> >>    Complex emulated in Xen.
> >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> >> >> >> no
> >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> >> >> >>    argument for separate Root Complex for QEMU.
> >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> >> >> >> this
> >> >> >> setup is already working, I propose to stick to this
> >> >> >> solution. Especially, taking into account that it does not require any
> >> >> >> changes to hypervisor code.
> >> >> >
> >> >> > I am not against two hostbridge as a temporary solution as long as
> >> >> > this is not a one way door decision. I am not concerned about the
> >> >> > hypervisor itself, I am more concerned about the interface exposed by
> >> >> > the toolstack and QEMU.
> >> >
> >> > I agree with this...
> >> >
> >> >
> >> >> > To clarify, I don't particular want to have to maintain the two
> >> >> > hostbridges solution once we can use a single hostbridge. So we need
> >> >> > to be able to get rid of it without impacting the interface too much.
> >> >
> >> > ...and this
> >> >
> >> >
> >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> >> >> get rid of the second host bridge we need either another virtio-pci
> >> >> backend or we need to alter QEMU code so it can live without host
> >> >> bridge.
> >> >> 
> >> >> As for interfaces, it appears that QEMU case does not require any changes
> >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> >> >> entries and spawning QEMU with correct command line arguments.
> >> >
> >> > One thing that Stewart wrote in his reply that is important: it doesn't
> >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> >> > required from QEMU's point of view to emulate an individual PCI device.
> >> >
> >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> >> > Root Complex is going to be left unused. I think that would be great
> >> > because we still have a clean QEMU-Xen-tools interface and the only
> >> > downside is some extra unused emulation in QEMU. It would be a
> >> > fantastic starting point.
> >> 
> >> I believe, that in this case we need to set manual ioreq handlers, like
> >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> >> forward them to emulator", because we need to route ECAM accesses
> >> either to a virtio-pci backend or to a real PCI device. Also we need
> >> to tell QEMU to not install own ioreq handles for ECAM space.
> >
> > I was imagining that the interface would look like this: QEMU registers
> > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > reads/writes requests for the PCI config space of that BDF only. It
> > would not be the entire ECAM space but only individual PCI conf
> > reads/writes that the BDF only.
> >
> 
> Okay, I see that there is the
> xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> PCI BDF?

Yes, I think that's best.

Let me expand on this. Like I wrote above, I think it is important that
Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
QEMU implementation easier, it is OK if QEMU emulates an unneeded and
unused PCI Root Complex. From Xen point of view, it doesn't exist.

In terms if ioreq registration, QEMU calls
xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
emulate. That way, Xen vPCI knows exactly what PCI config space
reads/writes to forward to QEMU.

Let's say that:
- 00:02.0 is PCI passthrough device
- 00:03.0 is a PCI emulated device

QEMU would register 00:03.0 and vPCI would know to forward anything
related to 00:03.0 to QEMU, but not 00:02.0.



> I see that xen-hvm-common.c in QEMU is able to handle only standard 256
> bytes configuration space, but I hope that it will be easy fix.
Roger Pau Monné Nov. 22, 2023, 11:53 a.m. UTC | #16
On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> > Stefano Stabellini <sstabellini@kernel.org> writes:
> > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> >> Hi Julien,
> > >> >> 
> > >> >> Julien Grall <julien@xen.org> writes:
> > >> >> 
> > >> >> > Hi Volodymyr,
> > >> >> >
> > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> > >> >> >> Hi Stefano,
> > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> > >> >> >> 
> > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > >> >> >>>>> tolerable but not ideal.
> > >> >> >>>>
> > >> >> >>>> But what is exactly wrong with this setup?
> > >> >> >>>
> > >> >> >>> [...]
> > >> >> >>>
> > >> >> >>>>> The worst case I would like to avoid is to have
> > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> > >> >> >>>>
> > >> >> >>>> This is how our setup works right now.
> > >> >> >>>
> > >> >> >>> If we have:
> > >> >> >>> - a single PCI Root Complex emulated in Xen
> > >> >> >>> - Xen is safety certified
> > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> > >> >> >>>
> > >> >> >>> We can go very far in terms of being able to use Virtio in safety
> > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> > >> >> >>>
> > >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> > >> >> >>> - we pay a price in terms of complexity of the codebase
> > >> >> >>> - we pay a price in terms of resource utilization
> > >> >> >>> - we have one additional problem in terms of using this setup with a
> > >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> > >> >> >>>
> > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> > >> >> >>> solution because:
> > >> >> >>> - we still pay a price in terms of resource utilization
> > >> >> >>> - the code complexity goes up a bit but hopefully not by much
> > >> >> >>> - there is no impact on safety compared to the ideal scenario
> > >> >> >>>
> > >> >> >>> This is why I wrote that it is tolerable.
> > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> > >> >> >> more
> > >> >> >> points:
> > >> >> >> - There is ongoing work on implementing virtio backends as a
> > >> >> >> separate
> > >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> > >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> > >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> > >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> > >> >> >>    Complex emulated in Xen.
> > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> > >> >> >> no
> > >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> > >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> > >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> > >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> > >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> > >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> > >> >> >>    argument for separate Root Complex for QEMU.
> > >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> > >> >> >> this
> > >> >> >> setup is already working, I propose to stick to this
> > >> >> >> solution. Especially, taking into account that it does not require any
> > >> >> >> changes to hypervisor code.
> > >> >> >
> > >> >> > I am not against two hostbridge as a temporary solution as long as
> > >> >> > this is not a one way door decision. I am not concerned about the
> > >> >> > hypervisor itself, I am more concerned about the interface exposed by
> > >> >> > the toolstack and QEMU.
> > >> >
> > >> > I agree with this...
> > >> >
> > >> >
> > >> >> > To clarify, I don't particular want to have to maintain the two
> > >> >> > hostbridges solution once we can use a single hostbridge. So we need
> > >> >> > to be able to get rid of it without impacting the interface too much.
> > >> >
> > >> > ...and this
> > >> >
> > >> >
> > >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> > >> >> get rid of the second host bridge we need either another virtio-pci
> > >> >> backend or we need to alter QEMU code so it can live without host
> > >> >> bridge.
> > >> >> 
> > >> >> As for interfaces, it appears that QEMU case does not require any changes
> > >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> > >> >> entries and spawning QEMU with correct command line arguments.
> > >> >
> > >> > One thing that Stewart wrote in his reply that is important: it doesn't
> > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > >> > required from QEMU's point of view to emulate an individual PCI device.
> > >> >
> > >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > >> > Root Complex is going to be left unused. I think that would be great
> > >> > because we still have a clean QEMU-Xen-tools interface and the only
> > >> > downside is some extra unused emulation in QEMU. It would be a
> > >> > fantastic starting point.
> > >> 
> > >> I believe, that in this case we need to set manual ioreq handlers, like
> > >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> > >> forward them to emulator", because we need to route ECAM accesses
> > >> either to a virtio-pci backend or to a real PCI device. Also we need
> > >> to tell QEMU to not install own ioreq handles for ECAM space.
> > >
> > > I was imagining that the interface would look like this: QEMU registers
> > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > > reads/writes requests for the PCI config space of that BDF only. It
> > > would not be the entire ECAM space but only individual PCI conf
> > > reads/writes that the BDF only.
> > >
> > 
> > Okay, I see that there is the
> > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> > PCI BDF?
> 
> Yes, I think that's best.
> 
> Let me expand on this. Like I wrote above, I think it is important that
> Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> unused PCI Root Complex. From Xen point of view, it doesn't exist.
> 
> In terms if ioreq registration, QEMU calls
> xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> emulate. That way, Xen vPCI knows exactly what PCI config space
> reads/writes to forward to QEMU.
> 
> Let's say that:
> - 00:02.0 is PCI passthrough device
> - 00:03.0 is a PCI emulated device
> 
> QEMU would register 00:03.0 and vPCI would know to forward anything
> related to 00:03.0 to QEMU, but not 00:02.0.

I think there's some work here so that we have a proper hierarchy
inside of Xen.  Right now both ioreq and vpci expect to decode the
accesses to the PCI config space, and setup (MM)IO handlers to trap
ECAM, see vpci_ecam_{read,write}().

I think we want to move to a model where vPCI doesn't setup MMIO traps
itself, and instead relies on ioreq to do the decoding and forwarding
of accesses.  We need some work in order to represent an internal
ioreq handler, but that shouldn't be too complicated.  IOW: vpci
should register devices it's handling with ioreq, much like QEMU does.

Thanks, Roger.
Stefano Stabellini Nov. 22, 2023, 9:18 p.m. UTC | #17
On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > On Tue, 20 Nov 2023, Volodymyr Babchuk wrote:
> > > Stefano Stabellini <sstabellini@kernel.org> writes:
> > > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> >> Hi Julien,
> > > >> >> 
> > > >> >> Julien Grall <julien@xen.org> writes:
> > > >> >> 
> > > >> >> > Hi Volodymyr,
> > > >> >> >
> > > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote:
> > > >> >> >> Hi Stefano,
> > > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes:
> > > >> >> >> 
> > > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote:
> > > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try
> > > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex
> > > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by
> > > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be
> > > >> >> >>>>> tolerable but not ideal.
> > > >> >> >>>>
> > > >> >> >>>> But what is exactly wrong with this setup?
> > > >> >> >>>
> > > >> >> >>> [...]
> > > >> >> >>>
> > > >> >> >>>>> The worst case I would like to avoid is to have
> > > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
> > > >> >> >>>>
> > > >> >> >>>> This is how our setup works right now.
> > > >> >> >>>
> > > >> >> >>> If we have:
> > > >> >> >>> - a single PCI Root Complex emulated in Xen
> > > >> >> >>> - Xen is safety certified
> > > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory
> > > >> >> >>>
> > > >> >> >>> We can go very far in terms of being able to use Virtio in safety
> > > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS.
> > > >> >> >>>
> > > >> >> >>> On the other hand if we put an additional Root Complex in QEMU:
> > > >> >> >>> - we pay a price in terms of complexity of the codebase
> > > >> >> >>> - we pay a price in terms of resource utilization
> > > >> >> >>> - we have one additional problem in terms of using this setup with a
> > > >> >> >>>    SafeOS (one more device emulated by a non-safe component)
> > > >> >> >>>
> > > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground
> > > >> >> >>> solution because:
> > > >> >> >>> - we still pay a price in terms of resource utilization
> > > >> >> >>> - the code complexity goes up a bit but hopefully not by much
> > > >> >> >>> - there is no impact on safety compared to the ideal scenario
> > > >> >> >>>
> > > >> >> >>> This is why I wrote that it is tolerable.
> > > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some
> > > >> >> >> more
> > > >> >> >> points:
> > > >> >> >> - There is ongoing work on implementing virtio backends as a
> > > >> >> >> separate
> > > >> >> >>    applications, written in Rust. Linaro are doing this part. Right now
> > > >> >> >>    they are implementing only virtio-mmio, but if they want to provide
> > > >> >> >>    virtio-pci as well, they will need a mechanism to plug only
> > > >> >> >>    virtio-pci, without Root Complex. This is argument for using single Root
> > > >> >> >>    Complex emulated in Xen.
> > > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has
> > > >> >> >> no
> > > >> >> >>    mechanism for exposing virtio-pci backends without exposing PCI root
> > > >> >> >>    complex as well. Architecturally, there should be a PCI bus to which
> > > >> >> >>    virtio-pci devices are connected. Or we need to make some changes to
> > > >> >> >>    QEMU internals to be able to create virtio-pci backends that are not
> > > >> >> >>    connected to any bus. Also, added benefit that PCI Root Complex
> > > >> >> >>    emulator in QEMU handles legacy PCI interrupts for us. This is
> > > >> >> >>    argument for separate Root Complex for QEMU.
> > > >> >> >> As right now we have only virtio-pci backends provided by QEMU and
> > > >> >> >> this
> > > >> >> >> setup is already working, I propose to stick to this
> > > >> >> >> solution. Especially, taking into account that it does not require any
> > > >> >> >> changes to hypervisor code.
> > > >> >> >
> > > >> >> > I am not against two hostbridge as a temporary solution as long as
> > > >> >> > this is not a one way door decision. I am not concerned about the
> > > >> >> > hypervisor itself, I am more concerned about the interface exposed by
> > > >> >> > the toolstack and QEMU.
> > > >> >
> > > >> > I agree with this...
> > > >> >
> > > >> >
> > > >> >> > To clarify, I don't particular want to have to maintain the two
> > > >> >> > hostbridges solution once we can use a single hostbridge. So we need
> > > >> >> > to be able to get rid of it without impacting the interface too much.
> > > >> >
> > > >> > ...and this
> > > >> >
> > > >> >
> > > >> >> This depends on virtio-pci backends availability. AFAIK, now only one
> > > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want
> > > >> >> get rid of the second host bridge we need either another virtio-pci
> > > >> >> backend or we need to alter QEMU code so it can live without host
> > > >> >> bridge.
> > > >> >> 
> > > >> >> As for interfaces, it appears that QEMU case does not require any changes
> > > >> >> into hypervisor itself, it just boils down to writing couple of xenstore
> > > >> >> entries and spawning QEMU with correct command line arguments.
> > > >> >
> > > >> > One thing that Stewart wrote in his reply that is important: it doesn't
> > > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's
> > > >> > required from QEMU's point of view to emulate an individual PCI device.
> > > >> >
> > > >> > If we can arrange it so the QEMU PCI Root Complex is not registered
> > > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI
> > > >> > Root Complex is going to be left unused. I think that would be great
> > > >> > because we still have a clean QEMU-Xen-tools interface and the only
> > > >> > downside is some extra unused emulation in QEMU. It would be a
> > > >> > fantastic starting point.
> > > >> 
> > > >> I believe, that in this case we need to set manual ioreq handlers, like
> > > >> what was done in patch "xen/arm: Intercept vPCI config accesses and
> > > >> forward them to emulator", because we need to route ECAM accesses
> > > >> either to a virtio-pci backend or to a real PCI device. Also we need
> > > >> to tell QEMU to not install own ioreq handles for ECAM space.
> > > >
> > > > I was imagining that the interface would look like this: QEMU registers
> > > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM
> > > > reads/writes requests for the PCI config space of that BDF only. It
> > > > would not be the entire ECAM space but only individual PCI conf
> > > > reads/writes that the BDF only.
> > > >
> > > 
> > > Okay, I see that there is the
> > > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding
> > > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register
> > > PCI BDF?
> > 
> > Yes, I think that's best.
> > 
> > Let me expand on this. Like I wrote above, I think it is important that
> > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > 
> > In terms if ioreq registration, QEMU calls
> > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > emulate. That way, Xen vPCI knows exactly what PCI config space
> > reads/writes to forward to QEMU.
> > 
> > Let's say that:
> > - 00:02.0 is PCI passthrough device
> > - 00:03.0 is a PCI emulated device
> > 
> > QEMU would register 00:03.0 and vPCI would know to forward anything
> > related to 00:03.0 to QEMU, but not 00:02.0.
> 
> I think there's some work here so that we have a proper hierarchy
> inside of Xen.  Right now both ioreq and vpci expect to decode the
> accesses to the PCI config space, and setup (MM)IO handlers to trap
> ECAM, see vpci_ecam_{read,write}().
> 
> I think we want to move to a model where vPCI doesn't setup MMIO traps
> itself, and instead relies on ioreq to do the decoding and forwarding
> of accesses.  We need some work in order to represent an internal
> ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> should register devices it's handling with ioreq, much like QEMU does.

I think this could be a good idea.

This would be the very first IOREQ handler implemented in Xen itself,
rather than outside of Xen. Some code refactoring might be required,
which worries me given that vPCI is at v10 and has been pending for
years. I think it could make sense as a follow-up series, not v11.

I think this idea would be beneficial if, in the example above, vPCI
doesn't really need to know about device 00:03.0. vPCI registers via
IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
00:03.0, and everything works. vPCI is not involved at all in PCI config
space reads and writes for 00:03.0. If this is the case, then moving
vPCI to IOREQ could be good.

On the other hand if vPCI actually needs to know that 00:03.0 exists,
perhaps because it changes something in the PCI Root Complex emulation
or vPCI needs to take some action when PCI config space registers of
00:03.0 are written to, then I think this model doesn't work well. If
this is the case, then I think it would be best to keep vPCI as MMIO
handler and let vPCI forward to IOREQ when appropriate.

I haven't run any experiements, but my gut feeling tells me that we'll
have to follow the second approach because the first is too limiting.
Roger Pau Monné Nov. 23, 2023, 8:29 a.m. UTC | #18
On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > > Let me expand on this. Like I wrote above, I think it is important that
> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > > 
> > > In terms if ioreq registration, QEMU calls
> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> > > reads/writes to forward to QEMU.
> > > 
> > > Let's say that:
> > > - 00:02.0 is PCI passthrough device
> > > - 00:03.0 is a PCI emulated device
> > > 
> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> > > related to 00:03.0 to QEMU, but not 00:02.0.
> > 
> > I think there's some work here so that we have a proper hierarchy
> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> > ECAM, see vpci_ecam_{read,write}().
> > 
> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> > itself, and instead relies on ioreq to do the decoding and forwarding
> > of accesses.  We need some work in order to represent an internal
> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> > should register devices it's handling with ioreq, much like QEMU does.
> 
> I think this could be a good idea.
> 
> This would be the very first IOREQ handler implemented in Xen itself,
> rather than outside of Xen. Some code refactoring might be required,
> which worries me given that vPCI is at v10 and has been pending for
> years. I think it could make sense as a follow-up series, not v11.

That's perfectly fine for me, most of the series here just deal with
the logic to intercept guest access to the config space and is
completely agnostic as to how the accesses are intercepted.

> I think this idea would be beneficial if, in the example above, vPCI
> doesn't really need to know about device 00:03.0. vPCI registers via
> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> space reads and writes for 00:03.0. If this is the case, then moving
> vPCI to IOREQ could be good.

Given your description above, with the root complex implemented in
vPCI, we would need to mandate vPCI together with ioreqs even if no
passthrough devices are using vPCI itself (just for the emulation of
the root complex).  Which is fine, just wanted to mention the
dependency.

> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> perhaps because it changes something in the PCI Root Complex emulation
> or vPCI needs to take some action when PCI config space registers of
> 00:03.0 are written to, then I think this model doesn't work well. If
> this is the case, then I think it would be best to keep vPCI as MMIO
> handler and let vPCI forward to IOREQ when appropriate.

At first approximation I don't think we would have such interactions,
otherwise the whole premise of ioreq being able to register individual
PCI devices would be broken.

XenSever already has scenarios with two different user-space emulators
(ie: two different ioreq servers) handling accesses to different
devices in the same PCI bus, and there's no interaction with the root
complex required.

> I haven't run any experiements, but my gut feeling tells me that we'll
> have to follow the second approach because the first is too limiting.

Iff there's some case where a change to a downstream PCI device needs
to be propagated to the root complex, then such mechanism should be
implemented as part of the ioreq interface, otherwise the whole model
is broken.

Thanks, Roger.
Volodymyr Babchuk Nov. 28, 2023, 11:45 p.m. UTC | #19
Hi Roger,

Roger Pau Monné <roger.pau@citrix.com> writes:

> On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
>> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
>> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
>> > > Let me expand on this. Like I wrote above, I think it is important that
>> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
>> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
>> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
>> > > 
>> > > In terms if ioreq registration, QEMU calls
>> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
>> > > emulate. That way, Xen vPCI knows exactly what PCI config space
>> > > reads/writes to forward to QEMU.
>> > > 
>> > > Let's say that:
>> > > - 00:02.0 is PCI passthrough device
>> > > - 00:03.0 is a PCI emulated device
>> > > 
>> > > QEMU would register 00:03.0 and vPCI would know to forward anything
>> > > related to 00:03.0 to QEMU, but not 00:02.0.
>> > 
>> > I think there's some work here so that we have a proper hierarchy
>> > inside of Xen.  Right now both ioreq and vpci expect to decode the
>> > accesses to the PCI config space, and setup (MM)IO handlers to trap
>> > ECAM, see vpci_ecam_{read,write}().
>> > 
>> > I think we want to move to a model where vPCI doesn't setup MMIO traps
>> > itself, and instead relies on ioreq to do the decoding and forwarding
>> > of accesses.  We need some work in order to represent an internal
>> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
>> > should register devices it's handling with ioreq, much like QEMU does.
>> 
>> I think this could be a good idea.
>> 
>> This would be the very first IOREQ handler implemented in Xen itself,
>> rather than outside of Xen. Some code refactoring might be required,
>> which worries me given that vPCI is at v10 and has been pending for
>> years. I think it could make sense as a follow-up series, not v11.
>
> That's perfectly fine for me, most of the series here just deal with
> the logic to intercept guest access to the config space and is
> completely agnostic as to how the accesses are intercepted.
>
>> I think this idea would be beneficial if, in the example above, vPCI
>> doesn't really need to know about device 00:03.0. vPCI registers via
>> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
>> 00:03.0, and everything works. vPCI is not involved at all in PCI config
>> space reads and writes for 00:03.0. If this is the case, then moving
>> vPCI to IOREQ could be good.
>
> Given your description above, with the root complex implemented in
> vPCI, we would need to mandate vPCI together with ioreqs even if no
> passthrough devices are using vPCI itself (just for the emulation of
> the root complex).  Which is fine, just wanted to mention the
> dependency.
>
>> On the other hand if vPCI actually needs to know that 00:03.0 exists,
>> perhaps because it changes something in the PCI Root Complex emulation
>> or vPCI needs to take some action when PCI config space registers of
>> 00:03.0 are written to, then I think this model doesn't work well. If
>> this is the case, then I think it would be best to keep vPCI as MMIO
>> handler and let vPCI forward to IOREQ when appropriate.
>
> At first approximation I don't think we would have such interactions,
> otherwise the whole premise of ioreq being able to register individual
> PCI devices would be broken.
>
> XenSever already has scenarios with two different user-space emulators
> (ie: two different ioreq servers) handling accesses to different
> devices in the same PCI bus, and there's no interaction with the root
> complex required.
>

Out of curiosity: how legacy PCI interrupts are handled in this case? In
my understanding, it is Root Complex's responsibility to propagate
correct IRQ levels to an interrupt controller?

[...]
Roger Pau Monné Nov. 29, 2023, 8:33 a.m. UTC | #20
On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote:
> Hi Roger,
> 
> Roger Pau Monné <roger.pau@citrix.com> writes:
> 
> > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> >> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> >> > > Let me expand on this. Like I wrote above, I think it is important that
> >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> >> > > 
> >> > > In terms if ioreq registration, QEMU calls
> >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> >> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> >> > > reads/writes to forward to QEMU.
> >> > > 
> >> > > Let's say that:
> >> > > - 00:02.0 is PCI passthrough device
> >> > > - 00:03.0 is a PCI emulated device
> >> > > 
> >> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> >> > > related to 00:03.0 to QEMU, but not 00:02.0.
> >> > 
> >> > I think there's some work here so that we have a proper hierarchy
> >> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> >> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> >> > ECAM, see vpci_ecam_{read,write}().
> >> > 
> >> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> >> > itself, and instead relies on ioreq to do the decoding and forwarding
> >> > of accesses.  We need some work in order to represent an internal
> >> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> >> > should register devices it's handling with ioreq, much like QEMU does.
> >> 
> >> I think this could be a good idea.
> >> 
> >> This would be the very first IOREQ handler implemented in Xen itself,
> >> rather than outside of Xen. Some code refactoring might be required,
> >> which worries me given that vPCI is at v10 and has been pending for
> >> years. I think it could make sense as a follow-up series, not v11.
> >
> > That's perfectly fine for me, most of the series here just deal with
> > the logic to intercept guest access to the config space and is
> > completely agnostic as to how the accesses are intercepted.
> >
> >> I think this idea would be beneficial if, in the example above, vPCI
> >> doesn't really need to know about device 00:03.0. vPCI registers via
> >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> >> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> >> space reads and writes for 00:03.0. If this is the case, then moving
> >> vPCI to IOREQ could be good.
> >
> > Given your description above, with the root complex implemented in
> > vPCI, we would need to mandate vPCI together with ioreqs even if no
> > passthrough devices are using vPCI itself (just for the emulation of
> > the root complex).  Which is fine, just wanted to mention the
> > dependency.
> >
> >> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> >> perhaps because it changes something in the PCI Root Complex emulation
> >> or vPCI needs to take some action when PCI config space registers of
> >> 00:03.0 are written to, then I think this model doesn't work well. If
> >> this is the case, then I think it would be best to keep vPCI as MMIO
> >> handler and let vPCI forward to IOREQ when appropriate.
> >
> > At first approximation I don't think we would have such interactions,
> > otherwise the whole premise of ioreq being able to register individual
> > PCI devices would be broken.
> >
> > XenSever already has scenarios with two different user-space emulators
> > (ie: two different ioreq servers) handling accesses to different
> > devices in the same PCI bus, and there's no interaction with the root
> > complex required.
> >
> 
> Out of curiosity: how legacy PCI interrupts are handled in this case? In
> my understanding, it is Root Complex's responsibility to propagate
> correct IRQ levels to an interrupt controller?

I'm unsure whether my understanding of the question is correct, so my
reply might not be what you are asking for, sorry.

Legacy IRQs (GSI on x86) are setup directly by the toolstack when the
device is assigned to the guest, using PHYSDEVOP_map_pirq +
XEN_DOMCTL_bind_pt_irq.  Those hypercalls bind together a host IO-APIC
pin to a guest IO-APIC pin, so that interrupts originating from that
host IO-APIC pin are always forwarded to the guest an injected as
originating from the guest IO-APIC pin.

Note that the device will always use the same IO-APIC pin, this is not
configured by the OS.

Thanks, Roger.
Stefano Stabellini Nov. 30, 2023, 2:28 a.m. UTC | #21
On Wed, 29 Nov 2023, Roger Pau Monné wrote:
> On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote:
> > Hi Roger,
> > 
> > Roger Pau Monné <roger.pau@citrix.com> writes:
> > 
> > > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote:
> > >> On Wed, 22 Nov 2023, Roger Pau Monné wrote:
> > >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote:
> > >> > > Let me expand on this. Like I wrote above, I think it is important that
> > >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the
> > >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and
> > >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist.
> > >> > > 
> > >> > > In terms if ioreq registration, QEMU calls
> > >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to
> > >> > > emulate. That way, Xen vPCI knows exactly what PCI config space
> > >> > > reads/writes to forward to QEMU.
> > >> > > 
> > >> > > Let's say that:
> > >> > > - 00:02.0 is PCI passthrough device
> > >> > > - 00:03.0 is a PCI emulated device
> > >> > > 
> > >> > > QEMU would register 00:03.0 and vPCI would know to forward anything
> > >> > > related to 00:03.0 to QEMU, but not 00:02.0.
> > >> > 
> > >> > I think there's some work here so that we have a proper hierarchy
> > >> > inside of Xen.  Right now both ioreq and vpci expect to decode the
> > >> > accesses to the PCI config space, and setup (MM)IO handlers to trap
> > >> > ECAM, see vpci_ecam_{read,write}().
> > >> > 
> > >> > I think we want to move to a model where vPCI doesn't setup MMIO traps
> > >> > itself, and instead relies on ioreq to do the decoding and forwarding
> > >> > of accesses.  We need some work in order to represent an internal
> > >> > ioreq handler, but that shouldn't be too complicated.  IOW: vpci
> > >> > should register devices it's handling with ioreq, much like QEMU does.
> > >> 
> > >> I think this could be a good idea.
> > >> 
> > >> This would be the very first IOREQ handler implemented in Xen itself,
> > >> rather than outside of Xen. Some code refactoring might be required,
> > >> which worries me given that vPCI is at v10 and has been pending for
> > >> years. I think it could make sense as a follow-up series, not v11.
> > >
> > > That's perfectly fine for me, most of the series here just deal with
> > > the logic to intercept guest access to the config space and is
> > > completely agnostic as to how the accesses are intercepted.
> > >
> > >> I think this idea would be beneficial if, in the example above, vPCI
> > >> doesn't really need to know about device 00:03.0. vPCI registers via
> > >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers
> > >> 00:03.0, and everything works. vPCI is not involved at all in PCI config
> > >> space reads and writes for 00:03.0. If this is the case, then moving
> > >> vPCI to IOREQ could be good.
> > >
> > > Given your description above, with the root complex implemented in
> > > vPCI, we would need to mandate vPCI together with ioreqs even if no
> > > passthrough devices are using vPCI itself (just for the emulation of
> > > the root complex).  Which is fine, just wanted to mention the
> > > dependency.
> > >
> > >> On the other hand if vPCI actually needs to know that 00:03.0 exists,
> > >> perhaps because it changes something in the PCI Root Complex emulation
> > >> or vPCI needs to take some action when PCI config space registers of
> > >> 00:03.0 are written to, then I think this model doesn't work well. If
> > >> this is the case, then I think it would be best to keep vPCI as MMIO
> > >> handler and let vPCI forward to IOREQ when appropriate.
> > >
> > > At first approximation I don't think we would have such interactions,
> > > otherwise the whole premise of ioreq being able to register individual
> > > PCI devices would be broken.
> > >
> > > XenSever already has scenarios with two different user-space emulators
> > > (ie: two different ioreq servers) handling accesses to different
> > > devices in the same PCI bus, and there's no interaction with the root
> > > complex required.

Good to hear

 
> > Out of curiosity: how legacy PCI interrupts are handled in this case? In
> > my understanding, it is Root Complex's responsibility to propagate
> > correct IRQ levels to an interrupt controller?
> 
> I'm unsure whether my understanding of the question is correct, so my
> reply might not be what you are asking for, sorry.
> 
> Legacy IRQs (GSI on x86) are setup directly by the toolstack when the
> device is assigned to the guest, using PHYSDEVOP_map_pirq +
> XEN_DOMCTL_bind_pt_irq.  Those hypercalls bind together a host IO-APIC
> pin to a guest IO-APIC pin, so that interrupts originating from that
> host IO-APIC pin are always forwarded to the guest an injected as
> originating from the guest IO-APIC pin.
> 
> Note that the device will always use the same IO-APIC pin, this is not
> configured by the OS.

QEMU calls xen_set_pci_intx_level which is implemented by
xendevicemodel_set_pci_intx_level, which is XEN_DMOP_set_pci_intx_level,
which does set_pci_intx_level. Eventually it calls hvm_pci_intx_assert
and hvm_pci_intx_deassert.

I don't think any of this goes via the Root Complex otherwise, like
Roger pointed out, it wouldn't be possible to emulated individual PCI
devices in separate IOREQ servers.
diff mbox series

Patch

diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig
index db94393f47..780490cf8e 100644
--- a/xen/drivers/Kconfig
+++ b/xen/drivers/Kconfig
@@ -15,4 +15,8 @@  source "drivers/video/Kconfig"
 config HAS_VPCI
 	bool
 
+config HAS_VPCI_GUEST_SUPPORT
+	bool
+	depends on HAS_VPCI
+
 endmenu
diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 5e34d0092a..7c46a2d3f4 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -36,6 +36,52 @@  extern vpci_register_init_t *const __start_vpci_array[];
 extern vpci_register_init_t *const __end_vpci_array[];
 #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array)
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(rw_is_write_locked(&pdev->domain->pci_lock));
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn && !pdev->info.is_virtfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number == VPCI_MAX_VIRT_DEV )
+    {
+        write_unlock(&pdev->domain->pci_lock);
+        return -ENOSPC;
+    }
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0);
+
+    return 0;
+}
+
+#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
+
 void vpci_deassign_device(struct pci_dev *pdev)
 {
     unsigned int i;
@@ -46,6 +92,13 @@  void vpci_deassign_device(struct pci_dev *pdev)
         return;
 
     spin_lock(&pdev->vpci->lock);
+
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    if ( pdev->vpci->guest_sbdf.sbdf != ~0 )
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+#endif
+
     while ( !list_empty(&pdev->vpci->handlers) )
     {
         struct vpci_register *r = list_first_entry(&pdev->vpci->handlers,
@@ -101,6 +154,13 @@  int vpci_assign_device(struct pci_dev *pdev)
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
 
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto out;
+#endif
+
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
         rc = __start_vpci_array[i](pdev);
@@ -108,11 +168,14 @@  int vpci_assign_device(struct pci_dev *pdev)
             break;
     }
 
+ out:
+    __maybe_unused;
     if ( rc )
         vpci_deassign_device(pdev);
 
     return rc;
 }
+
 #endif /* __XEN__ */
 
 static int vpci_register_cmp(const struct vpci_register *r1,
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 57391e74b6..84e608f670 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -462,6 +462,14 @@  struct domain
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
     rwlock_t pci_lock;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 60bdc10c13..4a53936447 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@  typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -155,6 +162,10 @@  struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };