Message ID | 20231012220854.2736994-14-volodymyr_babchuk@epam.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | PCI devices passthrough on Arm, part 3 | expand |
Hi Volodymyr, This patch was mentioned in another context about allocating the BDF. So I thought I would comment here as well. On 12/10/2023 23:09, Volodymyr Babchuk wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Assign SBDF to the PCI devices being passed through with bus 0. > The resulting topology is where PCIe devices reside on the bus 0 of the > root complex itself (embedded endpoints). > This implementation is limited to 32 devices which are allowed on > a single PCI bus. > > Please note, that at the moment only function 0 of a multifunction > device can be passed through. > > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Your signed-off-by should be added even if you are only sending the patch on behalf of Oleksandr. This is part of the DCO [1] > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index 5e34d0092a..7c46a2d3f4 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[]; > extern vpci_register_init_t *const __end_vpci_array[]; > #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array) > > +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT > +static int add_virtual_device(struct pci_dev *pdev) > +{ > + struct domain *d = pdev->domain; > + unsigned long new_dev_number; > + > + if ( is_hardware_domain(d) ) > + return 0; > + > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); > + > + /* > + * Each PCI bus supports 32 devices/slots at max or up to 256 when > + * there are multi-function ones which are not yet supported. > + */ > + if ( pdev->info.is_extfn && !pdev->info.is_virtfn ) > + { > + gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n", > + &pdev->sbdf); > + return -EOPNOTSUPP; > + } > + new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map, > + VPCI_MAX_VIRT_DEV); IIUC, this means that Xen will allocate the BDF. I think this will become a problem quite quickly as some of the PCI may need to be assigned at a specific vBDF (I have the intel graphic card in mind). Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not work with this approach. For dom0less passthrough, I feel the virtual BDF should always be specified in device-tree. When a domain is created after boot, then I think you want to support <bdf>@<vslot> where <vslot> is optional. [1] https://cert-manager.io/docs/contributing/sign-off/
On Thu, 16 Nov 2023, Julien Grall wrote: > IIUC, this means that Xen will allocate the BDF. I think this will become a > problem quite quickly as some of the PCI may need to be assigned at a specific > vBDF (I have the intel graphic card in mind). > > Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not > work with this approach. > > For dom0less passthrough, I feel the virtual BDF should always be specified in > device-tree. When a domain is created after boot, then I think you want to > support <bdf>@<vslot> where <vslot> is optional. Hi Julien, I also think there should be a way to specify the virtual BDF, but if possible (meaning: it is not super difficult to implement) I think it would be very convenient if we could let Xen pick whatever virtual BDF Xen wants when the user doesn't specify the virtual BDF. That's because it would make it easier to specify the configuration for the user. Typically the user doesn't care about the virtual BDF, only to expose a specific host device to the VM. There are exceptions of course and that's why I think we should also have a way for the user to request a specific virtual BDF. One of these exceptions are integrated GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if the device shows up at a different virtual BDF compared to the host. Thinking more about this, one way to simplify the problem would be if we always reuse the physical BDF as virtual BDF for passthrough devices. I think that would solve the problem and makes it much more unlikely to run into drivers bugs. And we allocate a "special" virtual BDF space for emulated devices, with the Root Complex still emulated in Xen. For instance, we could reserve ff:xx:xx and in case of clashes we could refuse to continue. Or we could allocate the first free virtual BDF, after all the pasthrough devices. Example: - the user wants to assign physical 00:11.5 and b3:00.1 to the guest - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices - Xen allocates the next virtual BDF for emulated devices: b4:xx.x - If more virtual BDFs are needed for emulated devices, Xen allocates b5:xx.x I still think, no matter the BDF allocation scheme, that we should try to avoid as much as possible to have two different PCI Root Complex emulators. Ideally we would have only one PCI Root Complex emulated by Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be tolerable but not ideal. The worst case I would like to avoid is to have two PCI Root Complexes, one emulated by Xen and one emulated by QEMU.
Hi Stefano, On 16/11/2023 23:28, Stefano Stabellini wrote: > On Thu, 16 Nov 2023, Julien Grall wrote: >> IIUC, this means that Xen will allocate the BDF. I think this will become a >> problem quite quickly as some of the PCI may need to be assigned at a specific >> vBDF (I have the intel graphic card in mind). >> >> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not >> work with this approach. >> >> For dom0less passthrough, I feel the virtual BDF should always be specified in >> device-tree. When a domain is created after boot, then I think you want to >> support <bdf>@<vslot> where <vslot> is optional. > > Hi Julien, > > I also think there should be a way to specify the virtual BDF, but if > possible (meaning: it is not super difficult to implement) I think it > would be very convenient if we could let Xen pick whatever virtual BDF > Xen wants when the user doesn't specify the virtual BDF. That's > because it would make it easier to specify the configuration for the > user. Typically the user doesn't care about the virtual BDF, only to > expose a specific host device to the VM. There are exceptions of course > and that's why I think we should also have a way for the user to > request a specific virtual BDF. One of these exceptions are integrated > GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if > the device shows up at a different virtual BDF compared to the host. If you let Xen allocating the vBDF, then wouldn't you need a way to tell the toolstack/Device Models which vBDF was allocated? > > Thinking more about this, one way to simplify the problem would be if we > always reuse the physical BDF as virtual BDF for passthrough devices. I > think that would solve the problem and makes it much more unlikely to > run into drivers bugs. This works so long you have only one physical segment (i.e. hostbridge). If you have multiple one, then you either have to expose multiple hostbridge to the guest (which is not great) or need someone to allocate the vBDF. > > And we allocate a "special" virtual BDF space for emulated devices, with > the Root Complex still emulated in Xen. For instance, we could reserve > ff:xx:xx. Hmmm... Wouldn't this means reserving ECAM space for 256 buses? Obviously, we could use 5 (just as random number). Yet, it still requires to reserve more memory than necessary. > and in case of clashes we could refuse to continue. Urgh. And what would be the solution users triggering this clash? > Or we could > allocate the first free virtual BDF, after all the pasthrough devices. This is only works if you don't want to support PCI hotplug. It may not be a thing for embedded, but it is used by cloud. So you need a mechanism that works with hotplug as well. > > Example: > - the user wants to assign physical 00:11.5 and b3:00.1 to the guest > - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices > - Xen allocates the next virtual BDF for emulated devices: b4:xx.x > - If more virtual BDFs are needed for emulated devices, Xen allocates > b5:xx.x > > I still think, no matter the BDF allocation scheme, that we should try > to avoid as much as possible to have two different PCI Root Complex > emulators. Ideally we would have only one PCI Root Complex emulated by > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > tolerable but not ideal. The worst case I would like to avoid is to have > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. So while I agree that one emulated hostbridge is the best solution, I don't think your proposal would work. As I wrote above, you may have a system with multiple physical hostbridge. It would not be possible to assign two PCI devices with the same BDF but from different segment. I agree unlikely, but if we can avoid it then it would be best. There are one scheme which fits that: 1. If the vBDF is not specified, then pick a free one. 2. Otherwise check if the specified vBDF is free. If not return an error. This scheme should be used for both virtual and physical. This is pretty much the algorithm used by QEMU today. It works, so what's would be the benefits to do something different? Cheers,
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > On Thu, 16 Nov 2023, Julien Grall wrote: >> IIUC, this means that Xen will allocate the BDF. I think this will become a >> problem quite quickly as some of the PCI may need to be assigned at a specific >> vBDF (I have the intel graphic card in mind). >> >> Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would not >> work with this approach. >> >> For dom0less passthrough, I feel the virtual BDF should always be specified in >> device-tree. When a domain is created after boot, then I think you want to >> support <bdf>@<vslot> where <vslot> is optional. > > Hi Julien, > > I also think there should be a way to specify the virtual BDF, but if > possible (meaning: it is not super difficult to implement) I think it > would be very convenient if we could let Xen pick whatever virtual BDF > Xen wants when the user doesn't specify the virtual BDF. That's > because it would make it easier to specify the configuration for the > user. Typically the user doesn't care about the virtual BDF, only to > expose a specific host device to the VM. There are exceptions of course > and that's why I think we should also have a way for the user to > request a specific virtual BDF. One of these exceptions are integrated > GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if > the device shows up at a different virtual BDF compared to the host. > > Thinking more about this, one way to simplify the problem would be if we > always reuse the physical BDF as virtual BDF for passthrough devices. I > think that would solve the problem and makes it much more unlikely to > run into drivers bugs. I'm not sure that this is possible. AFAIK, if we have device with B>0, we need to have bridge device for it. So, if I want to passthrough device 08:00.0, I need to provide a virtual bridge with BDF 0:NN.0. This unnecessary complicates things. Also, there can be funny situation with conflicting BFD numbers exposed by different domains. I know that this is not your typical setup, but imagine that Domain A acts as a driver domain for PCI controller A and Domain B acts as a driver domain for PCI controller B. They may expose devices with same BDFs but with different segments. > And we allocate a "special" virtual BDF space for emulated devices, with > the Root Complex still emulated in Xen. For instance, we could reserve > ff:xx:xx and in case of clashes we could refuse to continue. Or we could > allocate the first free virtual BDF, after all the pasthrough devices. Again, I may be wrong there, but we need an emulated PCI bridge device if we want to use Bus numbers > 0. > > Example: > - the user wants to assign physical 00:11.5 and b3:00.1 to the guest > - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices > - Xen allocates the next virtual BDF for emulated devices: b4:xx.x > - If more virtual BDFs are needed for emulated devices, Xen allocates > b5:xx.x > > I still think, no matter the BDF allocation scheme, that we should try > to avoid as much as possible to have two different PCI Root Complex > emulators. Ideally we would have only one PCI Root Complex emulated by > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > tolerable but not ideal. But what is exactly wrong with this setup? > The worst case I would like to avoid is to have > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. This is how our setup works right now. I agree that we need some way to provide static vBDF numbers. But I am wondering what is the best way to do this. We need some entity that manages and assigns those vBDFs. It should reside in Xen, because there is Dom0less use case. Probably we need to extend xen_domctl_assign_device so we can either request a free vBDF or a specific vBDF. And in the first case, Xen should return assigned vBDF so toolstack can give it to a backend, if PCI device is purely virtual.
On Fri, 17 Nov 2023, Julien Grall wrote: > Hi Stefano, > > On 16/11/2023 23:28, Stefano Stabellini wrote: > > On Thu, 16 Nov 2023, Julien Grall wrote: > > > IIUC, this means that Xen will allocate the BDF. I think this will become > > > a > > > problem quite quickly as some of the PCI may need to be assigned at a > > > specific > > > vBDF (I have the intel graphic card in mind). > > > > > > Also, xl allows you to specificy the slot (e.g. <bdf>@<vslot>) which would > > > not > > > work with this approach. > > > > > > For dom0less passthrough, I feel the virtual BDF should always be > > > specified in > > > device-tree. When a domain is created after boot, then I think you want to > > > support <bdf>@<vslot> where <vslot> is optional. > > > > Hi Julien, > > > > I also think there should be a way to specify the virtual BDF, but if > > possible (meaning: it is not super difficult to implement) I think it > > would be very convenient if we could let Xen pick whatever virtual BDF > > Xen wants when the user doesn't specify the virtual BDF. That's > > because it would make it easier to specify the configuration for the > > user. Typically the user doesn't care about the virtual BDF, only to > > expose a specific host device to the VM. There are exceptions of course > > and that's why I think we should also have a way for the user to > > request a specific virtual BDF. One of these exceptions are integrated > > GPUs: the OS drivers used to have hardcoded BDFs. So it wouldn't work if > > the device shows up at a different virtual BDF compared to the host. > > If you let Xen allocating the vBDF, then wouldn't you need a way to tell the > toolstack/Device Models which vBDF was allocated? > > > > > Thinking more about this, one way to simplify the problem would be if we > > always reuse the physical BDF as virtual BDF for passthrough devices. I > > think that would solve the problem and makes it much more unlikely to > > run into drivers bugs. > > This works so long you have only one physical segment (i.e. hostbridge). If > you have multiple one, then you either have to expose multiple hostbridge to > the guest (which is not great) or need someone to allocate the vBDF. > > > > > And we allocate a "special" virtual BDF space for emulated devices, with > > the Root Complex still emulated in Xen. For instance, we could reserve > > ff:xx:xx. > Hmmm... Wouldn't this means reserving ECAM space for 256 buses? Obviously, we > could use 5 (just as random number). Yet, it still requires to reserve more > memory than necessary. > > > and in case of clashes we could refuse to continue. > > Urgh. And what would be the solution users triggering this clash? > > > Or we could > > allocate the first free virtual BDF, after all the pasthrough devices. > > This is only works if you don't want to support PCI hotplug. It may not be a > thing for embedded, but it is used by cloud. So you need a mechanism that > works with hotplug as well. > > > > > Example: > > - the user wants to assign physical 00:11.5 and b3:00.1 to the guest > > - Xen create virtual BDFs 00:11.5 and b3:00.1 for the passthrough devices > > - Xen allocates the next virtual BDF for emulated devices: b4:xx.x > > - If more virtual BDFs are needed for emulated devices, Xen allocates > > b5:xx.x > > > I still think, no matter the BDF allocation scheme, that we should try > > to avoid as much as possible to have two different PCI Root Complex > > emulators. Ideally we would have only one PCI Root Complex emulated by > > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > > tolerable but not ideal. The worst case I would like to avoid is to have > > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > > So while I agree that one emulated hostbridge is the best solution, I don't > think your proposal would work. As I wrote above, you may have a system with > multiple physical hostbridge. It would not be possible to assign two PCI > devices with the same BDF but from different segment. > > I agree unlikely, but if we can avoid it then it would be best. There are one > scheme which fits that: > 1. If the vBDF is not specified, then pick a free one. > 2. Otherwise check if the specified vBDF is free. If not return an error. > > This scheme should be used for both virtual and physical. This is pretty much > the algorithm used by QEMU today. It works, so what's would be the benefits to > do something different? I am OK with that. I was trying to find a way that could work without user intervention in almost 100% of the cases. I think both 1. and 2. you proposed are fine.
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > I still think, no matter the BDF allocation scheme, that we should try > > to avoid as much as possible to have two different PCI Root Complex > > emulators. Ideally we would have only one PCI Root Complex emulated by > > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > > tolerable but not ideal. > > But what is exactly wrong with this setup? [...] > > The worst case I would like to avoid is to have > > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > > This is how our setup works right now. If we have: - a single PCI Root Complex emulated in Xen - Xen is safety certified - individual Virtio devices emulated by QEMU with grants for memory We can go very far in terms of being able to use Virtio in safety use-cases. We might even be able to use Virtio (frontends) in a SafeOS. On the other hand if we put an additional Root Complex in QEMU: - we pay a price in terms of complexity of the codebase - we pay a price in terms of resource utilization - we have one additional problem in terms of using this setup with a SafeOS (one more device emulated by a non-safe component) Having 2 PCI Root Complexes both emulated in Xen is a middle ground solution because: - we still pay a price in terms of resource utilization - the code complexity goes up a bit but hopefully not by much - there is no impact on safety compared to the ideal scenario This is why I wrote that it is tolerable. > I agree that we need some way to provide static vBDF numbers. But I am > wondering what is the best way to do this. We need some entity that > manages and assigns those vBDFs. It should reside in Xen, because there > is Dom0less use case. Probably we need to extend > xen_domctl_assign_device so we can either request a free vBDF or a > specific vBDF. And in the first case, Xen should return assigned vBDF so > toolstack can give it to a backend, if PCI device is purely virtual. I think that would be fine.
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> > I still think, no matter the BDF allocation scheme, that we should try >> > to avoid as much as possible to have two different PCI Root Complex >> > emulators. Ideally we would have only one PCI Root Complex emulated by >> > Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be >> > tolerable but not ideal. >> >> But what is exactly wrong with this setup? > > [...] > >> > The worst case I would like to avoid is to have >> > two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. >> >> This is how our setup works right now. > > If we have: > - a single PCI Root Complex emulated in Xen > - Xen is safety certified > - individual Virtio devices emulated by QEMU with grants for memory > > We can go very far in terms of being able to use Virtio in safety > use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > > On the other hand if we put an additional Root Complex in QEMU: > - we pay a price in terms of complexity of the codebase > - we pay a price in terms of resource utilization > - we have one additional problem in terms of using this setup with a > SafeOS (one more device emulated by a non-safe component) > > Having 2 PCI Root Complexes both emulated in Xen is a middle ground > solution because: > - we still pay a price in terms of resource utilization > - the code complexity goes up a bit but hopefully not by much > - there is no impact on safety compared to the ideal scenario > > This is why I wrote that it is tolerable. Ah, I see now. Yes, I am agree with this. Also I want to add some more points: - There is ongoing work on implementing virtio backends as a separate applications, written in Rust. Linaro are doing this part. Right now they are implementing only virtio-mmio, but if they want to provide virtio-pci as well, they will need a mechanism to plug only virtio-pci, without Root Complex. This is argument for using single Root Complex emulated in Xen. - As far as I know (actually, Oleksandr told this to me), QEMU has no mechanism for exposing virtio-pci backends without exposing PCI root complex as well. Architecturally, there should be a PCI bus to which virtio-pci devices are connected. Or we need to make some changes to QEMU internals to be able to create virtio-pci backends that are not connected to any bus. Also, added benefit that PCI Root Complex emulator in QEMU handles legacy PCI interrupts for us. This is argument for separate Root Complex for QEMU. As right now we have only virtio-pci backends provided by QEMU and this setup is already working, I propose to stick to this solution. Especially, taking into account that it does not require any changes to hypervisor code.
Hi Volodymyr, On 17/11/2023 14:09, Volodymyr Babchuk wrote: > > Hi Stefano, > > Stefano Stabellini <sstabellini@kernel.org> writes: > >> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >>>> I still think, no matter the BDF allocation scheme, that we should try >>>> to avoid as much as possible to have two different PCI Root Complex >>>> emulators. Ideally we would have only one PCI Root Complex emulated by >>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be >>>> tolerable but not ideal. >>> >>> But what is exactly wrong with this setup? >> >> [...] >> >>>> The worst case I would like to avoid is to have >>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. >>> >>> This is how our setup works right now. >> >> If we have: >> - a single PCI Root Complex emulated in Xen >> - Xen is safety certified >> - individual Virtio devices emulated by QEMU with grants for memory >> >> We can go very far in terms of being able to use Virtio in safety >> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. >> >> On the other hand if we put an additional Root Complex in QEMU: >> - we pay a price in terms of complexity of the codebase >> - we pay a price in terms of resource utilization >> - we have one additional problem in terms of using this setup with a >> SafeOS (one more device emulated by a non-safe component) >> >> Having 2 PCI Root Complexes both emulated in Xen is a middle ground >> solution because: >> - we still pay a price in terms of resource utilization >> - the code complexity goes up a bit but hopefully not by much >> - there is no impact on safety compared to the ideal scenario >> >> This is why I wrote that it is tolerable. > > Ah, I see now. Yes, I am agree with this. Also I want to add some more > points: > > - There is ongoing work on implementing virtio backends as a separate > applications, written in Rust. Linaro are doing this part. Right now > they are implementing only virtio-mmio, but if they want to provide > virtio-pci as well, they will need a mechanism to plug only > virtio-pci, without Root Complex. This is argument for using single Root > Complex emulated in Xen. > > - As far as I know (actually, Oleksandr told this to me), QEMU has no > mechanism for exposing virtio-pci backends without exposing PCI root > complex as well. Architecturally, there should be a PCI bus to which > virtio-pci devices are connected. Or we need to make some changes to > QEMU internals to be able to create virtio-pci backends that are not > connected to any bus. Also, added benefit that PCI Root Complex > emulator in QEMU handles legacy PCI interrupts for us. This is > argument for separate Root Complex for QEMU. > > As right now we have only virtio-pci backends provided by QEMU and this > setup is already working, I propose to stick to this > solution. Especially, taking into account that it does not require any > changes to hypervisor code. I am not against two hostbridge as a temporary solution as long as this is not a one way door decision. I am not concerned about the hypervisor itself, I am more concerned about the interface exposed by the toolstack and QEMU. To clarify, I don't particular want to have to maintain the two hostbridges solution once we can use a single hostbridge. So we need to be able to get rid of it without impacting the interface too much. Cheers,
Hi Julien, Julien Grall <julien@xen.org> writes: > Hi Volodymyr, > > On 17/11/2023 14:09, Volodymyr Babchuk wrote: >> Hi Stefano, >> Stefano Stabellini <sstabellini@kernel.org> writes: >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >>>>> I still think, no matter the BDF allocation scheme, that we should try >>>>> to avoid as much as possible to have two different PCI Root Complex >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be >>>>> tolerable but not ideal. >>>> >>>> But what is exactly wrong with this setup? >>> >>> [...] >>> >>>>> The worst case I would like to avoid is to have >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. >>>> >>>> This is how our setup works right now. >>> >>> If we have: >>> - a single PCI Root Complex emulated in Xen >>> - Xen is safety certified >>> - individual Virtio devices emulated by QEMU with grants for memory >>> >>> We can go very far in terms of being able to use Virtio in safety >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. >>> >>> On the other hand if we put an additional Root Complex in QEMU: >>> - we pay a price in terms of complexity of the codebase >>> - we pay a price in terms of resource utilization >>> - we have one additional problem in terms of using this setup with a >>> SafeOS (one more device emulated by a non-safe component) >>> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground >>> solution because: >>> - we still pay a price in terms of resource utilization >>> - the code complexity goes up a bit but hopefully not by much >>> - there is no impact on safety compared to the ideal scenario >>> >>> This is why I wrote that it is tolerable. >> Ah, I see now. Yes, I am agree with this. Also I want to add some >> more >> points: >> - There is ongoing work on implementing virtio backends as a >> separate >> applications, written in Rust. Linaro are doing this part. Right now >> they are implementing only virtio-mmio, but if they want to provide >> virtio-pci as well, they will need a mechanism to plug only >> virtio-pci, without Root Complex. This is argument for using single Root >> Complex emulated in Xen. >> - As far as I know (actually, Oleksandr told this to me), QEMU has >> no >> mechanism for exposing virtio-pci backends without exposing PCI root >> complex as well. Architecturally, there should be a PCI bus to which >> virtio-pci devices are connected. Or we need to make some changes to >> QEMU internals to be able to create virtio-pci backends that are not >> connected to any bus. Also, added benefit that PCI Root Complex >> emulator in QEMU handles legacy PCI interrupts for us. This is >> argument for separate Root Complex for QEMU. >> As right now we have only virtio-pci backends provided by QEMU and >> this >> setup is already working, I propose to stick to this >> solution. Especially, taking into account that it does not require any >> changes to hypervisor code. > > I am not against two hostbridge as a temporary solution as long as > this is not a one way door decision. I am not concerned about the > hypervisor itself, I am more concerned about the interface exposed by > the toolstack and QEMU. > > To clarify, I don't particular want to have to maintain the two > hostbridges solution once we can use a single hostbridge. So we need > to be able to get rid of it without impacting the interface too much. This depends on virtio-pci backends availability. AFAIK, now only one option is to use QEMU and QEMU provides own host bridge. So if we want get rid of the second host bridge we need either another virtio-pci backend or we need to alter QEMU code so it can live without host bridge. As for interfaces, it appears that QEMU case does not require any changes into hypervisor itself, it just boils down to writing couple of xenstore entries and spawning QEMU with correct command line arguments. From the user perspective, all this is configured via xl.conf entry like virtio=[ 'backend=DomD, type=virtio,device, transport=pci, bdf=0000:00:01.0, grant_usage=1, backend_type=qemu', ] In the future we can add backend_type=standalone for non-QEMU-based backends. If there will be no QEMU-based backends, there will be no second host bridge.
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > Hi Julien, > > Julien Grall <julien@xen.org> writes: > > > Hi Volodymyr, > > > > On 17/11/2023 14:09, Volodymyr Babchuk wrote: > >> Hi Stefano, > >> Stefano Stabellini <sstabellini@kernel.org> writes: > >> > >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >>>>> I still think, no matter the BDF allocation scheme, that we should try > >>>>> to avoid as much as possible to have two different PCI Root Complex > >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by > >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > >>>>> tolerable but not ideal. > >>>> > >>>> But what is exactly wrong with this setup? > >>> > >>> [...] > >>> > >>>>> The worst case I would like to avoid is to have > >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > >>>> > >>>> This is how our setup works right now. > >>> > >>> If we have: > >>> - a single PCI Root Complex emulated in Xen > >>> - Xen is safety certified > >>> - individual Virtio devices emulated by QEMU with grants for memory > >>> > >>> We can go very far in terms of being able to use Virtio in safety > >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > >>> > >>> On the other hand if we put an additional Root Complex in QEMU: > >>> - we pay a price in terms of complexity of the codebase > >>> - we pay a price in terms of resource utilization > >>> - we have one additional problem in terms of using this setup with a > >>> SafeOS (one more device emulated by a non-safe component) > >>> > >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground > >>> solution because: > >>> - we still pay a price in terms of resource utilization > >>> - the code complexity goes up a bit but hopefully not by much > >>> - there is no impact on safety compared to the ideal scenario > >>> > >>> This is why I wrote that it is tolerable. > >> Ah, I see now. Yes, I am agree with this. Also I want to add some > >> more > >> points: > >> - There is ongoing work on implementing virtio backends as a > >> separate > >> applications, written in Rust. Linaro are doing this part. Right now > >> they are implementing only virtio-mmio, but if they want to provide > >> virtio-pci as well, they will need a mechanism to plug only > >> virtio-pci, without Root Complex. This is argument for using single Root > >> Complex emulated in Xen. > >> - As far as I know (actually, Oleksandr told this to me), QEMU has > >> no > >> mechanism for exposing virtio-pci backends without exposing PCI root > >> complex as well. Architecturally, there should be a PCI bus to which > >> virtio-pci devices are connected. Or we need to make some changes to > >> QEMU internals to be able to create virtio-pci backends that are not > >> connected to any bus. Also, added benefit that PCI Root Complex > >> emulator in QEMU handles legacy PCI interrupts for us. This is > >> argument for separate Root Complex for QEMU. > >> As right now we have only virtio-pci backends provided by QEMU and > >> this > >> setup is already working, I propose to stick to this > >> solution. Especially, taking into account that it does not require any > >> changes to hypervisor code. > > > > I am not against two hostbridge as a temporary solution as long as > > this is not a one way door decision. I am not concerned about the > > hypervisor itself, I am more concerned about the interface exposed by > > the toolstack and QEMU. I agree with this... > > To clarify, I don't particular want to have to maintain the two > > hostbridges solution once we can use a single hostbridge. So we need > > to be able to get rid of it without impacting the interface too much. ...and this > This depends on virtio-pci backends availability. AFAIK, now only one > option is to use QEMU and QEMU provides own host bridge. So if we want > get rid of the second host bridge we need either another virtio-pci > backend or we need to alter QEMU code so it can live without host > bridge. > > As for interfaces, it appears that QEMU case does not require any changes > into hypervisor itself, it just boils down to writing couple of xenstore > entries and spawning QEMU with correct command line arguments. One thing that Stewart wrote in his reply that is important: it doesn't matter if QEMU thinks it is emulating a PCI Root Complex because that's required from QEMU's point of view to emulate an individual PCI device. If we can arrange it so the QEMU PCI Root Complex is not registered against Xen as part of the ioreq interface, then QEMU's emulated PCI Root Complex is going to be left unused. I think that would be great because we still have a clean QEMU-Xen-tools interface and the only downside is some extra unused emulation in QEMU. It would be a fantastic starting point.
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> Hi Julien, >> >> Julien Grall <julien@xen.org> writes: >> >> > Hi Volodymyr, >> > >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: >> >> Hi Stefano, >> >> Stefano Stabellini <sstabellini@kernel.org> writes: >> >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> >>>>> I still think, no matter the BDF allocation scheme, that we should try >> >>>>> to avoid as much as possible to have two different PCI Root Complex >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be >> >>>>> tolerable but not ideal. >> >>>> >> >>>> But what is exactly wrong with this setup? >> >>> >> >>> [...] >> >>> >> >>>>> The worst case I would like to avoid is to have >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. >> >>>> >> >>>> This is how our setup works right now. >> >>> >> >>> If we have: >> >>> - a single PCI Root Complex emulated in Xen >> >>> - Xen is safety certified >> >>> - individual Virtio devices emulated by QEMU with grants for memory >> >>> >> >>> We can go very far in terms of being able to use Virtio in safety >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. >> >>> >> >>> On the other hand if we put an additional Root Complex in QEMU: >> >>> - we pay a price in terms of complexity of the codebase >> >>> - we pay a price in terms of resource utilization >> >>> - we have one additional problem in terms of using this setup with a >> >>> SafeOS (one more device emulated by a non-safe component) >> >>> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground >> >>> solution because: >> >>> - we still pay a price in terms of resource utilization >> >>> - the code complexity goes up a bit but hopefully not by much >> >>> - there is no impact on safety compared to the ideal scenario >> >>> >> >>> This is why I wrote that it is tolerable. >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some >> >> more >> >> points: >> >> - There is ongoing work on implementing virtio backends as a >> >> separate >> >> applications, written in Rust. Linaro are doing this part. Right now >> >> they are implementing only virtio-mmio, but if they want to provide >> >> virtio-pci as well, they will need a mechanism to plug only >> >> virtio-pci, without Root Complex. This is argument for using single Root >> >> Complex emulated in Xen. >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has >> >> no >> >> mechanism for exposing virtio-pci backends without exposing PCI root >> >> complex as well. Architecturally, there should be a PCI bus to which >> >> virtio-pci devices are connected. Or we need to make some changes to >> >> QEMU internals to be able to create virtio-pci backends that are not >> >> connected to any bus. Also, added benefit that PCI Root Complex >> >> emulator in QEMU handles legacy PCI interrupts for us. This is >> >> argument for separate Root Complex for QEMU. >> >> As right now we have only virtio-pci backends provided by QEMU and >> >> this >> >> setup is already working, I propose to stick to this >> >> solution. Especially, taking into account that it does not require any >> >> changes to hypervisor code. >> > >> > I am not against two hostbridge as a temporary solution as long as >> > this is not a one way door decision. I am not concerned about the >> > hypervisor itself, I am more concerned about the interface exposed by >> > the toolstack and QEMU. > > I agree with this... > > >> > To clarify, I don't particular want to have to maintain the two >> > hostbridges solution once we can use a single hostbridge. So we need >> > to be able to get rid of it without impacting the interface too much. > > ...and this > > >> This depends on virtio-pci backends availability. AFAIK, now only one >> option is to use QEMU and QEMU provides own host bridge. So if we want >> get rid of the second host bridge we need either another virtio-pci >> backend or we need to alter QEMU code so it can live without host >> bridge. >> >> As for interfaces, it appears that QEMU case does not require any changes >> into hypervisor itself, it just boils down to writing couple of xenstore >> entries and spawning QEMU with correct command line arguments. > > One thing that Stewart wrote in his reply that is important: it doesn't > matter if QEMU thinks it is emulating a PCI Root Complex because that's > required from QEMU's point of view to emulate an individual PCI device. > > If we can arrange it so the QEMU PCI Root Complex is not registered > against Xen as part of the ioreq interface, then QEMU's emulated PCI > Root Complex is going to be left unused. I think that would be great > because we still have a clean QEMU-Xen-tools interface and the only > downside is some extra unused emulation in QEMU. It would be a > fantastic starting point. I believe, that in this case we need to set manual ioreq handlers, like what was done in patch "xen/arm: Intercept vPCI config accesses and forward them to emulator", because we need to route ECAM accesses either to a virtio-pci backend or to a real PCI device. Also we need to tell QEMU to not install own ioreq handles for ECAM space. Another point is PCI legacy interrupts, which should be emulated on Xen side. And unless I miss something, we will need some new mechanism to signal those interrupts from QEMU/other backend. I am not sure if we can use already existing IRQ signaling mechanism, because PCI interrupts are ORed for all devices on a bridge and are level-sensitive. Of course, we will need all of this anyways, if we want to support standalone virtio-pci backends, but for me it sounds more like "finish point" :)
On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >> Hi Julien, > >> > >> Julien Grall <julien@xen.org> writes: > >> > >> > Hi Volodymyr, > >> > > >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: > >> >> Hi Stefano, > >> >> Stefano Stabellini <sstabellini@kernel.org> writes: > >> >> > >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >> >>>>> I still think, no matter the BDF allocation scheme, that we should try > >> >>>>> to avoid as much as possible to have two different PCI Root Complex > >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by > >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > >> >>>>> tolerable but not ideal. > >> >>>> > >> >>>> But what is exactly wrong with this setup? > >> >>> > >> >>> [...] > >> >>> > >> >>>>> The worst case I would like to avoid is to have > >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > >> >>>> > >> >>>> This is how our setup works right now. > >> >>> > >> >>> If we have: > >> >>> - a single PCI Root Complex emulated in Xen > >> >>> - Xen is safety certified > >> >>> - individual Virtio devices emulated by QEMU with grants for memory > >> >>> > >> >>> We can go very far in terms of being able to use Virtio in safety > >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > >> >>> > >> >>> On the other hand if we put an additional Root Complex in QEMU: > >> >>> - we pay a price in terms of complexity of the codebase > >> >>> - we pay a price in terms of resource utilization > >> >>> - we have one additional problem in terms of using this setup with a > >> >>> SafeOS (one more device emulated by a non-safe component) > >> >>> > >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground > >> >>> solution because: > >> >>> - we still pay a price in terms of resource utilization > >> >>> - the code complexity goes up a bit but hopefully not by much > >> >>> - there is no impact on safety compared to the ideal scenario > >> >>> > >> >>> This is why I wrote that it is tolerable. > >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some > >> >> more > >> >> points: > >> >> - There is ongoing work on implementing virtio backends as a > >> >> separate > >> >> applications, written in Rust. Linaro are doing this part. Right now > >> >> they are implementing only virtio-mmio, but if they want to provide > >> >> virtio-pci as well, they will need a mechanism to plug only > >> >> virtio-pci, without Root Complex. This is argument for using single Root > >> >> Complex emulated in Xen. > >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has > >> >> no > >> >> mechanism for exposing virtio-pci backends without exposing PCI root > >> >> complex as well. Architecturally, there should be a PCI bus to which > >> >> virtio-pci devices are connected. Or we need to make some changes to > >> >> QEMU internals to be able to create virtio-pci backends that are not > >> >> connected to any bus. Also, added benefit that PCI Root Complex > >> >> emulator in QEMU handles legacy PCI interrupts for us. This is > >> >> argument for separate Root Complex for QEMU. > >> >> As right now we have only virtio-pci backends provided by QEMU and > >> >> this > >> >> setup is already working, I propose to stick to this > >> >> solution. Especially, taking into account that it does not require any > >> >> changes to hypervisor code. > >> > > >> > I am not against two hostbridge as a temporary solution as long as > >> > this is not a one way door decision. I am not concerned about the > >> > hypervisor itself, I am more concerned about the interface exposed by > >> > the toolstack and QEMU. > > > > I agree with this... > > > > > >> > To clarify, I don't particular want to have to maintain the two > >> > hostbridges solution once we can use a single hostbridge. So we need > >> > to be able to get rid of it without impacting the interface too much. > > > > ...and this > > > > > >> This depends on virtio-pci backends availability. AFAIK, now only one > >> option is to use QEMU and QEMU provides own host bridge. So if we want > >> get rid of the second host bridge we need either another virtio-pci > >> backend or we need to alter QEMU code so it can live without host > >> bridge. > >> > >> As for interfaces, it appears that QEMU case does not require any changes > >> into hypervisor itself, it just boils down to writing couple of xenstore > >> entries and spawning QEMU with correct command line arguments. > > > > One thing that Stewart wrote in his reply that is important: it doesn't > > matter if QEMU thinks it is emulating a PCI Root Complex because that's > > required from QEMU's point of view to emulate an individual PCI device. > > > > If we can arrange it so the QEMU PCI Root Complex is not registered > > against Xen as part of the ioreq interface, then QEMU's emulated PCI > > Root Complex is going to be left unused. I think that would be great > > because we still have a clean QEMU-Xen-tools interface and the only > > downside is some extra unused emulation in QEMU. It would be a > > fantastic starting point. > > I believe, that in this case we need to set manual ioreq handlers, like > what was done in patch "xen/arm: Intercept vPCI config accesses and > forward them to emulator", because we need to route ECAM accesses > either to a virtio-pci backend or to a real PCI device. Also we need > to tell QEMU to not install own ioreq handles for ECAM space. I was imagining that the interface would look like this: QEMU registers a PCI BDF and Xen automatically starts forwarding to QEMU ECAM reads/writes requests for the PCI config space of that BDF only. It would not be the entire ECAM space but only individual PCI conf reads/writes that the BDF only. > Another point is PCI legacy interrupts, which should be emulated on Xen > side. And unless I miss something, we will need some new mechanism to > signal those interrupts from QEMU/other backend. I am not sure if we can > use already existing IRQ signaling mechanism, because PCI interrupts are > ORed for all devices on a bridge and are level-sensitive. I hope we can reuse xc_hvm_set_pci_intx_level or another XEN_DMOP hypercall > Of course, we will need all of this anyways, if we want to support > standalone virtio-pci backends, but for me it sounds more like "finish > point" :)
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> >> Hi Julien, >> >> >> >> Julien Grall <julien@xen.org> writes: >> >> >> >> > Hi Volodymyr, >> >> > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: >> >> >> Hi Stefano, >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes: >> >> >> >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be >> >> >>>>> tolerable but not ideal. >> >> >>>> >> >> >>>> But what is exactly wrong with this setup? >> >> >>> >> >> >>> [...] >> >> >>> >> >> >>>>> The worst case I would like to avoid is to have >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. >> >> >>>> >> >> >>>> This is how our setup works right now. >> >> >>> >> >> >>> If we have: >> >> >>> - a single PCI Root Complex emulated in Xen >> >> >>> - Xen is safety certified >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory >> >> >>> >> >> >>> We can go very far in terms of being able to use Virtio in safety >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. >> >> >>> >> >> >>> On the other hand if we put an additional Root Complex in QEMU: >> >> >>> - we pay a price in terms of complexity of the codebase >> >> >>> - we pay a price in terms of resource utilization >> >> >>> - we have one additional problem in terms of using this setup with a >> >> >>> SafeOS (one more device emulated by a non-safe component) >> >> >>> >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground >> >> >>> solution because: >> >> >>> - we still pay a price in terms of resource utilization >> >> >>> - the code complexity goes up a bit but hopefully not by much >> >> >>> - there is no impact on safety compared to the ideal scenario >> >> >>> >> >> >>> This is why I wrote that it is tolerable. >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some >> >> >> more >> >> >> points: >> >> >> - There is ongoing work on implementing virtio backends as a >> >> >> separate >> >> >> applications, written in Rust. Linaro are doing this part. Right now >> >> >> they are implementing only virtio-mmio, but if they want to provide >> >> >> virtio-pci as well, they will need a mechanism to plug only >> >> >> virtio-pci, without Root Complex. This is argument for using single Root >> >> >> Complex emulated in Xen. >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has >> >> >> no >> >> >> mechanism for exposing virtio-pci backends without exposing PCI root >> >> >> complex as well. Architecturally, there should be a PCI bus to which >> >> >> virtio-pci devices are connected. Or we need to make some changes to >> >> >> QEMU internals to be able to create virtio-pci backends that are not >> >> >> connected to any bus. Also, added benefit that PCI Root Complex >> >> >> emulator in QEMU handles legacy PCI interrupts for us. This is >> >> >> argument for separate Root Complex for QEMU. >> >> >> As right now we have only virtio-pci backends provided by QEMU and >> >> >> this >> >> >> setup is already working, I propose to stick to this >> >> >> solution. Especially, taking into account that it does not require any >> >> >> changes to hypervisor code. >> >> > >> >> > I am not against two hostbridge as a temporary solution as long as >> >> > this is not a one way door decision. I am not concerned about the >> >> > hypervisor itself, I am more concerned about the interface exposed by >> >> > the toolstack and QEMU. >> > >> > I agree with this... >> > >> > >> >> > To clarify, I don't particular want to have to maintain the two >> >> > hostbridges solution once we can use a single hostbridge. So we need >> >> > to be able to get rid of it without impacting the interface too much. >> > >> > ...and this >> > >> > >> >> This depends on virtio-pci backends availability. AFAIK, now only one >> >> option is to use QEMU and QEMU provides own host bridge. So if we want >> >> get rid of the second host bridge we need either another virtio-pci >> >> backend or we need to alter QEMU code so it can live without host >> >> bridge. >> >> >> >> As for interfaces, it appears that QEMU case does not require any changes >> >> into hypervisor itself, it just boils down to writing couple of xenstore >> >> entries and spawning QEMU with correct command line arguments. >> > >> > One thing that Stewart wrote in his reply that is important: it doesn't >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's >> > required from QEMU's point of view to emulate an individual PCI device. >> > >> > If we can arrange it so the QEMU PCI Root Complex is not registered >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI >> > Root Complex is going to be left unused. I think that would be great >> > because we still have a clean QEMU-Xen-tools interface and the only >> > downside is some extra unused emulation in QEMU. It would be a >> > fantastic starting point. >> >> I believe, that in this case we need to set manual ioreq handlers, like >> what was done in patch "xen/arm: Intercept vPCI config accesses and >> forward them to emulator", because we need to route ECAM accesses >> either to a virtio-pci backend or to a real PCI device. Also we need >> to tell QEMU to not install own ioreq handles for ECAM space. > > I was imagining that the interface would look like this: QEMU registers > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM > reads/writes requests for the PCI config space of that BDF only. It > would not be the entire ECAM space but only individual PCI conf > reads/writes that the BDF only. > Okay, I see that there is the xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register PCI BDF? I see that xen-hvm-common.c in QEMU is able to handle only standard 256 bytes configuration space, but I hope that it will be easy fix.
On Thu, Oct 12, 2023 at 10:09:18PM +0000, Volodymyr Babchuk wrote: > From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > > Assign SBDF to the PCI devices being passed through with bus 0. > The resulting topology is where PCIe devices reside on the bus 0 of the > root complex itself (embedded endpoints). > This implementation is limited to 32 devices which are allowed on > a single PCI bus. > > Please note, that at the moment only function 0 of a multifunction > device can be passed through. > > Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> > --- > In v10: > - Removed ASSERT(pcidevs_locked()) > - Removed redundant code (local sbdf variable, clearing sbdf during > device removal, etc) > - Added __maybe_unused attribute to "out:" label > - Introduced HAS_VPCI_GUEST_SUPPORT Kconfig option, as this is the > first patch where it is used (previously was in "vpci: add hooks for > PCI device assign/de-assign") > In v9: > - Lock in add_virtual_device() replaced with ASSERT (thanks, Stewart) > In v8: > - Added write lock in add_virtual_device > Since v6: > - re-work wrt new locking scheme > - OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device() > Since v5: > - s/vpci_add_virtual_device/add_virtual_device and make it static > - call add_virtual_device from vpci_assign_device and do not use > REGISTER_VPCI_INIT machinery > - add pcidevs_locked ASSERT > - use DECLARE_BITMAP for vpci_dev_assigned_map > Since v4: > - moved and re-worked guest sbdf initializers > - s/set_bit/__set_bit > - s/clear_bit/__clear_bit > - minor comment fix s/Virtual/Guest/ > - added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used > later for counting the number of MMIO handlers required for a guest > (Julien) > Since v3: > - make use of VPCI_INIT > - moved all new code to vpci.c which belongs to it > - changed open-coded 31 to PCI_SLOT(~0) > - added comments and code to reject multifunction devices with > functions other than 0 > - updated comment about vpci_dev_next and made it unsigned int > - implement roll back in case of error while assigning/deassigning devices > - s/dom%pd/%pd > Since v2: > - remove casts that are (a) malformed and (b) unnecessary > - add new line for better readability > - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI > functions are now completely gated with this config > - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT > New in v2 > --- > xen/drivers/Kconfig | 4 +++ > xen/drivers/vpci/vpci.c | 63 +++++++++++++++++++++++++++++++++++++++++ > xen/include/xen/sched.h | 8 ++++++ > xen/include/xen/vpci.h | 11 +++++++ > 4 files changed, 86 insertions(+) > > diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig > index db94393f47..780490cf8e 100644 > --- a/xen/drivers/Kconfig > +++ b/xen/drivers/Kconfig > @@ -15,4 +15,8 @@ source "drivers/video/Kconfig" > config HAS_VPCI > bool > > +config HAS_VPCI_GUEST_SUPPORT > + bool > + depends on HAS_VPCI > + > endmenu > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index 5e34d0092a..7c46a2d3f4 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[]; > extern vpci_register_init_t *const __end_vpci_array[]; > #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array) > > +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT > +static int add_virtual_device(struct pci_dev *pdev) > +{ > + struct domain *d = pdev->domain; > + unsigned long new_dev_number; Why unsigned long? unsigned int seems more than enough to account for all possible dev numbers [0, 31]. > + > + if ( is_hardware_domain(d) ) > + return 0; > + > + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); > + > + /* > + * Each PCI bus supports 32 devices/slots at max or up to 256 when > + * there are multi-function ones which are not yet supported. > + */ > + if ( pdev->info.is_extfn && !pdev->info.is_virtfn ) > + { > + gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n", > + &pdev->sbdf); > + return -EOPNOTSUPP; > + } > + new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map, > + VPCI_MAX_VIRT_DEV); > + if ( new_dev_number == VPCI_MAX_VIRT_DEV ) > + { > + write_unlock(&pdev->domain->pci_lock); This write_unlock() looks bogus, as the lock is not taken by this function. Won't this create an unlock imbalance when the caller of vpci_assign_device() also attempts to write-unlock d->pci_lock? > + return -ENOSPC; > + } > + > + __set_bit(new_dev_number, &d->vpci_dev_assigned_map); > + > + /* > + * Both segment and bus number are 0: > + * - we emulate a single host bridge for the guest, e.g. segment 0 > + * - with bus 0 the virtual devices are seen as embedded > + * endpoints behind the root complex > + * > + * TODO: add support for multi-function devices. > + */ > + pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0); > + > + return 0; > +} > + > +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */ > + > void vpci_deassign_device(struct pci_dev *pdev) > { > unsigned int i; > @@ -46,6 +92,13 @@ void vpci_deassign_device(struct pci_dev *pdev) > return; > > spin_lock(&pdev->vpci->lock); > + > +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT > + if ( pdev->vpci->guest_sbdf.sbdf != ~0 ) > + __clear_bit(pdev->vpci->guest_sbdf.dev, > + &pdev->domain->vpci_dev_assigned_map); > +#endif This chunk could in principle be outside of the vpci->lock region AFAICT. > + > while ( !list_empty(&pdev->vpci->handlers) ) > { > struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, > @@ -101,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev) > INIT_LIST_HEAD(&pdev->vpci->handlers); > spin_lock_init(&pdev->vpci->lock); > > +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT > + pdev->vpci->guest_sbdf.sbdf = ~0; > + rc = add_virtual_device(pdev); > + if ( rc ) > + goto out; > +#endif > + > for ( i = 0; i < NUM_VPCI_INIT; i++ ) > { > rc = __start_vpci_array[i](pdev); > @@ -108,11 +168,14 @@ int vpci_assign_device(struct pci_dev *pdev) > break; > } > > + out: > + __maybe_unused; Can you place it in the same line as the out: label please? > if ( rc ) > vpci_deassign_device(pdev); > > return rc; > } > + Stray newline? Thanks, Roger.
On Tue, 20 Nov 2023, Volodymyr Babchuk wrote: > Stefano Stabellini <sstabellini@kernel.org> writes: > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >> >> Hi Julien, > >> >> > >> >> Julien Grall <julien@xen.org> writes: > >> >> > >> >> > Hi Volodymyr, > >> >> > > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: > >> >> >> Hi Stefano, > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes: > >> >> >> > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > >> >> >>>>> tolerable but not ideal. > >> >> >>>> > >> >> >>>> But what is exactly wrong with this setup? > >> >> >>> > >> >> >>> [...] > >> >> >>> > >> >> >>>>> The worst case I would like to avoid is to have > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > >> >> >>>> > >> >> >>>> This is how our setup works right now. > >> >> >>> > >> >> >>> If we have: > >> >> >>> - a single PCI Root Complex emulated in Xen > >> >> >>> - Xen is safety certified > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory > >> >> >>> > >> >> >>> We can go very far in terms of being able to use Virtio in safety > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > >> >> >>> > >> >> >>> On the other hand if we put an additional Root Complex in QEMU: > >> >> >>> - we pay a price in terms of complexity of the codebase > >> >> >>> - we pay a price in terms of resource utilization > >> >> >>> - we have one additional problem in terms of using this setup with a > >> >> >>> SafeOS (one more device emulated by a non-safe component) > >> >> >>> > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground > >> >> >>> solution because: > >> >> >>> - we still pay a price in terms of resource utilization > >> >> >>> - the code complexity goes up a bit but hopefully not by much > >> >> >>> - there is no impact on safety compared to the ideal scenario > >> >> >>> > >> >> >>> This is why I wrote that it is tolerable. > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some > >> >> >> more > >> >> >> points: > >> >> >> - There is ongoing work on implementing virtio backends as a > >> >> >> separate > >> >> >> applications, written in Rust. Linaro are doing this part. Right now > >> >> >> they are implementing only virtio-mmio, but if they want to provide > >> >> >> virtio-pci as well, they will need a mechanism to plug only > >> >> >> virtio-pci, without Root Complex. This is argument for using single Root > >> >> >> Complex emulated in Xen. > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has > >> >> >> no > >> >> >> mechanism for exposing virtio-pci backends without exposing PCI root > >> >> >> complex as well. Architecturally, there should be a PCI bus to which > >> >> >> virtio-pci devices are connected. Or we need to make some changes to > >> >> >> QEMU internals to be able to create virtio-pci backends that are not > >> >> >> connected to any bus. Also, added benefit that PCI Root Complex > >> >> >> emulator in QEMU handles legacy PCI interrupts for us. This is > >> >> >> argument for separate Root Complex for QEMU. > >> >> >> As right now we have only virtio-pci backends provided by QEMU and > >> >> >> this > >> >> >> setup is already working, I propose to stick to this > >> >> >> solution. Especially, taking into account that it does not require any > >> >> >> changes to hypervisor code. > >> >> > > >> >> > I am not against two hostbridge as a temporary solution as long as > >> >> > this is not a one way door decision. I am not concerned about the > >> >> > hypervisor itself, I am more concerned about the interface exposed by > >> >> > the toolstack and QEMU. > >> > > >> > I agree with this... > >> > > >> > > >> >> > To clarify, I don't particular want to have to maintain the two > >> >> > hostbridges solution once we can use a single hostbridge. So we need > >> >> > to be able to get rid of it without impacting the interface too much. > >> > > >> > ...and this > >> > > >> > > >> >> This depends on virtio-pci backends availability. AFAIK, now only one > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want > >> >> get rid of the second host bridge we need either another virtio-pci > >> >> backend or we need to alter QEMU code so it can live without host > >> >> bridge. > >> >> > >> >> As for interfaces, it appears that QEMU case does not require any changes > >> >> into hypervisor itself, it just boils down to writing couple of xenstore > >> >> entries and spawning QEMU with correct command line arguments. > >> > > >> > One thing that Stewart wrote in his reply that is important: it doesn't > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's > >> > required from QEMU's point of view to emulate an individual PCI device. > >> > > >> > If we can arrange it so the QEMU PCI Root Complex is not registered > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI > >> > Root Complex is going to be left unused. I think that would be great > >> > because we still have a clean QEMU-Xen-tools interface and the only > >> > downside is some extra unused emulation in QEMU. It would be a > >> > fantastic starting point. > >> > >> I believe, that in this case we need to set manual ioreq handlers, like > >> what was done in patch "xen/arm: Intercept vPCI config accesses and > >> forward them to emulator", because we need to route ECAM accesses > >> either to a virtio-pci backend or to a real PCI device. Also we need > >> to tell QEMU to not install own ioreq handles for ECAM space. > > > > I was imagining that the interface would look like this: QEMU registers > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM > > reads/writes requests for the PCI config space of that BDF only. It > > would not be the entire ECAM space but only individual PCI conf > > reads/writes that the BDF only. > > > > Okay, I see that there is the > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register > PCI BDF? Yes, I think that's best. Let me expand on this. Like I wrote above, I think it is important that Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the QEMU implementation easier, it is OK if QEMU emulates an unneeded and unused PCI Root Complex. From Xen point of view, it doesn't exist. In terms if ioreq registration, QEMU calls xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to emulate. That way, Xen vPCI knows exactly what PCI config space reads/writes to forward to QEMU. Let's say that: - 00:02.0 is PCI passthrough device - 00:03.0 is a PCI emulated device QEMU would register 00:03.0 and vPCI would know to forward anything related to 00:03.0 to QEMU, but not 00:02.0. > I see that xen-hvm-common.c in QEMU is able to handle only standard 256 > bytes configuration space, but I hope that it will be easy fix.
On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: > On Tue, 20 Nov 2023, Volodymyr Babchuk wrote: > > Stefano Stabellini <sstabellini@kernel.org> writes: > > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > >> >> Hi Julien, > > >> >> > > >> >> Julien Grall <julien@xen.org> writes: > > >> >> > > >> >> > Hi Volodymyr, > > >> >> > > > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: > > >> >> >> Hi Stefano, > > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes: > > >> >> >> > > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try > > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex > > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by > > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > > >> >> >>>>> tolerable but not ideal. > > >> >> >>>> > > >> >> >>>> But what is exactly wrong with this setup? > > >> >> >>> > > >> >> >>> [...] > > >> >> >>> > > >> >> >>>>> The worst case I would like to avoid is to have > > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > > >> >> >>>> > > >> >> >>>> This is how our setup works right now. > > >> >> >>> > > >> >> >>> If we have: > > >> >> >>> - a single PCI Root Complex emulated in Xen > > >> >> >>> - Xen is safety certified > > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory > > >> >> >>> > > >> >> >>> We can go very far in terms of being able to use Virtio in safety > > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > > >> >> >>> > > >> >> >>> On the other hand if we put an additional Root Complex in QEMU: > > >> >> >>> - we pay a price in terms of complexity of the codebase > > >> >> >>> - we pay a price in terms of resource utilization > > >> >> >>> - we have one additional problem in terms of using this setup with a > > >> >> >>> SafeOS (one more device emulated by a non-safe component) > > >> >> >>> > > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground > > >> >> >>> solution because: > > >> >> >>> - we still pay a price in terms of resource utilization > > >> >> >>> - the code complexity goes up a bit but hopefully not by much > > >> >> >>> - there is no impact on safety compared to the ideal scenario > > >> >> >>> > > >> >> >>> This is why I wrote that it is tolerable. > > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some > > >> >> >> more > > >> >> >> points: > > >> >> >> - There is ongoing work on implementing virtio backends as a > > >> >> >> separate > > >> >> >> applications, written in Rust. Linaro are doing this part. Right now > > >> >> >> they are implementing only virtio-mmio, but if they want to provide > > >> >> >> virtio-pci as well, they will need a mechanism to plug only > > >> >> >> virtio-pci, without Root Complex. This is argument for using single Root > > >> >> >> Complex emulated in Xen. > > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has > > >> >> >> no > > >> >> >> mechanism for exposing virtio-pci backends without exposing PCI root > > >> >> >> complex as well. Architecturally, there should be a PCI bus to which > > >> >> >> virtio-pci devices are connected. Or we need to make some changes to > > >> >> >> QEMU internals to be able to create virtio-pci backends that are not > > >> >> >> connected to any bus. Also, added benefit that PCI Root Complex > > >> >> >> emulator in QEMU handles legacy PCI interrupts for us. This is > > >> >> >> argument for separate Root Complex for QEMU. > > >> >> >> As right now we have only virtio-pci backends provided by QEMU and > > >> >> >> this > > >> >> >> setup is already working, I propose to stick to this > > >> >> >> solution. Especially, taking into account that it does not require any > > >> >> >> changes to hypervisor code. > > >> >> > > > >> >> > I am not against two hostbridge as a temporary solution as long as > > >> >> > this is not a one way door decision. I am not concerned about the > > >> >> > hypervisor itself, I am more concerned about the interface exposed by > > >> >> > the toolstack and QEMU. > > >> > > > >> > I agree with this... > > >> > > > >> > > > >> >> > To clarify, I don't particular want to have to maintain the two > > >> >> > hostbridges solution once we can use a single hostbridge. So we need > > >> >> > to be able to get rid of it without impacting the interface too much. > > >> > > > >> > ...and this > > >> > > > >> > > > >> >> This depends on virtio-pci backends availability. AFAIK, now only one > > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want > > >> >> get rid of the second host bridge we need either another virtio-pci > > >> >> backend or we need to alter QEMU code so it can live without host > > >> >> bridge. > > >> >> > > >> >> As for interfaces, it appears that QEMU case does not require any changes > > >> >> into hypervisor itself, it just boils down to writing couple of xenstore > > >> >> entries and spawning QEMU with correct command line arguments. > > >> > > > >> > One thing that Stewart wrote in his reply that is important: it doesn't > > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's > > >> > required from QEMU's point of view to emulate an individual PCI device. > > >> > > > >> > If we can arrange it so the QEMU PCI Root Complex is not registered > > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI > > >> > Root Complex is going to be left unused. I think that would be great > > >> > because we still have a clean QEMU-Xen-tools interface and the only > > >> > downside is some extra unused emulation in QEMU. It would be a > > >> > fantastic starting point. > > >> > > >> I believe, that in this case we need to set manual ioreq handlers, like > > >> what was done in patch "xen/arm: Intercept vPCI config accesses and > > >> forward them to emulator", because we need to route ECAM accesses > > >> either to a virtio-pci backend or to a real PCI device. Also we need > > >> to tell QEMU to not install own ioreq handles for ECAM space. > > > > > > I was imagining that the interface would look like this: QEMU registers > > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM > > > reads/writes requests for the PCI config space of that BDF only. It > > > would not be the entire ECAM space but only individual PCI conf > > > reads/writes that the BDF only. > > > > > > > Okay, I see that there is the > > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding > > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register > > PCI BDF? > > Yes, I think that's best. > > Let me expand on this. Like I wrote above, I think it is important that > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the > QEMU implementation easier, it is OK if QEMU emulates an unneeded and > unused PCI Root Complex. From Xen point of view, it doesn't exist. > > In terms if ioreq registration, QEMU calls > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to > emulate. That way, Xen vPCI knows exactly what PCI config space > reads/writes to forward to QEMU. > > Let's say that: > - 00:02.0 is PCI passthrough device > - 00:03.0 is a PCI emulated device > > QEMU would register 00:03.0 and vPCI would know to forward anything > related to 00:03.0 to QEMU, but not 00:02.0. I think there's some work here so that we have a proper hierarchy inside of Xen. Right now both ioreq and vpci expect to decode the accesses to the PCI config space, and setup (MM)IO handlers to trap ECAM, see vpci_ecam_{read,write}(). I think we want to move to a model where vPCI doesn't setup MMIO traps itself, and instead relies on ioreq to do the decoding and forwarding of accesses. We need some work in order to represent an internal ioreq handler, but that shouldn't be too complicated. IOW: vpci should register devices it's handling with ioreq, much like QEMU does. Thanks, Roger.
On Wed, 22 Nov 2023, Roger Pau Monné wrote: > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: > > On Tue, 20 Nov 2023, Volodymyr Babchuk wrote: > > > Stefano Stabellini <sstabellini@kernel.org> writes: > > > > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > > >> > On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > > >> >> Hi Julien, > > > >> >> > > > >> >> Julien Grall <julien@xen.org> writes: > > > >> >> > > > >> >> > Hi Volodymyr, > > > >> >> > > > > >> >> > On 17/11/2023 14:09, Volodymyr Babchuk wrote: > > > >> >> >> Hi Stefano, > > > >> >> >> Stefano Stabellini <sstabellini@kernel.org> writes: > > > >> >> >> > > > >> >> >>> On Fri, 17 Nov 2023, Volodymyr Babchuk wrote: > > > >> >> >>>>> I still think, no matter the BDF allocation scheme, that we should try > > > >> >> >>>>> to avoid as much as possible to have two different PCI Root Complex > > > >> >> >>>>> emulators. Ideally we would have only one PCI Root Complex emulated by > > > >> >> >>>>> Xen. Having 2 PCI Root Complexes both of them emulated by Xen would be > > > >> >> >>>>> tolerable but not ideal. > > > >> >> >>>> > > > >> >> >>>> But what is exactly wrong with this setup? > > > >> >> >>> > > > >> >> >>> [...] > > > >> >> >>> > > > >> >> >>>>> The worst case I would like to avoid is to have > > > >> >> >>>>> two PCI Root Complexes, one emulated by Xen and one emulated by QEMU. > > > >> >> >>>> > > > >> >> >>>> This is how our setup works right now. > > > >> >> >>> > > > >> >> >>> If we have: > > > >> >> >>> - a single PCI Root Complex emulated in Xen > > > >> >> >>> - Xen is safety certified > > > >> >> >>> - individual Virtio devices emulated by QEMU with grants for memory > > > >> >> >>> > > > >> >> >>> We can go very far in terms of being able to use Virtio in safety > > > >> >> >>> use-cases. We might even be able to use Virtio (frontends) in a SafeOS. > > > >> >> >>> > > > >> >> >>> On the other hand if we put an additional Root Complex in QEMU: > > > >> >> >>> - we pay a price in terms of complexity of the codebase > > > >> >> >>> - we pay a price in terms of resource utilization > > > >> >> >>> - we have one additional problem in terms of using this setup with a > > > >> >> >>> SafeOS (one more device emulated by a non-safe component) > > > >> >> >>> > > > >> >> >>> Having 2 PCI Root Complexes both emulated in Xen is a middle ground > > > >> >> >>> solution because: > > > >> >> >>> - we still pay a price in terms of resource utilization > > > >> >> >>> - the code complexity goes up a bit but hopefully not by much > > > >> >> >>> - there is no impact on safety compared to the ideal scenario > > > >> >> >>> > > > >> >> >>> This is why I wrote that it is tolerable. > > > >> >> >> Ah, I see now. Yes, I am agree with this. Also I want to add some > > > >> >> >> more > > > >> >> >> points: > > > >> >> >> - There is ongoing work on implementing virtio backends as a > > > >> >> >> separate > > > >> >> >> applications, written in Rust. Linaro are doing this part. Right now > > > >> >> >> they are implementing only virtio-mmio, but if they want to provide > > > >> >> >> virtio-pci as well, they will need a mechanism to plug only > > > >> >> >> virtio-pci, without Root Complex. This is argument for using single Root > > > >> >> >> Complex emulated in Xen. > > > >> >> >> - As far as I know (actually, Oleksandr told this to me), QEMU has > > > >> >> >> no > > > >> >> >> mechanism for exposing virtio-pci backends without exposing PCI root > > > >> >> >> complex as well. Architecturally, there should be a PCI bus to which > > > >> >> >> virtio-pci devices are connected. Or we need to make some changes to > > > >> >> >> QEMU internals to be able to create virtio-pci backends that are not > > > >> >> >> connected to any bus. Also, added benefit that PCI Root Complex > > > >> >> >> emulator in QEMU handles legacy PCI interrupts for us. This is > > > >> >> >> argument for separate Root Complex for QEMU. > > > >> >> >> As right now we have only virtio-pci backends provided by QEMU and > > > >> >> >> this > > > >> >> >> setup is already working, I propose to stick to this > > > >> >> >> solution. Especially, taking into account that it does not require any > > > >> >> >> changes to hypervisor code. > > > >> >> > > > > >> >> > I am not against two hostbridge as a temporary solution as long as > > > >> >> > this is not a one way door decision. I am not concerned about the > > > >> >> > hypervisor itself, I am more concerned about the interface exposed by > > > >> >> > the toolstack and QEMU. > > > >> > > > > >> > I agree with this... > > > >> > > > > >> > > > > >> >> > To clarify, I don't particular want to have to maintain the two > > > >> >> > hostbridges solution once we can use a single hostbridge. So we need > > > >> >> > to be able to get rid of it without impacting the interface too much. > > > >> > > > > >> > ...and this > > > >> > > > > >> > > > > >> >> This depends on virtio-pci backends availability. AFAIK, now only one > > > >> >> option is to use QEMU and QEMU provides own host bridge. So if we want > > > >> >> get rid of the second host bridge we need either another virtio-pci > > > >> >> backend or we need to alter QEMU code so it can live without host > > > >> >> bridge. > > > >> >> > > > >> >> As for interfaces, it appears that QEMU case does not require any changes > > > >> >> into hypervisor itself, it just boils down to writing couple of xenstore > > > >> >> entries and spawning QEMU with correct command line arguments. > > > >> > > > > >> > One thing that Stewart wrote in his reply that is important: it doesn't > > > >> > matter if QEMU thinks it is emulating a PCI Root Complex because that's > > > >> > required from QEMU's point of view to emulate an individual PCI device. > > > >> > > > > >> > If we can arrange it so the QEMU PCI Root Complex is not registered > > > >> > against Xen as part of the ioreq interface, then QEMU's emulated PCI > > > >> > Root Complex is going to be left unused. I think that would be great > > > >> > because we still have a clean QEMU-Xen-tools interface and the only > > > >> > downside is some extra unused emulation in QEMU. It would be a > > > >> > fantastic starting point. > > > >> > > > >> I believe, that in this case we need to set manual ioreq handlers, like > > > >> what was done in patch "xen/arm: Intercept vPCI config accesses and > > > >> forward them to emulator", because we need to route ECAM accesses > > > >> either to a virtio-pci backend or to a real PCI device. Also we need > > > >> to tell QEMU to not install own ioreq handles for ECAM space. > > > > > > > > I was imagining that the interface would look like this: QEMU registers > > > > a PCI BDF and Xen automatically starts forwarding to QEMU ECAM > > > > reads/writes requests for the PCI config space of that BDF only. It > > > > would not be the entire ECAM space but only individual PCI conf > > > > reads/writes that the BDF only. > > > > > > > > > > Okay, I see that there is the > > > xendevicemodel_map_pcidev_to_ioreq_server() function and corresponding > > > IOREQ_TYPE_PCI_CONFIG call. Is this what you propose to use to register > > > PCI BDF? > > > > Yes, I think that's best. > > > > Let me expand on this. Like I wrote above, I think it is important that > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and > > unused PCI Root Complex. From Xen point of view, it doesn't exist. > > > > In terms if ioreq registration, QEMU calls > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to > > emulate. That way, Xen vPCI knows exactly what PCI config space > > reads/writes to forward to QEMU. > > > > Let's say that: > > - 00:02.0 is PCI passthrough device > > - 00:03.0 is a PCI emulated device > > > > QEMU would register 00:03.0 and vPCI would know to forward anything > > related to 00:03.0 to QEMU, but not 00:02.0. > > I think there's some work here so that we have a proper hierarchy > inside of Xen. Right now both ioreq and vpci expect to decode the > accesses to the PCI config space, and setup (MM)IO handlers to trap > ECAM, see vpci_ecam_{read,write}(). > > I think we want to move to a model where vPCI doesn't setup MMIO traps > itself, and instead relies on ioreq to do the decoding and forwarding > of accesses. We need some work in order to represent an internal > ioreq handler, but that shouldn't be too complicated. IOW: vpci > should register devices it's handling with ioreq, much like QEMU does. I think this could be a good idea. This would be the very first IOREQ handler implemented in Xen itself, rather than outside of Xen. Some code refactoring might be required, which worries me given that vPCI is at v10 and has been pending for years. I think it could make sense as a follow-up series, not v11. I think this idea would be beneficial if, in the example above, vPCI doesn't really need to know about device 00:03.0. vPCI registers via IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers 00:03.0, and everything works. vPCI is not involved at all in PCI config space reads and writes for 00:03.0. If this is the case, then moving vPCI to IOREQ could be good. On the other hand if vPCI actually needs to know that 00:03.0 exists, perhaps because it changes something in the PCI Root Complex emulation or vPCI needs to take some action when PCI config space registers of 00:03.0 are written to, then I think this model doesn't work well. If this is the case, then I think it would be best to keep vPCI as MMIO handler and let vPCI forward to IOREQ when appropriate. I haven't run any experiements, but my gut feeling tells me that we'll have to follow the second approach because the first is too limiting.
On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote: > On Wed, 22 Nov 2023, Roger Pau Monné wrote: > > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: > > > Let me expand on this. Like I wrote above, I think it is important that > > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the > > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and > > > unused PCI Root Complex. From Xen point of view, it doesn't exist. > > > > > > In terms if ioreq registration, QEMU calls > > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to > > > emulate. That way, Xen vPCI knows exactly what PCI config space > > > reads/writes to forward to QEMU. > > > > > > Let's say that: > > > - 00:02.0 is PCI passthrough device > > > - 00:03.0 is a PCI emulated device > > > > > > QEMU would register 00:03.0 and vPCI would know to forward anything > > > related to 00:03.0 to QEMU, but not 00:02.0. > > > > I think there's some work here so that we have a proper hierarchy > > inside of Xen. Right now both ioreq and vpci expect to decode the > > accesses to the PCI config space, and setup (MM)IO handlers to trap > > ECAM, see vpci_ecam_{read,write}(). > > > > I think we want to move to a model where vPCI doesn't setup MMIO traps > > itself, and instead relies on ioreq to do the decoding and forwarding > > of accesses. We need some work in order to represent an internal > > ioreq handler, but that shouldn't be too complicated. IOW: vpci > > should register devices it's handling with ioreq, much like QEMU does. > > I think this could be a good idea. > > This would be the very first IOREQ handler implemented in Xen itself, > rather than outside of Xen. Some code refactoring might be required, > which worries me given that vPCI is at v10 and has been pending for > years. I think it could make sense as a follow-up series, not v11. That's perfectly fine for me, most of the series here just deal with the logic to intercept guest access to the config space and is completely agnostic as to how the accesses are intercepted. > I think this idea would be beneficial if, in the example above, vPCI > doesn't really need to know about device 00:03.0. vPCI registers via > IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers > 00:03.0, and everything works. vPCI is not involved at all in PCI config > space reads and writes for 00:03.0. If this is the case, then moving > vPCI to IOREQ could be good. Given your description above, with the root complex implemented in vPCI, we would need to mandate vPCI together with ioreqs even if no passthrough devices are using vPCI itself (just for the emulation of the root complex). Which is fine, just wanted to mention the dependency. > On the other hand if vPCI actually needs to know that 00:03.0 exists, > perhaps because it changes something in the PCI Root Complex emulation > or vPCI needs to take some action when PCI config space registers of > 00:03.0 are written to, then I think this model doesn't work well. If > this is the case, then I think it would be best to keep vPCI as MMIO > handler and let vPCI forward to IOREQ when appropriate. At first approximation I don't think we would have such interactions, otherwise the whole premise of ioreq being able to register individual PCI devices would be broken. XenSever already has scenarios with two different user-space emulators (ie: two different ioreq servers) handling accesses to different devices in the same PCI bus, and there's no interaction with the root complex required. > I haven't run any experiements, but my gut feeling tells me that we'll > have to follow the second approach because the first is too limiting. Iff there's some case where a change to a downstream PCI device needs to be propagated to the root complex, then such mechanism should be implemented as part of the ioreq interface, otherwise the whole model is broken. Thanks, Roger.
Hi Roger, Roger Pau Monné <roger.pau@citrix.com> writes: > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote: >> On Wed, 22 Nov 2023, Roger Pau Monné wrote: >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: >> > > Let me expand on this. Like I wrote above, I think it is important that >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist. >> > > >> > > In terms if ioreq registration, QEMU calls >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to >> > > emulate. That way, Xen vPCI knows exactly what PCI config space >> > > reads/writes to forward to QEMU. >> > > >> > > Let's say that: >> > > - 00:02.0 is PCI passthrough device >> > > - 00:03.0 is a PCI emulated device >> > > >> > > QEMU would register 00:03.0 and vPCI would know to forward anything >> > > related to 00:03.0 to QEMU, but not 00:02.0. >> > >> > I think there's some work here so that we have a proper hierarchy >> > inside of Xen. Right now both ioreq and vpci expect to decode the >> > accesses to the PCI config space, and setup (MM)IO handlers to trap >> > ECAM, see vpci_ecam_{read,write}(). >> > >> > I think we want to move to a model where vPCI doesn't setup MMIO traps >> > itself, and instead relies on ioreq to do the decoding and forwarding >> > of accesses. We need some work in order to represent an internal >> > ioreq handler, but that shouldn't be too complicated. IOW: vpci >> > should register devices it's handling with ioreq, much like QEMU does. >> >> I think this could be a good idea. >> >> This would be the very first IOREQ handler implemented in Xen itself, >> rather than outside of Xen. Some code refactoring might be required, >> which worries me given that vPCI is at v10 and has been pending for >> years. I think it could make sense as a follow-up series, not v11. > > That's perfectly fine for me, most of the series here just deal with > the logic to intercept guest access to the config space and is > completely agnostic as to how the accesses are intercepted. > >> I think this idea would be beneficial if, in the example above, vPCI >> doesn't really need to know about device 00:03.0. vPCI registers via >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers >> 00:03.0, and everything works. vPCI is not involved at all in PCI config >> space reads and writes for 00:03.0. If this is the case, then moving >> vPCI to IOREQ could be good. > > Given your description above, with the root complex implemented in > vPCI, we would need to mandate vPCI together with ioreqs even if no > passthrough devices are using vPCI itself (just for the emulation of > the root complex). Which is fine, just wanted to mention the > dependency. > >> On the other hand if vPCI actually needs to know that 00:03.0 exists, >> perhaps because it changes something in the PCI Root Complex emulation >> or vPCI needs to take some action when PCI config space registers of >> 00:03.0 are written to, then I think this model doesn't work well. If >> this is the case, then I think it would be best to keep vPCI as MMIO >> handler and let vPCI forward to IOREQ when appropriate. > > At first approximation I don't think we would have such interactions, > otherwise the whole premise of ioreq being able to register individual > PCI devices would be broken. > > XenSever already has scenarios with two different user-space emulators > (ie: two different ioreq servers) handling accesses to different > devices in the same PCI bus, and there's no interaction with the root > complex required. > Out of curiosity: how legacy PCI interrupts are handled in this case? In my understanding, it is Root Complex's responsibility to propagate correct IRQ levels to an interrupt controller? [...]
On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote: > Hi Roger, > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote: > >> On Wed, 22 Nov 2023, Roger Pau Monné wrote: > >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: > >> > > Let me expand on this. Like I wrote above, I think it is important that > >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the > >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and > >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist. > >> > > > >> > > In terms if ioreq registration, QEMU calls > >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to > >> > > emulate. That way, Xen vPCI knows exactly what PCI config space > >> > > reads/writes to forward to QEMU. > >> > > > >> > > Let's say that: > >> > > - 00:02.0 is PCI passthrough device > >> > > - 00:03.0 is a PCI emulated device > >> > > > >> > > QEMU would register 00:03.0 and vPCI would know to forward anything > >> > > related to 00:03.0 to QEMU, but not 00:02.0. > >> > > >> > I think there's some work here so that we have a proper hierarchy > >> > inside of Xen. Right now both ioreq and vpci expect to decode the > >> > accesses to the PCI config space, and setup (MM)IO handlers to trap > >> > ECAM, see vpci_ecam_{read,write}(). > >> > > >> > I think we want to move to a model where vPCI doesn't setup MMIO traps > >> > itself, and instead relies on ioreq to do the decoding and forwarding > >> > of accesses. We need some work in order to represent an internal > >> > ioreq handler, but that shouldn't be too complicated. IOW: vpci > >> > should register devices it's handling with ioreq, much like QEMU does. > >> > >> I think this could be a good idea. > >> > >> This would be the very first IOREQ handler implemented in Xen itself, > >> rather than outside of Xen. Some code refactoring might be required, > >> which worries me given that vPCI is at v10 and has been pending for > >> years. I think it could make sense as a follow-up series, not v11. > > > > That's perfectly fine for me, most of the series here just deal with > > the logic to intercept guest access to the config space and is > > completely agnostic as to how the accesses are intercepted. > > > >> I think this idea would be beneficial if, in the example above, vPCI > >> doesn't really need to know about device 00:03.0. vPCI registers via > >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers > >> 00:03.0, and everything works. vPCI is not involved at all in PCI config > >> space reads and writes for 00:03.0. If this is the case, then moving > >> vPCI to IOREQ could be good. > > > > Given your description above, with the root complex implemented in > > vPCI, we would need to mandate vPCI together with ioreqs even if no > > passthrough devices are using vPCI itself (just for the emulation of > > the root complex). Which is fine, just wanted to mention the > > dependency. > > > >> On the other hand if vPCI actually needs to know that 00:03.0 exists, > >> perhaps because it changes something in the PCI Root Complex emulation > >> or vPCI needs to take some action when PCI config space registers of > >> 00:03.0 are written to, then I think this model doesn't work well. If > >> this is the case, then I think it would be best to keep vPCI as MMIO > >> handler and let vPCI forward to IOREQ when appropriate. > > > > At first approximation I don't think we would have such interactions, > > otherwise the whole premise of ioreq being able to register individual > > PCI devices would be broken. > > > > XenSever already has scenarios with two different user-space emulators > > (ie: two different ioreq servers) handling accesses to different > > devices in the same PCI bus, and there's no interaction with the root > > complex required. > > > > Out of curiosity: how legacy PCI interrupts are handled in this case? In > my understanding, it is Root Complex's responsibility to propagate > correct IRQ levels to an interrupt controller? I'm unsure whether my understanding of the question is correct, so my reply might not be what you are asking for, sorry. Legacy IRQs (GSI on x86) are setup directly by the toolstack when the device is assigned to the guest, using PHYSDEVOP_map_pirq + XEN_DOMCTL_bind_pt_irq. Those hypercalls bind together a host IO-APIC pin to a guest IO-APIC pin, so that interrupts originating from that host IO-APIC pin are always forwarded to the guest an injected as originating from the guest IO-APIC pin. Note that the device will always use the same IO-APIC pin, this is not configured by the OS. Thanks, Roger.
On Wed, 29 Nov 2023, Roger Pau Monné wrote: > On Tue, Nov 28, 2023 at 11:45:34PM +0000, Volodymyr Babchuk wrote: > > Hi Roger, > > > > Roger Pau Monné <roger.pau@citrix.com> writes: > > > > > On Wed, Nov 22, 2023 at 01:18:32PM -0800, Stefano Stabellini wrote: > > >> On Wed, 22 Nov 2023, Roger Pau Monné wrote: > > >> > On Tue, Nov 21, 2023 at 05:12:15PM -0800, Stefano Stabellini wrote: > > >> > > Let me expand on this. Like I wrote above, I think it is important that > > >> > > Xen vPCI is the only in-use PCI Root Complex emulator. If it makes the > > >> > > QEMU implementation easier, it is OK if QEMU emulates an unneeded and > > >> > > unused PCI Root Complex. From Xen point of view, it doesn't exist. > > >> > > > > >> > > In terms if ioreq registration, QEMU calls > > >> > > xendevicemodel_map_pcidev_to_ioreq_server for each PCI BDF it wants to > > >> > > emulate. That way, Xen vPCI knows exactly what PCI config space > > >> > > reads/writes to forward to QEMU. > > >> > > > > >> > > Let's say that: > > >> > > - 00:02.0 is PCI passthrough device > > >> > > - 00:03.0 is a PCI emulated device > > >> > > > > >> > > QEMU would register 00:03.0 and vPCI would know to forward anything > > >> > > related to 00:03.0 to QEMU, but not 00:02.0. > > >> > > > >> > I think there's some work here so that we have a proper hierarchy > > >> > inside of Xen. Right now both ioreq and vpci expect to decode the > > >> > accesses to the PCI config space, and setup (MM)IO handlers to trap > > >> > ECAM, see vpci_ecam_{read,write}(). > > >> > > > >> > I think we want to move to a model where vPCI doesn't setup MMIO traps > > >> > itself, and instead relies on ioreq to do the decoding and forwarding > > >> > of accesses. We need some work in order to represent an internal > > >> > ioreq handler, but that shouldn't be too complicated. IOW: vpci > > >> > should register devices it's handling with ioreq, much like QEMU does. > > >> > > >> I think this could be a good idea. > > >> > > >> This would be the very first IOREQ handler implemented in Xen itself, > > >> rather than outside of Xen. Some code refactoring might be required, > > >> which worries me given that vPCI is at v10 and has been pending for > > >> years. I think it could make sense as a follow-up series, not v11. > > > > > > That's perfectly fine for me, most of the series here just deal with > > > the logic to intercept guest access to the config space and is > > > completely agnostic as to how the accesses are intercepted. > > > > > >> I think this idea would be beneficial if, in the example above, vPCI > > >> doesn't really need to know about device 00:03.0. vPCI registers via > > >> IOREQ the PCI Root Complex and device 00:02.0 only, QEMU registers > > >> 00:03.0, and everything works. vPCI is not involved at all in PCI config > > >> space reads and writes for 00:03.0. If this is the case, then moving > > >> vPCI to IOREQ could be good. > > > > > > Given your description above, with the root complex implemented in > > > vPCI, we would need to mandate vPCI together with ioreqs even if no > > > passthrough devices are using vPCI itself (just for the emulation of > > > the root complex). Which is fine, just wanted to mention the > > > dependency. > > > > > >> On the other hand if vPCI actually needs to know that 00:03.0 exists, > > >> perhaps because it changes something in the PCI Root Complex emulation > > >> or vPCI needs to take some action when PCI config space registers of > > >> 00:03.0 are written to, then I think this model doesn't work well. If > > >> this is the case, then I think it would be best to keep vPCI as MMIO > > >> handler and let vPCI forward to IOREQ when appropriate. > > > > > > At first approximation I don't think we would have such interactions, > > > otherwise the whole premise of ioreq being able to register individual > > > PCI devices would be broken. > > > > > > XenSever already has scenarios with two different user-space emulators > > > (ie: two different ioreq servers) handling accesses to different > > > devices in the same PCI bus, and there's no interaction with the root > > > complex required. Good to hear > > Out of curiosity: how legacy PCI interrupts are handled in this case? In > > my understanding, it is Root Complex's responsibility to propagate > > correct IRQ levels to an interrupt controller? > > I'm unsure whether my understanding of the question is correct, so my > reply might not be what you are asking for, sorry. > > Legacy IRQs (GSI on x86) are setup directly by the toolstack when the > device is assigned to the guest, using PHYSDEVOP_map_pirq + > XEN_DOMCTL_bind_pt_irq. Those hypercalls bind together a host IO-APIC > pin to a guest IO-APIC pin, so that interrupts originating from that > host IO-APIC pin are always forwarded to the guest an injected as > originating from the guest IO-APIC pin. > > Note that the device will always use the same IO-APIC pin, this is not > configured by the OS. QEMU calls xen_set_pci_intx_level which is implemented by xendevicemodel_set_pci_intx_level, which is XEN_DMOP_set_pci_intx_level, which does set_pci_intx_level. Eventually it calls hvm_pci_intx_assert and hvm_pci_intx_deassert. I don't think any of this goes via the Root Complex otherwise, like Roger pointed out, it wouldn't be possible to emulated individual PCI devices in separate IOREQ servers.
diff --git a/xen/drivers/Kconfig b/xen/drivers/Kconfig index db94393f47..780490cf8e 100644 --- a/xen/drivers/Kconfig +++ b/xen/drivers/Kconfig @@ -15,4 +15,8 @@ source "drivers/video/Kconfig" config HAS_VPCI bool +config HAS_VPCI_GUEST_SUPPORT + bool + depends on HAS_VPCI + endmenu diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index 5e34d0092a..7c46a2d3f4 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -36,6 +36,52 @@ extern vpci_register_init_t *const __start_vpci_array[]; extern vpci_register_init_t *const __end_vpci_array[]; #define NUM_VPCI_INIT (__end_vpci_array - __start_vpci_array) +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT +static int add_virtual_device(struct pci_dev *pdev) +{ + struct domain *d = pdev->domain; + unsigned long new_dev_number; + + if ( is_hardware_domain(d) ) + return 0; + + ASSERT(rw_is_write_locked(&pdev->domain->pci_lock)); + + /* + * Each PCI bus supports 32 devices/slots at max or up to 256 when + * there are multi-function ones which are not yet supported. + */ + if ( pdev->info.is_extfn && !pdev->info.is_virtfn ) + { + gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n", + &pdev->sbdf); + return -EOPNOTSUPP; + } + new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map, + VPCI_MAX_VIRT_DEV); + if ( new_dev_number == VPCI_MAX_VIRT_DEV ) + { + write_unlock(&pdev->domain->pci_lock); + return -ENOSPC; + } + + __set_bit(new_dev_number, &d->vpci_dev_assigned_map); + + /* + * Both segment and bus number are 0: + * - we emulate a single host bridge for the guest, e.g. segment 0 + * - with bus 0 the virtual devices are seen as embedded + * endpoints behind the root complex + * + * TODO: add support for multi-function devices. + */ + pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0); + + return 0; +} + +#endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */ + void vpci_deassign_device(struct pci_dev *pdev) { unsigned int i; @@ -46,6 +92,13 @@ void vpci_deassign_device(struct pci_dev *pdev) return; spin_lock(&pdev->vpci->lock); + +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT + if ( pdev->vpci->guest_sbdf.sbdf != ~0 ) + __clear_bit(pdev->vpci->guest_sbdf.dev, + &pdev->domain->vpci_dev_assigned_map); +#endif + while ( !list_empty(&pdev->vpci->handlers) ) { struct vpci_register *r = list_first_entry(&pdev->vpci->handlers, @@ -101,6 +154,13 @@ int vpci_assign_device(struct pci_dev *pdev) INIT_LIST_HEAD(&pdev->vpci->handlers); spin_lock_init(&pdev->vpci->lock); +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT + pdev->vpci->guest_sbdf.sbdf = ~0; + rc = add_virtual_device(pdev); + if ( rc ) + goto out; +#endif + for ( i = 0; i < NUM_VPCI_INIT; i++ ) { rc = __start_vpci_array[i](pdev); @@ -108,11 +168,14 @@ int vpci_assign_device(struct pci_dev *pdev) break; } + out: + __maybe_unused; if ( rc ) vpci_deassign_device(pdev); return rc; } + #endif /* __XEN__ */ static int vpci_register_cmp(const struct vpci_register *r1, diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 57391e74b6..84e608f670 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -462,6 +462,14 @@ struct domain #ifdef CONFIG_HAS_PCI struct list_head pdev_list; rwlock_t pci_lock; +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT + /* + * The bitmap which shows which device numbers are already used by the + * virtual PCI bus topology and is used to assign a unique SBDF to the + * next passed through virtual PCI device. + */ + DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV); +#endif #endif #ifdef CONFIG_HAS_PASSTHROUGH diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h index 60bdc10c13..4a53936447 100644 --- a/xen/include/xen/vpci.h +++ b/xen/include/xen/vpci.h @@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev); #define VPCI_ECAM_BDF(addr) (((addr) & 0x0ffff000) >> 12) +/* + * Maximum number of devices supported by the virtual bus topology: + * each PCI bus supports 32 devices/slots at max or up to 256 when + * there are multi-function ones which are not yet supported. + */ +#define VPCI_MAX_VIRT_DEV (PCI_SLOT(~0) + 1) + #define REGISTER_VPCI_INIT(x, p) \ static vpci_register_init_t *const x##_entry \ __used_section(".data.vpci." p) = x @@ -155,6 +162,10 @@ struct vpci { struct vpci_arch_msix_entry arch; } entries[]; } *msix; +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT + /* Guest SBDF of the device. */ + pci_sbdf_t guest_sbdf; +#endif #endif };