Message ID | 20231115112611.3865905-3-Sergiy_Kibrik@epam.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | ARM virtio-pci initial support | expand |
Hi, Thanks for adding support for virtio-pci in Xen. I have some questions. On 15/11/2023 11:26, Sergiy Kibrik wrote: > From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> > > In order to enable more use-cases such as having multiple > device-models (Qemu) running in different backend domains which provide > virtio-pci devices for the same guest, we allocate and expose one > PCI host bridge for every virtio backend domain for that guest. OOI, why do you need to expose one PCI host bridge for every stubdomain? In fact looking at the next patch, it seems you are handling some of the hostbridge request in Xen. This is adds a bit more confusion. I was expecting the virtual PCI device would be in the vPCI and each Device emulator would advertise which BDF they are covering. > > For that purpose, reserve separate virtio-pci resources (memory and SPI range > for Legacy PCI interrupts) up to 8 possible PCI hosts (to be aligned with Do you mean host bridge rather than host? > MAX_NR_IOREQ_SERVERS) and allocate a host per backend domain. The PCI host > details including its host_id to be written to dedicated Xenstore node for > the device-model to retrieve. So which with approach, who is decide which BDF will be used for a given virtio PCI device? > > Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> > Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> > --- > xen/include/public/arch-arm.h | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h > index a25e87dbda..e6c9cd5335 100644 > --- a/xen/include/public/arch-arm.h > +++ b/xen/include/public/arch-arm.h > @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t; > #define GUEST_VPCI_MEM_ADDR xen_mk_ullong(0x23000000) > #define GUEST_VPCI_MEM_SIZE xen_mk_ullong(0x10000000) > > +/* > + * 16 MB is reserved for virtio-pci configuration space based on calculation > + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB Can you explain how youd ecided the "2"? > + */ > +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) > +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) > +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) > + > +/* 64 MB is reserved for virtio-pci memory */ > +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) > +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) > +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) > + > /* > * 16MB == 4096 pages reserved for guest to use as a region to map its > * grant table in. > @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; > #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) > #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) > > +/* 64 MB is reserved for virtio-pci Prefetch memory */ This doesn't seem a lot depending on your use case. Can you details how you can up with "64 MB"? > +#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM xen_mk_ullong(0x42000000) > +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR xen_mk_ullong(0x3a000000) > +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE xen_mk_ullong(0x04000000) > + > #define GUEST_RAM_BANKS 2 > > /* > @@ -515,6 +533,9 @@ typedef uint64_t xen_callback_t; > #define GUEST_VIRTIO_MMIO_SPI_FIRST 33 > #define GUEST_VIRTIO_MMIO_SPI_LAST 43 > > +#define GUEST_VIRTIO_PCI_SPI_FIRST 44 > +#define GUEST_VIRTIO_PCI_SPI_LAST 76 Just to confirm this is 4 interrupts per PCI host bridge? If so, can this be clarified in a comment? Also, above you said that the host ID will be written to Xenstore. But who will decide which SPI belongs to which host bridge? Cheers,
On 15.11.23 14:33, Julien Grall wrote: > Hi, Hello Julien Let me please try to explain some bits. > > Thanks for adding support for virtio-pci in Xen. I have some questions. > > On 15/11/2023 11:26, Sergiy Kibrik wrote: >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >> >> In order to enable more use-cases such as having multiple >> device-models (Qemu) running in different backend domains which provide >> virtio-pci devices for the same guest, we allocate and expose one >> PCI host bridge for every virtio backend domain for that guest. > > OOI, why do you need to expose one PCI host bridge for every stubdomain? > > In fact looking at the next patch, it seems you are handling some of the > hostbridge request in Xen. This is adds a bit more confusion. > > I was expecting the virtual PCI device would be in the vPCI and each > Device emulator would advertise which BDF they are covering. This patch series only covers use-cases where the device emulator handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI pass-through resources, handling, accounting, nothing. From the hypervisor we only need a help to intercept the config space accesses happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and forward them to the linked device emulator (if any), that's all. It is not possible (with current series) to run device emulators what emulate only separate PCI (virtio-pci) devices. For it to be possible, I think, much more changes are required than current patch series does. There at least should be special PCI Host bridge emulation in Xen (or reuse vPCI) for the integration. Also Xen should be in charge of forming resulting PCI interrupt based on each PCI device level signaling (if we use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, etc. Please note, I am not saying this is not possible in general, likely it is possible, but initial patch series doesn't cover these use-cases) We expose one PCI host bridge per virtio backend domain. This is a separate PCI host bridge to combine all virtio-pci devices running in the same backend domain (in the same device emulator currently). The examples: - if only one domain runs Qemu which servers virtio-blk, virtio-net, virtio-console devices for DomU - only single PCI Host bridge will be exposed for DomU - if we add another domain to run Qemu to serve additionally virtio-gpu, virtio-input and virtio-snd for the *same* DomU - we expose second PCI Host bridge for DomU I am afraid, we cannot end up exposing only single PCI Host bridge with current model (if we use device emulators running in different domains that handles the *entire* PCI Host bridges), this won't work. Please note, I might miss some bits since this enabling work. > >> >> For that purpose, reserve separate virtio-pci resources (memory and >> SPI range >> for Legacy PCI interrupts) up to 8 possible PCI hosts (to be aligned with > > Do you mean host bridge rather than host? yes > >> MAX_NR_IOREQ_SERVERS) and allocate a host per backend domain. The PCI >> host >> details including its host_id to be written to dedicated Xenstore node >> for >> the device-model to retrieve. > > So which with approach, who is decide which BDF will be used for a given > virtio PCI device? toolstack (via configuration file) > >> >> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> >> --- >> xen/include/public/arch-arm.h | 21 +++++++++++++++++++++ >> 1 file changed, 21 insertions(+) >> >> diff --git a/xen/include/public/arch-arm.h >> b/xen/include/public/arch-arm.h >> index a25e87dbda..e6c9cd5335 100644 >> --- a/xen/include/public/arch-arm.h >> +++ b/xen/include/public/arch-arm.h >> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t; >> #define GUEST_VPCI_MEM_ADDR xen_mk_ullong(0x23000000) >> #define GUEST_VPCI_MEM_SIZE xen_mk_ullong(0x10000000) >> +/* >> + * 16 MB is reserved for virtio-pci configuration space based on >> calculation >> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB > > Can you explain how youd ecided the "2"? good question, we have a limited free space available in memory layout (we had difficulties to find a suitable holes) also we don't expect a lot of virtio-pci devices, so "256" used vPCI would be too much. It was decided to reduce significantly, but select maximum to fit into free space, with having "2" buses we still fit into the chosen holes. > >> + */ >> +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) >> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) >> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) >> + >> +/* 64 MB is reserved for virtio-pci memory */ >> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) >> +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) >> +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) >> + >> /* >> * 16MB == 4096 pages reserved for guest to use as a region to map its >> * grant table in. >> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; >> #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) >> #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) >> +/* 64 MB is reserved for virtio-pci Prefetch memory */ > > This doesn't seem a lot depending on your use case. Can you details how > you can up with "64 MB"? the same calculation as it was done configuration space. It was observed that only 16K is used per virtio-pci device (maybe it can be bigger for usual PCI device, I don't know). Please look at the example of DomU log below (to strings that contain "*BAR 4: assigned*"): > root@h3ulcb:~# dmesg | grep pci > [ 0.444163] pci-host-generic 33000000.pcie: host bridge /pcie@33000000 ranges: > [ 0.444257] pci-host-generic 33000000.pcie: MEM 0x0034000000..0x00347fffff -> 0x0034000000 > [ 0.444304] pci-host-generic 33000000.pcie: MEM 0x003a000000..0x003a7fffff -> 0x003a000000 > [ 0.444396] pci-host-generic 33000000.pcie: ECAM at [mem 0x33000000-0x331fffff] for [bus 00-01] > [ 0.444595] pci-host-generic 33000000.pcie: PCI host bridge to bus 0000:00 > [ 0.444635] pci_bus 0000:00: root bus resource [bus 00-01] > [ 0.444662] pci_bus 0000:00: root bus resource [mem 0x34000000-0x347fffff] > [ 0.444692] pci_bus 0000:00: root bus resource [mem 0x3a000000-0x3a7fffff pref] > [ 0.445193] pci 0000:00:00.0: [1b36:0008] type 00 class 0x060000 > [ 0.449493] pci 0000:00:08.0: [1af4:1042] type 00 class 0x010000 > [ 0.451760] pci 0000:00:08.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.455985] pci 0000:00:08.0: BAR 4: assigned [mem 0x3a000000-0x3a003fff 64bit pref] > [ 0.456678] pci-host-generic 33200000.pcie: host bridge /pcie@33200000 ranges: > [ 0.456748] pci-host-generic 33200000.pcie: MEM 0x0034800000..0x0034ffffff -> 0x0034800000 > [ 0.456793] pci-host-generic 33200000.pcie: MEM 0x003a800000..0x003affffff -> 0x003a800000 > [ 0.456879] pci-host-generic 33200000.pcie: ECAM at [mem 0x33200000-0x333fffff] for [bus 00-01] > [ 0.457038] pci-host-generic 33200000.pcie: PCI host bridge to bus 0001:00 > [ 0.457079] pci_bus 0001:00: root bus resource [bus 00-01] > [ 0.457106] pci_bus 0001:00: root bus resource [mem 0x34800000-0x34ffffff] > [ 0.457136] pci_bus 0001:00: root bus resource [mem 0x3a800000-0x3affffff pref] > [ 0.457808] pci 0001:00:00.0: [1b36:0008] type 00 class 0x060000 > [ 0.461401] pci 0001:00:01.0: [1af4:1041] type 00 class 0x020000 > [ 0.463537] pci 0001:00:01.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.468135] pci 0001:00:02.0: [1af4:1042] type 00 class 0x010000 > [ 0.470185] pci 0001:00:02.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.474575] pci 0001:00:03.0: [1af4:1050] type 00 class 0x038000 > [ 0.476534] pci 0001:00:03.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.480942] pci 0001:00:04.0: [1af4:1052] type 00 class 0x090000 > [ 0.483096] pci 0001:00:04.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.487631] pci 0001:00:06.0: [1af4:1053] type 00 class 0x078000 > [ 0.489693] pci 0001:00:06.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] > [ 0.493669] pci 0001:00:01.0: BAR 4: assigned [mem 0x3a800000-0x3a803fff 64bit pref] > [ 0.495840] pci 0001:00:02.0: BAR 4: assigned [mem 0x3a804000-0x3a807fff 64bit pref] > [ 0.496656] pci 0001:00:03.0: BAR 4: assigned [mem 0x3a808000-0x3a80bfff 64bit pref] > [ 0.497671] pci 0001:00:04.0: BAR 4: assigned [mem 0x3a80c000-0x3a80ffff 64bit pref] > [ 0.498320] pci 0001:00:06.0: BAR 4: assigned [mem 0x3a810000-0x3a813fff 64bit pref] > [ 0.500732] virtio-pci 0000:00:08.0: enabling device (0000 -> 0002) > [ 0.509728] virtio-pci 0001:00:01.0: enabling device (0000 -> 0002) > [ 0.529757] virtio-pci 0001:00:02.0: enabling device (0000 -> 0002) > [ 0.550208] virtio-pci 0001:00:03.0: enabling device (0000 -> 0002) > [ 0.571012] virtio-pci 0001:00:04.0: enabling device (0000 -> 0002) > [ 0.591042] virtio-pci 0001:00:06.0: enabling device (0000 -> 0002) > >> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM >> xen_mk_ullong(0x42000000) >> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR >> xen_mk_ullong(0x3a000000) >> +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE >> xen_mk_ullong(0x04000000) >> + >> #define GUEST_RAM_BANKS 2 >> /* >> @@ -515,6 +533,9 @@ typedef uint64_t xen_callback_t; >> #define GUEST_VIRTIO_MMIO_SPI_FIRST 33 >> #define GUEST_VIRTIO_MMIO_SPI_LAST 43 >> +#define GUEST_VIRTIO_PCI_SPI_FIRST 44 >> +#define GUEST_VIRTIO_PCI_SPI_LAST 76 > > Just to confirm this is 4 interrupts per PCI host bridge? If so, can > this be clarified in a comment? yes (INTA - INTD) > > Also, above you said that the host ID will be written to Xenstore. But > who will decide which SPI belongs to which host bridge? toolstack, it in charge of PCI host bridge resources allocation (as we need to both properly create PCI Host bridge device tree node and inform device emulator about PCI host bridge details). Please take a look at: https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-4-Sergiy_Kibrik@epam.com/ > > Cheers, >
Hi Oleksandr, On 15/11/2023 16:51, Oleksandr Tyshchenko wrote: > > > On 15.11.23 14:33, Julien Grall wrote: >> Hi, > > > Hello Julien > > Let me please try to explain some bits. > > >> >> Thanks for adding support for virtio-pci in Xen. I have some questions. >> >> On 15/11/2023 11:26, Sergiy Kibrik wrote: >>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>> >>> In order to enable more use-cases such as having multiple >>> device-models (Qemu) running in different backend domains which provide >>> virtio-pci devices for the same guest, we allocate and expose one >>> PCI host bridge for every virtio backend domain for that guest. >> >> OOI, why do you need to expose one PCI host bridge for every stubdomain? >> >> In fact looking at the next patch, it seems you are handling some of the >> hostbridge request in Xen. This is adds a bit more confusion. >> >> I was expecting the virtual PCI device would be in the vPCI and each >> Device emulator would advertise which BDF they are covering. > > > This patch series only covers use-cases where the device emulator > handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind > it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI > pass-through resources, handling, accounting, nothing. I understood you want to one Device Emulator to handle the entire PCI host bridge. But... From the > hypervisor we only need a help to intercept the config space accesses > happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... > GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and > forward them to the linked device emulator (if any), that's all. ... I really don't see why you need to add code in Xen to trap the region. If QEMU is dealing with the hostbridge, then it should be able to register the MMIO region and then do the translation. > > It is not possible (with current series) to run device emulators what > emulate only separate PCI (virtio-pci) devices. For it to be possible, I > think, much more changes are required than current patch series does. > There at least should be special PCI Host bridge emulation in Xen (or > reuse vPCI) for the integration. Also Xen should be in charge of forming > resulting PCI interrupt based on each PCI device level signaling (if we > use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, > etc. Please note, I am not saying this is not possible in general, > likely it is possible, but initial patch series doesn't cover these > use-cases) > > We expose one PCI host bridge per virtio backend domain. This is a > separate PCI host bridge to combine all virtio-pci devices running in > the same backend domain (in the same device emulator currently). > The examples: > - if only one domain runs Qemu which servers virtio-blk, virtio-net, > virtio-console devices for DomU - only single PCI Host bridge will be > exposed for DomU > - if we add another domain to run Qemu to serve additionally virtio-gpu, > virtio-input and virtio-snd for the *same* DomU - we expose second PCI > Host bridge for DomU > > I am afraid, we cannot end up exposing only single PCI Host bridge with > current model (if we use device emulators running in different domains > that handles the *entire* PCI Host bridges), this won't work. That makes sense and it is fine. But see above, I think only the #2 is necessary for the hypervisor. Patch #5 should not be necessary at all. [...] >>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> >>> --- >>> xen/include/public/arch-arm.h | 21 +++++++++++++++++++++ >>> 1 file changed, 21 insertions(+) >>> >>> diff --git a/xen/include/public/arch-arm.h >>> b/xen/include/public/arch-arm.h >>> index a25e87dbda..e6c9cd5335 100644 >>> --- a/xen/include/public/arch-arm.h >>> +++ b/xen/include/public/arch-arm.h >>> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t; >>> #define GUEST_VPCI_MEM_ADDR xen_mk_ullong(0x23000000) >>> #define GUEST_VPCI_MEM_SIZE xen_mk_ullong(0x10000000) >>> +/* >>> + * 16 MB is reserved for virtio-pci configuration space based on >>> calculation >>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB >> >> Can you explain how youd ecided the "2"? > > good question, we have a limited free space available in memory layout > (we had difficulties to find a suitable holes) also we don't expect a > lot of virtio-pci devices, so "256" used vPCI would be too much. It was > decided to reduce significantly, but select maximum to fit into free > space, with having "2" buses we still fit into the chosen holes. If you don't expect a lot of virtio devices, then why do you need two buses? Wouldn't one be sufficient? > > >> >>> + */ >>> +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) >>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) >>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) >>> + >>> +/* 64 MB is reserved for virtio-pci memory */ >>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) >>> +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) >>> +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) >>> + >>> /* >>> * 16MB == 4096 pages reserved for guest to use as a region to map its >>> * grant table in. >>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; >>> #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) >>> #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) >>> +/* 64 MB is reserved for virtio-pci Prefetch memory */ >> >> This doesn't seem a lot depending on your use case. Can you details how >> you can up with "64 MB"? > > the same calculation as it was done configuration space. It was observed > that only 16K is used per virtio-pci device (maybe it can be bigger for > usual PCI device, I don't know). Please look at the example of DomU log > below (to strings that contain "*BAR 4: assigned*"): What about virtio-gpu? I would expect a bit more memory is necessary for that use case. Any case, what I am looking for is for some explanation in the commit message of the limits. I don't particularly care about the exact limit because this is not part of a stable ABI. Cheers,
On 15.11.23 19:31, Julien Grall wrote: > Hi Oleksandr, Hello Julien > > On 15/11/2023 16:51, Oleksandr Tyshchenko wrote: >> >> >> On 15.11.23 14:33, Julien Grall wrote: >>> Hi, >> >> >> Hello Julien >> >> Let me please try to explain some bits. >> >> >>> >>> Thanks for adding support for virtio-pci in Xen. I have some questions. >>> >>> On 15/11/2023 11:26, Sergiy Kibrik wrote: >>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>>> >>>> In order to enable more use-cases such as having multiple >>>> device-models (Qemu) running in different backend domains which provide >>>> virtio-pci devices for the same guest, we allocate and expose one >>>> PCI host bridge for every virtio backend domain for that guest. >>> >>> OOI, why do you need to expose one PCI host bridge for every stubdomain? >>> >>> In fact looking at the next patch, it seems you are handling some of the >>> hostbridge request in Xen. This is adds a bit more confusion. >>> >>> I was expecting the virtual PCI device would be in the vPCI and each >>> Device emulator would advertise which BDF they are covering. >> >> >> This patch series only covers use-cases where the device emulator >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >> pass-through resources, handling, accounting, nothing. > > I understood you want to one Device Emulator to handle the entire PCI > host bridge. But... > > From the >> hypervisor we only need a help to intercept the config space accesses >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >> forward them to the linked device emulator (if any), that's all. > > ... I really don't see why you need to add code in Xen to trap the > region. If QEMU is dealing with the hostbridge, then it should be able > to register the MMIO region and then do the translation. Hmm, sounds surprising I would say. Are you saying that unmodified Qemu will work if we drop #5? I think this wants to be re-checked (@Sergiy can you please investigate?). If indeed so, than #5 will be dropped of course from the that series (I would say, postponed until more use-cases). > >> >> It is not possible (with current series) to run device emulators what >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I >> think, much more changes are required than current patch series does. >> There at least should be special PCI Host bridge emulation in Xen (or >> reuse vPCI) for the integration. Also Xen should be in charge of forming >> resulting PCI interrupt based on each PCI device level signaling (if we >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, >> etc. Please note, I am not saying this is not possible in general, >> likely it is possible, but initial patch series doesn't cover these >> use-cases) >> >> We expose one PCI host bridge per virtio backend domain. This is a >> separate PCI host bridge to combine all virtio-pci devices running in >> the same backend domain (in the same device emulator currently). >> The examples: >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, >> virtio-console devices for DomU - only single PCI Host bridge will be >> exposed for DomU >> - if we add another domain to run Qemu to serve additionally virtio-gpu, >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI >> Host bridge for DomU >> >> I am afraid, we cannot end up exposing only single PCI Host bridge with >> current model (if we use device emulators running in different domains >> that handles the *entire* PCI Host bridges), this won't work. > > That makes sense and it is fine. But see above, I think only the #2 is > necessary for the hypervisor. Patch #5 should not be necessary at all. Good, it should be re-checked without #5 sure. > > [...] > >>>> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>>> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> >>>> --- >>>> xen/include/public/arch-arm.h | 21 +++++++++++++++++++++ >>>> 1 file changed, 21 insertions(+) >>>> >>>> diff --git a/xen/include/public/arch-arm.h >>>> b/xen/include/public/arch-arm.h >>>> index a25e87dbda..e6c9cd5335 100644 >>>> --- a/xen/include/public/arch-arm.h >>>> +++ b/xen/include/public/arch-arm.h >>>> @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t; >>>> #define GUEST_VPCI_MEM_ADDR >>>> xen_mk_ullong(0x23000000) >>>> #define GUEST_VPCI_MEM_SIZE >>>> xen_mk_ullong(0x10000000) >>>> +/* >>>> + * 16 MB is reserved for virtio-pci configuration space based on >>>> calculation >>>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB >>> >>> Can you explain how youd ecided the "2"? >> >> good question, we have a limited free space available in memory layout >> (we had difficulties to find a suitable holes) also we don't expect a >> lot of virtio-pci devices, so "256" used vPCI would be too much. It was >> decided to reduce significantly, but select maximum to fit into free >> space, with having "2" buses we still fit into the chosen holes. > > If you don't expect a lot of virtio devices, then why do you need two > buses? Wouldn't one be sufficient? For now one would sufficient I think. @Sergiy if you reduce to a single bus here, don't forget to update "bus-range" property in device-tree node. > >> >> >>> >>>> + */ >>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) >>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) >>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) >>>> + >>>> +/* 64 MB is reserved for virtio-pci memory */ >>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) >>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) >>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) >>>> + >>>> /* >>>> * 16MB == 4096 pages reserved for guest to use as a region to >>>> map its >>>> * grant table in. >>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; >>>> #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) >>>> #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) >>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */ >>> >>> This doesn't seem a lot depending on your use case. Can you details how >>> you can up with "64 MB"? >> >> the same calculation as it was done configuration space. It was observed >> that only 16K is used per virtio-pci device (maybe it can be bigger for >> usual PCI device, I don't know). Please look at the example of DomU log >> below (to strings that contain "*BAR 4: assigned*"): > > What about virtio-gpu? I would expect a bit more memory is necessary for > that use case. In the DomU log I provided during last response, virtio-gpu was also present among 5 virtio-pci devices and it worked at the runtime [ 0.474575] pci 0001:00:03.0: [1af4:1050] type 00 class 0x038000 [ 0.476534] pci 0001:00:03.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] .... [ 0.496656] pci 0001:00:03.0: BAR 4: assigned [mem 0x3a808000-0x3a80bfff 64bit pref] .... [ 0.550208] virtio-pci 0001:00:03.0: enabling device (0000 -> 0002) .... 0001:00:03.0 Display controller: Red Hat, Inc. Virtio GPU (rev 01) I guess, indeed it needs more memory, but this is related to I/O descriptors at the runtime that passed via virtqueue. > > Any case, what I am looking for is for some explanation in the commit > message of the limits. I don't particularly care about the exact limit > because this is not part of a stable ABI. ok, sure > > Cheers, >
Hi Oleksandr, On 15/11/2023 18:14, Oleksandr Tyshchenko wrote: > On 15.11.23 19:31, Julien Grall wrote: >> On 15/11/2023 16:51, Oleksandr Tyshchenko wrote: >>> On 15.11.23 14:33, Julien Grall wrote: >>>> Thanks for adding support for virtio-pci in Xen. I have some questions. >>>> >>>> On 15/11/2023 11:26, Sergiy Kibrik wrote: >>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>>>> >>>>> In order to enable more use-cases such as having multiple >>>>> device-models (Qemu) running in different backend domains which provide >>>>> virtio-pci devices for the same guest, we allocate and expose one >>>>> PCI host bridge for every virtio backend domain for that guest. >>>> >>>> OOI, why do you need to expose one PCI host bridge for every stubdomain? >>>> >>>> In fact looking at the next patch, it seems you are handling some of the >>>> hostbridge request in Xen. This is adds a bit more confusion. >>>> >>>> I was expecting the virtual PCI device would be in the vPCI and each >>>> Device emulator would advertise which BDF they are covering. >>> >>> >>> This patch series only covers use-cases where the device emulator >>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >>> pass-through resources, handling, accounting, nothing. >> >> I understood you want to one Device Emulator to handle the entire PCI >> host bridge. But... >> >> From the >>> hypervisor we only need a help to intercept the config space accesses >>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >>> forward them to the linked device emulator (if any), that's all. >> >> ... I really don't see why you need to add code in Xen to trap the >> region. If QEMU is dealing with the hostbridge, then it should be able >> to register the MMIO region and then do the translation. > > > Hmm, sounds surprising I would say. Are you saying that unmodified Qemu > will work if we drop #5? I don't know if an unmodified QEMU will work. My point is I don't view the patch in Xen necessary. You should be able to tell QEMU "here is the ECAM region, please emulate an hostbridge". QEMU will then register the region to Xen and all the accesses will be forwarded. In the future we may need a patch similar to #5 if we want to have multiple DM using the same hostbridge. But this is a different discussion and the patch would need some rework. The ioreq.c code was always meant to be generic and is always for every emulated MMIO. So you want to limit any change in it. Checking the MMIO region belongs to the hostbridge and doing the translation is IMHO not a good idea to do in ioreq.c. Instead you want to do the conversion from MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of ioreq.c is to simply find the correct Device Model and forward it. I also don't see why the feature is gated by has_vcpi(). They are two distinct features (at least in your current model). Cheers,
On 15.11.23 20:33, Julien Grall wrote: > Hi Oleksandr, Hello Julien > > On 15/11/2023 18:14, Oleksandr Tyshchenko wrote: >> On 15.11.23 19:31, Julien Grall wrote: >>> On 15/11/2023 16:51, Oleksandr Tyshchenko wrote: >>>> On 15.11.23 14:33, Julien Grall wrote: >>>>> Thanks for adding support for virtio-pci in Xen. I have some >>>>> questions. >>>>> >>>>> On 15/11/2023 11:26, Sergiy Kibrik wrote: >>>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>>>>> >>>>>> In order to enable more use-cases such as having multiple >>>>>> device-models (Qemu) running in different backend domains which >>>>>> provide >>>>>> virtio-pci devices for the same guest, we allocate and expose one >>>>>> PCI host bridge for every virtio backend domain for that guest. >>>>> >>>>> OOI, why do you need to expose one PCI host bridge for every >>>>> stubdomain? >>>>> >>>>> In fact looking at the next patch, it seems you are handling some >>>>> of the >>>>> hostbridge request in Xen. This is adds a bit more confusion. >>>>> >>>>> I was expecting the virtual PCI device would be in the vPCI and each >>>>> Device emulator would advertise which BDF they are covering. >>>> >>>> >>>> This patch series only covers use-cases where the device emulator >>>> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices >>>> behind >>>> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >>>> pass-through resources, handling, accounting, nothing. >>> >>> I understood you want to one Device Emulator to handle the entire PCI >>> host bridge. But... >>> >>> From the >>>> hypervisor we only need a help to intercept the config space accesses >>>> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >>>> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >>>> forward them to the linked device emulator (if any), that's all. >>> >>> ... I really don't see why you need to add code in Xen to trap the >>> region. If QEMU is dealing with the hostbridge, then it should be able >>> to register the MMIO region and then do the translation. >> >> >> Hmm, sounds surprising I would say. Are you saying that unmodified Qemu >> will work if we drop #5? > > I don't know if an unmodified QEMU will work. My point is I don't view > the patch in Xen necessary. You should be able to tell QEMU "here is the > ECAM region, please emulate an hostbridge". QEMU will then register the > region to Xen and all the accesses will be forwarded. > > In the future we may need a patch similar to #5 if we want to have > multiple DM using the same hostbridge. But this is a different > discussion and the patch would need some rework. ok > > The ioreq.c code was always meant to be generic and is always for every > emulated MMIO. So you want to limit any change in it. Checking the MMIO > region belongs to the hostbridge and doing the translation is IMHO not a > good idea to do in ioreq.c. Instead you want to do the conversion from > MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of > ioreq.c is to simply find the correct Device Model and forward it. Are you about virtio_pci_ioreq_server_get_addr() called from arch_ioreq_server_get_type_addr()? If so and if I am not mistaken the x86 also check what PCI device is targeted there. But, I am not against the suggestion, I agree with it. > > I also don't see why the feature is gated by has_vcpi(). They are two > distinct features (at least in your current model). yes, you are correct. In #5 virtio-pci mmio handlers are still registered in domain_vpci_init() (which is gated by has_vcpi()), etc > > Cheers, >
Hi Oleksandr, On 15/11/2023 19:38, Oleksandr Tyshchenko wrote: >> The ioreq.c code was always meant to be generic and is always for every >> emulated MMIO. So you want to limit any change in it. Checking the MMIO >> region belongs to the hostbridge and doing the translation is IMHO not a >> good idea to do in ioreq.c. Instead you want to do the conversion from >> MMIO to (sbdf, offset) in virtio_pci_mmio{read, write}(). So the job of >> ioreq.c is to simply find the correct Device Model and forward it. > > > > Are you about virtio_pci_ioreq_server_get_addr() called from > arch_ioreq_server_get_type_addr()? If so and if I am not mistaken the > x86 also check what PCI device is targeted there. Well yes. We can do better to avoid extra complexity for each MMIO. Note that the x86 version is somewhat more light-weight. Cheers,
+ Stewart, Vikram On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote: > On 15.11.23 14:33, Julien Grall wrote: > > Thanks for adding support for virtio-pci in Xen. I have some questions. > > > > On 15/11/2023 11:26, Sergiy Kibrik wrote: > >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> > >> > >> In order to enable more use-cases such as having multiple > >> device-models (Qemu) running in different backend domains which provide > >> virtio-pci devices for the same guest, we allocate and expose one > >> PCI host bridge for every virtio backend domain for that guest. > > > > OOI, why do you need to expose one PCI host bridge for every stubdomain? > > > > In fact looking at the next patch, it seems you are handling some of the > > hostbridge request in Xen. This is adds a bit more confusion. > > > > I was expecting the virtual PCI device would be in the vPCI and each > > Device emulator would advertise which BDF they are covering. > > > This patch series only covers use-cases where the device emulator > handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind > it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI > pass-through resources, handling, accounting, nothing. From the > hypervisor we only need a help to intercept the config space accesses > happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... > GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and > forward them to the linked device emulator (if any), that's all. > > It is not possible (with current series) to run device emulators what > emulate only separate PCI (virtio-pci) devices. For it to be possible, I > think, much more changes are required than current patch series does. > There at least should be special PCI Host bridge emulation in Xen (or > reuse vPCI) for the integration. Also Xen should be in charge of forming > resulting PCI interrupt based on each PCI device level signaling (if we > use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, > etc. Please note, I am not saying this is not possible in general, > likely it is possible, but initial patch series doesn't cover these > use-cases) > > We expose one PCI host bridge per virtio backend domain. This is a > separate PCI host bridge to combine all virtio-pci devices running in > the same backend domain (in the same device emulator currently). > The examples: > - if only one domain runs Qemu which servers virtio-blk, virtio-net, > virtio-console devices for DomU - only single PCI Host bridge will be > exposed for DomU > - if we add another domain to run Qemu to serve additionally virtio-gpu, > virtio-input and virtio-snd for the *same* DomU - we expose second PCI > Host bridge for DomU > > I am afraid, we cannot end up exposing only single PCI Host bridge with > current model (if we use device emulators running in different domains > that handles the *entire* PCI Host bridges), this won't work. We were discussing the topic of vPCI and Virtio PCI just this morning with Stewart and Vikram. We also intend to make them work well together in the next couple of months (great timing!!) However, our thinking is to go with the other approach Julien suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would register individual PCI devices against it. Vikram, Stewart, please comment. Our understanding is that it should be possible to make QEMU virtio-pci work with vPCI with relatively minor efforts and AMD volunteers to do the work in the next couple of months on the vPCI side. Although it should be possible to make both approaches work at the same time, given that it would seem that EPAM and AMD have very similar requirements, I suggest we work together and collaborate on a single approach going forward that works best for everyone. Let me start by saying that if we can get away with it, I think that a single PCI Root Complex in Xen would be best because it requires less complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only one? Stewart, you are deep into vPCI, what's your thinking?
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > + Stewart, Vikram > > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote: >> On 15.11.23 14:33, Julien Grall wrote: >> > Thanks for adding support for virtio-pci in Xen. I have some questions. >> > >> > On 15/11/2023 11:26, Sergiy Kibrik wrote: >> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >> >> >> >> In order to enable more use-cases such as having multiple >> >> device-models (Qemu) running in different backend domains which provide >> >> virtio-pci devices for the same guest, we allocate and expose one >> >> PCI host bridge for every virtio backend domain for that guest. >> > >> > OOI, why do you need to expose one PCI host bridge for every stubdomain? >> > >> > In fact looking at the next patch, it seems you are handling some of the >> > hostbridge request in Xen. This is adds a bit more confusion. >> > >> > I was expecting the virtual PCI device would be in the vPCI and each >> > Device emulator would advertise which BDF they are covering. >> >> >> This patch series only covers use-cases where the device emulator >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >> pass-through resources, handling, accounting, nothing. From the >> hypervisor we only need a help to intercept the config space accesses >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >> forward them to the linked device emulator (if any), that's all. >> >> It is not possible (with current series) to run device emulators what >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I >> think, much more changes are required than current patch series does. >> There at least should be special PCI Host bridge emulation in Xen (or >> reuse vPCI) for the integration. Also Xen should be in charge of forming >> resulting PCI interrupt based on each PCI device level signaling (if we >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, >> etc. Please note, I am not saying this is not possible in general, >> likely it is possible, but initial patch series doesn't cover these >> use-cases) >> >> We expose one PCI host bridge per virtio backend domain. This is a >> separate PCI host bridge to combine all virtio-pci devices running in >> the same backend domain (in the same device emulator currently). >> The examples: >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, >> virtio-console devices for DomU - only single PCI Host bridge will be >> exposed for DomU >> - if we add another domain to run Qemu to serve additionally virtio-gpu, >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI >> Host bridge for DomU >> >> I am afraid, we cannot end up exposing only single PCI Host bridge with >> current model (if we use device emulators running in different domains >> that handles the *entire* PCI Host bridges), this won't work. > > > We were discussing the topic of vPCI and Virtio PCI just this morning > with Stewart and Vikram. We also intend to make them work well together > in the next couple of months (great timing!!) > > However, our thinking is to go with the other approach Julien > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would > register individual PCI devices against it. > > Vikram, Stewart, please comment. Our understanding is that it should be > possible to make QEMU virtio-pci work with vPCI with relatively minor > efforts and AMD volunteers to do the work in the next couple of months > on the vPCI side. > > > Although it should be possible to make both approaches work at the same > time, given that it would seem that EPAM and AMD have very similar > requirements, I suggest we work together and collaborate on a single > approach going forward that works best for everyone. > > > Let me start by saying that if we can get away with it, I think that a > single PCI Root Complex in Xen would be best because it requires less > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only > one? Well, in fact we tried similar setup, this was in the first version of virtio-pci support. But we had a couple of issues with this. For instance, this might conflict with PCI passthrough devices, with virtio devices that have back-ends in different domains, etc. I am no saying that this is impossible, but this just involves more moving parts. With my vPCI patch series in place, hypervisor itself assigns BDFs for passed-through devices. Toolstack needs to get this information to know which BDFs are free and can be used by virtio-pci.
Hi Volodymyr, On 16/11/2023 15:07, Volodymyr Babchuk wrote: > With my vPCI patch series in place, hypervisor itself assigns BDFs for > passed-through devices. Toolstack needs to get this information to know > which BDFs are free and can be used by virtio-pci. It sounds a bit odd to let the hypervisor to assign the BDFs. At least because there might be case where you want to specific vBDF (for instance this is the case with some intel graphic cards). This should be the toolstack job to say "I want to assign the pBDF to this vBDF". Do you have a link to the patch adding the logic in the hypervisor? I will comment there as well. Cheers,
On 11/16/23 10:12, Julien Grall wrote: > Hi Volodymyr, > > On 16/11/2023 15:07, Volodymyr Babchuk wrote: >> With my vPCI patch series in place, hypervisor itself assigns BDFs for >> passed-through devices. Toolstack needs to get this information to know >> which BDFs are free and can be used by virtio-pci. > > It sounds a bit odd to let the hypervisor to assign the BDFs. At least because there might be case where you want to specific vBDF (for instance this is the case with some intel graphic cards). This should be the toolstack job to say "I want to assign the pBDF to this vBDF". Keep in mind we are also supporting dom0less PCI passthrough. > > Do you have a link to the patch adding the logic in the hypervisor? I will comment there as well. See add_virtual_device() in [1], specifically the line: pdev->vpci->guest_sbdf = PCI_SBDF(0, 0, new_dev_number, 0); [1] https://lists.xenproject.org/archives/html/xen-devel/2023-10/msg00673.html
On 16/11/2023 15:26, Stewart Hildebrand wrote: > On 11/16/23 10:12, Julien Grall wrote: >> Hi Volodymyr, >> >> On 16/11/2023 15:07, Volodymyr Babchuk wrote: >>> With my vPCI patch series in place, hypervisor itself assigns BDFs for >>> passed-through devices. Toolstack needs to get this information to know >>> which BDFs are free and can be used by virtio-pci. >> >> It sounds a bit odd to let the hypervisor to assign the BDFs. At least because there might be case where you want to specific vBDF (for instance this is the case with some intel graphic cards). This should be the toolstack job to say "I want to assign the pBDF to this vBDF". > > Keep in mind we are also supporting dom0less PCI passthrough. Right, but even with that in mind, I expect the Device-Tree to provide the details where a given PCI is assigned. You could have logic for assigning the BDF automagically. But that should be part of dom0less, not deep into the vPCI code. Cheers,
Hi Julien, Julien Grall <julien@xen.org> writes: > On 16/11/2023 15:26, Stewart Hildebrand wrote: >> On 11/16/23 10:12, Julien Grall wrote: >>> Hi Volodymyr, >>> >>> On 16/11/2023 15:07, Volodymyr Babchuk wrote: >>>> With my vPCI patch series in place, hypervisor itself assigns BDFs for >>>> passed-through devices. Toolstack needs to get this information to know >>>> which BDFs are free and can be used by virtio-pci. >>> >>> It sounds a bit odd to let the hypervisor to assign the BDFs. At >>> least because there might be case where you want to specific vBDF >>> (for instance this is the case with some intel graphic cards). This >>> should be the toolstack job to say "I want to assign the pBDF to >>> this vBDF". >> Keep in mind we are also supporting dom0less PCI passthrough. > Right, but even with that in mind, I expect the Device-Tree to provide > the details where a given PCI is assigned. > > You could have logic for assigning the BDF automagically. But that > should be part of dom0less, not deep into the vPCI code. As far as I know, right now toolstack does not allow you to assign BDF in any form. For x86 there are two options, and they are controlled by "passthrough" option of xen-pciback driver in Linux: "Option to specify how to export PCI topology to guest:" " 0 - (default) Hide the true PCI topology and makes the frontend" " there is a single PCI bus with only the exported devices on it." " For example, a device at 03:05.0 will be re-assigned to 00:00.0" " while second device at 02:1a.1 will be re-assigned to 00:01.1." " 1 - Passthrough provides a real view of the PCI topology to the" " frontend (for example, a device at 06:01.b will still appear at" " 06:01.b to the frontend). This is similar to how Xen 2.0.x" " exposed PCI devices to its driver domains. This may be required" " for drivers which depend on finding their hardward in certain" " bus/slot locations."); Also, isn't strict dependency on BDF breaks the PCI specification? I mean, of course, you can assign Function on random, but what about Bus and Device parts? I mean, we can make toolstack responsible of assigning BDFs, but this might break existing x86 setups...
Hi Volodymyr, On 16/11/2023 16:53, Volodymyr Babchuk wrote: > Julien Grall <julien@xen.org> writes: >> On 16/11/2023 15:26, Stewart Hildebrand wrote: >>> On 11/16/23 10:12, Julien Grall wrote: >>>> Hi Volodymyr, >>>> >>>> On 16/11/2023 15:07, Volodymyr Babchuk wrote: >>>>> With my vPCI patch series in place, hypervisor itself assigns BDFs for >>>>> passed-through devices. Toolstack needs to get this information to know >>>>> which BDFs are free and can be used by virtio-pci. >>>> >>>> It sounds a bit odd to let the hypervisor to assign the BDFs. At >>>> least because there might be case where you want to specific vBDF >>>> (for instance this is the case with some intel graphic cards). This >>>> should be the toolstack job to say "I want to assign the pBDF to >>>> this vBDF". >>> Keep in mind we are also supporting dom0less PCI passthrough. >> Right, but even with that in mind, I expect the Device-Tree to provide >> the details where a given PCI is assigned. >> >> You could have logic for assigning the BDF automagically. But that >> should be part of dom0less, not deep into the vPCI code. > > As far as I know, right now toolstack does not allow you to assign BDF > in any form. For x86 there are two options, and they are controlled by > "passthrough" option of xen-pciback driver in Linux: Are you talking about HVM or PV? I am not very familiar for the latter but for the former, QEMU is today responsible to find a free slot. You can also specify which virtual slot you want to use in XL: =item B<vslot>=I<NUMBER> =over 4 =item Description Specifies the virtual slot (device) number where the guest will see this device. For example, running L<lspci(1)> in a Linux guest where B<vslot> was specified as C<8> would identify the device as C<00:08.0>. Virtual domain and bus numbers are always 0. > > "Option to specify how to export PCI topology to guest:" > " 0 - (default) Hide the true PCI topology and makes the frontend" > " there is a single PCI bus with only the exported devices on it." > " For example, a device at 03:05.0 will be re-assigned to 00:00.0" > " while second device at 02:1a.1 will be re-assigned to 00:01.1." > " 1 - Passthrough provides a real view of the PCI topology to the" > " frontend (for example, a device at 06:01.b will still appear at" > " 06:01.b to the frontend). This is similar to how Xen 2.0.x" > " exposed PCI devices to its driver domains. This may be required" > " for drivers which depend on finding their hardward in certain" > " bus/slot locations."); > > Also, isn't strict dependency on BDF breaks the PCI specification? I > mean, of course, you can assign Function on random, but what about Bus > and Device parts? I am not well-versed with the PCI specification. However, what the specs says and what users do tend to be different :). I know a few cases where you need to specify the slot. I have mentioned one earlier with the Intel Graphic cards. > I mean, we can make toolstack responsible of assigning BDFs, but this > might break existing x86 setups... It is not clear to me how letting the toolstack selecting the vslot would be a problem for HVM (see above). However, it is clear that you may break some x86 setup if always allocate a "random" slot. Cheers,
On Thu, 16 Nov 2023, Volodymyr Babchuk wrote: > Hi Stefano, > > Stefano Stabellini <sstabellini@kernel.org> writes: > > > + Stewart, Vikram > > > > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote: > >> On 15.11.23 14:33, Julien Grall wrote: > >> > Thanks for adding support for virtio-pci in Xen. I have some questions. > >> > > >> > On 15/11/2023 11:26, Sergiy Kibrik wrote: > >> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> > >> >> > >> >> In order to enable more use-cases such as having multiple > >> >> device-models (Qemu) running in different backend domains which provide > >> >> virtio-pci devices for the same guest, we allocate and expose one > >> >> PCI host bridge for every virtio backend domain for that guest. > >> > > >> > OOI, why do you need to expose one PCI host bridge for every stubdomain? > >> > > >> > In fact looking at the next patch, it seems you are handling some of the > >> > hostbridge request in Xen. This is adds a bit more confusion. > >> > > >> > I was expecting the virtual PCI device would be in the vPCI and each > >> > Device emulator would advertise which BDF they are covering. > >> > >> > >> This patch series only covers use-cases where the device emulator > >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind > >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI > >> pass-through resources, handling, accounting, nothing. From the > >> hypervisor we only need a help to intercept the config space accesses > >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... > >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and > >> forward them to the linked device emulator (if any), that's all. > >> > >> It is not possible (with current series) to run device emulators what > >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I > >> think, much more changes are required than current patch series does. > >> There at least should be special PCI Host bridge emulation in Xen (or > >> reuse vPCI) for the integration. Also Xen should be in charge of forming > >> resulting PCI interrupt based on each PCI device level signaling (if we > >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, > >> etc. Please note, I am not saying this is not possible in general, > >> likely it is possible, but initial patch series doesn't cover these > >> use-cases) > >> > >> We expose one PCI host bridge per virtio backend domain. This is a > >> separate PCI host bridge to combine all virtio-pci devices running in > >> the same backend domain (in the same device emulator currently). > >> The examples: > >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, > >> virtio-console devices for DomU - only single PCI Host bridge will be > >> exposed for DomU > >> - if we add another domain to run Qemu to serve additionally virtio-gpu, > >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI > >> Host bridge for DomU > >> > >> I am afraid, we cannot end up exposing only single PCI Host bridge with > >> current model (if we use device emulators running in different domains > >> that handles the *entire* PCI Host bridges), this won't work. > > > > > > We were discussing the topic of vPCI and Virtio PCI just this morning > > with Stewart and Vikram. We also intend to make them work well together > > in the next couple of months (great timing!!) > > > > However, our thinking is to go with the other approach Julien > > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would > > register individual PCI devices against it. > > > > Vikram, Stewart, please comment. Our understanding is that it should be > > possible to make QEMU virtio-pci work with vPCI with relatively minor > > efforts and AMD volunteers to do the work in the next couple of months > > on the vPCI side. > > > > > > Although it should be possible to make both approaches work at the same > > time, given that it would seem that EPAM and AMD have very similar > > requirements, I suggest we work together and collaborate on a single > > approach going forward that works best for everyone. > > > > > > Let me start by saying that if we can get away with it, I think that a > > single PCI Root Complex in Xen would be best because it requires less > > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only > > one? > > Well, in fact we tried similar setup, this was in the first version of > virtio-pci support. But we had a couple of issues with this. For > instance, this might conflict with PCI passthrough devices, with virtio > devices that have back-ends in different domains, etc. I am no saying > that this is impossible, but this just involves more moving parts. > > With my vPCI patch series in place, hypervisor itself assigns BDFs for > passed-through devices. Toolstack needs to get this information to know > which BDFs are free and can be used by virtio-pci. I'll premise that I don't really have an opinion on how the virtual BDF allocation should happen. But I'll ask the opposite question that Julien asked: if it is Xen that does the allocation, that's fine, then couldn't we arrange so that Xen also does the allocation in the toolstack case too (simply by picking the first available virtual BDF)? One way or the other it should be OK as long as we are consistent.
Hi Stefano, Stefano Stabellini <sstabellini@kernel.org> writes: > On Thu, 16 Nov 2023, Volodymyr Babchuk wrote: >> Hi Stefano, >> >> Stefano Stabellini <sstabellini@kernel.org> writes: >> >> > + Stewart, Vikram >> > >> > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote: >> >> On 15.11.23 14:33, Julien Grall wrote: >> >> > Thanks for adding support for virtio-pci in Xen. I have some questions. >> >> > >> >> > On 15/11/2023 11:26, Sergiy Kibrik wrote: >> >> >> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >> >> >> >> >> >> In order to enable more use-cases such as having multiple >> >> >> device-models (Qemu) running in different backend domains which provide >> >> >> virtio-pci devices for the same guest, we allocate and expose one >> >> >> PCI host bridge for every virtio backend domain for that guest. >> >> > >> >> > OOI, why do you need to expose one PCI host bridge for every stubdomain? >> >> > >> >> > In fact looking at the next patch, it seems you are handling some of the >> >> > hostbridge request in Xen. This is adds a bit more confusion. >> >> > >> >> > I was expecting the virtual PCI device would be in the vPCI and each >> >> > Device emulator would advertise which BDF they are covering. >> >> >> >> >> >> This patch series only covers use-cases where the device emulator >> >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >> >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >> >> pass-through resources, handling, accounting, nothing. From the >> >> hypervisor we only need a help to intercept the config space accesses >> >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >> >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >> >> forward them to the linked device emulator (if any), that's all. >> >> >> >> It is not possible (with current series) to run device emulators what >> >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I >> >> think, much more changes are required than current patch series does. >> >> There at least should be special PCI Host bridge emulation in Xen (or >> >> reuse vPCI) for the integration. Also Xen should be in charge of forming >> >> resulting PCI interrupt based on each PCI device level signaling (if we >> >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, >> >> etc. Please note, I am not saying this is not possible in general, >> >> likely it is possible, but initial patch series doesn't cover these >> >> use-cases) >> >> >> >> We expose one PCI host bridge per virtio backend domain. This is a >> >> separate PCI host bridge to combine all virtio-pci devices running in >> >> the same backend domain (in the same device emulator currently). >> >> The examples: >> >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, >> >> virtio-console devices for DomU - only single PCI Host bridge will be >> >> exposed for DomU >> >> - if we add another domain to run Qemu to serve additionally virtio-gpu, >> >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI >> >> Host bridge for DomU >> >> >> >> I am afraid, we cannot end up exposing only single PCI Host bridge with >> >> current model (if we use device emulators running in different domains >> >> that handles the *entire* PCI Host bridges), this won't work. >> > >> > >> > We were discussing the topic of vPCI and Virtio PCI just this morning >> > with Stewart and Vikram. We also intend to make them work well together >> > in the next couple of months (great timing!!) >> > >> > However, our thinking is to go with the other approach Julien >> > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would >> > register individual PCI devices against it. >> > >> > Vikram, Stewart, please comment. Our understanding is that it should be >> > possible to make QEMU virtio-pci work with vPCI with relatively minor >> > efforts and AMD volunteers to do the work in the next couple of months >> > on the vPCI side. >> > >> > >> > Although it should be possible to make both approaches work at the same >> > time, given that it would seem that EPAM and AMD have very similar >> > requirements, I suggest we work together and collaborate on a single >> > approach going forward that works best for everyone. >> > >> > >> > Let me start by saying that if we can get away with it, I think that a >> > single PCI Root Complex in Xen would be best because it requires less >> > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only >> > one? >> >> Well, in fact we tried similar setup, this was in the first version of >> virtio-pci support. But we had a couple of issues with this. For >> instance, this might conflict with PCI passthrough devices, with virtio >> devices that have back-ends in different domains, etc. I am no saying >> that this is impossible, but this just involves more moving parts. >> >> With my vPCI patch series in place, hypervisor itself assigns BDFs for >> passed-through devices. Toolstack needs to get this information to know >> which BDFs are free and can be used by virtio-pci. > > I'll premise that I don't really have an opinion on how the virtual BDF > allocation should happen. > > But I'll ask the opposite question that Julien asked: if it is Xen that > does the allocation, that's fine, then couldn't we arrange so that Xen > also does the allocation in the toolstack case too (simply by picking > the first available virtual BDF)? Actually, this was my intention as well. As I said in the another email, we just need to extend or add another domctl to manage vBFDs.
On 11/15/23 18:28, Stefano Stabellini wrote: > + Stewart, Vikram > > On Wed, 15 Nov 2023, Oleksandr Tyshchenko wrote: >> On 15.11.23 14:33, Julien Grall wrote: >>> Thanks for adding support for virtio-pci in Xen. I have some questions. >>> >>> On 15/11/2023 11:26, Sergiy Kibrik wrote: >>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> >>>> >>>> In order to enable more use-cases such as having multiple >>>> device-models (Qemu) running in different backend domains which provide >>>> virtio-pci devices for the same guest, we allocate and expose one >>>> PCI host bridge for every virtio backend domain for that guest. >>> >>> OOI, why do you need to expose one PCI host bridge for every stubdomain? >>> >>> In fact looking at the next patch, it seems you are handling some of the >>> hostbridge request in Xen. This is adds a bit more confusion. >>> >>> I was expecting the virtual PCI device would be in the vPCI and each >>> Device emulator would advertise which BDF they are covering. >> >> >> This patch series only covers use-cases where the device emulator >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >> pass-through resources, handling, accounting, nothing. From the >> hypervisor we only need a help to intercept the config space accesses >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >> forward them to the linked device emulator (if any), that's all. >> >> It is not possible (with current series) to run device emulators what >> emulate only separate PCI (virtio-pci) devices. For it to be possible, I >> think, much more changes are required than current patch series does. >> There at least should be special PCI Host bridge emulation in Xen (or >> reuse vPCI) for the integration. Also Xen should be in charge of forming >> resulting PCI interrupt based on each PCI device level signaling (if we >> use legacy interrupts), some kind of x86's XEN_DMOP_set_pci_intx_level, >> etc. Please note, I am not saying this is not possible in general, >> likely it is possible, but initial patch series doesn't cover these >> use-cases) >> >> We expose one PCI host bridge per virtio backend domain. This is a >> separate PCI host bridge to combine all virtio-pci devices running in >> the same backend domain (in the same device emulator currently). >> The examples: >> - if only one domain runs Qemu which servers virtio-blk, virtio-net, >> virtio-console devices for DomU - only single PCI Host bridge will be >> exposed for DomU >> - if we add another domain to run Qemu to serve additionally virtio-gpu, >> virtio-input and virtio-snd for the *same* DomU - we expose second PCI >> Host bridge for DomU >> >> I am afraid, we cannot end up exposing only single PCI Host bridge with >> current model (if we use device emulators running in different domains >> that handles the *entire* PCI Host bridges), this won't work. > > > We were discussing the topic of vPCI and Virtio PCI just this morning > with Stewart and Vikram. We also intend to make them work well together > in the next couple of months (great timing!!) > > However, our thinking is to go with the other approach Julien > suggested: a single PCI Root Complex emulated in Xen by vPCI. QEMU would > register individual PCI devices against it. > > Vikram, Stewart, please comment. Our understanding is that it should be > possible to make QEMU virtio-pci work with vPCI with relatively minor > efforts and AMD volunteers to do the work in the next couple of months > on the vPCI side. > > > Although it should be possible to make both approaches work at the same > time, given that it would seem that EPAM and AMD have very similar > requirements, I suggest we work together and collaborate on a single > approach going forward that works best for everyone. > > > Let me start by saying that if we can get away with it, I think that a > single PCI Root Complex in Xen would be best because it requires less > complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only > one? > > Stewart, you are deep into vPCI, what's your thinking? First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci: virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ] device_model_args = [ "-device", "virtio-serial-pci" ] pci = [ "01:00.0" ] Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view: pcie@10000000 { compatible = "pci-host-ecam-generic"; device_type = "pci"; reg = <0x00 0x10000000 0x00 0x10000000>; bus-range = <0x00 0xff>; #address-cells = <0x03>; #size-cells = <0x02>; status = "okay"; ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>; #interrupt-cells = <0x01>; interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>; interrupt-map-mask = <0x00 0x00 0x00 0x07>; }; pcie@33000000 { compatible = "pci-host-ecam-generic"; device_type = "pci"; reg = <0x00 0x33000000 0x00 0x200000>; bus-range = <0x00 0x01>; #address-cells = <0x03>; #size-cells = <0x02>; status = "okay"; ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>; dma-coherent; #interrupt-cells = <0x01>; interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>; interrupt-map-mask = <0x1800 0x00 0x00 0x07>; }; Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0]. Qemu exposes an emulated host bridge, along with any requested emulated devices. Running lspci -v in the domU yields the following: 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe Flags: bus master, fast devsel, latency 0, IRQ 13 Memory at 23000000 (32-bit, non-prefetchable) [size=64K] Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+ Kernel driver in use: rt2800pci 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge Subsystem: Red Hat, Inc. QEMU PCIe Host bridge Flags: fast devsel 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console Subsystem: Red Hat, Inc. Virtio console Flags: bus master, fast devsel, latency 0, IRQ 14 Memory at 3a000000 (64-bit, prefetchable) [size=16K] Capabilities: [84] Vendor Specific Information: VirtIO: <unknown> Capabilities: [70] Vendor Specific Information: VirtIO: Notify Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg Capabilities: [50] Vendor Specific Information: VirtIO: ISR Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg Kernel driver in use: virtio-pci 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0). 0001:00:00.0 is the qemu host bridge (base class 0x06). They are on different segments because they are associated with different root complexes. Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell. Some more observations assuming a single virtual root complex: We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage. We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices? If not, we will need to do SBDF translation in vPCI for virtio-pci devices. Meaning virtio-pci would then be dependent on the vPCI series [1]. I think that's okay. Hmm - if a qemu SBDF clashes with a real hardware SBDF, is that an issue? vPCI and virtio-pci devices will be sharing a single ECAM range, so it looks like we must rely on vPCI to intercept the accesses that need to be forwarded to ioreq. Initially, it looks like we would be adding some sort of support for legacy interrupts in vPCI, but for virtio-pci devices only. Eventually, we may also want to consider virtio-pci with MSI/MSI-X, but it's probably okay to wait. [0] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/pci/direct.c?h=v6.6.1#n186 [1] https://lists.xenproject.org/archives/html/xen-devel/2023-10/msg00660.html
On 17.11.23 05:31, Stewart Hildebrand wrote: Hello Stewart [answering only for virtio-pci bits as for vPCI I am only familiar with code responsible for trapping config space accesses] [snip] >> >> >> Let me start by saying that if we can get away with it, I think that a >> single PCI Root Complex in Xen would be best because it requires less >> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only >> one? >> >> Stewart, you are deep into vPCI, what's your thinking? > > First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci: > > virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ] > device_model_args = [ "-device", "virtio-serial-pci" ] > pci = [ "01:00.0" ] > > Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view: > > pcie@10000000 { > compatible = "pci-host-ecam-generic"; > device_type = "pci"; > reg = <0x00 0x10000000 0x00 0x10000000>; > bus-range = <0x00 0xff>; > #address-cells = <0x03>; > #size-cells = <0x02>; > status = "okay"; > ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>; > #interrupt-cells = <0x01>; > interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>; > interrupt-map-mask = <0x00 0x00 0x00 0x07>; I am wondering how you got interrupt-map here? AFAIR upstream toolstack doesn't add that property for vpci dt node. > }; > > pcie@33000000 { > compatible = "pci-host-ecam-generic"; > device_type = "pci"; > reg = <0x00 0x33000000 0x00 0x200000>; > bus-range = <0x00 0x01>; > #address-cells = <0x03>; > #size-cells = <0x02>; > status = "okay"; > ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>; > dma-coherent; > #interrupt-cells = <0x01>; > interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>; > interrupt-map-mask = <0x1800 0x00 0x00 0x07>; that is correct dump. BTW, if you added "grant_usage=1" (it is disabled by default for dom0) to virtio configuration you would get iommu-map property here as well [1]. This is another point to think about when considering combined approach (single PCI Host bridge node -> single virtual root complex), I guess usual PCI device doesn't want grant based DMA addresses, correct? If so, it shouldn't be specified in the property. > }; > > Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0]. > > Qemu exposes an emulated host bridge, along with any requested emulated devices. > > Running lspci -v in the domU yields the following: > > 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe > Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe > Flags: bus master, fast devsel, latency 0, IRQ 13 > Memory at 23000000 (32-bit, non-prefetchable) [size=64K] > Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+ > Kernel driver in use: rt2800pci > > 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge > Subsystem: Red Hat, Inc. QEMU PCIe Host bridge > Flags: fast devsel > > 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console > Subsystem: Red Hat, Inc. Virtio console > Flags: bus master, fast devsel, latency 0, IRQ 14 > Memory at 3a000000 (64-bit, prefetchable) [size=16K] > Capabilities: [84] Vendor Specific Information: VirtIO: <unknown> > Capabilities: [70] Vendor Specific Information: VirtIO: Notify > Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg > Capabilities: [50] Vendor Specific Information: VirtIO: ISR > Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg > Kernel driver in use: virtio-pci > > 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0). > 0001:00:00.0 is the qemu host bridge (base class 0x06). > They are on different segments because they are associated with different root complexes. Glad to hear this patch series doesn't seem to break PCI passthrough in your environment. > > > Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell. > > Some more observations assuming a single virtual root complex: > > We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage. > > We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices? Yes, it is possible. Maybe there is a better way, but at least *bus* and *addr* can be specified and Qemu indeed follows that. device_model_args=[ '-device', 'virtio-blk-pci,scsi=off,disable-legacy=on,iommu_platform=on,bus=pcie.0,addr=2,drive=image', '-drive', 'if=none,id=image,format=raw,file=/dev/mmcblk1p3' ] virtio=[ "backend=Domain-0, type=virtio,device, transport=pci, bdf=0000:00:02.0, grant_usage=1, backend_type=qemu" ] root@h3ulcb-domd:~# dmesg | grep virtio [ 0.660789] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002) [ 0.715876] virtio_blk virtio0: [vda] 4096 512-byte logical blocks (2.10 MB/2.00 MiB) root@h3ulcb-domd:~# lspci 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge 00:02.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01) Also there is one moment for current series: bdf specified for virtio-pci device only makes sense for iommu-map property. So bdf=0000:00:02.0 in virtio property and bus=pcie.0,addr=2 in device_model_args property should be in sync. [1] https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-5-Sergiy_Kibrik@epam.com/ [snip]
hi Julien, Oleksandr, [..] >> >> >> This patch series only covers use-cases where the device emulator >> handles the *entire* PCI Host bridge and PCI (virtio-pci) devices behind >> it (i.e. Qemu). Also this patch series doesn't touch vPCI/PCI >> pass-through resources, handling, accounting, nothing. > > I understood you want to one Device Emulator to handle the entire PCI > host bridge. But... > > From the >> hypervisor we only need a help to intercept the config space accesses >> happen in a range [GUEST_VIRTIO_PCI_ECAM_BASE ... >> GUEST_VIRTIO_PCI_ECAM_BASE + GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE] and >> forward them to the linked device emulator (if any), that's all. > > ... I really don't see why you need to add code in Xen to trap the > region. If QEMU is dealing with the hostbridge, then it should be able > to register the MMIO region and then do the translation. > >> [..] >> >> I am afraid, we cannot end up exposing only single PCI Host bridge with >> current model (if we use device emulators running in different domains >> that handles the *entire* PCI Host bridges), this won't work. > > That makes sense and it is fine. But see above, I think only the #2 is > necessary for the hypervisor. Patch #5 should not be necessary at all. > > [...] I did checks w/o patch #5 and can confirm that indeed -- qemu & xen can do this work without additional modifications to qemu code. So I'll drop this patch from this series. [..] >>>> +/* >>>> + * 16 MB is reserved for virtio-pci configuration space based on >>>> calculation >>>> + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB >>> >>> Can you explain how youd ecided the "2"? >> >> good question, we have a limited free space available in memory layout >> (we had difficulties to find a suitable holes) also we don't expect a >> lot of virtio-pci devices, so "256" used vPCI would be too much. It was >> decided to reduce significantly, but select maximum to fit into free >> space, with having "2" buses we still fit into the chosen holes. > > If you don't expect a lot of virtio devices, then why do you need two > buses? Wouldn't one be sufficient? > one should be reasonably sufficient, I agree >> >> >>> >>>> + */ >>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) >>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) >>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) >>>> + >>>> +/* 64 MB is reserved for virtio-pci memory */ >>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) >>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) >>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) >>>> + >>>> /* >>>> * 16MB == 4096 pages reserved for guest to use as a region to >>>> map its >>>> * grant table in. >>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; >>>> #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) >>>> #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) >>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */ >>> >>> This doesn't seem a lot depending on your use case. Can you details how >>> you can up with "64 MB"? >> >> the same calculation as it was done configuration space. It was observed >> that only 16K is used per virtio-pci device (maybe it can be bigger for >> usual PCI device, I don't know). Please look at the example of DomU log >> below (to strings that contain "*BAR 4: assigned*"): > > What about virtio-gpu? I would expect a bit more memory is necessary for > that use case. > > Any case, what I am looking for is for some explanation in the commit > message of the limits. I don't particularly care about the exact limit > because this is not part of a stable ABI. sure, I'll put a bit more explanation in both comment and commit message. Should I post updated patch series, with updated resources and without patch #5, or shall we wait for some more comments here? -- regards, Sergiy
Hi Sergiy, On 17/11/2023 13:19, Sergiy Kibrik wrote: >>>>> + */ >>>>> +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) >>>>> +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) >>>>> +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) >>>>> + >>>>> +/* 64 MB is reserved for virtio-pci memory */ >>>>> +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) >>>>> +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) >>>>> +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) >>>>> + >>>>> /* >>>>> * 16MB == 4096 pages reserved for guest to use as a region to >>>>> map its >>>>> * grant table in. >>>>> @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; >>>>> #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) >>>>> #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) >>>>> +/* 64 MB is reserved for virtio-pci Prefetch memory */ >>>> >>>> This doesn't seem a lot depending on your use case. Can you details how >>>> you can up with "64 MB"? >>> >>> the same calculation as it was done configuration space. It was observed >>> that only 16K is used per virtio-pci device (maybe it can be bigger for >>> usual PCI device, I don't know). Please look at the example of DomU log >>> below (to strings that contain "*BAR 4: assigned*"): >> >> What about virtio-gpu? I would expect a bit more memory is necessary >> for that use case. >> >> Any case, what I am looking for is for some explanation in the commit >> message of the limits. I don't particularly care about the exact limit >> because this is not part of a stable ABI. > > sure, I'll put a bit more explanation in both comment and commit > message. Should I post updated patch series, with updated resources and > without patch #5, or shall we wait for some more comments here? I would wait for comments before posting in particular if you haven't yet received any comment on the tools side. Cheers,
On 11/17/23 03:11, Oleksandr Tyshchenko wrote: > > > On 17.11.23 05:31, Stewart Hildebrand wrote: > > Hello Stewart > > [answering only for virtio-pci bits as for vPCI I am only familiar with > code responsible for trapping config space accesses] > > [snip] > >>> >>> >>> Let me start by saying that if we can get away with it, I think that a >>> single PCI Root Complex in Xen would be best because it requires less >>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only >>> one? >>> >>> Stewart, you are deep into vPCI, what's your thinking? >> >> First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci: >> >> virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ] >> device_model_args = [ "-device", "virtio-serial-pci" ] >> pci = [ "01:00.0" ] >> >> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view: >> >> pcie@10000000 { >> compatible = "pci-host-ecam-generic"; >> device_type = "pci"; >> reg = <0x00 0x10000000 0x00 0x10000000>; >> bus-range = <0x00 0xff>; >> #address-cells = <0x03>; >> #size-cells = <0x02>; >> status = "okay"; >> ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>; >> #interrupt-cells = <0x01>; >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>; >> interrupt-map-mask = <0x00 0x00 0x00 0x07>; > > > I am wondering how you got interrupt-map here? AFAIR upstream toolstack > doesn't add that property for vpci dt node. You are correct. I'm airing my dirty laundry here: this is a hack to get a legacy interrupt passed through on a ZCU102 board (Zynq UltraScale+), which doesn't have GICv3/ITS. In a system with a GICv3/ITS, we would have an msi-map property (this is also not yet upstream, but is in a couple of downstreams). > >> }; >> >> pcie@33000000 { >> compatible = "pci-host-ecam-generic"; >> device_type = "pci"; >> reg = <0x00 0x33000000 0x00 0x200000>; >> bus-range = <0x00 0x01>; >> #address-cells = <0x03>; >> #size-cells = <0x02>; >> status = "okay"; >> ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>; >> dma-coherent; >> #interrupt-cells = <0x01>; >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>; >> interrupt-map-mask = <0x1800 0x00 0x00 0x07>; > > > that is correct dump. > > BTW, if you added "grant_usage=1" (it is disabled by default for dom0) > to virtio configuration you would get iommu-map property here as well > [1]. This is another point to think about when considering combined > approach (single PCI Host bridge node -> single virtual root complex), I > guess usual PCI device doesn't want grant based DMA addresses, correct? > If so, it shouldn't be specified in the property. Right, a usual PCI device doesn't want/need grant based DMA addresses. The iommu-map property [2] is flexible enough to make it apply only to certain vBDFs or ranges of vBDFs. Actually, it looks like ("libxl/arm: Reuse generic PCI-IOMMU bindings for virtio-pci devices") already has the logic for doing exactly this. So it should be fine to have the iommu-map property in the single root complex and still do regular PCI passthrough. Presumably, we would also want msi-map [3] to apply to some vBDFs, not others. msi-map has the same flexible BDF matching as iommu-map (these two bindings actually share the same parsing function), so we should be able to use a similar strategy here if needed. We would also want interrupt-map [4] to apply to some vBDFs, not others. The BDF matching with interrupt-map behaves slightly differently, but I believe it is still flexible enough to achieve this. In this case, the function create_virtio_pci_irq_map(), added in ("libxl/arm: Add basic virtio-pci support"), would need some re-thinking. In summary, with a single virtual root complex, the toolstack needs to know which vBDFs are virtio-pci, and which vBDFs are passthrough, so it can create the [iommu,msi,interrupt]-map properties accordingly. [2] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt [3] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-msi.txt [4] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/host-generic-pci.yaml > > >> }; >> >> Xen vPCI doesn't currently expose a host bridge (i.e. a device with base class 0x06). As an aside, we may eventually want to expose a virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects one [0]. >> >> Qemu exposes an emulated host bridge, along with any requested emulated devices. >> >> Running lspci -v in the domU yields the following: >> >> 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R PCIe >> Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe >> Flags: bus master, fast devsel, latency 0, IRQ 13 >> Memory at 23000000 (32-bit, non-prefetchable) [size=64K] >> Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+ >> Kernel driver in use: rt2800pci >> >> 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge >> Subsystem: Red Hat, Inc. QEMU PCIe Host bridge >> Flags: fast devsel >> >> 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console >> Subsystem: Red Hat, Inc. Virtio console >> Flags: bus master, fast devsel, latency 0, IRQ 14 >> Memory at 3a000000 (64-bit, prefetchable) [size=16K] >> Capabilities: [84] Vendor Specific Information: VirtIO: <unknown> >> Capabilities: [70] Vendor Specific Information: VirtIO: Notify >> Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg >> Capabilities: [50] Vendor Specific Information: VirtIO: ISR >> Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg >> Kernel driver in use: virtio-pci >> >> 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 in dom0). >> 0001:00:00.0 is the qemu host bridge (base class 0x06). >> They are on different segments because they are associated with different root complexes. > > > Glad to hear this patch series doesn't seem to break PCI passthrough in > your environment. > > >> >> >> Back to the question: Sure, avoiding reserving more memory from the preciously small lowmem virtual memory layout is probably a good idea. With everything in a single virtual root complex (segment), it's probably possible to come up with some vBDF-picking algorithm (+ user ability to specify) that works for most use cases as discussed elsewhere. It will always be in a single fixed segment as far as I can tell. >> >> Some more observations assuming a single virtual root complex: >> >> We should probably hide the qemu host bridge(s) from the guest. In other words, hide all devices with base class 0x06, except eventually vPCI's own virtual host bridge. If we don't hide them, we would likely end up with multiple emulated host bridges on a single root complex (segment). That sounds messy and hard to manage. >> >> We have a need to control the vBDF exposed to the guest - can we force qemu to use particular BDFs for its emulated devices? > > > Yes, it is possible. Maybe there is a better way, but at > least *bus* and *addr* can be specified and Qemu indeed follows that. > > device_model_args=[ '-device', > 'virtio-blk-pci,scsi=off,disable-legacy=on,iommu_platform=on,bus=pcie.0,addr=2,drive=image', > '-drive', 'if=none,id=image,format=raw,file=/dev/mmcblk1p3' ] > > virtio=[ "backend=Domain-0, type=virtio,device, transport=pci, > bdf=0000:00:02.0, grant_usage=1, backend_type=qemu" ] > > root@h3ulcb-domd:~# dmesg | grep virtio > [ 0.660789] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002) > [ 0.715876] virtio_blk virtio0: [vda] 4096 512-byte logical blocks > (2.10 MB/2.00 MiB) > > root@h3ulcb-domd:~# lspci > 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge > 00:02.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01) > > Also there is one moment for current series: bdf specified for > virtio-pci device only makes sense for iommu-map property. So > bdf=0000:00:02.0 in virtio property and bus=pcie.0,addr=2 in > device_model_args property should be in sync. > > [1] > https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-5-Sergiy_Kibrik@epam.com/ > > > [snip]
On Tue, 21 Nov 2023, Stewart Hildebrand wrote: > On 11/17/23 03:11, Oleksandr Tyshchenko wrote: > > > > > > On 17.11.23 05:31, Stewart Hildebrand wrote: > > > > Hello Stewart > > > > [answering only for virtio-pci bits as for vPCI I am only familiar with > > code responsible for trapping config space accesses] > > > > [snip] > > > >>> > >>> > >>> Let me start by saying that if we can get away with it, I think that a > >>> single PCI Root Complex in Xen would be best because it requires less > >>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only > >>> one? > >>> > >>> Stewart, you are deep into vPCI, what's your thinking? > >> > >> First allow me explain the moving pieces in a bit more detail (skip ahead to "Back to the question: " if you don't want to be bored with the details). I played around with this series, and I passed through a PCI device (with vPCI) and enabled virtio-pci: > >> > >> virtio = [ "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ] > >> device_model_args = [ "-device", "virtio-serial-pci" ] > >> pci = [ "01:00.0" ] > >> > >> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) from the domU point of view: > >> > >> pcie@10000000 { > >> compatible = "pci-host-ecam-generic"; > >> device_type = "pci"; > >> reg = <0x00 0x10000000 0x00 0x10000000>; > >> bus-range = <0x00 0xff>; > >> #address-cells = <0x03>; > >> #size-cells = <0x02>; > >> status = "okay"; > >> ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>; > >> #interrupt-cells = <0x01>; > >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>; > >> interrupt-map-mask = <0x00 0x00 0x00 0x07>; > > > > > > I am wondering how you got interrupt-map here? AFAIR upstream toolstack > > doesn't add that property for vpci dt node. > > You are correct. I'm airing my dirty laundry here: this is a hack to get a legacy interrupt passed through on a ZCU102 board (Zynq UltraScale+), which doesn't have GICv3/ITS. In a system with a GICv3/ITS, we would have an msi-map property (this is also not yet upstream, but is in a couple of downstreams). > > > > >> }; > >> > >> pcie@33000000 { > >> compatible = "pci-host-ecam-generic"; > >> device_type = "pci"; > >> reg = <0x00 0x33000000 0x00 0x200000>; > >> bus-range = <0x00 0x01>; > >> #address-cells = <0x03>; > >> #size-cells = <0x02>; > >> status = "okay"; > >> ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>; > >> dma-coherent; > >> #interrupt-cells = <0x01>; > >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>; > >> interrupt-map-mask = <0x1800 0x00 0x00 0x07>; > > > > > > that is correct dump. > > > > BTW, if you added "grant_usage=1" (it is disabled by default for dom0) > > to virtio configuration you would get iommu-map property here as well > > [1]. This is another point to think about when considering combined > > approach (single PCI Host bridge node -> single virtual root complex), I > > guess usual PCI device doesn't want grant based DMA addresses, correct? > > If so, it shouldn't be specified in the property. > > Right, a usual PCI device doesn't want/need grant based DMA addresses. The iommu-map property [2] is flexible enough to make it apply only to certain vBDFs or ranges of vBDFs. Actually, it looks like ("libxl/arm: Reuse generic PCI-IOMMU bindings for virtio-pci devices") already has the logic for doing exactly this. So it should be fine to have the iommu-map property in the single root complex and still do regular PCI passthrough. > > Presumably, we would also want msi-map [3] to apply to some vBDFs, not others. msi-map has the same flexible BDF matching as iommu-map (these two bindings actually share the same parsing function), so we should be able to use a similar strategy here if needed. > > We would also want interrupt-map [4] to apply to some vBDFs, not others. The BDF matching with interrupt-map behaves slightly differently, but I believe it is still flexible enough to achieve this. In this case, the function create_virtio_pci_irq_map(), added in ("libxl/arm: Add basic virtio-pci support"), would need some re-thinking. > > In summary, with a single virtual root complex, the toolstack needs to know which vBDFs are virtio-pci, and which vBDFs are passthrough, so it can create the [iommu,msi,interrupt]-map properties accordingly. That should be fine given that the toolstack also knows all the PCI Passthrough devices too.
diff --git a/xen/include/public/arch-arm.h b/xen/include/public/arch-arm.h index a25e87dbda..e6c9cd5335 100644 --- a/xen/include/public/arch-arm.h +++ b/xen/include/public/arch-arm.h @@ -466,6 +466,19 @@ typedef uint64_t xen_callback_t; #define GUEST_VPCI_MEM_ADDR xen_mk_ullong(0x23000000) #define GUEST_VPCI_MEM_SIZE xen_mk_ullong(0x10000000) +/* + * 16 MB is reserved for virtio-pci configuration space based on calculation + * 8 bridges * 2 buses x 32 devices x 8 functions x 4 KB = 16 MB + */ +#define GUEST_VIRTIO_PCI_ECAM_BASE xen_mk_ullong(0x33000000) +#define GUEST_VIRTIO_PCI_TOTAL_ECAM_SIZE xen_mk_ullong(0x01000000) +#define GUEST_VIRTIO_PCI_HOST_ECAM_SIZE xen_mk_ullong(0x00200000) + +/* 64 MB is reserved for virtio-pci memory */ +#define GUEST_VIRTIO_PCI_ADDR_TYPE_MEM xen_mk_ullong(0x02000000) +#define GUEST_VIRTIO_PCI_MEM_ADDR xen_mk_ullong(0x34000000) +#define GUEST_VIRTIO_PCI_MEM_SIZE xen_mk_ullong(0x04000000) + /* * 16MB == 4096 pages reserved for guest to use as a region to map its * grant table in. @@ -476,6 +489,11 @@ typedef uint64_t xen_callback_t; #define GUEST_MAGIC_BASE xen_mk_ullong(0x39000000) #define GUEST_MAGIC_SIZE xen_mk_ullong(0x01000000) +/* 64 MB is reserved for virtio-pci Prefetch memory */ +#define GUEST_VIRTIO_PCI_ADDR_TYPE_PREFETCH_MEM xen_mk_ullong(0x42000000) +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_ADDR xen_mk_ullong(0x3a000000) +#define GUEST_VIRTIO_PCI_PREFETCH_MEM_SIZE xen_mk_ullong(0x04000000) + #define GUEST_RAM_BANKS 2 /* @@ -515,6 +533,9 @@ typedef uint64_t xen_callback_t; #define GUEST_VIRTIO_MMIO_SPI_FIRST 33 #define GUEST_VIRTIO_MMIO_SPI_LAST 43 +#define GUEST_VIRTIO_PCI_SPI_FIRST 44 +#define GUEST_VIRTIO_PCI_SPI_LAST 76 + /* PSCI functions */ #define PSCI_cpu_suspend 0 #define PSCI_cpu_off 1