Message ID | 20220225153956.1078-3-roger.pau@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vpci/msix: fix PBA acceses | expand |
I think there is an issue in the spin_lock handling of patch 2 for the "msix_write" function as it results in the lock being taken a second time while held (hangs). The lock taken before checking "VMSIX_ADDR_IN_RANGE" isn't unlocked for the non- PBA case and a second lock is attempted just before the call to get_entry() later in the same function. It looks like either the added lock should either be moved inside the PBA case or the lock before get_entry() should be removed. On my server, upon loading the ioatdma driver, it now successfully attempts an PBA write (which now doesn't crash the system), but I'm not sure I have a way to fully exercise it... I also see a different (related) issue in which modify_bars is called on a virtual function seemingly before the BAR addresses are initialized/known and will start a different thread for that topic. Regards, -Alex On Fri, 2022-02-25 at 16:39 +0100, Roger Pau Monne wrote: > Map the PBA in order to access it from the MSI-X read and write > handlers. Note that previously the handlers would pass the physical > host address into the {read,write}{l,q} handlers, which is wrong as > those expect a linear address. > > Map the PBA using ioremap when the first access is performed. Note > that 32bit arches might want to abstract the call to ioremap into a > vPCI arch handler, so they can use a fixmap range to map the PBA. > > Reported-by: Jan Beulich <jbeulich@suse.com> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > Cc: Alex Olson <this.is.a0lson@gmail.com> > --- > Changes since v1: > - Also handle writes. > > I don't seem to have a box with a driver that will try to access the > PBA, so I would consider this specific code path only build tested. At > least it doesn't seem to regress the current state of vPCI. > --- > xen/drivers/vpci/msix.c | 55 +++++++++++++++++++++++++++++++++++++---- > xen/drivers/vpci/vpci.c | 2 ++ > xen/include/xen/vpci.h | 2 ++ > 3 files changed, 54 insertions(+), 5 deletions(-) > > diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c > index a1fa7a5f13..9fbc111ecc 100644 > --- a/xen/drivers/vpci/msix.c > +++ b/xen/drivers/vpci/msix.c > @@ -198,8 +198,13 @@ static int cf_check msix_read( > if ( !access_allowed(msix->pdev, addr, len) ) > return X86EMUL_OKAY; > > + spin_lock(&msix->pdev->vpci->lock); > if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) > { > + struct vpci *vpci = msix->pdev->vpci; > + paddr_t base = vmsix_table_addr(vpci, VPCI_MSIX_PBA); > + unsigned int idx = addr - base; > + > /* > * Access to PBA. > * > @@ -207,25 +212,43 @@ static int cf_check msix_read( > * guest address space. If this changes the address will need to be > * translated. > */ > + > + if ( !msix->pba ) > + { > + msix->pba = ioremap(base, vmsix_table_size(vpci, VPCI_MSIX_PBA)); > + if ( !msix->pba ) > + { > + /* > + * If unable to map the PBA return all 1s (all pending): it's > + * likely better to trigger spurious events than drop them. > + */ > + spin_unlock(&vpci->lock); > + gprintk(XENLOG_WARNING, > + "%pp: unable to map MSI-X PBA, report all pending\n", > + msix->pdev); > + return X86EMUL_OKAY; > + } > + } > + > switch ( len ) > { > case 4: > - *data = readl(addr); > + *data = readl(msix->pba + idx); > break; > > case 8: > - *data = readq(addr); > + *data = readq(msix->pba + idx); > break; > > default: > ASSERT_UNREACHABLE(); > break; > } > + spin_unlock(&vpci->lock); > > return X86EMUL_OKAY; > } > > - spin_lock(&msix->pdev->vpci->lock); > entry = get_entry(msix, addr); > offset = addr & (PCI_MSIX_ENTRY_SIZE - 1); > > @@ -273,27 +296,49 @@ static int cf_check msix_write( > if ( !access_allowed(msix->pdev, addr, len) ) > return X86EMUL_OKAY; > > + spin_lock(&msix->pdev->vpci->lock); > if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) > { > + struct vpci *vpci = msix->pdev->vpci; > + paddr_t base = vmsix_table_addr(vpci, VPCI_MSIX_PBA); > + unsigned int idx = addr - base; > > if ( !is_hardware_domain(d) ) > + { > /* Ignore writes to PBA for DomUs, it's behavior is undefined. */ > + spin_unlock(&vpci->lock); > return X86EMUL_OKAY; > + } > + > + if ( !msix->pba ) > + { > + msix->pba = ioremap(base, vmsix_table_size(vpci, VPCI_MSIX_PBA)); > + if ( !msix->pba ) > + { > + /* Unable to map the PBA, ignore write. */ > + spin_unlock(&vpci->lock); > + gprintk(XENLOG_WARNING, > + "%pp: unable to map MSI-X PBA, write ignored\n", > + msix->pdev); > + return X86EMUL_OKAY; > + } > + } > > switch ( len ) > { > case 4: > - writel(data, addr); > + writel(data, msix->pba + idx); > break; > > case 8: > - writeq(data, addr); > + writeq(data, msix->pba + idx); > break; > > default: > ASSERT_UNREACHABLE(); > break; > } > + spin_unlock(&vpci->lock); > > return X86EMUL_OKAY; > } > diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c > index f3b32d66cb..9fb3c05b2b 100644 > --- a/xen/drivers/vpci/vpci.c > +++ b/xen/drivers/vpci/vpci.c > @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) > xfree(r); > } > spin_unlock(&pdev->vpci->lock); > + if ( pdev->vpci->msix && pdev->vpci->msix->pba ) > + iounmap(pdev->vpci->msix->pba); > xfree(pdev->vpci->msix); > xfree(pdev->vpci->msi); > xfree(pdev->vpci); > diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h > index bcad1516ae..c399b101ee 100644 > --- a/xen/include/xen/vpci.h > +++ b/xen/include/xen/vpci.h > @@ -127,6 +127,8 @@ struct vpci { > bool enabled : 1; > /* Masked? */ > bool masked : 1; > + /* PBA map */ > + void *pba; > /* Entries. */ > struct vpci_msix_entry { > uint64_t addr;
On Fri, Feb 25, 2022 at 11:57:05AM -0600, Alex Olson wrote: > I think there is an issue in the spin_lock handling of patch 2 for the > "msix_write" function as it results in the lock being taken a second time while > held (hangs). > > The lock taken before checking "VMSIX_ADDR_IN_RANGE" isn't unlocked for the non- > PBA case and a second lock is attempted just before the call to get_entry() > later in the same function. It looks like either the added lock should either > be moved inside the PBA case or the lock before get_entry() should be removed. Sorry, was in a rush to send this before leaving yesterday and didn't refresh the commit before generating the patch, v2.1 should be fixed. Could you provide a 'Tested-by' if it work for you? > > On my server, upon loading the ioatdma driver, it now successfully attempts an > PBA write (which now doesn't crash the system), but I'm not sure I have a way to > fully exercise it... Urg, that's weird, PBA should be read-only only according to the spec. Writes to PBA have undefined behavior. > > I also see a different (related) issue in which modify_bars is called on a > virtual function seemingly before the BAR addresses are initialized/known and > will start a different thread for that topic. SR-IOV is not supported on PVH dom0 yet, so that's not going to work. I've posted a series in 2018 to enable it, but sadly had no time to work on it anymore: https://lore.kernel.org/xen-devel/20180717094830.54806-1-roger.pau@citrix.com/ It's likely not going to apply cleanly, and there's a lot of comments to be fixed up there. Thanks, Roger.
Hi Roger, The revised patch looks good. The PBA writes seen during ioatdma driver initialization (self-test) complete successfully and the driver doesn't complain (I see two interrupts per ioatdma device). The driver has a self-test feature which appears to exercise MSIX interrupts and has code which appears to cause a PBA write. Feel free to add "Tested-by: Alex.Olson@starlab.io" to your patchset. Thanks also for the pointer to your 2018 work on SR-IOV, I'll give it a try. FYI, with this patch, I was seeing msix_read() and msix_write() being called during the driver's self-test on physical address 0xfbc01800, corresponding to the beginning of the PBA (lspci excerpt below): 02:00.0 System peripheral: Intel Corporation Xeon Processor D Family QuickData Technology Register DMA Channel 0 ... Region 0: Memory at fbc06000 (64-bit, non-prefetchable) [size=8K] ... Capabilities: [ac] MSI-X: Enable+ Count=1 Masked- Vector table: BAR=0 offset=00001000 PBA: BAR=0 offset=00001800 ... Kernel modules: ioatdma The functions involved on the Linux kernel side are: ioat_probe() -> ioat3_dma_self_test() -> ioat_dma_self_test() -> ioat_free_chan_resources() -> ioat_reset_hw() drivers/dma/ioat/dma.c: ioat_reset_hw() ... ioat_dma->msixpba = readq(ioat_dma->reg_base + 0x1800); ... writeq(ioat_dma->msixpba, ioat_dma->reg_base + 0x1800); Thanks, -Alex On Sat, 2022-02-26 at 11:10 +0100, Roger Pau Monné wrote: > On Fri, Feb 25, 2022 at 11:57:05AM -0600, Alex Olson wrote: > > I think there is an issue in the spin_lock handling of patch 2 for the > > "msix_write" function as it results in the lock being taken a second time > > while > > held (hangs). > > > > The lock taken before checking "VMSIX_ADDR_IN_RANGE" isn't unlocked for the > > non- > > PBA case and a second lock is attempted just before the call to get_entry() > > later in the same function. It looks like either the added lock should > > either > > be moved inside the PBA case or the lock before get_entry() should be > > removed. > > Sorry, was in a rush to send this before leaving yesterday and didn't > refresh the commit before generating the patch, v2.1 should be fixed. > > Could you provide a 'Tested-by' if it work for you? > > > On my server, upon loading the ioatdma driver, it now successfully attempts > > an > > PBA write (which now doesn't crash the system), but I'm not sure I have a > > way to > > fully exercise it... > > Urg, that's weird, PBA should be read-only only according to the spec. > Writes to PBA have undefined behavior. > > > I also see a different (related) issue in which modify_bars is called on a > > virtual function seemingly before the BAR addresses are initialized/known > > and > > will start a different thread for that topic. > > SR-IOV is not supported on PVH dom0 yet, so that's not going to work. > I've posted a series in 2018 to enable it, but sadly had no time to > work on it anymore: > > https://lore.kernel.org/xen-devel/20180717094830.54806-1-roger.pau@citrix.com/ > > It's likely not going to apply cleanly, and there's a lot of comments > to be fixed up there. > > Thanks, Roger.
On 28.02.2022 19:19, Alex Olson wrote: > FYI, with this patch, I was seeing msix_read() and msix_write() being called > during the driver's self-test on physical address 0xfbc01800, corresponding to > the beginning of the PBA (lspci excerpt below): > > > 02:00.0 System peripheral: Intel Corporation Xeon Processor D Family QuickData Technology Register DMA Channel 0 > ... > Region 0: Memory at fbc06000 (64-bit, non-prefetchable) [size=8K] > ... > Capabilities: [ac] MSI-X: Enable+ Count=1 Masked- > Vector table: BAR=0 offset=00001000 > PBA: BAR=0 offset=00001800 > ... > Kernel modules: ioatdma > > > > The functions involved on the Linux kernel side are: > > ioat_probe() > -> ioat3_dma_self_test() > -> ioat_dma_self_test() > -> ioat_free_chan_resources() > -> ioat_reset_hw() > > drivers/dma/ioat/dma.c: ioat_reset_hw() > ... > ioat_dma->msixpba = readq(ioat_dma->reg_base + 0x1800); > ... > writeq(ioat_dma->msixpba, ioat_dma->reg_base + 0x1800); Wow, a clear and apparently intentional violation of the PCI spec. There was a workaround for a reset issue introduced by commit 8a52b9ff1154, which was then revised by c997e30e7f65 to take the present shape. However, both commits claim this only affects certain Atoms, albeit the latter less explicitly by having "CB3.3" in the subject. Yet you're seeing this on a Xeon D ... Jan
diff --git a/xen/drivers/vpci/msix.c b/xen/drivers/vpci/msix.c index a1fa7a5f13..9fbc111ecc 100644 --- a/xen/drivers/vpci/msix.c +++ b/xen/drivers/vpci/msix.c @@ -198,8 +198,13 @@ static int cf_check msix_read( if ( !access_allowed(msix->pdev, addr, len) ) return X86EMUL_OKAY; + spin_lock(&msix->pdev->vpci->lock); if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) { + struct vpci *vpci = msix->pdev->vpci; + paddr_t base = vmsix_table_addr(vpci, VPCI_MSIX_PBA); + unsigned int idx = addr - base; + /* * Access to PBA. * @@ -207,25 +212,43 @@ static int cf_check msix_read( * guest address space. If this changes the address will need to be * translated. */ + + if ( !msix->pba ) + { + msix->pba = ioremap(base, vmsix_table_size(vpci, VPCI_MSIX_PBA)); + if ( !msix->pba ) + { + /* + * If unable to map the PBA return all 1s (all pending): it's + * likely better to trigger spurious events than drop them. + */ + spin_unlock(&vpci->lock); + gprintk(XENLOG_WARNING, + "%pp: unable to map MSI-X PBA, report all pending\n", + msix->pdev); + return X86EMUL_OKAY; + } + } + switch ( len ) { case 4: - *data = readl(addr); + *data = readl(msix->pba + idx); break; case 8: - *data = readq(addr); + *data = readq(msix->pba + idx); break; default: ASSERT_UNREACHABLE(); break; } + spin_unlock(&vpci->lock); return X86EMUL_OKAY; } - spin_lock(&msix->pdev->vpci->lock); entry = get_entry(msix, addr); offset = addr & (PCI_MSIX_ENTRY_SIZE - 1); @@ -273,27 +296,49 @@ static int cf_check msix_write( if ( !access_allowed(msix->pdev, addr, len) ) return X86EMUL_OKAY; + spin_lock(&msix->pdev->vpci->lock); if ( VMSIX_ADDR_IN_RANGE(addr, msix->pdev->vpci, VPCI_MSIX_PBA) ) { + struct vpci *vpci = msix->pdev->vpci; + paddr_t base = vmsix_table_addr(vpci, VPCI_MSIX_PBA); + unsigned int idx = addr - base; if ( !is_hardware_domain(d) ) + { /* Ignore writes to PBA for DomUs, it's behavior is undefined. */ + spin_unlock(&vpci->lock); return X86EMUL_OKAY; + } + + if ( !msix->pba ) + { + msix->pba = ioremap(base, vmsix_table_size(vpci, VPCI_MSIX_PBA)); + if ( !msix->pba ) + { + /* Unable to map the PBA, ignore write. */ + spin_unlock(&vpci->lock); + gprintk(XENLOG_WARNING, + "%pp: unable to map MSI-X PBA, write ignored\n", + msix->pdev); + return X86EMUL_OKAY; + } + } switch ( len ) { case 4: - writel(data, addr); + writel(data, msix->pba + idx); break; case 8: - writeq(data, addr); + writeq(data, msix->pba + idx); break; default: ASSERT_UNREACHABLE(); break; } + spin_unlock(&vpci->lock); return X86EMUL_OKAY; } diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c index f3b32d66cb..9fb3c05b2b 100644 --- a/xen/drivers/vpci/vpci.c +++ b/xen/drivers/vpci/vpci.c @@ -51,6 +51,8 @@ void vpci_remove_device(struct pci_dev *pdev) xfree(r); } spin_unlock(&pdev->vpci->lock); + if ( pdev->vpci->msix && pdev->vpci->msix->pba ) + iounmap(pdev->vpci->msix->pba); xfree(pdev->vpci->msix); xfree(pdev->vpci->msi); xfree(pdev->vpci); diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h index bcad1516ae..c399b101ee 100644 --- a/xen/include/xen/vpci.h +++ b/xen/include/xen/vpci.h @@ -127,6 +127,8 @@ struct vpci { bool enabled : 1; /* Masked? */ bool masked : 1; + /* PBA map */ + void *pba; /* Entries. */ struct vpci_msix_entry { uint64_t addr;
Map the PBA in order to access it from the MSI-X read and write handlers. Note that previously the handlers would pass the physical host address into the {read,write}{l,q} handlers, which is wrong as those expect a linear address. Map the PBA using ioremap when the first access is performed. Note that 32bit arches might want to abstract the call to ioremap into a vPCI arch handler, so they can use a fixmap range to map the PBA. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Cc: Alex Olson <this.is.a0lson@gmail.com> --- Changes since v1: - Also handle writes. I don't seem to have a box with a driver that will try to access the PBA, so I would consider this specific code path only build tested. At least it doesn't seem to regress the current state of vPCI. --- xen/drivers/vpci/msix.c | 55 +++++++++++++++++++++++++++++++++++++---- xen/drivers/vpci/vpci.c | 2 ++ xen/include/xen/vpci.h | 2 ++ 3 files changed, 54 insertions(+), 5 deletions(-)