Message ID | 20240529-vfio_pci_mmap-v3-2-cd217d019218@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/pci: s390: Fix issues preventing VFIO_PCI_MMAP=y for s390 and enable it | expand |
On Wed, 29 May 2024 13:36:25 +0200 Niklas Schnelle <schnelle@linux.ibm.com> wrote: > On s390 there is a virtual PCI device called ISM which has a few rather > annoying oddities. For one it claims to have a 256 TiB PCI BAR (not > a typo) which leads to any attempt to mmap() it failing during vmap. > > Even if one tried to map this "BAR" only partially the mapping would not > be usable on systems with MIO support enabled however. This is because > of another oddity in that this virtual PCI device does not support the > newer memory I/O (MIO) PCI instructions and legacy PCI instructions are > not accessible by user-space when MIO is in use. If this device needs to > be accessed by user-space it will thus need a vfio-pci variant driver. > Until then work around both issues by excluding resources which don't > fit between IOREMAP_START and IOREMAP_END in vfio_pci_probe_mmaps(). > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 80cae87fff36..0f1ddf2d3ef2 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -28,6 +28,7 @@ > #include <linux/nospec.h> > #include <linux/sched/mm.h> > #include <linux/iommufd.h> > +#include <linux/ioremap.h> > #if IS_ENABLED(CONFIG_EEH) > #include <asm/eeh.h> > #endif > @@ -129,9 +130,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) > /* > * The PCI core shouldn't set up a resource with a > * type but zero size. But there may be bugs that > - * cause us to do that. > + * cause us to do that. There is also at least one > + * device which advertises a resource too large to > + * ioremap(). > */ > - if (!resource_size(res)) > + if (!resource_size(res) || > + resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START)) > goto no_mmap; > > if (resource_size(res) >= PAGE_SIZE) { > A powerpc build reports: ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined! Looks like only __kernel_io_start is exported. Thanks, Alex
On Tue, Jun 18, 2024 at 09:51:34AM -0600, Alex Williamson wrote: > > - if (!resource_size(res)) > > + if (!resource_size(res) || > > + resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START)) > > goto no_mmap; > > > > if (resource_size(res) >= PAGE_SIZE) { > > > > A powerpc build reports: > > ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined! > > Looks like only __kernel_io_start is exported. Thanks, And exported code has no business looking at either one. I think the right thing here is a core PCI quirk to fix the BAR size of the ISM device instead of this hack in vfio.
On Wed, 2024-06-19 at 00:11 -0700, Christoph Hellwig wrote: > On Tue, Jun 18, 2024 at 09:51:34AM -0600, Alex Williamson wrote: > > > - if (!resource_size(res)) > > > + if (!resource_size(res) || > > > + resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START)) > > > goto no_mmap; > > > > > > if (resource_size(res) >= PAGE_SIZE) { > > > > > > > A powerpc build reports: > > > > ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined! > > > > Looks like only __kernel_io_start is exported. Thanks, > > And exported code has no business looking at either one. > > I think the right thing here is a core PCI quirk to fix the BAR > size of the ISM device instead of this hack in vfio. > I see your point. Sadly the situation with this oversized BAR is somewhat complex and while it's certainly quirky, I'm not sure a PCI quirk is a good fit. The reason the ISM device claims the 256 TiB BAR size is that it uses the offset into the BAR via our PCI Store Block instruction to encode additional information. The data encoded there called a DMB request is used to identify the target buffer in which the ISM device stores the data. This allows the device to do an entire data transfer with a single synchronous PCI Store Block instruction and without having to IOMMU map the data being sent or storing it somewhere else in between. This works as conceptually on the send side the data is simply stored at an offset into the BAR while on the receiving side it comes in as a DMA from the device all as a single instruction execution. And yes I'm aware that such synchronous end-to-end operations aren't something actual PCI devices can do. Don't shoot the messenger. In short, the ISM BAR 0 is stupidly large but this is intentional. It not fitting in the VMAP is simply the least crazy filter I could come up with to keep the ISM device from causing trouble for use of vfio-pci mmap() for other, normal, PCI devices. Thanks, Niklas
On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote: > In short, the ISM BAR 0 is stupidly large but this is intentional. It > not fitting in the VMAP is simply the least crazy filter I could come > up with to keep the ISM device from causing trouble for use of vfio-pci > mmap() for other, normal, PCI devices. Then maybe add a PCI quirk to prevent mapping it. This would also affect the sysfs resource0 file unless I'm missing something.
On Wed, 2024-06-19 at 21:09 -0700, Christoph Hellwig wrote: > On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote: > > In short, the ISM BAR 0 is stupidly large but this is intentional. It > > not fitting in the VMAP is simply the least crazy filter I could come > > up with to keep the ISM device from causing trouble for use of vfio-pci > > mmap() for other, normal, PCI devices. > > Then maybe add a PCI quirk to prevent mapping it. This would also > affect the sysfs resource0 file unless I'm missing something. > The resource<N> files are globally disabled on s390x due to lack of HAVE_PCI_MMAP/ARCH_GENERIC_PCI_MMAP_RESOURCE at the moment. I might change that in the future with a analogous argument as for VFIO_PCI_MMAP but for its not there. Once we add it we of course need the same special ISM treatment there too. Looking at the existing PCI quirks I don't see anything that would fit so I'm guessing you would want to add a new quirk? As I understand it we would then have to export something like a is_pci_mmap_broken(pdev) function while currently the only quirk function that seems to be exported is pxi_fixup_device(). But then what happens if CONFIG_PCI_QUIRKS=n? Also the header comment in drivers/pci/quirks.c says that platform specific devices shouldn't go in there and ISM is platform specific. Instead of exporting IOREMAP_START/IOREMAP_END maybe there is another reasonable maximum resource length? Or maybe we could create a size_t ioremap_area_size() helper similar to is_ioremap_addr() but not inlined. The latter already uses IOREMAP_START/IOREMAP_END so not sure how that works when IOREMAP_END is not exported? Thanks, Niklas
On Thu, 2024-06-20 at 14:06 +0200, Niklas Schnelle wrote: > On Wed, 2024-06-19 at 21:09 -0700, Christoph Hellwig wrote: > > On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote: > > > In short, the ISM BAR 0 is stupidly large but this is > > > intentional. It not fitting in the VMAP is simply the least crazy > > > filter I could come up with to keep the ISM device from causing > > > trouble for use of vfio-pci mmap() for other, normal, PCI > > > devices. > > > > Then maybe add a PCI quirk to prevent mapping it. This would also > > affect the sysfs resource0 file unless I'm missing something. > > > > The resource<N> files are globally disabled on s390x due to lack of > HAVE_PCI_MMAP/ARCH_GENERIC_PCI_MMAP_RESOURCE at the moment. I might > change that in the future with a analogous argument as for > VFIO_PCI_MMAP but for its not there. Once we add it we of course need > the same special ISM treatment there too. > > Looking at the existing PCI quirks I don't see anything that would > fit so I'm guessing you would want to add a new quirk? As I > understand it we would then have to export something like a > is_pci_mmap_broken(pdev) function while currently the only quirk > function that seems to be exported is pxi_fixup_device(). But then > what happens if CONFIG_PCI_QUIRKS=n? Also the header comment in > drivers/pci/quirks.c says that platform specific devices shouldn't go > in there and ISM is platform specific. I took a cursory look at that file as well and arrived at a similar conclusion that drivers/pci/quirks.c (as is) does not sound like a good match for the required functionality > Instead of exporting IOREMAP_START/IOREMAP_END maybe there is another > reasonable maximum resource length? Or maybe we could create a size_t > ioremap_area_size() helper similar to is_ioremap_addr() but not > inlined. The latter already uses IOREMAP_START/IOREMAP_END so not > sure how that works when IOREMAP_END is not exported? But seeing how the PCI quirks filter against Vendor and Device ID, I just had the idea if it would make sense to create a similar infrastructure in the form of VFIO PCI quirks - and simply reject mmap on ISM devices regardless of the size of iomap'able address spaces? Sound rather coarse-grained, though... > Thanks, > Niklas Just a thought, Gerd
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 80cae87fff36..0f1ddf2d3ef2 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -28,6 +28,7 @@ #include <linux/nospec.h> #include <linux/sched/mm.h> #include <linux/iommufd.h> +#include <linux/ioremap.h> #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif @@ -129,9 +130,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev) /* * The PCI core shouldn't set up a resource with a * type but zero size. But there may be bugs that - * cause us to do that. + * cause us to do that. There is also at least one + * device which advertises a resource too large to + * ioremap(). */ - if (!resource_size(res)) + if (!resource_size(res) || + resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START)) goto no_mmap; if (resource_size(res) >= PAGE_SIZE) {