diff mbox series

[v3,2/3] vfio/pci: Tolerate oversized BARs by disallowing mmap

Message ID 20240529-vfio_pci_mmap-v3-2-cd217d019218@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series vfio/pci: s390: Fix issues preventing VFIO_PCI_MMAP=y for s390 and enable it | expand

Commit Message

Niklas Schnelle May 29, 2024, 11:36 a.m. UTC
On s390 there is a virtual PCI device called ISM which has a few rather
annoying oddities. For one it claims to have a 256 TiB PCI BAR (not
a typo) which leads to any attempt to mmap() it failing during vmap.

Even if one tried to map this "BAR" only partially the mapping would not
be usable on systems with MIO support enabled however. This is because
of another oddity in that this virtual PCI device does not support the
newer memory I/O (MIO) PCI instructions and legacy PCI instructions are
not accessible by user-space when MIO is in use. If this device needs to
be accessed by user-space it will thus need a vfio-pci variant driver.
Until then work around both issues by excluding resources which don't
fit between IOREMAP_START and IOREMAP_END in vfio_pci_probe_mmaps().

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Alex Williamson June 18, 2024, 3:51 p.m. UTC | #1
On Wed, 29 May 2024 13:36:25 +0200
Niklas Schnelle <schnelle@linux.ibm.com> wrote:

> On s390 there is a virtual PCI device called ISM which has a few rather
> annoying oddities. For one it claims to have a 256 TiB PCI BAR (not
> a typo) which leads to any attempt to mmap() it failing during vmap.
> 
> Even if one tried to map this "BAR" only partially the mapping would not
> be usable on systems with MIO support enabled however. This is because
> of another oddity in that this virtual PCI device does not support the
> newer memory I/O (MIO) PCI instructions and legacy PCI instructions are
> not accessible by user-space when MIO is in use. If this device needs to
> be accessed by user-space it will thus need a vfio-pci variant driver.
> Until then work around both issues by excluding resources which don't
> fit between IOREMAP_START and IOREMAP_END in vfio_pci_probe_mmaps().
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 80cae87fff36..0f1ddf2d3ef2 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -28,6 +28,7 @@
>  #include <linux/nospec.h>
>  #include <linux/sched/mm.h>
>  #include <linux/iommufd.h>
> +#include <linux/ioremap.h>
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> @@ -129,9 +130,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  		/*
>  		 * The PCI core shouldn't set up a resource with a
>  		 * type but zero size. But there may be bugs that
> -		 * cause us to do that.
> +		 * cause us to do that. There is also at least one
> +		 * device which advertises a resource too large to
> +		 * ioremap().
>  		 */
> -		if (!resource_size(res))
> +		if (!resource_size(res) ||
> +		    resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START))
>  			goto no_mmap;
>  
>  		if (resource_size(res) >= PAGE_SIZE) {
> 

A powerpc build reports:

ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined!

Looks like only __kernel_io_start is exported.  Thanks,

Alex
Christoph Hellwig June 19, 2024, 7:11 a.m. UTC | #2
On Tue, Jun 18, 2024 at 09:51:34AM -0600, Alex Williamson wrote:
> > -		if (!resource_size(res))
> > +		if (!resource_size(res) ||
> > +		    resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START))
> >  			goto no_mmap;
> >  
> >  		if (resource_size(res) >= PAGE_SIZE) {
> > 
> 
> A powerpc build reports:
> 
> ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> 
> Looks like only __kernel_io_start is exported.  Thanks,

And exported code has no business looking at either one.

I think the right thing here is a core PCI quirk to fix the BAR
size of the ISM device instead of this hack in vfio.
Niklas Schnelle June 19, 2024, 10:56 a.m. UTC | #3
On Wed, 2024-06-19 at 00:11 -0700, Christoph Hellwig wrote:
> On Tue, Jun 18, 2024 at 09:51:34AM -0600, Alex Williamson wrote:
> > > -		if (!resource_size(res))
> > > +		if (!resource_size(res) ||
> > > +		    resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START))
> > >  			goto no_mmap;
> > >  
> > >  		if (resource_size(res) >= PAGE_SIZE) {
> > > 
> > 
> > A powerpc build reports:
> > 
> > ERROR: modpost: "__kernel_io_end" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> > 
> > Looks like only __kernel_io_start is exported.  Thanks,
> 
> And exported code has no business looking at either one.
> 
> I think the right thing here is a core PCI quirk to fix the BAR
> size of the ISM device instead of this hack in vfio.
> 

I see your point. Sadly the situation with this oversized BAR is
somewhat complex and while it's certainly quirky, I'm not sure a PCI
quirk is a good fit. The reason the ISM device claims the 256 TiB BAR
size is that it uses the offset into the BAR via our PCI Store Block
instruction to encode additional information. The data encoded there
called a DMB request is used to identify the target buffer in which the
ISM device stores the data. This allows the device to do an entire data
transfer with a single synchronous PCI Store Block instruction and
without having to IOMMU map the data being sent or storing it somewhere
else in between. This works as conceptually on the send side the data
is simply stored at an offset into the BAR while on the receiving side
it comes in as a DMA from the device all as a single instruction
execution. And yes I'm aware that such synchronous end-to-end
operations aren't something actual PCI devices can do. Don't shoot the
messenger.

In short, the ISM BAR 0 is stupidly large but this is intentional. It
not fitting in the VMAP is simply the least crazy filter I could come
up with to keep the ISM device from causing trouble for use of vfio-pci
mmap() for other, normal, PCI devices.

Thanks,
Niklas
Christoph Hellwig June 20, 2024, 4:09 a.m. UTC | #4
On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote:
> In short, the ISM BAR 0 is stupidly large but this is intentional. It
> not fitting in the VMAP is simply the least crazy filter I could come
> up with to keep the ISM device from causing trouble for use of vfio-pci
> mmap() for other, normal, PCI devices.

Then maybe add a PCI quirk to prevent mapping it.  This would also
affect the sysfs resource0 file unless I'm missing something.
Niklas Schnelle June 20, 2024, 12:06 p.m. UTC | #5
On Wed, 2024-06-19 at 21:09 -0700, Christoph Hellwig wrote:
> On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote:
> > In short, the ISM BAR 0 is stupidly large but this is intentional. It
> > not fitting in the VMAP is simply the least crazy filter I could come
> > up with to keep the ISM device from causing trouble for use of vfio-pci
> > mmap() for other, normal, PCI devices.
> 
> Then maybe add a PCI quirk to prevent mapping it.  This would also
> affect the sysfs resource0 file unless I'm missing something.
> 

The resource<N> files are globally disabled on s390x due to lack of
HAVE_PCI_MMAP/ARCH_GENERIC_PCI_MMAP_RESOURCE at the moment. I might
change that in the future with a analogous argument as for
VFIO_PCI_MMAP but for its not there. Once we add it we of course need
the same special ISM treatment there too.

Looking at the existing PCI quirks I don't see anything that would fit
so I'm guessing you would want to add a new quirk? As I understand it
we would then have to export something like a is_pci_mmap_broken(pdev)
function while currently the only quirk function that seems to be
exported is pxi_fixup_device(). But then what happens if
CONFIG_PCI_QUIRKS=n? Also the header comment in drivers/pci/quirks.c
says that platform specific devices shouldn't go in there and ISM is
platform specific.

Instead of exporting IOREMAP_START/IOREMAP_END maybe there is another
reasonable maximum resource length? Or maybe we could create a size_t
ioremap_area_size() helper similar to is_ioremap_addr() but not
inlined. The latter already uses IOREMAP_START/IOREMAP_END so not sure
how that works when IOREMAP_END is not exported?

Thanks,
Niklas
Gerd Bayer June 20, 2024, 12:29 p.m. UTC | #6
On Thu, 2024-06-20 at 14:06 +0200, Niklas Schnelle wrote:
> On Wed, 2024-06-19 at 21:09 -0700, Christoph Hellwig wrote:
> > On Wed, Jun 19, 2024 at 12:56:47PM +0200, Niklas Schnelle wrote:
> > > In short, the ISM BAR 0 is stupidly large but this is
> > > intentional. It not fitting in the VMAP is simply the least crazy
> > > filter I could come up with to keep the ISM device from causing
> > > trouble for use of vfio-pci mmap() for other, normal, PCI
> > > devices.
> > 
> > Then maybe add a PCI quirk to prevent mapping it.  This would also
> > affect the sysfs resource0 file unless I'm missing something.
> > 
> 
> The resource<N> files are globally disabled on s390x due to lack of
> HAVE_PCI_MMAP/ARCH_GENERIC_PCI_MMAP_RESOURCE at the moment. I might
> change that in the future with a analogous argument as for
> VFIO_PCI_MMAP but for its not there. Once we add it we of course need
> the same special ISM treatment there too.
> 
> Looking at the existing PCI quirks I don't see anything that would
> fit so I'm guessing you would want to add a new quirk? As I
> understand it we would then have to export something like a
> is_pci_mmap_broken(pdev) function while currently the only quirk
> function that seems to be exported is pxi_fixup_device(). But then
> what happens if CONFIG_PCI_QUIRKS=n? Also the header comment in
> drivers/pci/quirks.c says that platform specific devices shouldn't go
> in there and ISM is platform specific.

I took a cursory look at that file as well and arrived at a similar
conclusion that drivers/pci/quirks.c (as is) does not sound like a good
match for the required functionality


> Instead of exporting IOREMAP_START/IOREMAP_END maybe there is another
> reasonable maximum resource length? Or maybe we could create a size_t
> ioremap_area_size() helper similar to is_ioremap_addr() but not
> inlined. The latter already uses IOREMAP_START/IOREMAP_END so not
> sure how that works when IOREMAP_END is not exported?

But seeing how the PCI quirks filter against Vendor and Device ID,
I just had the idea if it would make sense to create a similar
infrastructure in the form of VFIO PCI quirks - and simply reject mmap
on ISM devices regardless of the size of iomap'able address spaces?
Sound rather coarse-grained, though...

> Thanks,
> Niklas

Just a thought,
Gerd
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 80cae87fff36..0f1ddf2d3ef2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,6 +28,7 @@ 
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
+#include <linux/ioremap.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -129,9 +130,12 @@  static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 		/*
 		 * The PCI core shouldn't set up a resource with a
 		 * type but zero size. But there may be bugs that
-		 * cause us to do that.
+		 * cause us to do that. There is also at least one
+		 * device which advertises a resource too large to
+		 * ioremap().
 		 */
-		if (!resource_size(res))
+		if (!resource_size(res) ||
+		    resource_size(res) > (IOREMAP_END + 1 - IOREMAP_START))
 			goto no_mmap;
 
 		if (resource_size(res) >= PAGE_SIZE) {