diff mbox series

[v7,20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

Message ID 20220615161233.17527-21-logang@deltatee.com (mailing list archive)
State New
Headers show
Series Userspace P2PDMA with O_DIRECT NVMe devices | expand

Commit Message

Logan Gunthorpe June 15, 2022, 4:12 p.m. UTC
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk with their reference
count set to one. They are returned to the genalloc when the page is put
through p2pdma_page_free() (the reference count is once again set to 1
in free_zone_device_page()).

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/p2pdma.c       | 210 ++++++++++++++++++++++++++++++++++++-
 include/linux/pci-p2pdma.h |  16 +++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 225 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig June 29, 2022, 6:48 a.m. UTC | #1
On Wed, Jun 15, 2022 at 10:12:32AM -0600, Logan Gunthorpe wrote:
> A pseudo mount is used to allocate an inode for each PCI device. The
> inode's address_space is used in the file doing the mmap so that all
> VMAs are collected and can be unmapped if the PCI device is unbound.
> After unmapping, the VMAs are iterated through and their pages are
> put so the device can continue to be unbound. An active flag is used
> to signal to VMAs not to allocate any further P2P memory once the
> removal process starts. The flag is synchronized with concurrent
> access with an RCU lock.

Can't we come up with a way of doing this without all the pseudo-fs
garbagage?  I really hate all the overhead for that in the next
nvme patch as well.
Logan Gunthorpe June 29, 2022, 4 p.m. UTC | #2
On 2022-06-29 00:48, Christoph Hellwig wrote:
> On Wed, Jun 15, 2022 at 10:12:32AM -0600, Logan Gunthorpe wrote:
>> A pseudo mount is used to allocate an inode for each PCI device. The
>> inode's address_space is used in the file doing the mmap so that all
>> VMAs are collected and can be unmapped if the PCI device is unbound.
>> After unmapping, the VMAs are iterated through and their pages are
>> put so the device can continue to be unbound. An active flag is used
>> to signal to VMAs not to allocate any further P2P memory once the
>> removal process starts. The flag is synchronized with concurrent
>> access with an RCU lock.
> 
> Can't we come up with a way of doing this without all the pseudo-fs
> garbagage?  I really hate all the overhead for that in the next
> nvme patch as well.

I assume you still want to be able to unmap the VMAs on unbind and not
just hang?

I'll see if I can come up with something to do the a similar thing using
vm_private data or some such.

I was not a fan of the extra code for this either, but I was given to
understand that it was the standard way to collect and cleanup VMAs.

Thanks for the reviews,

Logan
Jason Gunthorpe June 29, 2022, 5:59 p.m. UTC | #3
On Wed, Jun 29, 2022 at 10:00:09AM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2022-06-29 00:48, Christoph Hellwig wrote:
> > On Wed, Jun 15, 2022 at 10:12:32AM -0600, Logan Gunthorpe wrote:
> >> A pseudo mount is used to allocate an inode for each PCI device. The
> >> inode's address_space is used in the file doing the mmap so that all
> >> VMAs are collected and can be unmapped if the PCI device is unbound.
> >> After unmapping, the VMAs are iterated through and their pages are
> >> put so the device can continue to be unbound. An active flag is used
> >> to signal to VMAs not to allocate any further P2P memory once the
> >> removal process starts. The flag is synchronized with concurrent
> >> access with an RCU lock.
> > 
> > Can't we come up with a way of doing this without all the pseudo-fs
> > garbagage?  I really hate all the overhead for that in the next
> > nvme patch as well.
> 
> I assume you still want to be able to unmap the VMAs on unbind and not
> just hang?
> 
> I'll see if I can come up with something to do the a similar thing using
> vm_private data or some such.

I've tried in the past, this is not a good idea. There is no way to
handle failures when a VMA is dup'd and if you rely on private_data
you almost certainly have to alloc here.

Then there is the issue of making the locking work on invalidation
which is crazy ugly.

> I was not a fan of the extra code for this either, but I was given to
> understand that it was the standard way to collect and cleanup VMAs.

Christoph you tried tried to clean it once globally, what happened to
that?

All that is needed here is a way to get a unique inode for the PCI
memory.

Jason
Christoph Hellwig July 5, 2022, 7:51 a.m. UTC | #4
On Wed, Jun 29, 2022 at 02:59:06PM -0300, Jason Gunthorpe wrote:
> I've tried in the past, this is not a good idea. There is no way to
> handle failures when a VMA is dup'd and if you rely on private_data
> you almost certainly have to alloc here.
> 
> Then there is the issue of making the locking work on invalidation
> which is crazy ugly.
> 
> > I was not a fan of the extra code for this either, but I was given to
> > understand that it was the standard way to collect and cleanup VMAs.
> 
> Christoph you tried tried to clean it once globally, what happened to
> that?

Al pointed out that there are various places that rely on having a
separate file system.  I might be able to go back to it and see
if we could at least do it for some users.

But what also really matters here:  I don't want every user that
wants to be able to mmap a character device to do all this work.
The layering is simply wrong, it needs some character device
based helpers, not be open code everywhere.

In fact I'm not even sure this should be a character device, it seems
to fit it way better with the PCI sysfs hierchacy, just like how we
map MMIO resources, which these are anyway.  And once it is on sysfs
we do have a uniqueue inode and need none of the pseudofs stuff, and
don't need all the glue code in nvme either.
Jason Gunthorpe July 5, 2022, 1:51 p.m. UTC | #5
On Tue, Jul 05, 2022 at 09:51:08AM +0200, Christoph Hellwig wrote:

> But what also really matters here:  I don't want every user that
> wants to be able to mmap a character device to do all this work.
> The layering is simply wrong, it needs some character device
> based helpers, not be open code everywhere.

I think alot (all?) cases would be happy if the inode was 1:1 with the
cdev struct device. I suppose the cdev code would still have to create
pseudo fs, but at least that is hidden.

> In fact I'm not even sure this should be a character device, it seems
> to fit it way better with the PCI sysfs hierchacy, just like how we
> map MMIO resources, which these are anyway.  And once it is on sysfs
> we do have a uniqueue inode and need none of the pseudofs stuff, and
> don't need all the glue code in nvme either.

Shouldn't there be an allocator here? It feels a bit weird that the
entire CMB is given to a single process, it is a sharable resource,
isn't it?

Jason
Christoph Hellwig July 5, 2022, 4:12 p.m. UTC | #6
On Tue, Jul 05, 2022 at 10:51:02AM -0300, Jason Gunthorpe wrote:
> > In fact I'm not even sure this should be a character device, it seems
> > to fit it way better with the PCI sysfs hierchacy, just like how we
> > map MMIO resources, which these are anyway.  And once it is on sysfs
> > we do have a uniqueue inode and need none of the pseudofs stuff, and
> > don't need all the glue code in nvme either.
> 
> Shouldn't there be an allocator here? It feels a bit weird that the
> entire CMB is given to a single process, it is a sharable resource,
> isn't it?

Making the entire area given by the device to the p2p allocator available
to user space seems sensible to me.  That is what the current series does,
and what a sysfs interface would do as well.
Jason Gunthorpe July 5, 2022, 4:29 p.m. UTC | #7
On Tue, Jul 05, 2022 at 06:12:40PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 10:51:02AM -0300, Jason Gunthorpe wrote:
> > > In fact I'm not even sure this should be a character device, it seems
> > > to fit it way better with the PCI sysfs hierchacy, just like how we
> > > map MMIO resources, which these are anyway.  And once it is on sysfs
> > > we do have a uniqueue inode and need none of the pseudofs stuff, and
> > > don't need all the glue code in nvme either.
> > 
> > Shouldn't there be an allocator here? It feels a bit weird that the
> > entire CMB is given to a single process, it is a sharable resource,
> > isn't it?
> 
> Making the entire area given by the device to the p2p allocator available
> to user space seems sensible to me.  That is what the current series does,
> and what a sysfs interface would do as well.

That makes openning the mmap exclusive with the in-kernel allocator -
so it means opening the mmap fails if something else is using a P2P
page and once the mmap is open all kernel side P2P allocations will
fail?

Which seems inelegant, I would expect the the mmap operation to
request some pages from the P2P allocator and provide them to
userspace so user and kernel workflows can co-exist using the same
CMB.

Jason
Christoph Hellwig July 5, 2022, 4:40 p.m. UTC | #8
On Tue, Jul 05, 2022 at 01:29:59PM -0300, Jason Gunthorpe wrote:
> > Making the entire area given by the device to the p2p allocator available
> > to user space seems sensible to me.  That is what the current series does,
> > and what a sysfs interface would do as well.
> 
> That makes openning the mmap exclusive with the in-kernel allocator -
> so it means opening the mmap fails if something else is using a P2P
> page and once the mmap is open all kernel side P2P allocations will
> fail?

No.  Just as in the current patchset you can mmap the file and will get
len / PAGE_SIZE pages from the per-device p2pdma pool, or the mmap will
fail if none are available.  A kernel consumer (or multiple) can use
other pages in the pool at the same time.
Logan Gunthorpe July 5, 2022, 4:41 p.m. UTC | #9
On 2022-07-05 10:12, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 10:51:02AM -0300, Jason Gunthorpe wrote:
>>> In fact I'm not even sure this should be a character device, it seems
>>> to fit it way better with the PCI sysfs hierchacy, just like how we
>>> map MMIO resources, which these are anyway.  And once it is on sysfs
>>> we do have a uniqueue inode and need none of the pseudofs stuff, and
>>> don't need all the glue code in nvme either.
>>
>> Shouldn't there be an allocator here? It feels a bit weird that the
>> entire CMB is given to a single process, it is a sharable resource,
>> isn't it?
> 
> Making the entire area given by the device to the p2p allocator available
> to user space seems sensible to me.  That is what the current series does,
> and what a sysfs interface would do as well.

Yes, I think Jason is assuming the sysfs file would behave like the
existing mmio resource files where the process doing the mapping
specifies the offset and length into the BAR. That is not what we want
here, but I don't see why I don't see why we can't do the same thing in
sysfs as we do with the char device with a bin_attribute->mmap() callback.

mmapping the char device was convenient in user space, but it's not much
more work to dig through sysfs and mmap an attribute from there.

Using sysfs means we don't need all the messy callbacks from the nvme
driver, which is a plus. But I'm not sure how we'd get or unmap the
mapping of a sysfs file or avoid the anonymous inode. Seems with the
existing PCI resources, it uses an bin_attribute->f_mapping() callback
to pass back the iomem_get_mapping() mapping on file open.
revoke_iomem() is then used to nuke the VMAs. I don't think we can use
the same infrastructure here as that would add a dependency on
CONFIG_IO_STRICT_DEVMEM; which would be odd. And I'm not sure whether
there is a better way.

Logan
Christoph Hellwig July 5, 2022, 4:43 p.m. UTC | #10
On Tue, Jul 05, 2022 at 10:41:52AM -0600, Logan Gunthorpe wrote:
> Using sysfs means we don't need all the messy callbacks from the nvme
> driver, which is a plus. But I'm not sure how we'd get or unmap the
> mapping of a sysfs file or avoid the anonymous inode. Seems with the
> existing PCI resources, it uses an bin_attribute->f_mapping() callback
> to pass back the iomem_get_mapping() mapping on file open.
> revoke_iomem() is then used to nuke the VMAs. I don't think we can use
> the same infrastructure here as that would add a dependency on
> CONFIG_IO_STRICT_DEVMEM; which would be odd. And I'm not sure whether
> there is a better way.

Why can't we do the revoke on the actual sysfs inode?
Logan Gunthorpe July 5, 2022, 4:44 p.m. UTC | #11
On 2022-07-05 10:43, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 10:41:52AM -0600, Logan Gunthorpe wrote:
>> Using sysfs means we don't need all the messy callbacks from the nvme
>> driver, which is a plus. But I'm not sure how we'd get or unmap the
>> mapping of a sysfs file or avoid the anonymous inode. Seems with the
>> existing PCI resources, it uses an bin_attribute->f_mapping() callback
>> to pass back the iomem_get_mapping() mapping on file open.
>> revoke_iomem() is then used to nuke the VMAs. I don't think we can use
>> the same infrastructure here as that would add a dependency on
>> CONFIG_IO_STRICT_DEVMEM; which would be odd. And I'm not sure whether
>> there is a better way.
> 
> Why can't we do the revoke on the actual sysfs inode?

We might be able to. I'm not sure. I'll have to figure out how to find
that inode from the p2pdma code. I haven't found an obvious interface to
do that.

Logan
Christoph Hellwig July 5, 2022, 4:50 p.m. UTC | #12
[note for the newcomers, this is about allowing mmap()ing the PCIe
P2P memory from the generic PCI P2P code through sysfs, and more
importantly how to revoke it on device removal]

On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
> We might be able to. I'm not sure. I'll have to figure out how to find
> that inode from the p2pdma code. I haven't found an obvious interface to
> do that.

I think the right way to approach this would be a new sysfs API
that internally calls unmap_mapping_range internally instead of
exposing the inode. I suspect that might actually be the right thing
to do for iomem_inode as well.
Greg KH July 5, 2022, 5:21 p.m. UTC | #13
On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
> [note for the newcomers, this is about allowing mmap()ing the PCIe
> P2P memory from the generic PCI P2P code through sysfs, and more
> importantly how to revoke it on device removal]

We allow mmap on PCIe config space today, right?  Why is this different
from what pci_create_legacy_files() does today?

> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
> > We might be able to. I'm not sure. I'll have to figure out how to find
> > that inode from the p2pdma code. I haven't found an obvious interface to
> > do that.
> 
> I think the right way to approach this would be a new sysfs API
> that internally calls unmap_mapping_range internally instead of
> exposing the inode. I suspect that might actually be the right thing
> to do for iomem_inode as well.

Why do we need something new and how is this any different from the PCI
binary files I mention above?  We have supported PCI hotplug for a very
long time, do the current PCI binary sysfs files not work properly with
mmap and removing a device?

thanks,

greg k-h
Logan Gunthorpe July 5, 2022, 5:32 p.m. UTC | #14
On 2022-07-05 11:21, Greg Kroah-Hartman wrote:
> On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
>> [note for the newcomers, this is about allowing mmap()ing the PCIe
>> P2P memory from the generic PCI P2P code through sysfs, and more
>> importantly how to revoke it on device removal]
> 
> We allow mmap on PCIe config space today, right?  Why is this different
> from what pci_create_legacy_files() does today?
> 
>> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
>>> We might be able to. I'm not sure. I'll have to figure out how to find
>>> that inode from the p2pdma code. I haven't found an obvious interface to
>>> do that.
>>
>> I think the right way to approach this would be a new sysfs API
>> that internally calls unmap_mapping_range internally instead of
>> exposing the inode. I suspect that might actually be the right thing
>> to do for iomem_inode as well.
> 
> Why do we need something new and how is this any different from the PCI
> binary files I mention above?  We have supported PCI hotplug for a very
> long time, do the current PCI binary sysfs files not work properly with
> mmap and removing a device?

The P2PDMA code allocates and hands out struct pages to userspace that
are backed with ZONE_DEVICE memory from a device's BAR. This is quite
different from the existing binary files mentioned above which neither
support struct pages nor allocation.

Logan
Greg KH July 5, 2022, 5:42 p.m. UTC | #15
On Tue, Jul 05, 2022 at 11:32:23AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2022-07-05 11:21, Greg Kroah-Hartman wrote:
> > On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
> >> [note for the newcomers, this is about allowing mmap()ing the PCIe
> >> P2P memory from the generic PCI P2P code through sysfs, and more
> >> importantly how to revoke it on device removal]
> > 
> > We allow mmap on PCIe config space today, right?  Why is this different
> > from what pci_create_legacy_files() does today?
> > 
> >> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
> >>> We might be able to. I'm not sure. I'll have to figure out how to find
> >>> that inode from the p2pdma code. I haven't found an obvious interface to
> >>> do that.
> >>
> >> I think the right way to approach this would be a new sysfs API
> >> that internally calls unmap_mapping_range internally instead of
> >> exposing the inode. I suspect that might actually be the right thing
> >> to do for iomem_inode as well.
> > 
> > Why do we need something new and how is this any different from the PCI
> > binary files I mention above?  We have supported PCI hotplug for a very
> > long time, do the current PCI binary sysfs files not work properly with
> > mmap and removing a device?
> 
> The P2PDMA code allocates and hands out struct pages to userspace that
> are backed with ZONE_DEVICE memory from a device's BAR. This is quite
> different from the existing binary files mentioned above which neither
> support struct pages nor allocation.

Why would you want to do this through a sysfs interface?  that feels
horrid...
Logan Gunthorpe July 5, 2022, 6:16 p.m. UTC | #16
On 2022-07-05 11:42, Greg Kroah-Hartman wrote:
> On Tue, Jul 05, 2022 at 11:32:23AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-07-05 11:21, Greg Kroah-Hartman wrote:
>>> On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
>>>> [note for the newcomers, this is about allowing mmap()ing the PCIe
>>>> P2P memory from the generic PCI P2P code through sysfs, and more
>>>> importantly how to revoke it on device removal]
>>>
>>> We allow mmap on PCIe config space today, right?  Why is this different
>>> from what pci_create_legacy_files() does today?
>>>
>>>> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
>>>>> We might be able to. I'm not sure. I'll have to figure out how to find
>>>>> that inode from the p2pdma code. I haven't found an obvious interface to
>>>>> do that.
>>>>
>>>> I think the right way to approach this would be a new sysfs API
>>>> that internally calls unmap_mapping_range internally instead of
>>>> exposing the inode. I suspect that might actually be the right thing
>>>> to do for iomem_inode as well.
>>>
>>> Why do we need something new and how is this any different from the PCI
>>> binary files I mention above?  We have supported PCI hotplug for a very
>>> long time, do the current PCI binary sysfs files not work properly with
>>> mmap and removing a device?
>>
>> The P2PDMA code allocates and hands out struct pages to userspace that
>> are backed with ZONE_DEVICE memory from a device's BAR. This is quite
>> different from the existing binary files mentioned above which neither
>> support struct pages nor allocation.
> 
> Why would you want to do this through a sysfs interface?  that feels
> horrid...

The current version does it through a char device, but that requires
creating a simple_fs and anon_inode for teardown on driver removal, plus
a bunch of hooks through the driver that exposes it (NVMe, in this case)
to set this all up.

Christoph is suggesting a sysfs interface which could potentially avoid
the anon_inode and all of the extra hooks. It has some significant
benefits and maybe some small downsides, but I wouldn't describe it as
horrid.

Logan
Christoph Hellwig July 6, 2022, 6:51 a.m. UTC | #17
On Tue, Jul 05, 2022 at 12:16:45PM -0600, Logan Gunthorpe wrote:
> The current version does it through a char device, but that requires
> creating a simple_fs and anon_inode for teardown on driver removal, plus
> a bunch of hooks through the driver that exposes it (NVMe, in this case)
> to set this all up.
> 
> Christoph is suggesting a sysfs interface which could potentially avoid
> the anon_inode and all of the extra hooks. It has some significant
> benefits and maybe some small downsides, but I wouldn't describe it as
> horrid.

Yeah, I don't think is is horrible, it fits in with the resource files
for the BARs, and solves a lot of problems.  Greg, can you explain
what would be so bad about it?
Greg KH July 6, 2022, 7:04 a.m. UTC | #18
On Wed, Jul 06, 2022 at 08:51:27AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 12:16:45PM -0600, Logan Gunthorpe wrote:
> > The current version does it through a char device, but that requires
> > creating a simple_fs and anon_inode for teardown on driver removal, plus
> > a bunch of hooks through the driver that exposes it (NVMe, in this case)
> > to set this all up.
> > 
> > Christoph is suggesting a sysfs interface which could potentially avoid
> > the anon_inode and all of the extra hooks. It has some significant
> > benefits and maybe some small downsides, but I wouldn't describe it as
> > horrid.
> 
> Yeah, I don't think is is horrible, it fits in with the resource files
> for the BARs, and solves a lot of problems.  Greg, can you explain
> what would be so bad about it?

As you mention, you will have to pass different things down into sysfs
in order for that to be possible.  If it matches the resource files like
we currently have today, that might not be that bad, but it still feels
odd to me.  Let's see an implementation and a Documentation/ABI/ entry
first though.

thanks,

greg k-h
Logan Gunthorpe July 6, 2022, 9:30 p.m. UTC | #19
On 2022-07-06 01:04, Greg Kroah-Hartman wrote:
> On Wed, Jul 06, 2022 at 08:51:27AM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 05, 2022 at 12:16:45PM -0600, Logan Gunthorpe wrote:
>>> The current version does it through a char device, but that requires
>>> creating a simple_fs and anon_inode for teardown on driver removal, plus
>>> a bunch of hooks through the driver that exposes it (NVMe, in this case)
>>> to set this all up.
>>>
>>> Christoph is suggesting a sysfs interface which could potentially avoid
>>> the anon_inode and all of the extra hooks. It has some significant
>>> benefits and maybe some small downsides, but I wouldn't describe it as
>>> horrid.
>>
>> Yeah, I don't think is is horrible, it fits in with the resource files
>> for the BARs, and solves a lot of problems.  Greg, can you explain
>> what would be so bad about it?
> 
> As you mention, you will have to pass different things down into sysfs
> in order for that to be possible.  If it matches the resource files like
> we currently have today, that might not be that bad, but it still feels
> odd to me.  Let's see an implementation and a Documentation/ABI/ entry
> first though.

I'll work something up in the coming weeks.

Thanks,

Logan
diff mbox series

Patch

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index d4e635012ffe..a6572069008b 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@ 
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/percpu-refcount.h>
+#include <linux/pfn_t.h>
+#include <linux/pseudo_fs.h>
 #include <linux/random.h>
 #include <linux/seq_buf.h>
 #include <linux/xarray.h>
+#include <uapi/linux/magic.h>
 
 struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
 	struct xarray map_types;
+	struct inode *inode;
+	bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -101,6 +106,41 @@  static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+	.name = "p2dma",
+	.owner = THIS_MODULE,
+	.init_fs_context = pci_p2pdma_fs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
+static void p2pdma_page_free(struct page *page)
+{
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+	struct percpu_ref *ref;
+
+	gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+			    (uintptr_t)page_to_virt(page), PAGE_SIZE,
+			    (void **)&ref);
+	percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+	.page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
 	struct pci_dev *pdev = data;
@@ -117,6 +157,9 @@  static void pci_p2pdma_release(void *data)
 	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	xa_destroy(&p2pdma->map_types);
+
+	iput(p2pdma->inode);
+	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -134,17 +177,32 @@  static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (!p2p->pool)
 		goto out;
 
-	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+			      &pci_p2pdma_fs_cnt);
 	if (error)
 		goto out_pool_destroy;
 
+	p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+	if (IS_ERR(p2p->inode)) {
+		error = -ENOMEM;
+		goto out_unpin_fs;
+	}
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_put_inode;
+
 	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
 	if (error)
-		goto out_pool_destroy;
+		goto out_put_inode;
 
 	rcu_assign_pointer(pdev->p2pdma, p2p);
 	return 0;
 
+out_put_inode:
+	iput(p2p->inode);
+out_unpin_fs:
+	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 out_pool_destroy:
 	gen_pool_destroy(p2p->pool);
 out:
@@ -152,6 +210,18 @@  static int pci_p2pdma_setup(struct pci_dev *pdev)
 	return error;
 }
 
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+	struct pci_dev *pdev = data;
+	struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+
+	/* Ensure no new pages can be allocated in mappings */
+	p2pdma->active = false;
+	synchronize_rcu();
+
+	unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
+}
+
 /**
  * pci_p2pdma_add_resource - add memory for use as p2p memory
  * @pdev: the device to add the memory to
@@ -198,6 +268,7 @@  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->range.end = pgmap->range.start + size - 1;
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->ops = &p2pdma_pgmap_ops;
 
 	p2p_pgmap->provider = pdev;
 	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
@@ -209,6 +280,11 @@  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		goto pgmap_free;
 	}
 
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+					 pdev);
+	if (error)
+		goto pages_free;
+
 	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
 	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
@@ -217,6 +293,7 @@  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (error)
 		goto pages_free;
 
+	p2pdma->active = true;
 	pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
 		 pgmap->range.start, pgmap->range.end);
 
@@ -1023,3 +1100,132 @@  ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 	return sprintf(page, "%s\n", pci_name(p2p_dev));
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+/**
+ * pci_p2pdma_file_open - setup file mapping to store P2PMEM VMAs
+ * @pdev: the device to allocate memory from
+ * @file: the file to open
+ *
+ * Set f_mapping of the file to the p2pdma inode so that mappings
+ * are can be torn down on device unbind.
+ */
+int pci_p2pdma_file_open(struct pci_dev *pdev, struct file *file)
+{
+	struct pci_p2pdma *p2pdma;
+	int ret;
+
+	ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+			    &pci_p2pdma_fs_cnt);
+	if (ret)
+		return ret;
+
+	rcu_read_lock();
+	p2pdma = rcu_dereference(pdev->p2pdma);
+	if (p2pdma) {
+		ihold(p2pdma->inode);
+		file->f_mapping = p2pdma->inode->i_mapping;
+		rcu_read_unlock();
+	} else {
+		rcu_read_unlock();
+		simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_file_open);
+
+/**
+ * pci_p2pdma_file_release - release a file opened with pci_p2pdma_file_open()
+ * @file: the userspace vma to map the memory to
+ *
+ * Release the reference to f_mapping set by pci_p2pdma_file_open()
+ */
+void pci_p2pdma_file_release(struct file *file)
+{
+	if (file->f_mapping->host != file->f_inode) {
+		iput(file->f_mapping->host);
+		simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
+	}
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_file_release);
+
+/**
+ * pci_mmap_p2pmem - setup an mmap region to be backed with P2PDMA memory
+ *	that was registered with pci_p2pdma_add_resource()
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * The file must call pci_p2pdma_mmap_file_open() in its open() operation.
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
+{
+	size_t len = vma->vm_end - vma->vm_start;
+	struct pci_p2pdma *p2pdma;
+	struct percpu_ref *ref;
+	unsigned long vaddr;
+	void *kaddr;
+	int ret;
+
+	/* prevent private mappings from being established */
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted private mapping\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	if (vma->vm_pgoff) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted mapping with non-zero offset\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	rcu_read_lock();
+	p2pdma = rcu_dereference(pdev->p2pdma);
+	if (!p2pdma || !p2pdma->active) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
+	if (!kaddr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * vm_insert_page() can sleep, so a reference is taken to mapping
+	 * such that rcu_read_unlock() can be done before inserting the
+	 * pages
+	 */
+	if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
+		ret = -ENODEV;
+		goto out_free_mem;
+	}
+	rcu_read_unlock();
+
+	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+		ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+		if (ret) {
+			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+			return ret;
+		}
+		percpu_ref_get(ref);
+		put_page(virt_to_page(kaddr));
+		kaddr += PAGE_SIZE;
+		len -= PAGE_SIZE;
+	}
+
+	percpu_ref_put(ref);
+
+	return 0;
+out_free_mem:
+	gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+out:
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2c07aa6b7665..0ffe782940da 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -34,6 +34,9 @@  int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+int pci_p2pdma_file_open(struct pci_dev *pdev, struct file *file);
+void pci_p2pdma_file_release(struct file *file);
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -90,6 +93,19 @@  static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline int pci_p2pdma_file_open(struct pci_dev *pdev,
+				       struct file *file)
+{
+	return 0;
+}
+static inline void pci_p2pdma_file_release(struct file *file)
+{
+}
+static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
+				  struct vm_area_struct *vma)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f724129c0425..59ba2e60dc03 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -95,6 +95,7 @@ 
 #define BPF_FS_MAGIC		0xcafe4a11
 #define AAFS_MAGIC		0x5a3c69f0
 #define ZONEFS_MAGIC		0x5a4f4653
+#define P2PDMA_MAGIC		0x70327064
 
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346