diff mbox

[RFC,1/8] Introduce Peer-to-Peer memory (p2pmem) device

Message ID 1490911959-5146-2-git-send-email-logang@deltatee.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Logan Gunthorpe March 30, 2017, 10:12 p.m. UTC
A p2pmem device is simply a PCI card with a BAR space that points to
regular memory. This may be an independent PCI card or part of another
completely unrelated device (like an IB card or a NVMe card). The
p2pmem device is designed such that other drivers may register p2pmem
memory for use by the system.

p2pmem devices then provide a kernel interface so that other subsystems
can allocate chunks of this memory as necessary to facilitate transfers
between two PCI peers. Depending on hardware, this may reduce the
bandwidth of the transfer but could significantly reduce presure
on system memory. This may be desirable in many cases: for example a
system could be designed with a small CPU connected to a PCI switch by a
small number of lanes which would maximize the number of lanes available
to connect to NVME devices.

Seeing using p2p memory can often have negative effects, especially
with older PCI root complexes. The code is designed to only utilize the
p2pmem device if all the devices involved in a transfer are behind the
same PCI switch. Other cases may still work or be desirable for some
end users but it was decided this would be the best course of action
to prevent users enabling it and wondering why their performance
dropped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/memory/Kconfig  |   5 +
 drivers/memory/Makefile |   2 +
 drivers/memory/p2pmem.c | 403 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/p2pmem.h  | 103 +++++++++++++
 4 files changed, 513 insertions(+)
 create mode 100644 drivers/memory/p2pmem.c
 create mode 100644 include/linux/p2pmem.h

Comments

Sinan Kaya March 31, 2017, 6:49 p.m. UTC | #1
Hi Logan,

> +/**
> + * p2pmem_unregister() - unregister a p2pmem device
> + * @p: the device to unregister
> + *
> + * The device will remain until all users are done with it
> + */
> +void p2pmem_unregister(struct p2pmem_dev *p)
> +{
> +	if (!p)
> +		return;
> +
> +	dev_info(&p->dev, "unregistered");
> +	device_del(&p->dev);
> +	ida_simple_remove(&p2pmem_ida, p->id);

Don't you need to clean up the p->pool here.

> +	put_device(&p->dev);
> +}
> +EXPORT_SYMBOL(p2pmem_unregister);
> +

I don't like the ugliness around the switch port to be honest. 

Going to whitelist/blacklist looks simpler in my opinion.

Sinan
Logan Gunthorpe March 31, 2017, 9:23 p.m. UTC | #2
On 31/03/17 12:49 PM, Sinan Kaya wrote:
> Don't you need to clean up the p->pool here.

See Patch 7 in the series.

>> +	put_device(&p->dev);
>> +}
>> +EXPORT_SYMBOL(p2pmem_unregister);
>> +
> 
> I don't like the ugliness around the switch port to be honest. 
> 
> Going to whitelist/blacklist looks simpler in my opinion.

What exactly would you white/black list? It can't be the NIC or the
disk. If it's going to be a white/black list on the switch or root port
then you'd need essentially the same code to ensure they are all behind
the same switch or root port. So you could add a white/black list on top
of the current scheme but you couldn't get rid of it.

Our original plan was to just punt the decision to userspace but we had
pushback on that at LSF.

Thanks,

Logan
Sinan Kaya March 31, 2017, 9:38 p.m. UTC | #3
On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
> What exactly would you white/black list? It can't be the NIC or the
> disk. If it's going to be a white/black list on the switch or root port
> then you'd need essentially the same code to ensure they are all behind
> the same switch or root port.

What is so special about being connected to the same switch?

Why don't we allow the feature by default and blacklist by the root ports
that don't work with a quirk.

I'm looking at this from portability perspective to be honest.

I'd rather see the feature enabled by default without any assumptions.
Using it with a switch is just a use case that you happened to test.
It can allow new architectures to use your code tomorrow.

Sinan
Logan Gunthorpe March 31, 2017, 10:42 p.m. UTC | #4
On 31/03/17 03:38 PM, Sinan Kaya wrote:
> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>> What exactly would you white/black list? It can't be the NIC or the
>> disk. If it's going to be a white/black list on the switch or root port
>> then you'd need essentially the same code to ensure they are all behind
>> the same switch or root port.
> 
> What is so special about being connected to the same switch?
> 
> Why don't we allow the feature by default and blacklist by the root ports
> that don't work with a quirk.

Well root ports have the same issue here. There may be more than one
root port or other buses (ie QPI) between the devices in question. So
you can't just say "this system has root port X therefore we can always
use p2pmem". In the end, if you want to do any kind of restrictions
you're going to have to walk the tree, as the code currently does, and
figure out what's between the devices being used and black or white list
accordingly. Then seeing there's just such a vast number of devices out
there you'd almost certainly have to use some kind of white list and not
a black list. Then the question becomes which devices will be white
listed? The first to be listed would be switches seeing they will always
work. This is pretty much what we have (though it doesn't currently
cover multiple levels of switches). The next step, if someone wanted to
test with specific hardware, might be to allow the case where all the
devices are behind the same root port which Intel Ivy Bridge or newer.
However, I don't think a comprehensive white list should be a
requirement for this work to go forward and I don't think anything
you've suggested will remove any of the "ugliness".

What we discussed at LSF was that only allowing cases with a switch was
the simplest way to be sure any given setup would actually work.

> I'm looking at this from portability perspective to be honest.

I'm looking at this from the fact that there's a vast number of
topologies and devices involved, and figuring out which will work is
very complicated and could require a lot of hardware testing. The LSF
folks were primarily concerned with not having users enable the feature
and see breakage or terrible performance.

> I'd rather see the feature enabled by default without any assumptions.
> Using it with a switch is just a use case that you happened to test.
> It can allow new architectures to use your code tomorrow.

That's why I was advocating for letting userspace decide such that if
you're setting up a system with this you say to use a specific p2pmem
device and then you are responsible to test and benchmark it and decide
to use it in going forward. However, this has received a lot of push back.

Logan
Sinan Kaya March 31, 2017, 11:51 p.m. UTC | #5
On 3/31/2017 6:42 PM, Logan Gunthorpe wrote:
> 
> 
> On 31/03/17 03:38 PM, Sinan Kaya wrote:
>> On 3/31/2017 5:23 PM, Logan Gunthorpe wrote:
>>> What exactly would you white/black list? It can't be the NIC or the
>>> disk. If it's going to be a white/black list on the switch or root port
>>> then you'd need essentially the same code to ensure they are all behind
>>> the same switch or root port.
>>
>> What is so special about being connected to the same switch?
>>
>> Why don't we allow the feature by default and blacklist by the root ports
>> that don't work with a quirk.
> 
> Well root ports have the same issue here. There may be more than one
> root port or other buses (ie QPI) between the devices in question. So
> you can't just say "this system has root port X therefore we can always
> use p2pmem". 

We only care about devices on the data path between two devices.

> In the end, if you want to do any kind of restrictions
> you're going to have to walk the tree, as the code currently does, and
> figure out what's between the devices being used and black or white list
> accordingly. Then seeing there's just such a vast number of devices out
> there you'd almost certainly have to use some kind of white list and not
> a black list. Then the question becomes which devices will be white
> listed? 

How about a combination of blacklist + time bomb + peer-to-peer feature?

You can put a restriction with DMI/SMBIOS such that all devices from 2016
work else they belong to blacklist.

> The first to be listed would be switches seeing they will always
> work. This is pretty much what we have (though it doesn't currently
> cover multiple levels of switches). The next step, if someone wanted to
> test with specific hardware, might be to allow the case where all the
> devices are behind the same root port which Intel Ivy Bridge or newer.

Sorry, I'm not familiar with Intel architecture. Based on what you just
wrote, I think I see your point. 

I'm trying to generalize what you are doing to a little
bigger context so that I can use it on another architecture like arm64
where I may or may not have a switch.

This text below is sort of repeating what you are writing above. 

How about this:

The goal is to find a common parent between any two devices that need to
use your code. 

- all bridges/switches on the data need to support peer-to-peer, otherwise
stop.

- Make sure that all devices on the data path are not blacklisted via your
code.

- If there is at least somebody blacklisted, we stop and the feature is
not allowed.

- If we find a common parent and no errors, you are good to go.

- We don't care about devices above the common parent whether they have
some feature X, Y, Z or not. 

Maybe, a little bit less code than what you have but it is flexible and
not that too hard to implement.

Well, the code is in RFC. I don't see why we can't remove some restrictions
and still have your code move forward. 

> However, I don't think a comprehensive white list should be a
> requirement for this work to go forward and I don't think anything
> you've suggested will remove any of the "ugliness".

I don't think the ask above is a very big deal. If you feel like
addressing this on another patchset like you suggested in your cover letter,
I'm fine with that too.

> 
> What we discussed at LSF was that only allowing cases with a switch was
> the simplest way to be sure any given setup would actually work.
> 
>> I'm looking at this from portability perspective to be honest.
> 
> I'm looking at this from the fact that there's a vast number of
> topologies and devices involved, and figuring out which will work is
> very complicated and could require a lot of hardware testing. The LSF
> folks were primarily concerned with not having users enable the feature
> and see breakage or terrible performance.
> 
>> I'd rather see the feature enabled by default without any assumptions.
>> Using it with a switch is just a use case that you happened to test.
>> It can allow new architectures to use your code tomorrow.
> 
> That's why I was advocating for letting userspace decide such that if
> you're setting up a system with this you say to use a specific p2pmem
> device and then you are responsible to test and benchmark it and decide
> to use it in going forward. However, this has received a lot of push back.

Yeah, we shouldn't trust the userspace for such things.

> 
> Logan
>
Logan Gunthorpe April 1, 2017, 1:57 a.m. UTC | #6
On 31/03/17 05:51 PM, Sinan Kaya wrote:
> You can put a restriction with DMI/SMBIOS such that all devices from 2016
> work else they belong to blacklist.

How do you get a manufacturing date for a given device within the
kernel? Is this actually something generically available?

Logan
Sinan Kaya April 1, 2017, 2:17 a.m. UTC | #7
On 2017-03-31 21:57, Logan Gunthorpe wrote:
> On 31/03/17 05:51 PM, Sinan Kaya wrote:
>> You can put a restriction with DMI/SMBIOS such that all devices from 
>> 2016
>> work else they belong to blacklist.
> 
> How do you get a manufacturing date for a given device within the
> kernel? Is this actually something generically available?
> 
> Logan

Smbios calls are used all over the place in kernel for introducing new 
functionality while maintaining backwards compatibility.

See drivers/pci and drivers/acpi directory.
Logan Gunthorpe April 1, 2017, 10:16 p.m. UTC | #8
Hey,

On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
> See drivers/pci and drivers/acpi directory.

The best I could find was the date of the firmware/bios. I really don't
think that makes sense to tie the two together. And really the more that
I think about it trying to do a date cutoff for this seems crazy without
very comprehensive hardware testing done. I have no idea which AMD chips
have decent root ports for this and then if we include all of ARM and
POWERPC, etc there's a huge amount of unknown hardware. Saying that the
system's firmware has to be written after 2016 seems like an arbitrary
restriction that isn't likely to correlate to any working systems.

I still say the only sane thing to do is allow all switches and then add
a whitelist of root ports that are known to work well. If we care about
preventing broken systems in a comprehensive way then that's the only
thing that is going to work.

Logan
Sinan Kaya April 2, 2017, 2:26 a.m. UTC | #9
Hi Logan,

I added Alex and Bjorn above.

On 4/1/2017 6:16 PM, Logan Gunthorpe wrote:
> Hey,
> 
> On 31/03/17 08:17 PM, okaya@codeaurora.org wrote:
>> See drivers/pci and drivers/acpi directory.
> 
> The best I could find was the date of the firmware/bios. I really don't
> think that makes sense to tie the two together. And really the more that
> I think about it trying to do a date cutoff for this seems crazy without
> very comprehensive hardware testing done. I have no idea which AMD chips
> have decent root ports for this and then if we include all of ARM and
> POWERPC, etc there's a huge amount of unknown hardware. Saying that the
> system's firmware has to be written after 2016 seems like an arbitrary
> restriction that isn't likely to correlate to any working systems.

I recommended a combination of blacklist + p2p capability + BIOS date.
Not just BIOS date. BIOS date by itself is useless.

As you may or may not be aware, PCI defines capability registers for
discovering features. Unfortunately, there is no direct p2p capability
register. 

However, Access Control Services (ACS) capability register has flags
indicating p2p functionality. p2p feature needs to be discovered from ACS. 

https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf

This is just one of the many P2P capability flags.

"ACS P2P Request Redirect: must be implemented by Root Ports that support peer-to-peer
traffic with other Root Ports5; must be implemented by Switch Downstream Ports."

If the root port or a switch does not have ACS capability, p2p is not allowed.
If these p2p flags are not set, don't allow p2p feature.

The normal expectation from any system (root port/switch) is not to set these
bits unless p2p feature is present/working.

However, there could be systems in the field with ACS capability but broken HW
or broken FW. 

This is when the BIOS date helps so that you don't break existing systems.

The right thing in my opinion is 

1. blacklist by pci vendor/device id like any other pci quirk in quirks.c.
2. Require this feature for recent HW/BIOS by checking the BIOS date.
3. Check the p2p capability from ACS. 

> 
> I still say the only sane thing to do is allow all switches and then add
> a whitelist of root ports that are known to work well. If we care about
> preventing broken systems in a comprehensive way then that's the only
> thing that is going to work.

We can't guarentee all switches will work either. See above for instructions
on when this feature should be enabled.

Let's step back for a moment.

If we think about logical blocks here, p2pmem is a pci user. It should
not walk the bus and search for possible good things by itself. We don't
usually put code into the kernel's driver directory for specific arch/
specific devices. There are hundreds of device drivers in the kernel. 
None of them are guarenteed to work in any architecture but they don't
prohibit use either.

System integrators like me test these drivers against their own systems,
find bugs to remove arch specific assumptions and post patches.

p2pmem is potentially just one of the many users of p2p capability in the
system.

This p2p detection needs to be done by some p2p driver inside the 
drivers/pci directory or inside drivers/pci/probe.c.

This p2p driver needs to verify ACS permissions similar to what
pci_device_group() does.

If the system is p2p capable, this p2p driver sets p2p_capable bit in 
struct pci_dev.

p2pmem driver then uses this bit to decide when it should enable its feature.

Bjorn and Alex needs to device about the final solution as they maintain both
PCI and virtualization (ACS) respectively.

Sinan
Logan Gunthorpe April 2, 2017, 5:21 p.m. UTC | #10
On 01/04/17 08:26 PM, Sinan Kaya wrote:
> I recommended a combination of blacklist + p2p capability + BIOS date.
> Not just BIOS date. BIOS date by itself is useless.

Well this proposal doesn't work for me at all. None of my hardware has
the p2p ACS capability and my BIOS date is in 2013 and yet my switch
works perfectly fine. You're going to have to make the case that ACS p2p
capabilities are somehow correlated with a device's ability to move TLPs
between ports with reasonable performance. (For example my sandy bridge
CPU does support p2p transactions fine, it just doesn't have great
performance.) The documentation doesn't suggest this nor can I even find
(via google) any lspci dump that suggest there is hardware that sets
this p2p capability. The ACS P2P flag is meant to indicate something
completely different from what you are proposing using it for: it's
meant to indicate the ability to manage permissions of p2p destined TLPs
not the ability to efficiently transfer them.

> This is when the BIOS date helps so that you don't break existing systems.

I'm not that worried about this code breaking existing systems. There
are significant trade-offs with using p2pmem (ie. you are quite likely
sacrificing performance for memory QOS or upstream PCI bandwidth), and
therefore the user _has_ to specifically say to use it. This is why
we've put a flag in the nvme target code that defaults to off. Thus we
are not going to have a situation where people upgrade their kernels and
see broken or slow systems. People _have_ to make the decision to turn
it on and decide based on their use case whether it's appropriate.

> We can't guarentee all switches will work either. See above for instructions
> on when this feature should be enabled.

It's a lot easier to say that all switches will work than it is for root
ports. This is essentially what switches are designed for, so I'd be
surprised to find one that doesn't work. Root ports are the trouble here
seeing it's a lot more likely for them to be designed without
considering that traffic needs to move between ports efficiently. If we
do find extremely broken switches that don't support this then we'd
probably want to create a black list for that. Also, there's
significantly fewer PCI switch products on the market than there are
root port instances, so a black list would be much easier to manage there.

> If we think about logical blocks here, p2pmem is a pci user. 

Well technically, the only thing that ties p2pmem to pci is the concept
of which devices to allow it's use with. There's absolutely no reason
why any other bus couldn't use the same code and just say any devices on
that bus allow p2pmem.

>It should
> not walk the bus and search for possible good things by itself. We don't
> usually put code into the kernel's driver directory for specific arch/
> specific devices. There are hundreds of device drivers in the kernel. 
> None of them are guarenteed to work in any architecture but they don't
> prohibit use either.

I'd agree that the final code for determining p2p capability should
belong in the pci code. Or more likely an even more generic interface
with struct device that is bus agnostic. Though, I'd hope that a lot of
this could happen later when there are more kernel users actually
wanting to use this code. It's hard to design a generic interface when
you only have one user at present.

> p2pmem is potentially just one of the many users of p2p capability in the
> system.

Yup, we've had similar feedback from Max. However, without knowing the
needs of a generic p2p device at this point, it's hard to consider this
at all. I am open to it though.

Logan
Sinan Kaya April 2, 2017, 9:03 p.m. UTC | #11
On 4/2/2017 1:21 PM, Logan Gunthorpe wrote:
>> This is when the BIOS date helps so that you don't break existing systems.
> I'm not that worried about this code breaking existing systems. There
> are significant trade-offs with using p2pmem (ie. you are quite likely
> sacrificing performance for memory QOS or upstream PCI bandwidth), and
> therefore the user _has_ to specifically say to use it. This is why
> we've put a flag in the nvme target code that defaults to off. Thus we
> are not going to have a situation where people upgrade their kernels and
> see broken or slow systems. People _have_ to make the decision to turn
> it on and decide based on their use case whether it's appropriate.
> 

OK. I didn't know the feature was not enabled by default. This is even 
easier now. 

Push the decision all the way to the user. Let them decide whether they
want this feature to work on a root port connected port or under the
switch.

>> We can't guarentee all switches will work either. See above for instructions
>> on when this feature should be enabled.
> It's a lot easier to say that all switches will work than it is for root
> ports. This is essentially what switches are designed for, so I'd be
> surprised to find one that doesn't work. Root ports are the trouble here
> seeing it's a lot more likely for them to be designed without
> considering that traffic needs to move between ports efficiently. If we
> do find extremely broken switches that don't support this then we'd
> probably want to create a black list for that. Also, there's
> significantly fewer PCI switch products on the market than there are
> root port instances, so a black list would be much easier to manage there.
> 

I thought the issue was feature didn't work at all with some root ports
or there was some kind of memory corruption issue that you were trying to
avoid with the existing systems.

If you are just worried about performance, the switch recommendation belongs
to your particular product tuning guide or a howto document not into the
actual code itself. 

I think you should get rid of all pci searching business in your code.
Logan Gunthorpe April 3, 2017, 4:26 a.m. UTC | #12
On 02/04/17 03:03 PM, Sinan Kaya wrote:
> Push the decision all the way to the user. Let them decide whether they
> want this feature to work on a root port connected port or under the
> switch.

Yes, I prefer this too. If other folks agree with that I'd be very happy
to go back to user chooses. I think Sagi was the most vocal proponent
for kernel chooses at LSF so hopefully he will read this thread and
offer some opinion.

> I thought the issue was feature didn't work at all with some root ports
> or there was some kind of memory corruption issue that you were trying to
> avoid with the existing systems.

I *think* there are some much older root ports where P2P TLPs don't even
get through. But it doesn't really change the situation: in the nvmet
case, the user would enable p2pmem and then be unable to connect and
thus choose to disable it going forward. Not a big difference from the
user seeing bad performance and not choosing to enable it.

> I think you should get rid of all pci searching business in your code.

Yes, my original proposal was when you configure the nvme target you
chose the specific p2pmem device to use. That code had no tie ins to PCI
code and could, in theory, work generically with any device and bus.

Logan
Marta Rybczynska April 25, 2017, 11:58 a.m. UTC | #13
> On 02/04/17 03:03 PM, Sinan Kaya wrote:
>> Push the decision all the way to the user. Let them decide whether they
>> want this feature to work on a root port connected port or under the
>> switch.
> 
> Yes, I prefer this too. If other folks agree with that I'd be very happy
> to go back to user chooses. I think Sagi was the most vocal proponent
> for kernel chooses at LSF so hopefully he will read this thread and
> offer some opinion.
> 
>> I thought the issue was feature didn't work at all with some root ports
>> or there was some kind of memory corruption issue that you were trying to
>> avoid with the existing systems.
> 
> I *think* there are some much older root ports where P2P TLPs don't even
> get through. But it doesn't really change the situation: in the nvmet
> case, the user would enable p2pmem and then be unable to connect and
> thus choose to disable it going forward. Not a big difference from the
> user seeing bad performance and not choosing to enable it.
> 
>> I think you should get rid of all pci searching business in your code.
> 
> Yes, my original proposal was when you configure the nvme target you
> chose the specific p2pmem device to use. That code had no tie ins to PCI
> code and could, in theory, work generically with any device and bus.
> 

I would add one issue that doesn't seem to be addressed: in my experience
P2P doesn't work when IOMMU activated. It works best with deactivation at
the BIOS level, even the kernel options are not enough in some cases.

This is another argument to leave the chose to user/integrator.

Marta
Logan Gunthorpe April 25, 2017, 4:58 p.m. UTC | #14
On 25/04/17 05:58 AM, Marta Rybczynska wrote:
> I would add one issue that doesn't seem to be addressed: in my experience
> P2P doesn't work when IOMMU activated. It works best with deactivation at
> the BIOS level, even the kernel options are not enough in some cases.

Well this would likely be addressed by 'arch_p2p_cross_segment' as
proposed by Jason.

Logan
diff mbox

Patch

diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
index ec80e35..4a02cd3 100644
--- a/drivers/memory/Kconfig
+++ b/drivers/memory/Kconfig
@@ -146,3 +146,8 @@  source "drivers/memory/samsung/Kconfig"
 source "drivers/memory/tegra/Kconfig"
 
 endif
+
+config P2PMEM
+	bool "Peer 2 Peer Memory Device Support"
+	help
+	  This driver is for peer 2 peer memory device managers.
diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
index e88097fb..260bfe9 100644
--- a/drivers/memory/Makefile
+++ b/drivers/memory/Makefile
@@ -21,3 +21,5 @@  obj-$(CONFIG_DA8XX_DDRCTL)	+= da8xx-ddrctl.o
 
 obj-$(CONFIG_SAMSUNG_MC)	+= samsung/
 obj-$(CONFIG_TEGRA_MC)		+= tegra/
+
+obj-$(CONFIG_P2PMEM)        += p2pmem.o
diff --git a/drivers/memory/p2pmem.c b/drivers/memory/p2pmem.c
new file mode 100644
index 0000000..c4ea311
--- /dev/null
+++ b/drivers/memory/p2pmem.c
@@ -0,0 +1,403 @@ 
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/p2pmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+
+MODULE_DESCRIPTION("Peer 2 Peer Memory Device");
+MODULE_VERSION("0.1");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Microsemi Corporation");
+
+static struct class *p2pmem_class;
+static DEFINE_IDA(p2pmem_ida);
+
+static struct p2pmem_dev *to_p2pmem(struct device *dev)
+{
+	return container_of(dev, struct p2pmem_dev, dev);
+}
+
+static void p2pmem_percpu_release(struct percpu_ref *ref)
+{
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	complete_all(&p->cmp);
+}
+
+static void p2pmem_percpu_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+
+	percpu_ref_exit(ref);
+}
+
+static void p2pmem_percpu_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct p2pmem_dev *p = container_of(ref, struct p2pmem_dev, ref);
+
+	if (percpu_ref_is_dying(ref))
+		return;
+
+	percpu_ref_kill(ref);
+	wait_for_completion(&p->cmp);
+}
+
+static void p2pmem_release(struct device *dev)
+{
+	struct p2pmem_dev *p = to_p2pmem(dev);
+
+	if (p->pool)
+		gen_pool_destroy(p->pool);
+
+	kfree(p);
+}
+
+/**
+ * p2pmem_create() - create a new p2pmem device
+ * @parent: the parent device to create it under
+ *
+ * Return value is a pointer to the new device or an ERR_PTR
+ * on failure.
+ */
+struct p2pmem_dev *p2pmem_create(struct device *parent)
+{
+	struct p2pmem_dev *p;
+	int nid = dev_to_node(parent);
+	int rc;
+
+	p = kzalloc_node(sizeof(*p), GFP_KERNEL, nid);
+	if (!p)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&p->cmp);
+	device_initialize(&p->dev);
+	p->dev.class = p2pmem_class;
+	p->dev.parent = parent;
+	p->dev.release = p2pmem_release;
+
+	p->id = ida_simple_get(&p2pmem_ida, 0, 0, GFP_KERNEL);
+	if (p->id < 0) {
+		rc = p->id;
+		goto err_free;
+	}
+
+	dev_set_name(&p->dev, "p2pmem%d", p->id);
+
+	p->pool = gen_pool_create(PAGE_SHIFT, nid);
+	if (!p->pool) {
+		rc = -ENOMEM;
+		goto err_id;
+	}
+
+	rc = percpu_ref_init(&p->ref, p2pmem_percpu_release, 0,
+			     GFP_KERNEL);
+	if (rc)
+		goto err_id;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_exit, &p->ref);
+	if (rc)
+		goto err_id;
+
+	rc = device_add(&p->dev);
+	if (rc)
+		goto err_id;
+
+	dev_info(&p->dev, "registered");
+
+	return p;
+
+err_id:
+	ida_simple_remove(&p2pmem_ida, p->id);
+err_free:
+	put_device(&p->dev);
+	return ERR_PTR(rc);
+}
+EXPORT_SYMBOL(p2pmem_create);
+
+/**
+ * p2pmem_unregister() - unregister a p2pmem device
+ * @p: the device to unregister
+ *
+ * The device will remain until all users are done with it
+ */
+void p2pmem_unregister(struct p2pmem_dev *p)
+{
+	if (!p)
+		return;
+
+	dev_info(&p->dev, "unregistered");
+	device_del(&p->dev);
+	ida_simple_remove(&p2pmem_ida, p->id);
+	put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_unregister);
+
+/**
+ * p2pmem_add_resource() - add memory for use as p2pmem to the device
+ * @p: the device to add the memory to
+ * @res: resource describing the memory
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res)
+{
+	int rc;
+	void *addr;
+	int nid = dev_to_node(&p->dev);
+
+	addr = devm_memremap_pages(&p->dev, res, &p->ref, NULL);
+	if (IS_ERR(addr))
+		return PTR_ERR(addr);
+
+	rc = gen_pool_add_virt(p->pool, (unsigned long)addr,
+			       res->start, resource_size(res), nid);
+	if (rc)
+		return rc;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_percpu_kill, &p->ref);
+	if (rc)
+		return rc;
+
+	dev_info(&p->dev, "added %pR", res);
+
+	return 0;
+}
+EXPORT_SYMBOL(p2pmem_add_resource);
+
+struct pci_region {
+	struct pci_dev *pdev;
+	int bar;
+};
+
+static void p2pmem_release_pci_region(void *data)
+{
+	struct pci_region *r = data;
+
+	pci_release_region(r->pdev, r->bar);
+	kfree(r);
+}
+
+/**
+ * p2pmem_add_pci_region() - request and add an entire PCI region to the
+ *	specified p2pmem device
+ * @p: the device to add the memory to
+ * @pdev: pci device to register the bar from
+ * @bar: the bar number to add
+ *
+ * The memory will be given ZONE_DEVICE struct pages so that it may
+ * be used with any dma request.
+ */
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar)
+{
+	int rc;
+	struct pci_region *r;
+
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return -ENOMEM;
+
+	r->pdev = pdev;
+	r->bar = bar;
+
+	rc = pci_request_region(pdev, bar, dev_name(&p->dev));
+	if (rc < 0)
+		goto err_pci;
+
+	rc = p2pmem_add_resource(p, &pdev->resource[bar]);
+	if (rc < 0)
+		goto err_add;
+
+	rc = devm_add_action_or_reset(&p->dev, p2pmem_release_pci_region, r);
+	if (rc)
+		return rc;
+
+	return 0;
+
+err_add:
+	pci_release_region(pdev, bar);
+err_pci:
+	kfree(r);
+	return rc;
+}
+EXPORT_SYMBOL(p2pmem_add_pci_region);
+
+/**
+ * p2pmem_alloc() - allocate some p2p memory
+ * @p: the device to allocate memory from
+ * @size: number of bytes to allocate
+ *
+ * Returns the allocated memory or NULL on error
+ */
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return (void *)gen_pool_alloc(p->pool, size);
+}
+EXPORT_SYMBOL(p2pmem_alloc);
+
+/**
+ * p2pmem_free() - free allocated p2p memory
+ * @p: the device the memory was allocated from
+ * @addr: address of the memory that was allocated
+ * @size: number of bytes that was allocated
+ */
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+	gen_pool_free(p->pool, (unsigned long)addr, size);
+}
+EXPORT_SYMBOL(p2pmem_free);
+
+static struct device *find_parent_pci_dev(struct device *dev)
+{
+	while (dev) {
+		if (dev_is_pci(dev))
+			return dev;
+
+		dev = dev->parent;
+	}
+
+	return NULL;
+}
+
+/*
+ * If a device is behind a switch, we try to find the upstream bridge
+ * port of the switch. This requires two calls to pci_upstream_bridge:
+ * one for the upstream port on the switch, one on the upstream port
+ * for the next level in the hierarchy. Because of this, devices connected
+ * to the root port will be rejected.
+ */
+static struct pci_dev *get_upstream_switch_port(struct device *dev)
+{
+	struct device *dpci;
+	struct pci_dev *pci;
+
+	dpci = find_parent_pci_dev(dev);
+	if (!dpci)
+		return NULL;
+
+	pci = pci_upstream_bridge(to_pci_dev(dpci));
+	if (!pci)
+		return NULL;
+
+	return pci_upstream_bridge(pci);
+}
+
+static int upstream_bridges_match(struct device *p2pmem,
+				  const void *data)
+{
+	struct device * const *dma_devices = data;
+	struct pci_dev *p2p_up;
+	struct pci_dev *dma_up;
+
+	p2p_up = get_upstream_switch_port(p2pmem);
+	if (!p2p_up) {
+		dev_warn(p2pmem, "p2pmem is not behind a pci switch");
+		return false;
+	}
+
+	while (*dma_devices) {
+		dma_up = get_upstream_switch_port(*dma_devices);
+
+		if (!dma_up) {
+			dev_dbg(p2pmem, "%s is not a pci device behind a switch",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		if (p2p_up != dma_up) {
+			dev_dbg(p2pmem,
+				"%s does not reside on the same upstream bridge",
+				dev_name(*dma_devices));
+			return false;
+		}
+
+		dev_dbg(p2pmem, "%s is compatible", dev_name(*dma_devices));
+		dma_devices++;
+	}
+
+	return true;
+}
+
+/**
+ * p2pmem_find_compat() - find a p2pmem device compatible with the
+ *	specified devices
+ * @dma_devices: a null terminated array of device pointers which
+ *	all must be compatible with the returned p2pmem device
+ *
+ * For now, we only support cases where all the devices that
+ * will transfer to the p2pmem device are on the same switch.
+ * This cuts out cases that may work but is safest for the user.
+ * We also do not presently support cases where two devices
+ * are behind multiple levels of switches even though this would
+ * likely work fine.
+ *
+ * Future work could be done to whitelist root ports that are known
+ * to be good and support many levels of switches. Additionally,
+ * it would make sense to choose the topographically closest p2pmem
+ * for a given setup. (Presently we only return the first that matches.)
+ *
+ * Returns a pointer to the p2pmem device with the reference taken
+ * (use p2pmem_put to return the reference) or NULL if no compatible
+ * p2pmem device is found.
+ */
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices)
+{
+	struct device *dev;
+
+	dev = class_find_device(p2pmem_class, NULL, dma_devices,
+				upstream_bridges_match);
+
+	if (!dev)
+		return NULL;
+
+	return to_p2pmem(dev);
+}
+EXPORT_SYMBOL(p2pmem_find_compat);
+
+/**
+ * p2pmem_put() - decrement a p2pmem device reference
+ * @p: p2pmem device to return
+ *
+ * Dereference and free (if last) the device's reference counter.
+ * It's safe to pass a NULL pointer to this function.
+ */
+void p2pmem_put(struct p2pmem_dev *p)
+{
+	if (p)
+		put_device(&p->dev);
+}
+EXPORT_SYMBOL(p2pmem_put);
+
+static int __init p2pmem_init(void)
+{
+	p2pmem_class = class_create(THIS_MODULE, "p2pmem");
+	if (IS_ERR(p2pmem_class))
+		return PTR_ERR(p2pmem_class);
+
+	return 0;
+}
+module_init(p2pmem_init);
+
+static void __exit p2pmem_exit(void)
+{
+	class_destroy(p2pmem_class);
+
+	pr_info(KBUILD_MODNAME ": unloaded.\n");
+}
+module_exit(p2pmem_exit);
diff --git a/include/linux/p2pmem.h b/include/linux/p2pmem.h
new file mode 100644
index 0000000..71dc1e1
--- /dev/null
+++ b/include/linux/p2pmem.h
@@ -0,0 +1,103 @@ 
+/*
+ * Peer 2 Peer Memory Device
+ * Copyright (c) 2016, Microsemi Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#ifndef __P2PMEM_H__
+#define __P2PMEM_H__
+
+#include <linux/device.h>
+#include <linux/pci.h>
+
+struct p2pmem_dev {
+	struct device dev;
+	int id;
+
+	struct percpu_ref ref;
+	struct completion cmp;
+	struct gen_pool *pool;
+};
+
+#ifdef CONFIG_P2PMEM
+
+struct p2pmem_dev *p2pmem_create(struct device *parent);
+void p2pmem_unregister(struct p2pmem_dev *p);
+
+int p2pmem_add_resource(struct p2pmem_dev *p, struct resource *res);
+int p2pmem_add_pci_region(struct p2pmem_dev *p, struct pci_dev *pdev, int bar);
+
+void *p2pmem_alloc(struct p2pmem_dev *p, size_t size);
+void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size);
+
+struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devices);
+void p2pmem_put(struct p2pmem_dev *p);
+
+#else
+
+static inline void *p2pmem_create(struct device *parent)
+{
+	return NULL;
+}
+
+static inline void p2pmem_unregister(struct p2pmem_dev *p)
+{
+}
+
+static inline int p2pmem_add_resource(struct p2pmem_dev *p,
+				      struct resource *res)
+{
+	return -ENODEV;
+}
+
+static inline int p2pmem_add_pci_region(struct p2pmem_dev *p,
+					struct pci_dev *pdev, int bar)
+{
+	return -ENODEV;
+}
+
+static inline void *p2pmem_alloc(struct p2pmem_dev *p, size_t size)
+{
+	return NULL;
+}
+
+static inline void p2pmem_free(struct p2pmem_dev *p, void *addr, size_t size)
+{
+}
+
+static inline struct p2pmem_dev *p2pmem_find_compat(struct device **dma_devs)
+{
+	return NULL;
+}
+
+static inline void p2pmem_put(struct p2pmem_dev *p)
+{
+}
+
+#endif
+
+static inline struct page *p2pmem_alloc_page(struct p2pmem_dev *p)
+{
+	struct page *pg = p2pmem_alloc(p, PAGE_SIZE);
+
+	if (pg)
+		return virt_to_page(pg);
+
+	return NULL;
+}
+
+static inline void p2pmem_free_page(struct p2pmem_dev *p, struct page *pg)
+{
+	p2pmem_free(p, page_to_virt(pg), PAGE_SIZE);
+}
+
+#endif