Message ID | 20180312193525.2855-2-logang@deltatee.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> +int pci_p2pdma_add_client(struct list_head *head, struct device *dev)
It feels like code tried to be a generic p2pdma provider first. Then got
converted to PCI, yet all dev parameters are still struct device.
Maybe, dev parameter should also be struct pci_dev so that you can get rid of
all to_pci_dev() calls in this code including find_parent_pci_dev() function.
Regarding the switch business, It is amazing how much trouble you went into
limit this functionality into very specific hardware.
I thought that we reached to an agreement that code would not impose
any limits on what user wants.
What happened to all the emails we exchanged?
I understand you are coming from what you tested. Considering that you
are singing up for providing a generic PCI functionality into the kernel,
why don't you just blacklist the products that you had problems with and
yet still allow other architectures to use your code with their root ports?
On 12/03/18 09:28 PM, Sinan Kaya wrote: > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > Regarding the switch business, It is amazing how much trouble you went into > limit this functionality into very specific hardware. > > I thought that we reached to an agreement that code would not impose > any limits on what user wants. > > What happened to all the emails we exchanged? It turns out that root ports that support P2P are far less common than anyone thought. So it will likely have to be a white list. Nobody else seems keen on allowing the user to enable this on hardware that doesn't work. The easiest solution is still limiting it to using a switch. From there, if someone wants to start creating a white-list then that's probably the way forward to support root ports. And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes). Additionally, once you allow for root ports you may find the IOMMU getting in the way. So there are great deal more issues to sort out if you don't restrict to devices behind switches. Logan
On 3/13/2018 12:43 PM, Logan Gunthorpe wrote: > > > On 12/03/18 09:28 PM, Sinan Kaya wrote: >> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: >> Regarding the switch business, It is amazing how much trouble you went into >> limit this functionality into very specific hardware. >> >> I thought that we reached to an agreement that code would not impose >> any limits on what user wants. >> >> What happened to all the emails we exchanged? > > It turns out that root ports that support P2P are far less common than anyone thought. So it will likely have to be a white list. Nobody else seems keen on allowing the user to enable this on hardware that doesn't work. The easiest solution is still limiting it to using a switch. From there, if someone wants to start creating a white-list then that's probably the way forward to support root ports. I thought only few root ports had problem. Thanks for clarifying that. > > And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes). > Do you think you can keep a pointer to the parent bridge instead of querying it via get_upstream_bridge_port() here so that we can reuse your pci_p2pdma_disable_acs() in the future. +int pci_p2pdma_disable_acs(struct pci_dev *pdev) +{ + struct pci_dev *up; + int pos; + u16 ctrl; + + up = get_upstream_bridge_port(pdev); + if (!up) + return 0; Some future device would only deal with pci_p2pdma_add_client(() for whitelisting instead of changing all of your code. We should at least limit the usage of get_upstream_bridge_port() family of functions to probe time only. > Additionally, once you allow for root ports you may find the IOMMU getting in the way. We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel command line to have identical behavior to your switch case. > > So there are great deal more issues to sort out if you don't restrict to devices behind switches. > > Logan >
On 12/03/18 09:28 PM, Sinan Kaya wrote: > Maybe, dev parameter should also be struct pci_dev so that you can get rid of > all to_pci_dev() calls in this code including find_parent_pci_dev() function. No, this was mentioned in v2. find_parent_pci_dev is necessary because the calling drivers aren't likely to have a bunch of struct pci_dev's for all the devices they are working with lying around. It's a much nicer from an API stand point to take struct devs and not worth it just to have a PCI API only taking struct pci_devs. Logan
On 13/03/18 11:49 AM, Sinan Kaya wrote: >> And there's also the ACS problem which means if you want to use P2P on the root ports you'll have to disable ACS on the entire system. (Or preferably, the IOMMU groups need to get more sophisticated to allow for dynamic changes). >> > > Do you think you can keep a pointer to the parent bridge instead of querying it > via get_upstream_bridge_port() here so that we can reuse your > pci_p2pdma_disable_acs() in the future. Keep a pointer where? pci_p2pdma_disable_acs() and pci_p2pdma_add_client() are used in completely different cases on completely different devices. There really is no overlap and no obvious place to store the port pointer (except in the struct pci_dev itself, in which case why not just call the function again). > +int pci_p2pdma_disable_acs(struct pci_dev *pdev) > +{ > + struct pci_dev *up; > + int pos; > + u16 ctrl; > + > + up = get_upstream_bridge_port(pdev); > + if (!up) > + return 0; > > Some future device would only deal with pci_p2pdma_add_client(() for whitelisting > instead of changing all of your code. That's a problem for whoever writes the future code. > We should at least limit the usage of get_upstream_bridge_port() family of functions > to probe time only. Why? > We can tell iommu to do one to one translation by passing iommu.passthrough=1 to kernel > command line to have identical behavior to your switch case. Well, someone will need to write code for all available IOMMUs to support this. That's a very big project. Logan
On 3/13/2018 2:44 PM, Logan Gunthorpe wrote: >> Do you think you can keep a pointer to the parent bridge instead of querying it >> via get_upstream_bridge_port() here so that we can reuse your >> pci_p2pdma_disable_acs() in the future. > > Keep a pointer where? pci_p2pdma_disable_acs() and pci_p2pdma_add_client() are used in completely different cases on completely different devices. There really is no overlap and no obvious place to store the port pointer (except in the struct pci_dev itself, in which case why not just call the function again). I was thinking of this for the pci_p2pdma_add_client() case for the parent pointer. +struct pci_p2pdma_client { + struct list_head list; + struct pci_dev *client; + struct pci_dev *provider; +}; You are right second code seems to be there to disable ACS altogether when this kernel configuration is selected. It doesn't have any relationship to the client code. But then, Why bother searching for the switch at all? Even if the switch is there, there is no guarantee that it is currently being used for P2P. It seems that we are going with the assumption that enabling this config option implies you want P2P, then we can simplify this code as well.
On 13/03/18 01:10 PM, Sinan Kaya wrote: > I was thinking of this for the pci_p2pdma_add_client() case for the > parent pointer. > > +struct pci_p2pdma_client { > + struct list_head list; > + struct pci_dev *client; > + struct pci_dev *provider; > +}; Yeah, that structure only exists in a list owned by the client and we only check the upstream bridge once per entry so I don't see the point. > But then, Why bother searching for the switch at all? Huh? We have to make sure all the client and provider devices are behind the same switch. How can we do that without "searching" for the switch? In the ACS case, we only disable ACS on downstream ports of switches. No sense disabling it globally as that's worse from an isolation point of view and not worth it given we require all P2P transactions to be behind a switch. > Even if the switch is there, there is no guarantee that it is currently > being used for P2P. IOMMU groups are set at boot time and, at present, there's no way to dynamically change ACS bits without messing up the groups. So switches not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set and I don't know of any good solution to that. Please see the ACS discussions in v1 and v2. > It seems that we are going with the assumption that enabling this config > option implies you want P2P, then we can simplify this code as well. How so? Logan
On 3/13/2018 3:19 PM, Logan Gunthorpe wrote: > > > On 13/03/18 01:10 PM, Sinan Kaya wrote: >> I was thinking of this for the pci_p2pdma_add_client() case for the >> parent pointer. >> >> +struct pci_p2pdma_client { >> + struct list_head list; >> + struct pci_dev *client; >> + struct pci_dev *provider; >> +}; > > Yeah, that structure only exists in a list owned by the client and we > only check the upstream bridge once per entry so I don't see the point. > >> But then, Why bother searching for the switch at all? > > Huh? We have to make sure all the client and provider devices are behind > the same switch. How can we do that without "searching" for the switch? > Sorry, I was thinking of ACS case you described below. The only thing code cares is if the device is behind a switch or not at this moment. > In the ACS case, we only disable ACS on downstream ports of switches. No > sense disabling it globally as that's worse from an isolation point of > view and not worth it given we require all P2P transactions to be behind > a switch. I agree disabling globally would be bad. Somebody can always say I have ten switches on my system. I want to do peer-to-peer on one switch only. Now, this change weakened security for the other switches that I had no intention with doing P2P. Isn't this a problem? Can we specify the BDF of the downstream device we want P2P with during boot via kernel command line? > >> Even if the switch is there, there is no guarantee that it is currently >> being used for P2P. > > IOMMU groups are set at boot time and, at present, there's no way to > dynamically change ACS bits without messing up the groups. So switches > not used for P2P will not have ACS enabled when CONFIG_PCI_P2PDMA is set > and I don't know of any good solution to that. Please see the ACS > discussions in v1 and v2. Given the implementation limitations, this might be OK as a short-term solution. It depends on if Alex is comfortable with this. > >> It seems that we are going with the assumption that enabling this config >> option implies you want P2P, then we can simplify this code as well. > > How so? > > Logan > > >
On 13/03/18 01:53 PM, Sinan Kaya wrote: > I agree disabling globally would be bad. Somebody can always say I have > ten switches on my system. I want to do peer-to-peer on one switch only. Now, > this change weakened security for the other switches that I had no intention > with doing P2P. > > Isn't this a problem? Well, if it's a problem for someone they'll have to solve it. We're targeting JBOFs that have no use for ACS / IOMMU groups at all. > Can we specify the BDF of the downstream device we want P2P with during boot via > kernel command line? That's a painful configuration burden. And then things might stop working if you change your topology at all and now have to change boot parameters. Logan
On 3/13/2018 4:46 PM, Logan Gunthorpe wrote: > > > On 13/03/18 01:53 PM, Sinan Kaya wrote: >> I agree disabling globally would be bad. Somebody can always say I have >> ten switches on my system. I want to do peer-to-peer on one switch only. Now, >> this change weakened security for the other switches that I had no intention >> with doing P2P. >> >> Isn't this a problem? > > Well, if it's a problem for someone they'll have to solve it. We're > targeting JBOFs that have no use for ACS / IOMMU groups at all. > >> Can we specify the BDF of the downstream device we want P2P with during boot via >> kernel command line? > > That's a painful configuration burden. And then things might stop > working if you change your topology at all and now have to change boot > parameters. > It sounds like you have very tight hardware expectations for this to work at this moment. You also don't want to generalize this code for others and address the shortcomings. To get you going, you should limit this change to the switch products that you have validated via white-listing PCI vendor/device ids. Please do not enable this feature for all other PCI devices or by default. I think your code qualifies as a virus until this issue is resolved (so NAK). Another option is for your CONFIG to depend on BROKEN/EXPERT. You are delivering a general purpose P2P code with a lot of holes in it and expecting people to jump through it. Turning security off by default is also not acceptable. Linux requires ACS support even though you don't care about it for your particular application. I'd hate ACS to be broken due to some operating system enabling your CONFIG option.
On 13/03/18 03:22 PM, Sinan Kaya wrote: > It sounds like you have very tight hardware expectations for this to work > at this moment. You also don't want to generalize this code for others and > address the shortcomings. No, that's the way the community has pushed this work. Our original work was very general and we were told it was unacceptable to put the onus on the user and have things break if the hardware doesn't support it. I think that's a reasonable requirement. So the hardware use-cases were wittled down to the ones we can be confident about and support with reasonable changes to the kernel today. > To get you going, you should limit this change to the switch products that you have > validated via white-listing PCI vendor/device ids. Please do not enable this feature > for all other PCI devices or by default. This doesn't seem necessary. We know that all PCIe switches available today support P2P and we are highly confident that any switch that would ever be produced would support P2P. As long as devices are behind a switch you don't need any white listing. This is what the current patch set implements. If you want to start including root ports then you will need a white list (and solve all the other problems I mentioned earlier). > I think your code qualifies as a virus until this issue is resolved (so NAK). That seems a bit hyperbolic... "a virus"??!... please keep things constructive. > You are delivering a general purpose P2P code with a lot of holes in it and > expecting people to jump through it. No, the code prevents users from screwing it up. It just requires a switch in the hardware which is hardly a high bar to jump through (provided you are putting some design thought into your hardware). And given it requires semi-custom hardware today, it's not something that needs to be on by default in any distributor kernel. > Turning security off by default is also not acceptable. Linux requires ACS support > even though you don't care about it for your particular application. That's not true. Linux does not require ACS support. In fact it's already off by default until you specifically turn on the IOMMU. (Which is not always the most obvious thing to enable.) And the only thing it really supports is enforcing isolation between VMs that are using different pieces of hardware in the system. > I'd hate ACS to be broken due to some operating system enabling your CONFIG option. ACS isn't "broken" by enabling the config option. It just makes the IOMMU groups and isolation less granular. (ie. devices behind a switch will be in the same group and not isolated from each-other). Logan
On 3/13/2018 6:00 PM, Logan Gunthorpe wrote: > > > On 13/03/18 03:22 PM, Sinan Kaya wrote: >> It sounds like you have very tight hardware expectations for this to work >> at this moment. You also don't want to generalize this code for others and >> address the shortcomings. > > No, that's the way the community has pushed this work. Our original work > was very general and we were told it was unacceptable to put the onus on > the user and have things break if the hardware doesn't support it. I > think that's a reasonable requirement. So the hardware use-cases were > wittled down to the ones we can be confident about and support with > reasonable changes to the kernel today. If hardware doesn't support it, blacklisting should have been the right path and I still think that you should remove all switch business from the code. I did not hear enough justification for having a switch requirement for P2P. You are also saying that root ports have issues not because of functionality but because of performance. If you know what is bad, create a list and keep adding it. You can't assume universally that all root ports are broken/ have bad performance. > >> To get you going, you should limit this change to the switch products that you have >> validated via white-listing PCI vendor/device ids. Please do not enable this feature >> for all other PCI devices or by default. > > This doesn't seem necessary. We know that all PCIe switches available > today support P2P and we are highly confident that any switch that would > ever be produced would support P2P. As long as devices are behind a > switch you don't need any white listing. This is what the current patch > set implements. If you want to start including root ports then you will > need a white list (and solve all the other problems I mentioned earlier). What if I come up with a very cheap/crappy switch (something like used in data mining)? What guarantees that P2P will work with this device? You are making an assumption here that all switches have good performance. How is that any different from good switch vs. bad switch and good root port vs. bad root port? If it is universally broken, why don't you list the things that work? > >> I think your code qualifies as a virus until this issue is resolved (so NAK). > > That seems a bit hyperbolic... "a virus"??!... please keep things > constructive. > Sorry, this was harsh. I'm taking "virus" word back. I apologize. But, I hold onto my NAK opinion. I have been doing my best to provide feedback. It feels like you are throwing them over the wall to be honest. You keep implying "not my problem". > > I agree disabling globally would be bad. Somebody can always say I have > > ten switches on my system. I want to do peer-to-peer on one switch only. Now, > > this change weakened security for the other switches that I had no intention > > with doing P2P. > > > > Isn't this a problem? > > Well, if it's a problem for someone they'll have to solve it. We're > targeting JBOFs that have no use for ACS / IOMMU groups at all. IMO, you (not somebody) should address this one way or the other before this series land in upstream. >> You are delivering a general purpose P2P code with a lot of holes in it and >> expecting people to jump through it. > No, the code prevents users from screwing it up. It just requires a > switch in the hardware which is hardly a high bar to jump through > (provided you are putting some design thought into your hardware). And > given it requires semi-custom hardware today, it's not something that > needs to be on by default in any distributor kernel. > >> Turning security off by default is also not acceptable. Linux requires ACS support >> even though you don't care about it for your particular application. > > That's not true. Linux does not require ACS support. In fact it's > already off by default until you specifically turn on the IOMMU. (Which > is not always the most obvious thing to enable.) And the only thing it > really supports is enforcing isolation between VMs that are using > different pieces of hardware in the system. Another assumption: There are other architectures like ARM64 where IOMMU is enabled by default even if you don't use VMs for security reasons. IOMMU blocks stray transactions. > >> I'd hate ACS to be broken due to some operating system enabling your CONFIG option. > > ACS isn't "broken" by enabling the config option. It just makes the > IOMMU groups and isolation less granular. (ie. devices behind a switch > will be in the same group and not isolated from each-other). Didn't the ACS behavior change suddenly for no good reason when we enabled your code even though I might not be using the P2P but I happen to have a kernel with P2P config option?
>> It sounds like you have very tight hardware expectations for this to work >> at this moment. You also don't want to generalize this code for others and >> address the shortcomings. > No, that's the way the community has pushed this work Hi Sinan Thanks for all the input. As Logan has pointed out the switch requirement is something that has evolved over time based on input from the community. You are more than welcome to have an opinion on this (and you have made that opinion clear ;-)). Over time the patchset may evolve from its current requirements but right now we are aligned with the feedback from the community. Cheers Stephen
Hi Sinan > If hardware doesn't support it, blacklisting should have been the right > path and I still think that you should remove all switch business from the code. > I did not hear enough justification for having a switch requirement > for P2P. We disagree. As does the community based on feedback on previous version of this patch set. > You are also saying that root ports have issues not because of functionality but > because of performance. I have seen both. Some systems fail in entirety to route TLPs across route ports. Others do route but with poor performance. Having a requirement for a switch is therefore a reasonable and sensible minimum criteria for enabling p2p. > If you know what is bad, create a list and keep adding it. You can't assume > universally that all root ports are broken/ have bad performance. I actually wanted to do a blacklist originally but that idea was not accepted by the community. > What if I come up with a very cheap/crappy switch (something like used in data > mining)? > What guarantees that P2P will work with this device? You are making an > assumption here that all switches have good performance. A switch must support P2P or it is in violation of the PCIe specification. > I have been doing my best to provide feedback. It feels like you are throwing > them over the wall to be honest. I think one issue Sinan is that a lot of your ideas have been discussed in previous versions of the series. So you are asking us to make changes which, in some cases, we know are not acceptable to others in the community. The input is welcome but it flies in the face of other input we have received. However if other developers feel strongly about the blacklist/whitelist and switch criteria please speak up! ;-) Stephen
On 13/03/18 04:29 PM, Sinan Kaya wrote: > If hardware doesn't support it, blacklisting should have been the right > path and I still think that you should remove all switch business from the code. > I did not hear enough justification for having a switch requirement > for P2P. I disagree. > You are also saying that root ports have issues not because of functionality but > because of performance. No... performance can also be an issue but the main justification for this is that many root ports do not support P2P at all and can fail in different ways (usually just dropping the TLPs). > What if I come up with a very cheap/crappy switch (something like used in data > mining)? Good luck. That's not how hardware is designed. PCIe switches that have any hope to compete with the existing market will need features like NTB, non-transparent ports, etc... and that implies a certain symmetry (ie there isn't a special host port because there may be more than one and it may move around) which implies that packets will always be able to forward between each ports which implies P2P will work. > I have been doing my best to provide feedback. It feels like you are throwing > them over the wall to be honest. > > You keep implying "not my problem". Well, the fact of the matter is that extending this in all the ways people like you want face huge problems on all sides. These are not trivial issues and holding back work that works for our problem because it doesn't solve your problem is IMO just going to grind development in this area to a halt. We have to have something we can agree on which is safe to start building on. The building can't just emerge fully formed in one go. P2P proposal go back a long time and have never gotten anywhere because there are limitations and people want it to do things that are hard but don't want to contribute the work to solving those problems. >> Well, if it's a problem for someone they'll have to solve it. We're >> targeting JBOFs that have no use for ACS / IOMMU groups at all. > > IMO, you (not somebody) should address this one way or the other before this > series land in upstream. The real way to address this (as I've mentioned before) is with some way of doing ACS and iomem groups dynamically. But this is a huge project in itself and is never going to be part of the P2P patchset. > Another assumption: There are other architectures like ARM64 where IOMMU > is enabled by default even if you don't use VMs for security reasons. > IOMMU blocks stray transactions. True, but it doesn't change my point: ACS is not a requirement for Linux many many systems do not have it on at all or by default. > Didn't the ACS behavior change suddenly for no good reason when we enabled > your code even though I might not be using the P2P but I happen to have > a kernel with P2P config option? Well no, presumably you made a conscious choice to turn the config option on and build a custom kernel for your box. That doesn't seem very sudden and the reason is that the two concepts are very much at odds with each other: you can't have isolation and still have transactions between devices. Logan
On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote: > >> It sounds like you have very tight hardware expectations for this to work > >> at this moment. You also don't want to generalize this code for others and > >> address the shortcomings. > > No, that's the way the community has pushed this work > > Hi Sinan > > Thanks for all the input. As Logan has pointed out the switch > requirement is something that has evolved over time based on input > from the community. You are more than welcome to have an opinion on > this (and you have made that opinion clear ;-)). Over time the > patchset may evolve from its current requirements but right now we > are aligned with the feedback from the community. This part of the community hasn't been convinced of the need to have two bridges, e.g., both an Upstream Port and a Downstream Port, or two conventional PCI bridges, above the peers. Every PCI-to-PCI bridge is required to support routing transactions between devices on its secondary side. Therefore, I think it is sufficient to verify that the potential peers share a single common upstream bridge. This could be a conventional PCI bridge, a Switch Downstream Port, or a Root Port. I've seen the response that peers directly below a Root Port could not DMA to each other through the Root Port because of the "route to self" issue, and I'm not disputing that. But enforcing a requirement for two upstream bridges introduces a weird restriction on conventional PCI topologies, makes the code hard to read, and I don't think it's necessary. If it *is* necessary because Root Ports and devices below them behave differently than in conventional PCI, I think you should include a reference to the relevant section of the spec and check directly for a Root Port. I would prefer that over trying to exclude Root Ports by looking for two upstream bridges. Bjorn
On 3/13/2018 6:48 PM, Logan Gunthorpe wrote: > > > On 13/03/18 04:29 PM, Sinan Kaya wrote: >> If hardware doesn't support it, blacklisting should have been the right >> path and I still think that you should remove all switch business from the code. >> I did not hear enough justification for having a switch requirement >> for P2P. > > I disagree. > >> You are also saying that root ports have issues not because of functionality but >> because of performance. > > No... performance can also be an issue but the main justification for > this is that many root ports do not support P2P at all and can fail in > different ways (usually just dropping the TLPs). > >> What if I come up with a very cheap/crappy switch (something like used in data >> mining)? > > Good luck. That's not how hardware is designed. PCIe switches that have > any hope to compete with the existing market will need features like > NTB, non-transparent ports, etc... and that implies a certain symmetry > (ie there isn't a special host port because there may be more than one > and it may move around) which implies that packets will always be able > to forward between each ports which implies P2P will work. > It is still a switch it can move packets but, maybe it can move data at 100kbps speed. What prevents that switch from trying P2P and having a bad user experience? If everything is so broken, I was suggesting to at least list the switches you have tested. What's the problem with this? Why do you want to assume that all switches are good and all root ports are bad? >> I have been doing my best to provide feedback. It feels like you are throwing >> them over the wall to be honest. >> >> You keep implying "not my problem". > > Well, the fact of the matter is that extending this in all the ways > people like you want face huge problems on all sides. These are not > trivial issues and holding back work that works for our problem because > it doesn't solve your problem is IMO just going to grind development in > this area to a halt. We have to have something we can agree on which is > safe to start building on. The building can't just emerge fully formed > in one go. > What if the design is so canned that you can't change anything? I have been asking things like getting rid of switch search in ACS enablement towards achieving generic P2P. You seem to be pushing back. You said yourself P2P and isolation doesn't go together at this point but you also care about isolation for other devices that are not doing P2P. Confusing... > P2P proposal go back a long time and have never gotten anywhere because > there are limitations and people want it to do things that are hard but > don't want to contribute the work to solving those problems. > >>> Well, if it's a problem for someone they'll have to solve it. We're >>> targeting JBOFs that have no use for ACS / IOMMU groups at all. >> >> IMO, you (not somebody) should address this one way or the other before this >> series land in upstream. > > The real way to address this (as I've mentioned before) is with some way > of doing ACS and iomem groups dynamically. But this is a huge project in > itself and is never going to be part of the P2P patchset. fair enough. > >> Another assumption: There are other architectures like ARM64 where IOMMU >> is enabled by default even if you don't use VMs for security reasons. >> IOMMU blocks stray transactions. > > True, but it doesn't change my point: ACS is not a requirement for Linux > many many systems do not have it on at all or by default. I don't think so. It is not a requirement for you but it is a requirement for me (ARM64 guy). Linux happens to run on multiple architectures. One exception invalidates your point. > >> Didn't the ACS behavior change suddenly for no good reason when we enabled >> your code even though I might not be using the P2P but I happen to have >> a kernel with P2P config option? > If you are assuming that your kernel option should not be used by general distributions like Ubuntu/redhat etc. and requires a kernel compilation, creating a dependency to EXPERT is the right way to do. Distributions assume that no damage is done by enabling PCI bus options under normal circumstances. > Well no, presumably you made a conscious choice to turn the config > option on and build a custom kernel for your box. That doesn't seem very > sudden and the reason is that the two concepts are very much at odds > with each other: you can't have isolation and still have transactions > between devices. > > Logan > >
On 13/03/18 05:08 PM, Bjorn Helgaas wrote: > On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote: > If it *is* necessary because Root Ports and devices below them behave > differently than in conventional PCI, I think you should include a > reference to the relevant section of the spec and check directly for a > Root Port. I would prefer that over trying to exclude Root Ports by > looking for two upstream bridges. Well we've established that we don't want to allow root ports. But we need to, at a minimum, do two pci_upstream_bridge() calls... Remember, we need to check that a number of devices are behind the same switch. So we need to find a _common_ upstream port for several devices. Excluding the multifunction device case (which I don't think is applicable for reasons we've discussed before), this will *never* be the first upstream port for a given device. We need to find the upstream port of the switch and ensure all devices share that same port. If a device is behind a switch, it's pci_upstream_bridge() is the downstream switch port which is unique to that device. So a peer device would have a different pci_upstream_bridge() port but share the same pci_upstream_bridge(pci_upstream_bridge()) port (assuming they are on the same switch). The alternative, is to walk the entire tree of upstream ports and check that all devices have a common port somewhere in the tree that isn't the root complex. Arguably this is the right way to do it and would support a network of switches but is a fair bit more complex to code. Logan
On 13/03/18 05:19 PM, Sinan Kaya wrote: > It is still a switch it can move packets but, maybe it can move data at > 100kbps speed. As Stephen pointed out, it's a requirement of the PCIe spec that a switch supports P2P. If you want to sell a switch that does P2P with bad performance then that's on you to deal with. > What prevents that switch from trying P2P and having a bad user experience? The fact that no one would buy such a switch as it would have a bad user experience with regular PCI transfers. A P2P TLP is not special on a switch as it's routed in just the same way as any others. There'd also be no cost gain in making such a broken-by-design device. > If everything is so broken, I was suggesting to at least list the switches > you have tested. > > What's the problem with this? More complexity for zero gain. > Why do you want to assume that all switches are good and all root ports are > bad? Because the assumption that all switches are good is more than reasonable and simplifies the code and maintenance significantly. No one wants to maintain a white list when they don't have to. > What if the design is so canned that you can't change anything? Based on the feedback we've got so far and the developers that have participated in getting it to where it is, it is not "canned". > I have been asking things like getting rid of switch search in ACS > enablement towards achieving generic P2P. You seem to be pushing back. > You said yourself P2P and isolation doesn't go together at this point > but you also care about isolation for other devices that are not doing > P2P. P2P and isolation will never be compatible at any point. They are two opposite concepts. So we could just disable isolation across the whole system and for our purposes that would be fine. But it's relatively simple to limit this and only disable it behind switches. So why wouldn't we? It enables use cases like having an isolated card on the root complex used in a VM while having P2P on cards behind switches. I personally have no interest in doing this but I also have no reason to prevent it with my code. > It is not a requirement for you but it is a requirement for me (ARM64 guy). > Linux happens to run on multiple architectures. One exception invalidates your > point. It is not a requirement of an architecture or people that use a specific architecture. It is a requirement of the use-case and you have not said any use case or how we could do better. If you're running VMs that require isolation you will need to be *very* careful if you also want to do P2P between cards which requires no isolation. But, based on my understanding, most people will want to do one or the other -- not both. If you want to do P2P you enable the P2P config option, if you want isolation you don't. > If you are assuming that your kernel option should not be used by general > distributions like Ubuntu/redhat etc. and requires a kernel compilation, > creating a dependency to EXPERT is the right way to do. I don't have a problem adding EXPERT as a dependency. We can do that for v4. I'd rather hope that distros actually read and understand the kconfig help text before enabling an off-by-default option. But maybe I'm asking too much. Logan
On Tue, Mar 13, 2018 at 05:21:20PM -0600, Logan Gunthorpe wrote: > On 13/03/18 05:08 PM, Bjorn Helgaas wrote: > > On Tue, Mar 13, 2018 at 10:31:55PM +0000, Stephen Bates wrote: > > If it *is* necessary because Root Ports and devices below them behave > > differently than in conventional PCI, I think you should include a > > reference to the relevant section of the spec and check directly for a > > Root Port. I would prefer that over trying to exclude Root Ports by > > looking for two upstream bridges. > > Well we've established that we don't want to allow root ports. I assume you want to exclude Root Ports because of multi-function devices and the "route to self" error. I was hoping for a reference to that so I could learn more about it. While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS Functions in SR-IOV Capable and Multi-Function Devices", which seems relevant. It talks about "peer-to-peer Requests (between Functions of the device)". Thay says to me that multi-function devices can DMA between themselves. > But we need to, at a minimum, do two pci_upstream_bridge() calls... > > Remember, we need to check that a number of devices are behind the same > switch. So we need to find a _common_ upstream port for several devices. I agree that peers need to have a common upstream bridge. I think you're saying peers need to have *two* common upstream bridges. If I understand correctly, requiring two common bridges is a way to ensure that peers directly below Root Ports don't try to DMA to each other. So I guess the first order of business is to nail down whether peers below a Root Port are prohibited from DMAing to each other. My assumption, based on 6.12.1.2 and the fact that I haven't yet found a prohibition, is that they can. > Excluding the multifunction device case (which I don't think is > applicable for reasons we've discussed before), this will *never* be the > first upstream port for a given device. If you're looking for a common upstream bridge, you don't need to make assumptions about whether the hierarchy is conventional PCI or PCIe or how many levels are in the hierarchy. You already have upstream_bridges_match(), which takes two pci_devs. I think it should walk up the PCI hierarchy from the first device, checking whether the bridge at each level is also a parent of the second device. Bjorn
From: Logan Gunthorpe > Sent: 13 March 2018 23:46 ... > As Stephen pointed out, it's a requirement of the PCIe spec that a > switch supports P2P. If you want to sell a switch that does P2P with bad > performance then that's on you to deal with. That surprises me (unless I missed something last time I read the spec). While P2P writes are relatively easy to handle, reads and any other TLP that require acks are a completely different proposition. There are no additional fields that can be set in the read TLP and will be reflected back in the ack(s) than can be used to route the acks back to the correct initiator. I'm pretty sure that to support P2P reads a switch would have to save the received read TLP and (possibly later on) issue read TLP of its own for the required data. I'm not even sure it is easy to interleave the P2P reads with those coming from the root. That requires a potentially infinite queue of pending requests. Some x86 root ports support P2P writes (maybe with a bios option). It would be a shame not to be able to do P2P writes on such systems even though P2P reads won't work. (We looked at using P2P transfers for some data, but in the end used a different scheme. For our use case P2P writes were enough. An alternative would be to access the same host memory buffer from two different devices - but there isn't an API that lets you do that.) David
>I assume you want to exclude Root Ports because of multi-function > devices and the "route to self" error. I was hoping for a reference > to that so I could learn more about it. Apologies Bjorn. This slipped through my net. I will try and get you a reference for RTS in the next couple of days. > While I was looking for it, I found sec 6.12.1.2 (PCIe r4.0), "ACS > Functions in SR-IOV Capable and Multi-Function Devices", which seems > relevant. It talks about "peer-to-peer Requests (between Functions of > the device)". Thay says to me that multi-function devices can DMA > between themselves. I will go take a look. Appreciate the link. Stephen
On 13/03/18 08:56 PM, Bjorn Helgaas wrote: > I assume you want to exclude Root Ports because of multi-function > devices and the "route to self" error. I was hoping for a reference > to that so I could learn more about it. I haven't been able to find where in the spec it forbids route to self. But I was told this by developers who work know switches. Hopefully Stephen can find the reference. But it's a bit of a moot point. Devices can DMA to themselves if they are designed to do so. For example, some NVMe cards can read and write their own CMB for certain types of DMA request. There is a register in the spec (CMBSZ) which specifies which types of requests are supported. (See 3.1.12 in NVMe 1.3a). > I agree that peers need to have a common upstream bridge. I think > you're saying peers need to have *two* common upstream bridges. If I > understand correctly, requiring two common bridges is a way to ensure > that peers directly below Root Ports don't try to DMA to each other. No, I don't get where you think we need to have two common upstream bridges. I'm not sure when such a case would ever happen. But you seem to understand based on what you wrote below. > So I guess the first order of business is to nail down whether peers > below a Root Port are prohibited from DMAing to each other. My > assumption, based on 6.12.1.2 and the fact that I haven't yet found > a prohibition, is that they can. If you have a multifunction device designed to DMA to itself below a root port, it can. But determining this is on a device by device basis, just as determining whether a root complex can do peer to peer is on a per device basis. So I'd say we don't want to allow it by default and let someone who has such a device figure out what's necessary if and when one comes along. > You already have upstream_bridges_match(), which takes two pci_devs. > I think it should walk up the PCI hierarchy from the first device, > checking whether the bridge at each level is also a parent of the > second device. Yes, this is what I meant when I said walking the entire tree. I've been kicking the can down the road on implementing this as getting ref counting right and testing it is going to be quite tricky. The single switch approach we implemented now is just a simplification which works for a single switch. But I guess we can look at implementing it this way for v4. Logan
On 14/03/18 06:16 AM, David Laight wrote: > That surprises me (unless I missed something last time I read the spec). > While P2P writes are relatively easy to handle, reads and any other TLP that > require acks are a completely different proposition. > There are no additional fields that can be set in the read TLP and will be > reflected back in the ack(s) than can be used to route the acks back to the > correct initiator. > > I'm pretty sure that to support P2P reads a switch would have to save > the received read TLP and (possibly later on) issue read TLP of its own > for the required data. > I'm not even sure it is easy to interleave the P2P reads with those > coming from the root. > That requires a potentially infinite queue of pending requests. This is wrong. A completion is a TLP just like any other and makes use of the Destination ID field in the header to route it back to the original requester. > Some x86 root ports support P2P writes (maybe with a bios option). > It would be a shame not to be able to do P2P writes on such systems > even though P2P reads won't work. Yes, and this has been discussed many times. It won't be changing in the near term. Logan
On Wed, Mar 14, 2018 at 10:17:34AM -0600, Logan Gunthorpe wrote: > On 13/03/18 08:56 PM, Bjorn Helgaas wrote: > > I agree that peers need to have a common upstream bridge. I think > > you're saying peers need to have *two* common upstream bridges. If I > > understand correctly, requiring two common bridges is a way to ensure > > that peers directly below Root Ports don't try to DMA to each other. > > No, I don't get where you think we need to have two common upstream > bridges. I'm not sure when such a case would ever happen. But you seem > to understand based on what you wrote below. Sorry, I phrased that wrong. You don't require two common upstream bridges; you require two upstream bridges, with the upper one being common, i.e., static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) { struct pci_dev *up1, *up2; up1 = pci_dev_get(pci_upstream_bridge(pdev)); up2 = pci_dev_get(pci_upstream_bridge(up1)); return up2; } So if you're starting with pdev, up1 is the immediately upstream bridge and up2 is the second upstream bridge. If this is PCIe, up1 may be a Root Port and there is no up2, or up1 and up2 are in a switch. This is more restrictive than the spec requires. As long as there is a single common upstream bridge, peer-to-peer DMA should work. In fact, in conventional PCI, I think the upstream bridge could even be the host bridge (not a PCI-to-PCI bridge). You are focused on PCIe systems, and in those systems, most topologies do have an upstream switch, which means two upstream bridges. I'm trying to remove that assumption because I don't think there's a requirement for it in the spec. Enforcing this assumption complicates the code and makes it harder to understand because the reader says "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, so why do we need these two bridges?" [*] For conventional PCI, this means anything below the same host bridge. Two devices on a conventional PCI root bus should be able to DMA to each other, even though there's no PCI-to-PCI bridge above them. For PCIe, it means a "hierarchy domain" as used in PCIe r4.0, sec 1.3.1, i.e., anything below the same Root Port. > > So I guess the first order of business is to nail down whether peers > > below a Root Port are prohibited from DMAing to each other. My > > assumption, based on 6.12.1.2 and the fact that I haven't yet found > > a prohibition, is that they can. > > If you have a multifunction device designed to DMA to itself below a > root port, it can. But determining this is on a device by device basis, > just as determining whether a root complex can do peer to peer is on a > per device basis. So I'd say we don't want to allow it by default and > let someone who has such a device figure out what's necessary if and > when one comes along. It's not the job of this infrastructure to answer the device-dependent question of whether DMA initiators or targets support peer-to-peer DMA. All we want to do here is figure out whether the PCI topology supports it, using the mechanisms guaranteed by the spec. We can derive that from the basic rules about how PCI bridges work, i.e., from the PCI-to-PCI Bridge spec r1.2, sec 4.3: A bridge forwards PCI memory transactions from its primary interface to its secondary interface (downstream) if a memory address is in the range defined by the Memory Base and Memory Limit registers (when the base is less than or equal to the limit) as illustrated in Figure 4-3. Conversely, a memory transaction on the secondary interface that is within this address range will not be forwarded upstream to the primary interface. Any memory transactions on the secondary interface that are outside this address range will be forwarded upstream to the primary interface (provided they are not in the address range defined by the prefetchable memory address range registers). This works for either PCI or PCIe. The only wrinkle PCIe adds is that the very top of the hierarchy is a Root Port, and we can't rely on it to route traffic to other Root Ports. I also doubt Root Complex Integrated Endpoints can participate in peer-to-peer DMA. Thanks for your patience in working through all this. I know it sometimes feels like being bounced around in all directions. It's just a normal consequence of trying to add complex functionality to an already complex system, with interest and expertise spread unevenly across a crowd of people. Bjorn
On 14/03/18 12:51 PM, Bjorn Helgaas wrote: > You are focused on PCIe systems, and in those systems, most topologies > do have an upstream switch, which means two upstream bridges. I'm > trying to remove that assumption because I don't think there's a > requirement for it in the spec. Enforcing this assumption complicates > the code and makes it harder to understand because the reader says > "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, > so why do we need these two bridges?" Yes, as I've said, we focused on being behind a single PCIe Switch because it's easier and vaguely safer (we *know* switches will work but other types of topology we have to assume will work based on the spec). Also, I have my doubts that anyone will ever have a use for this with non-PCIe devices. A switch shows up as two or more virtual bridges (per the PCIe v4 Spec 1.3.3) which explains the existing get_upstream_bridge_port() function. In any case, we'll look at generalizing this by looking for a common upstream port in the next revision of the patch set. Logan
On Wed, Mar 14, 2018 at 12:03 PM, Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 14/03/18 12:51 PM, Bjorn Helgaas wrote: >> You are focused on PCIe systems, and in those systems, most topologies >> do have an upstream switch, which means two upstream bridges. I'm >> trying to remove that assumption because I don't think there's a >> requirement for it in the spec. Enforcing this assumption complicates >> the code and makes it harder to understand because the reader says >> "huh, I know peer-to-peer DMA should work inside any PCI hierarchy*, >> so why do we need these two bridges?" > > Yes, as I've said, we focused on being behind a single PCIe Switch > because it's easier and vaguely safer (we *know* switches will work but > other types of topology we have to assume will work based on the spec). > Also, I have my doubts that anyone will ever have a use for this with > non-PCIe devices. P2P over PCI/PCI-X is quite common in devices like raid controllers. It would be useful if those configurations were not left behind so that Linux could feasibly deploy offload code to a controller in the PCI domain.
On 14/03/18 01:28 PM, Dan Williams wrote: > P2P over PCI/PCI-X is quite common in devices like raid controllers. > It would be useful if those configurations were not left behind so > that Linux could feasibly deploy offload code to a controller in the > PCI domain. Thanks for the note. Neat. In the end nothing is getting left behind it's just work for someone to add support. Even if I wasn't already going to make the change I mentioned it all fits in the architecture and APIs quite easily. Logan
> P2P over PCI/PCI-X is quite common in devices like raid controllers. Hi Dan Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)? > It would be useful if those configurations were not left behind so > that Linux could feasibly deploy offload code to a controller in the > PCI domain. Agreed. I think this would be great. Kind of like the XCOPY framework that was proposed a while back for SCSI devices [1] but updated to also include NVMe devices. That is definitely a use case we would like this framework to support. Stephen [1] https://lwn.net/Articles/592094/
Stephen, >> It would be useful if those configurations were not left behind so >> that Linux could feasibly deploy offload code to a controller in the >> PCI domain. > > Agreed. I think this would be great. Kind of like the XCOPY framework > that was proposed a while back for SCSI devices [1] but updated to also > include NVMe devices. That is definitely a use case we would like this > framework to support. I'm on my umpteenth rewrite of the block/SCSI offload code. It is not as protocol-agnostic as I would like in the block layer facing downwards. It has proven quite hard to reconcile token-based and EXTENDED COPY semantics along with the desire to support stacking. But from an application/filesystem perspective everything looks the same regardless of the intricacies of the device. Nothing is preventing us from supporting other protocols...
On Wed, Mar 14, 2018 at 12:34 PM, Stephen Bates <sbates@raithlin.com> wrote: >> P2P over PCI/PCI-X is quite common in devices like raid controllers. > > Hi Dan > > Do you mean between PCIe devices below the RAID controller? Isn't it pretty novel to be able to support PCIe EPs below a RAID controller (as opposed to SCSI based devices)? I'm thinking of the classic I/O offload card where there's an NTB to an internal PCI bus that has a storage controller and raid offload engines.
> I've seen the response that peers directly below a Root Port could not > DMA to each other through the Root Port because of the "route to self" > issue, and I'm not disputing that. Bjorn You asked me for a reference to RTS in the PCIe specification. As luck would have it I ended up in an Irish bar with Peter Onufryk this week at OCP Summit. We discussed the topic. It is not explicitly referred to as "Route to Self" and it's certainly not explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification discusses error conditions for virtual PCI bridges. One of these conditions (given in the very first bullet in that section) applies to a request that is destined for the same port it came in on. When this occurs the request must be terminated as a UR. Stephen
On Thu, Mar 22, 2018 at 10:57:32PM +0000, Stephen Bates wrote: > > I've seen the response that peers directly below a Root Port could not > > DMA to each other through the Root Port because of the "route to self" > > issue, and I'm not disputing that. > > Bjorn > > You asked me for a reference to RTS in the PCIe specification. As > luck would have it I ended up in an Irish bar with Peter Onufryk > this week at OCP Summit. We discussed the topic. It is not > explicitly referred to as "Route to Self" and it's certainly not > explicit (or obvious) but r6.2.8.1 of the PCIe 4.0 specification > discusses error conditions for virtual PCI bridges. One of these > conditions (given in the very first bullet in that section) applies > to a request that is destined for the same port it came in on. When > this occurs the request must be terminated as a UR. Thanks for that reference! I suspect figure 10-3 in sec 10.1.1 might also be relevant, although it's buried in the ATS section. It shows internal routing between functions of a multifunction device. That suggests that the functions should be able to DMA to each other without those transactions ever appearing on the link. Popping way up the stack, my original point was that I'm trying to remove restrictions on what devices can participate in peer-to-peer DMA. I think it's fairly clear that in conventional PCI, any devices in the same PCI hierarchy, i.e., below the same host-to-PCI bridge, should be able to DMA to each other. The routing behavior of PCIe is supposed to be compatible with conventional PCI, and I would argue that this effectively requires multi-function PCIe devices to have the internal routing required to avoid the route-to-self issue. Bjorn
On 23/03/18 03:50 PM, Bjorn Helgaas wrote: > Popping way up the stack, my original point was that I'm trying to > remove restrictions on what devices can participate in peer-to-peer > DMA. I think it's fairly clear that in conventional PCI, any devices > in the same PCI hierarchy, i.e., below the same host-to-PCI bridge, > should be able to DMA to each other. Yup, we are working on this. > The routing behavior of PCIe is supposed to be compatible with > conventional PCI, and I would argue that this effectively requires > multi-function PCIe devices to have the internal routing required to > avoid the route-to-self issue. That would be very nice but many devices do not support the internal route. We've had to work around this in the past and as I mentioned earlier that NVMe devices have a flag indicating support. However, if a device wants to be involved in P2P it must support it and we can exclude devices that don't support it by simply not enabling their drivers. Logan
On Fri, Mar 23, 2018 at 03:59:14PM -0600, Logan Gunthorpe wrote: > On 23/03/18 03:50 PM, Bjorn Helgaas wrote: > > Popping way up the stack, my original point was that I'm trying to > > remove restrictions on what devices can participate in > > peer-to-peer DMA. I think it's fairly clear that in conventional > > PCI, any devices in the same PCI hierarchy, i.e., below the same > > host-to-PCI bridge, should be able to DMA to each other. > > Yup, we are working on this. > > > The routing behavior of PCIe is supposed to be compatible with > > conventional PCI, and I would argue that this effectively requires > > multi-function PCIe devices to have the internal routing required > > to avoid the route-to-self issue. > > That would be very nice but many devices do not support the internal > route. We've had to work around this in the past and as I mentioned > earlier that NVMe devices have a flag indicating support. However, > if a device wants to be involved in P2P it must support it and we > can exclude devices that don't support it by simply not enabling > their drivers. Do you think these devices that don't support internal DMA between functions are within spec, or should we handle them as exceptions, e.g., via quirks? If NVMe defines a flag indicating peer-to-peer support, that would suggest to me that these devices are within spec. I looked up the CMBSZ register you mentioned (NVMe 1.3a, sec 3.1.12). You must be referring to the WDS, RDS, LISTS, CQS, and SQS bits. If WDS is set, the controller supports having Write-related data and metadata in the Controller Memory Buffer. That would mean the driver could put certain queues in controller memory instead of in host memory. The controller could then read the queue from its own internal memory rather than issuing a PCIe transaction to read it from host memory. That makes sense to me, but I don't see the connection to peer-to-peer. There's no multi-function device in this picture, so it's not about internal DMA between functions. WDS, etc., tell us about capabilities of the controller. If WDS is set, the CPU (or a peer PCIe device) can write things to controller memory. If it is clear, neither the CPU nor a peer device can put things there. So it doesn't seem to tell us anything about peer-to-peer specifically. It looks like information needed by the NVMe driver, but not by the PCI core. Bjorn
> That would be very nice but many devices do not support the internal > route. But Logan in the NVMe case we are discussing movement within a single function (i.e. from a NVMe namespace to a NVMe CMB on the same function). Bjorn is discussing movement between two functions (PFs or VFs) in the same PCIe EP. In the case of multi-function endpoints I think the standard requires those devices to support internal DMAs for transfers between those functions (but does not require it within a function). So I think the summary is: 1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time. 2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP. 3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device). Based on my review of the specification I concur with Bjorn that p2pdma between functions in a MF end-point should be assured to be supported via the standard. However if the p2pdma involves only a single function in a MF device then we can only support NVMe CMBs for now. Let's review and see what the options are for supporting this in the next respin. Stephen
On Tue, 13 Mar 2018 10:43:55 -0600 Logan Gunthorpe <logang@deltatee.com> wrote: > On 12/03/18 09:28 PM, Sinan Kaya wrote: > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > > Regarding the switch business, It is amazing how much trouble you went into > > limit this functionality into very specific hardware. > > > > I thought that we reached to an agreement that code would not impose > > any limits on what user wants. > > > > What happened to all the emails we exchanged? > > It turns out that root ports that support P2P are far less common than > anyone thought. So it will likely have to be a white list. This came as a bit of a surprise to our PCIe architect. His follow up was whether it was worth raising an ECR for the PCIe spec to add a capability bit to allow this to be discovered. This might long term avoid the need to maintain the white list for new devices. So is it worth having a long term solution for making this discoverable? Jonathan > Nobody else > seems keen on allowing the user to enable this on hardware that doesn't > work. The easiest solution is still limiting it to using a switch. From > there, if someone wants to start creating a white-list then that's > probably the way forward to support root ports. > > And there's also the ACS problem which means if you want to use P2P on > the root ports you'll have to disable ACS on the entire system. (Or > preferably, the IOMMU groups need to get more sophisticated to allow for > dynamic changes). > > Additionally, once you allow for root ports you may find the IOMMU > getting in the way. > > So there are great deal more issues to sort out if you don't restrict to > devices behind switches. > > Logan
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > On Tue, 13 Mar 2018 10:43:55 -0600 > Logan Gunthorpe <logang@deltatee.com> wrote: > > It turns out that root ports that support P2P are far less common than > > anyone thought. So it will likely have to be a white list. > > This came as a bit of a surprise to our PCIe architect. > > His follow up was whether it was worth raising an ECR for the PCIe spec > to add a capability bit to allow this to be discovered. This might > long term avoid the need to maintain the white list for new devices. > > So is it worth having a long term solution for making this discoverable? It was surprising to me that there's no architected way to discover this. It seems like such an obvious thing that I guess I assumed the omission was intentional, i.e., maybe there's something that makes it impractical, but it would be worth at least asking somebody in the SIG. It seems like for root ports in the same root complex, at least, there could be a bit somewhere in the root port or the RCRB (which Linux doesn't support yet).
On 24/03/18 09:28 AM, Stephen Bates wrote: > 1. There is no requirement for a single function to support internal DMAs but in the case of NVMe we do have a protocol specific way for a NVMe function to indicate it supports via the CMB BAR. Other protocols may also have such methods but I am not aware of them at this time. > > 2. For multi-function end-points I think it is a requirement that DMAs *between* functions are supported via an internal path but this can be over-ridden by ACS when supported in the EP. > > 3. For multi-function end-points there is no requirement to support internal DMA within each individual function (i.e. a la point 1 but extended to each function in a MF device). Yeah, that make sense. I wasn't making a huge distinction between single and multi-function devices because from an implementation point of view it's not that different. Logan
On 26/03/18 08:01 AM, Bjorn Helgaas wrote: > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: >> On Tue, 13 Mar 2018 10:43:55 -0600 >> Logan Gunthorpe <logang@deltatee.com> wrote: >>> It turns out that root ports that support P2P are far less common than >>> anyone thought. So it will likely have to be a white list. >> >> This came as a bit of a surprise to our PCIe architect. >> >> His follow up was whether it was worth raising an ECR for the PCIe spec >> to add a capability bit to allow this to be discovered. This might >> long term avoid the need to maintain the white list for new devices. >> >> So is it worth having a long term solution for making this discoverable? > > It was surprising to me that there's no architected way to discover > this. It seems like such an obvious thing that I guess I assumed the > omission was intentional, i.e., maybe there's something that makes it > impractical, but it would be worth at least asking somebody in the > SIG. It seems like for root ports in the same root complex, at least, > there could be a bit somewhere in the root port or the RCRB (which > Linux doesn't support yet). Yes, I agree. It would be a good long term solution to have this bit in the spec. That would avoid us needing to create a white list for new hardware. However, I expect it would be years before we can rely on it so someone may yet implement that white list. Logan
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > On Tue, 13 Mar 2018 10:43:55 -0600 > Logan Gunthorpe <logang@deltatee.com> wrote: > > > On 12/03/18 09:28 PM, Sinan Kaya wrote: > > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > > > Regarding the switch business, It is amazing how much trouble you went into > > > limit this functionality into very specific hardware. > > > > > > I thought that we reached to an agreement that code would not impose > > > any limits on what user wants. > > > > > > What happened to all the emails we exchanged? > > > > It turns out that root ports that support P2P are far less common than > > anyone thought. So it will likely have to be a white list. > > This came as a bit of a surprise to our PCIe architect. I don't think it is a hardware problem. I know Mellanox and Nvidia have been doing p2p on Intel root complexes for something like 5-6 years now.. I don't have the details, but it does seem to work. I have heard some chips give poor performance.. Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche feature in Intel/AMD land. I think the main issue here is that there is some BIOS involvement to set things up properly. Eg in GPU land motherboards certify for 'crossfire' support. > His follow up was whether it was worth raising an ECR for the PCIe spec > to add a capability bit to allow this to be discovered. This might > long term avoid the need to maintain the white list for new devices. If it is primarily a BIOS issue then it should be an ACPI thing, right? Jason
On 26/03/18 10:41 AM, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: >> On Tue, 13 Mar 2018 10:43:55 -0600 >> Logan Gunthorpe <logang@deltatee.com> wrote: >> >>> On 12/03/18 09:28 PM, Sinan Kaya wrote: >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: >>>> Regarding the switch business, It is amazing how much trouble you went into >>>> limit this functionality into very specific hardware. >>>> >>>> I thought that we reached to an agreement that code would not impose >>>> any limits on what user wants. >>>> >>>> What happened to all the emails we exchanged? >>> >>> It turns out that root ports that support P2P are far less common than >>> anyone thought. So it will likely have to be a white list. >> >> This came as a bit of a surprise to our PCIe architect. > > I don't think it is a hardware problem. The latest and greatest Power9 CPUs still explicitly do not support this. And, if I recall correctly, the ARM64 device we played with did not either -- but I suspect that will differ depending on vendor. The latest Intel devices of course do support it, but go back 3-4 years or so and the performance is pretty much unusable (at least for the purposes of what this patchset implements). Even older CPUs did not support it. We haven't done any testing on AMD devices but I assume they are similar to Intel. In any case, even if every CPU manufactured today supported it well, there are still older devices out there without support that we need to ensure are handled properly. Logan
On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote: > > > On 26/03/18 10:41 AM, Jason Gunthorpe wrote: > > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > >> On Tue, 13 Mar 2018 10:43:55 -0600 > >> Logan Gunthorpe <logang@deltatee.com> wrote: > >> > >>> On 12/03/18 09:28 PM, Sinan Kaya wrote: > >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > >>>> Regarding the switch business, It is amazing how much trouble you went into > >>>> limit this functionality into very specific hardware. > >>>> > >>>> I thought that we reached to an agreement that code would not impose > >>>> any limits on what user wants. > >>>> > >>>> What happened to all the emails we exchanged? > >>> > >>> It turns out that root ports that support P2P are far less common than > >>> anyone thought. So it will likely have to be a white list. > >> > >> This came as a bit of a surprise to our PCIe architect. > > > > I don't think it is a hardware problem. > > The latest and greatest Power9 CPUs still explicitly do not support > this. I think this is another case of the HW can do it but the SW support is missing. IOMMU configuration and maybe firmware too, for instance. If I recall I saw a presentation that Coral was expected to use P2P between the network and GPU. > And, if I recall correctly, the ARM64 device we played with did > not either -- but I suspect that will differ depending on vendor. Wouldn't surprise me at all to see broken implementations in ARM64.. But even there it needs IOMMU enablement to work at all if I recall. Bascially, this is probably not a HW problem that needs a HW bit, but a OS/firmware problem to do all the enablement.. Jason
On 26/03/18 01:35 PM, Jason Gunthorpe wrote: > I think this is another case of the HW can do it but the SW support is > missing. IOMMU configuration and maybe firmware too, for instance. Nope, not sure how you can make this leap. We've been specifically told that peer-to-peer PCIe DMA is not supported on P9. We've also been told there is a very limited (ie. store only), low performance mode designed for poking mailboxes but it is disabled by default and must be enable with some kind of special firmware calls. Logan
On Mon, 26 Mar 2018 09:46:24 -0600 Logan Gunthorpe <logang@deltatee.com> wrote: > On 26/03/18 08:01 AM, Bjorn Helgaas wrote: > > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > >> On Tue, 13 Mar 2018 10:43:55 -0600 > >> Logan Gunthorpe <logang@deltatee.com> wrote: > >>> It turns out that root ports that support P2P are far less common than > >>> anyone thought. So it will likely have to be a white list. > >> > >> This came as a bit of a surprise to our PCIe architect. > >> > >> His follow up was whether it was worth raising an ECR for the PCIe spec > >> to add a capability bit to allow this to be discovered. This might > >> long term avoid the need to maintain the white list for new devices. > >> > >> So is it worth having a long term solution for making this discoverable? > > > > It was surprising to me that there's no architected way to discover > > this. It seems like such an obvious thing that I guess I assumed the > > omission was intentional, i.e., maybe there's something that makes it > > impractical, but it would be worth at least asking somebody in the > > SIG. It seems like for root ports in the same root complex, at least, > > there could be a bit somewhere in the root port or the RCRB (which > > Linux doesn't support yet). > > Yes, I agree. It would be a good long term solution to have this bit in > the spec. That would avoid us needing to create a white list for new > hardware. However, I expect it would be years before we can rely on it > so someone may yet implement that white list. > > Logan I'll see if I can get our PCI SIG people to follow this through and see if it is just an omission or as Bjorn suggested, there is some reason we aren't thinking of that makes it hard. Agreed that it is a somewhat tangential question to what to do with current hardware, but let's get the ball rolling if possible. Jonathan
On 27/03/18 02:47 AM, Jonathan Cameron wrote: > I'll see if I can get our PCI SIG people to follow this through and see if > it is just an omission or as Bjorn suggested, there is some reason we > aren't thinking of that makes it hard. That would be great! Thanks! Logan
> I'll see if I can get our PCI SIG people to follow this through
Hi Jonathan
Can you let me know if this moves forward within PCI-SIG? I would like to track it. I can see this being doable between Root Ports that reside in the same Root Complex but might become more challenging to standardize for RPs that reside in different RCs in the same (potentially multi-socket) system. I know in the past we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that is not something that would work in all systems and must fall outside the remit of PCI-SIG ;-).
I agree such a capability bit would be very useful but it's going to be quite some time before we can rely on hardware being available that supports it.
Stephen
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 34b56a8f8480..d59f6f5ddfcd 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -124,6 +124,22 @@ config PCI_PASID If unsure, say N. +config PCI_P2PDMA + bool "PCI peer-to-peer transfer support" + depends on ZONE_DEVICE + select GENERIC_ALLOCATOR + help + Enableѕ drivers to do PCI peer-to-peer transactions to and from + BARs that are exposed in other devices that are the part of + the hierarchy where peer-to-peer DMA is guaranteed by the PCI + specification to work (ie. anything below a single PCI bridge). + + Many PCIe root complexes do not support P2P transactions and + it's hard to tell which support it at all, so at this time you + will need a PCIe switch. + + If unsure, say N. + config PCI_LABEL def_bool y if (DMI || ACPI) depends on PCI diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 941970936840..45e0ff6f3213 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o obj-$(CONFIG_PCI_ATS) += ats.o obj-$(CONFIG_PCI_IOV) += iov.o +obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o # # ACPI Related PCI FW Functions diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c new file mode 100644 index 000000000000..0ee917381dce --- /dev/null +++ b/drivers/pci/p2pdma.c @@ -0,0 +1,679 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI Peer 2 Peer DMA support. + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + * + */ + +#include <linux/pci-p2pdma.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/genalloc.h> +#include <linux/memremap.h> +#include <linux/percpu-refcount.h> +#include <linux/random.h> + +struct pci_p2pdma { + struct percpu_ref devmap_ref; + struct completion devmap_ref_done; + struct gen_pool *pool; + bool p2pmem_published; +}; + +static void pci_p2pdma_percpu_release(struct percpu_ref *ref) +{ + struct pci_p2pdma *p2p = + container_of(ref, struct pci_p2pdma, devmap_ref); + + complete_all(&p2p->devmap_ref_done); +} + +static void pci_p2pdma_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + + if (percpu_ref_is_dying(ref)) + return; + + percpu_ref_kill(ref); +} + +static void pci_p2pdma_release(void *data) +{ + struct pci_dev *pdev = data; + + if (!pdev->p2pdma) + return; + + wait_for_completion(&pdev->p2pdma->devmap_ref_done); + percpu_ref_exit(&pdev->p2pdma->devmap_ref); + + gen_pool_destroy(pdev->p2pdma->pool); + pdev->p2pdma = NULL; +} + +static int pci_p2pdma_setup(struct pci_dev *pdev) +{ + int error = -ENOMEM; + struct pci_p2pdma *p2p; + + p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); + if (!p2p) + return -ENOMEM; + + p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); + if (!p2p->pool) + goto out; + + init_completion(&p2p->devmap_ref_done); + error = percpu_ref_init(&p2p->devmap_ref, + pci_p2pdma_percpu_release, 0, GFP_KERNEL); + if (error) + goto out_pool_destroy; + + percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref); + + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (error) + goto out_pool_destroy; + + pdev->p2pdma = p2p; + + return 0; + +out_pool_destroy: + gen_pool_destroy(p2p->pool); +out: + devm_kfree(&pdev->dev, p2p); + return error; +} + +/** + * pci_p2pdma_add_resource - add memory for use as p2p memory + * @pdev: the device to add the memory to + * @bar: PCI BAR to add + * @size: size of the memory to add, may be zero to use the whole BAR + * @offset: offset into the PCI BAR + * + * The memory will be given ZONE_DEVICE struct pages so that it may + * be used with any DMA request. + */ +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, + u64 offset) +{ + struct dev_pagemap *pgmap; + void *addr; + int error; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return -EINVAL; + + if (offset >= pci_resource_len(pdev, bar)) + return -EINVAL; + + if (!size) + size = pci_resource_len(pdev, bar) - offset; + + if (size + offset > pci_resource_len(pdev, bar)) + return -EINVAL; + + if (!pdev->p2pdma) { + error = pci_p2pdma_setup(pdev); + if (error) + return error; + } + + pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL); + if (!pgmap) + return -ENOMEM; + + pgmap->res.start = pci_resource_start(pdev, bar) + offset; + pgmap->res.end = pgmap->res.start + size - 1; + pgmap->res.flags = pci_resource_flags(pdev, bar); + pgmap->ref = &pdev->p2pdma->devmap_ref; + pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; + + addr = devm_memremap_pages(&pdev->dev, pgmap); + if (IS_ERR(addr)) { + error = PTR_ERR(addr); + goto pgmap_free; + } + + error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr, + pci_bus_address(pdev, bar) + offset, + resource_size(&pgmap->res), dev_to_node(&pdev->dev)); + if (error) + goto pgmap_free; + + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill, + &pdev->p2pdma->devmap_ref); + if (error) + goto pgmap_free; + + pci_info(pdev, "added peer-to-peer DMA memory %pR\n", + &pgmap->res); + + return 0; + +pgmap_free: + devres_free(pgmap); + return error; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource); + +static struct pci_dev *find_parent_pci_dev(struct device *dev) +{ + struct device *parent; + + dev = get_device(dev); + + while (dev) { + if (dev_is_pci(dev)) + return to_pci_dev(dev); + + parent = get_device(dev->parent); + put_device(dev); + dev = parent; + } + + return NULL; +} + +/* + * If a device is behind a switch, we try to find the upstream bridge + * port of the switch. This requires two calls to pci_upstream_bridge(): + * one for the upstream port on the switch, one on the upstream port + * for the next level in the hierarchy. Because of this, devices connected + * to the root port will be rejected. + */ +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) +{ + struct pci_dev *up1, *up2; + + if (!pdev) + return NULL; + + up1 = pci_dev_get(pci_upstream_bridge(pdev)); + if (!up1) + return NULL; + + up2 = pci_dev_get(pci_upstream_bridge(up1)); + pci_dev_put(up1); + + return up2; +} + +/* + * This function checks if two PCI devices are behind the same switch. + * (ie. they share the same second upstream port as returned by + * get_upstream_bridge_port().) + * + * Future work could expand this to handle hierarchies of switches + * so any devices whose traffic can be routed without going through + * the root complex could be used. For now, we limit it to just one + * level of switch. + * + * This function returns the "distance" between the devices. 0 meaning + * they are the same device, 1 meaning they are behind the same switch. + * If they are not behind the same switch, -1 is returned. + */ +static int __upstream_bridges_match(struct pci_dev *upstream, + struct pci_dev *client) +{ + struct pci_dev *dma_up; + int ret = 1; + + dma_up = get_upstream_bridge_port(client); + + if (!dma_up) { + dev_dbg(&client->dev, "not a PCI device behind a bridge\n"); + ret = -1; + goto out; + } + + if (upstream != dma_up) { + dev_dbg(&client->dev, + "does not reside on the same upstream bridge\n"); + ret = -1; + goto out; + } + +out: + pci_dev_put(dma_up); + return ret; +} + +static int upstream_bridges_match(struct pci_dev *provider, + struct pci_dev *client) +{ + struct pci_dev *upstream; + int ret; + + if (provider == client) + return 0; + + upstream = get_upstream_bridge_port(provider); + if (!upstream) { + pci_warn(provider, "not behind a PCI bridge\n"); + return -1; + } + + ret = __upstream_bridges_match(upstream, client); + + pci_dev_put(upstream); + + return ret; +} + +struct pci_p2pdma_client { + struct list_head list; + struct pci_dev *client; + struct pci_dev *provider; +}; + +/** + * pci_p2pdma_add_client - allocate a new element in a client device list + * @head: list head of p2pdma clients + * @dev: device to add to the list + * + * This adds @dev to a list of clients used by a p2pdma device. + * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has + * been called successfully, the list will be bound to a specific p2pdma + * device and new clients can only be added to the list if they are + * supported by that p2pdma device. + * + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2p functions can be called concurrently + * on that list. + * + * Returns 0 if the client was successfully added. + */ +int pci_p2pdma_add_client(struct list_head *head, struct device *dev) +{ + struct pci_p2pdma_client *item, *new_item; + struct pci_dev *provider = NULL; + struct pci_dev *client; + int ret; + + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) { + dev_warn(dev, + "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n"); + return -ENODEV; + } + + + client = find_parent_pci_dev(dev); + if (!client) { + dev_warn(dev, + "cannot be used for peer-to-peer DMA as it is not a PCI device\n"); + return -ENODEV; + } + + item = list_first_entry_or_null(head, struct pci_p2pdma_client, list); + if (item && item->provider) { + provider = item->provider; + + if (upstream_bridges_match(provider, client) < 0) { + ret = -EXDEV; + goto put_client; + } + } + + new_item = kzalloc(sizeof(*new_item), GFP_KERNEL); + if (!new_item) { + ret = -ENOMEM; + goto put_client; + } + + new_item->client = client; + new_item->provider = pci_dev_get(provider); + + list_add_tail(&new_item->list, head); + + return 0; + +put_client: + pci_dev_put(client); + return ret; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client); + +static void pci_p2pdma_client_free(struct pci_p2pdma_client *item) +{ + list_del(&item->list); + pci_dev_put(item->client); + pci_dev_put(item->provider); + kfree(item); +} + +/** + * pci_p2pdma_remove_client - remove and free a new p2pdma client + * @head: list head of p2pdma clients + * @dev: device to remove from the list + * + * This removes @dev from a list of clients used by a p2pdma device. + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2p functions can be called concurrently + * on that list. + */ +void pci_p2pdma_remove_client(struct list_head *head, struct device *dev) +{ + struct pci_p2pdma_client *pos, *tmp; + struct pci_dev *pdev; + + pdev = find_parent_pci_dev(dev); + if (!pdev) + return; + + list_for_each_entry_safe(pos, tmp, head, list) { + if (pos->client != pdev) + continue; + + pci_p2pdma_client_free(pos); + } + + pci_dev_put(pdev); +} +EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client); + +/** + * pci_p2pdma_client_list_free - free an entire list of p2pdma clients + * @head: list head of p2pdma clients + * + * This removes all devices in a list of clients used by a p2pdma device. + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2pdma functions can be called concurrently + * on that list. + */ +void pci_p2pdma_client_list_free(struct list_head *head) +{ + struct pci_p2pdma_client *pos, *tmp; + + list_for_each_entry_safe(pos, tmp, head, list) + pci_p2pdma_client_free(pos); +} +EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free); + +/** + * pci_p2pdma_distance - Determive the cumulative distance between + * a p2pdma provider and the clients in use. + * @provider: p2pdma provider to check against the client list + * @clients: list of devices to check (NULL-terminated) + * + * Returns -1 if any of the clients are not compatible (behind the same + * switch as the provider), otherwise returns a positive number where + * the lower number is the preferrable choice. (If there's one client + * that's the same as the provider it will return 0, which is best choice). + * + * For now, "compatible" means the provider and the clients are all behind + * the same switch. This cuts out cases that may work but is safest for the + * user. Future work can expand this to cases with nested switches. + */ +int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients) +{ + struct pci_p2pdma_client *pos; + struct pci_dev *upstream; + int ret; + int distance = 0; + + upstream = get_upstream_bridge_port(provider); + if (!upstream) { + pci_warn(provider, "not behind a PCI bridge\n"); + return false; + } + + list_for_each_entry(pos, clients, list) { + if (pos->client == provider) + continue; + + ret = __upstream_bridges_match(upstream, pos->client); + if (ret < 0) + goto no_match; + + distance += ret; + } + + ret = distance; + +no_match: + pci_dev_put(upstream); + return ret; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_distance); + +/** + * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance) + * and assign a provider to a list of clients + * @provider: p2pdma provider to assign to the client list + * @clients: list of devices to check (NULL-terminated) + * + * Returns false if any of the clients are not compatible, true if the + * provider was successfully assigned to the clients. + */ +bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients) +{ + struct pci_p2pdma_client *pos; + + if (pci_p2pdma_distance(provider, clients) < 0) + return false; + + list_for_each_entry(pos, clients, list) + pos->provider = provider; + + return true; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider); + +/** + * pci_has_p2pmem - check if a given PCI device has published any p2pmem + * @pdev: PCI device to check + */ +bool pci_has_p2pmem(struct pci_dev *pdev) +{ + return pdev->p2pdma && pdev->p2pdma->p2pmem_published; +} +EXPORT_SYMBOL_GPL(pci_has_p2pmem); + +/** + * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with + * the specified list of clients and shortest distance (as determined + * by pci_p2pmem_dma()) + * @clients: list of devices to check (NULL-terminated) + * + * If multiple devices are behind the same switch, the one "closest" to the + * client devices in use will be chosen first. (So if one of the providers are + * the same as one of the clients, that provider will be used ahead of any + * other providers that are unrelated). If multiple providers are an equal + * distance away, one will be chosen at random. + * + * Returns a pointer to the PCI device with a reference taken (use pci_dev_put + * to return the reference) or NULL if no compatible device is found. The + * found provider will also be assigned to the client list. + */ +struct pci_dev *pci_p2pmem_find(struct list_head *clients) +{ + struct pci_dev *pdev = NULL; + struct pci_p2pdma_client *pos; + int distance; + int closest_distance = INT_MAX; + struct pci_dev **closest_pdevs; + int ties = 0; + const int max_ties = PAGE_SIZE / sizeof(*closest_pdevs); + int i; + + closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL); + + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) { + if (!pci_has_p2pmem(pdev)) + continue; + + distance = pci_p2pdma_distance(pdev, clients); + if (distance < 0 || distance > closest_distance) + continue; + + if (distance == closest_distance && ties >= max_ties) + continue; + + if (distance < closest_distance) { + for (i = 0; i < ties; i++) + pci_dev_put(closest_pdevs[i]); + + ties = 0; + closest_distance = distance; + } + + closest_pdevs[ties++] = pci_dev_get(pdev); + } + + if (ties) + pdev = pci_dev_get(closest_pdevs[prandom_u32_max(ties)]); + + for (i = 0; i < ties; i++) + pci_dev_put(closest_pdevs[i]); + + if (pdev) + list_for_each_entry(pos, clients, list) + pos->provider = pdev; + + kfree(closest_pdevs); + return pdev; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_find); + +/** + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory + * @pdev: the device to allocate memory from + * @size: number of bytes to allocate + * + * Returns the allocated memory or NULL on error. + */ +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) +{ + void *ret; + + if (unlikely(!pdev->p2pdma)) + return NULL; + + if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref))) + return NULL; + + ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size); + + if (unlikely(!ret)) + percpu_ref_put(&pdev->p2pdma->devmap_ref); + + return ret; +} +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem); + +/** + * pci_free_p2pmem - allocate peer-to-peer DMA memory + * @pdev: the device the memory was allocated from + * @addr: address of the memory that was allocated + * @size: number of bytes that was allocated + */ +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size) +{ + gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size); + percpu_ref_put(&pdev->p2pdma->devmap_ref); +} +EXPORT_SYMBOL_GPL(pci_free_p2pmem); + +/** + * pci_virt_to_bus - return the PCI bus address for a given virtual + * address obtained with pci_alloc_p2pmem() + * @pdev: the device the memory was allocated from + * @addr: address of the memory that was allocated + */ +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr) +{ + if (!addr) + return 0; + if (!pdev->p2pdma) + return 0; + + /* + * Note: when we added the memory to the pool we used the PCI + * bus address as the physical address. So gen_pool_virt_to_phys() + * actually returns the bus address despite the misleading name. + */ + return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr); +} +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus); + +/** + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist + * @pdev: the device to allocate memory from + * @sgl: the allocated scatterlist + * @nents: the number of SG entries in the list + * @length: number of bytes to allocate + * + * Returns 0 on success + */ +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl, + unsigned int *nents, u32 length) +{ + struct scatterlist *sg; + void *addr; + + sg = kzalloc(sizeof(*sg), GFP_KERNEL); + if (!sg) + return -ENOMEM; + + sg_init_table(sg, 1); + + addr = pci_alloc_p2pmem(pdev, length); + if (!addr) + goto out_free_sg; + + sg_set_buf(sg, addr, length); + *sgl = sg; + *nents = 1; + return 0; + +out_free_sg: + kfree(sg); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl); + +/** + * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl() + * @pdev: the device to allocate memory from + * @sgl: the allocated scatterlist + * @nents: the number of SG entries in the list + */ +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl, + unsigned int nents) +{ + struct scatterlist *sg; + int count; + + if (!sgl || !nents) + return; + + for_each_sg(sgl, sg, nents, count) + pci_free_p2pmem(pdev, sg_virt(sg), sg->length); + kfree(sgl); +} +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl); + +/** + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by + * other devices with pci_p2pmem_find() + * @pdev: the device with peer-to-peer DMA memory to publish + * @publish: set to true to publish the memory, false to unpublish it + */ +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) +{ + if (publish && !pdev->p2pdma) + return; + + pdev->p2pdma->p2pmem_published = publish; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_publish); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 7b4899c06f49..9e907c338a44 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -53,11 +53,16 @@ struct vmem_altmap { * driver can hotplug the device memory using ZONE_DEVICE and with that memory * type. Any page of a process can be migrated to such memory. However no one * should be allow to pin such memory so that it can always be evicted. + * + * MEMORY_DEVICE_PCI_P2PDMA: + * Device memory residing in a PCI BAR intended for use with Peer-to-Peer + * transactions. */ enum memory_type { MEMORY_DEVICE_HOST = 0, MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_PUBLIC, + MEMORY_DEVICE_PCI_P2PDMA, }; /* @@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap, } #endif /* CONFIG_ZONE_DEVICE */ +#ifdef CONFIG_PCI_P2PDMA +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; +} +#else /* CONFIG_PCI_P2PDMA */ +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return false; +} +#endif /* CONFIG_PCI_P2PDMA */ + #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC) static inline bool is_device_private_page(const struct page *page) { diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h new file mode 100644 index 000000000000..1f7856ff098b --- /dev/null +++ b/include/linux/pci-p2pdma.h @@ -0,0 +1,101 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * PCI Peer 2 Peer DMA support. + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + * + */ + +#ifndef _LINUX_PCI_P2PDMA_H +#define _LINUX_PCI_P2PDMA_H + +#include <linux/pci.h> + +struct block_device; +struct scatterlist; + +#ifdef CONFIG_PCI_P2PDMA +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, + u64 offset); +int pci_p2pdma_add_client(struct list_head *head, struct device *dev); +void pci_p2pdma_remove_client(struct list_head *head, struct device *dev); +void pci_p2pdma_client_list_free(struct list_head *head); +int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients); +bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients); +bool pci_has_p2pmem(struct pci_dev *pdev); +struct pci_dev *pci_p2pmem_find(struct list_head *clients); +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size); +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size); +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr); +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl, + unsigned int *nents, u32 length); +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl, + unsigned int nents); +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish); +#else /* CONFIG_PCI_P2PDMA */ +static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, + size_t size, u64 offset) +{ + return 0; +} +static inline int pci_p2pdma_add_client(struct list_head *head, + struct device *dev) +{ + return 0; +} +static inline void pci_p2pdma_remove_client(struct list_head *head, + struct device *dev) +{ +} +static inline void pci_p2pdma_client_list_free(struct list_head *head) +{ +} +static inline int pci_p2pdma_distance(struct pci_dev *provider, + struct list_head *clients) +{ + return -1; +} +static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients) +{ + return false; +} +static inline bool pci_has_p2pmem(struct pci_dev *pdev) +{ + return false; +} +static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients) +{ + return NULL; +} +static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) +{ + return NULL; +} +static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr, + size_t size) +{ +} +static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, + void *addr) +{ + return 0; +} +static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, + struct scatterlist **sgl, unsigned int *nents, u32 length) +{ + return -ENODEV; +} +static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev, + struct scatterlist *sgl, unsigned int nents) +{ +} +static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) +{ +} +#endif /* CONFIG_PCI_P2PDMA */ +#endif /* _LINUX_PCI_P2P_H */ diff --git a/include/linux/pci.h b/include/linux/pci.h index 024a1beda008..437e42615896 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -276,6 +276,7 @@ struct pcie_link_state; struct pci_vpd; struct pci_sriov; struct pci_ats; +struct pci_p2pdma; /* The pci_dev structure describes PCI devices */ struct pci_dev { @@ -429,6 +430,9 @@ struct pci_dev { #ifdef CONFIG_PCI_PASID u16 pasid_features; #endif +#ifdef CONFIG_PCI_P2PDMA + struct pci_p2pdma *p2pdma; +#endif phys_addr_t rom; /* Physical address if not from BAR */ size_t romlen; /* Length if not from BAR */ char *driver_override; /* Driver name to force a match */