Message ID | 20180104190137.7654-7-logang@deltatee.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Thu, Jan 04, 2018 at 12:01:31PM -0700, Logan Gunthorpe wrote: > > @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp, > * @remote_addr:remote address to read/write (relative to @rkey) > * @rkey: remote key to operate on > * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ > + * @flags: any of the RDMA_RW_CTX_FLAG_* flags > * > * Returns the number of WQEs that will be needed on the workqueue if > * successful, or a negative error code. > */ > int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, > struct scatterlist *sg, u32 sg_cnt, u32 sg_offset, > - u64 remote_addr, u32 rkey, enum dma_data_direction dir) > + u64 remote_addr, u32 rkey, enum dma_data_direction dir, > + unsigned int flags) > { > struct ib_device *dev = qp->pd->device; > int ret; > > - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); > + if (flags & RDMA_RW_CTX_FLAG_PCI_P2P) > + ret = pci_p2pmem_map_sg(sg, sg_cnt); > + else > + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); This seems really clunky since we are going to want to do this same logic all over the place. I'd be much happier if dma_map_sg can tell the memory is P2P or not from the scatterlist or dir arguments and not require the callers to have this. For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or adding another bit to the page flags in scatterlist. The signature of pci_p2pmem_map_sg also doesn't look very good, it should mirror the usual dma_map_sg, otherwise it needs more revision to extend this to do P2P through a root complex. Jason
On 04/01/18 12:22 PM, Jason Gunthorpe wrote: > This seems really clunky since we are going to want to do this same > logic all over the place. > > I'd be much happier if dma_map_sg can tell the memory is P2P or not > from the scatterlist or dir arguments and not require the callers to > have this. We tried things like this in an earlier iteration[1] which assumed the SG was homogenous (all P2P or all regular memory). This required serious ugliness to try and ensure SGs were in fact homogenous[2]. If you don't assume homogenousness you need change every DMA mapping provider or have a way to efficiently know if the SGL contains any P2P pages. Unfortunately, it's hard to do any of this without significant improvements to the SGL code (like [3]) but I understand there's significant push back against such changes at this time. > For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or > adding another bit to the page flags in scatterlist. Creating new direction flags doesn't really solve the problem you point out. It'd only really work in the rdma_rw_ctx_init() case because that function already takes the direction argument. It doesn't work in the similar block device case because the DMA direction isn't passed through the request. Though, if there's agreement to do something like that I'm not against it. > The signature of pci_p2pmem_map_sg also doesn't look very good, it > should mirror the usual dma_map_sg, otherwise it needs more revision > to extend this to do P2P through a root complex. Generally, the feedback that I get is to not make changes that try to anticipate the future. It would be easy enough to add those arguments when the code for going through the RC is made (which I expect will be very involved anyway). Logan [1] https://github.com/sbates130272/linux-p2pmem/commit/af3d3742669511c343c2c5bfbbfaccc325bee80a [2] https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127 [3] https://github.com/sbates130272/linux-p2pmem/commit/d8635a4de3c674b577b95653d431fd61c2504ccd
On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote: > >I'd be much happier if dma_map_sg can tell the memory is P2P or not > >from the scatterlist or dir arguments and not require the callers to > >have this. > > We tried things like this in an earlier iteration[1] which assumed the SG > was homogenous (all P2P or all regular memory). This required serious > ugliness to try and ensure SGs were in fact homogenous[2]. I'm confused, these patches already assume the sg is homogenous, right? Sure looks that way. So [2] is just debugging?? What I meant is to rely on the fact it is already homogenous and create a function like this: static inline bool sg_is_table_p2p(struct scatterlist *sg) { return is_pci_p2p_page(sg_page(sg)); } And then insert if (sg_is_table_p2p(sg)) return pci_p2pmem_map_sg(...); At the top of dma_map_sg_attrs() (and similar for unmap) Then we don't need to patch RDMA because RDMA is not special when it comes to P2P. P2P should work with everything. This has been my objection to every P2P iteration so far, for years. RDMA is not special, P2P should not be patching RDMA's DMA path. This an issue between the DMA and PCI subsystems, not for RDMA. > >The signature of pci_p2pmem_map_sg also doesn't look very good, it > >should mirror the usual dma_map_sg, otherwise it needs more revision > >to extend this to do P2P through a root complex. > > Generally, the feedback that I get is to not make changes that try to > anticipate the future. It would be easy enough to add those arguments when > the code for going through the RC is made (which I expect will be very > involved anyway). Yes, but DMA mapping fundamentally requires the arguments to dma_map_sg, so it is hard to accept an API called dma_map without them.. Maybe just a naming issue. [2] https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127
On 04/01/18 03:13 PM, Jason Gunthorpe wrote: > On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote: >> We tried things like this in an earlier iteration[1] which assumed the SG >> was homogenous (all P2P or all regular memory). This required serious >> ugliness to try and ensure SGs were in fact homogenous[2]. > > I'm confused, these patches already assume the sg is homogenous, > right? Sure looks that way. So [2] is just debugging?? Yes, but it's a bit different to expect that someone calling pci_p2pmem_map_sg() will know what they're doing and provide a homogenous SG. It is relatively clear by convention that the entire SG must be homogenous given they're calling a pci_p2pmem function. Where as, allowing P2P SGs into the core DMA code means all we can do is hope that future developers don't screw it up and allow P2P pages to mix in with regular pages. It's also very difficult to add similar functionality to dma_map_page seeing dma_unmap_page won't have any way to know what it's dealing with. It just seems confusing to support P2P in the SG version and not the page version. > What I meant is to rely on the fact it is already homogenous and > create a function like this: > > static inline bool sg_is_table_p2p(struct scatterlist *sg) > { > return is_pci_p2p_page(sg_page(sg)); > } > > And then insert > > if (sg_is_table_p2p(sg)) > return pci_p2pmem_map_sg(...); Yes, this is almost identical to the earlier patch I referenced... > Then we don't need to patch RDMA because RDMA is not special when it > comes to P2P. P2P should work with everything. Yes, I agree this would be very nice. But it's challenging at this time to do it safely. And something like this can still be done in future work after this patch set gets merged. The changes in this set to the RDMA and block layers are fairly small and the majority of the block layer changes would still be required anyway to ensure a given queue/driver supports P2P requests. >>> The signature of pci_p2pmem_map_sg also doesn't look very good, it >>> should mirror the usual dma_map_sg, otherwise it needs more revision >>> to extend this to do P2P through a root complex. >> >> Generally, the feedback that I get is to not make changes that try to >> anticipate the future. It would be easy enough to add those arguments when >> the code for going through the RC is made (which I expect will be very >> involved anyway). > > Yes, but DMA mapping fundamentally requires the arguments to > dma_map_sg, so it is hard to accept an API called dma_map without > them.. Maybe just a naming issue. I don't agree that this is a fundamental requirement. But if no one else objects to adding unused arguments I'd be fine with adding them. Logan
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote: > On 04/01/18 03:13 PM, Jason Gunthorpe wrote: > >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote: > >>We tried things like this in an earlier iteration[1] which assumed the SG > >>was homogenous (all P2P or all regular memory). This required serious > >>ugliness to try and ensure SGs were in fact homogenous[2]. > > > >I'm confused, these patches already assume the sg is homogenous, > >right? Sure looks that way. So [2] is just debugging?? > > Yes, but it's a bit different to expect that someone calling > pci_p2pmem_map_sg() will know what they're doing and provide a homogenous > SG. It is relatively clear by convention that the entire SG must be > homogenous given they're calling a pci_p2pmem function. Where as, allowing > P2P SGs into the core DMA code means all we can do is hope that future > developers don't screw it up and allow P2P pages to mix in with regular > pages. Well that argument applies equally to the RDMA RW API wrappers around the DMA API. I think it is fine if sgl are defined to only have P2P or not, and that debugging support seemed reasonable to me.. > It's also very difficult to add similar functionality to dma_map_page seeing > dma_unmap_page won't have any way to know what it's dealing with. It just > seems confusing to support P2P in the SG version and not the page version. Well, this proposal is to support P2P in only some RDMA APIs and not others, so it seems about as confusing to me.. > >Then we don't need to patch RDMA because RDMA is not special when it > >comes to P2P. P2P should work with everything. > > Yes, I agree this would be very nice. Well, it is more than very nice. We have to keep RDMA working after all, and if you make it even more special things become harder for us. It is already the case that DMA in RDMA is very strange. We have drivers that provide their own DMA ops, for instance. And on that topic, does this scheme work with HFI? On first glance, it looks like no. The PCI device the HFI device is attached to may be able to do P2P, so it should be able to trigger the support. However, substituting the p2p_dma_map for the real device op dma_map will cause a kernel crash when working with HFI. HFI uses a custom DMA ops that returns CPU addreses in the dma_addr_t which the driver handles in various special ways. One cannot just replace them with PCI bus addresses. So, this kinda looks to me like it causes bad breakage for some RDMA drivers?? This is why P2P must fit in to the common DMA framework somehow, we rely on these abstractions to work properly and fully in RDMA. I think you should consider pushing this directly into the dma_ops implementations. Add a p2p_supported flag to struct dma_map_ops, and only if it is true can a caller pass a homogeneous SGL to ops->map_sg. Only map_sg would be supported for P2P. Upgraded implementations can call the helper function. Jason
On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote: > Well that argument applies equally to the RDMA RW API wrappers around > the DMA API. I think it is fine if sgl are defined to only have P2P or > not, and that debugging support seemed reasonable to me.. > > > It's also very difficult to add similar functionality to dma_map_page seeing > > dma_unmap_page won't have any way to know what it's dealing with. It just > > seems confusing to support P2P in the SG version and not the page version. > > Well, this proposal is to support P2P in only some RDMA APIs and not > others, so it seems about as confusing to me.. As usual we implement what actually has a consumer. On top of that the R/W API is the only core RDMA API that actually does DMA mapping for the ULP at the moment. For SENDs and everything else dma maps are done by the ULP (I'd like to eventually change that, though - e.g. sends through that are inline to the workqueue don't need a dma map to start with). > Well, it is more than very nice. We have to keep RDMA working after > all, and if you make it even more special things become harder for us. > > It is already the case that DMA in RDMA is very strange. We have > drivers that provide their own DMA ops, for instance. That's because the initial design was to let the ULPs do the DMA mappings, which fundamentally is wrong. I've fixed it for the R/W API when adding it, but no one has started work on SENDs and atomics. > And on that topic, does this scheme work with HFI? No, and I guess we need an opt-out. HFI generally seems to be extremely weird. > This is why P2P must fit in to the common DMA framework somehow, we > rely on these abstractions to work properly and fully in RDMA. Moving P2P up to common RDMA code isn't going to fix this. For that we need to stop preting that something that isn't DMA can abuse the dma mapping framework, and until then opt them out of behavior that assumes actual DMA like P2P. > I think you should consider pushing this directly into the dma_ops > implementations. Add a p2p_supported flag to struct dma_map_ops, and > only if it is true can a caller pass a homogeneous SGL to ops->map_sg. > Only map_sg would be supported for P2P. Upgraded implementations can > call the helper function. If at all it should be in the dma_map* wrappers, but for that we'd need a good identifier. And it still would not solve the whole fake dma ops issue. So for now I'd much prefer to let the drivers handle it, and once usage grows and we know usage patterns better (and I make progress refactoring both the dma-mapping subsystem itself and the RDMA dma map code to not be a complete mess) we can move it to the core.
On Mon, Jan 08, 2018 at 03:59:01PM +0100, Christoph Hellwig wrote: > On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote: > > Well that argument applies equally to the RDMA RW API wrappers around > > the DMA API. I think it is fine if sgl are defined to only have P2P or > > not, and that debugging support seemed reasonable to me.. > > > > > It's also very difficult to add similar functionality to dma_map_page seeing > > > dma_unmap_page won't have any way to know what it's dealing with. It just > > > seems confusing to support P2P in the SG version and not the page version. > > > > Well, this proposal is to support P2P in only some RDMA APIs and not > > others, so it seems about as confusing to me.. > > As usual we implement what actually has a consumer. On top of that the > R/W API is the only core RDMA API that actually does DMA mapping for the > ULP at the moment. Well again the same can be said for dma_map_page vs dma_map_sg... > For SENDs and everything else dma maps are done by the ULP (I'd like > to eventually change that, though - e.g. sends through that are > inline to the workqueue don't need a dma map to start with). > That's because the initial design was to let the ULPs do the DMA > mappings, which fundamentally is wrong. I've fixed it for the R/W > API when adding it, but no one has started work on SENDs and atomics. Well, you know why it is like this, and it is very complicated to unwind - the HW driver does not have enough information during CQ processing to properly do any unmaps, let alone serious error tear down unmaps, so we'd need a bunch of new APIs developed first, like RW did. :\ > > And on that topic, does this scheme work with HFI? > > No, and I guess we need an opt-out. HFI generally seems to be > extremely weird. This series needs some kind of fix so HFI, QIB, rxe, etc don't get broken, and it shouldn't be 'fixed' at the RDMA level. > > This is why P2P must fit in to the common DMA framework somehow, we > > rely on these abstractions to work properly and fully in RDMA. > > Moving P2P up to common RDMA code isn't going to fix this. For that > we need to stop preting that something that isn't DMA can abuse the > dma mapping framework, and until then opt them out of behavior that > assumes actual DMA like P2P. It could, if we had a DMA op for p2p then the drivers that provide their own ops can implement it appropriately or not at all. Eg the correct implementation for rxe to support p2p memory is probably somewhat straightfoward. > > I think you should consider pushing this directly into the dma_ops > > implementations. Add a p2p_supported flag to struct dma_map_ops, and > > only if it is true can a caller pass a homogeneous SGL to ops->map_sg. > > Only map_sg would be supported for P2P. Upgraded implementations can > > call the helper function. > > If at all it should be in the dma_map* wrappers, but for that we'd need > a good identifier. And it still would not solve the whole fake dma > ops issue. Very long term the IOMMUs under the ops will need to care about this, so the wrapper is not an optimal place to put it - but I wouldn't object if it gets it out of RDMA :) Jason
On 08/01/18 11:09 AM, Jason Gunthorpe wrote: > It could, if we had a DMA op for p2p then the drivers that provide > their own ops can implement it appropriately or not at all. I was thinking of doing something like this. I'll probably rough out a patch and send it along today or tomorrow. >> If at all it should be in the dma_map* wrappers, but for that we'd need >> a good identifier. And it still would not solve the whole fake dma >> ops issue. > > Very long term the IOMMUs under the ops will need to care about this, > so the wrapper is not an optimal place to put it - but I wouldn't > object if it gets it out of RDMA :) Well, creating the extra op doesn't really change anything to the RDMA patch in this series. The point is, for the time being, we need to track whether we are doing a P2P or normal mapping using a side channel as we don't have a good way of tracking it in the SGL at this time. Adding an extra DMA op is just one way to allow the existing DMA providers to opt-in/out. Logan
On Mon, Jan 08, 2018 at 11:17:38AM -0700, Logan Gunthorpe wrote: > >>If at all it should be in the dma_map* wrappers, but for that we'd need > >>a good identifier. And it still would not solve the whole fake dma > >>ops issue. > > > >Very long term the IOMMUs under the ops will need to care about this, > >so the wrapper is not an optimal place to put it - but I wouldn't > >object if it gets it out of RDMA :) > > Well, creating the extra op doesn't really change anything to the RDMA patch > in this series. Not fundamentally, but it lets us solve the bugs the patch introduces with hfi/etc > The point is, for the time being, we need to track whether we are > doing a P2P or normal mapping using a side channel as we don't have > a good way of tracking it in the SGL at this time. Well, that is disappointing for now, but I'm OK with the flag in the RW interface - as long as we all sort of agree it is not desirable and the SG should self-identify in an ideal someday future world.. Jason
On Mon, Jan 08, 2018 at 11:09:17AM -0700, Jason Gunthorpe wrote: > > As usual we implement what actually has a consumer. On top of that the > > R/W API is the only core RDMA API that actually does DMA mapping for the > > ULP at the moment. > > Well again the same can be said for dma_map_page vs dma_map_sg... I don't understand this comment. > > > For SENDs and everything else dma maps are done by the ULP (I'd like > > to eventually change that, though - e.g. sends through that are > > inline to the workqueue don't need a dma map to start with). > > > > That's because the initial design was to let the ULPs do the DMA > > mappings, which fundamentally is wrong. I've fixed it for the R/W > > API when adding it, but no one has started work on SENDs and atomics. > > Well, you know why it is like this, and it is very complicated to > unwind - the HW driver does not have enough information during CQ > processing to properly do any unmaps, let alone serious error tear > down unmaps, so we'd need a bunch of new APIs developed first, like RW > did. :\ Yes, if it was trivial we would have done it already. > > > And on that topic, does this scheme work with HFI? > > > > No, and I guess we need an opt-out. HFI generally seems to be > > extremely weird. > > This series needs some kind of fix so HFI, QIB, rxe, etc don't get > broken, and it shouldn't be 'fixed' at the RDMA level. I don't think rxe is a problem as it won't show up a pci device. HFI and QIB do show as PCI devices, and could be used for P2P transfers from the PCI point of view. It's just that they have a layer of software indirection between their hardware and what is exposed at the RDMA layer. So I very much disagree about where to place that workaround - the RDMA code is exactly the right place. > > > This is why P2P must fit in to the common DMA framework somehow, we > > > rely on these abstractions to work properly and fully in RDMA. > > > > Moving P2P up to common RDMA code isn't going to fix this. For that > > we need to stop preting that something that isn't DMA can abuse the > > dma mapping framework, and until then opt them out of behavior that > > assumes actual DMA like P2P. > > It could, if we had a DMA op for p2p then the drivers that provide > their own ops can implement it appropriately or not at all. > > Eg the correct implementation for rxe to support p2p memory is > probably somewhat straightfoward. But P2P is _not_ a factor of the dma_ops implementation at all, it is something that happens behind the dma_map implementation. Think about what the dma mapping routines do: (a) translate from host address to bus addresses and (b) flush caches (in non-coherent architectures) Both are obviously not needed for P2P transfers, as they never reach the host. > Very long term the IOMMUs under the ops will need to care about this, > so the wrapper is not an optimal place to put it - but I wouldn't > object if it gets it out of RDMA :) Unless you have an IOMMU on your PCIe switch and not before/inside the root complex that is not correct.
On 08/01/18 11:34 AM, Christoph Hellwig wrote: > But P2P is _not_ a factor of the dma_ops implementation at all, > it is something that happens behind the dma_map implementation. > > Think about what the dma mapping routines do: > > (a) translate from host address to bus addresses > > and > > (b) flush caches (in non-coherent architectures) > > Both are obviously not needed for P2P transfers, as they never reach > the host. Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host address to a PCI bus address. >> Very long term the IOMMUs under the ops will need to care about this, >> so the wrapper is not an optimal place to put it - but I wouldn't >> object if it gets it out of RDMA :) > > Unless you have an IOMMU on your PCIe switch and not before/inside > the root complex that is not correct. Per the ACS discussion, in the future we might want to implement "ACS Direct Translated P2P" as Alex described. I expect it would be the IOMMU that needs to set that up. So, I think, we also have the dma_map implementations also doing something like: (c) setup/manage any security permissions on mappings Which P2P may at some point be concerned with. Logan
On Mon, Jan 08, 2018 at 11:44:41AM -0700, Logan Gunthorpe wrote: >> Think about what the dma mapping routines do: >> >> (a) translate from host address to bus addresses >> >> and >> >> (b) flush caches (in non-coherent architectures) >> >> Both are obviously not needed for P2P transfers, as they never reach >> the host. > > Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host > address to a PCI bus address. It does, sort of - but in a different way then the normal DMA map ops. And only to work around the fact that we need to map our P2P space into struct pages. Without that we could just pass the bus address around, but the Linux stack and VM isn't anywhere near ready for something that advanced. >>> Very long term the IOMMUs under the ops will need to care about this, >>> so the wrapper is not an optimal place to put it - but I wouldn't >>> object if it gets it out of RDMA :) >> >> Unless you have an IOMMU on your PCIe switch and not before/inside >> the root complex that is not correct. > > Per the ACS discussion, in the future we might want to implement "ACS > Direct Translated P2P" as Alex described. I expect it would be the IOMMU > that needs to set that up. So, I think, we also have the dma_map > implementations also doing something like: > > (c) setup/manage any security permissions on mappings > Which P2P may at some point be concerned with. Maybe once root complexes with iommus actually support P2P. But until then we have a lot more more important problems to solve.
On Mon, Jan 08, 2018 at 07:34:34PM +0100, Christoph Hellwig wrote: > > > > And on that topic, does this scheme work with HFI? > > > > > > No, and I guess we need an opt-out. HFI generally seems to be > > > extremely weird. > > > > This series needs some kind of fix so HFI, QIB, rxe, etc don't get > > broken, and it shouldn't be 'fixed' at the RDMA level. > > I don't think rxe is a problem as it won't show up a pci device. Right today's restrictions save us.. > HFI and QIB do show as PCI devices, and could be used for P2P transfers > from the PCI point of view. It's just that they have a layer of > software indirection between their hardware and what is exposed at > the RDMA layer. > > So I very much disagree about where to place that workaround - the > RDMA code is exactly the right place. But why? RDMA is using core code to do this. It uses dma_ops in struct device and it uses normal dma_map SG. How is it RDMA's problem that some PCI drivers provide strange DMA ops? Admittedly they are RDMA drivers, but it is a core mechanism they (ab)use these days.. > > It could, if we had a DMA op for p2p then the drivers that provide > > their own ops can implement it appropriately or not at all. > > > > Eg the correct implementation for rxe to support p2p memory is > > probably somewhat straightfoward. > > But P2P is _not_ a factor of the dma_ops implementation at all, > it is something that happens behind the dma_map implementation. Only as long as the !ACS and switch limitations are present. Those limitations are fine to get things started, but there is going to a be a push improve the system to remove them. > > Very long term the IOMMUs under the ops will need to care about this, > > so the wrapper is not an optimal place to put it - but I wouldn't > > object if it gets it out of RDMA :) > > Unless you have an IOMMU on your PCIe switch and not before/inside > the root complex that is not correct. I understand the proposed patches restrict things to require a switch and not transit the IOMMU. But *very long term* P2P will need to work with paths that transit the system IOMMU and root complex. This already exists as out-of-tree funtionality that has been deployed in production for years and years that does P2P through the root complex with the IOMMU turned off. Jason
On 08/01/18 11:57 AM, Christoph Hellwig wrote: > It does, sort of - but in a different way then the normal DMA map > ops. And only to work around the fact that we need to map our > P2P space into struct pages. Without that we could just pass the > bus address around, but the Linux stack and VM isn't anywhere near > ready for something that advanced. Yeah, wouldn't that be nice. Ok, so if we shouldn't touch the dma_map infrastructure how should the workaround to opt-out HFI and QIB look? I'm not that familiar with the RDMA code. Thanks, Logan
On Mon, Jan 8, 2018 at 11:57 AM, Christoph Hellwig <hch@lst.de> wrote: >> (c) setup/manage any security permissions on mappings >> Which P2P may at some point be concerned with. > > Maybe once root complexes with iommus actually support P2P. But until > then we have a lot more more important problems to solve. Pretty sure P2P capable IOMMU hardware exists. With SOC's we also have the scenario that an DMA originated from an on-die device wishes to target an off-die PCI BAR (through the IOMMU), that definitely exists today, and people care about it :) Jason
On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote: > Pretty sure P2P capable IOMMU hardware exists. > > With SOC's we also have the scenario that an DMA originated from an > on-die device wishes to target an off-die PCI BAR (through the IOMMU), > that definitely exists today, and people care about it :) Then people will have to help and contribute support for it.
On Mon, Jan 08, 2018 at 12:05:57PM -0700, Logan Gunthorpe wrote: > Ok, so if we shouldn't touch the dma_map infrastructure how should the > workaround to opt-out HFI and QIB look? I'm not that familiar with the RDMA > code. We can add a no_p2p quirk, I'm just not sure what the right place for it is. As said they device don't really mind P2P at the PCIe level, it's just that their RDMA implementation is really strange. So I'd be tempted to do it in RDMA.
On Mon, Jan 08, 2018 at 12:01:16PM -0700, Jason Gunthorpe wrote: > > So I very much disagree about where to place that workaround - the > > RDMA code is exactly the right place. > > But why? RDMA is using core code to do this. It uses dma_ops in struct > device and it uses normal dma_map SG. How is it RDMA's problem that > some PCI drivers provide strange DMA ops? Because RDMA uses the dma_virt_ops to pretend a device does DMA when in fact it doesn't - at least not for the exact data mapped, or as far as I can tell often not all - e.g. the QIB/HFI devices might do mmio access for data mapped. This whole problem only exist because a few RDMA HCA drivers lie with the help of the RDMA core.
On Tue, Jan 09, 2018 at 05:46:40PM +0100, Christoph Hellwig wrote: > On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote: > > Pretty sure P2P capable IOMMU hardware exists. > > > > With SOC's we also have the scenario that an DMA originated from an > > on-die device wishes to target an off-die PCI BAR (through the IOMMU), > > that definitely exists today, and people care about it :) > > Then people will have to help and contribute support for it. Sure. But my point was all this will have to migrate under the dma_ops for that to work, so why the resistance to using dma_ops right away? Jason
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index c8963e91f92a..7956484da082 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -12,6 +12,7 @@ */ #include <linux/moduleparam.h> #include <linux/slab.h> +#include <linux/pci-p2p.h> #include <rdma/mr_pool.h> #include <rdma/rw.h> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp, * @remote_addr:remote address to read/write (relative to @rkey) * @rkey: remote key to operate on * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ + * @flags: any of the RDMA_RW_CTX_FLAG_* flags * * Returns the number of WQEs that will be needed on the workqueue if * successful, or a negative error code. */ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct scatterlist *sg, u32 sg_cnt, u32 sg_offset, - u64 remote_addr, u32 rkey, enum dma_data_direction dir) + u64 remote_addr, u32 rkey, enum dma_data_direction dir, + unsigned int flags) { struct ib_device *dev = qp->pd->device; int ret; - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + if (flags & RDMA_RW_CTX_FLAG_PCI_P2P) + ret = pci_p2pmem_map_sg(sg, sg_cnt); + else + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + if (!ret) return -ENOMEM; sg_cnt = ret; @@ -499,7 +506,7 @@ struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp, rdma_rw_update_lkey(&ctx->sig->data, true); if (ctx->sig->prot.mr) rdma_rw_update_lkey(&ctx->sig->prot, true); - + ctx->sig->sig_mr->need_inval = true; ib_update_fast_reg_key(ctx->sig->sig_mr, ib_inc_rkey(ctx->sig->sig_mr->lkey)); @@ -579,9 +586,11 @@ EXPORT_SYMBOL(rdma_rw_ctx_post); * @sg: scatterlist that was used for the READ/WRITE * @sg_cnt: number of entries in @sg * @dir: %DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ + * @flags: the same flags used to init the context */ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, - struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir) + struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir, + unsigned int flags) { int i; @@ -602,7 +611,10 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, break; } - ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir); + if (flags & RDMA_RW_CTX_FLAG_PCI_P2P) + pci_p2pmem_unmap_sg(sg, sg_cnt); + else + ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir); } EXPORT_SYMBOL(rdma_rw_ctx_destroy); diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 720dfb3a1ac2..a076da2ead16 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -1496,7 +1496,8 @@ isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn) se_cmd->t_prot_nents, dir); } else { rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num, - se_cmd->t_data_sg, se_cmd->t_data_nents, dir); + se_cmd->t_data_sg, se_cmd->t_data_nents, + dir, 0); } cmd->rw.nr_ops = 0; @@ -2148,7 +2149,7 @@ isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn, } else { ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num, se_cmd->t_data_sg, se_cmd->t_data_nents, - offset, addr, rkey, dir); + offset, addr, rkey, dir, 0); } if (ret < 0) { isert_err("Cmd: %p failed to prepare RDMA res\n", cmd); diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c index 8a1bd354b1cc..c5371ab2e47d 100644 --- a/drivers/infiniband/ulp/srpt/ib_srpt.c +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c @@ -854,7 +854,8 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx, goto unwind; ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port, - ctx->sg, ctx->nents, 0, remote_addr, rkey, dir); + ctx->sg, ctx->nents, 0, remote_addr, rkey, + dir, 0); if (ret < 0) { target_free_sgl(ctx->sg, ctx->nents); goto unwind; @@ -883,7 +884,7 @@ static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx, struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i]; rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port, - ctx->sg, ctx->nents, dir); + ctx->sg, ctx->nents, dir, 0); target_free_sgl(ctx->sg, ctx->nents); } if (ioctx->rw_ctxs != &ioctx->s_rw_ctx) @@ -901,7 +902,7 @@ static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch, struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i]; rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port, - ctx->sg, ctx->nents, dir); + ctx->sg, ctx->nents, dir, 0); target_free_sgl(ctx->sg, ctx->nents); } diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 49912909c298..d4d0662ab071 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -480,7 +480,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp) if (rsp->n_rdma) { rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp, queue->cm_id->port_num, rsp->req.sg, - rsp->req.sg_cnt, nvmet_data_dir(&rsp->req)); + rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0); } if (rsp->req.sg != &rsp->cmd->inline_sg) @@ -563,7 +563,7 @@ static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc) atomic_add(rsp->n_rdma, &queue->sq_wr_avail); rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp, queue->cm_id->port_num, rsp->req.sg, - rsp->req.sg_cnt, nvmet_data_dir(&rsp->req)); + rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0); rsp->n_rdma = 0; if (unlikely(wc->status != IB_WC_SUCCESS)) { @@ -634,7 +634,7 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num, rsp->req.sg, rsp->req.sg_cnt, 0, addr, key, - nvmet_data_dir(&rsp->req)); + nvmet_data_dir(&rsp->req), 0); if (ret < 0) return NVME_SC_INTERNAL; rsp->req.transfer_len += len; diff --git a/include/rdma/rw.h b/include/rdma/rw.h index a3cbbc7b6417..ba8050434667 100644 --- a/include/rdma/rw.h +++ b/include/rdma/rw.h @@ -59,12 +59,15 @@ struct rdma_rw_ctx { }; }; +#define RDMA_RW_CTX_FLAG_PCI_P2P (1 << 0) + int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct scatterlist *sg, u32 sg_cnt, u32 sg_offset, - u64 remote_addr, u32 rkey, enum dma_data_direction dir); + u64 remote_addr, u32 rkey, enum dma_data_direction dir, + unsigned int flags); void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct scatterlist *sg, u32 sg_cnt, - enum dma_data_direction dir); + enum dma_data_direction dir, unsigned int flags); int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct scatterlist *sg, u32 sg_cnt, diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c index 9bd04549a1ad..5f46c35e6707 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c @@ -140,7 +140,7 @@ static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc, rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp, rdma->sc_port_num, ctxt->rw_sg_table.sgl, - ctxt->rw_nents, dir); + ctxt->rw_nents, dir, 0); svc_rdma_put_rw_ctxt(rdma, ctxt); } svc_xprt_put(&rdma->sc_xprt); @@ -433,7 +433,7 @@ svc_rdma_build_writes(struct svc_rdma_write_info *info, ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp, rdma->sc_port_num, ctxt->rw_sg_table.sgl, ctxt->rw_nents, 0, seg_offset, - seg_handle, DMA_TO_DEVICE); + seg_handle, DMA_TO_DEVICE, 0); if (ret < 0) goto out_initerr; @@ -639,7 +639,7 @@ static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info, ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp, cc->cc_rdma->sc_port_num, ctxt->rw_sg_table.sgl, ctxt->rw_nents, - 0, offset, rkey, DMA_FROM_DEVICE); + 0, offset, rkey, DMA_FROM_DEVICE, 0); if (ret < 0) goto out_initerr;
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be called to map the correct DMA address. To do this, we add a flags variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is specified use the appropriate map function. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> --- drivers/infiniband/core/rw.c | 22 +++++++++++++++++----- drivers/infiniband/ulp/isert/ib_isert.c | 5 +++-- drivers/infiniband/ulp/srpt/ib_srpt.c | 7 ++++--- drivers/nvme/target/rdma.c | 6 +++--- include/rdma/rw.h | 7 +++++-- net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +++--- 6 files changed, 35 insertions(+), 18 deletions(-)