diff mbox

[06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

Message ID 20180104190137.7654-7-logang@deltatee.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Logan Gunthorpe Jan. 4, 2018, 7:01 p.m. UTC
In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be
called to map the correct DMA address. To do this, we add a flags
variable and the RDMA_RW_CTX_FLAG_PCI_P2P flag. When the flag is
specified use the appropriate map function.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c            | 22 +++++++++++++++++-----
 drivers/infiniband/ulp/isert/ib_isert.c |  5 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.c   |  7 ++++---
 drivers/nvme/target/rdma.c              |  6 +++---
 include/rdma/rw.h                       |  7 +++++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c       |  6 +++---
 6 files changed, 35 insertions(+), 18 deletions(-)

Comments

Jason Gunthorpe Jan. 4, 2018, 7:22 p.m. UTC | #1
On Thu, Jan 04, 2018 at 12:01:31PM -0700, Logan Gunthorpe wrote:
>  
> @@ -269,18 +270,24 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
>   * @remote_addr:remote address to read/write (relative to @rkey)
>   * @rkey:	remote key to operate on
>   * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
> + * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
>   *
>   * Returns the number of WQEs that will be needed on the workqueue if
>   * successful, or a negative error code.
>   */
>  int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
>  		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
> -		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
> +		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
> +		unsigned int flags)
>  {
>  	struct ib_device *dev = qp->pd->device;
>  	int ret;
>  
> -	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
> +	if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
> +		ret = pci_p2pmem_map_sg(sg, sg_cnt);
> +	else
> +		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);

This seems really clunky since we are going to want to do this same
logic all over the place.

I'd be much happier if dma_map_sg can tell the memory is P2P or not
from the scatterlist or dir arguments and not require the callers to
have this.

For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or
adding another bit to the page flags in scatterlist.

The signature of pci_p2pmem_map_sg also doesn't look very good, it
should mirror the usual dma_map_sg, otherwise it needs more revision
to extend this to do P2P through a root complex.

Jason
Logan Gunthorpe Jan. 4, 2018, 7:52 p.m. UTC | #2
On 04/01/18 12:22 PM, Jason Gunthorpe wrote:
> This seems really clunky since we are going to want to do this same
> logic all over the place.
> 
> I'd be much happier if dma_map_sg can tell the memory is P2P or not
> from the scatterlist or dir arguments and not require the callers to
> have this.

We tried things like this in an earlier iteration[1] which assumed the 
SG was homogenous (all P2P or all regular memory). This required serious 
ugliness to try and ensure SGs were in fact homogenous[2]. If you don't 
assume homogenousness you need change every DMA mapping provider or have 
a way to efficiently know if the SGL contains any P2P pages.

Unfortunately, it's hard to do any of this without significant 
improvements to the SGL code (like [3]) but I understand there's 
significant push back against such changes at this time.

> For instance adding DMA_TO_P2P_DEVICE and DMA_FROM_P2P_DEVICE, or
> adding another bit to the page flags in scatterlist.

Creating new direction flags doesn't really solve the problem you point 
out. It'd only really work in the rdma_rw_ctx_init() case because that 
function already takes the direction argument. It doesn't work in the 
similar block device case because the DMA direction isn't passed through 
the request.

Though, if there's agreement to do something like that I'm not against it.

> The signature of pci_p2pmem_map_sg also doesn't look very good, it
> should mirror the usual dma_map_sg, otherwise it needs more revision
> to extend this to do P2P through a root complex.

Generally, the feedback that I get is to not make changes that try to 
anticipate the future. It would be easy enough to add those arguments 
when the code for going through the RC is made (which I expect will be 
very involved anyway).

Logan

[1] 
https://github.com/sbates130272/linux-p2pmem/commit/af3d3742669511c343c2c5bfbbfaccc325bee80a
[2] 
https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127
[3] 
https://github.com/sbates130272/linux-p2pmem/commit/d8635a4de3c674b577b95653d431fd61c2504ccd
Jason Gunthorpe Jan. 4, 2018, 10:13 p.m. UTC | #3
On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >I'd be much happier if dma_map_sg can tell the memory is P2P or not
> >from the scatterlist or dir arguments and not require the callers to
> >have this.
> 
> We tried things like this in an earlier iteration[1] which assumed the SG
> was homogenous (all P2P or all regular memory). This required serious
> ugliness to try and ensure SGs were in fact homogenous[2].

I'm confused, these patches already assume the sg is homogenous,
right? Sure looks that way. So [2] is just debugging??

What I meant is to rely on the fact it is already homogenous and
create a function like this:

static inline bool sg_is_table_p2p(struct scatterlist *sg)
{
    return is_pci_p2p_page(sg_page(sg));
}

And then insert

  if (sg_is_table_p2p(sg))
       return pci_p2pmem_map_sg(...);

At the top of dma_map_sg_attrs() (and similar for unmap)

Then we don't need to patch RDMA because RDMA is not special when it
comes to P2P. P2P should work with everything.

This has been my objection to every P2P iteration so far, for
years. RDMA is not special, P2P should not be patching RDMA's DMA
path. This an issue between the DMA and PCI subsystems, not for RDMA.

> >The signature of pci_p2pmem_map_sg also doesn't look very good, it
> >should mirror the usual dma_map_sg, otherwise it needs more revision
> >to extend this to do P2P through a root complex.
> 
> Generally, the feedback that I get is to not make changes that try to
> anticipate the future. It would be easy enough to add those arguments when
> the code for going through the RC is made (which I expect will be very
> involved anyway).

Yes, but DMA mapping fundamentally requires the arguments to
dma_map_sg, so it is hard to accept an API called dma_map without
them.. Maybe just a naming issue.

[2] https://github.com/sbates130272/linux-p2pmem/commit/61ec04fa63c79dab14570ea0e97432d9325ad127
Logan Gunthorpe Jan. 4, 2018, 11:44 p.m. UTC | #4
On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
>> We tried things like this in an earlier iteration[1] which assumed the SG
>> was homogenous (all P2P or all regular memory). This required serious
>> ugliness to try and ensure SGs were in fact homogenous[2].
> 
> I'm confused, these patches already assume the sg is homogenous,
> right? Sure looks that way. So [2] is just debugging??

Yes, but it's a bit different to expect that someone calling 
pci_p2pmem_map_sg() will know what they're doing and provide a 
homogenous SG. It is relatively clear by convention that the entire SG 
must be homogenous given they're calling a pci_p2pmem function. Where 
as, allowing P2P SGs into the core DMA code means all we can do is hope 
that future developers don't screw it up and allow P2P pages to mix in 
with regular pages.

It's also very difficult to add similar functionality to dma_map_page 
seeing dma_unmap_page won't have any way to know what it's dealing with. 
It just seems confusing to support P2P in the SG version and not the 
page version.

> What I meant is to rely on the fact it is already homogenous and
> create a function like this:
> 
> static inline bool sg_is_table_p2p(struct scatterlist *sg)
> {
>      return is_pci_p2p_page(sg_page(sg));
> }
> 
> And then insert
> 
>    if (sg_is_table_p2p(sg))
>         return pci_p2pmem_map_sg(...);

Yes, this is almost identical to the earlier patch I referenced...

> Then we don't need to patch RDMA because RDMA is not special when it
> comes to P2P. P2P should work with everything.

Yes, I agree this would be very nice. But it's challenging at this time 
to do it safely. And something like this can still be done in future 
work after this patch set gets merged. The changes in this set to the 
RDMA and block layers are fairly small and the majority of the block 
layer changes would still be required anyway to ensure a given 
queue/driver supports P2P requests.
>>> The signature of pci_p2pmem_map_sg also doesn't look very good, it
>>> should mirror the usual dma_map_sg, otherwise it needs more revision
>>> to extend this to do P2P through a root complex.
>>
>> Generally, the feedback that I get is to not make changes that try to
>> anticipate the future. It would be easy enough to add those arguments when
>> the code for going through the RC is made (which I expect will be very
>> involved anyway).
> 
> Yes, but DMA mapping fundamentally requires the arguments to
> dma_map_sg, so it is hard to accept an API called dma_map without
> them.. Maybe just a naming issue.

I don't agree that this is a fundamental requirement. But if no one else 
objects to adding unused arguments I'd be fine with adding them.

Logan
Jason Gunthorpe Jan. 5, 2018, 4:50 a.m. UTC | #5
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote:
> On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >>We tried things like this in an earlier iteration[1] which assumed the SG
> >>was homogenous (all P2P or all regular memory). This required serious
> >>ugliness to try and ensure SGs were in fact homogenous[2].
> >
> >I'm confused, these patches already assume the sg is homogenous,
> >right? Sure looks that way. So [2] is just debugging??
> 
> Yes, but it's a bit different to expect that someone calling
> pci_p2pmem_map_sg() will know what they're doing and provide a homogenous
> SG. It is relatively clear by convention that the entire SG must be
> homogenous given they're calling a pci_p2pmem function. Where as, allowing
> P2P SGs into the core DMA code means all we can do is hope that future
> developers don't screw it up and allow P2P pages to mix in with regular
> pages.

Well that argument applies equally to the RDMA RW API wrappers around
the DMA API. I think it is fine if sgl are defined to only have P2P or
not, and that debugging support seemed reasonable to me..

> It's also very difficult to add similar functionality to dma_map_page seeing
> dma_unmap_page won't have any way to know what it's dealing with. It just
> seems confusing to support P2P in the SG version and not the page version.

Well, this proposal is to support P2P in only some RDMA APIs and not
others, so it seems about as confusing to me..

> >Then we don't need to patch RDMA because RDMA is not special when it
> >comes to P2P. P2P should work with everything.
> 
> Yes, I agree this would be very nice.

Well, it is more than very nice. We have to keep RDMA working after
all, and if you make it even more special things become harder for us.

It is already the case that DMA in RDMA is very strange. We have
drivers that provide their own DMA ops, for instance.

And on that topic, does this scheme work with HFI?

On first glance, it looks like no. The PCI device the HFI device is
attached to may be able to do P2P, so it should be able to trigger the
support.

However, substituting the p2p_dma_map for the real device op dma_map
will cause a kernel crash when working with HFI. HFI uses a custom DMA
ops that returns CPU addreses in the dma_addr_t which the driver
handles in various special ways. One cannot just replace them with PCI
bus addresses.

So, this kinda looks to me like it causes bad breakage for some RDMA
drivers??

This is why P2P must fit in to the common DMA framework somehow, we
rely on these abstractions to work properly and fully in RDMA.

I think you should consider pushing this directly into the dma_ops
implementations. Add a p2p_supported flag to struct dma_map_ops, and
only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
Only map_sg would be supported for P2P. Upgraded implementations can
call the helper function.

Jason
Christoph Hellwig Jan. 8, 2018, 2:59 p.m. UTC | #6
On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote:
> Well that argument applies equally to the RDMA RW API wrappers around
> the DMA API. I think it is fine if sgl are defined to only have P2P or
> not, and that debugging support seemed reasonable to me..
> 
> > It's also very difficult to add similar functionality to dma_map_page seeing
> > dma_unmap_page won't have any way to know what it's dealing with. It just
> > seems confusing to support P2P in the SG version and not the page version.
> 
> Well, this proposal is to support P2P in only some RDMA APIs and not
> others, so it seems about as confusing to me..

As usual we implement what actually has a consumer.  On top of that the
R/W API is the only core RDMA API that actually does DMA mapping for the
ULP at the moment.  For SENDs and everything else dma maps are done by
the ULP (I'd like to eventually change that, though - e.g. sends through
that are inline to the workqueue don't need a dma map to start with).

> Well, it is more than very nice. We have to keep RDMA working after
> all, and if you make it even more special things become harder for us.
> 
> It is already the case that DMA in RDMA is very strange. We have
> drivers that provide their own DMA ops, for instance.

That's because the initial design was to let the ULPs do the DMA
mappings, which fundamentally is wrong.  I've fixed it for the R/W
API when adding it, but no one has started work on SENDs and atomics.

> And on that topic, does this scheme work with HFI?

No, and I guess we need an opt-out.  HFI generally seems to be
extremely weird.

> This is why P2P must fit in to the common DMA framework somehow, we
> rely on these abstractions to work properly and fully in RDMA.

Moving P2P up to common RDMA code isn't going to fix this.  For that
we need to stop preting that something that isn't DMA can abuse the
dma mapping framework, and until then opt them out of behavior that
assumes actual DMA like P2P.

> I think you should consider pushing this directly into the dma_ops
> implementations. Add a p2p_supported flag to struct dma_map_ops, and
> only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
> Only map_sg would be supported for P2P. Upgraded implementations can
> call the helper function.

If at all it should be in the dma_map* wrappers, but for that we'd need
a good identifier.  And it still would not solve the whole fake dma
ops issue.

So for now I'd much prefer to let the drivers handle it, and once usage
grows and we know usage patterns better (and I make progress refactoring
both the dma-mapping subsystem itself and the RDMA dma map code to not
be a complete mess) we can move it to the core.
Jason Gunthorpe Jan. 8, 2018, 6:09 p.m. UTC | #7
On Mon, Jan 08, 2018 at 03:59:01PM +0100, Christoph Hellwig wrote:
> On Thu, Jan 04, 2018 at 09:50:31PM -0700, Jason Gunthorpe wrote:
> > Well that argument applies equally to the RDMA RW API wrappers around
> > the DMA API. I think it is fine if sgl are defined to only have P2P or
> > not, and that debugging support seemed reasonable to me..
> > 
> > > It's also very difficult to add similar functionality to dma_map_page seeing
> > > dma_unmap_page won't have any way to know what it's dealing with. It just
> > > seems confusing to support P2P in the SG version and not the page version.
> > 
> > Well, this proposal is to support P2P in only some RDMA APIs and not
> > others, so it seems about as confusing to me..
> 
> As usual we implement what actually has a consumer.  On top of that the
> R/W API is the only core RDMA API that actually does DMA mapping for the
> ULP at the moment.

Well again the same can be said for dma_map_page vs dma_map_sg...

> For SENDs and everything else dma maps are done by the ULP (I'd like
> to eventually change that, though - e.g. sends through that are
> inline to the workqueue don't need a dma map to start with).


> That's because the initial design was to let the ULPs do the DMA
> mappings, which fundamentally is wrong.  I've fixed it for the R/W
> API when adding it, but no one has started work on SENDs and atomics.

Well, you know why it is like this, and it is very complicated to
unwind - the HW driver does not have enough information during CQ
processing to properly do any unmaps, let alone serious error tear
down unmaps, so we'd need a bunch of new APIs developed first, like RW
did. :\

> > And on that topic, does this scheme work with HFI?
> 
> No, and I guess we need an opt-out.  HFI generally seems to be
> extremely weird.

This series needs some kind of fix so HFI, QIB, rxe, etc don't get
broken, and it shouldn't be 'fixed' at the RDMA level.

> > This is why P2P must fit in to the common DMA framework somehow, we
> > rely on these abstractions to work properly and fully in RDMA.
> 
> Moving P2P up to common RDMA code isn't going to fix this.  For that
> we need to stop preting that something that isn't DMA can abuse the
> dma mapping framework, and until then opt them out of behavior that
> assumes actual DMA like P2P.

It could, if we had a DMA op for p2p then the drivers that provide
their own ops can implement it appropriately or not at all.

Eg the correct implementation for rxe to support p2p memory is
probably somewhat straightfoward.

> > I think you should consider pushing this directly into the dma_ops
> > implementations. Add a p2p_supported flag to struct dma_map_ops, and
> > only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
> > Only map_sg would be supported for P2P. Upgraded implementations can
> > call the helper function.
> 
> If at all it should be in the dma_map* wrappers, but for that we'd need
> a good identifier.  And it still would not solve the whole fake dma
> ops issue.

Very long term the IOMMUs under the ops will need to care about this,
so the wrapper is not an optimal place to put it - but I wouldn't
object if it gets it out of RDMA :)

Jason
Logan Gunthorpe Jan. 8, 2018, 6:17 p.m. UTC | #8
On 08/01/18 11:09 AM, Jason Gunthorpe wrote:
> It could, if we had a DMA op for p2p then the drivers that provide
> their own ops can implement it appropriately or not at all.

I was thinking of doing something like this. I'll probably rough out a 
patch and send it along today or tomorrow.

>> If at all it should be in the dma_map* wrappers, but for that we'd need
>> a good identifier.  And it still would not solve the whole fake dma
>> ops issue.
> 
> Very long term the IOMMUs under the ops will need to care about this,
> so the wrapper is not an optimal place to put it - but I wouldn't
> object if it gets it out of RDMA :)

Well, creating the extra op doesn't really change anything to the RDMA 
patch in this series. The point is, for the time being, we need to track 
whether we are doing a P2P or normal mapping using a side channel as we 
don't have a good way of tracking it in the SGL at this time.

Adding an extra DMA op is just one way to allow the existing DMA 
providers to opt-in/out.

Logan
Jason Gunthorpe Jan. 8, 2018, 6:29 p.m. UTC | #9
On Mon, Jan 08, 2018 at 11:17:38AM -0700, Logan Gunthorpe wrote:
 
> >>If at all it should be in the dma_map* wrappers, but for that we'd need
> >>a good identifier.  And it still would not solve the whole fake dma
> >>ops issue.
> >
> >Very long term the IOMMUs under the ops will need to care about this,
> >so the wrapper is not an optimal place to put it - but I wouldn't
> >object if it gets it out of RDMA :)
> 
> Well, creating the extra op doesn't really change anything to the RDMA patch
> in this series.

Not fundamentally, but it lets us solve the bugs the patch introduces
with hfi/etc

> The point is, for the time being, we need to track whether we are
> doing a P2P or normal mapping using a side channel as we don't have
> a good way of tracking it in the SGL at this time.

Well, that is disappointing for now, but I'm OK with the flag in the
RW interface - as long as we all sort of agree it is not desirable and
the SG should self-identify in an ideal someday future world..

Jason
Christoph Hellwig Jan. 8, 2018, 6:34 p.m. UTC | #10
On Mon, Jan 08, 2018 at 11:09:17AM -0700, Jason Gunthorpe wrote:
> > As usual we implement what actually has a consumer.  On top of that the
> > R/W API is the only core RDMA API that actually does DMA mapping for the
> > ULP at the moment.
> 
> Well again the same can be said for dma_map_page vs dma_map_sg...

I don't understand this comment.

> 
> > For SENDs and everything else dma maps are done by the ULP (I'd like
> > to eventually change that, though - e.g. sends through that are
> > inline to the workqueue don't need a dma map to start with).
> 
> 
> > That's because the initial design was to let the ULPs do the DMA
> > mappings, which fundamentally is wrong.  I've fixed it for the R/W
> > API when adding it, but no one has started work on SENDs and atomics.
> 
> Well, you know why it is like this, and it is very complicated to
> unwind - the HW driver does not have enough information during CQ
> processing to properly do any unmaps, let alone serious error tear
> down unmaps, so we'd need a bunch of new APIs developed first, like RW
> did. :\

Yes, if it was trivial we would have done it already.

> > > And on that topic, does this scheme work with HFI?
> > 
> > No, and I guess we need an opt-out.  HFI generally seems to be
> > extremely weird.
> 
> This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> broken, and it shouldn't be 'fixed' at the RDMA level.

I don't think rxe is a problem as it won't show up a pci device.
HFI and QIB do show as PCI devices, and could be used for P2P transfers
from the PCI point of view.  It's just that they have a layer of
software indirection between their hardware and what is exposed at
the RDMA layer.

So I very much disagree about where to place that workaround - the
RDMA code is exactly the right place.

> > > This is why P2P must fit in to the common DMA framework somehow, we
> > > rely on these abstractions to work properly and fully in RDMA.
> > 
> > Moving P2P up to common RDMA code isn't going to fix this.  For that
> > we need to stop preting that something that isn't DMA can abuse the
> > dma mapping framework, and until then opt them out of behavior that
> > assumes actual DMA like P2P.
> 
> It could, if we had a DMA op for p2p then the drivers that provide
> their own ops can implement it appropriately or not at all.
> 
> Eg the correct implementation for rxe to support p2p memory is
> probably somewhat straightfoward.

But P2P is _not_ a factor of the dma_ops implementation at all,
it is something that happens behind the dma_map implementation.

Think about what the dma mapping routines do:

 (a) translate from host address to bus addresses

and

 (b) flush caches (in non-coherent architectures)

Both are obviously not needed for P2P transfers, as they never reach
the host.

> Very long term the IOMMUs under the ops will need to care about this,
> so the wrapper is not an optimal place to put it - but I wouldn't
> object if it gets it out of RDMA :)

Unless you have an IOMMU on your PCIe switch and not before/inside
the root complex that is not correct.
Logan Gunthorpe Jan. 8, 2018, 6:44 p.m. UTC | #11
On 08/01/18 11:34 AM, Christoph Hellwig wrote:
> But P2P is _not_ a factor of the dma_ops implementation at all,
> it is something that happens behind the dma_map implementation.
> 
> Think about what the dma mapping routines do:
> 
>   (a) translate from host address to bus addresses
> 
> and
> 
>   (b) flush caches (in non-coherent architectures)
> 
> Both are obviously not needed for P2P transfers, as they never reach
> the host.

Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host 
address to a PCI bus address.

>> Very long term the IOMMUs under the ops will need to care about this,
>> so the wrapper is not an optimal place to put it - but I wouldn't
>> object if it gets it out of RDMA :)
> 
> Unless you have an IOMMU on your PCIe switch and not before/inside
> the root complex that is not correct.

Per the ACS discussion, in the future we might want to implement "ACS 
Direct Translated P2P" as Alex described. I expect it would be the IOMMU 
that needs to set that up. So, I think, we also have the dma_map 
implementations also doing something like:

(c) setup/manage any security permissions on mappings
Which P2P may at some point be concerned with.

Logan
Christoph Hellwig Jan. 8, 2018, 6:57 p.m. UTC | #12
On Mon, Jan 08, 2018 at 11:44:41AM -0700, Logan Gunthorpe wrote:
>> Think about what the dma mapping routines do:
>>
>>   (a) translate from host address to bus addresses
>>
>> and
>>
>>   (b) flush caches (in non-coherent architectures)
>>
>> Both are obviously not needed for P2P transfers, as they never reach
>> the host.
>
> Isn't pci_p2pdma_map_sg doing (a)? It's just translating from a host 
> address to a PCI bus address.

It does, sort of - but in a different way then the normal DMA map
ops.  And only to work around the fact that we need to map our
P2P space into struct pages.  Without that we could just pass the
bus address around, but the Linux stack and VM isn't anywhere near
ready for something that advanced.

>>> Very long term the IOMMUs under the ops will need to care about this,
>>> so the wrapper is not an optimal place to put it - but I wouldn't
>>> object if it gets it out of RDMA :)
>>
>> Unless you have an IOMMU on your PCIe switch and not before/inside
>> the root complex that is not correct.
>
> Per the ACS discussion, in the future we might want to implement "ACS 
> Direct Translated P2P" as Alex described. I expect it would be the IOMMU 
> that needs to set that up. So, I think, we also have the dma_map 
> implementations also doing something like:
>
> (c) setup/manage any security permissions on mappings
> Which P2P may at some point be concerned with.

Maybe once root complexes with iommus actually support P2P.  But until
then we have a lot more more important problems to solve.
Jason Gunthorpe Jan. 8, 2018, 7:01 p.m. UTC | #13
On Mon, Jan 08, 2018 at 07:34:34PM +0100, Christoph Hellwig wrote:
> > > > And on that topic, does this scheme work with HFI?
> > > 
> > > No, and I guess we need an opt-out.  HFI generally seems to be
> > > extremely weird.
> > 
> > This series needs some kind of fix so HFI, QIB, rxe, etc don't get
> > broken, and it shouldn't be 'fixed' at the RDMA level.
> 
> I don't think rxe is a problem as it won't show up a pci device.

Right today's restrictions save us..

> HFI and QIB do show as PCI devices, and could be used for P2P transfers
> from the PCI point of view.  It's just that they have a layer of
> software indirection between their hardware and what is exposed at
> the RDMA layer.
> 
> So I very much disagree about where to place that workaround - the
> RDMA code is exactly the right place.

But why? RDMA is using core code to do this. It uses dma_ops in struct
device and it uses normal dma_map SG. How is it RDMA's problem that
some PCI drivers provide strange DMA ops?

Admittedly they are RDMA drivers, but it is a core mechanism they
(ab)use these days..

> > It could, if we had a DMA op for p2p then the drivers that provide
> > their own ops can implement it appropriately or not at all.
> > 
> > Eg the correct implementation for rxe to support p2p memory is
> > probably somewhat straightfoward.
> 
> But P2P is _not_ a factor of the dma_ops implementation at all,
> it is something that happens behind the dma_map implementation.

Only as long as the !ACS and switch limitations are present.

Those limitations are fine to get things started, but there is going
to a be a push improve the system to remove them.

> > Very long term the IOMMUs under the ops will need to care about this,
> > so the wrapper is not an optimal place to put it - but I wouldn't
> > object if it gets it out of RDMA :)
> 
> Unless you have an IOMMU on your PCIe switch and not before/inside
> the root complex that is not correct.

I understand the proposed patches restrict things to require a switch
and not transit the IOMMU.

But *very long term* P2P will need to work with paths that transit the
system IOMMU and root complex.

This already exists as out-of-tree funtionality that has been deployed
in production for years and years that does P2P through the root
complex with the IOMMU turned off.

Jason
Logan Gunthorpe Jan. 8, 2018, 7:05 p.m. UTC | #14
On 08/01/18 11:57 AM, Christoph Hellwig wrote:
> It does, sort of - but in a different way then the normal DMA map
> ops.  And only to work around the fact that we need to map our
> P2P space into struct pages.  Without that we could just pass the
> bus address around, but the Linux stack and VM isn't anywhere near
> ready for something that advanced.

Yeah, wouldn't that be nice.

Ok, so if we shouldn't touch the dma_map infrastructure how should the 
workaround to opt-out HFI and QIB look? I'm not that familiar with the 
RDMA code.

Thanks,

Logan
Jason Gunthorpe Jan. 8, 2018, 7:49 p.m. UTC | #15
On Mon, Jan 8, 2018 at 11:57 AM, Christoph Hellwig <hch@lst.de> wrote:

>> (c) setup/manage any security permissions on mappings
>> Which P2P may at some point be concerned with.
>
> Maybe once root complexes with iommus actually support P2P.  But until
> then we have a lot more more important problems to solve.

Pretty sure P2P capable IOMMU hardware exists.

With SOC's we also have the scenario that an DMA originated from an
on-die device wishes to target an off-die PCI BAR (through the IOMMU),
that definitely exists today, and people care about it :)

Jason
Christoph Hellwig Jan. 9, 2018, 4:46 p.m. UTC | #16
On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> Pretty sure P2P capable IOMMU hardware exists.
> 
> With SOC's we also have the scenario that an DMA originated from an
> on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> that definitely exists today, and people care about it :)

Then people will have to help and contribute support for it.
Christoph Hellwig Jan. 9, 2018, 4:47 p.m. UTC | #17
On Mon, Jan 08, 2018 at 12:05:57PM -0700, Logan Gunthorpe wrote:
> Ok, so if we shouldn't touch the dma_map infrastructure how should the 
> workaround to opt-out HFI and QIB look? I'm not that familiar with the RDMA 
> code.

We can add a no_p2p quirk, I'm just not sure what the right place
for it is.  As said they device don't really mind P2P at the PCIe
level, it's just that their RDMA implementation is really strange.

So I'd be tempted to do it in RDMA.
Christoph Hellwig Jan. 9, 2018, 4:55 p.m. UTC | #18
On Mon, Jan 08, 2018 at 12:01:16PM -0700, Jason Gunthorpe wrote:
> > So I very much disagree about where to place that workaround - the
> > RDMA code is exactly the right place.
> 
> But why? RDMA is using core code to do this. It uses dma_ops in struct
> device and it uses normal dma_map SG. How is it RDMA's problem that
> some PCI drivers provide strange DMA ops?

Because RDMA uses the dma_virt_ops to pretend a device does DMA when
in fact it doesn't - at least not for the exact data mapped, or
as far as I can tell often not all - e.g. the QIB/HFI devices
might do mmio access for data mapped.

This whole problem only exist because a few RDMA HCA drivers lie
with the help of the RDMA core.
Jason Gunthorpe Jan. 9, 2018, 5:10 p.m. UTC | #19
On Tue, Jan 09, 2018 at 05:46:40PM +0100, Christoph Hellwig wrote:
> On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> > Pretty sure P2P capable IOMMU hardware exists.
> > 
> > With SOC's we also have the scenario that an DMA originated from an
> > on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> > that definitely exists today, and people care about it :)
> 
> Then people will have to help and contribute support for it.

Sure. But my point was all this will have to migrate under the dma_ops
for that to work, so why the resistance to using dma_ops right away?

Jason
diff mbox

Patch

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index c8963e91f92a..7956484da082 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@ 
  */
 #include <linux/moduleparam.h>
 #include <linux/slab.h>
+#include <linux/pci-p2p.h>
 #include <rdma/mr_pool.h>
 #include <rdma/rw.h>
 
@@ -269,18 +270,24 @@  static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
  * @remote_addr:remote address to read/write (relative to @rkey)
  * @rkey:	remote key to operate on
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      any of the RDMA_RW_CTX_FLAG_* flags
  *
  * Returns the number of WQEs that will be needed on the workqueue if
  * successful, or a negative error code.
  */
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	struct ib_device *dev = qp->pd->device;
 	int ret;
 
-	ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
+		ret = pci_p2pmem_map_sg(sg, sg_cnt);
+	else
+		ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
 	if (!ret)
 		return -ENOMEM;
 	sg_cnt = ret;
@@ -499,7 +506,7 @@  struct ib_send_wr *rdma_rw_ctx_wrs(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		rdma_rw_update_lkey(&ctx->sig->data, true);
 		if (ctx->sig->prot.mr)
 			rdma_rw_update_lkey(&ctx->sig->prot, true);
-	
+
 		ctx->sig->sig_mr->need_inval = true;
 		ib_update_fast_reg_key(ctx->sig->sig_mr,
 			ib_inc_rkey(ctx->sig->sig_mr->lkey));
@@ -579,9 +586,11 @@  EXPORT_SYMBOL(rdma_rw_ctx_post);
  * @sg:		scatterlist that was used for the READ/WRITE
  * @sg_cnt:	number of entries in @sg
  * @dir:	%DMA_TO_DEVICE for RDMA WRITE, %DMA_FROM_DEVICE for RDMA READ
+ * @flags:      the same flags used to init the context
  */
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
-		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir)
+		struct scatterlist *sg, u32 sg_cnt, enum dma_data_direction dir,
+		unsigned int flags)
 {
 	int i;
 
@@ -602,7 +611,10 @@  void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		break;
 	}
 
-	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	if (flags & RDMA_RW_CTX_FLAG_PCI_P2P)
+		pci_p2pmem_unmap_sg(sg, sg_cnt);
+	else
+		ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 720dfb3a1ac2..a076da2ead16 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1496,7 +1496,8 @@  isert_rdma_rw_ctx_destroy(struct isert_cmd *cmd, struct isert_conn *conn)
 				se_cmd->t_prot_nents, dir);
 	} else {
 		rdma_rw_ctx_destroy(&cmd->rw, conn->qp, conn->cm_id->port_num,
-				se_cmd->t_data_sg, se_cmd->t_data_nents, dir);
+				se_cmd->t_data_sg, se_cmd->t_data_nents,
+				dir, 0);
 	}
 
 	cmd->rw.nr_ops = 0;
@@ -2148,7 +2149,7 @@  isert_rdma_rw_ctx_post(struct isert_cmd *cmd, struct isert_conn *conn,
 	} else {
 		ret = rdma_rw_ctx_init(&cmd->rw, conn->qp, port_num,
 				se_cmd->t_data_sg, se_cmd->t_data_nents,
-				offset, addr, rkey, dir);
+				offset, addr, rkey, dir, 0);
 	}
 	if (ret < 0) {
 		isert_err("Cmd: %p failed to prepare RDMA res\n", cmd);
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 8a1bd354b1cc..c5371ab2e47d 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -854,7 +854,8 @@  static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 			goto unwind;
 
 		ret = rdma_rw_ctx_init(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, 0, remote_addr, rkey, dir);
+				ctx->sg, ctx->nents, 0, remote_addr, rkey,
+				dir, 0);
 		if (ret < 0) {
 			target_free_sgl(ctx->sg, ctx->nents);
 			goto unwind;
@@ -883,7 +884,7 @@  static int srpt_alloc_rw_ctxs(struct srpt_send_ioctx *ioctx,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 	if (ioctx->rw_ctxs != &ioctx->s_rw_ctx)
@@ -901,7 +902,7 @@  static void srpt_free_rw_ctxs(struct srpt_rdma_ch *ch,
 		struct srpt_rw_ctx *ctx = &ioctx->rw_ctxs[i];
 
 		rdma_rw_ctx_destroy(&ctx->rw, ch->qp, ch->sport->port,
-				ctx->sg, ctx->nents, dir);
+				ctx->sg, ctx->nents, dir, 0);
 		target_free_sgl(ctx->sg, ctx->nents);
 	}
 
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 49912909c298..d4d0662ab071 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -480,7 +480,7 @@  static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 	if (rsp->n_rdma) {
 		rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 				queue->cm_id->port_num, rsp->req.sg,
-				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+				rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	}
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
@@ -563,7 +563,7 @@  static void nvmet_rdma_read_data_done(struct ib_cq *cq, struct ib_wc *wc)
 	atomic_add(rsp->n_rdma, &queue->sq_wr_avail);
 	rdma_rw_ctx_destroy(&rsp->rw, queue->cm_id->qp,
 			queue->cm_id->port_num, rsp->req.sg,
-			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req));
+			rsp->req.sg_cnt, nvmet_data_dir(&rsp->req), 0);
 	rsp->n_rdma = 0;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
@@ -634,7 +634,7 @@  static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 
 	ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num,
 			rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-			nvmet_data_dir(&rsp->req));
+			nvmet_data_dir(&rsp->req), 0);
 	if (ret < 0)
 		return NVME_SC_INTERNAL;
 	rsp->req.transfer_len += len;
diff --git a/include/rdma/rw.h b/include/rdma/rw.h
index a3cbbc7b6417..ba8050434667 100644
--- a/include/rdma/rw.h
+++ b/include/rdma/rw.h
@@ -59,12 +59,15 @@  struct rdma_rw_ctx {
 	};
 };
 
+#define RDMA_RW_CTX_FLAG_PCI_P2P  (1 << 0)
+
 int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt, u32 sg_offset,
-		u64 remote_addr, u32 rkey, enum dma_data_direction dir);
+		u64 remote_addr, u32 rkey, enum dma_data_direction dir,
+		unsigned int flags);
 void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
 		struct scatterlist *sg, u32 sg_cnt,
-		enum dma_data_direction dir);
+		enum dma_data_direction dir, unsigned int flags);
 
 int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		u8 port_num, struct scatterlist *sg, u32 sg_cnt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 9bd04549a1ad..5f46c35e6707 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -140,7 +140,7 @@  static void svc_rdma_cc_release(struct svc_rdma_chunk_ctxt *cc,
 
 		rdma_rw_ctx_destroy(&ctxt->rw_ctx, rdma->sc_qp,
 				    rdma->sc_port_num, ctxt->rw_sg_table.sgl,
-				    ctxt->rw_nents, dir);
+				    ctxt->rw_nents, dir, 0);
 		svc_rdma_put_rw_ctxt(rdma, ctxt);
 	}
 	svc_xprt_put(&rdma->sc_xprt);
@@ -433,7 +433,7 @@  svc_rdma_build_writes(struct svc_rdma_write_info *info,
 		ret = rdma_rw_ctx_init(&ctxt->rw_ctx, rdma->sc_qp,
 				       rdma->sc_port_num, ctxt->rw_sg_table.sgl,
 				       ctxt->rw_nents, 0, seg_offset,
-				       seg_handle, DMA_TO_DEVICE);
+				       seg_handle, DMA_TO_DEVICE, 0);
 		if (ret < 0)
 			goto out_initerr;
 
@@ -639,7 +639,7 @@  static int svc_rdma_build_read_segment(struct svc_rdma_read_info *info,
 	ret = rdma_rw_ctx_init(&ctxt->rw_ctx, cc->cc_rdma->sc_qp,
 			       cc->cc_rdma->sc_port_num,
 			       ctxt->rw_sg_table.sgl, ctxt->rw_nents,
-			       0, offset, rkey, DMA_FROM_DEVICE);
+			       0, offset, rkey, DMA_FROM_DEVICE, 0);
 	if (ret < 0)
 		goto out_initerr;