diff mbox series

[for-next,v2,06/18] RDMA/rxe: Add rxe_add_frag() to rxe_mr.c

Message ID 20221031202805.19138-6-rpearsonhpe@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Jason Gunthorpe
Headers show
Series RDMA/rxe: Enable scatter/gather support for skbs | expand

Commit Message

Bob Pearson Oct. 31, 2022, 8:27 p.m. UTC
Add the subroutine rxe_add_frag() to add a fragment to an skb.

This is in preparation for supporting fragmented skbs.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h |  2 ++
 drivers/infiniband/sw/rxe/rxe_mr.c  | 34 +++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

Comments

Jason Gunthorpe Nov. 24, 2022, 7:10 p.m. UTC | #1
On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
> +		 int length, int offset)
> +{
> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
> +
> +	if (nr_frags >= MAX_SKB_FRAGS) {
> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
> +			__func__, nr_frags);
> +		return -EINVAL;
> +	}
> +
> +	frag->bv_len = length;
> +	frag->bv_offset = offset;
> +	frag->bv_page = virt_to_page(buf->addr);

Assuming this is even OK to do, then please do the xarray conversion
I sketched first:

https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/

And this operation is basically a xa_for_each loop taking 'struct page
*' off of the MR's xarray, slicing it, then stuffing into the
skb. Don't call virt_to_page()

*However* I have no idea if it is even safe to stuff unstable pages
into a skb. Are there other examples doing this? Eg zero copy tcp?

Jason
Bob Pearson Nov. 30, 2022, 8:53 p.m. UTC | #2
On 11/24/22 13:10, Jason Gunthorpe wrote:
> On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
>> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
>> +		 int length, int offset)
>> +{
>> +	int nr_frags = skb_shinfo(skb)->nr_frags;
>> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
>> +
>> +	if (nr_frags >= MAX_SKB_FRAGS) {
>> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
>> +			__func__, nr_frags);
>> +		return -EINVAL;
>> +	}
>> +
>> +	frag->bv_len = length;
>> +	frag->bv_offset = offset;
>> +	frag->bv_page = virt_to_page(buf->addr);
> 
> Assuming this is even OK to do, then please do the xarray conversion
> I sketched first:
> 
> https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/

I've been looking at this. Seems incorrect for IB_MR_TYPE_DMA which
do not carry a page map but simply convert iova to kernel virtual addresses.
This breaks in the mr_copy routine and atomic ops in responder.
There is also a missing rxe_mr_iova_to_index() function. Looks simple enough
just the number of pages starting from 0.

I am curious what the benefit of the 'advanced' API for xarrays buys here. Is it just
preallocating all the memory before it gets used?

I am happy to go in this direction and if we do it should be ahead of the other
outstanding changes that touch MRs.

I will try to submit a patch to do that.

Bob

> 
> And this operation is basically a xa_for_each loop taking 'struct page
> *' off of the MR's xarray, slicing it, then stuffing into the
> skb. Don't call virt_to_page()
> 
> *However* I have no idea if it is even safe to stuff unstable pages
> into a skb. Are there other examples doing this? Eg zero copy tcp?
> 
> Jason
Jason Gunthorpe Nov. 30, 2022, 11:36 p.m. UTC | #3
On Wed, Nov 30, 2022 at 02:53:22PM -0600, Bob Pearson wrote:
> On 11/24/22 13:10, Jason Gunthorpe wrote:
> > On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
> >> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
> >> +		 int length, int offset)
> >> +{
> >> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> >> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
> >> +
> >> +	if (nr_frags >= MAX_SKB_FRAGS) {
> >> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
> >> +			__func__, nr_frags);
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	frag->bv_len = length;
> >> +	frag->bv_offset = offset;
> >> +	frag->bv_page = virt_to_page(buf->addr);
> > 
> > Assuming this is even OK to do, then please do the xarray conversion
> > I sketched first:
> > 
> > https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
> 
> I've been looking at this. Seems incorrect for IB_MR_TYPE_DMA which
> do not carry a page map but simply convert iova to kernel virtual
> addresses.

There is always a struct page involved, even in the kernel case. You
can do virt_to_page on kernel addresses

> I am curious what the benefit of the 'advanced' API for xarrays buys here. Is it just
> preallocating all the memory before it gets used?

It runs quite a bit faster

Jason
Bob Pearson Dec. 1, 2022, 12:16 a.m. UTC | #4
On 11/30/22 17:36, Jason Gunthorpe wrote:
> On Wed, Nov 30, 2022 at 02:53:22PM -0600, Bob Pearson wrote:
>> On 11/24/22 13:10, Jason Gunthorpe wrote:
>>> On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
>>>> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
>>>> +		 int length, int offset)
>>>> +{
>>>> +	int nr_frags = skb_shinfo(skb)->nr_frags;
>>>> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
>>>> +
>>>> +	if (nr_frags >= MAX_SKB_FRAGS) {
>>>> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
>>>> +			__func__, nr_frags);
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	frag->bv_len = length;
>>>> +	frag->bv_offset = offset;
>>>> +	frag->bv_page = virt_to_page(buf->addr);
>>>
>>> Assuming this is even OK to do, then please do the xarray conversion
>>> I sketched first:
>>>
>>> https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
>>
>> I've been looking at this. Seems incorrect for IB_MR_TYPE_DMA which
>> do not carry a page map but simply convert iova to kernel virtual
>> addresses.
> 
> There is always a struct page involved, even in the kernel case. You
> can do virt_to_page on kernel addresses
Agreed but there isn't a page map set up for DMA mr's. You just get the whole kernel
address space. So the call to rxe_mr_copy_xarray() won't work. There isn't an
xarray to copy to/from. Much easier to just leave the DMA mr code in place since
it does what we want very simply. Also you have to treat the DMA mr separately for
atomic ops.

Bob
> 
>> I am curious what the benefit of the 'advanced' API for xarrays buys here. Is it just
>> preallocating all the memory before it gets used?
> 
> It runs quite a bit faster
> 
> Jason
Jason Gunthorpe Dec. 1, 2022, 12:20 a.m. UTC | #5
On Wed, Nov 30, 2022 at 06:16:53PM -0600, Bob Pearson wrote:
> On 11/30/22 17:36, Jason Gunthorpe wrote:
> > On Wed, Nov 30, 2022 at 02:53:22PM -0600, Bob Pearson wrote:
> >> On 11/24/22 13:10, Jason Gunthorpe wrote:
> >>> On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
> >>>> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
> >>>> +		 int length, int offset)
> >>>> +{
> >>>> +	int nr_frags = skb_shinfo(skb)->nr_frags;
> >>>> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
> >>>> +
> >>>> +	if (nr_frags >= MAX_SKB_FRAGS) {
> >>>> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
> >>>> +			__func__, nr_frags);
> >>>> +		return -EINVAL;
> >>>> +	}
> >>>> +
> >>>> +	frag->bv_len = length;
> >>>> +	frag->bv_offset = offset;
> >>>> +	frag->bv_page = virt_to_page(buf->addr);
> >>>
> >>> Assuming this is even OK to do, then please do the xarray conversion
> >>> I sketched first:
> >>>
> >>> https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
> >>
> >> I've been looking at this. Seems incorrect for IB_MR_TYPE_DMA which
> >> do not carry a page map but simply convert iova to kernel virtual
> >> addresses.
> > 
> > There is always a struct page involved, even in the kernel case. You
> > can do virt_to_page on kernel addresses

> Agreed but there isn't a page map set up for DMA mr's. You just get the whole kernel
> address space. So the call to rxe_mr_copy_xarray() won't work. There isn't an
> xarray to copy to/from. Much easier to just leave the DMA mr code in place since
> it does what we want very simply. Also you have to treat the DMA mr separately for
> atomic ops.

You mean the all physical memory MR type? It is true, but you still
have to add the kmap and so on. It should be a similar function that
assumes the IOVA is a physical address is a kernel mapped address and
does virt_to_page/etc instead of the xarray loop.

Jason
Bob Pearson Dec. 1, 2022, 12:36 a.m. UTC | #6
On 11/30/22 18:20, Jason Gunthorpe wrote:
> On Wed, Nov 30, 2022 at 06:16:53PM -0600, Bob Pearson wrote:
>> On 11/30/22 17:36, Jason Gunthorpe wrote:
>>> On Wed, Nov 30, 2022 at 02:53:22PM -0600, Bob Pearson wrote:
>>>> On 11/24/22 13:10, Jason Gunthorpe wrote:
>>>>> On Mon, Oct 31, 2022 at 03:27:55PM -0500, Bob Pearson wrote:
>>>>>> +int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
>>>>>> +		 int length, int offset)
>>>>>> +{
>>>>>> +	int nr_frags = skb_shinfo(skb)->nr_frags;
>>>>>> +	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
>>>>>> +
>>>>>> +	if (nr_frags >= MAX_SKB_FRAGS) {
>>>>>> +		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
>>>>>> +			__func__, nr_frags);
>>>>>> +		return -EINVAL;
>>>>>> +	}
>>>>>> +
>>>>>> +	frag->bv_len = length;
>>>>>> +	frag->bv_offset = offset;
>>>>>> +	frag->bv_page = virt_to_page(buf->addr);
>>>>>
>>>>> Assuming this is even OK to do, then please do the xarray conversion
>>>>> I sketched first:
>>>>>
>>>>> https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
>>>>
>>>> I've been looking at this. Seems incorrect for IB_MR_TYPE_DMA which
>>>> do not carry a page map but simply convert iova to kernel virtual
>>>> addresses.
>>>
>>> There is always a struct page involved, even in the kernel case. You
>>> can do virt_to_page on kernel addresses
> 
>> Agreed but there isn't a page map set up for DMA mr's. You just get the whole kernel
>> address space. So the call to rxe_mr_copy_xarray() won't work. There isn't an
>> xarray to copy to/from. Much easier to just leave the DMA mr code in place since
>> it does what we want very simply. Also you have to treat the DMA mr separately for
>> atomic ops.
> 
> You mean the all physical memory MR type? It is true, but you still
> have to add the kmap and so on. It should be a similar function that
> assumes the IOVA is a physical address is a kernel mapped address and
> does virt_to_page/etc instead of the xarray loop.
> 
> Jason

I'm not looking at my patch you responded to but the one you posted to replace maps
by xarrays. The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA that
the iova is just a kernel (virtual) address that is already mapped. Maybe this is
not correct but it has always worked this way. These are heavily used by storage stacks
(e.g. Lustre) which always use DMA mr's. Since we don't actually do any DMAs we don't
need to setup the iommu for these and just do memcpy's without dealing with pages.

Bob
Jason Gunthorpe Dec. 1, 2022, 12:41 a.m. UTC | #7
On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
> I'm not looking at my patch you responded to but the one you posted to replace maps
> by xarrays.

I see, I botched that part

> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
> that the iova is just a kernel (virtual) address that is already
> mapped.

No, it is not correct

> Maybe this is not correct but it has always worked this way. These
> are heavily used by storage stacks (e.g. Lustre) which always use
> DMA mr's. Since we don't actually do any DMAs we don't need to setup
> the iommu for these and just do memcpy's without dealing with pages.

You still should be doing the kmap

Jason
Bob Pearson Dec. 1, 2022, 5:05 a.m. UTC | #8
On 11/30/22 18:41, Jason Gunthorpe wrote:
> On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
>> I'm not looking at my patch you responded to but the one you posted to replace maps
>> by xarrays.
> 
> I see, I botched that part
> 
>> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
>> that the iova is just a kernel (virtual) address that is already
>> mapped.
> 
> No, it is not correct
> 
>> Maybe this is not correct but it has always worked this way. These
>> are heavily used by storage stacks (e.g. Lustre) which always use
>> DMA mr's. Since we don't actually do any DMAs we don't need to setup
>> the iommu for these and just do memcpy's without dealing with pages.
> 
> You still should be doing the kmap
> 
> Jason

Does this have to do with 32 bit machines? I have always tested on 64 bit machines and
dma mr's are always already mapped.

Bob
Jason Gunthorpe Dec. 1, 2022, 12:51 p.m. UTC | #9
On Wed, Nov 30, 2022 at 11:05:04PM -0600, Bob Pearson wrote:
> On 11/30/22 18:41, Jason Gunthorpe wrote:
> > On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
> >> I'm not looking at my patch you responded to but the one you posted to replace maps
> >> by xarrays.
> > 
> > I see, I botched that part
> > 
> >> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
> >> that the iova is just a kernel (virtual) address that is already
> >> mapped.
> > 
> > No, it is not correct
> > 
> >> Maybe this is not correct but it has always worked this way. These
> >> are heavily used by storage stacks (e.g. Lustre) which always use
> >> DMA mr's. Since we don't actually do any DMAs we don't need to setup
> >> the iommu for these and just do memcpy's without dealing with pages.
> > 
> > You still should be doing the kmap
> > 
> > Jason
> 
> Does this have to do with 32 bit machines? I have always tested on 64 bit machines and
> dma mr's are always already mapped.

Originally, but people are making 64 bit machines require kmap for
security purposes

Jason
Bob Pearson Dec. 1, 2022, 3:04 p.m. UTC | #10
On 11/30/22 18:41, Jason Gunthorpe wrote:
> On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
>> I'm not looking at my patch you responded to but the one you posted to replace maps
>> by xarrays.
> 
> I see, I botched that part
> 
>> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
>> that the iova is just a kernel (virtual) address that is already
>> mapped.
> 
> No, it is not correct
> 
>> Maybe this is not correct but it has always worked this way. These
>> are heavily used by storage stacks (e.g. Lustre) which always use
>> DMA mr's. Since we don't actually do any DMAs we don't need to setup
>> the iommu for these and just do memcpy's without dealing with pages.
> 
> You still should be doing the kmap
> 
> Jason

Something was disconnected in my memory. So I went back and looked at lustre.
Turns out it never uses IB_MR_TYPE_DMA and for that matter I can't find any
use cases in the rdma tree or online. So, the implementation in rxe has almost
certainly never been used.

So I need to choose to 'fix' the current implementation or just delete type dma support.
I get the idea that I need to convert the iova to a page and kmap it but i'm not
clear how to do that. This 64 bit numnber (iova) needs to convert to a struct page *.
Without a use case to look at I don't know how to interpret it. Apparently it's not a
virtual address.

Bob
Bob Pearson Dec. 1, 2022, 3:16 p.m. UTC | #11
On 12/1/22 09:04, Bob Pearson wrote:
> On 11/30/22 18:41, Jason Gunthorpe wrote:
>> On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
>>> I'm not looking at my patch you responded to but the one you posted to replace maps
>>> by xarrays.
>>
>> I see, I botched that part
>>
>>> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
>>> that the iova is just a kernel (virtual) address that is already
>>> mapped.
>>
>> No, it is not correct
>>
>>> Maybe this is not correct but it has always worked this way. These
>>> are heavily used by storage stacks (e.g. Lustre) which always use
>>> DMA mr's. Since we don't actually do any DMAs we don't need to setup
>>> the iommu for these and just do memcpy's without dealing with pages.
>>
>> You still should be doing the kmap
>>
>> Jason
> 
> Something was disconnected in my memory. So I went back and looked at lustre.
> Turns out it never uses IB_MR_TYPE_DMA and for that matter I can't find any
> use cases in the rdma tree or online. So, the implementation in rxe has almost
> certainly never been used.
> 
> So I need to choose to 'fix' the current implementation or just delete type dma support.
> I get the idea that I need to convert the iova to a page and kmap it but i'm not
> clear how to do that. This 64 bit numnber (iova) needs to convert to a struct page *.
> Without a use case to look at I don't know how to interpret it. Apparently it's not a
> virtual address.
> 
> Bob
> 

I did find a single use case for the mr created during alloc_pd. The comments seem
to imply that the use is just access to local kernel memory with va=pa. So I am back
to my previous thoughts. Memcpy should just work.

Bob
Bob Pearson Dec. 1, 2022, 3:38 p.m. UTC | #12
On 12/1/22 09:16, Bob Pearson wrote:
> On 12/1/22 09:04, Bob Pearson wrote:
>> On 11/30/22 18:41, Jason Gunthorpe wrote:
>>> On Wed, Nov 30, 2022 at 06:36:56PM -0600, Bob Pearson wrote:
>>>> I'm not looking at my patch you responded to but the one you posted to replace maps
>>>> by xarrays.
>>>
>>> I see, I botched that part
>>>
>>>> The existing rxe driver assumes that if ibmr->type == IB_MR_TYPE_DMA
>>>> that the iova is just a kernel (virtual) address that is already
>>>> mapped.
>>>
>>> No, it is not correct
>>>
>>>> Maybe this is not correct but it has always worked this way. These
>>>> are heavily used by storage stacks (e.g. Lustre) which always use
>>>> DMA mr's. Since we don't actually do any DMAs we don't need to setup
>>>> the iommu for these and just do memcpy's without dealing with pages.
>>>
>>> You still should be doing the kmap
>>>
>>> Jason
>>
>> Something was disconnected in my memory. So I went back and looked at lustre.
>> Turns out it never uses IB_MR_TYPE_DMA and for that matter I can't find any
>> use cases in the rdma tree or online. So, the implementation in rxe has almost
>> certainly never been used.
>>
>> So I need to choose to 'fix' the current implementation or just delete type dma support.
>> I get the idea that I need to convert the iova to a page and kmap it but i'm not
>> clear how to do that. This 64 bit numnber (iova) needs to convert to a struct page *.
>> Without a use case to look at I don't know how to interpret it. Apparently it's not a
>> virtual address.
>>
>> Bob
>>
> 
> I did find a single use case for the mr created during alloc_pd. The comments seem
> to imply that the use is just access to local kernel memory with va=pa. So I am back
> to my previous thoughts. Memcpy should just work.
> 
> Bob

Further, looking at ipoib as an example, it builds sge lists using the lkey from get_dma_mr()
and sets the sge->addr to a kernel virtual memory address after previously calling
ib_dma_map_single() so the addresses are mapped for dma access and visible before use.
They are unmapped after the read/write operation completes. What is the point of kmapping
the address after dma mapping them?

Bob
Jason Gunthorpe Dec. 1, 2022, 3:39 p.m. UTC | #13
On Thu, Dec 01, 2022 at 09:38:10AM -0600, Bob Pearson wrote:

> Further, looking at ipoib as an example, it builds sge lists using the lkey from get_dma_mr()
> and sets the sge->addr to a kernel virtual memory address after previously calling
> ib_dma_map_single() so the addresses are mapped for dma access and visible before use.
> They are unmapped after the read/write operation completes. What is the point of kmapping
> the address after dma mapping them?

Because not everything is ipoib, and things like block will map sgls
with struct pages, not kva.

Jason
Bob Pearson Dec. 1, 2022, 5:11 p.m. UTC | #14
On 12/1/22 09:39, Jason Gunthorpe wrote:
> On Thu, Dec 01, 2022 at 09:38:10AM -0600, Bob Pearson wrote:
> 
>> Further, looking at ipoib as an example, it builds sge lists using the lkey from get_dma_mr()
>> and sets the sge->addr to a kernel virtual memory address after previously calling
>> ib_dma_map_single() so the addresses are mapped for dma access and visible before use.
>> They are unmapped after the read/write operation completes. What is the point of kmapping
>> the address after dma mapping them?
> 
> Because not everything is ipoib, and things like block will map sgls
> with struct pages, not kva.
> 
> Jason

OK it's working now but there is a bug in your rxe_mr_fill_pages_from_sgt() routine.
You have a

	if (xas_xa_index && WARN_ON(sg_iter.sg_pgoffset % PAGE_SIZE)) {...}

which seems to assume that sg_pgoffset contains the byte offset in the current page.
But looking at __sg_page_iter_next() it appears that it is the number of pages offset
in the current sg entry which results in a splat when I run ib_send_bw.

Bob
Jason Gunthorpe Dec. 1, 2022, 6 p.m. UTC | #15
On Thu, Dec 01, 2022 at 11:11:10AM -0600, Bob Pearson wrote:
> On 12/1/22 09:39, Jason Gunthorpe wrote:
> > On Thu, Dec 01, 2022 at 09:38:10AM -0600, Bob Pearson wrote:
> > 
> >> Further, looking at ipoib as an example, it builds sge lists using the lkey from get_dma_mr()
> >> and sets the sge->addr to a kernel virtual memory address after previously calling
> >> ib_dma_map_single() so the addresses are mapped for dma access and visible before use.
> >> They are unmapped after the read/write operation completes. What is the point of kmapping
> >> the address after dma mapping them?
> > 
> > Because not everything is ipoib, and things like block will map sgls
> > with struct pages, not kva.
> > 
> > Jason
> 
> OK it's working now but there is a bug in your rxe_mr_fill_pages_from_sgt() routine.
> You have a
> 
> 	if (xas_xa_index && WARN_ON(sg_iter.sg_pgoffset % PAGE_SIZE)) {...}
> 
> which seems to assume that sg_pgoffset contains the byte offset in the current page.
> But looking at __sg_page_iter_next() it appears that it is the number of pages offset
> in the current sg entry which results in a splat when I run
> ib_send_bw.

The pgoffset depends on the creator of the SGL, I guess something must
be pointing the SGL ad a folio and using the sg_pgoffset as a sub
folio index?

If so then the calculation of the kmap has to be adjusted to extract
the number of pages from the sg_pgoffset

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index ff803a957ac1..81a611778d44 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -68,6 +68,8 @@  void rxe_mr_init_dma(int access, struct rxe_mr *mr);
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr);
 int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
+int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
+		 int length, int offset);
 int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		enum rxe_mr_copy_op op);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 60a8034f1416..2dcf37f32330 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -286,6 +286,40 @@  void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
 	return addr;
 }
 
+/**
+ * rxe_add_frag() - Add a frag to a nonlinear packet
+ * @skb: The packet buffer
+ * @buf: Kernel buffer info
+ * @length: Length of fragment
+ * @offset: Offset of fragment in buf
+ *
+ * Returns: 0 on success else a negative errno
+ */
+int rxe_add_frag(struct sk_buff *skb, struct rxe_phys_buf *buf,
+		 int length, int offset)
+{
+	int nr_frags = skb_shinfo(skb)->nr_frags;
+	skb_frag_t *frag = &skb_shinfo(skb)->frags[nr_frags];
+
+	if (nr_frags >= MAX_SKB_FRAGS) {
+		pr_debug("%s: nr_frags (%d) >= MAX_SKB_FRAGS\n",
+			__func__, nr_frags);
+		return -EINVAL;
+	}
+
+	frag->bv_len = length;
+	frag->bv_offset = offset;
+	frag->bv_page = virt_to_page(buf->addr);
+	/* because kfree_skb will call put_page() */
+	get_page(frag->bv_page);
+	skb_shinfo(skb)->nr_frags++;
+
+	skb->data_len += length;
+	skb->len += length;
+
+	return 0;
+}
+
 /* copy data from a range (vaddr, vaddr+length-1) to or from
  * a mr object starting at iova.
  */