diff mbox series

[net-next,v4,4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag

Message ID 20230612130256.4572-5-linyunsheng@huawei.com (mailing list archive)
State Not Applicable
Headers show
Series [net-next,v4,1/5] page_pool: frag API support for 32-bit arch with 64-bit DMA | expand

Commit Message

Yunsheng Lin June 12, 2023, 1:02 p.m. UTC
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
handling is unified and page_pool_alloc_frag() is supported
in 32-bit arch with 64-bit DMA, so remove it.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c          | 3 +--
 drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c        | 2 +-
 drivers/net/wireless/mediatek/mt76/mac80211.c            | 2 +-
 include/net/page_pool.h                                  | 7 ++-----
 net/core/skbuff.c                                        | 2 +-
 6 files changed, 7 insertions(+), 11 deletions(-)

Comments

Jakub Kicinski June 14, 2023, 5:19 p.m. UTC | #1
On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote:
>  	struct page_pool_params pp_params = {
> -		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
> -				PP_FLAG_DMA_SYNC_DEV,
> +		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>  		.order = hns3_page_order(ring),

Does hns3_page_order() set a good example for the users?

static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
{
#if (PAGE_SIZE < 8192)
	if (ring->buf_size > (PAGE_SIZE / 2))
		return 1;
#endif
	return 0;
}

Why allocate order 1 pages for buffers which would fit in a single page?
I feel like this soft of heuristic should be built into the API itself.
Yunsheng Lin June 15, 2023, 7:17 a.m. UTC | #2
On 2023/6/15 1:19, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote:
>>  	struct page_pool_params pp_params = {
>> -		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>> -				PP_FLAG_DMA_SYNC_DEV,
>> +		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>  		.order = hns3_page_order(ring),
> 
> Does hns3_page_order() set a good example for the users?
> 
> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
> {
> #if (PAGE_SIZE < 8192)
> 	if (ring->buf_size > (PAGE_SIZE / 2))
> 		return 1;
> #endif
> 	return 0;
> }
> 
> Why allocate order 1 pages for buffers which would fit in a single page?
> I feel like this soft of heuristic should be built into the API itself.

hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
with 4096 bytes and system page size with 4K, as hns3 driver still support the
per-desc ping-pong way of page splitting when page_pool_enabled is false.

With page pool enabled, you are right that order 0 pages is enough, and I am not
sure about the exact reason we use the some order as the ping-pong way of page
splitting now.
As 2048 bytes buf size seems to be the default one, and I has not heard any one
changing it. Also, it caculates the pool_size using something as below, so the
memory usage is almost the same for order 0 and order 1:

.pool_size = ring->desc_num * hns3_buf_size(ring) /
		(PAGE_SIZE << hns3_page_order(ring)),

I am not sure it worth changing it, maybe just change it to set good example for
the users:) anyway I need to discuss this with other colleague internally and do
some testing before doing the change.

> .
>
Alexander Lobakin June 15, 2023, 1:59 p.m. UTC | #3
From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 14 Jun 2023 10:19:54 -0700

> On Mon, 12 Jun 2023 21:02:55 +0800 Yunsheng Lin wrote:
>>  	struct page_pool_params pp_params = {
>> -		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
>> -				PP_FLAG_DMA_SYNC_DEV,
>> +		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>  		.order = hns3_page_order(ring),
> 
> Does hns3_page_order() set a good example for the users?
> 
> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
> {
> #if (PAGE_SIZE < 8192)
> 	if (ring->buf_size > (PAGE_SIZE / 2))
> 		return 1;
> #endif
> 	return 0;
> }

Oh lol, just what Intel drivers do. They don't have a pool to keep some
bunch of pages (they can recycle a page only within its buffer), so in
order to still recycle them, they allocate order-1 pages to be able to
flip the halves >_<

> 
> Why allocate order 1 pages for buffers which would fit in a single page?
> I feel like this soft of heuristic should be built into the API itself.

Offtop:

I tested this series with IAVF: very little perf regression* (almost
stddev) comparing to just 1-page-per-frame Page Pool series, but 21 Mb
less RAM taken comparing to both "old" PP series and baseline, nice :D

(+Cc David Christensen, he'll be glad to hear we're stopping eating 64Kb
 pages)

* this might be caused by that in the previous version I was hardcoding
truesize, but now it depends on what page_pool_alloc() returns. Same for
Rx offset: it was always 0 previously, as every frame was placed at the
start of page, now depends on how PP places** it.
With MTU of 1500 and no XDP, two frames fit into one 4k page. With XDP
on (increased headroom) or increased MTU, PP starts effectively do
1-frame-per-page with literally no changes in performance (increased RAM
usage obviously -- I mean, it gets restored to the baseline numbers).

** BTW, instead of 2048 + 2048, I'm getting 1920 + 2176. Maybe the stack
would be happier to see more consistent truesize for cache purposes.
I'll try to play with it.

Thanks,
Olek
Jakub Kicinski June 15, 2023, 4:51 p.m. UTC | #4
On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
> > Does hns3_page_order() set a good example for the users?
> > 
> > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
> > {
> > #if (PAGE_SIZE < 8192)
> > 	if (ring->buf_size > (PAGE_SIZE / 2))
> > 		return 1;
> > #endif
> > 	return 0;
> > }
> > 
> > Why allocate order 1 pages for buffers which would fit in a single page?
> > I feel like this soft of heuristic should be built into the API itself.  
> 
> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
> with 4096 bytes and system page size with 4K, as hns3 driver still support the
> per-desc ping-pong way of page splitting when page_pool_enabled is false.
> 
> With page pool enabled, you are right that order 0 pages is enough, and I am not
> sure about the exact reason we use the some order as the ping-pong way of page
> splitting now.
> As 2048 bytes buf size seems to be the default one, and I has not heard any one
> changing it. Also, it caculates the pool_size using something as below, so the
> memory usage is almost the same for order 0 and order 1:
> 
> .pool_size = ring->desc_num * hns3_buf_size(ring) /
> 		(PAGE_SIZE << hns3_page_order(ring)),
> 
> I am not sure it worth changing it, maybe just change it to set good example for
> the users:) anyway I need to discuss this with other colleague internally and do
> some testing before doing the change.

Right, I think this may be a leftover from the page flipping mode of
operation. But AFAIU we should leave the recycling fully to the page
pool now. If we make any improvements try to make them at the page pool
level.

I like your patches as they isolate the drivers from having to make the
fragmentation decisions based on the system page size (4k vs 64k but
we're hearing more and more about ARM w/ 16k pages). For that use case
this is great. 

What we don't want is drivers to start requesting larger page sizes
because it looks good in iperf on a freshly booted, idle system :(
Alexander Duyck June 15, 2023, 6:26 p.m. UTC | #5
On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
> > > Does hns3_page_order() set a good example for the users?
> > >
> > > static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
> > > {
> > > #if (PAGE_SIZE < 8192)
> > >     if (ring->buf_size > (PAGE_SIZE / 2))
> > >             return 1;
> > > #endif
> > >     return 0;
> > > }
> > >
> > > Why allocate order 1 pages for buffers which would fit in a single page?
> > > I feel like this soft of heuristic should be built into the API itself.
> >
> > hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
> > 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
> > with 4096 bytes and system page size with 4K, as hns3 driver still support the
> > per-desc ping-pong way of page splitting when page_pool_enabled is false.
> >
> > With page pool enabled, you are right that order 0 pages is enough, and I am not
> > sure about the exact reason we use the some order as the ping-pong way of page
> > splitting now.
> > As 2048 bytes buf size seems to be the default one, and I has not heard any one
> > changing it. Also, it caculates the pool_size using something as below, so the
> > memory usage is almost the same for order 0 and order 1:
> >
> > .pool_size = ring->desc_num * hns3_buf_size(ring) /
> >               (PAGE_SIZE << hns3_page_order(ring)),
> >
> > I am not sure it worth changing it, maybe just change it to set good example for
> > the users:) anyway I need to discuss this with other colleague internally and do
> > some testing before doing the change.
>
> Right, I think this may be a leftover from the page flipping mode of
> operation. But AFAIU we should leave the recycling fully to the page
> pool now. If we make any improvements try to make them at the page pool
> level.
>
> I like your patches as they isolate the drivers from having to make the
> fragmentation decisions based on the system page size (4k vs 64k but
> we're hearing more and more about ARM w/ 16k pages). For that use case
> this is great.
>
> What we don't want is drivers to start requesting larger page sizes
> because it looks good in iperf on a freshly booted, idle system :(

Actually that would be a really good direction for this patch set to
look at going into. Rather than having us always allocate a "page" it
would make sense for most drivers to allocate a 4K fragment or the
like in the case that the base page size is larger than 4K. That might
be a good use case to justify doing away with the standard page pool
page and look at making them all fragmented.

In the case of the standard page size being 4K a standard page would
just have to take on the CPU overhead of the atomic_set and
atomic_read for pp_ref_count (new name) which should be minimal as on
most sane systems those just end up being a memory write and read.
Yunsheng Lin June 16, 2023, 12:20 p.m. UTC | #6
On 2023/6/16 2:26, Alexander Duyck wrote:
> On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
>>>> Does hns3_page_order() set a good example for the users?
>>>>
>>>> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
>>>> {
>>>> #if (PAGE_SIZE < 8192)
>>>>     if (ring->buf_size > (PAGE_SIZE / 2))
>>>>             return 1;
>>>> #endif
>>>>     return 0;
>>>> }
>>>>
>>>> Why allocate order 1 pages for buffers which would fit in a single page?
>>>> I feel like this soft of heuristic should be built into the API itself.
>>>
>>> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
>>> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
>>> with 4096 bytes and system page size with 4K, as hns3 driver still support the
>>> per-desc ping-pong way of page splitting when page_pool_enabled is false.
>>>
>>> With page pool enabled, you are right that order 0 pages is enough, and I am not
>>> sure about the exact reason we use the some order as the ping-pong way of page
>>> splitting now.
>>> As 2048 bytes buf size seems to be the default one, and I has not heard any one
>>> changing it. Also, it caculates the pool_size using something as below, so the
>>> memory usage is almost the same for order 0 and order 1:
>>>
>>> .pool_size = ring->desc_num * hns3_buf_size(ring) /
>>>               (PAGE_SIZE << hns3_page_order(ring)),
>>>
>>> I am not sure it worth changing it, maybe just change it to set good example for
>>> the users:) anyway I need to discuss this with other colleague internally and do
>>> some testing before doing the change.
>>
>> Right, I think this may be a leftover from the page flipping mode of
>> operation. But AFAIU we should leave the recycling fully to the page
>> pool now. If we make any improvements try to make them at the page pool
>> level.

I checked, the per-desc buf with 4096 bytes for hnse does not seem to
be used mainly because of the larger memory usage you mentioned below.

>>
>> I like your patches as they isolate the drivers from having to make the
>> fragmentation decisions based on the system page size (4k vs 64k but
>> we're hearing more and more about ARM w/ 16k pages). For that use case
>> this is great.

Yes, That is my point. For hw case, the page splitting in page pool is
mainly to enble multi-descs to use the same page as my understanding.

>>
>> What we don't want is drivers to start requesting larger page sizes
>> because it looks good in iperf on a freshly booted, idle system :(
> 
> Actually that would be a really good direction for this patch set to
> look at going into. Rather than having us always allocate a "page" it
> would make sense for most drivers to allocate a 4K fragment or the
> like in the case that the base page size is larger than 4K. That might
> be a good use case to justify doing away with the standard page pool
> page and look at making them all fragmented.

I am not sure if I understand the above, isn't the frag API able to
support allocating a 4K fragment when base page size is larger than
4K before or after this patch? what more do we need to do?

> 
> In the case of the standard page size being 4K a standard page would
> just have to take on the CPU overhead of the atomic_set and
> atomic_read for pp_ref_count (new name) which should be minimal as on
> most sane systems those just end up being a memory write and read.

If I understand you correctly, I think what you are trying to do
may break some of Jesper' benchmarking:)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

> .
>
Alexander Duyck June 16, 2023, 3:01 p.m. UTC | #7
On Fri, Jun 16, 2023 at 5:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/16 2:26, Alexander Duyck wrote:
> > On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
> >>>> Does hns3_page_order() set a good example for the users?
> >>>>
> >>>> static inline unsigned int hns3_page_order(struct hns3_enet_ring *ring)
> >>>> {
> >>>> #if (PAGE_SIZE < 8192)
> >>>>     if (ring->buf_size > (PAGE_SIZE / 2))
> >>>>             return 1;
> >>>> #endif
> >>>>     return 0;
> >>>> }
> >>>>
> >>>> Why allocate order 1 pages for buffers which would fit in a single page?
> >>>> I feel like this soft of heuristic should be built into the API itself.
> >>>
> >>> hns3 only support fixed buf size per desc by 512 byte, 1024 bytes, 2048 bytes
> >>> 4096 bytes, see hns3_buf_size2type(), I think the order 1 pages is for buf size
> >>> with 4096 bytes and system page size with 4K, as hns3 driver still support the
> >>> per-desc ping-pong way of page splitting when page_pool_enabled is false.
> >>>
> >>> With page pool enabled, you are right that order 0 pages is enough, and I am not
> >>> sure about the exact reason we use the some order as the ping-pong way of page
> >>> splitting now.
> >>> As 2048 bytes buf size seems to be the default one, and I has not heard any one
> >>> changing it. Also, it caculates the pool_size using something as below, so the
> >>> memory usage is almost the same for order 0 and order 1:
> >>>
> >>> .pool_size = ring->desc_num * hns3_buf_size(ring) /
> >>>               (PAGE_SIZE << hns3_page_order(ring)),
> >>>
> >>> I am not sure it worth changing it, maybe just change it to set good example for
> >>> the users:) anyway I need to discuss this with other colleague internally and do
> >>> some testing before doing the change.
> >>
> >> Right, I think this may be a leftover from the page flipping mode of
> >> operation. But AFAIU we should leave the recycling fully to the page
> >> pool now. If we make any improvements try to make them at the page pool
> >> level.
>
> I checked, the per-desc buf with 4096 bytes for hnse does not seem to
> be used mainly because of the larger memory usage you mentioned below.
>
> >>
> >> I like your patches as they isolate the drivers from having to make the
> >> fragmentation decisions based on the system page size (4k vs 64k but
> >> we're hearing more and more about ARM w/ 16k pages). For that use case
> >> this is great.
>
> Yes, That is my point. For hw case, the page splitting in page pool is
> mainly to enble multi-descs to use the same page as my understanding.
>
> >>
> >> What we don't want is drivers to start requesting larger page sizes
> >> because it looks good in iperf on a freshly booted, idle system :(
> >
> > Actually that would be a really good direction for this patch set to
> > look at going into. Rather than having us always allocate a "page" it
> > would make sense for most drivers to allocate a 4K fragment or the
> > like in the case that the base page size is larger than 4K. That might
> > be a good use case to justify doing away with the standard page pool
> > page and look at making them all fragmented.
>
> I am not sure if I understand the above, isn't the frag API able to
> support allocating a 4K fragment when base page size is larger than
> 4K before or after this patch? what more do we need to do?

I'm not talking about the frag API. I am talking about the
non-fragmented case. Right now standard page_pool will allocate an
order 0 page. So if a driver is using just pages expecting 4K pages
that isn't true on these ARM or PowerPC systems where the page size is
larger than 4K.

For a bit of historical reference on igb/ixgbe they had a known issue
where they would potentially run a system out of memory when page size
was larger than 4K. I had originally implemented things with just the
refcounting hack and at the time it worked great on systems with 4K
pages. However on a PowerPC it would trigger OOM errors because they
could run with 64K pages. To fix that I started adding all the
PAGE_SIZE checks in the driver and moved over to a striping model for
those that would free the page when it reached the end in order to
force it to free the page and make better use of the available memory.

> >
> > In the case of the standard page size being 4K a standard page would
> > just have to take on the CPU overhead of the atomic_set and
> > atomic_read for pp_ref_count (new name) which should be minimal as on
> > most sane systems those just end up being a memory write and read.
>
> If I understand you correctly, I think what you are trying to do
> may break some of Jesper' benchmarking:)
>
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

So? If it breaks an out-of-tree benchmark the benchmark can always be
fixed. The point is enabling a use case that can add value across the
board instead of trying to force the community to support a niche use
case.

Ideally we should get away from using the pages directly for most
cases in page pool. In my mind the page pool should start operating
more like __get_free_pages where what you get is a virtual address
instead of the actual page. That way we could start abstracting it
away and eventually get to something more like a true page_pool api
instead of what feels like a set of add-ons for the page allocator.
Although at the end of the day this still feels more like we are just
reimplementing slab so it is hard for me to say this is necessarily
the best solution either.
Jesper Dangaard Brouer June 16, 2023, 6:59 p.m. UTC | #8
On 16/06/2023 17.01, Alexander Duyck wrote:
> On Fri, Jun 16, 2023 at 5:21 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/6/16 2:26, Alexander Duyck wrote:
>>> On Thu, Jun 15, 2023 at 9:51 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Thu, 15 Jun 2023 15:17:39 +0800 Yunsheng Lin wrote:
[...]
>>>>
>>>> I like your patches as they isolate the drivers from having to make the
>>>> fragmentation decisions based on the system page size (4k vs 64k but
>>>> we're hearing more and more about ARM w/ 16k pages). For that use case
>>>> this is great.

+1

[...]
>>>
>>> In the case of the standard page size being 4K a standard page would
>>> just have to take on the CPU overhead of the atomic_set and
>>> atomic_read for pp_ref_count (new name) which should be minimal as on
>>> most sane systems those just end up being a memory write and read.
>>
>> If I understand you correctly, I think what you are trying to do
>> may break some of Jesper' benchmarking:)
>>
>> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> 
> So? If it breaks an out-of-tree benchmark the benchmark can always be
> fixed. 

It doesn't matter if this is out-of-tree (I should have upstreamed it 
when AKPM asked me to.)

Point is don't break my page_pool fast-path!!! :-P

> The point is enabling a use case that can add value across the
> board instead of trying to force the community to support a niche use
> case.

I'm all for creating a new API, lets call it netmem, that takes care of 
this use-case.
I'm *not* okay with this new API slowing down the page_pool fast-path.

Why not multiplex on a MEM_TYPE, like XDP_MEM_TYPE is prepared for?!?
Meaning the caller can choose which is the correct API call.
(thus, we can stay away from adding code to fast-path case)

See below, copy-paste of code that shows what I mean by multiplex on a 
MEM_TYPE.

> 
> Ideally we should get away from using the pages directly for most
> cases in page pool. In my mind the page pool should start operating
> more like __get_free_pages where what you get is a virtual address
> instead of the actual page. That way we could start abstracting it
> away and eventually get to something more like a true page_pool api
> instead of what feels like a set of add-ons for the page allocator.

Yes, I agree with Alex Duyck here.
Like when I looked at veth proposed changes, it also felt like a virtual 
address would be better than a page.

  addr = netmem_alloc(rq->page_pool, &truesize);

> Although at the end of the day this still feels more like we are just
> reimplementing slab so it is hard for me to say this is necessarily
> the best solution either.

Yes, we have to be careful not to re-implement the MM layer in network 
land ;-)

(below code copy-paste broke whitespaces)

$ git show
commit fe38c642d629f8361f76b25aa8732e5e331d0925 (HEAD -> pp_rm_workqueue04)
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Fri Jun 16 20:54:08 2023 +0200

     page_pool: code examplifying multiplexing on mem_type

     Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..c02ac82a1d79 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -42,6 +42,7 @@ enum xdp_mem_type {
         MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
         MEM_TYPE_PAGE_POOL,
         MEM_TYPE_XSK_BUFF_POOL,
+       MEM_TYPE_PP_NETMEM,
         MEM_TYPE_MAX,
  };

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d03448a4c411..68be76efef00 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -353,7 +353,7 @@ static void page_pool_set_pp_info(struct page_pool 
*pool,
                                   struct page *page)
  {
         page->pp = pool;
-       page->pp_magic |= PP_SIGNATURE;
+       page->pp_magic |= PP_SIGNATURE | (MEM_TYPE_PAGE_POOL << 8);
         if (pool->p.init_callback)
                 pool->p.init_callback(page, pool->p.init_arg);
  }
@@ -981,6 +981,7 @@ bool page_pool_return_skb_page(struct page *page, 
bool napi_safe)
         struct napi_struct *napi;
         struct page_pool *pp;
         bool allow_direct;
+       int mem_type;

         page = compound_head(page);

@@ -991,9 +992,10 @@ bool page_pool_return_skb_page(struct page *page, 
bool napi_safe)
          * and page_is_pfmemalloc() is checked in __page_pool_put_page()
          * to avoid recycling the pfmemalloc page.
          */
-       if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+       if (unlikely((page->pp_magic & ~0xF03UL) != PP_SIGNATURE))
                 return false;

+       mem_type = (page->pp_magic & 0xF00) >> 8;
         pp = page->pp;

         /* Allow direct recycle if we have reasons to believe that we are
@@ -1009,7 +1011,10 @@ bool page_pool_return_skb_page(struct page *page, 
bool napi_safe)
          * The page will be returned to the pool here regardless of the
          * 'flipped' fragment being in use or not.
          */
-       page_pool_put_full_page(pp, page, allow_direct);
+       if (mem_type == MEM_TYPE_PP_NETMEM)
+               pp_netmem_put_page(pp, page, allow_direct);
+       else
+               page_pool_put_full_page(pp, page, allow_direct);

         return true;
  }
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 41e5ca8643ec..dc4bfbe8f002 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -380,6 +380,11 @@ void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
         struct page *page;

         switch (mem->type) {
+       case MEM_TYPE_PP_NETMEM:
+               if (napi_direct && xdp_return_frame_no_direct())
+                       napi_direct = false;
+               pp_netmem_put(page->pp, data, napi_direct);
+               break;
         case MEM_TYPE_PAGE_POOL:
                 page = virt_to_head_page(data);
                 if (napi_direct && xdp_return_frame_no_direct())
Jakub Kicinski June 16, 2023, 7:21 p.m. UTC | #9
On Fri, 16 Jun 2023 20:59:12 +0200 Jesper Dangaard Brouer wrote:
> +       if (mem_type == MEM_TYPE_PP_NETMEM)
> +               pp_netmem_put_page(pp, page, allow_direct);
> +       else
> +               page_pool_put_full_page(pp, page, allow_direct);

Interesting, what is the netmem type? I was thinking about extending
page pool for other mem providers and what came to mind was either
optionally replacing the free / alloc with a function pointer:

https://github.com/torvalds/linux/commit/578ebda5607781c0abb26c1feae7ec8b83840768

or wrapping the PP calls with static inlines which can direct to 
a different implementation completely (like zctap / io_uring zc).

Former is better for huge pages, latter is better for IO mem
(peer-to-peer DMA). I wonder if you have different use case which
requires a different model :(
Jesper Dangaard Brouer June 16, 2023, 8:42 p.m. UTC | #10
On 16/06/2023 21.21, Jakub Kicinski wrote:
> On Fri, 16 Jun 2023 20:59:12 +0200 Jesper Dangaard Brouer wrote:
>> +       if (mem_type == MEM_TYPE_PP_NETMEM)
>> +               pp_netmem_put_page(pp, page, allow_direct);
>> +       else
>> +               page_pool_put_full_page(pp, page, allow_direct);
> 
> Interesting, what is the netmem type? I was thinking about extending
> page pool for other mem providers and what came to mind was either
> optionally replacing the free / alloc with a function pointer:
> 
> https://github.com/torvalds/linux/commit/578ebda5607781c0abb26c1feae7ec8b83840768
> 
> or wrapping the PP calls with static inlines which can direct to
> a different implementation completely (like zctap / io_uring zc).
> 

I *LOVE* this idea!!!
It have been my master plan since day-1 to have other mem providers.
Notice how ZC xsk/AF_XDP have it's own memory allocator implementation.

The page_pool was never meant to be the final and best solution, I want
to see other, better and faster solutions competing with page_pool and
maybe some day replacing page_pool (I even see it as a success if PP get
depreciated and remove from the kernel due to a better solution).

See[1] how net/core/xdp.c simply have a switch statement
(is fast, because ASM wise it becomes a jump table):

  [1] 
https://github.com/torvalds/linux/blob/v6.4-rc6/net/core/xdp.c#L382-L402

> Former is better for huge pages, latter is better for IO mem
> (peer-to-peer DMA). I wonder if you have different use case which
> requires a different model :(
> 

I want for the network stack SKBs (and XDP) to support different memory
types for the "head" frame and "data-frags". Eric have described this
idea before, that hardware will do header-split, and we/he can get TCP
data part is another page/frag, making it faster for TCP-streams, but
this can be used for much more.

My proposed use-cases involves more that TCP.  We can easily imagine
NVMe protocol header-split, and the data-frag could be a mem_type that
actually belongs to the harddisk (maybe CPU cannot even read this).  The
same scenario goes for GPU memory, which is for the AI use-case.  IIRC
then Jonathan have previously send patches for the GPU use-case.

I really hope we can work in this direction together,
--Jesper
Yunsheng Lin June 17, 2023, 12:19 p.m. UTC | #11
On 2023/6/16 23:01, Alexander Duyck wrote:

...

>>> Actually that would be a really good direction for this patch set to
>>> look at going into. Rather than having us always allocate a "page" it
>>> would make sense for most drivers to allocate a 4K fragment or the
>>> like in the case that the base page size is larger than 4K. That might
>>> be a good use case to justify doing away with the standard page pool
>>> page and look at making them all fragmented.
>>
>> I am not sure if I understand the above, isn't the frag API able to
>> support allocating a 4K fragment when base page size is larger than
>> 4K before or after this patch? what more do we need to do?
> 
> I'm not talking about the frag API. I am talking about the
> non-fragmented case. Right now standard page_pool will allocate an
> order 0 page. So if a driver is using just pages expecting 4K pages
> that isn't true on these ARM or PowerPC systems where the page size is
> larger than 4K.
> 
> For a bit of historical reference on igb/ixgbe they had a known issue
> where they would potentially run a system out of memory when page size
> was larger than 4K. I had originally implemented things with just the
> refcounting hack and at the time it worked great on systems with 4K
> pages. However on a PowerPC it would trigger OOM errors because they
> could run with 64K pages. To fix that I started adding all the
> PAGE_SIZE checks in the driver and moved over to a striping model for
> those that would free the page when it reached the end in order to
> force it to free the page and make better use of the available memory.

Isn't the page_pool_alloc() or page_pool_alloc_frag() API also solve
the above problem?
I think what you really want is another layer of subdividing support
in the driver on top of the above, right?
Jakub Kicinski June 19, 2023, 6:07 p.m. UTC | #12
On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote:
> > Former is better for huge pages, latter is better for IO mem
> > (peer-to-peer DMA). I wonder if you have different use case which
> > requires a different model :(
> 
> I want for the network stack SKBs (and XDP) to support different memory
> types for the "head" frame and "data-frags". Eric have described this
> idea before, that hardware will do header-split, and we/he can get TCP
> data part is another page/frag, making it faster for TCP-streams, but
> this can be used for much more.
> 
> My proposed use-cases involves more that TCP.  We can easily imagine
> NVMe protocol header-split, and the data-frag could be a mem_type that
> actually belongs to the harddisk (maybe CPU cannot even read this).  The
> same scenario goes for GPU memory, which is for the AI use-case.  IIRC
> then Jonathan have previously send patches for the GPU use-case.
> 
> I really hope we can work in this direction together,

Perfect, that's also the use case I had in mind. The huge page thing
was just a quick thing to implement as a PoC (although useful in its
own right, one day I'll find the time to finish it, sigh).

That said I couldn't convince myself that for a peer-to-peer setup we
have enough space in struct page to store all the information we need.
Or that we'd get a struct page at all, and not just a region of memory
with no struct page * allocated :S

That'd require serious surgery on the page pool's fast paths to work
around.

I haven't dug into the details, tho. If you think we can use page pool
as a frontend for iouring and/or p2p memory that'd be awesome!

The workaround solution I had in mind would be to create a narrower API
for just data pages. Since we'd need to sprinkle ifs anyway, pull them
up close to the call site. Allowing to switch page pool for a
completely different implementation, like the one Jonathan coded up for
iouring. Basically

$name_alloc_page(queue)
{
	if (queue->pp)
		return page_pool_dev_alloc_pages(queue->pp);
	else if (queue->iouring..)
		...
}
Jesper Dangaard Brouer June 20, 2023, 3:12 p.m. UTC | #13
On 19/06/2023 20.07, Jakub Kicinski wrote:
> On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote:
>>> Former is better for huge pages, latter is better for IO mem
>>> (peer-to-peer DMA). I wonder if you have different use case which
>>> requires a different model :(
>>
>> I want for the network stack SKBs (and XDP) to support different memory
>> types for the "head" frame and "data-frags". Eric have described this
>> idea before, that hardware will do header-split, and we/he can get TCP
>> data part is another page/frag, making it faster for TCP-streams, but
>> this can be used for much more.
>>
>> My proposed use-cases involves more that TCP.  We can easily imagine
>> NVMe protocol header-split, and the data-frag could be a mem_type that
>> actually belongs to the harddisk (maybe CPU cannot even read this).  The
>> same scenario goes for GPU memory, which is for the AI use-case.  IIRC
>> then Jonathan have previously send patches for the GPU use-case.
>>
>> I really hope we can work in this direction together,
> 
> Perfect, that's also the use case I had in mind. The huge page thing
> was just a quick thing to implement as a PoC (although useful in its
> own right, one day I'll find the time to finish it, sigh).
> 
> That said I couldn't convince myself that for a peer-to-peer setup we
> have enough space in struct page to store all the information we need.
> Or that we'd get a struct page at all, and not just a region of memory
> with no struct page * allocated :S

Good with big ideas, but I think we should start smaller and evolve.

> 
> That'd require serious surgery on the page pool's fast paths to work
> around.
> 
> I haven't dug into the details, tho. If you think we can use page pool
> as a frontend for iouring and/or p2p memory that'd be awesome!
> 

Hmm... I don't like the sound of this.
My point is that we should create a more plug-able memory system for
netstack.  And NOT try to extend page_pool to cover all use-cases.

> The workaround solution I had in mind would be to create a narrower API
> for just data pages. Since we'd need to sprinkle ifs anyway, pull them
> up close to the call site. Allowing to switch page pool for a
> completely different implementation, like the one Jonathan coded up for
> iouring. Basically
> 
> $name_alloc_page(queue)
> {
> 	if (queue->pp)
> 		return page_pool_dev_alloc_pages(queue->pp);
> 	else if (queue->iouring..)
> 		...
> }

Yes, this is more the direction I'm thinking.
In many cases, you don't need this if-statement helper in the driver, as
driver RX side code will know the API used upfront.

The TX completion side will need this kind of multiplexing return
helper, to return the pages to the correct memory allocator type (e.g.
page_pool being one).  See concept in [1] __xdp_return().

Performance wise, function pointers are slow due to RETPOLINE, but
switch-case statements (below certain size) becomes a jump table, which
is fast.  See[1].

[1] https://elixir.bootlin.com/linux/v6.4-rc7/source/net/core/xdp.c#L377

Regarding room in "struct page", notice that page->pp_magic will have
plenty room for e.g. storing xdp_mem_type or even xdp_mem_info (which
also contains an ID).

--Jesper
Jakub Kicinski June 20, 2023, 3:39 p.m. UTC | #14
On Tue, 20 Jun 2023 17:12:41 +0200 Jesper Dangaard Brouer wrote:
> > The workaround solution I had in mind would be to create a narrower API
> > for just data pages. Since we'd need to sprinkle ifs anyway, pull them
> > up close to the call site. Allowing to switch page pool for a
> > completely different implementation, like the one Jonathan coded up for
> > iouring. Basically
> > 
> > $name_alloc_page(queue)
> > {
> > 	if (queue->pp)
> > 		return page_pool_dev_alloc_pages(queue->pp);
> > 	else if (queue->iouring..)
> > 		...
> > }  
> 
> Yes, this is more the direction I'm thinking.
> In many cases, you don't need this if-statement helper in the driver, as
> driver RX side code will know the API used upfront.

Meaning that the driver "knows" if it's in the XDP, AF_XDP, iouring 
or "normal" Rx path?  I hope we can avoid extra code in the driver
completely, for data pages.

> The TX completion side will need this kind of multiplexing return
> helper, to return the pages to the correct memory allocator type (e.g.
> page_pool being one).  See concept in [1] __xdp_return().
> 
> Performance wise, function pointers are slow due to RETPOLINE, but
> switch-case statements (below certain size) becomes a jump table, which
> is fast.  See[1].
> 
> [1] https://elixir.bootlin.com/linux/v6.4-rc7/source/net/core/xdp.c#L377

SG!

> Regarding room in "struct page", notice that page->pp_magic will have
> plenty room for e.g. storing xdp_mem_type or even xdp_mem_info (which
> also contains an ID).

I was worried about fitting the DMA address, if the pages code from user
space.
Mina Almasry June 30, 2023, 2:27 a.m. UTC | #15
On Mon, Jun 19, 2023 at 11:07 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 16 Jun 2023 22:42:35 +0200 Jesper Dangaard Brouer wrote:
> > > Former is better for huge pages, latter is better for IO mem
> > > (peer-to-peer DMA). I wonder if you have different use case which
> > > requires a different model :(
> >
> > I want for the network stack SKBs (and XDP) to support different memory
> > types for the "head" frame and "data-frags". Eric have described this
> > idea before, that hardware will do header-split, and we/he can get TCP
> > data part is another page/frag, making it faster for TCP-streams, but
> > this can be used for much more.
> >
> > My proposed use-cases involves more that TCP.  We can easily imagine
> > NVMe protocol header-split, and the data-frag could be a mem_type that
> > actually belongs to the harddisk (maybe CPU cannot even read this).  The
> > same scenario goes for GPU memory, which is for the AI use-case.  IIRC
> > then Jonathan have previously send patches for the GPU use-case.
> >
> > I really hope we can work in this direction together,
>
> Perfect, that's also the use case I had in mind. The huge page thing
> was just a quick thing to implement as a PoC (although useful in its
> own right, one day I'll find the time to finish it, sigh).
>
> That said I couldn't convince myself that for a peer-to-peer setup we
> have enough space in struct page to store all the information we need.
> Or that we'd get a struct page at all, and not just a region of memory
> with no struct page * allocated :S
>
> That'd require serious surgery on the page pool's fast paths to work
> around.
>
> I haven't dug into the details, tho. If you think we can use page pool
> as a frontend for iouring and/or p2p memory that'd be awesome!
>

Hello Jakub, I'm looking into device memory (peer-to-peer) networking
actually, and I plan to pursue using the page pool as a front end.

Quick description of what I have so far:
current implementation uses device memory with struct pages; I am
putting all those pages in a gen_pool, and we have written an
allocator that allocates pages from the gen_pool. In the driver, we
use this allocator instead of alloc_page() (the driver in question is
gve which currently doesn't use the page pool). When the driver is
done with the p2p page, it simply decrements the refcount on it and
the page is freed back to the gen_pool.

Test results are good; our best results we're able to achieve ~96%
line rate with incoming packets going straight to device memory and
without bouncing the memory to a host buffer (albeit these results are
on our slighly older, production LTS, I need to work on getting
results from linus/master).

I've discussed your page pool frontend idea with our gve owners and
the idea is attractive. In particular it would be good not to insert
much custom code into the driver to support device memory pages or
other page types. I plan on trying to change my approach to match the
page pool provider you have in progress here:
https://github.com/kuba-moo/linux/tree/pp-providers

In particular the main challenge right now seems to be that my device
memory pages are ZONE_DEVICE pages, which can't be inserted to the
page pool as-is due to the union in struct page between the page pool
entries and the ZONE_DEVICE entries. I have some ideas on how to work
around that I'm looking into.

It sounds like you don't have the time at the moment to work on the
page pool provider idea; I plan to try and get my code working with
that model and propose it if it's successful. Let me know if you have
concerns here.

> The workaround solution I had in mind would be to create a narrower API
> for just data pages. Since we'd need to sprinkle ifs anyway, pull them
> up close to the call site. Allowing to switch page pool for a
> completely different implementation, like the one Jonathan coded up for
> iouring. Basically
>
> $name_alloc_page(queue)
> {
>         if (queue->pp)
>                 return page_pool_dev_alloc_pages(queue->pp);
>         else if (queue->iouring..)
>                 ...
> }
>
David Ahern July 3, 2023, 4:20 a.m. UTC | #16
On 6/29/23 8:27 PM, Mina Almasry wrote:
> 
> Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> actually, and I plan to pursue using the page pool as a front end.
> 
> Quick description of what I have so far:
> current implementation uses device memory with struct pages; I am
> putting all those pages in a gen_pool, and we have written an
> allocator that allocates pages from the gen_pool. In the driver, we
> use this allocator instead of alloc_page() (the driver in question is
> gve which currently doesn't use the page pool). When the driver is
> done with the p2p page, it simply decrements the refcount on it and
> the page is freed back to the gen_pool.

I take it these are typical Linux networking applications using standard
socket APIs (not dpdk or xdp sockets or such)? If so, what does tcpdump
show for those skbs with pages for the device memory?
Mina Almasry July 3, 2023, 6:22 a.m. UTC | #17
On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 6/29/23 8:27 PM, Mina Almasry wrote:
> >
> > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > actually, and I plan to pursue using the page pool as a front end.
> >
> > Quick description of what I have so far:
> > current implementation uses device memory with struct pages; I am
> > putting all those pages in a gen_pool, and we have written an
> > allocator that allocates pages from the gen_pool. In the driver, we
> > use this allocator instead of alloc_page() (the driver in question is
> > gve which currently doesn't use the page pool). When the driver is
> > done with the p2p page, it simply decrements the refcount on it and
> > the page is freed back to the gen_pool.

Quick update here, I was able to get my implementation working with
the page pool as a front end with the memory provider API Jakub wrote
here:
https://github.com/kuba-moo/linux/tree/pp-providers

The main complication indeed was the fact that my device memory pages
are ZONE_DEVICE pages, which are incompatible with the page_pool due
to the union in struct page. I thought of a couple of approaches to
resolve that.

1. Make my device memory pages non-ZONE_DEVICE pages. The issue with
that is that if the page is not ZONE_DEVICE, put_page(page) will
attempt to free it to the buddy allocator I think, which is not
correct. The only places where the mm stack currently allow custom
freeing callback (AFAIK), are for ZONE_DEVICE page where
free_zone_device_page() will call the provided callback in
page->pgmap->ops->page_free, and compound pages where the
compound_dtor is specified. My device memory pages aren't compound
pages so only ZONE_DEVICE pages do what I want.

2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
vice versa as they're being inserted and removed from the page pool.
This, I think, works elegantly without any issue, and is the option I
went with. The info from ZONE_DEVICE that I care about for device
memory TCP is the page->zone_device_data which holds the dma_addr, and
the page->pgmap which holds the page_free op. I'm able to store both
in my memory provider so I can swap pages from ZONE_DEVICE and
page_pool back and forth.

So far I haven't needed to make any modifications to the memory
provider implementation Jakub has pretty much, and my functionality
tests are passing. If there are no major objections I'll look into
cleaning up the interface a bit and propose it for merge. This is a
prerequisite of device memory TCP via the page_pool.

>
> I take it these are typical Linux networking applications using standard
> socket APIs (not dpdk or xdp sockets or such)? If so, what does tcpdump
> show for those skbs with pages for the device memory?
>

Yes these are using (mostly) standing socket APIs. We have small
extensions to sendmsg() and recvmsg() to pass a reference to the
device memory in both these cases, but that's about it.

tcpdump is able to access the header of these skbs which is in host
memory, but not the payload in device memory. Here is an example
session with my netcat-like test for device memory TCP:
https://pastebin.com/raw/FRjKf0kv

tcpdump seems to work, and the length of the packets above is correct.
tcpdump -A however isn't able to print the payload of the packets:
https://pastebin.com/raw/2PcNxaZV
David Ahern July 3, 2023, 2:45 p.m. UTC | #18
On 7/3/23 12:22 AM, Mina Almasry wrote:
> tcpdump is able to access the header of these skbs which is in host
> memory, but not the payload in device memory. Here is an example
> session with my netcat-like test for device memory TCP:
> https://pastebin.com/raw/FRjKf0kv
> 
> tcpdump seems to work, and the length of the packets above is correct.
> tcpdump -A however isn't able to print the payload of the packets:
> https://pastebin.com/raw/2PcNxaZV

That is my expectation. The tcpdump is just an easy example of accessing
the skb page frags. skb_copy_and_csum_bits used by icmp is another
example that can walk frags wanting access to device memory. You did not
cause a panic or trip a WARN_ON for example with the tcpdump?

Thanks for checking.
Eric Dumazet July 3, 2023, 5:13 p.m. UTC | #19
On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 7/3/23 12:22 AM, Mina Almasry wrote:
> > tcpdump is able to access the header of these skbs which is in host
> > memory, but not the payload in device memory. Here is an example
> > session with my netcat-like test for device memory TCP:
> > https://pastebin.com/raw/FRjKf0kv
> >
> > tcpdump seems to work, and the length of the packets above is correct.
> > tcpdump -A however isn't able to print the payload of the packets:
> > https://pastebin.com/raw/2PcNxaZV
>
> That is my expectation. The tcpdump is just an easy example of accessing
> the skb page frags. skb_copy_and_csum_bits used by icmp is another
> example that can walk frags wanting access to device memory. You did not
> cause a panic or trip a WARN_ON for example with the tcpdump?
>

Change for af_packet was not too hard :)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432
100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb,
struct net_device *dev,
                }
        }

-       snaplen = skb->len;
+       snaplen = skb->devmem ? skb_headlen(skb) : skb->len;

        res = run_filter(skb, sk, snaplen);
        if (!res)
@@ -2275,7 +2275,7 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,
                }
        }

-       snaplen = skb->len;
+       snaplen = skb->devmem ? skb_headlen(skb) : skb->len;

        res = run_filter(skb, sk, snaplen);
        if (!res)


And a generic change in pskb_may_pull() ( __pskb_pull_tail() more
exactly) was enough to cover most other cases.
Eric Dumazet July 3, 2023, 5:15 p.m. UTC | #20
On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote:

> That is my expectation. The tcpdump is just an easy example of accessing
> the skb page frags. skb_copy_and_csum_bits used by icmp is another
> example that can walk frags wanting access to device memory. You did not
> cause a panic or trip a WARN_ON for example with the tcpdump?
>

ICMP packets do not land on the queues having devmem buffers, for
obvious reasons.

Only chosen TCP flows are steered to these queues.
David Ahern July 3, 2023, 5:23 p.m. UTC | #21
On 7/3/23 11:13 AM, Eric Dumazet wrote:
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432
> 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb,
> struct net_device *dev,
>                 }
>         }
> 
> -       snaplen = skb->len;
> +       snaplen = skb->devmem ? skb_headlen(skb) : skb->len;
> 

Ok, so you expect a flag on the skb noting the use of 'untouchable'
memory. That aligns with my expectations based on POCs.

Based on the above: 1) skb->head is expected to be host memory, and 2)
the flag is a global for all frags, so no mix and match.
David Ahern July 3, 2023, 5:25 p.m. UTC | #22
On 7/3/23 11:15 AM, Eric Dumazet wrote:
> On Mon, Jul 3, 2023 at 4:45 PM David Ahern <dsahern@kernel.org> wrote:
> 
>> That is my expectation. The tcpdump is just an easy example of accessing
>> the skb page frags. skb_copy_and_csum_bits used by icmp is another
>> example that can walk frags wanting access to device memory. You did not
>> cause a panic or trip a WARN_ON for example with the tcpdump?
>>
> 
> ICMP packets do not land on the queues having devmem buffers, for
> obvious reasons.

I was not referring to ICMP packets coming in to the nic, but rather an
icmp getting triggered during stack processing.

> 
> Only chosen TCP flows are steered to these queues.

of course.
Jason Gunthorpe July 3, 2023, 9:43 p.m. UTC | #23
On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote:
> On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 6/29/23 8:27 PM, Mina Almasry wrote:
> > >
> > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > > actually, and I plan to pursue using the page pool as a front end.
> > >
> > > Quick description of what I have so far:
> > > current implementation uses device memory with struct pages; I am
> > > putting all those pages in a gen_pool, and we have written an
> > > allocator that allocates pages from the gen_pool. In the driver, we
> > > use this allocator instead of alloc_page() (the driver in question is
> > > gve which currently doesn't use the page pool). When the driver is
> > > done with the p2p page, it simply decrements the refcount on it and
> > > the page is freed back to the gen_pool.
> 
> Quick update here, I was able to get my implementation working with
> the page pool as a front end with the memory provider API Jakub wrote
> here:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> The main complication indeed was the fact that my device memory pages
> are ZONE_DEVICE pages, which are incompatible with the page_pool due
> to the union in struct page. I thought of a couple of approaches to
> resolve that.
> 
> 1. Make my device memory pages non-ZONE_DEVICE pages. 

Hard no on this from a mm perspective.. We need P2P memory to be
properly tagged and have the expected struct pages to be DMA mappable
and otherwise, you totally break everything if you try to do this..

> 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
> vice versa as they're being inserted and removed from the page pool.

This is kind of scary, it is very, very, fragile to rework the pages
like this. Eg what happens when the owning device unplugs and needs to
revoke these pages? I think it would likely crash..

I think it also technically breaks the DMA API as we may need to look
into the pgmap to do cache ops on some architectures.

I suggest you try to work with 8k folios and then the tail page's
struct page is empty enough to store the information you need..
Or allocate per page memory and do a memdesc like thing..

Though overall, you won't find devices creating struct pages for their
P2P memory today, so I'm not sure what the purpose is. Jonathan
already got highly slammed for proposing code to the kernel that was
unusable. Please don't repeat that. Other than a special NVMe use case
the interface for P2P is DMABUF right now and it is not struct page
backed.

Even if we did get to struct pages for device memory, it is highly
likely cases you are interested in will be using larger than 4k
folios, so page pool would need to cope with this nicely as well.

Jason
Mina Almasry July 6, 2023, 1:17 a.m. UTC | #24
On Mon, Jul 3, 2023 at 2:43 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Jul 02, 2023 at 11:22:33PM -0700, Mina Almasry wrote:
> > On Sun, Jul 2, 2023 at 9:20 PM David Ahern <dsahern@kernel.org> wrote:
> > >
> > > On 6/29/23 8:27 PM, Mina Almasry wrote:
> > > >
> > > > Hello Jakub, I'm looking into device memory (peer-to-peer) networking
> > > > actually, and I plan to pursue using the page pool as a front end.
> > > >
> > > > Quick description of what I have so far:
> > > > current implementation uses device memory with struct pages; I am
> > > > putting all those pages in a gen_pool, and we have written an
> > > > allocator that allocates pages from the gen_pool. In the driver, we
> > > > use this allocator instead of alloc_page() (the driver in question is
> > > > gve which currently doesn't use the page pool). When the driver is
> > > > done with the p2p page, it simply decrements the refcount on it and
> > > > the page is freed back to the gen_pool.
> >
> > Quick update here, I was able to get my implementation working with
> > the page pool as a front end with the memory provider API Jakub wrote
> > here:
> > https://github.com/kuba-moo/linux/tree/pp-providers
> >
> > The main complication indeed was the fact that my device memory pages
> > are ZONE_DEVICE pages, which are incompatible with the page_pool due
> > to the union in struct page. I thought of a couple of approaches to
> > resolve that.
> >
> > 1. Make my device memory pages non-ZONE_DEVICE pages.
>
> Hard no on this from a mm perspective.. We need P2P memory to be
> properly tagged and have the expected struct pages to be DMA mappable
> and otherwise, you totally break everything if you try to do this..
>
> > 2. Convert the pages from ZONE_DEVICE pages to page_pool pages and
> > vice versa as they're being inserted and removed from the page pool.
>
> This is kind of scary, it is very, very, fragile to rework the pages
> like this. Eg what happens when the owning device unplugs and needs to
> revoke these pages? I think it would likely crash..
>
> I think it also technically breaks the DMA API as we may need to look
> into the pgmap to do cache ops on some architectures.
>
> I suggest you try to work with 8k folios and then the tail page's
> struct page is empty enough to store the information you need..

Hi Jason, sorry for the late reply,

I think this could work, and the page pool already supports > order 0
allocations. It may end up being a big change to the GVE driver which
as I understand currently deals with order 0 allocations exclusively.

Another issue is that in networks with low MTU, we could be DMAing
1400/1500 bytes into each allocation, which is problematic if the
allocation is 8K+. I would need to investigate a bit to see if/how to
solve that, and we may end up having to split the page and again run
into the 'not enough room in struct page' problem.

> Or allocate per page memory and do a memdesc like thing..
>

I need to review memdesc more closely. Do you imagine I add a pointer
in struct page that points to the memdesc? Or implement a page to
memdesc mapping in the page_pool? Either approach could work. I think
the concern would be accessing the memdesc entries may be a cache miss
unacceptable in fast paths, but I think I already dereference
page->pgmap in a few places and it doesn't seem to be an issue.

> Though overall, you won't find devices creating struct pages for their
> P2P memory today, so I'm not sure what the purpose is. Jonathan
> already got highly slammed for proposing code to the kernel that was
> unusable. Please don't repeat that. Other than a special NVMe use case
> the interface for P2P is DMABUF right now and it is not struct page
> backed.
>

Our approach is actually to extend DMABUF to provide struct page
backed attachment mappings, which as far as I understand sidesteps the
issues Jonathan ran into. Our code is fully functional with any device
that supports dmabuf and in fact a lot of my tests use udmabuf to
minimize the dependencies. The RFC may come with a udmabuf selftest to
showcase that any dmabuf, even a mocked one, would be supported.

> Even if we did get to struct pages for device memory, it is highly
> likely cases you are interested in will be using larger than 4k
> folios, so page pool would need to cope with this nicely as well.
>

--
Thanks,
Mina
Mina Almasry July 6, 2023, 1:19 a.m. UTC | #25
On Mon, Jul 3, 2023 at 10:23 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 7/3/23 11:13 AM, Eric Dumazet wrote:
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index a2dbeb264f260e5b8923ece9aac99fe19ddfeb62..aa4133d1b1e0676e408499ea4534b51262394432
> > 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -2152,7 +2152,7 @@ static int packet_rcv(struct sk_buff *skb,
> > struct net_device *dev,
> >                 }
> >         }
> >
> > -       snaplen = skb->len;
> > +       snaplen = skb->devmem ? skb_headlen(skb) : skb->len;
> >
>
> Ok, so you expect a flag on the skb noting the use of 'untouchable'
> memory. That aligns with my expectations based on POCs.
>
> Based on the above: 1) skb->head is expected to be host memory, and 2)
> the flag is a global for all frags, so no mix and match.

Yes, both are correct (i.e. what I plan to propose).
Jakub Kicinski July 6, 2023, 4:50 p.m. UTC | #26
On Thu, 29 Jun 2023 19:27:46 -0700 Mina Almasry wrote:
> I've discussed your page pool frontend idea with our gve owners and
> the idea is attractive. In particular it would be good not to insert
> much custom code into the driver to support device memory pages or
> other page types. I plan on trying to change my approach to match the
> page pool provider you have in progress here:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> In particular the main challenge right now seems to be that my device
> memory pages are ZONE_DEVICE pages, which can't be inserted to the
> page pool as-is due to the union in struct page between the page pool
> entries and the ZONE_DEVICE entries. I have some ideas on how to work
> around that I'm looking into.

Right, have you talked to Willem? I mentioned to him that I initially
pursued the idea of using the page pool as an abstraction but I
realized that fitting both the dma addr and the frag reference count
into struct page which isn't just a basic page will be a challenge.

So the second best thing seems to be to create an API which matches 
the page pool API, and have it select between a page pool and another
provider based on user configuration. As you said the main goal is to
be able to feed kernel / user / device memory to the driver without
having to modify driver code.

I thought this would live "above" the page pool (perhaps that's what
you mean by gen_pool) but Jesper brought up integrating it into the
(middle of?) page pool and have page pool switch on the pp_magic.
No strong preference on which on is better, seems like something that
can only be ironed out by modifying a couple of drivers to find out
what fits best :(

I'm hoping to jump back into the pp_provider work and finish that up
over the next few days, to at least post a PoC. Even if it doesn't work
for user / device memory - being able to feed the page pool from huge
pages is already a big win for the IOTLB.
Jason Gunthorpe July 10, 2023, 5:44 p.m. UTC | #27
On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote:

> Another issue is that in networks with low MTU, we could be DMAing
> 1400/1500 bytes into each allocation, which is problematic if the
> allocation is 8K+. I would need to investigate a bit to see if/how to
> solve that, and we may end up having to split the page and again run
> into the 'not enough room in struct page' problem.

You don't have an intree driver to use this with, so who knows, but
the out of tree GPU drivers tend to use a 64k memory management page
size, and I don't expect you'd make progress with a design where a 64K
naturaly sized allocator is producing 4k/8k non-compound pages just
for netdev. We are still struggling with pagemap support for variable
page size folios, so there is a bunch of technical blockers before
drivers could do this.

This is why it is so important to come with a complete in-tree
solution, as we cannot review this design if your work is done with
hacked up out of tree drivers.

Fully and properly adding P2P ZONE_DEVICE to a real world driver is a
pretty big ask still.

> > Or allocate per page memory and do a memdesc like thing..
> 
> I need to review memdesc more closely. Do you imagine I add a pointer
> in struct page that points to the memdesc? 

Pointer to extra memory from the PFN has been the usual meaning of
memdesc, so doing an interm where the pointer is in the struct page is
a reasonable starting point.

> > Though overall, you won't find devices creating struct pages for their
> > P2P memory today, so I'm not sure what the purpose is. Jonathan
> > already got highly slammed for proposing code to the kernel that was
> > unusable. Please don't repeat that. Other than a special NVMe use case
> > the interface for P2P is DMABUF right now and it is not struct page
> > backed.
> >
> 
> Our approach is actually to extend DMABUF to provide struct page
> backed attachment mappings, which as far as I understand sidesteps the
> issues Jonathan ran into.

No DMABUF exporters do this today, so your patch series is just as
incomplete as the prior ones. Please don't post it as non-RFC,
unusable code like this must not be merged.

> that supports dmabuf and in fact a lot of my tests use udmabuf to
> minimize the dependencies. The RFC may come with a udmabuf selftest to
> showcase that any dmabuf, even a mocked one, would be supported.

That is not good enough to get merged. You need to get agreement and
coded merged from actual driver owners of dmabuf exporters that they
want to support this direction. As above it has surprising road
blocks outside netdev :\

Jason
Mina Almasry July 10, 2023, 11:02 p.m. UTC | #28
On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote:
>
> > Another issue is that in networks with low MTU, we could be DMAing
> > 1400/1500 bytes into each allocation, which is problematic if the
> > allocation is 8K+. I would need to investigate a bit to see if/how to
> > solve that, and we may end up having to split the page and again run
> > into the 'not enough room in struct page' problem.
>
> You don't have an intree driver to use this with, so who knows, but
> the out of tree GPU drivers tend to use a 64k memory management page
> size, and I don't expect you'd make progress with a design where a 64K
> naturaly sized allocator is producing 4k/8k non-compound pages just
> for netdev. We are still struggling with pagemap support for variable
> page size folios, so there is a bunch of technical blockers before
> drivers could do this.
>
> This is why it is so important to come with a complete in-tree
> solution, as we cannot review this design if your work is done with
> hacked up out of tree drivers.
>

I think you're assuming the proposal requires dma-buf exporter driver
changes, and I have a 'hacked up out of tree driver' not visible to
you. Both are not quite right. The proposal requires no changes to the
dma-buf exporter, and works with udmabuf _as is_, proving that. Please
do review the proposal:
https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.com/

If you still don't like the approach, we can try something else.

> Fully and properly adding P2P ZONE_DEVICE to a real world driver is a
> pretty big ask still.
>

There is no such ask.

> > > Or allocate per page memory and do a memdesc like thing..
> >
> > I need to review memdesc more closely. Do you imagine I add a pointer
> > in struct page that points to the memdesc?
>
> Pointer to extra memory from the PFN has been the usual meaning of
> memdesc, so doing an interm where the pointer is in the struct page is
> a reasonable starting point.
>
> > > Though overall, you won't find devices creating struct pages for their
> > > P2P memory today, so I'm not sure what the purpose is. Jonathan
> > > already got highly slammed for proposing code to the kernel that was
> > > unusable. Please don't repeat that. Other than a special NVMe use case
> > > the interface for P2P is DMABUF right now and it is not struct page
> > > backed.
> > >
> >
> > Our approach is actually to extend DMABUF to provide struct page
> > backed attachment mappings, which as far as I understand sidesteps the
> > issues Jonathan ran into.
>
> No DMABUF exporters do this today, so your patch series is just as
> incomplete as the prior ones. Please don't post it as non-RFC,
> unusable code like this must not be merged.
>
> > that supports dmabuf and in fact a lot of my tests use udmabuf to
> > minimize the dependencies. The RFC may come with a udmabuf selftest to
> > showcase that any dmabuf, even a mocked one, would be supported.
>
> That is not good enough to get merged. You need to get agreement and
> coded merged from actual driver owners of dmabuf exporters that they
> want to support this direction. As above it has surprising road
> blocks outside netdev :\
>

The current proposal requires no changes to the dma-buf exporters:
https://lore.kernel.org/netdev/20230710223304.1174642-1-almasrymina@google.com/

On dma-buf changes required. I do need approval from the dma-buf
maintainers, but AFAICT, no approval from the dma-buf exporters (all I
need is already supported). If we need to change direction to a
proposal that needs additional support from the driver owners, yes,
we'd need their approval, but this is not the case at the moment.
Jason Gunthorpe July 10, 2023, 11:49 p.m. UTC | #29
On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote:
> On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote:
> >
> > > Another issue is that in networks with low MTU, we could be DMAing
> > > 1400/1500 bytes into each allocation, which is problematic if the
> > > allocation is 8K+. I would need to investigate a bit to see if/how to
> > > solve that, and we may end up having to split the page and again run
> > > into the 'not enough room in struct page' problem.
> >
> > You don't have an intree driver to use this with, so who knows, but
> > the out of tree GPU drivers tend to use a 64k memory management page
> > size, and I don't expect you'd make progress with a design where a 64K
> > naturaly sized allocator is producing 4k/8k non-compound pages just
> > for netdev. We are still struggling with pagemap support for variable
> > page size folios, so there is a bunch of technical blockers before
> > drivers could do this.
> >
> > This is why it is so important to come with a complete in-tree
> > solution, as we cannot review this design if your work is done with
> > hacked up out of tree drivers.
> >
> 
> I think you're assuming the proposal requires dma-buf exporter driver
> changes, and I have a 'hacked up out of tree driver' not visible to
> you.

Oh, I thought it was obvious what you did in patch 1 was a total
non-starter when I said you can't abuse the ZONE_DEVICE pages like
this.

You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to
represent P2P memory, and you can't do that automatically from the
dmabuf core code.

Without doing this the DMA API doesn't actually work properly because
it doesn't have enough information to know about what the underlying
exporter is.

The entire point of DEVICE_PRIVATE is that the page content, and
access to the page's physical location, is *explicitly* unavailable to
anyone but the pgmap owner.

> > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a
> > pretty big ask still.
> 
> There is no such ask.

Well, there is from me if you want to use stuct pages as handles for
P2P memory. That is what we have defined in the pgmap area.

Also I should warn you that your 'option 2' is being NAK'd by
Christoph for now, we are not adding any new code around DMABUF's
hacky use of NULL sg_page scatterlist for P2P memory either. I've been
working on solutions here but it is slow going.

> On dma-buf changes required. I do need approval from the dma-buf
> maintainers,

It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct
pages for the exclusive use of pagepool - but you need more approval
than just dmabuf maintainers to abuse the pgmap framework like
this.

At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used
to represent P2P memory. You haven't CC'd anyone from the mm community
so I've added some people here to see if there are other opinions.

To be clear, you need an explicit ack from mm people on the abusive
use of pgmap in patch 1.

I know it is not what you want to hear, but there are actual reasons
why the P2P DMA problem has been festering for so long, and hacky
quick fixes like this are not going to be enough..

Jason
Mina Almasry July 11, 2023, 12:45 a.m. UTC | #30
On Mon, Jul 10, 2023 at 4:49 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote:
> > On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote:
> > >
> > > > Another issue is that in networks with low MTU, we could be DMAing
> > > > 1400/1500 bytes into each allocation, which is problematic if the
> > > > allocation is 8K+. I would need to investigate a bit to see if/how to
> > > > solve that, and we may end up having to split the page and again run
> > > > into the 'not enough room in struct page' problem.
> > >
> > > You don't have an intree driver to use this with, so who knows, but
> > > the out of tree GPU drivers tend to use a 64k memory management page
> > > size, and I don't expect you'd make progress with a design where a 64K
> > > naturaly sized allocator is producing 4k/8k non-compound pages just
> > > for netdev. We are still struggling with pagemap support for variable
> > > page size folios, so there is a bunch of technical blockers before
> > > drivers could do this.
> > >
> > > This is why it is so important to come with a complete in-tree
> > > solution, as we cannot review this design if your work is done with
> > > hacked up out of tree drivers.
> > >
> >
> > I think you're assuming the proposal requires dma-buf exporter driver
> > changes, and I have a 'hacked up out of tree driver' not visible to
> > you.
>
> Oh, I thought it was obvious what you did in patch 1 was a total
> non-starter when I said you can't abuse the ZONE_DEVICE pages like
> this.
>
> You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to
> represent P2P memory,

We can extend the memory_type enum, with something like MEMORY_DEVICE_DMABUF?

> and you can't do that automatically from the
> dmabuf core code.
>

Sorry, why not?

> Without doing this the DMA API doesn't actually work properly because
> it doesn't have enough information to know about what the underlying
> exporter is.
>

I didn't think it would matter that the DMA API doesn't work with
these pages, because the mapping already exists courtesy of
dma_buf_map_attachment(), and these dma-buf pages don't need to be
mapped (the dma_addr is already available). I'm providing
dma_buf_map_sg() in that patch to use instead of the DMA API. Would be
nice to help me understand why we would care why the DMA API doesn't
work with these pages when we're asking code using these pages to use
the provided API instead?

> The entire point of DEVICE_PRIVATE is that the page content, and
> access to the page's physical location, is *explicitly* unavailable to
> anyone but the pgmap owner.
>

This seems to be a semantics issue. Would setting the pages as
MEMORY_DEVICE_DMABUF address this?

> > > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a
> > > pretty big ask still.
> >
> > There is no such ask.
>
> Well, there is from me if you want to use stuct pages as handles for
> P2P memory. That is what we have defined in the pgmap area.
>
> Also I should warn you that your 'option 2' is being NAK'd by
> Christoph for now, we are not adding any new code around DMABUF's
> hacky use of NULL sg_page scatterlist for P2P memory either. I've been
> working on solutions here but it is slow going.
>

Option 2, I think, doesn't care about the sg_page NULL. But it may be
pointless to discuss, I don't see how we could modify the page pool,
networking stack, and all the networking drivers to do anything other
than struct page (someone feel free to correct).

> > On dma-buf changes required. I do need approval from the dma-buf
> > maintainers,
>
> It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct
> pages for the exclusive use of pagepool - but you need more approval
> than just dmabuf maintainers to abuse the pgmap framework like
> this.
>

Who? I'd love to add them on subsequent proposals.

> At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used
> to represent P2P memory.

Would using p2pdma API instead of dmabuf be an acceptable direction?

The nice thing about what I'm doing is that any existing dmabuf
exported would be supported, AFAICT. To use p2pdma I think I would
need to modify the drivers providing the device memory to do a
pci_p2pdma_add_resource(), and use the pci_p2pdma_map_sg() on the
importer driver (NIC), but that may work, with driver changes.

Would that be an acceptable direction to you? Or are you against the
approach in the RFC and don't have any alternative path forward for
this? Is this not a use case we want to support in some way in the
kernel?

> You haven't CC'd anyone from the mm community
> so I've added some people here to see if there are other opinions.
>

I CC'd get_maintainers.pl output and you because you showed interest
and get_maintainers did not add you automatically.

> To be clear, you need an explicit ack from mm people on the abusive
> use of pgmap in patch 1.
>

Who? I would love to add them on subsequent proposals.

> I know it is not what you want to hear, but there are actual reasons
> why the P2P DMA problem has been festering for so long, and hacky
> quick fixes like this are not going to be enough..
>

--
Thanks,
Mina
Christoph Hellwig July 11, 2023, 4:27 a.m. UTC | #31
On Mon, Jul 10, 2023 at 08:49:25PM -0300, Jason Gunthorpe wrote:
> The entire point of DEVICE_PRIVATE is that the page content, and
> access to the page's physical location, is *explicitly* unavailable to
> anyone but the pgmap owner.

DEVICE_PRIVATE is in fact availably to no one at all.  It is a place
holder for memory not addressable at all.

Not going to comment on the rest of this as it seems bat shit crazy
hacks for out of tree junk.  Why is anyone even wasting time on this?
Jakub Kicinski July 11, 2023, 4:59 a.m. UTC | #32
On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote:
> Not going to comment on the rest of this as it seems bat shit crazy
> hacks for out of tree junk.  Why is anyone even wasting time on this?

Noob question - how does RDMA integrate with the out of tree junk?
AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU
direct".

Both Jonathan in the past (Meta) and now Mina (Google) are trying 
to move the needle and at least feed the GPUs over TCP, instead of
patented, proprietary and closed RDMA transports.
Christoph Hellwig July 11, 2023, 5:04 a.m. UTC | #33
On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote:
> On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote:
> > Not going to comment on the rest of this as it seems bat shit crazy
> > hacks for out of tree junk.  Why is anyone even wasting time on this?
> 
> Noob question - how does RDMA integrate with the out of tree junk?
> AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU
> direct".

I don't care and it has absolutel no business being discussed here.

FYI at leat iWarp is a totally open standard.
Dan Williams July 11, 2023, 6:52 a.m. UTC | #34
Jason Gunthorpe wrote:
> On Mon, Jul 10, 2023 at 04:02:59PM -0700, Mina Almasry wrote:
> > On Mon, Jul 10, 2023 at 10:44 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:17:39PM -0700, Mina Almasry wrote:
> > >
> > > > Another issue is that in networks with low MTU, we could be DMAing
> > > > 1400/1500 bytes into each allocation, which is problematic if the
> > > > allocation is 8K+. I would need to investigate a bit to see if/how to
> > > > solve that, and we may end up having to split the page and again run
> > > > into the 'not enough room in struct page' problem.
> > >
> > > You don't have an intree driver to use this with, so who knows, but
> > > the out of tree GPU drivers tend to use a 64k memory management page
> > > size, and I don't expect you'd make progress with a design where a 64K
> > > naturaly sized allocator is producing 4k/8k non-compound pages just
> > > for netdev. We are still struggling with pagemap support for variable
> > > page size folios, so there is a bunch of technical blockers before
> > > drivers could do this.
> > >
> > > This is why it is so important to come with a complete in-tree
> > > solution, as we cannot review this design if your work is done with
> > > hacked up out of tree drivers.
> > >
> > 
> > I think you're assuming the proposal requires dma-buf exporter driver
> > changes, and I have a 'hacked up out of tree driver' not visible to
> > you.
> 
> Oh, I thought it was obvious what you did in patch 1 was a total
> non-starter when I said you can't abuse the ZONE_DEVICE pages like
> this.
> 
> You must create ZONE_DEVICE P2P pages, not MEMORY_DEVICE_PRIVATE to
> represent P2P memory, and you can't do that automatically from the
> dmabuf core code.
> 
> Without doing this the DMA API doesn't actually work properly because
> it doesn't have enough information to know about what the underlying
> exporter is.
> 
> The entire point of DEVICE_PRIVATE is that the page content, and
> access to the page's physical location, is *explicitly* unavailable to
> anyone but the pgmap owner.
> 
> > > Fully and properly adding P2P ZONE_DEVICE to a real world driver is a
> > > pretty big ask still.
> > 
> > There is no such ask.
> 
> Well, there is from me if you want to use stuct pages as handles for
> P2P memory. That is what we have defined in the pgmap area.
> 
> Also I should warn you that your 'option 2' is being NAK'd by
> Christoph for now, we are not adding any new code around DMABUF's
> hacky use of NULL sg_page scatterlist for P2P memory either. I've been
> working on solutions here but it is slow going.
> 
> > On dma-buf changes required. I do need approval from the dma-buf
> > maintainers,
> 
> It is a neat hack, of sorts, to abuse DEVICE_PRIVATE to create struct
> pages for the exclusive use of pagepool - but you need more approval
> than just dmabuf maintainers to abuse the pgmap framework like
> this.
> 
> At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used
> to represent P2P memory. You haven't CC'd anyone from the mm community
> so I've added some people here to see if there are other opinions.
> 
> To be clear, you need an explicit ack from mm people on the abusive
> use of pgmap in patch 1.

This thread is long so I am only reacting to this first message I am
copied on, but yes agree with Jason anything peer-to-peer DMA needs to
reuse p2pdma and it definitely needs in-tree producers and consumers for
all infrastructure.

One piece of technical debt standing in the way of any proposed
expansion of pgmap usage is the final resolution of this topic:

https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

I am also generally reticent to entertain taking on new pgmap
maintenance. I.e. now that accelerator memory attached over a coherent
link is an open hardware standard (CXL) that addresses what pgmap
infrastructure enabled in software.

> I know it is not what you want to hear, but there are actual reasons
> why the P2P DMA problem has been festering for so long, and hacky
> quick fixes like this are not going to be enough..
> 
> Jason
Jason Gunthorpe July 11, 2023, 12:05 p.m. UTC | #35
On Tue, Jul 11, 2023 at 07:04:45AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote:
> > On Tue, 11 Jul 2023 06:27:08 +0200 Christoph Hellwig wrote:
> > > Not going to comment on the rest of this as it seems bat shit crazy
> > > hacks for out of tree junk.  Why is anyone even wasting time on this?
> > 
> > Noob question - how does RDMA integrate with the out of tree junk?
> > AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU
> > direct".
> 
> I don't care and it has absolutel no business being discussed here.
> 
> FYI at leat iWarp is a totally open standard.

So is Infiniband, Jakub has a unique definition of "proprietary".

RDMA works with the AMD and Intel intree drivers using DMABUF without
requiring struct pages using the DRM hacky scatterlist approach.

Jason
Jason Gunthorpe July 11, 2023, 1:11 p.m. UTC | #36
On Mon, Jul 10, 2023 at 05:45:05PM -0700, Mina Almasry wrote:

> > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used
> > to represent P2P memory.
> 
> Would using p2pdma API instead of dmabuf be an acceptable direction?

"p2pdma API" is really just using MEMORY_DEVICE_PCI_P2PDMA and
teaching the pagepool how to work with ZONE_DEVICE pages.

I suspect this will clash badly with Matthew's work here:

https://lore.kernel.org/all/20230111042214.907030-1-willy@infradead.org/

As from a mm side we haven't ever considered that ZONE_DEVICE and
"netmem" can be composed together. The entire point of netmem like
stuff is that the allocator hands over the majority of struct page to
the allocatee, and ZONE_DEVICE can't work like that. 

However, assuming that can be solved in some agreeable way then it
would be OK to go down this path.

But, I feel like this is just overall too hard a direction from the mm
perspective.

I don't know anything about page pool, but the main sticking point is
its reliance on struct page. If it can find another way to locate its
meta data (eg an xarray), at least for some cases, it would make
things alot easier.

Jason
Jakub Kicinski July 11, 2023, 4 p.m. UTC | #37
On Tue, 11 Jul 2023 09:05:02 -0300 Jason Gunthorpe wrote:
> On Tue, Jul 11, 2023 at 07:04:45AM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 10, 2023 at 09:59:06PM -0700, Jakub Kicinski wrote:  
> > > Noob question - how does RDMA integrate with the out of tree junk?
> > > AFAIU it's possible to run the "in-tree" RDMA stack and get "GPU
> > > direct".  
> > 
> > I don't care and it has absolutel no business being discussed here.

My question was genuine. If you think the code is dog shit I will 
make no argument for merging it. I just get hives from looking at
proprietary code so I was hoping someone could explain how the
proprietary stacks get it done.

> > FYI at leat iWarp is a totally open standard.  

And I'm sure someone cares about that. I care about open source.

I think people in the networking world have a deeper understanding 
of standardization processes, and their practical implication for
open source and hack-ability. If all usable implementations are
proprietary the standard could as well not exist.

That may be a little hard to understand for folks coming from storage
and fabric worlds where the interesting bits of the implementation was
pretty much always closed.

> So is Infiniband, Jakub has a unique definition of "proprietary".

For IB AFAIU there's only one practically usable vendor, such an
impressive ecosystem!!

> RDMA works with the AMD and Intel intree drivers using DMABUF without
> requiring struct pages using the DRM hacky scatterlist approach.

I see, thanks. We need pages primarily for refcounting. Avoiding all
the infamous problems with memory pins. Oh well.
David Ahern July 11, 2023, 4:20 p.m. UTC | #38
On 7/11/23 10:00 AM, Jakub Kicinski wrote:
>> RDMA works with the AMD and Intel intree drivers using DMABUF without
>> requiring struct pages using the DRM hacky scatterlist approach.
> I see, thanks. We need pages primarily for refcounting. Avoiding all
> the infamous problems with memory pins. Oh well.

io_uring for example already manages the page pinning. An skb flag was
added for ZC Tx API to avoid refcounting in the core networking layer.
Any reason not to allow an alternative representation for skb frags than
struct page?
Jakub Kicinski July 11, 2023, 4:32 p.m. UTC | #39
On Tue, 11 Jul 2023 10:20:58 -0600 David Ahern wrote:
> On 7/11/23 10:00 AM, Jakub Kicinski wrote:
> >> RDMA works with the AMD and Intel intree drivers using DMABUF without
> >> requiring struct pages using the DRM hacky scatterlist approach.  
> > I see, thanks. We need pages primarily for refcounting. Avoiding all
> > the infamous problems with memory pins. Oh well.  
> 
> io_uring for example already manages the page pinning. An skb flag was
> added for ZC Tx API to avoid refcounting in the core networking layer.

Right, we can refcount in similar fashion. Still tracking explicitly
when buffers are handed over to the NIC.

> Any reason not to allow an alternative representation for skb frags than
> struct page?

I don't think there's a hard technical reason. We can make it work.
Jason Gunthorpe July 11, 2023, 4:42 p.m. UTC | #40
On Tue, Jul 11, 2023 at 09:00:47AM -0700, Jakub Kicinski wrote:

> > So is Infiniband, Jakub has a unique definition of "proprietary".
> 
> For IB AFAIU there's only one practically usable vendor, such an
> impressive ecosystem!!

IB has roce (RDMA Over Converged Ethernet, better thought of as IB
over Ethernet) under it's standization umbrella and every major
commodity NIC vendor (mellanox, broadcom, intel) now implements
roce. We also have the roce support in the switch from all major
switch vendors.

IB as a link layer "failed" with most implementors leaving the
ecosystem broadly because Ethernet link layer always wins in
networking - this is more of an economic outcome than a
standardization outcome in my mind.

Due to roce, IB as a transport and software protocol has a solid
multi-vendor ecosystem.

Jason
Mina Almasry July 11, 2023, 5:06 p.m. UTC | #41
On Tue, Jul 11, 2023 at 9:32 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jul 2023 10:20:58 -0600 David Ahern wrote:
> > On 7/11/23 10:00 AM, Jakub Kicinski wrote:
> > >> RDMA works with the AMD and Intel intree drivers using DMABUF without
> > >> requiring struct pages using the DRM hacky scatterlist approach.
> > > I see, thanks. We need pages primarily for refcounting. Avoiding all
> > > the infamous problems with memory pins. Oh well.
> >
> > io_uring for example already manages the page pinning. An skb flag was
> > added for ZC Tx API to avoid refcounting in the core networking layer.
>
> Right, we can refcount in similar fashion. Still tracking explicitly
> when buffers are handed over to the NIC.
>
> > Any reason not to allow an alternative representation for skb frags than
> > struct page?
>
> I don't think there's a hard technical reason. We can make it work.

I also think we can switch the representation for skb frags to
something else. However - please do correct me if I'm wrong - I don't
think that is sufficient for device memory TCP. My understanding is
that we also need to modify any NIC drivers that want to use device
memory TCP to understand a new memory type, and the page pool as well
if that's involved. I think in particular modifying the memory type in
all the NIC drivers that want to do device memory TCP is difficult. Do
you think this is feasible?

--
Thanks,
Mina
Jakub Kicinski July 11, 2023, 5:06 p.m. UTC | #42
On Tue, 11 Jul 2023 13:42:47 -0300 Jason Gunthorpe wrote:
> On Tue, Jul 11, 2023 at 09:00:47AM -0700, Jakub Kicinski wrote:
> > > So is Infiniband, Jakub has a unique definition of "proprietary".  
> > 
> > For IB AFAIU there's only one practically usable vendor, such an
> > impressive ecosystem!!  
> 
> IB has roce (RDMA Over Converged Ethernet, better thought of as IB
> over Ethernet) under it's standization umbrella and every major
> commodity NIC vendor (mellanox, broadcom, intel) now implements
> roce.

Now we're getting into our favorite argument and completely
sidetracking the conversation, aren't we? :) And as usual 
our ability to present facts is limited by various NDAs..

> We also have the roce support in the switch from all major
> switch vendors.

By which you mean all major switch vendors should support basic RoCE
requirements. But most vendors will try to put special features into
their switches trying to make the full NIC + switch solution as sticky
as possible.

> IB as a link layer "failed" with most implementors leaving the

Interesting.

> ecosystem broadly because Ethernet link layer always wins in
> networking - this is more of an economic outcome than a
> standardization outcome in my mind.
> 
> Due to roce, IB as a transport and software protocol has a solid
> multi-vendor ecosystem.
 
Last I checked every generation of HW from even a single vendor came out
with a new congestion control algorithm and add-ons. All closed source
and usually patented. That is the core of a transport protocol, the
intelligence, not the frame format (which may or may not itself deviate
thru various encaps and stolen bits).
Mina Almasry July 11, 2023, 5:24 p.m. UTC | #43
On Tue, Jul 11, 2023 at 6:11 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Mon, Jul 10, 2023 at 05:45:05PM -0700, Mina Almasry wrote:
>
> > > At least from my position I want to see MEMORY_DEVICE_PCI_P2PDMA used
> > > to represent P2P memory.
> >
> > Would using p2pdma API instead of dmabuf be an acceptable direction?
>
> "p2pdma API" is really just using MEMORY_DEVICE_PCI_P2PDMA and
> teaching the pagepool how to work with ZONE_DEVICE pages.
>

Yes, that's what I have in mind. Roughly something along the lines
where the device memory provider (GPU or what not) does something like
a pci_p2pdma_add_resource(), and the NIC client driver allocates the
pages from the resource, and if is_pci_p2pdma_page() use
pci_p2pdma_map_sg() instead of dma_map_sg() and whatnot.

> I suspect this will clash badly with Matthew's work here:
>
> https://lore.kernel.org/all/20230111042214.907030-1-willy@infradead.org/
>
> As from a mm side we haven't ever considered that ZONE_DEVICE and
> "netmem" can be composed together. The entire point of netmem like
> stuff is that the allocator hands over the majority of struct page to
> the allocatee, and ZONE_DEVICE can't work like that.
>
> However, assuming that can be solved in some agreeable way then it
> would be OK to go down this path.
>

I think there is potential to solve this in an agreeable way. We may
be able to get netmem ZONE_DEVICE pages using the memdesc idea you
proposed, or something like the xarray meta data you mention below. If
not that, I think Jakub already said that he is considering coming up
with a 'page pool like' API that drivers can use, with an
implementation that is compatible with ZONE_DEVICE pages.

> But, I feel like this is just overall too hard a direction from the mm
> perspective.
>
> I don't know anything about page pool, but the main sticking point is
> its reliance on struct page. If it can find another way to locate its
> meta data (eg an xarray), at least for some cases, it would make
> things alot easier.
>
Jason Gunthorpe July 11, 2023, 6:52 p.m. UTC | #44
On Tue, Jul 11, 2023 at 10:06:36AM -0700, Jakub Kicinski wrote:

> Now we're getting into our favorite argument and completely
> sidetracking the conversation, aren't we? :) And as usual 
> our ability to present facts is limited by various NDAs..

Yes, well, maybe I should stop taking the bait everytime you write
"proprietary" :)
 
> > We also have the roce support in the switch from all major
> > switch vendors.
> 
> By which you mean all major switch vendors should support basic RoCE
> requirements. But most vendors will try to put special features into
> their switches trying to make the full NIC + switch solution as sticky
> as possible.

Yep. At the high end open standards based ethernet has also notably
"failed" as well. Every switch vendor now offers their own proprietary
ecosystem on a whole bunch of different axis. They all present
"ethernet" toward the host but the host often needs to work in a
special way to really take full advantage of the proprietary fabric
behaviors.

> Last I checked every generation of HW from even a single vendor came out
> with a new congestion control algorithm and add-ons. 

Probably, but I don't really view this as an IB or roce issue.

Back in the day, there was "data center ethernet" which was a
standardization effort to try and tame some of these problems. roce
was imagined as an important workload over DCE, but the effort was
ethernet focused and generic. Sadly DCE and successor standard based
congestion mangement approaches did not work, or were "standardized"
in a way that had a big hole that needed to be filled with proprietary
algorithms. Eventualy the interest in standardization seems to have
waned and several of the big network operators seem to be valuing
their unique congestion management as a proprietary element. From a
vendor perspective this is has turned into an interop train
wreck. Sigh.

roce is just highly sensitive to loss - which is managed in ethernet
through congestion management. This is why you see roce and congestion
management so tightly linked, and perhaps in some deployments becomes
the motivating reason to look at congestion management.

However, TCP under congestion management is also very interesting and
is a motivation to deploy congestion management in its own right in
some cases.

Jason
Jakub Kicinski July 11, 2023, 8:34 p.m. UTC | #45
On Tue, 11 Jul 2023 15:52:37 -0300 Jason Gunthorpe wrote:
> > Now we're getting into our favorite argument and completely
> > sidetracking the conversation, aren't we? :) And as usual 
> > our ability to present facts is limited by various NDAs..  
> 
> Yes, well, maybe I should stop taking the bait everytime you write
> "proprietary" :)
> 
> > > We also have the roce support in the switch from all major
> > > switch vendors.  
> > 
> > By which you mean all major switch vendors should support basic RoCE
> > requirements. But most vendors will try to put special features into
> > their switches trying to make the full NIC + switch solution as sticky
> > as possible.  
> 
> Yep. At the high end open standards based ethernet has also notably
> "failed" as well. Every switch vendor now offers their own proprietary
> ecosystem on a whole bunch of different axis. They all present
> "ethernet" toward the host but the host often needs to work in a
> special way to really take full advantage of the proprietary fabric
> behaviors.

I'm not familiar with "high end open standards based on ethernet", would
those be some RDMA / storage things? For TCP/IP networks pretty much
the only things that matter in a switch are bandwidth, size of buffers,
power... Implementation stuff.

> > Last I checked every generation of HW from even a single vendor came out
> > with a new congestion control algorithm and add-ons.   
> 
> Probably, but I don't really view this as an IB or roce issue.
> 
> Back in the day, there was "data center ethernet" which was a
> standardization effort to try and tame some of these problems. roce
> was imagined as an important workload over DCE, but the effort was
> ethernet focused and generic. Sadly DCE and successor standard based
> congestion mangement approaches did not work, or were "standardized"
> in a way that had a big hole that needed to be filled with proprietary
> algorithms. Eventualy the interest in standardization seems to have
> waned and several of the big network operators seem to be valuing
> their unique congestion management as a proprietary element. From a
> vendor perspective this is has turned into an interop train
> wreck. Sigh.
> 
> roce is just highly sensitive to loss - which is managed in ethernet
> through congestion management. This is why you see roce and congestion
> management so tightly linked, and perhaps in some deployments becomes
> the motivating reason to look at congestion management.

A lot of "standardization" efforts are just attempts to prove to 
a buyers that an ecosystem exists.

Open source the firmware. Let people actually hack on it and when
the users bring their own algorithms de facto standardization will
happen. Short of that it's all smoke and mirrors.
Jakub Kicinski July 11, 2023, 8:39 p.m. UTC | #46
On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote:
> > > Any reason not to allow an alternative representation for skb frags than
> > > struct page?  
> >
> > I don't think there's a hard technical reason. We can make it work.  
> 
> I also think we can switch the representation for skb frags to
> something else. However - please do correct me if I'm wrong - I don't
> think that is sufficient for device memory TCP. My understanding is
> that we also need to modify any NIC drivers that want to use device
> memory TCP to understand a new memory type, and the page pool as well
> if that's involved. I think in particular modifying the memory type in
> all the NIC drivers that want to do device memory TCP is difficult. Do
> you think this is feasible?

That's why I was thinking about adding an abstraction between 
the page pool and the driver. Instead of feeding driver pages
a new abstraction could feed the driver just an identifier and a PA.

Whether we want to support fragmentation in that model or not would 
have to be decided.

We can take pages from the page pool and feed them to drivers via
such an API, but drivers need to stop expecting pages.

That's for data buffers only, obviously. We can keep using pages 
and raw page pool for headers.
David Ahern July 11, 2023, 9:39 p.m. UTC | #47
On 7/11/23 2:39 PM, Jakub Kicinski wrote:
> On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote:
>>>> Any reason not to allow an alternative representation for skb frags than
>>>> struct page?  
>>>
>>> I don't think there's a hard technical reason. We can make it work.  
>>
>> I also think we can switch the representation for skb frags to
>> something else. However - please do correct me if I'm wrong - I don't
>> think that is sufficient for device memory TCP. My understanding is
>> that we also need to modify any NIC drivers that want to use device
>> memory TCP to understand a new memory type, and the page pool as well
>> if that's involved. I think in particular modifying the memory type in
>> all the NIC drivers that want to do device memory TCP is difficult. Do
>> you think this is feasible?
> 
> That's why I was thinking about adding an abstraction between 
> the page pool and the driver. Instead of feeding driver pages
> a new abstraction could feed the driver just an identifier and a PA.

skb frag is currently a bio_vec. Overloading the 'struct page' address
in that struct with another address is easy to do. Requiring a certain
alignment on the address gives you a few low bits to use a flags / magic
/ etc.

Overloading len and offset is not really possible - way too much code is
affected (e.g., iov walking and MSS / TSO segmenting).

ie., you could overload page address with a pointer to an object in your
new abstraction layer and the struct has the other meta data.

typedef struct skb_frag {
	union {
		struct bio_vec bvec;
		struct new_abstraction abs;
	};
} skb_frag_t;

where

struct new_abstraction {
	void *addr,
	unsigned int len;
	unsigned int offset;
};

I have been playing with a similar and it co-exists with the existing
code quite well with the constraint on location of len and offset.

> 
> Whether we want to support fragmentation in that model or not would 
> have to be decided.
> 
> We can take pages from the page pool and feed them to drivers via
> such an API, but drivers need to stop expecting pages.

yes, drivers would have to be updated to understand the new format. A
downside, but again relatively easy to manage.

> 
> That's for data buffers only, obviously. We can keep using pages 
> and raw page pool for headers.

yes.
Jason Gunthorpe July 11, 2023, 11:56 p.m. UTC | #48
On Tue, Jul 11, 2023 at 01:34:20PM -0700, Jakub Kicinski wrote:

> > Yep. At the high end open standards based ethernet has also notably
> > "failed" as well. Every switch vendor now offers their own proprietary
> > ecosystem on a whole bunch of different axis. They all present
> > "ethernet" toward the host but the host often needs to work in a
> > special way to really take full advantage of the proprietary fabric
> > behaviors.
> 
> I'm not familiar with "high end open standards based on ethernet", would
> those be some RDMA / storage things? For TCP/IP networks pretty much
> the only things that matter in a switch are bandwidth, size of buffers,
> power... Implementation stuff.

I would say when you are getting into ethernet deployments with 25 or
51 Tbps switches directly connected to hosts running at >100G you are
getting into the high end side of things.

These are very expensive networks. They run complex congestion
provoking workloads. They have sophisticated multi-pathing. They often
use use a non-blocking topology. Congestion management is important.

Making this work with good utilization, and low tail latency is a
really hard problem. Many of the solutions come with switch features
supporting it.

You'd proably say these are not TCP focused networks, even though they
are based on ethernet and IP.

So I think of them as high end "standards based" ethernet and IP
looking networks that have proprietary elements mixed in throughout.

Right now there is a bit of a press war between vendors on 'ethernet
for AI'. Both Broadcom and NVIDIA are taking techonlogies that were
originally built for TCP ethernet networks and remixing/speeding them
up to run roce workloads effectively. There is alot more information
available now without NDA that shows some detail on this space.

AWS's SRD multipathing, Broadcom "AI Ethernet" and NVIDIA's Spectrum-X
spring to mind as topical to what these sorts of ethernet networks
are.

> A lot of "standardization" efforts are just attempts to prove to 
> a buyers that an ecosystem exists.

Heh, that was probably more true years ago. These days it seems like
some standardization is also being done so the large hyperscalers can
improve their Approved Vendors List.

I suppose as long as the result is something we can implement openly
in Linux the motivation for standardization is less important.

Jason
Mina Almasry July 12, 2023, 3:42 a.m. UTC | #49
On Tue, Jul 11, 2023 at 2:39 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 7/11/23 2:39 PM, Jakub Kicinski wrote:
> > On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote:
> >>>> Any reason not to allow an alternative representation for skb frags than
> >>>> struct page?
> >>>
> >>> I don't think there's a hard technical reason. We can make it work.
> >>
> >> I also think we can switch the representation for skb frags to
> >> something else. However - please do correct me if I'm wrong - I don't
> >> think that is sufficient for device memory TCP. My understanding is
> >> that we also need to modify any NIC drivers that want to use device
> >> memory TCP to understand a new memory type, and the page pool as well
> >> if that's involved. I think in particular modifying the memory type in
> >> all the NIC drivers that want to do device memory TCP is difficult. Do
> >> you think this is feasible?
> >
> > That's why I was thinking about adding an abstraction between
> > the page pool and the driver. Instead of feeding driver pages
> > a new abstraction could feed the driver just an identifier and a PA.
>
> skb frag is currently a bio_vec. Overloading the 'struct page' address
> in that struct with another address is easy to do. Requiring a certain
> alignment on the address gives you a few low bits to use a flags / magic
> / etc.
>
> Overloading len and offset is not really possible - way too much code is
> affected (e.g., iov walking and MSS / TSO segmenting).
>
> ie., you could overload page address with a pointer to an object in your
> new abstraction layer and the struct has the other meta data.
>
> typedef struct skb_frag {
>         union {
>                 struct bio_vec bvec;
>                 struct new_abstraction abs;
>         };
> } skb_frag_t;
>
> where
>
> struct new_abstraction {
>         void *addr,
>         unsigned int len;
>         unsigned int offset;
> };
>
> I have been playing with a similar and it co-exists with the existing
> code quite well with the constraint on location of len and offset.
>
> >
> > Whether we want to support fragmentation in that model or not would
> > have to be decided.
> >
> > We can take pages from the page pool and feed them to drivers via
> > such an API, but drivers need to stop expecting pages.
>
> yes, drivers would have to be updated to understand the new format. A
> downside, but again relatively easy to manage.
>

I'm glad to see that you're open to this approach. As far as I
understand, getting device memory in a struct page form would still be
preferred, no? And the approach you point to would be a backup plan I
presume?

Since the good folks on this thread have pointed me to p2pdma to
address my use case, I've been doing some homework to see if it can
apply. AFACT so far, it applies, and Willem actually had a prototype
of it working a while back. The rough approach Willem and I are
thinking of would be something like:

1. The device memory driver would be the p2pdma provider. It would
expose a user API which allocates a device memory region, calls
pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
a reference to it to the userspace.

2. The NIC driver would be the p2pdma client and orchestrator. It
would expose a user API which binds an rxq to a pci device. Prior to
the bind the user API would check that the pci device has published
p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
to the driver (pci_p2pdma_distance() I think), etc.

3. The NIC would allocate pages from the p2pdma provider for incoming
packets, and create devmem skbs, and deliver the devmem skbs to the
user using the support in my RFC. AFACT all that code need not be
changed.

AFAICT, all the concerns brought up in this thread are sidestepped by
using p2pdma. I need not allocate struct pages in the core dma-buf
code anymore (or anywhere), and I need not allocate pgmaps. I would
just re-use the p2pdma support.

Anyone see any glaring issues with this approach? I plan on trying to
implement a PoC and sending an RFC v2.

The only pending concern is integration with the page pool, but we
already have some ideas on how to solve that.

> >
> > That's for data buffers only, obviously. We can keep using pages
> > and raw page pool for headers.
>
> yes.
Christian König July 12, 2023, 7:55 a.m. UTC | #50
Am 12.07.23 um 05:42 schrieb Mina Almasry:
> On Tue, Jul 11, 2023 at 2:39 PM David Ahern <dsahern@kernel.org> wrote:
>> On 7/11/23 2:39 PM, Jakub Kicinski wrote:
>>> On Tue, 11 Jul 2023 10:06:28 -0700 Mina Almasry wrote:
>>>>>> Any reason not to allow an alternative representation for skb frags than
>>>>>> struct page?
>>>>> I don't think there's a hard technical reason. We can make it work.
>>>> I also think we can switch the representation for skb frags to
>>>> something else. However - please do correct me if I'm wrong - I don't
>>>> think that is sufficient for device memory TCP. My understanding is
>>>> that we also need to modify any NIC drivers that want to use device
>>>> memory TCP to understand a new memory type, and the page pool as well
>>>> if that's involved. I think in particular modifying the memory type in
>>>> all the NIC drivers that want to do device memory TCP is difficult. Do
>>>> you think this is feasible?
>>> That's why I was thinking about adding an abstraction between
>>> the page pool and the driver. Instead of feeding driver pages
>>> a new abstraction could feed the driver just an identifier and a PA.
>> skb frag is currently a bio_vec. Overloading the 'struct page' address
>> in that struct with another address is easy to do. Requiring a certain
>> alignment on the address gives you a few low bits to use a flags / magic
>> / etc.
>>
>> Overloading len and offset is not really possible - way too much code is
>> affected (e.g., iov walking and MSS / TSO segmenting).
>>
>> ie., you could overload page address with a pointer to an object in your
>> new abstraction layer and the struct has the other meta data.
>>
>> typedef struct skb_frag {
>>          union {
>>                  struct bio_vec bvec;
>>                  struct new_abstraction abs;
>>          };
>> } skb_frag_t;
>>
>> where
>>
>> struct new_abstraction {
>>          void *addr,
>>          unsigned int len;
>>          unsigned int offset;
>> };
>>
>> I have been playing with a similar and it co-exists with the existing
>> code quite well with the constraint on location of len and offset.
>>
>>> Whether we want to support fragmentation in that model or not would
>>> have to be decided.
>>>
>>> We can take pages from the page pool and feed them to drivers via
>>> such an API, but drivers need to stop expecting pages.
>> yes, drivers would have to be updated to understand the new format. A
>> downside, but again relatively easy to manage.
>>
> I'm glad to see that you're open to this approach. As far as I
> understand, getting device memory in a struct page form would still be
> preferred, no? And the approach you point to would be a backup plan I
> presume?

Well yes and no, if you need struct pages depends on what you want to do.

struct pages are an approach to manage memory regions, but P2PDMA is 
essentially about pumping data between devices.

If your data is for example organized in files on a filesystem then 
having struct pages is a must have because you need to be able to manage 
references, address spaces and so one.

If a pages was acquired with alloc_pages() in a driver then you have 
struct pages, but you can't necessary use them the way you want to use 
them because the first few dwords have different meaning depending on 
the use case.

And then you have the use case where you have for example micro 
controllers using P2PDMA to talk with each other. In this case your DMA 
address might not be memory at all, but rather MMIO. E.g. doorbells on 
graphics hw is such a case as well as cameras and encoders which 
communicate directly with each other.

> Since the good folks on this thread have pointed me to p2pdma to
> address my use case, I've been doing some homework to see if it can
> apply. AFACT so far, it applies, and Willem actually had a prototype
> of it working a while back. The rough approach Willem and I are
> thinking of would be something like:
>
> 1. The device memory driver would be the p2pdma provider. It would
> expose a user API which allocates a device memory region, calls
> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
> a reference to it to the userspace.
>
> 2. The NIC driver would be the p2pdma client and orchestrator. It
> would expose a user API which binds an rxq to a pci device. Prior to
> the bind the user API would check that the pci device has published
> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
> to the driver (pci_p2pdma_distance() I think), etc.
>
> 3. The NIC would allocate pages from the p2pdma provider for incoming
> packets, and create devmem skbs, and deliver the devmem skbs to the
> user using the support in my RFC. AFACT all that code need not be
> changed.
>
> AFAICT, all the concerns brought up in this thread are sidestepped by
> using p2pdma. I need not allocate struct pages in the core dma-buf
> code anymore (or anywhere), and I need not allocate pgmaps. I would
> just re-use the p2pdma support.
>
> Anyone see any glaring issues with this approach? I plan on trying to
> implement a PoC and sending an RFC v2.

Well we already have DMA-buf as user API for this use case, which is 
perfectly supported by RDMA if I'm not completely mistaken.

So what problem do you try to solve here actually?

Regards,
Christian.

>
> The only pending concern is integration with the page pool, but we
> already have some ideas on how to solve that.
>
>>> That's for data buffers only, obviously. We can keep using pages
>>> and raw page pool for headers.
>> yes.
>
>
Jason Gunthorpe July 12, 2023, 1:01 p.m. UTC | #51
On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote:

> 1. The device memory driver would be the p2pdma provider. It would
> expose a user API which allocates a device memory region, calls
> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
> a reference to it to the userspace.

This is not quite right, if you convert any of the GPU drivers to use
P2PDMA you are going to need to restructure the p2pmem stuff to
seperate the genalloc. The GPU driver must continue to be the owner
and allocator of the MMIO memory it already controls, we can't have
two allocators working in parallel. 

The genalloc stuff supports the special NVMe use case, I don't know of
anything else that would like to work that way.

> 2. The NIC driver would be the p2pdma client and orchestrator. It
> would expose a user API which binds an rxq to a pci device. Prior to
> the bind the user API would check that the pci device has published
> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
> to the driver (pci_p2pdma_distance() I think), etc.

This doesn't fit the programming model for GPUs at all. You don't want
to get packets landing in random GPU memory that a kernel side
allocator selects, you want packets landing in GPU memory owned by a
specific process that owns the TCP connection.

This is why DMABUF is used here as it gives a handle to the GPU
memory. What you want is to get the P2P pages either directly from the
DMABUF or via pin_user_pages() on the DMABUF's mmap.

> AFAICT, all the concerns brought up in this thread are sidestepped by
> using p2pdma. I need not allocate struct pages in the core dma-buf
> code anymore (or anywhere), and I need not allocate pgmaps. I would
> just re-use the p2pdma support.

Well, as I said it is going to be a big ask to P2P enable any of the
DRM drivers.

And you still have the netmem vs zone_device struct page conflict to
figure out

But it is alot closer to reasonable than this RFC.

Jason
Jason Gunthorpe July 12, 2023, 1:03 p.m. UTC | #52
On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote:

> > Anyone see any glaring issues with this approach? I plan on trying to
> > implement a PoC and sending an RFC v2.
> 
> Well we already have DMA-buf as user API for this use case, which is
> perfectly supported by RDMA if I'm not completely mistaken.
> 
> So what problem do you try to solve here actually?

In a nutshell, netdev's design currently needs struct pages to do DMA
to it's packet buffers.

So it cannot consume the scatterlist that dmabuf puts out

RDMA doesn't need struct pages at all, so it is fine.

If Mina can go down the path of changing netdev to avoid needing
struct pages then no changes to DRM side things.

Otherwise a P2P struct page and a co-existance with netmem on a
ZONE_DEVICE page would be required. :\

Jason
Christian König July 12, 2023, 1:35 p.m. UTC | #53
Am 12.07.23 um 15:03 schrieb Jason Gunthorpe:
> On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote:
>
>>> Anyone see any glaring issues with this approach? I plan on trying to
>>> implement a PoC and sending an RFC v2.
>> Well we already have DMA-buf as user API for this use case, which is
>> perfectly supported by RDMA if I'm not completely mistaken.
>>
>> So what problem do you try to solve here actually?
> In a nutshell, netdev's design currently needs struct pages to do DMA
> to it's packet buffers.
>
> So it cannot consume the scatterlist that dmabuf puts out
>
> RDMA doesn't need struct pages at all, so it is fine.
>
> If Mina can go down the path of changing netdev to avoid needing
> struct pages then no changes to DRM side things.
>
> Otherwise a P2P struct page and a co-existance with netmem on a
> ZONE_DEVICE page would be required. :\

Uff, depending on why netdev needs struct page (I think I have a good 
idea why) this isn't really going to work generically either way.

What we maybe able to do is to allow copy_file_range() between DMA-buf 
file descriptor and a TCP socket.

If I'm not completely mistaken that should then end up in DMA-bufs 
file_operations->copy_file_range callback (maybe with some minor change 
to allows this).

The DMA-buf framework could then forward this to the exporter of the 
memory which owns the backing memory could then do the necessary steps.

Regards,
Christian.

>
> Jason
Mina Almasry July 12, 2023, 8:16 p.m. UTC | #54
On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote:
>
> > 1. The device memory driver would be the p2pdma provider. It would
> > expose a user API which allocates a device memory region, calls
> > pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
> > a reference to it to the userspace.
>
> This is not quite right, if you convert any of the GPU drivers to use
> P2PDMA you are going to need to restructure the p2pmem stuff to
> seperate the genalloc. The GPU driver must continue to be the owner
> and allocator of the MMIO memory it already controls, we can't have
> two allocators working in parallel.
>
> The genalloc stuff supports the special NVMe use case, I don't know of
> anything else that would like to work that way.
>

I think maybe you misunderstood the proposal. AFAICT the genalloc
stuff works for us, although there are other issues with p2pdma that I
need to solve.

The proposal was that the uapi in step #1 allocates a region of GPU
memory, and sets up a p2pdma provider for this region of memory. From
the perspective of the GPU, the memory is allocated, and in use by the
user. The p2pdma provider uses genalloc to give out 4K regions with
struct pages to in-kernel allocators from this memory region. Why
would that not work? Looking at the code, that seems to be how p2pdma
works today. The p2pdma provider does p2pdma_add_resource() on a chunk
of its memory, and the genalloc allocates memory from that chunk?

The actual issues I see with this approach are:

1. There is no way for the p2pdma provider to relinquish back the
memory it has provided via pci_p2pdma_add_resource(), in the case that
the user crashed or would like to free the GPU buffer. I would need to
add a pci_p2pdma_remove_resource(). Would that be  acceptable?

2. The p2pdma semantics seem to be global to the pci device. I.e., 1
GPU can export 1 p2pdma resource at a time (the way I'm reading the
API). This is not usable for my use case. I would need multiple users
to be able to use the uapi in step #1 simultaneously. I would need the
same pci device to export different p2pdma resources simultaneously
and the p2pdma clients would need to be able to import some of the
resources. I would likely need to add an api like this:

diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..c9d754713fdc 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -19,6 +19,33 @@ struct scatterlist;
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
                u64 offset);
+
+/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file
+ * handle referring to this resource. Multiple such resources can be exported
+ * by the same pci device.
+ */
+struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev,
+               int bar,
+               size_t size,
+               u64 offset);
+
+/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */
+struct file *pci_p2pdma_remove_resource_with_handle(
+               struct file *p2pdma_resource_file);
+
+/* Allocates memory from a resource created using
+ * pci_p2pdma_add_resource_with_handle()
+ */
+void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file,
+               size_t size);
+
+/* Frees p2pmem to a resource created using
+ * pci_p2pdma_add_resource_with_handle()
+ */
+void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file,
+               void *addr,
+               size_t size);
+

Looking for feedback from anyone knowledgeable, but the p2pdma
maintainers as well if possibl.

> > 2. The NIC driver would be the p2pdma client and orchestrator. It
> > would expose a user API which binds an rxq to a pci device. Prior to
> > the bind the user API would check that the pci device has published
> > p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
> > to the driver (pci_p2pdma_distance() I think), etc.
>
> This doesn't fit the programming model for GPUs at all. You don't want
> to get packets landing in random GPU memory that a kernel side
> allocator selects, you want packets landing in GPU memory owned by a
> specific process that owns the TCP connection.
>

I think this comment is maybe a side effect of the misunderstanding.
In the proposal, the user allocates a GPU buffer using the API in step
#1, and then binds the memory to the NIC rx queue using the API
specified in step #2. We use flow steering & RSS to steer this user's
TCP traffic to the buffer owned by them.

> This is why DMABUF is used here as it gives a handle to the GPU
> memory. What you want is to get the P2P pages either directly from the
> DMABUF or via pin_user_pages() on the DMABUF's mmap.
>
> > AFAICT, all the concerns brought up in this thread are sidestepped by
> > using p2pdma. I need not allocate struct pages in the core dma-buf
> > code anymore (or anywhere), and I need not allocate pgmaps. I would
> > just re-use the p2pdma support.
>
> Well, as I said it is going to be a big ask to P2P enable any of the
> DRM drivers.
>
> And you still have the netmem vs zone_device struct page conflict to
> figure out
>
> But it is alot closer to reasonable than this RFC.
>
> Jason
Mina Almasry July 12, 2023, 10:41 p.m. UTC | #55
On Wed, Jul 12, 2023 at 6:35 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 12.07.23 um 15:03 schrieb Jason Gunthorpe:
> > On Wed, Jul 12, 2023 at 09:55:51AM +0200, Christian König wrote:
> >
> >>> Anyone see any glaring issues with this approach? I plan on trying to
> >>> implement a PoC and sending an RFC v2.
> >> Well we already have DMA-buf as user API for this use case, which is
> >> perfectly supported by RDMA if I'm not completely mistaken.
> >>
> >> So what problem do you try to solve here actually?
> > In a nutshell, netdev's design currently needs struct pages to do DMA
> > to it's packet buffers.
> >
> > So it cannot consume the scatterlist that dmabuf puts out
> >
> > RDMA doesn't need struct pages at all, so it is fine.
> >
> > If Mina can go down the path of changing netdev to avoid needing
> > struct pages then no changes to DRM side things.
> >
> > Otherwise a P2P struct page and a co-existance with netmem on a
> > ZONE_DEVICE page would be required. :\
>
> Uff, depending on why netdev needs struct page (I think I have a good
> idea why) this isn't really going to work generically either way.
>
> What we maybe able to do is to allow copy_file_range() between DMA-buf
> file descriptor and a TCP socket.
>
> If I'm not completely mistaken that should then end up in DMA-bufs
> file_operations->copy_file_range callback (maybe with some minor change
> to allows this).
>
> The DMA-buf framework could then forward this to the exporter of the
> memory which owns the backing memory could then do the necessary steps.
>

I may be missing something, but the way it works on our end for
receive is that we give a list of buffers (dma_addr + length + other
metadata) to the network card, and the network card writes incoming
packets to these dma_addrs and gives us an rx completion pointing to
the data it DMA'd. Usually the network card does something like an
alloc_page() + dma_map_page() and provides the to the network card.
Transmit path works similarly. Not sure that adding copy_file_range()
support to dma-buf enables this in some way.
Jason Gunthorpe July 12, 2023, 11:57 p.m. UTC | #56
On Wed, Jul 12, 2023 at 01:16:00PM -0700, Mina Almasry wrote:

> The proposal was that the uapi in step #1 allocates a region of GPU
> memory, and sets up a p2pdma provider for this region of memory.

Honestly that feels too hacky, which is why I've said a few times
you'd actually need to integrate p2p pages into a DRM driver properly.

> 2. The p2pdma semantics seem to be global to the pci device. I.e., 1
> GPU can export 1 p2pdma resource at a time (the way I'm reading the
> API).

All of this is just a reflection that you are trying to use it in a
wrong hacky way. The driver is supposed to create a p2p registration
for its entire BAR at startup. Not on demand and not in fragments.

Jason
Christian König July 13, 2023, 7:56 a.m. UTC | #57
Am 12.07.23 um 22:16 schrieb Mina Almasry:
> On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote:
>>
>>> 1. The device memory driver would be the p2pdma provider. It would
>>> expose a user API which allocates a device memory region, calls
>>> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
>>> a reference to it to the userspace.
>> This is not quite right, if you convert any of the GPU drivers to use
>> P2PDMA you are going to need to restructure the p2pmem stuff to
>> seperate the genalloc. The GPU driver must continue to be the owner
>> and allocator of the MMIO memory it already controls, we can't have
>> two allocators working in parallel.
>>
>> The genalloc stuff supports the special NVMe use case, I don't know of
>> anything else that would like to work that way.
>>
> I think maybe you misunderstood the proposal. AFAICT the genalloc
> stuff works for us, although there are other issues with p2pdma that I
> need to solve.
>
> The proposal was that the uapi in step #1 allocates a region of GPU
> memory, and sets up a p2pdma provider for this region of memory. From
> the perspective of the GPU, the memory is allocated, and in use by the
> user. The p2pdma provider uses genalloc to give out 4K regions with
> struct pages to in-kernel allocators from this memory region. Why
> would that not work?

Oh well, where should I start.

struct page is used in the various I/O  subsystems instead of DMA 
addresses because it allows for a wider range of operations.

For example when a page is acquired using get_user_pages() somebody can 
use the rmap to figure out where a page is mapped and eventually unmap 
it, map it read only or change the caching attributes etc...

Then you have the ability to grab a reference to a page, this for 
example allows I/O operations to complete and not access freed memory 
even when the application has already long died.

Then a very common use case is that you need to fallback to a CPU copy 
because the data inside the page isn't aligned or outside the physical 
reach of a device.

The are just numerous issues with what I listed above, for example some 
of those use cases only work with pagecache pages.

Approaching it from the user side, with GPUs there is usually no 
guarantee that stuff is coherent. E.g. a network card wouldn't 
automatically see the results of a calculation.

Then GPUs usually have tons of memory which isn't CPU accessible or even 
PCIe bus accessible. So a bounce buffer done with a CPU copy won't work, 
you need to bounce this with a hw assisted copy. Or you have inter 
device connections. For example ethernet over HDMI links would be able 
to access all of the internal GPU resources.

Then GPUs often need to shuffle memory around, e.g. similar 
functionality to ___GFP_MOVABLE. Just with stuff not CPU accessible nor 
mapped into CPU page tables.

...

I mean I can go with this list for quite some time :)

> Looking at the code, that seems to be how p2pdma
> works today. The p2pdma provider does p2pdma_add_resource() on a chunk
> of its memory, and the genalloc allocates memory from that chunk?

Well this is how it works for NVMe, that doesn't mean this way works for 
GPUs or acceleration devices.

Regards,
Christian.

>
> The actual issues I see with this approach are:
>
> 1. There is no way for the p2pdma provider to relinquish back the
> memory it has provided via pci_p2pdma_add_resource(), in the case that
> the user crashed or would like to free the GPU buffer. I would need to
> add a pci_p2pdma_remove_resource(). Would that be  acceptable?
>
> 2. The p2pdma semantics seem to be global to the pci device. I.e., 1
> GPU can export 1 p2pdma resource at a time (the way I'm reading the
> API). This is not usable for my use case. I would need multiple users
> to be able to use the uapi in step #1 simultaneously. I would need the
> same pci device to export different p2pdma resources simultaneously
> and the p2pdma clients would need to be able to import some of the
> resources. I would likely need to add an api like this:
>
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 8318a97c9c61..c9d754713fdc 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -19,6 +19,33 @@ struct scatterlist;
>   #ifdef CONFIG_PCI_P2PDMA
>   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>                  u64 offset);
> +
> +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file
> + * handle referring to this resource. Multiple such resources can be exported
> + * by the same pci device.
> + */
> +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev,
> +               int bar,
> +               size_t size,
> +               u64 offset);
> +
> +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */
> +struct file *pci_p2pdma_remove_resource_with_handle(
> +               struct file *p2pdma_resource_file);
> +
> +/* Allocates memory from a resource created using
> + * pci_p2pdma_add_resource_with_handle()
> + */
> +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file,
> +               size_t size);
> +
> +/* Frees p2pmem to a resource created using
> + * pci_p2pdma_add_resource_with_handle()
> + */
> +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file,
> +               void *addr,
> +               size_t size);
> +
>
> Looking for feedback from anyone knowledgeable, but the p2pdma
> maintainers as well if possibl.
>
>>> 2. The NIC driver would be the p2pdma client and orchestrator. It
>>> would expose a user API which binds an rxq to a pci device. Prior to
>>> the bind the user API would check that the pci device has published
>>> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
>>> to the driver (pci_p2pdma_distance() I think), etc.
>> This doesn't fit the programming model for GPUs at all. You don't want
>> to get packets landing in random GPU memory that a kernel side
>> allocator selects, you want packets landing in GPU memory owned by a
>> specific process that owns the TCP connection.
>>
> I think this comment is maybe a side effect of the misunderstanding.
> In the proposal, the user allocates a GPU buffer using the API in step
> #1, and then binds the memory to the NIC rx queue using the API
> specified in step #2. We use flow steering & RSS to steer this user's
> TCP traffic to the buffer owned by them.
>
>> This is why DMABUF is used here as it gives a handle to the GPU
>> memory. What you want is to get the P2P pages either directly from the
>> DMABUF or via pin_user_pages() on the DMABUF's mmap.
>>
>>> AFAICT, all the concerns brought up in this thread are sidestepped by
>>> using p2pdma. I need not allocate struct pages in the core dma-buf
>>> code anymore (or anywhere), and I need not allocate pgmaps. I would
>>> just re-use the p2pdma support.
>> Well, as I said it is going to be a big ask to P2P enable any of the
>> DRM drivers.
>>
>> And you still have the netmem vs zone_device struct page conflict to
>> figure out
>>
>> But it is alot closer to reasonable than this RFC.
>>
>> Jason
Mina Almasry July 14, 2023, 2:55 p.m. UTC | #58
On Thu, Jul 13, 2023 at 12:56 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 12.07.23 um 22:16 schrieb Mina Almasry:
> > On Wed, Jul 12, 2023 at 6:01 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote:
> >>
> >>> 1. The device memory driver would be the p2pdma provider. It would
> >>> expose a user API which allocates a device memory region, calls
> >>> pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns
> >>> a reference to it to the userspace.
> >> This is not quite right, if you convert any of the GPU drivers to use
> >> P2PDMA you are going to need to restructure the p2pmem stuff to
> >> seperate the genalloc. The GPU driver must continue to be the owner
> >> and allocator of the MMIO memory it already controls, we can't have
> >> two allocators working in parallel.
> >>
> >> The genalloc stuff supports the special NVMe use case, I don't know of
> >> anything else that would like to work that way.
> >>
> > I think maybe you misunderstood the proposal. AFAICT the genalloc
> > stuff works for us, although there are other issues with p2pdma that I
> > need to solve.
> >
> > The proposal was that the uapi in step #1 allocates a region of GPU
> > memory, and sets up a p2pdma provider for this region of memory. From
> > the perspective of the GPU, the memory is allocated, and in use by the
> > user. The p2pdma provider uses genalloc to give out 4K regions with
> > struct pages to in-kernel allocators from this memory region. Why
> > would that not work?
>
> Oh well, where should I start.
>
> struct page is used in the various I/O  subsystems instead of DMA
> addresses because it allows for a wider range of operations.
>
> For example when a page is acquired using get_user_pages() somebody can
> use the rmap to figure out where a page is mapped and eventually unmap
> it, map it read only or change the caching attributes etc...
>
> Then you have the ability to grab a reference to a page, this for
> example allows I/O operations to complete and not access freed memory
> even when the application has already long died.
>
> Then a very common use case is that you need to fallback to a CPU copy
> because the data inside the page isn't aligned or outside the physical
> reach of a device.
>
> The are just numerous issues with what I listed above, for example some
> of those use cases only work with pagecache pages.
>
> Approaching it from the user side, with GPUs there is usually no
> guarantee that stuff is coherent. E.g. a network card wouldn't
> automatically see the results of a calculation.
>
> Then GPUs usually have tons of memory which isn't CPU accessible or even
> PCIe bus accessible. So a bounce buffer done with a CPU copy won't work,
> you need to bounce this with a hw assisted copy. Or you have inter
> device connections. For example ethernet over HDMI links would be able
> to access all of the internal GPU resources.
>
> Then GPUs often need to shuffle memory around, e.g. similar
> functionality to ___GFP_MOVABLE. Just with stuff not CPU accessible nor
> mapped into CPU page tables.
>
> ...
>
> I mean I can go with this list for quite some time :)
>
> > Looking at the code, that seems to be how p2pdma
> > works today. The p2pdma provider does p2pdma_add_resource() on a chunk
> > of its memory, and the genalloc allocates memory from that chunk?
>
> Well this is how it works for NVMe, that doesn't mean this way works for
> GPUs or acceleration devices.
>

Thanks Christian, yes, I misunderstood how this works and I apologize
for that. From reading the p2pdma interface it looked to me like it
takes a buffer as input and goes ahead and allocates the struct pages
for it and exports them as a provider. As you and Jason have
repeatedly tried to explain to me this bit is NVMe specific and as
Jason puts it it is a "big ask to P2P enable any of the DRM drivers".

I am facing some logistical issues as well. My use case requires NIC
with special features, and even getting access to the appropriate
hardware may be an issue for me.

I guess the remaining option not fully explored is the idea of getting
the networking stack to consume the scatterlist that
dma_buf_map_attachment() provides for the device memory. The very
rough approach I have in mind (for the RX path) is:

1. Some uapi that binds a dmabuf to an RX queue. It will do a
dma_buf_map_attachment() and get the sg table.

2. We need to feed the scratterlist entries to some allocator that
will chunk it up into pieces that can be allocated by the NIC for
incoming traffic. I'm thinking genalloc may work for this as-is, but I
may need to add one or use something else if I run into some issue.

3. We can implement a memory_provider that allocates these chunks and
wraps them in a struct new_abstraction (as David called it) and feeds
those into the page pool.

4. The page pool would need to be able to process these struct
new_abstraction alongside the struct pages it normally gets from
providers. This is maybe the most complicated part, but looking at the
page pool code it doesn't seem that big of a hurdle (but I have not
tried a POC yet).

5. The drivers (I looked at mlx5) seem to avoid making any mm calls on
the struct pages returned by the pool; the pool abstracts everything
already. The changes to the drivers may be minimal..?

6. We would need to add a new helper, skb_add_rx_new_abstraction_frag
that creates a frag out of new_abstraction rather than a struct page.

Once the skb frags with struct new_abstraction are in the TCP stack,
they will need some special handling in code accessing the frags. But
my RFC already addressed that somewhat because the frags were
inaccessible in that case. In this case the frags will be both
inaccessible and will not be struct pages at all (things like
get_page() will not work), so more special handling will be required,
maybe.

I imagine the TX path would be considerably less complicated because
the allocator and page pool are not involved (I think).

Anyone see any glaring issues with this approach?

> Regards,
> Christian.
>
> >
> > The actual issues I see with this approach are:
> >
> > 1. There is no way for the p2pdma provider to relinquish back the
> > memory it has provided via pci_p2pdma_add_resource(), in the case that
> > the user crashed or would like to free the GPU buffer. I would need to
> > add a pci_p2pdma_remove_resource(). Would that be  acceptable?
> >
> > 2. The p2pdma semantics seem to be global to the pci device. I.e., 1
> > GPU can export 1 p2pdma resource at a time (the way I'm reading the
> > API). This is not usable for my use case. I would need multiple users
> > to be able to use the uapi in step #1 simultaneously. I would need the
> > same pci device to export different p2pdma resources simultaneously
> > and the p2pdma clients would need to be able to import some of the
> > resources. I would likely need to add an api like this:
> >
> > diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> > index 8318a97c9c61..c9d754713fdc 100644
> > --- a/include/linux/pci-p2pdma.h
> > +++ b/include/linux/pci-p2pdma.h
> > @@ -19,6 +19,33 @@ struct scatterlist;
> >   #ifdef CONFIG_PCI_P2PDMA
> >   int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> >                  u64 offset);
> > +
> > +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file
> > + * handle referring to this resource. Multiple such resources can be exported
> > + * by the same pci device.
> > + */
> > +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev,
> > +               int bar,
> > +               size_t size,
> > +               u64 offset);
> > +
> > +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */
> > +struct file *pci_p2pdma_remove_resource_with_handle(
> > +               struct file *p2pdma_resource_file);
> > +
> > +/* Allocates memory from a resource created using
> > + * pci_p2pdma_add_resource_with_handle()
> > + */
> > +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file,
> > +               size_t size);
> > +
> > +/* Frees p2pmem to a resource created using
> > + * pci_p2pdma_add_resource_with_handle()
> > + */
> > +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file,
> > +               void *addr,
> > +               size_t size);
> > +
> >
> > Looking for feedback from anyone knowledgeable, but the p2pdma
> > maintainers as well if possibl.
> >
> >>> 2. The NIC driver would be the p2pdma client and orchestrator. It
> >>> would expose a user API which binds an rxq to a pci device. Prior to
> >>> the bind the user API would check that the pci device has published
> >>> p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible
> >>> to the driver (pci_p2pdma_distance() I think), etc.
> >> This doesn't fit the programming model for GPUs at all. You don't want
> >> to get packets landing in random GPU memory that a kernel side
> >> allocator selects, you want packets landing in GPU memory owned by a
> >> specific process that owns the TCP connection.
> >>
> > I think this comment is maybe a side effect of the misunderstanding.
> > In the proposal, the user allocates a GPU buffer using the API in step
> > #1, and then binds the memory to the NIC rx queue using the API
> > specified in step #2. We use flow steering & RSS to steer this user's
> > TCP traffic to the buffer owned by them.
> >
> >> This is why DMABUF is used here as it gives a handle to the GPU
> >> memory. What you want is to get the P2P pages either directly from the
> >> DMABUF or via pin_user_pages() on the DMABUF's mmap.
> >>
> >>> AFAICT, all the concerns brought up in this thread are sidestepped by
> >>> using p2pdma. I need not allocate struct pages in the core dma-buf
> >>> code anymore (or anywhere), and I need not allocate pgmaps. I would
> >>> just re-use the p2pdma support.
> >> Well, as I said it is going to be a big ask to P2P enable any of the
> >> DRM drivers.
> >>
> >> And you still have the netmem vs zone_device struct page conflict to
> >> figure out
> >>
> >> But it is alot closer to reasonable than this RFC.
> >>
> >> Jason
>
David Ahern July 14, 2023, 3:18 p.m. UTC | #59
On 7/14/23 8:55 AM, Mina Almasry wrote:
> 
> I guess the remaining option not fully explored is the idea of getting
> the networking stack to consume the scatterlist that
> dma_buf_map_attachment() provides for the device memory. The very
> rough approach I have in mind (for the RX path) is:
> 
> 1. Some uapi that binds a dmabuf to an RX queue. It will do a
> dma_buf_map_attachment() and get the sg table.
> 
> 2. We need to feed the scratterlist entries to some allocator that
> will chunk it up into pieces that can be allocated by the NIC for
> incoming traffic. I'm thinking genalloc may work for this as-is, but I
> may need to add one or use something else if I run into some issue.
> 
> 3. We can implement a memory_provider that allocates these chunks and
> wraps them in a struct new_abstraction (as David called it) and feeds
> those into the page pool.
> 
> 4. The page pool would need to be able to process these struct
> new_abstraction alongside the struct pages it normally gets from
> providers. This is maybe the most complicated part, but looking at the
> page pool code it doesn't seem that big of a hurdle (but I have not
> tried a POC yet).
> 
> 5. The drivers (I looked at mlx5) seem to avoid making any mm calls on
> the struct pages returned by the pool; the pool abstracts everything
> already. The changes to the drivers may be minimal..?
> 
> 6. We would need to add a new helper, skb_add_rx_new_abstraction_frag
> that creates a frag out of new_abstraction rather than a struct page.
> 
> Once the skb frags with struct new_abstraction are in the TCP stack,
> they will need some special handling in code accessing the frags. But
> my RFC already addressed that somewhat because the frags were
> inaccessible in that case. In this case the frags will be both
> inaccessible and will not be struct pages at all (things like
> get_page() will not work), so more special handling will be required,
> maybe.
> 
> I imagine the TX path would be considerably less complicated because
> the allocator and page pool are not involved (I think).
> 
> Anyone see any glaring issues with this approach?

Moving skb_frags to an alternative scheme is essential to make this
work. The current page scheme to go from user virtual to pages to
physical is not needed for the dmabuf use case.

For the driver and hardware queue: don't you need a dedicated queue for
the flow(s) in question? If not, how can you properly handle the
teardown case (e.g., app crashes and you need to ensure all references
to GPU memory are removed from NIC descriptors)? If you agree on this
point, then you can require the dedicated queue management in the driver
to use and expect only the alternative frag addressing scheme. ie., it
knows the address is not struct page (validates by checking skb flag or
frag flag or address magic), but a reference to say a page_pool entry
(if you are using page_pool for management of the dmabuf slices) which
contains the metadata needed for the use case.
Jason Gunthorpe July 14, 2023, 3:55 p.m. UTC | #60
On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:

> Once the skb frags with struct new_abstraction are in the TCP stack,
> they will need some special handling in code accessing the frags. But
> my RFC already addressed that somewhat because the frags were
> inaccessible in that case. In this case the frags will be both
> inaccessible and will not be struct pages at all (things like
> get_page() will not work), so more special handling will be required,
> maybe.

It seems sort of reasonable, though there will be interesting concerns
about coherence and synchronization with generial purpose DMABUFs that
will need tackling.

Still it is such a lot of churn and weridness in the netdev side, I
think you'd do well to present an actual full application as
justification.

Yes, you showed you can stick unordered TCP data frags into GPU memory
sort of quickly, but have you gone further with this to actually show
it is useful for a real world GPU centric application?

BTW your cover letter said 96% utilization, the usual server
configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
TCP BW with this?

Jason
Mina Almasry July 17, 2023, 1:53 a.m. UTC | #61
On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>
> > Once the skb frags with struct new_abstraction are in the TCP stack,
> > they will need some special handling in code accessing the frags. But
> > my RFC already addressed that somewhat because the frags were
> > inaccessible in that case. In this case the frags will be both
> > inaccessible and will not be struct pages at all (things like
> > get_page() will not work), so more special handling will be required,
> > maybe.
>
> It seems sort of reasonable, though there will be interesting concerns
> about coherence and synchronization with generial purpose DMABUFs that
> will need tackling.
>
> Still it is such a lot of churn and weridness in the netdev side, I
> think you'd do well to present an actual full application as
> justification.
>
> Yes, you showed you can stick unordered TCP data frags into GPU memory
> sort of quickly, but have you gone further with this to actually show
> it is useful for a real world GPU centric application?
>
> BTW your cover letter said 96% utilization, the usual server
> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
> TCP BW with this?
>

I do notice that the number of NICs is missing from our public
documentation so far, so I will refrain from specifying how many NICs
are on those A3 VMs until the information is public. But I think I can
confirm that your general thinking is correct, the perf that we're
getting is 96.6% line rate of each GPU/NIC pair, and scales linearly
for each NIC/GPU pair we've tested with so far. Line rate of each
NIC/GPU pair is 200 Gb/sec.

So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.
If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec
...
etc.
Mina Almasry July 17, 2023, 2:05 a.m. UTC | #62
On Fri, Jul 14, 2023 at 8:19 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 7/14/23 8:55 AM, Mina Almasry wrote:
> >
> > I guess the remaining option not fully explored is the idea of getting
> > the networking stack to consume the scatterlist that
> > dma_buf_map_attachment() provides for the device memory. The very
> > rough approach I have in mind (for the RX path) is:
> >
> > 1. Some uapi that binds a dmabuf to an RX queue. It will do a
> > dma_buf_map_attachment() and get the sg table.
> >
> > 2. We need to feed the scratterlist entries to some allocator that
> > will chunk it up into pieces that can be allocated by the NIC for
> > incoming traffic. I'm thinking genalloc may work for this as-is, but I
> > may need to add one or use something else if I run into some issue.
> >
> > 3. We can implement a memory_provider that allocates these chunks and
> > wraps them in a struct new_abstraction (as David called it) and feeds
> > those into the page pool.
> >
> > 4. The page pool would need to be able to process these struct
> > new_abstraction alongside the struct pages it normally gets from
> > providers. This is maybe the most complicated part, but looking at the
> > page pool code it doesn't seem that big of a hurdle (but I have not
> > tried a POC yet).
> >
> > 5. The drivers (I looked at mlx5) seem to avoid making any mm calls on
> > the struct pages returned by the pool; the pool abstracts everything
> > already. The changes to the drivers may be minimal..?
> >
> > 6. We would need to add a new helper, skb_add_rx_new_abstraction_frag
> > that creates a frag out of new_abstraction rather than a struct page.
> >
> > Once the skb frags with struct new_abstraction are in the TCP stack,
> > they will need some special handling in code accessing the frags. But
> > my RFC already addressed that somewhat because the frags were
> > inaccessible in that case. In this case the frags will be both
> > inaccessible and will not be struct pages at all (things like
> > get_page() will not work), so more special handling will be required,
> > maybe.
> >
> > I imagine the TX path would be considerably less complicated because
> > the allocator and page pool are not involved (I think).
> >
> > Anyone see any glaring issues with this approach?
>
> Moving skb_frags to an alternative scheme is essential to make this
> work. The current page scheme to go from user virtual to pages to
> physical is not needed for the dmabuf use case.
>
> For the driver and hardware queue: don't you need a dedicated queue for
> the flow(s) in question?

In the RFC and the implementation I'm thinking of, the queue is
'dedicated' in that each queue will be a devmem TCP queue or a regular
queue. devmem queues generate devmem skbs and non-devmem queues
generate non-devmem skbs. We support switching queues between devmem
mode and non-devmem mode via a uapi.

> If not, how can you properly handle the
> teardown case (e.g., app crashes and you need to ensure all references
> to GPU memory are removed from NIC descriptors)?

Jason and Christian will correct me if I'm wrong, but AFAICT the
dma-buf API requires the dma-buf provider to keep the attachment
mapping alive as long as the importer requires it. The dma-buf API
gives the importer dma_buf_map_attachment() and
dma_buf_unmap_attachment() APIs, but there is no callback for the
exporter to inform the importer that it has to take the mapping away.
The closest thing I saw was the move_notify() callback, but that is
optional.

In my mind the way it works is that there will be some uapi that binds
a dma-buf to an RX queue, that will create the attachment and the
mapping. If the user crashes or closes the dma-buf handle then that
will unbind the dma-buf from the RX queue, but the mapping will remain
alive (via some refcounting) until all the NIC descriptors are freed
and the mapping is not under use anymore. Usually this will happen
next driver reset which destroys and recreates rx queues thereby
freeing all the NIC descriptors (but could be a new API so that we
don't rely on a driver reset).

> If you agree on this
> point, then you can require the dedicated queue management in the driver
> to use and expect only the alternative frag addressing scheme. ie., it
> knows the address is not struct page (validates by checking skb flag or
> frag flag or address magic), but a reference to say a page_pool entry
> (if you are using page_pool for management of the dmabuf slices) which
> contains the metadata needed for the use case.

Honestly if my understanding above doesn't match what you want, I
could implement 'dedicated queues' instead, just let me know what you
want at some future iteration. Now, I'm more worried about this memory
format issue and I'm working on an RX prototype without struct pages.
So far purely technically speaking it seems possible.
David Ahern July 17, 2023, 3:08 a.m. UTC | #63
On 7/16/23 8:05 PM, Mina Almasry wrote:
>>
>> For the driver and hardware queue: don't you need a dedicated queue for
>> the flow(s) in question?
> 
> In the RFC and the implementation I'm thinking of, the queue is
> 'dedicated' in that each queue will be a devmem TCP queue or a regular
> queue. devmem queues generate devmem skbs and non-devmem queues
> generate non-devmem skbs. We support switching queues between devmem
> mode and non-devmem mode via a uapi.

ethtool APIs or something else?

> 
>> If not, how can you properly handle the
>> teardown case (e.g., app crashes and you need to ensure all references
>> to GPU memory are removed from NIC descriptors)?
> 
> Jason and Christian will correct me if I'm wrong, but AFAICT the
> dma-buf API requires the dma-buf provider to keep the attachment
> mapping alive as long as the importer requires it. The dma-buf API
> gives the importer dma_buf_map_attachment() and
> dma_buf_unmap_attachment() APIs, but there is no callback for the
> exporter to inform the importer that it has to take the mapping away.

Isn't the importer that application that terminated (cleanly or other)?
That was my thinking but I guess there are other designs that can cross
a single application.

> The closest thing I saw was the move_notify() callback, but that is
> optional.
> 
> In my mind the way it works is that there will be some uapi that binds
> a dma-buf to an RX queue, that will create the attachment and the
> mapping. If the user crashes or closes the dma-buf handle then that
> will unbind the dma-buf from the RX queue, but the mapping will remain
> alive (via some refcounting) until all the NIC descriptors are freed
> and the mapping is not under use anymore. Usually this will happen
> next driver reset which destroys and recreates rx queues thereby
> freeing all the NIC descriptors (but could be a new API so that we
> don't rely on a driver reset).
> 
>> If you agree on this
>> point, then you can require the dedicated queue management in the driver
>> to use and expect only the alternative frag addressing scheme. ie., it
>> knows the address is not struct page (validates by checking skb flag or
>> frag flag or address magic), but a reference to say a page_pool entry
>> (if you are using page_pool for management of the dmabuf slices) which
>> contains the metadata needed for the use case.
> 
> Honestly if my understanding above doesn't match what you want, I
> could implement 'dedicated queues' instead, just let me know what you
> want at some future iteration. Now, I'm more worried about this memory
> format issue and I'm working on an RX prototype without struct pages.
> So far purely technically speaking it seems possible.
> 
> 

My comment was only a suggestion on how to simplify driver changes. ie.,
a queue is either pages (based on standard page_pool or alloc_pages) or
some "special" page_pool (ie., new abstraction) but not mixed. In that
case it knows how to handle the overloaded 'address' in skb_frag in a
clean manner.
Jesper Dangaard Brouer July 24, 2023, 2:56 p.m. UTC | #64
On 17/07/2023 03.53, Mina Almasry wrote:
> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>>
>>> Once the skb frags with struct new_abstraction are in the TCP stack,
>>> they will need some special handling in code accessing the frags. But
>>> my RFC already addressed that somewhat because the frags were
>>> inaccessible in that case. In this case the frags will be both
>>> inaccessible and will not be struct pages at all (things like
>>> get_page() will not work), so more special handling will be required,
>>> maybe.
>>
>> It seems sort of reasonable, though there will be interesting concerns
>> about coherence and synchronization with generial purpose DMABUFs that
>> will need tackling.
>>
>> Still it is such a lot of churn and weridness in the netdev side, I
>> think you'd do well to present an actual full application as
>> justification.
>>
>> Yes, you showed you can stick unordered TCP data frags into GPU memory
>> sort of quickly, but have you gone further with this to actually show
>> it is useful for a real world GPU centric application?
>>
>> BTW your cover letter said 96% utilization, the usual server
>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
>> TCP BW with this?
>>
> 
> I do notice that the number of NICs is missing from our public
> documentation so far, so I will refrain from specifying how many NICs
> are on those A3 VMs until the information is public. But I think I can
> confirm that your general thinking is correct, the perf that we're
> getting is 96.6% line rate of each GPU/NIC pair, 

What do you mean by 96.6% "line rate".
Is is the Ethernet line-rate?

Is the measured throughput the measured TCP data "goodput"?
Assuming
  - MTU 1500 bytes (1514 on wire).
  - Ethernet header 14 bytes
  - IP header 20 bytes
  - TCP header 20 bytes

Due to header overhead the goodput will be approx 96.4%.
  - (1514-(14+20+20))/1514 = 0.9643
  - (Not taking Ethernet interframe gap into account).

Thus, maybe you have hit Ethernet wire line-rate already?

> and scales linearly
> for each NIC/GPU pair we've tested with so far. Line rate of each
> NIC/GPU pair is 200 Gb/sec.
> 
> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.

Lets keep our units straight.
Here you mean 1545 Gbit/sec, which is 193 GBytes/s

> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec

Here you mean 384 Gbit/sec, which is 48 GBytes/sec.

> ...
> etc.
> 

These massive throughput numbers are important, because they *exceed*
the physical host RAM/DIMM memory speeds.

This is the *real argument* why software cannot afford to do a single
copy of the data from host-RAM into GPU-memory, because the CPU memory
throughput to DRAM/DIMM are insufficient.

My testlab CPU E5-1650 have 4 DIMM slots DDR4
  - Data Width: 64 bits (= 8 bytes)
  - Configured Memory Speed: 2400 MT/s
  - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)

Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).

When testing this with lmbench tool bw_mem, the results (below
signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
exceeding L3 cache size.  In practice it looks like main memory is
limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)

Given DDIO can deliver network packets into L3, I also tried to figure
out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
(339 Gbit/s), in hopes that it would be enough, but it was not.


--Jesper
(data below signature)

CPU under test:

  $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
  model name	: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
  cache size	: 15360 KB


Providing some cmdline outputs from lmbench "bw_mem" tool.
(Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
256.00 14924.50

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
256.00 9895.25

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
256.00 9737.54

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
256.00 12462.88

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
256.00 14869.89


Next output shows reducing size below L3 cache size, which shows an
increase in speed, likely the L3 bandwidth.

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
64.00 14840.58

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
32.00 14823.97

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
16.00 24743.86

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
8.00 40852.26

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
4.00 42545.65

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
2.00 42447.82

$ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
1.00 42447.82
Jason Gunthorpe July 24, 2023, 4:28 p.m. UTC | #65
On Mon, Jul 24, 2023 at 04:56:27PM +0200, Jesper Dangaard Brouer wrote:

> These massive throughput numbers are important, because they *exceed*
> the physical host RAM/DIMM memory speeds.

That's right, this HW is all designed to use the high memory bandwidth
of the parallel GPUs. The CPU must not be involved in the data
movement.

If you look at the reference block diagrams for a DGX-H100 you can see
that the each GPU is directly wired to a 400Gb/s NIC, and there is
even software that allows the GPU to directly operate its attached
NIC interface.

Each of the 8 GPUs in the block diagram has 800 Gb/s full duplex RDMA,
and 7200 Gb/sec full duplex on the nvlink interconnect directly
connected to it.

The GPU is the center of all the interconnect in these systems.

Jason
Mina Almasry July 25, 2023, 4:04 a.m. UTC | #66
On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 17/07/2023 03.53, Mina Almasry wrote:
> > On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>
> >> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
> >>
> >>> Once the skb frags with struct new_abstraction are in the TCP stack,
> >>> they will need some special handling in code accessing the frags. But
> >>> my RFC already addressed that somewhat because the frags were
> >>> inaccessible in that case. In this case the frags will be both
> >>> inaccessible and will not be struct pages at all (things like
> >>> get_page() will not work), so more special handling will be required,
> >>> maybe.
> >>
> >> It seems sort of reasonable, though there will be interesting concerns
> >> about coherence and synchronization with generial purpose DMABUFs that
> >> will need tackling.
> >>
> >> Still it is such a lot of churn and weridness in the netdev side, I
> >> think you'd do well to present an actual full application as
> >> justification.
> >>
> >> Yes, you showed you can stick unordered TCP data frags into GPU memory
> >> sort of quickly, but have you gone further with this to actually show
> >> it is useful for a real world GPU centric application?
> >>
> >> BTW your cover letter said 96% utilization, the usual server
> >> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
> >> TCP BW with this?
> >>
> >
> > I do notice that the number of NICs is missing from our public
> > documentation so far, so I will refrain from specifying how many NICs
> > are on those A3 VMs until the information is public. But I think I can
> > confirm that your general thinking is correct, the perf that we're
> > getting is 96.6% line rate of each GPU/NIC pair,
>
> What do you mean by 96.6% "line rate".
> Is is the Ethernet line-rate?
>

Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec
that my NICs run.

> Is the measured throughput the measured TCP data "goodput"?

Yes, it is goodput. Roughly I believe we add up the return values of
recvmsg() and divide that number by time (very roughly, I think).

> Assuming
>   - MTU 1500 bytes (1514 on wire).
>   - Ethernet header 14 bytes
>   - IP header 20 bytes
>   - TCP header 20 bytes
>
> Due to header overhead the goodput will be approx 96.4%.
>   - (1514-(14+20+20))/1514 = 0.9643
>   - (Not taking Ethernet interframe gap into account).
>
> Thus, maybe you have hit Ethernet wire line-rate already?

My MTU is 8244 actually, which gives me 8192 mss/payload for my
connections. By my math the theoretical max would be 1 - 52/8244 =
~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the
implementation.

>
> > and scales linearly
> > for each NIC/GPU pair we've tested with so far. Line rate of each
> > NIC/GPU pair is 200 Gb/sec.
> >
> > So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.
>
> Lets keep our units straight.
> Here you mean 1545 Gbit/sec, which is 193 GBytes/s
>

Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry!

> > If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec
>
> Here you mean 384 Gbit/sec, which is 48 GBytes/sec.
>

Correct again!

> > ...
> > etc.
> >
>
> These massive throughput numbers are important, because they *exceed*
> the physical host RAM/DIMM memory speeds.
>
> This is the *real argument* why software cannot afford to do a single
> copy of the data from host-RAM into GPU-memory, because the CPU memory
> throughput to DRAM/DIMM are insufficient.
>
> My testlab CPU E5-1650 have 4 DIMM slots DDR4
>   - Data Width: 64 bits (= 8 bytes)
>   - Configured Memory Speed: 2400 MT/s
>   - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)
>
> Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
> the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).
>
> When testing this with lmbench tool bw_mem, the results (below
> signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
> exceeding L3 cache size.  In practice it looks like main memory is
> limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)
>
> Given DDIO can deliver network packets into L3, I also tried to figure
> out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
> (339 Gbit/s), in hopes that it would be enough, but it was not.
>
>

Yes, avoiding any memory speed bottleneck as you note is important,
but the second point mentioned in my cover letter is also impactful:

" Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through the
  root complex."

Depending on the hardware, this is a bottleneck that we avoid with
device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a
hierarchy like this:

          root complex
                  | (uplink)
          PCIe switch
           /             \
       NIC           GPU

I believe the uplink from the PCIe switch to the root complex is used
up 2 times for TX and 2 times for RX if the data needs to go through
host memory:

RX: NIC -> root complex -> GPU
TX: GPU -> root complex -> NIC

With device memory TCP, and enabling PCI P2P communication between the
devices under the same PCIe switch, the payload flows directly from/to
the NIC/GPU through the PCIe switch, and the payload never goes to the
root complex, alleviating pressure/bottleneck on that link between the
PCIe switch/root complex. I believe this is a core reason we're able
to scale throughput linearly with NIC/GPU pairs, because we don't
stress share uplink connections and all the payload data transfer
happens beneath the PCIe switch.

> --Jesper
> (data below signature)
>
> CPU under test:
>
>   $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
>   model name    : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>   cache size    : 15360 KB
>
>
> Providing some cmdline outputs from lmbench "bw_mem" tool.
> (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
> 256.00 14924.50
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
> 256.00 9895.25
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
> 256.00 9737.54
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
> 256.00 12462.88
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
> 256.00 14869.89
>
>
> Next output shows reducing size below L3 cache size, which shows an
> increase in speed, likely the L3 bandwidth.
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
> 64.00 14840.58
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
> 32.00 14823.97
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
> 16.00 24743.86
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
> 8.00 40852.26
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
> 4.00 42545.65
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
> 2.00 42447.82
>
> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
> 1.00 42447.82
>
Jesper Dangaard Brouer July 26, 2023, 5:36 p.m. UTC | #67
On 25/07/2023 06.04, Mina Almasry wrote:
> On Mon, Jul 24, 2023 at 7:56 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 17/07/2023 03.53, Mina Almasry wrote:
>>> On Fri, Jul 14, 2023 at 8:55 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>>
>>>> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote:
>>>>
>>>>> Once the skb frags with struct new_abstraction are in the TCP stack,
>>>>> they will need some special handling in code accessing the frags. But
>>>>> my RFC already addressed that somewhat because the frags were
>>>>> inaccessible in that case. In this case the frags will be both
>>>>> inaccessible and will not be struct pages at all (things like
>>>>> get_page() will not work), so more special handling will be required,
>>>>> maybe.
>>>>
>>>> It seems sort of reasonable, though there will be interesting concerns
>>>> about coherence and synchronization with generial purpose DMABUFs that
>>>> will need tackling.
>>>>
>>>> Still it is such a lot of churn and weridness in the netdev side, I
>>>> think you'd do well to present an actual full application as
>>>> justification.
>>>>
>>>> Yes, you showed you can stick unordered TCP data frags into GPU memory
>>>> sort of quickly, but have you gone further with this to actually show
>>>> it is useful for a real world GPU centric application?
>>>>
>>>> BTW your cover letter said 96% utilization, the usual server
>>>> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of
>>>> TCP BW with this?
>>>>
>>>
>>> I do notice that the number of NICs is missing from our public
>>> documentation so far, so I will refrain from specifying how many NICs
>>> are on those A3 VMs until the information is public. But I think I can
>>> confirm that your general thinking is correct, the perf that we're
>>> getting is 96.6% line rate of each GPU/NIC pair,
>>
>> What do you mean by 96.6% "line rate".
>> Is is the Ethernet line-rate?
>>
> 
> Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec
> that my NICs run.
> 
>> Is the measured throughput the measured TCP data "goodput"?
> 
> Yes, it is goodput. Roughly I believe we add up the return values of
> recvmsg() and divide that number by time (very roughly, I think).
> 
>> Assuming
>>    - MTU 1500 bytes (1514 on wire).
>>    - Ethernet header 14 bytes
>>    - IP header 20 bytes
>>    - TCP header 20 bytes
>>
>> Due to header overhead the goodput will be approx 96.4%.
>>    - (1514-(14+20+20))/1514 = 0.9643
>>    - (Not taking Ethernet interframe gap into account).
>>
>> Thus, maybe you have hit Ethernet wire line-rate already?
> 
> My MTU is 8244 actually, which gives me 8192 mss/payload for my
> connections. By my math the theoretical max would be 1 - 52/8244 =
> ~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the
> implementation.
> 

Close enough, my math I would have added the 4 byte FCS checksum
(1-56/8244), but it makes no real difference at MTU 8244.


>>
>>> and scales linearly
>>> for each NIC/GPU pair we've tested with so far. Line rate of each
>>> NIC/GPU pair is 200 Gb/sec.
>>>
>>> So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 = 1545 GB/sec.
>>
>> Lets keep our units straight.
>> Here you mean 1545 Gbit/sec, which is 193 GBytes/s
>>
> 
> Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry!
> 
>>> If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 = 384 GB/sec
>>
>> Here you mean 384 Gbit/sec, which is 48 GBytes/sec.
>>
> 
> Correct again!
> 
>>> ...
>>> etc.
>>>
>>
>> These massive throughput numbers are important, because they *exceed*
>> the physical host RAM/DIMM memory speeds.
>>
>> This is the *real argument* why software cannot afford to do a single
>> copy of the data from host-RAM into GPU-memory, because the CPU memory
>> throughput to DRAM/DIMM are insufficient.
>>
>> My testlab CPU E5-1650 have 4 DIMM slots DDR4
>>    - Data Width: 64 bits (= 8 bytes)
>>    - Configured Memory Speed: 2400 MT/s
>>    - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4)
>>
>> Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for
>> the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs).
>>
>> When testing this with lmbench tool bw_mem, the results (below
>> signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as
>> exceeding L3 cache size.  In practice it looks like main memory is
>> limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s)
>>

Some more insights.  I couldn't believe this (single CPU) test was so
far from the theoretical max (76.8 vs. 14.8 GBytes/s).
This smells like a per CPU core limitation. The lmbench tool bw_mem have
an option for parallelism (-P) for testing this.
My testlab CPU only have 6 cores (as I have disabled HT).

Testing on more CPU cores show an increase in scaling mem bandwidth:

  Cores 1 = 15.0 GB/s - scale: 1.00 (one core as scale point)
  Cores 2 = 26.9 GB/s - scale: 1.79
  Cores 3 = 36.3 GB/s - scale: 2.42
  Cores 4 = 44.0 GB/s - scale: 2.93
  Cores 5 = 48.9 GB/s - scale: 3.26
  Cores 6 = 49.4 GB/s - scale: 3.29

Thus, the practical test show CPU memory DIMM read bandwidth scales to
49.4 GB/s (395.2 Gbit/s), so there is still hope of 400Gbit/s devices,
when utilizing more CPU cores.

I don't have a clear explanation why there is a limit per core.

The [toplev] tool with (bw_mem -P2) says:
  Backend_Bound = 90.7% of the time.
  Backend_Bound.Memory_Bound = 71.8% of these 90.7%
  Backend_Bound.Core_Bound = remaining 18.9%

Backend_Bound.Memory_Bound is split into two main "stalls":
  Backend_Bound.Memory_Bound.L3_Bound = 7.9%
  Backend_Bound.Memory_Bound.DRAM_Bound = 58.5%

[toplev] https://github.com/andikleen/pmu-tools

>> Given DDIO can deliver network packets into L3, I also tried to figure
>> out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec
>> (339 Gbit/s), in hopes that it would be enough, but it was not.
>>

The memory bandwidth to L3 cache scales up per CPU core:

  Cores 1 =  42.35 GB/s = scale: 1.00 (one core as scale point)
  Cores 2 =  86.38 GB/s = scale: 2.04
  Cores 3 = 126.96 GB/s = scale: 3.00
  Cores 4 = 168.48 GB/s = scale: 3.98
  Cores 5 = 211.77 GB/s = scale: 5.00
  Cores 6 = 244.95 GB/s = scale: 5.78

Nice to see how well this scales up per core.
Fairly impressive total max L3 bandwidth of 244.95 GB/s (1959.6 Gbit/s).

>>
> 
> Yes, avoiding any memory speed bottleneck as you note is important,
> but the second point mentioned in my cover letter is also impactful:
> 
> " Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>    of the PCIe tree, compared to traditional path which sends data through the
>    root complex."
> 

This is a good and important point.

> Depending on the hardware, this is a bottleneck that we avoid with
> device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a
> hierarchy like this:
> 
>            root complex
>                    | (uplink)
>            PCIe switch
>             /             \
>         NIC           GPU
> 
> I believe the uplink from the PCIe switch to the root complex is used
> up 2 times for TX and 2 times for RX if the data needs to go through
> host memory:
> 
> RX: NIC -> root complex -> GPU
> TX: GPU -> root complex -> NIC
> 
> With device memory TCP, and enabling PCI P2P communication between the
> devices under the same PCIe switch, the payload flows directly from/to
> the NIC/GPU through the PCIe switch, and the payload never goes to the
> root complex, alleviating pressure/bottleneck on that link between the
> PCIe switch/root complex. I believe this is a core reason we're able
> to scale throughput linearly with NIC/GPU pairs, because we don't
> stress share uplink connections and all the payload data transfer
> happens beneath the PCIe switch.
> 

Good points, and I guess this is what Jason was hinting to.
And illustrated in this picture[1] (I googled):

  [1] 
https://www.servethehome.com/wp-content/uploads/2022/08/HC34-NVIDIA-DGX-H100-Data-network-configuration.jpg

The GPUs "internally" have switched nvlink connections. As Jason said,
these nvlinks have an impressive bandwidth[2] of 900GB/s (7200 Gbit/s).

  [2] https://www.nvidia.com/en-us/data-center/nvlink/

>> --Jesper
>> (data below signature)
>>

Added raw commands and data below.

>> CPU under test:
>>
>>    $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2
>>    model name    : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
>>    cache size    : 15360 KB
>>
>>
>> Providing some cmdline outputs from lmbench "bw_mem" tool.
>> (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second)
>>

Running bw_mem with parallelism utilizing more cores:

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P1 256m rd
268.44 15015.69

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P2 256m rd
268.44 26896.42

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P3 256m rd
268.44 36347.36

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P4 256m rd
268.44 44073.72

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P5 256m rd
268.44 48872.02

$ /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 -P6 256m rd
268.44 49426.76



>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd
>> 256.00 14924.50
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr
>> 256.00 9895.25
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr
>> 256.00 9737.54
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy
>> 256.00 12462.88
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero
>> 256.00 14869.89
>>
>>
>> Next output shows reducing size below L3 cache size, which shows an
>> increase in speed, likely the L3 bandwidth.
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd
>> 64.00 14840.58
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd
>> 32.00 14823.97
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd
>> 16.00 24743.86
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd
>> 8.00 40852.26
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd
>> 4.00 42545.65
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd
>> 2.00 42447.82
>>
>> $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd
>> 1.00 42447.82
>>
> 

Tests for testing L3 per core scaling.

$ taskset -c 0   /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P1  512K rd
0.512000 42357.43

$ taskset -c 0-1 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P2  512K rd
0.512000 86380.09

$ taskset -c 0-2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P3  512K rd
0.512000 126960.94

$ taskset -c 0-3 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P4  512K rd
0.512000 168485.49

$ taskset -c 0-4 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P5  512K rd
0.512000 211770.67

$ taskset -c 0-5 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem -W2 -N4 
-P6  512K rd
0.512000 244959.35
diff mbox series

Patch

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index b676496ec6d7..4e613d5bf1fd 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -4925,8 +4925,7 @@  static void hns3_put_ring_config(struct hns3_nic_priv *priv)
 static void hns3_alloc_page_pool(struct hns3_enet_ring *ring)
 {
 	struct page_pool_params pp_params = {
-		.flags = PP_FLAG_DMA_MAP | PP_FLAG_PAGE_FRAG |
-				PP_FLAG_DMA_SYNC_DEV,
+		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
 		.order = hns3_page_order(ring),
 		.pool_size = ring->desc_num * hns3_buf_size(ring) /
 				(PAGE_SIZE << hns3_page_order(ring)),
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index a79cb680bb23..404caec467af 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -1426,7 +1426,7 @@  int otx2_pool_init(struct otx2_nic *pfvf, u16 pool_id,
 		return 0;
 	}
 
-	pp_params.flags = PP_FLAG_PAGE_FRAG | PP_FLAG_DMA_MAP;
+	pp_params.flags = PP_FLAG_DMA_MAP;
 	pp_params.pool_size = numptrs;
 	pp_params.nid = NUMA_NO_NODE;
 	pp_params.dev = pfvf->dev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 593cdfbfc035..79f2f5e51ae0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -843,7 +843,7 @@  static int mlx5e_alloc_rq(struct mlx5e_params *params,
 		}
 
 		pp_params.order     = 0;
-		pp_params.flags     = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | PP_FLAG_PAGE_FRAG;
+		pp_params.flags     = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pp_params.pool_size = pool_size;
 		pp_params.nid       = node;
 		pp_params.dev       = rq->pdev;
diff --git a/drivers/net/wireless/mediatek/mt76/mac80211.c b/drivers/net/wireless/mediatek/mt76/mac80211.c
index 467afef98ba2..ee72869e5572 100644
--- a/drivers/net/wireless/mediatek/mt76/mac80211.c
+++ b/drivers/net/wireless/mediatek/mt76/mac80211.c
@@ -566,7 +566,7 @@  int mt76_create_page_pool(struct mt76_dev *dev, struct mt76_queue *q)
 {
 	struct page_pool_params pp_params = {
 		.order = 0,
-		.flags = PP_FLAG_PAGE_FRAG,
+		.flags = 0,
 		.nid = NUMA_NO_NODE,
 		.dev = dev->dma_dev,
 	};
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index c135cd157cea..f4fc339ff020 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -46,10 +46,8 @@ 
 					* Please note DMA-sync-for-CPU is still
 					* device driver responsibility
 					*/
-#define PP_FLAG_PAGE_FRAG	BIT(2) /* for page frag feature */
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP |\
-				 PP_FLAG_DMA_SYNC_DEV |\
-				 PP_FLAG_PAGE_FRAG)
+				 PP_FLAG_DMA_SYNC_DEV)
 
 #define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
 		(sizeof(dma_addr_t) > sizeof(unsigned long))
@@ -235,8 +233,7 @@  static inline struct page *page_pool_alloc_frag(struct page_pool *pool,
 
 	size = ALIGN(size, dma_get_cache_alignment());
 
-	if (WARN_ON(!(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
-		    size > max_size))
+	if (WARN_ON(size > max_size))
 		return NULL;
 
 	/* Don't allow page splitting and allocate one big frag
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c4338221b17..ca2316cc1e7e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5652,7 +5652,7 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	/* In general, avoid mixing page_pool and non-page_pool allocated
 	 * pages within the same SKB. Additionally avoid dealing with clones
 	 * with page_pool pages, in case the SKB is using page_pool fragment
-	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
+	 * references (page_pool_alloc_frag()). Since we only take full page
 	 * references for cloned SKBs at the moment that would result in
 	 * inconsistent reference counts.
 	 * In theory we could take full references if @from is cloned and