mbox series

[net-next,v4,0/8] bpf: cpumap: enable GRO for XDP_PASS frames

Message ID 20250205163609.3208829-1-aleksander.lobakin@intel.com (mailing list archive)
Headers show
Series bpf: cpumap: enable GRO for XDP_PASS frames | expand

Message

Alexander Lobakin Feb. 5, 2025, 4:36 p.m. UTC
Several months ago, I had been looking through my old XDP hints tree[0]
to check whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-2: decouple the GRO struct from the NAPI struct and allow
  using it out of a NAPI entity within the kernel core code;
* patch 3: switch cpumap from netif_receive_skb_list() to
  gro_receive_skb().

Additional improvements:

* patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 5-6: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 7-8: use that function in veth as well and remove the one that
  was now superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
patch 3         2.3        4         Mpps
patch 8         2.4        4.7       Mpps

1...3 diff      -17        +48       %
1...8 diff      -11        +74       %

Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
tests. On my system however, the same test gave me up to +100%.

Note that there's a series from Lorenzo[4] which achieves the same, but
in a different way. During the discussions, the approach using a
standalone GRO instance was preferred over the threaded NAPI.

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
[3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
[4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org

Alexander Lobakin (8):
  net: gro: decouple GRO from the NAPI layer
  net: gro: expose GRO init/cleanup to use outside of NAPI
  bpf: cpumap: switch to GRO from netif_receive_skb_list()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

 include/linux/netdevice.h                  |  26 ++--
 include/linux/skbuff.h                     |   1 +
 include/net/busy_poll.h                    |  11 +-
 include/net/gro.h                          |  38 ++++--
 include/net/xdp.h                          |   1 -
 drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
 drivers/net/ethernet/cortina/gemini.c      |   1 +
 drivers/net/veth.c                         |   3 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
 kernel/bpf/cpumap.c                        | 146 +++++++++++++--------
 net/core/dev.c                             |  77 +++--------
 net/core/gro.c                             | 101 +++++++++-----
 net/core/skbuff.c                          |  62 +++++++++
 net/core/xdp.c                             |  10 --
 14 files changed, 299 insertions(+), 180 deletions(-)

---
From v3[5]:
* series:
  * rebase on top of the latest net-next;
  * pick RBs from Toke;
* 1:
  * reduce possible false sharing of napi_struct (now gro_node is
    placed precisely at 64 byte offset, was 56);
  * pick Acked-by from Jakub;
* 2: RB from Jakub;
* 3: move the assignment out of the condition check :p (Jakub).

From v2[6]:
* 1: remove napi_id duplication in both &gro_node and &napi_struct by
     using a tagged struct group. The most efficient approach I've
     found so far: no additional branches, no inline expansion, no tail
     calls / double calls, saves 8 bytes of &napi_struct in comparison
     with v2 (Jakub, Paolo, me);
* 4: improve and streamline skb allocation fails (-1 branch per frame),
     skip more code for skb-only batches.

From v1[7]:
* use a standalone GRO instance instead of the threaded NAPI (Jakub);
* rebase and send to net-next as it's now more networking than BPF.

[5] https://lore.kernel.org/netdev/20250115151901.2063909-1-aleksander.lobakin@intel.com
[6] https://lore.kernel.org/netdev/20250107152940.26530-1-aleksander.lobakin@intel.com
[7] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com

Comments

Alexander Lobakin Feb. 10, 2025, 2:05 p.m. UTC | #1
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 5 Feb 2025 17:36:01 +0100

> Several months ago, I had been looking through my old XDP hints tree[0]
> to check whether some patches not directly related to hints can be sent
> standalone. Roughly at the same time, Daniel appeared and asked[1] about
> GRO for cpumap from that tree.

I see "Changes requested" on Patchwork. Which ones?

1/8 regarding gro_node? Nobody proposed a solution which would be as
efficient, but avoid using struct_group(), I don't see such as well.
I explain in the commitmsgs and cover letter everything. Jakub gave me
Acked-by on struct_group() in the v3 thread.

Thanks,
Olek
Eric Dumazet Feb. 10, 2025, 3:31 p.m. UTC | #2
On Mon, Feb 10, 2025 at 3:10 PM Alexander Lobakin
<aleksander.lobakin@intel.com> wrote:
>
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Wed, 5 Feb 2025 17:36:01 +0100
>
> > Several months ago, I had been looking through my old XDP hints tree[0]
> > to check whether some patches not directly related to hints can be sent
> > standalone. Roughly at the same time, Daniel appeared and asked[1] about
> > GRO for cpumap from that tree.
>
> I see "Changes requested" on Patchwork. Which ones?
>
> 1/8 regarding gro_node? Nobody proposed a solution which would be as
> efficient, but avoid using struct_group(), I don't see such as well.
> I explain in the commitmsgs and cover letter everything. Jakub gave me
> Acked-by on struct_group() in the v3 thread.

One of the points of your nice series is to dissociate GRO from NAPI,
so defining gro_node inside napi_struct is not appealing.

I suggested not putting napi_id in the new structure.

If you need to cache a copy in it for "performance/whatever reason",
you can cache napi_id, because napi->napi_id is only set once
in __napi_hash_add_with_id()

gro->napi_id_cache = napi->napi_id;
Alexander Lobakin Feb. 10, 2025, 3:31 p.m. UTC | #3
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 10 Feb 2025 16:31:22 +0100

> On Mon, Feb 10, 2025 at 3:10 PM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:
>>
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Wed, 5 Feb 2025 17:36:01 +0100
>>
>>> Several months ago, I had been looking through my old XDP hints tree[0]
>>> to check whether some patches not directly related to hints can be sent
>>> standalone. Roughly at the same time, Daniel appeared and asked[1] about
>>> GRO for cpumap from that tree.
>>
>> I see "Changes requested" on Patchwork. Which ones?
>>
>> 1/8 regarding gro_node? Nobody proposed a solution which would be as
>> efficient, but avoid using struct_group(), I don't see such as well.
>> I explain in the commitmsgs and cover letter everything. Jakub gave me
>> Acked-by on struct_group() in the v3 thread.
> 
> One of the points of your nice series is to dissociate GRO from NAPI,
> so defining gro_node inside napi_struct is not appealing.
> 
> I suggested not putting napi_id in the new structure.
> 
> If you need to cache a copy in it for "performance/whatever reason",
> you can cache napi_id, because napi->napi_id is only set once
> in __napi_hash_add_with_id()
> 
> gro->napi_id_cache = napi->napi_id;

This was rejected by Kuba in v2.
He didn't like to have napi_id two times within napi_struct (one inside
gro_node, one outside).

Thanks,
Olek
Jakub Kicinski Feb. 11, 2025, 12:35 a.m. UTC | #4
On Mon, 10 Feb 2025 16:31:52 +0100 Alexander Lobakin wrote:
> This was rejected by Kuba in v2.
> He didn't like to have napi_id two times within napi_struct (one inside
> gro_node, one outside).

Do you mean:

  the napi_id in gro sticks out..

https://lore.kernel.org/netdev/20250113130104.5c2c02e0@kernel.org/ ?

That's more of a nudge to try harder than a "no". We explored 
the alternatives, there's no perfect way to layer this. I think 
Eric's suggestion is probably as clean as we can get.
Alexander Lobakin Feb. 12, 2025, 2:22 p.m. UTC | #5
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 10 Feb 2025 16:35:29 -0800

> On Mon, 10 Feb 2025 16:31:52 +0100 Alexander Lobakin wrote:
>> This was rejected by Kuba in v2.
>> He didn't like to have napi_id two times within napi_struct (one inside
>> gro_node, one outside).
> 
> Do you mean:
> 
>   the napi_id in gro sticks out..
> 
> https://lore.kernel.org/netdev/20250113130104.5c2c02e0@kernel.org/ ?
> 
> That's more of a nudge to try harder than a "no". We explored 
> the alternatives, there's no perfect way to layer this. I think 
> Eric's suggestion is probably as clean as we can get.

You mean to cache napi_id in gro_node?

Then we get +8 bytes to sizeof(napi_struct) for little reason... Dunno,
if you really prefer, I can do it that way.

OTOH you gave Acked-by on struct_group(), then Eric appeared and you
took your ack back :D

Thanks,
Olek
Alexander Lobakin Feb. 12, 2025, 3:55 p.m. UTC | #6
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 12 Feb 2025 15:22:55 +0100

> From: Jakub Kicinski <kuba@kernel.org>
> Date: Mon, 10 Feb 2025 16:35:29 -0800
> 
>> On Mon, 10 Feb 2025 16:31:52 +0100 Alexander Lobakin wrote:
>>> This was rejected by Kuba in v2.
>>> He didn't like to have napi_id two times within napi_struct (one inside
>>> gro_node, one outside).
>>
>> Do you mean:
>>
>>   the napi_id in gro sticks out..
>>
>> https://lore.kernel.org/netdev/20250113130104.5c2c02e0@kernel.org/ ?
>>
>> That's more of a nudge to try harder than a "no". We explored 
>> the alternatives, there's no perfect way to layer this. I think 
>> Eric's suggestion is probably as clean as we can get.
> 
> You mean to cache napi_id in gro_node?
> 
> Then we get +8 bytes to sizeof(napi_struct) for little reason... Dunno,
> if you really prefer, I can do it that way.

Alternative to avoid +8 bytes:

struct napi_struct {
	...

	union {
		struct gro_node	gro;
		struct {
			u8 pad[offsetof(struct gro_node, napi_id)];
			u32 napi_id;
		};
	};

This is effectively the same what struct_group() does, just more ugly.
But allows to declare gro_node separately.

> 
> OTOH you gave Acked-by on struct_group(), then Eric appeared and you
> took your ack back :D
Thanks,
Olek
Jakub Kicinski Feb. 12, 2025, 6:29 p.m. UTC | #7
On Wed, 12 Feb 2025 16:55:52 +0100 Alexander Lobakin wrote:
> > You mean to cache napi_id in gro_node?
> > 
> > Then we get +8 bytes to sizeof(napi_struct) for little reason...

Right but I think the expectation would be that we don't ever touch
that on the fast path, right? The "real" napi_id would basically
go down below:

	/* control-path-only fields follow */

8B of cold data doesn't matter at all. But I haven't checked if
we need the napi->napi_id access anywhere hot, do we?

> > Dunno, if you really prefer, I can do it that way.  
> 
> Alternative to avoid +8 bytes:
> 
> struct napi_struct {
> 	...
> 
> 	union {
> 		struct gro_node	gro;
> 		struct {
> 			u8 pad[offsetof(struct gro_node, napi_id)];
> 			u32 napi_id;
> 		};
> 	};
> 
> This is effectively the same what struct_group() does, just more ugly.
> But allows to declare gro_node separately.
Alexander Lobakin Feb. 14, 2025, 3:43 p.m. UTC | #8
From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 12 Feb 2025 10:29:36 -0800

> On Wed, 12 Feb 2025 16:55:52 +0100 Alexander Lobakin wrote:
>>> You mean to cache napi_id in gro_node?
>>>
>>> Then we get +8 bytes to sizeof(napi_struct) for little reason...
> 
> Right but I think the expectation would be that we don't ever touch
> that on the fast path, right? The "real" napi_id would basically
> go down below:
> 
> 	/* control-path-only fields follow */
> 
> 8B of cold data doesn't matter at all. But I haven't checked if
> we need the napi->napi_id access anywhere hot, do we?

Hmm, if the "hot" napi_id will be in cached in gro_node, then maybe
napi_struct::napi_id could really be in the cold part, let me recheck.

> 
>>> Dunno, if you really prefer, I can do it that way.  
>>
>> Alternative to avoid +8 bytes:
>>
>> struct napi_struct {
>> 	...
>>
>> 	union {
>> 		struct gro_node	gro;
>> 		struct {
>> 			u8 pad[offsetof(struct gro_node, napi_id)];
>> 			u32 napi_id;
>> 		};
>> 	};
>>
>> This is effectively the same what struct_group() does, just more ugly.
>> But allows to declare gro_node separately.

Thanks,
Olek