mbox series

[RFC/RFT,v2,0/3] Introduce GRO support to cpumap codebase

Message ID cover.1726480607.git.lorenzo@kernel.org (mailing list archive)
Headers show
Series Introduce GRO support to cpumap codebase | expand

Message

Lorenzo Bianconi Sept. 16, 2024, 10:13 a.m. UTC
Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
NAPI-kthread pinned on the selected cpu.

Changes in rfc v2:
- get rid of dummy netdev dependency

Lorenzo Bianconi (3):
  net: Add napi_init_for_gro routine
  net: add napi_threaded_poll to netdevice.h
  bpf: cpumap: Add gro support

 include/linux/netdevice.h |   3 +
 kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
 net/core/dev.c            |  27 ++++++---
 3 files changed, 73 insertions(+), 80 deletions(-)

Comments

Alexander Lobakin Sept. 16, 2024, 3:10 p.m. UTC | #1
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Mon, 16 Sep 2024 12:13:42 +0200

> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support

Oh okay, so it's still uses a NAPI.
When I'm back from the conferences (next week), I might rebase and send
the solution where I only use the GRO part of it, i.e. no
napi_schedule()/poll()/napi_complete() logics.

> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)

Thanks,
Olek
Daniel Xu Oct. 8, 2024, 10:39 p.m. UTC | #2
Hi Lorenzo,

On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support
> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)
> 
> -- 
> 2.46.0
> 

Sorry about the long delay - finally caught up to everything after
conferences.

I re-ran my synthetic tests (including baseline). v2 is somehow showing
2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
variable I changed is kernel version - steering prog is active for both.


Baseline (again)							

./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
							
cpumap NAPI patches v2							
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%

Thanks,
Daniel
Lorenzo Bianconi Oct. 9, 2024, 10:46 a.m. UTC | #3
> Hi Lorenzo,
> 
> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> > NAPI-kthread pinned on the selected cpu.
> > 
> > Changes in rfc v2:
> > - get rid of dummy netdev dependency
> > 
> > Lorenzo Bianconi (3):
> >   net: Add napi_init_for_gro routine
> >   net: add napi_threaded_poll to netdevice.h
> >   bpf: cpumap: Add gro support
> > 
> >  include/linux/netdevice.h |   3 +
> >  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >  net/core/dev.c            |  27 ++++++---
> >  3 files changed, 73 insertions(+), 80 deletions(-)
> > 
> > -- 
> > 2.46.0
> > 
> 
> Sorry about the long delay - finally caught up to everything after
> conferences.
> 
> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> variable I changed is kernel version - steering prog is active for both.
> 
> 
> Baseline (again)							
> 
> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> 							
> cpumap NAPI patches v2							
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> 
> Thanks,
> Daniel

Hi Daniel,

cool, thx for testing it.

@Olek: how do we want to proceed on it? Are you still working on it or do you want me
to send a regular patch for it?

Regards,
Lorenzo
Alexander Lobakin Oct. 9, 2024, 12:27 p.m. UTC | #4
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 12:46:00 +0200

>> Hi Lorenzo,
>>
>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>> NAPI-kthread pinned on the selected cpu.
>>>
>>> Changes in rfc v2:
>>> - get rid of dummy netdev dependency
>>>
>>> Lorenzo Bianconi (3):
>>>   net: Add napi_init_for_gro routine
>>>   net: add napi_threaded_poll to netdevice.h
>>>   bpf: cpumap: Add gro support
>>>
>>>  include/linux/netdevice.h |   3 +
>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>  net/core/dev.c            |  27 ++++++---
>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>
>>> -- 
>>> 2.46.0
>>>
>>
>> Sorry about the long delay - finally caught up to everything after
>> conferences.
>>
>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>> variable I changed is kernel version - steering prog is active for both.
>>
>>
>> Baseline (again)							
>>
>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>> 							
>> cpumap NAPI patches v2							
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>
>> Thanks,
>> Daniel
> 
> Hi Daniel,
> 
> cool, thx for testing it.
> 
> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> to send a regular patch for it?

Hi,

I had a small vacation, sorry. I'm starting working on it again today.

> 
> Regards,
> Lorenzo

Thanks,
Olek
Lorenzo Bianconi Oct. 9, 2024, 12:47 p.m. UTC | #5
> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 12:46:00 +0200
> 
> >> Hi Lorenzo,
> >>
> >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>> NAPI-kthread pinned on the selected cpu.
> >>>
> >>> Changes in rfc v2:
> >>> - get rid of dummy netdev dependency
> >>>
> >>> Lorenzo Bianconi (3):
> >>>   net: Add napi_init_for_gro routine
> >>>   net: add napi_threaded_poll to netdevice.h
> >>>   bpf: cpumap: Add gro support
> >>>
> >>>  include/linux/netdevice.h |   3 +
> >>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>  net/core/dev.c            |  27 ++++++---
> >>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>
> >>> -- 
> >>> 2.46.0
> >>>
> >>
> >> Sorry about the long delay - finally caught up to everything after
> >> conferences.
> >>
> >> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >> variable I changed is kernel version - steering prog is active for both.
> >>
> >>
> >> Baseline (again)							
> >>
> >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >> 							
> >> cpumap NAPI patches v2							
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>
> >> Thanks,
> >> Daniel
> > 
> > Hi Daniel,
> > 
> > cool, thx for testing it.
> > 
> > @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> > to send a regular patch for it?
> 
> Hi,
> 
> I had a small vacation, sorry. I'm starting working on it again today.

ack, no worries. Are you going to rebase the other patches on top of it
or are you going to try a different approach?

Regards,
Lorenzo

> 
> > 
> > Regards,
> > Lorenzo
> 
> Thanks,
> Olek
Alexander Lobakin Oct. 9, 2024, 12:50 p.m. UTC | #6
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 14:47:58 +0200

>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>
>>>> Hi Lorenzo,
>>>>
>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>
>>>>> Changes in rfc v2:
>>>>> - get rid of dummy netdev dependency
>>>>>
>>>>> Lorenzo Bianconi (3):
>>>>>   net: Add napi_init_for_gro routine
>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>   bpf: cpumap: Add gro support
>>>>>
>>>>>  include/linux/netdevice.h |   3 +
>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.46.0
>>>>>
>>>>
>>>> Sorry about the long delay - finally caught up to everything after
>>>> conferences.
>>>>
>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>> variable I changed is kernel version - steering prog is active for both.
>>>>
>>>>
>>>> Baseline (again)							
>>>>
>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>> 							
>>>> cpumap NAPI patches v2							
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>
>>>> Thanks,
>>>> Daniel
>>>
>>> Hi Daniel,
>>>
>>> cool, thx for testing it.
>>>
>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>> to send a regular patch for it?
>>
>> Hi,
>>
>> I had a small vacation, sorry. I'm starting working on it again today.
> 
> ack, no worries. Are you going to rebase the other patches on top of it
> or are you going to try a different approach?

I'll try the approach without NAPI as Kuba asks and let Daniel test it,
then we'll see.

BTW I'm curious how he got this boost on v2, from what I see you didn't
change the implementation that much?

Thanks,
Olek
Alexander Lobakin Oct. 22, 2024, 3:51 p.m. UTC | #7
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 9 Oct 2024 14:50:42 +0200

> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 14:47:58 +0200
> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>
>>>>> Hi Lorenzo,
>>>>>
>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>
>>>>>> Changes in rfc v2:
>>>>>> - get rid of dummy netdev dependency
>>>>>>
>>>>>> Lorenzo Bianconi (3):
>>>>>>   net: Add napi_init_for_gro routine
>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>   bpf: cpumap: Add gro support
>>>>>>
>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>
>>>>>> -- 
>>>>>> 2.46.0
>>>>>>
>>>>>
>>>>> Sorry about the long delay - finally caught up to everything after
>>>>> conferences.
>>>>>
>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>
>>>>>
>>>>> Baseline (again)							
>>>>>
>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>> 							
>>>>> cpumap NAPI patches v2							
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>
>>>> Hi Daniel,
>>>>
>>>> cool, thx for testing it.
>>>>
>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>> to send a regular patch for it?
>>>
>>> Hi,
>>>
>>> I had a small vacation, sorry. I'm starting working on it again today.
>>
>> ack, no worries. Are you going to rebase the other patches on top of it
>> or are you going to try a different approach?
> 
> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> then we'll see.

For now, I have the same results without NAPI as with your series, so
I'll push it soon and let Daniel test.

(I simply decoupled GRO and NAPI and used the former in cpumap, but the
 kthread logic didn't change)

> 
> BTW I'm curious how he got this boost on v2, from what I see you didn't
> change the implementation that much?

Thanks,
Olek
Alexander Lobakin Nov. 12, 2024, 5:43 p.m. UTC | #8
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 22 Oct 2024 17:51:43 +0200

> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Wed, 9 Oct 2024 14:50:42 +0200
> 
>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>
>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>
>>>>>> Hi Lorenzo,
>>>>>>
>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>
>>>>>>> Changes in rfc v2:
>>>>>>> - get rid of dummy netdev dependency
>>>>>>>
>>>>>>> Lorenzo Bianconi (3):
>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>
>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.46.0
>>>>>>>
>>>>>>
>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>> conferences.
>>>>>>
>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>
>>>>>>
>>>>>> Baseline (again)							
>>>>>>
>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>> 							
>>>>>> cpumap NAPI patches v2							
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> cool, thx for testing it.
>>>>>
>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>> to send a regular patch for it?
>>>>
>>>> Hi,
>>>>
>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>
>>> ack, no worries. Are you going to rebase the other patches on top of it
>>> or are you going to try a different approach?
>>
>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>> then we'll see.
> 
> For now, I have the same results without NAPI as with your series, so
> I'll push it soon and let Daniel test.
> 
> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>  kthread logic didn't change)
> 
>>
>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>> change the implementation that much?

Hi Daniel,

Sorry for the delay. Please test [0].

[0] https://github.com/alobakin/linux/commits/cpumap-old

Thanks,
Olek
Daniel Xu Nov. 13, 2024, 11:39 p.m. UTC | #9
On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Tue, 22 Oct 2024 17:51:43 +0200
>
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Wed, 9 Oct 2024 14:50:42 +0200
>> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>>
>>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>>
>>>>>>> Hi Lorenzo,
>>>>>>>
>>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>>
>>>>>>>> Changes in rfc v2:
>>>>>>>> - get rid of dummy netdev dependency
>>>>>>>>
>>>>>>>> Lorenzo Bianconi (3):
>>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>>
>>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> 2.46.0
>>>>>>>>
>>>>>>>
>>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>>> conferences.
>>>>>>>
>>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>>
>>>>>>>
>>>>>>> Baseline (again)							
>>>>>>>
>>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>>> 							
>>>>>>> cpumap NAPI patches v2							
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Daniel
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> cool, thx for testing it.
>>>>>>
>>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>>> to send a regular patch for it?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>>
>>>> ack, no worries. Are you going to rebase the other patches on top of it
>>>> or are you going to try a different approach?
>>>
>>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>>> then we'll see.
>> 
>> For now, I have the same results without NAPI as with your series, so
>> I'll push it soon and let Daniel test.
>> 
>> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>>  kthread logic didn't change)
>> 
>>>
>>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>>> change the implementation that much?
>
> Hi Daniel,
>
> Sorry for the delay. Please test [0].
>
> [0] https://github.com/alobakin/linux/commits/cpumap-old
>
> Thanks,
> Olek

Ack. Will do probably early next week.
Daniel Xu Nov. 23, 2024, 12:10 a.m. UTC | #10
Hi Olek,

Here are the results.

On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>
>
> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > From: Alexander Lobakin <aleksander.lobakin@intel.com>
> > Date: Tue, 22 Oct 2024 17:51:43 +0200
> >
> >> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> >> Date: Wed, 9 Oct 2024 14:50:42 +0200
> >>
> >>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200
> >>>
> >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
> >>>>>
> >>>>>>> Hi Lorenzo,
> >>>>>>>
> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>>>>>>> NAPI-kthread pinned on the selected cpu.
> >>>>>>>>
> >>>>>>>> Changes in rfc v2:
> >>>>>>>> - get rid of dummy netdev dependency
> >>>>>>>>
> >>>>>>>> Lorenzo Bianconi (3):
> >>>>>>>>   net: Add napi_init_for_gro routine
> >>>>>>>>   net: add napi_threaded_poll to netdevice.h
> >>>>>>>>   bpf: cpumap: Add gro support
> >>>>>>>>
> >>>>>>>>  include/linux/netdevice.h |   3 +
> >>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>>>>>>  net/core/dev.c            |  27 ++++++---
> >>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.46.0
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sorry about the long delay - finally caught up to everything after
> >>>>>>> conferences.
> >>>>>>>
> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >>>>>>> variable I changed is kernel version - steering prog is active for both.
> >>>>>>>
> >>>>>>>
> >>>>>>> Baseline (again)
> >>>>>>>
> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >>>>>>>
> >>>>>>> cpumap NAPI patches v2
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Daniel
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>>
> >>>>>> cool, thx for testing it.
> >>>>>>
> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> >>>>>> to send a regular patch for it?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I had a small vacation, sorry. I'm starting working on it again today.
> >>>>
> >>>> ack, no worries. Are you going to rebase the other patches on top of it
> >>>> or are you going to try a different approach?
> >>>
> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> >>> then we'll see.
> >>
> >> For now, I have the same results without NAPI as with your series, so
> >> I'll push it soon and let Daniel test.
> >>
> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
> >>  kthread logic didn't change)
> >>
> >>>
> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't
> >>> change the implementation that much?
> >
> > Hi Daniel,
> >
> > Sorry for the delay. Please test [0].
> >
> > [0] https://github.com/alobakin/linux/commits/cpumap-old
> >
> > Thanks,
> > Olek
>
> Ack. Will do probably early next week.
>

Baseline (again)

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126

cpumap v2 Olek

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%


It's very interesting that we see -40% tput w/ the patches. I went back
and double checked and it seems the numbers are right. Here's the
some output from some profiles I took with:

    perf record -e cycles:k -a -- sleep 10
    perf --no-pager diff perf.data.baseline perf.data.withpatches > ...

    # Event 'cycles:k'
    # Baseline  Delta Abs  Shared Object                                                    Symbol
         6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
     3.57%     -2.56%  bpf_prog_954ab9c8c8b5e42f_latency                                [k] bpf_prog_954ab9c8c8b5e42f_latency
               +2.22%  bpf_prog_5c74b34eb24d5c9b_steering                               [k] bpf_prog_5c74b34eb24d5c9b_steering
     2.61%     -1.88%  [kernel.kallsyms]                                                [k] __skb_datagram_iter
     0.55%     +1.53%  [kernel.kallsyms]                                                [k] acpi_processor_ffh_cstate_enter
     4.52%     -1.46%  [kernel.kallsyms]                                                [k] read_tsc
     0.34%     +1.42%  [kernel.kallsyms]                                                [k] __slab_free
     0.97%     +1.18%  [kernel.kallsyms]                                                [k] do_idle
     1.35%     +1.17%  [kernel.kallsyms]                                                [k] cpuidle_enter_state
     1.89%     -1.15%  [kernel.kallsyms]                                                [k] tcp_ack
     2.08%     +1.14%  [kernel.kallsyms]                                                [k] _raw_spin_lock
               +1.13%  <redacted>
     0.22%     +1.02%  [kernel.kallsyms]                                                [k] __sock_wfree
     2.23%     -1.02%  [kernel.kallsyms]                                                [k] bpf_dynptr_slice
     0.00%     +0.98%  [kernel.kallsyms]                                                [k] tcp6_gro_receive
     2.91%     -0.98%  [kernel.kallsyms]                                                [k] csum_partial
     0.62%     +0.94%  [kernel.kallsyms]                                                [k] skb_release_data
               +0.81%  [kernel.kallsyms]                                                [k] memset
     0.16%     +0.74%  [kernel.kallsyms]                                                [k] bnxt_tx_int
     0.00%     +0.74%  [kernel.kallsyms]                                                [k] dev_gro_receive
     0.36%     +0.74%  [kernel.kallsyms]                                                [k] __tcp_transmit_skb
               +0.72%  [kernel.kallsyms]                                                [k] tcp_gro_receive
     1.10%     -0.66%  [kernel.kallsyms]                                                [k] ep_poll_callback
     1.52%     -0.65%  [kernel.kallsyms]                                                [k] page_pool_put_unrefed_netmem
     0.75%     -0.57%  [kernel.kallsyms]                                                [k] bnxt_rx_pkt
     1.10%     +0.56%  [kernel.kallsyms]                                                [k] native_sched_clock
     0.16%     +0.53%  <redacted>
     0.83%     -0.53%  [kernel.kallsyms]                                                [k] skb_try_coalesce
     0.60%     +0.53%  [kernel.kallsyms]                                                [k] eth_type_trans
     1.65%     -0.51%  [kernel.kallsyms]                                                [k] _raw_spin_lock_irqsave
     0.14%     +0.50%  [kernel.kallsyms]                                                [k] bnxt_start_xmit
     0.54%     -0.48%  [kernel.kallsyms]                                                [k] __skb_frag_unref
     0.91%     +0.48%  [cls_bpf]                                                        [k] 0x0000000000000010
     0.00%     +0.47%  [kernel.kallsyms]                                                [k] ipv6_gro_receive
     0.76%     -0.45%  [kernel.kallsyms]                                                [k] tcp_rcv_established
     0.94%     -0.45%  [kernel.kallsyms]                                                [k] __inet6_lookup_established
     0.31%     +0.43%  [kernel.kallsyms]                                                [k] __sched_text_start
     0.21%     +0.43%  [kernel.kallsyms]                                                [k] poll_idle
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] tcp_try_coalesce
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] kmem_cache_free
     1.13%     +0.42%  [kernel.kallsyms]                                                [k] __bnxt_poll_work
     0.48%     -0.41%  [kernel.kallsyms]                                                [k] tcp_urg
               +0.39%  [kernel.kallsyms]                                                [k] memcpy
     0.51%     -0.38%  [kernel.kallsyms]                                                [k] _raw_read_unlock_irqrestore
               +0.38%  [kernel.kallsyms]                                                [k] __skb_gro_checksum_complete
               +0.37%  [kernel.kallsyms]                                                [k] irq_entries_start
     0.16%     +0.36%  [kernel.kallsyms]                                                [k] bpf_sk_storage_get
     0.62%     -0.36%  [kernel.kallsyms]                                                [k] page_pool_refill_alloc_cache
     0.08%     +0.35%  [kernel.kallsyms]                                                [k] ip6_finish_output2
     0.14%     +0.34%  [kernel.kallsyms]                                                [k] bnxt_poll_p5
     0.06%     +0.33%  [sch_fq]                                                         [k] 0x0000000000000020
     0.04%     +0.32%  [kernel.kallsyms]                                                [k] __dev_queue_xmit
     0.75%     -0.32%  [kernel.kallsyms]                                                [k] __xdp_build_skb_from_frame
     0.67%     -0.31%  [kernel.kallsyms]                                                [k] sock_def_readable
     0.05%     +0.31%  [kernel.kallsyms]                                                [k] netif_skb_features
               +0.30%  [kernel.kallsyms]                                                [k] tcp_gro_pull_header
     0.49%     -0.29%  [kernel.kallsyms]                                                [k] napi_pp_put_page
     0.18%     +0.29%  [kernel.kallsyms]                                                [k] call_function_single_prep_ipi
     0.40%     -0.28%  [kernel.kallsyms]                                                [k] _raw_read_lock_irqsave
     0.11%     +0.27%  [kernel.kallsyms]                                                [k] raw6_local_deliver
     0.18%     +0.26%  [kernel.kallsyms]                                                [k] ip6_dst_check
     0.42%     -0.26%  [kernel.kallsyms]                                                [k] netif_receive_skb_list_internal
     0.05%     +0.26%  [kernel.kallsyms]                                                [k] __qdisc_run
     0.75%     +0.25%  [kernel.kallsyms]                                                [k] __build_skb_around
     0.05%     +0.25%  [kernel.kallsyms]                                                [k] htab_map_hash
     0.09%     +0.24%  [kernel.kallsyms]                                                [k] net_rx_action
     0.07%     +0.23%  <redacted>
     0.45%     -0.23%  [kernel.kallsyms]                                                [k] migrate_enable
     0.48%     -0.23%  [kernel.kallsyms]                                                [k] mem_cgroup_charge_skmem
     0.26%     +0.23%  [kernel.kallsyms]                                                [k] __switch_to
     0.15%     +0.22%  [kernel.kallsyms]                                                [k] sock_rfree
     0.30%     -0.22%  [kernel.kallsyms]                                                [k] tcp_add_backlog

     <snip>

     5.68%             bpf_prog_17fea1bb6503ed98_steering                               [k] bpf_prog_17fea1bb6503ed98_steering
     2.10%             [kernel.kallsyms]                                                [k] __skb_checksum_complete
     0.71%             [kernel.kallsyms]                                                [k] __memset
     0.54%             [kernel.kallsyms]                                                [k] __memcpy
     0.18%             [kernel.kallsyms]                                                [k] __irqentry_text_start

     <snip>

Please let me know if you want me to collect any other data.

Thanks,
Daniel
Alexander Lobakin Nov. 25, 2024, 3:12 p.m. UTC | #11
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Fri, 22 Nov 2024 17:10:06 -0700

> Hi Olek,
> 
> Here are the results.
> 
> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>
>>
>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:

[...]

> Baseline (again)
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> 
> cpumap v2 Olek
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> 
> 
> It's very interesting that we see -40% tput w/ the patches. I went back

Oh no, I messed up something =\

Could you please also test not the whole series, but patches 1-3 (up to
"bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
array...")? Would be great to see whether this implementation works
worse right from the start or I just broke something later on.

> and double checked and it seems the numbers are right. Here's the
> some output from some profiles I took with:
> 
>     perf record -e cycles:k -a -- sleep 10
>     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> 
>     # Event 'cycles:k'
>     # Baseline  Delta Abs  Shared Object                                                    Symbol
>          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter

BTW, what CONFIG_HZ do you have on the kernel you're testing with?

Thanks,
Olek
Daniel Xu Nov. 25, 2024, 5:03 p.m. UTC | #12
On Mon, Nov 25, 2024 at 04:12:24PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
> > Hi Olek,
> > 
> > Here are the results.
> > 
> > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>
> >>
> >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
> > Baseline (again)
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > 
> > cpumap v2 Olek
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > 
> > 
> > It's very interesting that we see -40% tput w/ the patches. I went back
> 
> Oh no, I messed up something =\
> 
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Will do.

> 
> > and double checked and it seems the numbers are right. Here's the
> > some output from some profiles I took with:
> > 
> >     perf record -e cycles:k -a -- sleep 10
> >     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > 
> >     # Event 'cycles:k'
> >     # Baseline  Delta Abs  Shared Object                                                    Symbol
> >          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> 
> BTW, what CONFIG_HZ do you have on the kernel you're testing with?

# zgrep CONFIG_HZ /proc/config.gz
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Just curious - why do you ask?

Thanks,
Daniel
Jesper Dangaard Brouer Nov. 25, 2024, 6:50 p.m. UTC | #13
On 25/11/2024 16.12, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
>> Hi Olek,
>>
>> Here are the results.
>>
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
>> Baseline (again)
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>

We need to talk about what we are measuring, and how to control the
experiment setup to get reproducible results.
Especially controlling on what CPU cores our code paths are executing.

In above "baseline" case, we have two processes/tasks executing:
  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
  (2) Userspace netserver process TCP receiving data from socket.

My experience is that you will see two noticeable different
throughput performance results depending on whether (1) and (2) is
executing on the *same* CPU (multi-tasking context-switching),
or executing in parallel (e.g. pinned) on two different CPU cores.

The netperf command have an option

  -T lcpu,remcpu
       Request that netperf be bound to local CPU lcpu and/or netserver 
be bound to remote CPU rcpu.

Verify setting by listing pinning like this:
   for PID in $(pidof netserver); do taskset -pc $PID ; done

You can also set pinning runtime like this:
  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU 
$PID; done

For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
output and adjust pinning runtime to observe the effect quickly.

My experience is unfortunately that TCP results have a lot of variation
(thanks for incliding 5 runs in your benchmarks), as it depends on tasks
timing, that can get affected by CPU sleep states. The systems CPU
latency setting can be seen in /dev/cpu_dma_latency, which can be read
like this:

  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency

For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
as it requires holding the file open. E.g I play with these profiles:

  sudo tuned-adm profile throughput-performance
  sudo tuned-adm profile latency-performance
  sudo tuned-adm profile network-latency


>> cpumap v2 Olek
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>
>>


We now three processes/tasks executing:
  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
  (3) Userspace netserver process TCP receiving data from socket.

Again, now the performance is going to depend on depending on which CPU
cores the processes/tasks are running and whether some are sharing the
same CPU. (There are both wakeup timing and cache-line effects).

There are now more combinations to test...

CPUmap is a CPU scaling facility, and you will likely also see different
CPU utilization on the difference cores one you start to pin these to
control the scenarios.

>> It's very interesting that we see -40% tput w/ the patches. I went back
> 

Sad that we see -40% throughput...  but do we know what CPU cores the
now three different tasks/processes run on(?)


> Oh no, I messed up something =\
>  > Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.
> 
>> and double checked and it seems the numbers are right. Here's the
>> some output from some profiles I took with:
>>
>>      perf record -e cycles:k -a -- sleep 10
>>      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
>>
>>      # Event 'cycles:k'
>>      # Baseline  Delta Abs  Shared Object                                                    Symbol
>>           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
>

I really appreciate that you provide perf data and perf diff, but as
described above, we need data and information on what CPU cores are
running which workload.

Fortunately perf diff (and perf report) support doing like this:
  perf diff --sort=cpu,symbol

But then you also need to control the CPUs used in experiment for the
diff to work.

I hope I made sense as these kind of CPU scaling benchmarks are tricky,
--Jesper
Daniel Xu Nov. 25, 2024, 9:53 p.m. UTC | #14
Hi Jesper,

On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 25/11/2024 16.12, Alexander Lobakin wrote:
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > 
> > > Hi Olek,
> > > 
> > > Here are the results.
> > > 
> > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > 
> > > > 
> > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > 
> > [...]
> > 
> > > Baseline (again)
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > 
> 
> We need to talk about what we are measuring, and how to control the
> experiment setup to get reproducible results.
> Especially controlling on what CPU cores our code paths are executing.
> 
> In above "baseline" case, we have two processes/tasks executing:
>  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
>  (2) Userspace netserver process TCP receiving data from socket.

"baseline" in this case is still cpumap, just without these GRO patches.

> 
> My experience is that you will see two noticeable different
> throughput performance results depending on whether (1) and (2) is
> executing on the *same* CPU (multi-tasking context-switching),
> or executing in parallel (e.g. pinned) on two different CPU cores.
> 
> The netperf command have an option
> 
>  -T lcpu,remcpu
>       Request that netperf be bound to local CPU lcpu and/or netserver be
> bound to remote CPU rcpu.
> 
> Verify setting by listing pinning like this:
>   for PID in $(pidof netserver); do taskset -pc $PID ; done
> 
> You can also set pinning runtime like this:
>  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> done
> 
> For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> output and adjust pinning runtime to observe the effect quickly.
> 
> My experience is unfortunately that TCP results have a lot of variation
> (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> timing, that can get affected by CPU sleep states. The systems CPU
> latency setting can be seen in /dev/cpu_dma_latency, which can be read
> like this:
> 
>  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> 
> For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> as it requires holding the file open. E.g I play with these profiles:
> 
>  sudo tuned-adm profile throughput-performance
>  sudo tuned-adm profile latency-performance
>  sudo tuned-adm profile network-latency

Appreciate the tips - I should keep this saved somewhere.

> 
> 
> > > cpumap v2 Olek
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > 
> > > 
> 
> 
> We now three processes/tasks executing:
>  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
>  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
>  (3) Userspace netserver process TCP receiving data from socket.
> 
> Again, now the performance is going to depend on depending on which CPU
> cores the processes/tasks are running and whether some are sharing the
> same CPU. (There are both wakeup timing and cache-line effects).
> 
> There are now more combinations to test...
> 
> CPUmap is a CPU scaling facility, and you will likely also see different
> CPU utilization on the difference cores one you start to pin these to
> control the scenarios.
> 
> > > It's very interesting that we see -40% tput w/ the patches. I went back
> > 
> 
> Sad that we see -40% throughput...  but do we know what CPU cores the
> now three different tasks/processes run on(?)
> 

Roughly, yes. For context, my primary use case for cpumap is to provide
some degree of isolation between colocated containers on a single host.
In particular, colocation occurs on AMD Bergamo. And containers are
CPU pinned to their own CCX (roughly). My RX steering program ensures
RX packets destined to a specific container are cpumap redirected to any
of the container's pinned CPUs. It not only provides a good measure of
isolation but ensures resources are properly accounted.

So to answer your question of which CPUs the 3 things run on: cpumap
kthread and application run on the same set of cores. More than that,
they share the same L3 cache by design. irq/softirq is effectively
random given default RSS config and IRQ affinities.


> 
> > Oh no, I messed up something =\
> >  > Could you please also test not the whole series, but patches 1-3 (up to
> > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > array...")? Would be great to see whether this implementation works
> > worse right from the start or I just broke something later on.
> > 
> > > and double checked and it seems the numbers are right. Here's the
> > > some output from some profiles I took with:
> > > 
> > >      perf record -e cycles:k -a -- sleep 10
> > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > 
> > >      # Event 'cycles:k'
> > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > 
> 
> I really appreciate that you provide perf data and perf diff, but as
> described above, we need data and information on what CPU cores are
> running which workload.
> 
> Fortunately perf diff (and perf report) support doing like this:
>  perf diff --sort=cpu,symbol
> 
> But then you also need to control the CPUs used in experiment for the
> diff to work.
> 
> I hope I made sense as these kind of CPU scaling benchmarks are tricky,

Indeed, sounds quite tricky.

My understanding with GRO is that it's a powerful general purpose
optimization. Enough that it should rise above the usual noise on a
reasonably configured system (which mine is).

Maybe we can consider decoupling the cpumap GRO enablement with the
later optimizations?

So in Olek's above series, patches 1-3 seem like they would still
benefit from an simpler testbed. But the more targetted optimizations in
patch 4+ would probably justify a de-noised setup.  Possibly single host
with xdp-trafficgen or something.

Procedurally speaking, maybe it would save some wasted effort if
everyone agreed on the general approach before investing more time into
finer optimizations built on top of the basic GRO support?

Thanks,
Daniel
Lorenzo Bianconi Nov. 25, 2024, 10:19 p.m. UTC | #15
> Hi Jesper,
> 
> On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> > 
> > 
> > On 25/11/2024 16.12, Alexander Lobakin wrote:
> > > From: Daniel Xu <dxu@dxuuu.xyz>
> > > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > > 
> > > > Hi Olek,
> > > > 
> > > > Here are the results.
> > > > 
> > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > > 
> > > > > 
> > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > > 
> > > [...]
> > > 
> > > > Baseline (again)
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > > 
> > 
> > We need to talk about what we are measuring, and how to control the
> > experiment setup to get reproducible results.
> > Especially controlling on what CPU cores our code paths are executing.
> > 
> > In above "baseline" case, we have two processes/tasks executing:
> >  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
> >  (2) Userspace netserver process TCP receiving data from socket.
> 
> "baseline" in this case is still cpumap, just without these GRO patches.
> 
> > 
> > My experience is that you will see two noticeable different
> > throughput performance results depending on whether (1) and (2) is
> > executing on the *same* CPU (multi-tasking context-switching),
> > or executing in parallel (e.g. pinned) on two different CPU cores.
> > 
> > The netperf command have an option
> > 
> >  -T lcpu,remcpu
> >       Request that netperf be bound to local CPU lcpu and/or netserver be
> > bound to remote CPU rcpu.
> > 
> > Verify setting by listing pinning like this:
> >   for PID in $(pidof netserver); do taskset -pc $PID ; done
> > 
> > You can also set pinning runtime like this:
> >  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> > done
> > 
> > For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> > output and adjust pinning runtime to observe the effect quickly.
> > 
> > My experience is unfortunately that TCP results have a lot of variation
> > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> > timing, that can get affected by CPU sleep states. The systems CPU
> > latency setting can be seen in /dev/cpu_dma_latency, which can be read
> > like this:
> > 
> >  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> > 
> > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> > as it requires holding the file open. E.g I play with these profiles:
> > 
> >  sudo tuned-adm profile throughput-performance
> >  sudo tuned-adm profile latency-performance
> >  sudo tuned-adm profile network-latency
> 
> Appreciate the tips - I should keep this saved somewhere.
> 
> > 
> > 
> > > > cpumap v2 Olek
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > > 
> > > > 
> > 
> > 
> > We now three processes/tasks executing:
> >  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
> >  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
> >  (3) Userspace netserver process TCP receiving data from socket.
> > 
> > Again, now the performance is going to depend on depending on which CPU
> > cores the processes/tasks are running and whether some are sharing the
> > same CPU. (There are both wakeup timing and cache-line effects).
> > 
> > There are now more combinations to test...
> > 
> > CPUmap is a CPU scaling facility, and you will likely also see different
> > CPU utilization on the difference cores one you start to pin these to
> > control the scenarios.
> > 
> > > > It's very interesting that we see -40% tput w/ the patches. I went back
> > > 
> > 
> > Sad that we see -40% throughput...  but do we know what CPU cores the
> > now three different tasks/processes run on(?)
> > 
> 
> Roughly, yes. For context, my primary use case for cpumap is to provide
> some degree of isolation between colocated containers on a single host.
> In particular, colocation occurs on AMD Bergamo. And containers are
> CPU pinned to their own CCX (roughly). My RX steering program ensures
> RX packets destined to a specific container are cpumap redirected to any
> of the container's pinned CPUs. It not only provides a good measure of
> isolation but ensures resources are properly accounted.
> 
> So to answer your question of which CPUs the 3 things run on: cpumap
> kthread and application run on the same set of cores. More than that,
> they share the same L3 cache by design. irq/softirq is effectively
> random given default RSS config and IRQ affinities.
> 
> 
> > 
> > > Oh no, I messed up something =\
> > >  > Could you please also test not the whole series, but patches 1-3 (up to
> > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > > array...")? Would be great to see whether this implementation works
> > > worse right from the start or I just broke something later on.
> > > 
> > > > and double checked and it seems the numbers are right. Here's the
> > > > some output from some profiles I took with:
> > > > 
> > > >      perf record -e cycles:k -a -- sleep 10
> > > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > > 
> > > >      # Event 'cycles:k'
> > > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > > 
> > 
> > I really appreciate that you provide perf data and perf diff, but as
> > described above, we need data and information on what CPU cores are
> > running which workload.
> > 
> > Fortunately perf diff (and perf report) support doing like this:
> >  perf diff --sort=cpu,symbol
> > 
> > But then you also need to control the CPUs used in experiment for the
> > diff to work.
> > 
> > I hope I made sense as these kind of CPU scaling benchmarks are tricky,
> 
> Indeed, sounds quite tricky.
> 
> My understanding with GRO is that it's a powerful general purpose
> optimization. Enough that it should rise above the usual noise on a
> reasonably configured system (which mine is).
> 
> Maybe we can consider decoupling the cpumap GRO enablement with the
> later optimizations?

I agree. First, we need to identify the best approach to enable GRO on cpumap
(between Olek's approach and what I have suggested) and then we can evaluate
subsequent optimizations.
@Olek: do you agree?

Regards,
Lorenzo

> 
> So in Olek's above series, patches 1-3 seem like they would still
> benefit from an simpler testbed. But the more targetted optimizations in
> patch 4+ would probably justify a de-noised setup.  Possibly single host
> with xdp-trafficgen or something.
> 
> Procedurally speaking, maybe it would save some wasted effort if
> everyone agreed on the general approach before investing more time into
> finer optimizations built on top of the basic GRO support?
> 
> Thanks,
> Daniel
>
Daniel Xu Nov. 25, 2024, 10:56 p.m. UTC | #16
On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
>
>> Hi Olek,
>> 
>> Here are the results.
>> 
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>
> [...]
>
>> Baseline (again)
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>> 
>> cpumap v2 Olek
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>> 
>> 
>> It's very interesting that we see -40% tput w/ the patches. I went back
>
> Oh no, I messed up something =\
>
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Patches 1-3 reproduces the -40% tput numbers. 

With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.

tcp_rr results were unaffected.

Thanks,
Daniel
Alexander Lobakin Nov. 26, 2024, 10:36 a.m. UTC | #17
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Mon, 25 Nov 2024 16:56:49 -0600

> 
> 
> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>
>>> Hi Olek,
>>>
>>> Here are the results.
>>>
>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>
>>>>
>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>
>> [...]
>>
>>> Baseline (again)
>>>
>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>>
>>> cpumap v2 Olek
>>>
>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>>
>>>
>>> It's very interesting that we see -40% tput w/ the patches. I went back
>>
>> Oh no, I messed up something =\
>>
>> Could you please also test not the whole series, but patches 1-3 (up to
>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>> array...")? Would be great to see whether this implementation works
>> worse right from the start or I just broke something later on.
> 
> Patches 1-3 reproduces the -40% tput numbers. 

Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
cpumap's kthreads instead of NAPI) really performs worse than switching
cpumap to NAPI.

> 
> With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.

Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.

> 
> tcp_rr results were unaffected.

@ Jakub,

Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
least for now =\ I took a look on the backlog NAPI and it could be used,
although we'd need a pointer in the backlog to the corresponding cpumap
+ also some synchronization point to make sure backlog NAPI won't access
already destroyed cpumap.

Maybe Lorenzo could take a look...

Thanks,
Olek
Lorenzo Bianconi Nov. 26, 2024, 5:02 p.m. UTC | #18
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Mon, 25 Nov 2024 16:56:49 -0600
> 
> > 
> > 
> > On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> >> From: Daniel Xu <dxu@dxuuu.xyz>
> >> Date: Fri, 22 Nov 2024 17:10:06 -0700
> >>
> >>> Hi Olek,
> >>>
> >>> Here are the results.
> >>>
> >>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>>>
> >>>>
> >>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> >>
> >> [...]
> >>
> >>> Baseline (again)
> >>>
> >>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> >>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> >>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> >>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> >>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> >>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> >>>
> >>> cpumap v2 Olek
> >>>
> >>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> >>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> >>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> >>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> >>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> >>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> >>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> >>>
> >>>
> >>> It's very interesting that we see -40% tput w/ the patches. I went back
> >>
> >> Oh no, I messed up something =\
> >>
> >> Could you please also test not the whole series, but patches 1-3 (up to
> >> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> >> array...")? Would be great to see whether this implementation works
> >> worse right from the start or I just broke something later on.
> > 
> > Patches 1-3 reproduces the -40% tput numbers. 
> 
> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
> cpumap's kthreads instead of NAPI) really performs worse than switching
> cpumap to NAPI.
> 
> > 
> > With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.
> 
> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.
> 
> > 
> > tcp_rr results were unaffected.
> 
> @ Jakub,
> 
> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
> least for now =\ I took a look on the backlog NAPI and it could be used,
> although we'd need a pointer in the backlog to the corresponding cpumap
> + also some synchronization point to make sure backlog NAPI won't access
> already destroyed cpumap.
> 
> Maybe Lorenzo could take a look...

it seems to me the only difference would be we will use the shared backlog_napi
kthreads instead of having a dedicated kthread for each cpumap entry but we still
need the napi poll logic. I can look into it if you prefer the shared kthread
approach.
@Jakub: what do you think?

Regards,
Lorenzo

> 
> Thanks,
> Olek
>
Jesper Dangaard Brouer Nov. 26, 2024, 5:12 p.m. UTC | #19
On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>
>>>
>>>
>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>
>>>>> Hi Olek,
>>>>>
>>>>> Here are the results.
>>>>>
>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>
>>>> [...]
>>>>
>>>>> Baseline (again)
>>>>>
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>>>>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>>>>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>>>>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>>>>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>>>>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>>>>
>>>>> cpumap v2 Olek
>>>>>
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>>>>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>>>>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>>>>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>>>>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>>>>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>>>>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>>>>
>>>>>
>>>>> It's very interesting that we see -40% tput w/ the patches. I went back
>>>>
>>>> Oh no, I messed up something =\
>>>>
>>>> Could you please also test not the whole series, but patches 1-3 (up to
>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>> array...")? Would be great to see whether this implementation works
>>>> worse right from the start or I just broke something later on.
>>>
>>> Patches 1-3 reproduces the -40% tput numbers.
>>
>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>> cpumap's kthreads instead of NAPI) really performs worse than switching
>> cpumap to NAPI.
>>
>>>
>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.
>>
>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up on it.
>>
>>>
>>> tcp_rr results were unaffected.
>>
>> @ Jakub,
>>
>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>> least for now =\ I took a look on the backlog NAPI and it could be used,
>> although we'd need a pointer in the backlog to the corresponding cpumap
>> + also some synchronization point to make sure backlog NAPI won't access
>> already destroyed cpumap.
>>
>> Maybe Lorenzo could take a look...
> 
> it seems to me the only difference would be we will use the shared backlog_napi
> kthreads instead of having a dedicated kthread for each cpumap entry but we still
> need the napi poll logic. I can look into it if you prefer the shared kthread
> approach.

I don't like a shared kthread approach. For my use-case I want to give
the "remote" CPU-map kthreads higher scheduling priority. (As it will be
running a 2nd XDP BPF DDoS program protecting against overload by 
dropping packets).

Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
to give backlog NAPI high priority, in my use-case.

> @Jakub: what do you think?


--Jesper
Alexander Lobakin Nov. 28, 2024, 10:41 a.m. UTC | #20
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 26 Nov 2024 18:12:27 +0100

> 
> 
> 
> On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>>
>>>>
>>>>
>>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>>
>>>>>> Hi Olek,
>>>>>>
>>>>>> Here are the results.
>>>>>>
>>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>> Baseline (again)
>>>>>>
>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>> Run 1    3169917            0.00007295    0.00007871   
>>>>>> 0.00009343        Run 1    21749.43
>>>>>> Run 2    3228290            0.00007103    0.00007679   
>>>>>> 0.00009215        Run 2    21897.17
>>>>>> Run 3    3226746            0.00007231    0.00007871   
>>>>>> 0.00009087        Run 3    21906.82
>>>>>> Run 4    3191258            0.00007231    0.00007743   
>>>>>> 0.00009087        Run 4    21155.15
>>>>>> Run 5    3235653            0.00007231    0.00007743   
>>>>>> 0.00008703        Run 5    21397.06
>>>>>> Average    3210372.8    0.000072182    0.000077814   
>>>>>> 0.00009087        Average    21621.126
>>>>>>
>>>>>> cpumap v2 Olek
>>>>>>
>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>> Run 1    3253651            0.00007167    0.00007807   
>>>>>> 0.00009343        Run 1    13497.57
>>>>>> Run 2    3221492            0.00007231    0.00007743   
>>>>>> 0.00009087        Run 2    12115.53
>>>>>> Run 3    3296453            0.00007039    0.00007807   
>>>>>> 0.00009087        Run 3    12323.38
>>>>>> Run 4    3254460            0.00007167    0.00007807   
>>>>>> 0.00009087        Run 4    12901.88
>>>>>> Run 5    3173327            0.00007295    0.00007871   
>>>>>> 0.00009215        Run 5    12593.22
>>>>>> Average    3239876.6    0.000071798    0.00007807   
>>>>>> 0.000091638        Average    12686.316
>>>>>> Delta    0.92%            -0.53%            0.33%           
>>>>>> 0.85%                    -41.32%
>>>>>>
>>>>>>
>>>>>> It's very interesting that we see -40% tput w/ the patches. I went
>>>>>> back
>>>>>
>>>>> Oh no, I messed up something =\
>>>>>
>>>>> Could you please also test not the whole series, but patches 1-3
>>>>> (up to
>>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>>> array...")? Would be great to see whether this implementation works
>>>>> worse right from the start or I just broke something later on.
>>>>
>>>> Patches 1-3 reproduces the -40% tput numbers.
>>>
>>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>>> cpumap's kthreads instead of NAPI) really performs worse than switching
>>> cpumap to NAPI.
>>>
>>>>
>>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
>>>> it was noisy.
>>>
>>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
>>> on it.
>>>
>>>>
>>>> tcp_rr results were unaffected.
>>>
>>> @ Jakub,
>>>
>>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>>> least for now =\ I took a look on the backlog NAPI and it could be used,
>>> although we'd need a pointer in the backlog to the corresponding cpumap
>>> + also some synchronization point to make sure backlog NAPI won't access
>>> already destroyed cpumap.
>>>
>>> Maybe Lorenzo could take a look...
>>
>> it seems to me the only difference would be we will use the shared
>> backlog_napi
>> kthreads instead of having a dedicated kthread for each cpumap entry
>> but we still
>> need the napi poll logic. I can look into it if you prefer the shared
>> kthread
>> approach.
> 
> I don't like a shared kthread approach. For my use-case I want to give
> the "remote" CPU-map kthreads higher scheduling priority. (As it will be
> running a 2nd XDP BPF DDoS program protecting against overload by
> dropping packets).

Oh, that is also valid.
Let's see what Jakub replies, for now I'm leaning towards posting
approach from this RFC with my bulk allocation from the NAPI cache.

> 
> Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
> to give backlog NAPI high priority, in my use-case.
> 
>> @Jakub: what do you think?
> 
> 
> --Jesper

Thanks,
Olek
Lorenzo Bianconi Nov. 28, 2024, 10:56 a.m. UTC | #21
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Date: Tue, 26 Nov 2024 18:12:27 +0100
> 
> > 
> > 
> > 
> > On 26/11/2024 18.02, Lorenzo Bianconi wrote:
> >>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>> Date: Mon, 25 Nov 2024 16:56:49 -0600
> >>>
> >>>>
> >>>>
> >>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> >>>>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
> >>>>>
> >>>>>> Hi Olek,
> >>>>>>
> >>>>>> Here are the results.
> >>>>>>
> >>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> >>>>>
> >>>>> [...]
> >>>>>
> >>>>>> Baseline (again)
> >>>>>>
> >>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
> >>>>>> P99 (s)            Throughput (Mbit/s)
> >>>>>> Run 1    3169917            0.00007295    0.00007871   
> >>>>>> 0.00009343        Run 1    21749.43
> >>>>>> Run 2    3228290            0.00007103    0.00007679   
> >>>>>> 0.00009215        Run 2    21897.17
> >>>>>> Run 3    3226746            0.00007231    0.00007871   
> >>>>>> 0.00009087        Run 3    21906.82
> >>>>>> Run 4    3191258            0.00007231    0.00007743   
> >>>>>> 0.00009087        Run 4    21155.15
> >>>>>> Run 5    3235653            0.00007231    0.00007743   
> >>>>>> 0.00008703        Run 5    21397.06
> >>>>>> Average    3210372.8    0.000072182    0.000077814   
> >>>>>> 0.00009087        Average    21621.126
> >>>>>>
> >>>>>> cpumap v2 Olek
> >>>>>>
> >>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
> >>>>>> P99 (s)            Throughput (Mbit/s)
> >>>>>> Run 1    3253651            0.00007167    0.00007807   
> >>>>>> 0.00009343        Run 1    13497.57
> >>>>>> Run 2    3221492            0.00007231    0.00007743   
> >>>>>> 0.00009087        Run 2    12115.53
> >>>>>> Run 3    3296453            0.00007039    0.00007807   
> >>>>>> 0.00009087        Run 3    12323.38
> >>>>>> Run 4    3254460            0.00007167    0.00007807   
> >>>>>> 0.00009087        Run 4    12901.88
> >>>>>> Run 5    3173327            0.00007295    0.00007871   
> >>>>>> 0.00009215        Run 5    12593.22
> >>>>>> Average    3239876.6    0.000071798    0.00007807   
> >>>>>> 0.000091638        Average    12686.316
> >>>>>> Delta    0.92%            -0.53%            0.33%           
> >>>>>> 0.85%                    -41.32%
> >>>>>>
> >>>>>>
> >>>>>> It's very interesting that we see -40% tput w/ the patches. I went
> >>>>>> back
> >>>>>
> >>>>> Oh no, I messed up something =\
> >>>>>
> >>>>> Could you please also test not the whole series, but patches 1-3
> >>>>> (up to
> >>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> >>>>> array...")? Would be great to see whether this implementation works
> >>>>> worse right from the start or I just broke something later on.
> >>>>
> >>>> Patches 1-3 reproduces the -40% tput numbers.
> >>>
> >>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
> >>> cpumap's kthreads instead of NAPI) really performs worse than switching
> >>> cpumap to NAPI.
> >>>
> >>>>
> >>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
> >>>> it was noisy.
> >>>
> >>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
> >>> on it.
> >>>
> >>>>
> >>>> tcp_rr results were unaffected.
> >>>
> >>> @ Jakub,
> >>>
> >>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
> >>> least for now =\ I took a look on the backlog NAPI and it could be used,
> >>> although we'd need a pointer in the backlog to the corresponding cpumap
> >>> + also some synchronization point to make sure backlog NAPI won't access
> >>> already destroyed cpumap.
> >>>
> >>> Maybe Lorenzo could take a look...
> >>
> >> it seems to me the only difference would be we will use the shared
> >> backlog_napi
> >> kthreads instead of having a dedicated kthread for each cpumap entry
> >> but we still
> >> need the napi poll logic. I can look into it if you prefer the shared
> >> kthread
> >> approach.
> > 
> > I don't like a shared kthread approach. For my use-case I want to give
> > the "remote" CPU-map kthreads higher scheduling priority. (As it will be
> > running a 2nd XDP BPF DDoS program protecting against overload by
> > dropping packets).
> 
> Oh, that is also valid.
> Let's see what Jakub replies, for now I'm leaning towards posting
> approach from this RFC with my bulk allocation from the NAPI cache.

I guess it would be better to keep them separated to check what are the effects
of each change (GRO for cpumap and bulk allocation). I guess you can post your
changes on top of mine if we all agree the proposed approach is fine.
What do you think?

Regards,
Lorenzo

> 
> > 
> > Thus, I'm not a fan of using the shared backlog_napi.  As I don't want
> > to give backlog NAPI high priority, in my use-case.
> > 
> >> @Jakub: what do you think?
> > 
> > 
> > --Jesper
> 
> Thanks,
> Olek
Alexander Lobakin Nov. 28, 2024, 10:57 a.m. UTC | #22
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Thu, 28 Nov 2024 11:56:24 +0100

>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>> Date: Tue, 26 Nov 2024 18:12:27 +0100
>>
>>>
>>>
>>>
>>> On 26/11/2024 18.02, Lorenzo Bianconi wrote:
>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>> Date: Mon, 25 Nov 2024 16:56:49 -0600
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
>>>>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>>>>> Date: Fri, 22 Nov 2024 17:10:06 -0700
>>>>>>>
>>>>>>>> Hi Olek,
>>>>>>>>
>>>>>>>> Here are the results.
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>
>>>>>>>> Baseline (again)
>>>>>>>>
>>>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>>>> Run 1    3169917            0.00007295    0.00007871   
>>>>>>>> 0.00009343        Run 1    21749.43
>>>>>>>> Run 2    3228290            0.00007103    0.00007679   
>>>>>>>> 0.00009215        Run 2    21897.17
>>>>>>>> Run 3    3226746            0.00007231    0.00007871   
>>>>>>>> 0.00009087        Run 3    21906.82
>>>>>>>> Run 4    3191258            0.00007231    0.00007743   
>>>>>>>> 0.00009087        Run 4    21155.15
>>>>>>>> Run 5    3235653            0.00007231    0.00007743   
>>>>>>>> 0.00008703        Run 5    21397.06
>>>>>>>> Average    3210372.8    0.000072182    0.000077814   
>>>>>>>> 0.00009087        Average    21621.126
>>>>>>>>
>>>>>>>> cpumap v2 Olek
>>>>>>>>
>>>>>>>>     Transactions    Latency P50 (s)    Latency P90 (s)    Latency
>>>>>>>> P99 (s)            Throughput (Mbit/s)
>>>>>>>> Run 1    3253651            0.00007167    0.00007807   
>>>>>>>> 0.00009343        Run 1    13497.57
>>>>>>>> Run 2    3221492            0.00007231    0.00007743   
>>>>>>>> 0.00009087        Run 2    12115.53
>>>>>>>> Run 3    3296453            0.00007039    0.00007807   
>>>>>>>> 0.00009087        Run 3    12323.38
>>>>>>>> Run 4    3254460            0.00007167    0.00007807   
>>>>>>>> 0.00009087        Run 4    12901.88
>>>>>>>> Run 5    3173327            0.00007295    0.00007871   
>>>>>>>> 0.00009215        Run 5    12593.22
>>>>>>>> Average    3239876.6    0.000071798    0.00007807   
>>>>>>>> 0.000091638        Average    12686.316
>>>>>>>> Delta    0.92%            -0.53%            0.33%           
>>>>>>>> 0.85%                    -41.32%
>>>>>>>>
>>>>>>>>
>>>>>>>> It's very interesting that we see -40% tput w/ the patches. I went
>>>>>>>> back
>>>>>>>
>>>>>>> Oh no, I messed up something =\
>>>>>>>
>>>>>>> Could you please also test not the whole series, but patches 1-3
>>>>>>> (up to
>>>>>>> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
>>>>>>> array...")? Would be great to see whether this implementation works
>>>>>>> worse right from the start or I just broke something later on.
>>>>>>
>>>>>> Patches 1-3 reproduces the -40% tput numbers.
>>>>>
>>>>> Ok, thanks! Seems like using the hybrid approach (GRO, but on top of
>>>>> cpumap's kthreads instead of NAPI) really performs worse than switching
>>>>> cpumap to NAPI.
>>>>>
>>>>>>
>>>>>> With patches 1-4 the numbers get slightly worse (~1gbps lower) but
>>>>>> it was noisy.
>>>>>
>>>>> Interesting, I was sure patch 4 optimizes stuff... Maybe I'll give up
>>>>> on it.
>>>>>
>>>>>>
>>>>>> tcp_rr results were unaffected.
>>>>>
>>>>> @ Jakub,
>>>>>
>>>>> Looks like I can't just use GRO without Lorenzo's conversion to NAPI, at
>>>>> least for now =\ I took a look on the backlog NAPI and it could be used,
>>>>> although we'd need a pointer in the backlog to the corresponding cpumap
>>>>> + also some synchronization point to make sure backlog NAPI won't access
>>>>> already destroyed cpumap.
>>>>>
>>>>> Maybe Lorenzo could take a look...
>>>>
>>>> it seems to me the only difference would be we will use the shared
>>>> backlog_napi
>>>> kthreads instead of having a dedicated kthread for each cpumap entry
>>>> but we still
>>>> need the napi poll logic. I can look into it if you prefer the shared
>>>> kthread
>>>> approach.
>>>
>>> I don't like a shared kthread approach. For my use-case I want to give
>>> the "remote" CPU-map kthreads higher scheduling priority. (As it will be
>>> running a 2nd XDP BPF DDoS program protecting against overload by
>>> dropping packets).
>>
>> Oh, that is also valid.
>> Let's see what Jakub replies, for now I'm leaning towards posting
>> approach from this RFC with my bulk allocation from the NAPI cache.
> 
> I guess it would be better to keep them separated to check what are the effects
> of each change (GRO for cpumap and bulk allocation). I guess you can post your
> changes on top of mine if we all agree the proposed approach is fine.
> What do you think?

Sounds good as well, I don't have any preference here.

> 
> Regards,
> Lorenzo

Thanks,
Olek
Jakub Kicinski Dec. 2, 2024, 10:47 p.m. UTC | #23
On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote:
> > tcp_rr results were unaffected.  
> 
> @ Jakub,

Context? What doesn't work and why?
Alexander Lobakin Dec. 3, 2024, 11:01 a.m. UTC | #24
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 2 Dec 2024 14:47:39 -0800

> On Tue, 26 Nov 2024 11:36:53 +0100 Alexander Lobakin wrote:
>>> tcp_rr results were unaffected.  
>>
>> @ Jakub,
> 
> Context? What doesn't work and why?

My tests show the same perf as on Lorenzo's series, but I test with UDP
trafficgen. Daniel tests TCP and the results are much worse than with
Lorenzo's implementation.
I suspect this is related to that how NAPI performs flushes / decides
whether to repoll again or exit vs how kthread does that (even though I
also try to flush only every 64 frames or when the ring is empty). Or
maybe to that part of the kthread happens in process context outside any
softirq, while when using NAPI, the whole loop is inside RX softirq.

Jesper said that he'd like to see cpumap still using own kthread, so
that its priority can be boosted separately from the backlog. That's why
we asked you whether it would be fine to have cpumap as threaded NAPI in
regards to all this :D

Thanks,
Olek
Jakub Kicinski Dec. 4, 2024, 12:51 a.m. UTC | #25
On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
> >> @ Jakub,  
> > 
> > Context? What doesn't work and why?  
> 
> My tests show the same perf as on Lorenzo's series, but I test with UDP
> trafficgen. Daniel tests TCP and the results are much worse than with
> Lorenzo's implementation.
> I suspect this is related to that how NAPI performs flushes / decides
> whether to repoll again or exit vs how kthread does that (even though I
> also try to flush only every 64 frames or when the ring is empty). Or
> maybe to that part of the kthread happens in process context outside any
> softirq, while when using NAPI, the whole loop is inside RX softirq.
> 
> Jesper said that he'd like to see cpumap still using own kthread, so
> that its priority can be boosted separately from the backlog. That's why
> we asked you whether it would be fine to have cpumap as threaded NAPI in
> regards to all this :D

Certainly not without a clear understanding what the problem with 
a kthread is.
Alexander Lobakin Dec. 4, 2024, 4:42 p.m. UTC | #26
From: Jakub Kicinski <kuba@kernel.org>
Date: Tue, 3 Dec 2024 16:51:57 -0800

> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>> @ Jakub,  
>>>
>>> Context? What doesn't work and why?  
>>
>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>> trafficgen. Daniel tests TCP and the results are much worse than with
>> Lorenzo's implementation.
>> I suspect this is related to that how NAPI performs flushes / decides
>> whether to repoll again or exit vs how kthread does that (even though I
>> also try to flush only every 64 frames or when the ring is empty). Or
>> maybe to that part of the kthread happens in process context outside any
>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>
>> Jesper said that he'd like to see cpumap still using own kthread, so
>> that its priority can be boosted separately from the backlog. That's why
>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>> regards to all this :D
> 
> Certainly not without a clear understanding what the problem with 
> a kthread is.

Yes, sure thing.

Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
was testing with the UDP trafficgen and got up to 80% improvement over
the baseline. Now I tested TCP and got up to 70% improvement, no
regressions whatsoever =\

I don't know where this regression on Daniel's setup comes from. Is it
multi-thread or single-thread test? What app do you use: iperf, netperf,
neper, Microsoft's app (forgot the name)? Do you have multiple NUMA
nodes on your system, are you sure you didn't cross the node when
redirecting with the GRO patches / no other NUMA mismatches happened?
Some other random stuff like RSS hash key, which affects flow steering?

Thanks,
Olek
Daniel Xu Dec. 4, 2024, 9:51 p.m. UTC | #27
On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Tue, 3 Dec 2024 16:51:57 -0800
>
>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>> @ Jakub,  
>>>>
>>>> Context? What doesn't work and why?  
>>>
>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>> Lorenzo's implementation.
>>> I suspect this is related to that how NAPI performs flushes / decides
>>> whether to repoll again or exit vs how kthread does that (even though I
>>> also try to flush only every 64 frames or when the ring is empty). Or
>>> maybe to that part of the kthread happens in process context outside any
>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>
>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>> that its priority can be boosted separately from the backlog. That's why
>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>> regards to all this :D
>> 
>> Certainly not without a clear understanding what the problem with 
>> a kthread is.
>
> Yes, sure thing.
>
> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
> was testing with the UDP trafficgen and got up to 80% improvement over
> the baseline. Now I tested TCP and got up to 70% improvement, no
> regressions whatsoever =\
>
> I don't know where this regression on Daniel's setup comes from. Is it
> multi-thread or single-thread test? 

8 threads with 16 flows over them (-T8 -F16)

> What app do you use: iperf, netperf,
> neper, Microsoft's app (forgot the name)?

neper, tcp_stream.

> Do you have multiple NUMA
> nodes on your system, are you sure you didn't cross the node when
> redirecting with the GRO patches / no other NUMA mismatches happened?

Single node. Technically EPYC NPS=1. So there are some numa characteristics
but I think the interconnect is supposed to hide it fairly efficiently.

> Some other random stuff like RSS hash key, which affects flow steering?

Whatever the default is - I'd be willing to be Kuba set up the configuration
at one point or another so it's probably sane. And with 5 runs it seems
unlikely the hashing would get unlucky and cause an imbalance.

>
> Thanks,
> Olek

Since I've got the setup handy and am motivated to see this work land,
do you have any other pointers for things I should look for? I'll spend some
time looking at profiles to see if I can identify any hot spots compared to
softirq based GRO.

Thanks,
Daniel
Alexander Lobakin Dec. 5, 2024, 10:38 a.m. UTC | #28
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Wed, 04 Dec 2024 13:51:08 -0800

> 
> 
> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>> From: Jakub Kicinski <kuba@kernel.org>
>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>
>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>> @ Jakub,  
>>>>>
>>>>> Context? What doesn't work and why?  
>>>>
>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>> Lorenzo's implementation.
>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>> maybe to that part of the kthread happens in process context outside any
>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>
>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>> that its priority can be boosted separately from the backlog. That's why
>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>> regards to all this :D
>>>
>>> Certainly not without a clear understanding what the problem with 
>>> a kthread is.
>>
>> Yes, sure thing.
>>
>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>> was testing with the UDP trafficgen and got up to 80% improvement over
>> the baseline. Now I tested TCP and got up to 70% improvement, no
>> regressions whatsoever =\
>>
>> I don't know where this regression on Daniel's setup comes from. Is it
>> multi-thread or single-thread test? 
> 
> 8 threads with 16 flows over them (-T8 -F16)
> 
>> What app do you use: iperf, netperf,
>> neper, Microsoft's app (forgot the name)?
> 
> neper, tcp_stream.

Let me recheck with neper -T8 -F16, I'll post my results soon.

> 
>> Do you have multiple NUMA
>> nodes on your system, are you sure you didn't cross the node when
>> redirecting with the GRO patches / no other NUMA mismatches happened?
> 
> Single node. Technically EPYC NPS=1. So there are some numa characteristics
> but I think the interconnect is supposed to hide it fairly efficiently.
> 
>> Some other random stuff like RSS hash key, which affects flow steering?
> 
> Whatever the default is - I'd be willing to be Kuba set up the configuration
> at one point or another so it's probably sane. And with 5 runs it seems
> unlikely the hashing would get unlucky and cause an imbalance.
> 
>>
>> Thanks,
>> Olek
> 
> Since I've got the setup handy and am motivated to see this work land,
> do you have any other pointers for things I should look for? I'll spend some
> time looking at profiles to see if I can identify any hot spots compared to
> softirq based GRO.
> 
> Thanks,
> Daniel

Thanks for helping with this!
Olek
Alexander Lobakin Dec. 5, 2024, 11:06 a.m. UTC | #29
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu, 5 Dec 2024 11:38:11 +0100

> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Wed, 04 Dec 2024 13:51:08 -0800
> 
>>
>>
>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>> From: Jakub Kicinski <kuba@kernel.org>
>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>
>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>> @ Jakub,  
>>>>>>
>>>>>> Context? What doesn't work and why?  
>>>>>
>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>> Lorenzo's implementation.
>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>> maybe to that part of the kthread happens in process context outside any
>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>
>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>> regards to all this :D
>>>>
>>>> Certainly not without a clear understanding what the problem with 
>>>> a kthread is.
>>>
>>> Yes, sure thing.
>>>
>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>> regressions whatsoever =\
>>>
>>> I don't know where this regression on Daniel's setup comes from. Is it
>>> multi-thread or single-thread test? 
>>
>> 8 threads with 16 flows over them (-T8 -F16)
>>
>>> What app do you use: iperf, netperf,
>>> neper, Microsoft's app (forgot the name)?
>>
>> neper, tcp_stream.
> 
> Let me recheck with neper -T8 -F16, I'll post my results soon.

kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
clean      28           51              13        9               Gbps
GRO        28           51              26        18              Gbps

100% gain, no regressions =\

My XDP prog is simple (upstream xdp-tools repo with no changes):

numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
no-touch ens802f0np0

IOW it simply redirects everything to CPU 23 (same NUMA node) from any
Rx queue without looking into headers or packet.
Do you test with more sophisticated XDP prog?

Thanks,
Olek
Daniel Xu Dec. 6, 2024, 12:41 a.m. UTC | #30
On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 5 Dec 2024 11:38:11 +0100
> 
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Wed, 04 Dec 2024 13:51:08 -0800
> > 
> >>
> >>
> >> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
> >>> From: Jakub Kicinski <kuba@kernel.org>
> >>> Date: Tue, 3 Dec 2024 16:51:57 -0800
> >>>
> >>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
> >>>>>>> @ Jakub,  
> >>>>>>
> >>>>>> Context? What doesn't work and why?  
> >>>>>
> >>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
> >>>>> trafficgen. Daniel tests TCP and the results are much worse than with
> >>>>> Lorenzo's implementation.
> >>>>> I suspect this is related to that how NAPI performs flushes / decides
> >>>>> whether to repoll again or exit vs how kthread does that (even though I
> >>>>> also try to flush only every 64 frames or when the ring is empty). Or
> >>>>> maybe to that part of the kthread happens in process context outside any
> >>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
> >>>>>
> >>>>> Jesper said that he'd like to see cpumap still using own kthread, so
> >>>>> that its priority can be boosted separately from the backlog. That's why
> >>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
> >>>>> regards to all this :D
> >>>>
> >>>> Certainly not without a clear understanding what the problem with 
> >>>> a kthread is.
> >>>
> >>> Yes, sure thing.
> >>>
> >>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
> >>> was testing with the UDP trafficgen and got up to 80% improvement over
> >>> the baseline. Now I tested TCP and got up to 70% improvement, no
> >>> regressions whatsoever =\
> >>>
> >>> I don't know where this regression on Daniel's setup comes from. Is it
> >>> multi-thread or single-thread test? 
> >>
> >> 8 threads with 16 flows over them (-T8 -F16)
> >>
> >>> What app do you use: iperf, netperf,
> >>> neper, Microsoft's app (forgot the name)?
> >>
> >> neper, tcp_stream.
> > 
> > Let me recheck with neper -T8 -F16, I'll post my results soon.
> 
> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
> clean      28           51              13        9               Gbps
> GRO        28           51              26        18              Gbps
> 
> 100% gain, no regressions =\
> 
> My XDP prog is simple (upstream xdp-tools repo with no changes):
> 
> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
> no-touch ens802f0np0
> 
> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
> Rx queue without looking into headers or packet.
> Do you test with more sophisticated XDP prog?

Great reminder... my prog is a bit more sophisticated. I forgot we were
doing latency tracking by inserting a timestamp into frame metadata. But
not clearing it after it was read on remote CPU, which disables GRO. So
previous test was paying the penalty of fixed GRO overhead without
getting any packet merges.

Once I fixed up prog to reset metadata pointer I could see the wins.
Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
latency changes.

Sorry about the churn.

Daniel
Alexander Lobakin Dec. 6, 2024, 3:06 p.m. UTC | #31
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Thu, 5 Dec 2024 17:41:27 -0700

> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 5 Dec 2024 11:38:11 +0100
>>
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Wed, 04 Dec 2024 13:51:08 -0800
>>>
>>>>
>>>>
>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>>>> From: Jakub Kicinski <kuba@kernel.org>
>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>>>
>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>>>> @ Jakub,  
>>>>>>>>
>>>>>>>> Context? What doesn't work and why?  
>>>>>>>
>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>>>> Lorenzo's implementation.
>>>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>>>> maybe to that part of the kthread happens in process context outside any
>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>>>
>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>>>> regards to all this :D
>>>>>>
>>>>>> Certainly not without a clear understanding what the problem with 
>>>>>> a kthread is.
>>>>>
>>>>> Yes, sure thing.
>>>>>
>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>>>> regressions whatsoever =\
>>>>>
>>>>> I don't know where this regression on Daniel's setup comes from. Is it
>>>>> multi-thread or single-thread test? 
>>>>
>>>> 8 threads with 16 flows over them (-T8 -F16)
>>>>
>>>>> What app do you use: iperf, netperf,
>>>>> neper, Microsoft's app (forgot the name)?
>>>>
>>>> neper, tcp_stream.
>>>
>>> Let me recheck with neper -T8 -F16, I'll post my results soon.
>>
>> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
>> clean      28           51              13        9               Gbps
>> GRO        28           51              26        18              Gbps
>>
>> 100% gain, no regressions =\
>>
>> My XDP prog is simple (upstream xdp-tools repo with no changes):
>>
>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
>> no-touch ens802f0np0
>>
>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
>> Rx queue without looking into headers or packet.
>> Do you test with more sophisticated XDP prog?
> 
> Great reminder... my prog is a bit more sophisticated. I forgot we were
> doing latency tracking by inserting a timestamp into frame metadata. But
> not clearing it after it was read on remote CPU, which disables GRO. So
> previous test was paying the penalty of fixed GRO overhead without
> getting any packet merges.
> 
> Once I fixed up prog to reset metadata pointer I could see the wins.
> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
> latency changes.
> 
> Sorry about the churn.

No problem, crap happens sometimes :)

Let me send my implementation on Monday-Wednesday. I'll include my UDP
and TCP test results, as well as yours (+18%).

BTW would be great if you could give me a Tested-by tag, as I assume the
tests were fine and it works for you?

Thanks,
Olek
Daniel Xu Dec. 6, 2024, 11:36 p.m. UTC | #32
On Fri, Dec 6, 2024, at 7:06 AM, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Thu, 5 Dec 2024 17:41:27 -0700
>
>> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote:
>>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Date: Thu, 5 Dec 2024 11:38:11 +0100
>>>
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Wed, 04 Dec 2024 13:51:08 -0800
>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote:
>>>>>> From: Jakub Kicinski <kuba@kernel.org>
>>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800
>>>>>>
>>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote:
>>>>>>>>>> @ Jakub,  
>>>>>>>>>
>>>>>>>>> Context? What doesn't work and why?  
>>>>>>>>
>>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP
>>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with
>>>>>>>> Lorenzo's implementation.
>>>>>>>> I suspect this is related to that how NAPI performs flushes / decides
>>>>>>>> whether to repoll again or exit vs how kthread does that (even though I
>>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or
>>>>>>>> maybe to that part of the kthread happens in process context outside any
>>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq.
>>>>>>>>
>>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so
>>>>>>>> that its priority can be boosted separately from the backlog. That's why
>>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in
>>>>>>>> regards to all this :D
>>>>>>>
>>>>>>> Certainly not without a clear understanding what the problem with 
>>>>>>> a kthread is.
>>>>>>
>>>>>> Yes, sure thing.
>>>>>>
>>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I
>>>>>> was testing with the UDP trafficgen and got up to 80% improvement over
>>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no
>>>>>> regressions whatsoever =\
>>>>>>
>>>>>> I don't know where this regression on Daniel's setup comes from. Is it
>>>>>> multi-thread or single-thread test? 
>>>>>
>>>>> 8 threads with 16 flows over them (-T8 -F16)
>>>>>
>>>>>> What app do you use: iperf, netperf,
>>>>>> neper, Microsoft's app (forgot the name)?
>>>>>
>>>>> neper, tcp_stream.
>>>>
>>>> Let me recheck with neper -T8 -F16, I'll post my results soon.
>>>
>>> kernel     direct T1    direct T8F16    cpumap    cpumap T8F16
>>> clean      28           51              13        9               Gbps
>>> GRO        28           51              26        18              Gbps
>>>
>>> 100% gain, no regressions =\
>>>
>>> My XDP prog is simple (upstream xdp-tools repo with no changes):
>>>
>>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p
>>> no-touch ens802f0np0
>>>
>>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any
>>> Rx queue without looking into headers or packet.
>>> Do you test with more sophisticated XDP prog?
>> 
>> Great reminder... my prog is a bit more sophisticated. I forgot we were
>> doing latency tracking by inserting a timestamp into frame metadata. But
>> not clearing it after it was read on remote CPU, which disables GRO. So
>> previous test was paying the penalty of fixed GRO overhead without
>> getting any packet merges.
>> 
>> Once I fixed up prog to reset metadata pointer I could see the wins.
>> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No
>> latency changes.
>> 
>> Sorry about the churn.
>
> No problem, crap happens sometimes :)
>
> Let me send my implementation on Monday-Wednesday. I'll include my UDP
> and TCP test results, as well as yours (+18%).
>
> BTW would be great if you could give me a Tested-by tag, as I assume the
> tests were fine and it works for you?

Yep, worked great for me.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>