mbox series

[RFC/RFT,v2,0/3] Introduce GRO support to cpumap codebase

Message ID cover.1726480607.git.lorenzo@kernel.org (mailing list archive)
Headers show
Series Introduce GRO support to cpumap codebase | expand

Message

Lorenzo Bianconi Sept. 16, 2024, 10:13 a.m. UTC
Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
NAPI-kthread pinned on the selected cpu.

Changes in rfc v2:
- get rid of dummy netdev dependency

Lorenzo Bianconi (3):
  net: Add napi_init_for_gro routine
  net: add napi_threaded_poll to netdevice.h
  bpf: cpumap: Add gro support

 include/linux/netdevice.h |   3 +
 kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
 net/core/dev.c            |  27 ++++++---
 3 files changed, 73 insertions(+), 80 deletions(-)

Comments

Alexander Lobakin Sept. 16, 2024, 3:10 p.m. UTC | #1
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Mon, 16 Sep 2024 12:13:42 +0200

> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support

Oh okay, so it's still uses a NAPI.
When I'm back from the conferences (next week), I might rebase and send
the solution where I only use the GRO part of it, i.e. no
napi_schedule()/poll()/napi_complete() logics.

> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)

Thanks,
Olek
Daniel Xu Oct. 8, 2024, 10:39 p.m. UTC | #2
Hi Lorenzo,

On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> NAPI-kthread pinned on the selected cpu.
> 
> Changes in rfc v2:
> - get rid of dummy netdev dependency
> 
> Lorenzo Bianconi (3):
>   net: Add napi_init_for_gro routine
>   net: add napi_threaded_poll to netdevice.h
>   bpf: cpumap: Add gro support
> 
>  include/linux/netdevice.h |   3 +
>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>  net/core/dev.c            |  27 ++++++---
>  3 files changed, 73 insertions(+), 80 deletions(-)
> 
> -- 
> 2.46.0
> 

Sorry about the long delay - finally caught up to everything after
conferences.

I re-ran my synthetic tests (including baseline). v2 is somehow showing
2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
variable I changed is kernel version - steering prog is active for both.


Baseline (again)							

./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
							
cpumap NAPI patches v2							
							
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%

Thanks,
Daniel
Lorenzo Bianconi Oct. 9, 2024, 10:46 a.m. UTC | #3
> Hi Lorenzo,
> 
> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> > Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> > NAPI-kthread pinned on the selected cpu.
> > 
> > Changes in rfc v2:
> > - get rid of dummy netdev dependency
> > 
> > Lorenzo Bianconi (3):
> >   net: Add napi_init_for_gro routine
> >   net: add napi_threaded_poll to netdevice.h
> >   bpf: cpumap: Add gro support
> > 
> >  include/linux/netdevice.h |   3 +
> >  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >  net/core/dev.c            |  27 ++++++---
> >  3 files changed, 73 insertions(+), 80 deletions(-)
> > 
> > -- 
> > 2.46.0
> > 
> 
> Sorry about the long delay - finally caught up to everything after
> conferences.
> 
> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> variable I changed is kernel version - steering prog is active for both.
> 
> 
> Baseline (again)							
> 
> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> 							
> cpumap NAPI patches v2							
> 							
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> 
> Thanks,
> Daniel

Hi Daniel,

cool, thx for testing it.

@Olek: how do we want to proceed on it? Are you still working on it or do you want me
to send a regular patch for it?

Regards,
Lorenzo
Alexander Lobakin Oct. 9, 2024, 12:27 p.m. UTC | #4
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 12:46:00 +0200

>> Hi Lorenzo,
>>
>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>> NAPI-kthread pinned on the selected cpu.
>>>
>>> Changes in rfc v2:
>>> - get rid of dummy netdev dependency
>>>
>>> Lorenzo Bianconi (3):
>>>   net: Add napi_init_for_gro routine
>>>   net: add napi_threaded_poll to netdevice.h
>>>   bpf: cpumap: Add gro support
>>>
>>>  include/linux/netdevice.h |   3 +
>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>  net/core/dev.c            |  27 ++++++---
>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>
>>> -- 
>>> 2.46.0
>>>
>>
>> Sorry about the long delay - finally caught up to everything after
>> conferences.
>>
>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>> variable I changed is kernel version - steering prog is active for both.
>>
>>
>> Baseline (again)							
>>
>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>> 							
>> cpumap NAPI patches v2							
>> 							
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>
>> Thanks,
>> Daniel
> 
> Hi Daniel,
> 
> cool, thx for testing it.
> 
> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> to send a regular patch for it?

Hi,

I had a small vacation, sorry. I'm starting working on it again today.

> 
> Regards,
> Lorenzo

Thanks,
Olek
Lorenzo Bianconi Oct. 9, 2024, 12:47 p.m. UTC | #5
> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 12:46:00 +0200
> 
> >> Hi Lorenzo,
> >>
> >> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>> NAPI-kthread pinned on the selected cpu.
> >>>
> >>> Changes in rfc v2:
> >>> - get rid of dummy netdev dependency
> >>>
> >>> Lorenzo Bianconi (3):
> >>>   net: Add napi_init_for_gro routine
> >>>   net: add napi_threaded_poll to netdevice.h
> >>>   bpf: cpumap: Add gro support
> >>>
> >>>  include/linux/netdevice.h |   3 +
> >>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>  net/core/dev.c            |  27 ++++++---
> >>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>
> >>> -- 
> >>> 2.46.0
> >>>
> >>
> >> Sorry about the long delay - finally caught up to everything after
> >> conferences.
> >>
> >> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >> variable I changed is kernel version - steering prog is active for both.
> >>
> >>
> >> Baseline (again)							
> >>
> >> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >> 							
> >> cpumap NAPI patches v2							
> >> 							
> >> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>
> >> Thanks,
> >> Daniel
> > 
> > Hi Daniel,
> > 
> > cool, thx for testing it.
> > 
> > @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> > to send a regular patch for it?
> 
> Hi,
> 
> I had a small vacation, sorry. I'm starting working on it again today.

ack, no worries. Are you going to rebase the other patches on top of it
or are you going to try a different approach?

Regards,
Lorenzo

> 
> > 
> > Regards,
> > Lorenzo
> 
> Thanks,
> Olek
Alexander Lobakin Oct. 9, 2024, 12:50 p.m. UTC | #6
From: Lorenzo Bianconi <lorenzo@kernel.org>
Date: Wed, 9 Oct 2024 14:47:58 +0200

>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>
>>>> Hi Lorenzo,
>>>>
>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>
>>>>> Changes in rfc v2:
>>>>> - get rid of dummy netdev dependency
>>>>>
>>>>> Lorenzo Bianconi (3):
>>>>>   net: Add napi_init_for_gro routine
>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>   bpf: cpumap: Add gro support
>>>>>
>>>>>  include/linux/netdevice.h |   3 +
>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.46.0
>>>>>
>>>>
>>>> Sorry about the long delay - finally caught up to everything after
>>>> conferences.
>>>>
>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>> variable I changed is kernel version - steering prog is active for both.
>>>>
>>>>
>>>> Baseline (again)							
>>>>
>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>> 							
>>>> cpumap NAPI patches v2							
>>>> 							
>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>
>>>> Thanks,
>>>> Daniel
>>>
>>> Hi Daniel,
>>>
>>> cool, thx for testing it.
>>>
>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>> to send a regular patch for it?
>>
>> Hi,
>>
>> I had a small vacation, sorry. I'm starting working on it again today.
> 
> ack, no worries. Are you going to rebase the other patches on top of it
> or are you going to try a different approach?

I'll try the approach without NAPI as Kuba asks and let Daniel test it,
then we'll see.

BTW I'm curious how he got this boost on v2, from what I see you didn't
change the implementation that much?

Thanks,
Olek
Alexander Lobakin Oct. 22, 2024, 3:51 p.m. UTC | #7
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Wed, 9 Oct 2024 14:50:42 +0200

> From: Lorenzo Bianconi <lorenzo@kernel.org>
> Date: Wed, 9 Oct 2024 14:47:58 +0200
> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>
>>>>> Hi Lorenzo,
>>>>>
>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>
>>>>>> Changes in rfc v2:
>>>>>> - get rid of dummy netdev dependency
>>>>>>
>>>>>> Lorenzo Bianconi (3):
>>>>>>   net: Add napi_init_for_gro routine
>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>   bpf: cpumap: Add gro support
>>>>>>
>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>
>>>>>> -- 
>>>>>> 2.46.0
>>>>>>
>>>>>
>>>>> Sorry about the long delay - finally caught up to everything after
>>>>> conferences.
>>>>>
>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>
>>>>>
>>>>> Baseline (again)							
>>>>>
>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>> 							
>>>>> cpumap NAPI patches v2							
>>>>> 							
>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>
>>>>> Thanks,
>>>>> Daniel
>>>>
>>>> Hi Daniel,
>>>>
>>>> cool, thx for testing it.
>>>>
>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>> to send a regular patch for it?
>>>
>>> Hi,
>>>
>>> I had a small vacation, sorry. I'm starting working on it again today.
>>
>> ack, no worries. Are you going to rebase the other patches on top of it
>> or are you going to try a different approach?
> 
> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> then we'll see.

For now, I have the same results without NAPI as with your series, so
I'll push it soon and let Daniel test.

(I simply decoupled GRO and NAPI and used the former in cpumap, but the
 kthread logic didn't change)

> 
> BTW I'm curious how he got this boost on v2, from what I see you didn't
> change the implementation that much?

Thanks,
Olek
Alexander Lobakin Nov. 12, 2024, 5:43 p.m. UTC | #8
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Tue, 22 Oct 2024 17:51:43 +0200

> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Wed, 9 Oct 2024 14:50:42 +0200
> 
>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>
>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>
>>>>>> Hi Lorenzo,
>>>>>>
>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>
>>>>>>> Changes in rfc v2:
>>>>>>> - get rid of dummy netdev dependency
>>>>>>>
>>>>>>> Lorenzo Bianconi (3):
>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>
>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.46.0
>>>>>>>
>>>>>>
>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>> conferences.
>>>>>>
>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>
>>>>>>
>>>>>> Baseline (again)							
>>>>>>
>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>> 							
>>>>>> cpumap NAPI patches v2							
>>>>>> 							
>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel
>>>>>
>>>>> Hi Daniel,
>>>>>
>>>>> cool, thx for testing it.
>>>>>
>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>> to send a regular patch for it?
>>>>
>>>> Hi,
>>>>
>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>
>>> ack, no worries. Are you going to rebase the other patches on top of it
>>> or are you going to try a different approach?
>>
>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>> then we'll see.
> 
> For now, I have the same results without NAPI as with your series, so
> I'll push it soon and let Daniel test.
> 
> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>  kthread logic didn't change)
> 
>>
>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>> change the implementation that much?

Hi Daniel,

Sorry for the delay. Please test [0].

[0] https://github.com/alobakin/linux/commits/cpumap-old

Thanks,
Olek
Daniel Xu Nov. 13, 2024, 11:39 p.m. UTC | #9
On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Tue, 22 Oct 2024 17:51:43 +0200
>
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Wed, 9 Oct 2024 14:50:42 +0200
>> 
>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>> Date: Wed, 9 Oct 2024 14:47:58 +0200
>>>
>>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
>>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
>>>>>
>>>>>>> Hi Lorenzo,
>>>>>>>
>>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
>>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
>>>>>>>> NAPI-kthread pinned on the selected cpu.
>>>>>>>>
>>>>>>>> Changes in rfc v2:
>>>>>>>> - get rid of dummy netdev dependency
>>>>>>>>
>>>>>>>> Lorenzo Bianconi (3):
>>>>>>>>   net: Add napi_init_for_gro routine
>>>>>>>>   net: add napi_threaded_poll to netdevice.h
>>>>>>>>   bpf: cpumap: Add gro support
>>>>>>>>
>>>>>>>>  include/linux/netdevice.h |   3 +
>>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
>>>>>>>>  net/core/dev.c            |  27 ++++++---
>>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> 2.46.0
>>>>>>>>
>>>>>>>
>>>>>>> Sorry about the long delay - finally caught up to everything after
>>>>>>> conferences.
>>>>>>>
>>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
>>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
>>>>>>> variable I changed is kernel version - steering prog is active for both.
>>>>>>>
>>>>>>>
>>>>>>> Baseline (again)							
>>>>>>>
>>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
>>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
>>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
>>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
>>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
>>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
>>>>>>> 							
>>>>>>> cpumap NAPI patches v2							
>>>>>>> 							
>>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
>>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
>>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
>>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
>>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
>>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
>>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Daniel
>>>>>>
>>>>>> Hi Daniel,
>>>>>>
>>>>>> cool, thx for testing it.
>>>>>>
>>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
>>>>>> to send a regular patch for it?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I had a small vacation, sorry. I'm starting working on it again today.
>>>>
>>>> ack, no worries. Are you going to rebase the other patches on top of it
>>>> or are you going to try a different approach?
>>>
>>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
>>> then we'll see.
>> 
>> For now, I have the same results without NAPI as with your series, so
>> I'll push it soon and let Daniel test.
>> 
>> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
>>  kthread logic didn't change)
>> 
>>>
>>> BTW I'm curious how he got this boost on v2, from what I see you didn't
>>> change the implementation that much?
>
> Hi Daniel,
>
> Sorry for the delay. Please test [0].
>
> [0] https://github.com/alobakin/linux/commits/cpumap-old
>
> Thanks,
> Olek

Ack. Will do probably early next week.
Daniel Xu Nov. 23, 2024, 12:10 a.m. UTC | #10
Hi Olek,

Here are the results.

On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>
>
> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > From: Alexander Lobakin <aleksander.lobakin@intel.com>
> > Date: Tue, 22 Oct 2024 17:51:43 +0200
> >
> >> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> >> Date: Wed, 9 Oct 2024 14:50:42 +0200
> >>
> >>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>> Date: Wed, 9 Oct 2024 14:47:58 +0200
> >>>
> >>>>> From: Lorenzo Bianconi <lorenzo@kernel.org>
> >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200
> >>>>>
> >>>>>>> Hi Lorenzo,
> >>>>>>>
> >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote:
> >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a
> >>>>>>>> NAPI-kthread pinned on the selected cpu.
> >>>>>>>>
> >>>>>>>> Changes in rfc v2:
> >>>>>>>> - get rid of dummy netdev dependency
> >>>>>>>>
> >>>>>>>> Lorenzo Bianconi (3):
> >>>>>>>>   net: Add napi_init_for_gro routine
> >>>>>>>>   net: add napi_threaded_poll to netdevice.h
> >>>>>>>>   bpf: cpumap: Add gro support
> >>>>>>>>
> >>>>>>>>  include/linux/netdevice.h |   3 +
> >>>>>>>>  kernel/bpf/cpumap.c       | 123 ++++++++++++++++----------------------
> >>>>>>>>  net/core/dev.c            |  27 ++++++---
> >>>>>>>>  3 files changed, 73 insertions(+), 80 deletions(-)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> 2.46.0
> >>>>>>>>
> >>>>>>>
> >>>>>>> Sorry about the long delay - finally caught up to everything after
> >>>>>>> conferences.
> >>>>>>>
> >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing
> >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only
> >>>>>>> variable I changed is kernel version - steering prog is active for both.
> >>>>>>>
> >>>>>>>
> >>>>>>> Baseline (again)
> >>>>>>>
> >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30			        ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2560252	        0.00009087	0.00010495	0.00011647		Run 1	15479.31
> >>>>>>> Run 2	2665517	        0.00008575	0.00010239	0.00013311		Run 2	15162.48
> >>>>>>> Run 3	2755939	        0.00008191	0.00010367	0.00012287		Run 3	14709.04
> >>>>>>> Run 4	2595680	        0.00008575	0.00011263	0.00012671		Run 4	15373.06
> >>>>>>> Run 5	2841865	        0.00007999	0.00009471	0.00012799		Run 5	15234.91
> >>>>>>> Average	2683850.6	0.000084854	0.00010367	0.00012543		Average	15191.76
> >>>>>>>
> >>>>>>> cpumap NAPI patches v2
> >>>>>>>
> >>>>>>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> >>>>>>> Run 1	2577838	        0.00008575	0.00012031	0.00013695		Run 1	19914.56
> >>>>>>> Run 2	2729237	        0.00007551	0.00013311	0.00017663		Run 2	20140.92
> >>>>>>> Run 3	2689442	        0.00008319	0.00010495	0.00013311		Run 3	19887.48
> >>>>>>> Run 4	2862366	        0.00008127	0.00009471	0.00010623		Run 4	19374.49
> >>>>>>> Run 5	2700538	        0.00008319	0.00010367	0.00012799		Run 5	19784.49
> >>>>>>> Average	2711884.2	0.000081782	0.00011135	0.000136182		Average	19820.388
> >>>>>>> Delta	1.04%	        -3.62%	        7.41%	        8.57%			        30.47%
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Daniel
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>>
> >>>>>> cool, thx for testing it.
> >>>>>>
> >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me
> >>>>>> to send a regular patch for it?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I had a small vacation, sorry. I'm starting working on it again today.
> >>>>
> >>>> ack, no worries. Are you going to rebase the other patches on top of it
> >>>> or are you going to try a different approach?
> >>>
> >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it,
> >>> then we'll see.
> >>
> >> For now, I have the same results without NAPI as with your series, so
> >> I'll push it soon and let Daniel test.
> >>
> >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the
> >>  kthread logic didn't change)
> >>
> >>>
> >>> BTW I'm curious how he got this boost on v2, from what I see you didn't
> >>> change the implementation that much?
> >
> > Hi Daniel,
> >
> > Sorry for the delay. Please test [0].
> >
> > [0] https://github.com/alobakin/linux/commits/cpumap-old
> >
> > Thanks,
> > Olek
>
> Ack. Will do probably early next week.
>

Baseline (again)

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126

cpumap v2 Olek

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%


It's very interesting that we see -40% tput w/ the patches. I went back
and double checked and it seems the numbers are right. Here's the
some output from some profiles I took with:

    perf record -e cycles:k -a -- sleep 10
    perf --no-pager diff perf.data.baseline perf.data.withpatches > ...

    # Event 'cycles:k'
    # Baseline  Delta Abs  Shared Object                                                    Symbol
         6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
     3.57%     -2.56%  bpf_prog_954ab9c8c8b5e42f_latency                                [k] bpf_prog_954ab9c8c8b5e42f_latency
               +2.22%  bpf_prog_5c74b34eb24d5c9b_steering                               [k] bpf_prog_5c74b34eb24d5c9b_steering
     2.61%     -1.88%  [kernel.kallsyms]                                                [k] __skb_datagram_iter
     0.55%     +1.53%  [kernel.kallsyms]                                                [k] acpi_processor_ffh_cstate_enter
     4.52%     -1.46%  [kernel.kallsyms]                                                [k] read_tsc
     0.34%     +1.42%  [kernel.kallsyms]                                                [k] __slab_free
     0.97%     +1.18%  [kernel.kallsyms]                                                [k] do_idle
     1.35%     +1.17%  [kernel.kallsyms]                                                [k] cpuidle_enter_state
     1.89%     -1.15%  [kernel.kallsyms]                                                [k] tcp_ack
     2.08%     +1.14%  [kernel.kallsyms]                                                [k] _raw_spin_lock
               +1.13%  <redacted>
     0.22%     +1.02%  [kernel.kallsyms]                                                [k] __sock_wfree
     2.23%     -1.02%  [kernel.kallsyms]                                                [k] bpf_dynptr_slice
     0.00%     +0.98%  [kernel.kallsyms]                                                [k] tcp6_gro_receive
     2.91%     -0.98%  [kernel.kallsyms]                                                [k] csum_partial
     0.62%     +0.94%  [kernel.kallsyms]                                                [k] skb_release_data
               +0.81%  [kernel.kallsyms]                                                [k] memset
     0.16%     +0.74%  [kernel.kallsyms]                                                [k] bnxt_tx_int
     0.00%     +0.74%  [kernel.kallsyms]                                                [k] dev_gro_receive
     0.36%     +0.74%  [kernel.kallsyms]                                                [k] __tcp_transmit_skb
               +0.72%  [kernel.kallsyms]                                                [k] tcp_gro_receive
     1.10%     -0.66%  [kernel.kallsyms]                                                [k] ep_poll_callback
     1.52%     -0.65%  [kernel.kallsyms]                                                [k] page_pool_put_unrefed_netmem
     0.75%     -0.57%  [kernel.kallsyms]                                                [k] bnxt_rx_pkt
     1.10%     +0.56%  [kernel.kallsyms]                                                [k] native_sched_clock
     0.16%     +0.53%  <redacted>
     0.83%     -0.53%  [kernel.kallsyms]                                                [k] skb_try_coalesce
     0.60%     +0.53%  [kernel.kallsyms]                                                [k] eth_type_trans
     1.65%     -0.51%  [kernel.kallsyms]                                                [k] _raw_spin_lock_irqsave
     0.14%     +0.50%  [kernel.kallsyms]                                                [k] bnxt_start_xmit
     0.54%     -0.48%  [kernel.kallsyms]                                                [k] __skb_frag_unref
     0.91%     +0.48%  [cls_bpf]                                                        [k] 0x0000000000000010
     0.00%     +0.47%  [kernel.kallsyms]                                                [k] ipv6_gro_receive
     0.76%     -0.45%  [kernel.kallsyms]                                                [k] tcp_rcv_established
     0.94%     -0.45%  [kernel.kallsyms]                                                [k] __inet6_lookup_established
     0.31%     +0.43%  [kernel.kallsyms]                                                [k] __sched_text_start
     0.21%     +0.43%  [kernel.kallsyms]                                                [k] poll_idle
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] tcp_try_coalesce
     0.91%     -0.42%  [kernel.kallsyms]                                                [k] kmem_cache_free
     1.13%     +0.42%  [kernel.kallsyms]                                                [k] __bnxt_poll_work
     0.48%     -0.41%  [kernel.kallsyms]                                                [k] tcp_urg
               +0.39%  [kernel.kallsyms]                                                [k] memcpy
     0.51%     -0.38%  [kernel.kallsyms]                                                [k] _raw_read_unlock_irqrestore
               +0.38%  [kernel.kallsyms]                                                [k] __skb_gro_checksum_complete
               +0.37%  [kernel.kallsyms]                                                [k] irq_entries_start
     0.16%     +0.36%  [kernel.kallsyms]                                                [k] bpf_sk_storage_get
     0.62%     -0.36%  [kernel.kallsyms]                                                [k] page_pool_refill_alloc_cache
     0.08%     +0.35%  [kernel.kallsyms]                                                [k] ip6_finish_output2
     0.14%     +0.34%  [kernel.kallsyms]                                                [k] bnxt_poll_p5
     0.06%     +0.33%  [sch_fq]                                                         [k] 0x0000000000000020
     0.04%     +0.32%  [kernel.kallsyms]                                                [k] __dev_queue_xmit
     0.75%     -0.32%  [kernel.kallsyms]                                                [k] __xdp_build_skb_from_frame
     0.67%     -0.31%  [kernel.kallsyms]                                                [k] sock_def_readable
     0.05%     +0.31%  [kernel.kallsyms]                                                [k] netif_skb_features
               +0.30%  [kernel.kallsyms]                                                [k] tcp_gro_pull_header
     0.49%     -0.29%  [kernel.kallsyms]                                                [k] napi_pp_put_page
     0.18%     +0.29%  [kernel.kallsyms]                                                [k] call_function_single_prep_ipi
     0.40%     -0.28%  [kernel.kallsyms]                                                [k] _raw_read_lock_irqsave
     0.11%     +0.27%  [kernel.kallsyms]                                                [k] raw6_local_deliver
     0.18%     +0.26%  [kernel.kallsyms]                                                [k] ip6_dst_check
     0.42%     -0.26%  [kernel.kallsyms]                                                [k] netif_receive_skb_list_internal
     0.05%     +0.26%  [kernel.kallsyms]                                                [k] __qdisc_run
     0.75%     +0.25%  [kernel.kallsyms]                                                [k] __build_skb_around
     0.05%     +0.25%  [kernel.kallsyms]                                                [k] htab_map_hash
     0.09%     +0.24%  [kernel.kallsyms]                                                [k] net_rx_action
     0.07%     +0.23%  <redacted>
     0.45%     -0.23%  [kernel.kallsyms]                                                [k] migrate_enable
     0.48%     -0.23%  [kernel.kallsyms]                                                [k] mem_cgroup_charge_skmem
     0.26%     +0.23%  [kernel.kallsyms]                                                [k] __switch_to
     0.15%     +0.22%  [kernel.kallsyms]                                                [k] sock_rfree
     0.30%     -0.22%  [kernel.kallsyms]                                                [k] tcp_add_backlog

     <snip>

     5.68%             bpf_prog_17fea1bb6503ed98_steering                               [k] bpf_prog_17fea1bb6503ed98_steering
     2.10%             [kernel.kallsyms]                                                [k] __skb_checksum_complete
     0.71%             [kernel.kallsyms]                                                [k] __memset
     0.54%             [kernel.kallsyms]                                                [k] __memcpy
     0.18%             [kernel.kallsyms]                                                [k] __irqentry_text_start

     <snip>

Please let me know if you want me to collect any other data.

Thanks,
Daniel
Alexander Lobakin Nov. 25, 2024, 3:12 p.m. UTC | #11
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Fri, 22 Nov 2024 17:10:06 -0700

> Hi Olek,
> 
> Here are the results.
> 
> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>
>>
>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:

[...]

> Baseline (again)
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> 
> cpumap v2 Olek
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> 
> 
> It's very interesting that we see -40% tput w/ the patches. I went back

Oh no, I messed up something =\

Could you please also test not the whole series, but patches 1-3 (up to
"bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
array...")? Would be great to see whether this implementation works
worse right from the start or I just broke something later on.

> and double checked and it seems the numbers are right. Here's the
> some output from some profiles I took with:
> 
>     perf record -e cycles:k -a -- sleep 10
>     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> 
>     # Event 'cycles:k'
>     # Baseline  Delta Abs  Shared Object                                                    Symbol
>          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter

BTW, what CONFIG_HZ do you have on the kernel you're testing with?

Thanks,
Olek
Daniel Xu Nov. 25, 2024, 5:03 p.m. UTC | #12
On Mon, Nov 25, 2024 at 04:12:24PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
> > Hi Olek,
> > 
> > Here are the results.
> > 
> > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> >>
> >>
> >> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
> > Baseline (again)
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > 
> > cpumap v2 Olek
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > 
> > 
> > It's very interesting that we see -40% tput w/ the patches. I went back
> 
> Oh no, I messed up something =\
> 
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Will do.

> 
> > and double checked and it seems the numbers are right. Here's the
> > some output from some profiles I took with:
> > 
> >     perf record -e cycles:k -a -- sleep 10
> >     perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > 
> >     # Event 'cycles:k'
> >     # Baseline  Delta Abs  Shared Object                                                    Symbol
> >          6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> 
> BTW, what CONFIG_HZ do you have on the kernel you're testing with?

# zgrep CONFIG_HZ /proc/config.gz
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Just curious - why do you ask?

Thanks,
Daniel
Jesper Dangaard Brouer Nov. 25, 2024, 6:50 p.m. UTC | #13
On 25/11/2024 16.12, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
> 
>> Hi Olek,
>>
>> Here are the results.
>>
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> 
> [...]
> 
>> Baseline (again)
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>>

We need to talk about what we are measuring, and how to control the
experiment setup to get reproducible results.
Especially controlling on what CPU cores our code paths are executing.

In above "baseline" case, we have two processes/tasks executing:
  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
  (2) Userspace netserver process TCP receiving data from socket.

My experience is that you will see two noticeable different
throughput performance results depending on whether (1) and (2) is
executing on the *same* CPU (multi-tasking context-switching),
or executing in parallel (e.g. pinned) on two different CPU cores.

The netperf command have an option

  -T lcpu,remcpu
       Request that netperf be bound to local CPU lcpu and/or netserver 
be bound to remote CPU rcpu.

Verify setting by listing pinning like this:
   for PID in $(pidof netserver); do taskset -pc $PID ; done

You can also set pinning runtime like this:
  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU 
$PID; done

For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
output and adjust pinning runtime to observe the effect quickly.

My experience is unfortunately that TCP results have a lot of variation
(thanks for incliding 5 runs in your benchmarks), as it depends on tasks
timing, that can get affected by CPU sleep states. The systems CPU
latency setting can be seen in /dev/cpu_dma_latency, which can be read
like this:

  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency

For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
as it requires holding the file open. E.g I play with these profiles:

  sudo tuned-adm profile throughput-performance
  sudo tuned-adm profile latency-performance
  sudo tuned-adm profile network-latency


>> cpumap v2 Olek
>>
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>>
>>


We now three processes/tasks executing:
  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
  (3) Userspace netserver process TCP receiving data from socket.

Again, now the performance is going to depend on depending on which CPU
cores the processes/tasks are running and whether some are sharing the
same CPU. (There are both wakeup timing and cache-line effects).

There are now more combinations to test...

CPUmap is a CPU scaling facility, and you will likely also see different
CPU utilization on the difference cores one you start to pin these to
control the scenarios.

>> It's very interesting that we see -40% tput w/ the patches. I went back
> 

Sad that we see -40% throughput...  but do we know what CPU cores the
now three different tasks/processes run on(?)


> Oh no, I messed up something =\
>  > Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.
> 
>> and double checked and it seems the numbers are right. Here's the
>> some output from some profiles I took with:
>>
>>      perf record -e cycles:k -a -- sleep 10
>>      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
>>
>>      # Event 'cycles:k'
>>      # Baseline  Delta Abs  Shared Object                                                    Symbol
>>           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
>

I really appreciate that you provide perf data and perf diff, but as
described above, we need data and information on what CPU cores are
running which workload.

Fortunately perf diff (and perf report) support doing like this:
  perf diff --sort=cpu,symbol

But then you also need to control the CPUs used in experiment for the
diff to work.

I hope I made sense as these kind of CPU scaling benchmarks are tricky,
--Jesper
Daniel Xu Nov. 25, 2024, 9:53 p.m. UTC | #14
Hi Jesper,

On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 25/11/2024 16.12, Alexander Lobakin wrote:
> > From: Daniel Xu <dxu@dxuuu.xyz>
> > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > 
> > > Hi Olek,
> > > 
> > > Here are the results.
> > > 
> > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > 
> > > > 
> > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > 
> > [...]
> > 
> > > Baseline (again)
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > 
> 
> We need to talk about what we are measuring, and how to control the
> experiment setup to get reproducible results.
> Especially controlling on what CPU cores our code paths are executing.
> 
> In above "baseline" case, we have two processes/tasks executing:
>  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
>  (2) Userspace netserver process TCP receiving data from socket.

"baseline" in this case is still cpumap, just without these GRO patches.

> 
> My experience is that you will see two noticeable different
> throughput performance results depending on whether (1) and (2) is
> executing on the *same* CPU (multi-tasking context-switching),
> or executing in parallel (e.g. pinned) on two different CPU cores.
> 
> The netperf command have an option
> 
>  -T lcpu,remcpu
>       Request that netperf be bound to local CPU lcpu and/or netserver be
> bound to remote CPU rcpu.
> 
> Verify setting by listing pinning like this:
>   for PID in $(pidof netserver); do taskset -pc $PID ; done
> 
> You can also set pinning runtime like this:
>  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> done
> 
> For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> output and adjust pinning runtime to observe the effect quickly.
> 
> My experience is unfortunately that TCP results have a lot of variation
> (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> timing, that can get affected by CPU sleep states. The systems CPU
> latency setting can be seen in /dev/cpu_dma_latency, which can be read
> like this:
> 
>  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> 
> For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> as it requires holding the file open. E.g I play with these profiles:
> 
>  sudo tuned-adm profile throughput-performance
>  sudo tuned-adm profile latency-performance
>  sudo tuned-adm profile network-latency

Appreciate the tips - I should keep this saved somewhere.

> 
> 
> > > cpumap v2 Olek
> > > 
> > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > 
> > > 
> 
> 
> We now three processes/tasks executing:
>  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
>  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
>  (3) Userspace netserver process TCP receiving data from socket.
> 
> Again, now the performance is going to depend on depending on which CPU
> cores the processes/tasks are running and whether some are sharing the
> same CPU. (There are both wakeup timing and cache-line effects).
> 
> There are now more combinations to test...
> 
> CPUmap is a CPU scaling facility, and you will likely also see different
> CPU utilization on the difference cores one you start to pin these to
> control the scenarios.
> 
> > > It's very interesting that we see -40% tput w/ the patches. I went back
> > 
> 
> Sad that we see -40% throughput...  but do we know what CPU cores the
> now three different tasks/processes run on(?)
> 

Roughly, yes. For context, my primary use case for cpumap is to provide
some degree of isolation between colocated containers on a single host.
In particular, colocation occurs on AMD Bergamo. And containers are
CPU pinned to their own CCX (roughly). My RX steering program ensures
RX packets destined to a specific container are cpumap redirected to any
of the container's pinned CPUs. It not only provides a good measure of
isolation but ensures resources are properly accounted.

So to answer your question of which CPUs the 3 things run on: cpumap
kthread and application run on the same set of cores. More than that,
they share the same L3 cache by design. irq/softirq is effectively
random given default RSS config and IRQ affinities.


> 
> > Oh no, I messed up something =\
> >  > Could you please also test not the whole series, but patches 1-3 (up to
> > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > array...")? Would be great to see whether this implementation works
> > worse right from the start or I just broke something later on.
> > 
> > > and double checked and it seems the numbers are right. Here's the
> > > some output from some profiles I took with:
> > > 
> > >      perf record -e cycles:k -a -- sleep 10
> > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > 
> > >      # Event 'cycles:k'
> > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > 
> 
> I really appreciate that you provide perf data and perf diff, but as
> described above, we need data and information on what CPU cores are
> running which workload.
> 
> Fortunately perf diff (and perf report) support doing like this:
>  perf diff --sort=cpu,symbol
> 
> But then you also need to control the CPUs used in experiment for the
> diff to work.
> 
> I hope I made sense as these kind of CPU scaling benchmarks are tricky,

Indeed, sounds quite tricky.

My understanding with GRO is that it's a powerful general purpose
optimization. Enough that it should rise above the usual noise on a
reasonably configured system (which mine is).

Maybe we can consider decoupling the cpumap GRO enablement with the
later optimizations?

So in Olek's above series, patches 1-3 seem like they would still
benefit from an simpler testbed. But the more targetted optimizations in
patch 4+ would probably justify a de-noised setup.  Possibly single host
with xdp-trafficgen or something.

Procedurally speaking, maybe it would save some wasted effort if
everyone agreed on the general approach before investing more time into
finer optimizations built on top of the basic GRO support?

Thanks,
Daniel
Lorenzo Bianconi Nov. 25, 2024, 10:19 p.m. UTC | #15
> Hi Jesper,
> 
> On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote:
> > 
> > 
> > On 25/11/2024 16.12, Alexander Lobakin wrote:
> > > From: Daniel Xu <dxu@dxuuu.xyz>
> > > Date: Fri, 22 Nov 2024 17:10:06 -0700
> > > 
> > > > Hi Olek,
> > > > 
> > > > Here are the results.
> > > > 
> > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
> > > > > 
> > > > > 
> > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
> > > 
> > > [...]
> > > 
> > > > Baseline (again)
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
> > > > Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
> > > > Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
> > > > Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
> > > > Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
> > > > Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
> > > > 
> > 
> > We need to talk about what we are measuring, and how to control the
> > experiment setup to get reproducible results.
> > Especially controlling on what CPU cores our code paths are executing.
> > 
> > In above "baseline" case, we have two processes/tasks executing:
> >  (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket)
> >  (2) Userspace netserver process TCP receiving data from socket.
> 
> "baseline" in this case is still cpumap, just without these GRO patches.
> 
> > 
> > My experience is that you will see two noticeable different
> > throughput performance results depending on whether (1) and (2) is
> > executing on the *same* CPU (multi-tasking context-switching),
> > or executing in parallel (e.g. pinned) on two different CPU cores.
> > 
> > The netperf command have an option
> > 
> >  -T lcpu,remcpu
> >       Request that netperf be bound to local CPU lcpu and/or netserver be
> > bound to remote CPU rcpu.
> > 
> > Verify setting by listing pinning like this:
> >   for PID in $(pidof netserver); do taskset -pc $PID ; done
> > 
> > You can also set pinning runtime like this:
> >  export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID;
> > done
> > 
> > For troubleshooting, I like to use the periodic 1 sec (netperf -D1)
> > output and adjust pinning runtime to observe the effect quickly.
> > 
> > My experience is unfortunately that TCP results have a lot of variation
> > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks
> > timing, that can get affected by CPU sleep states. The systems CPU
> > latency setting can be seen in /dev/cpu_dma_latency, which can be read
> > like this:
> > 
> >  sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
> > 
> > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm
> > as it requires holding the file open. E.g I play with these profiles:
> > 
> >  sudo tuned-adm profile throughput-performance
> >  sudo tuned-adm profile latency-performance
> >  sudo tuned-adm profile network-latency
> 
> Appreciate the tips - I should keep this saved somewhere.
> 
> > 
> > 
> > > > cpumap v2 Olek
> > > > 
> > > > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > > > Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
> > > > Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
> > > > Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
> > > > Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
> > > > Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
> > > > Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
> > > > Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
> > > > 
> > > > 
> > 
> > 
> > We now three processes/tasks executing:
> >  (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap)
> >  (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket)
> >  (3) Userspace netserver process TCP receiving data from socket.
> > 
> > Again, now the performance is going to depend on depending on which CPU
> > cores the processes/tasks are running and whether some are sharing the
> > same CPU. (There are both wakeup timing and cache-line effects).
> > 
> > There are now more combinations to test...
> > 
> > CPUmap is a CPU scaling facility, and you will likely also see different
> > CPU utilization on the difference cores one you start to pin these to
> > control the scenarios.
> > 
> > > > It's very interesting that we see -40% tput w/ the patches. I went back
> > > 
> > 
> > Sad that we see -40% throughput...  but do we know what CPU cores the
> > now three different tasks/processes run on(?)
> > 
> 
> Roughly, yes. For context, my primary use case for cpumap is to provide
> some degree of isolation between colocated containers on a single host.
> In particular, colocation occurs on AMD Bergamo. And containers are
> CPU pinned to their own CCX (roughly). My RX steering program ensures
> RX packets destined to a specific container are cpumap redirected to any
> of the container's pinned CPUs. It not only provides a good measure of
> isolation but ensures resources are properly accounted.
> 
> So to answer your question of which CPUs the 3 things run on: cpumap
> kthread and application run on the same set of cores. More than that,
> they share the same L3 cache by design. irq/softirq is effectively
> random given default RSS config and IRQ affinities.
> 
> 
> > 
> > > Oh no, I messed up something =\
> > >  > Could you please also test not the whole series, but patches 1-3 (up to
> > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> > > array...")? Would be great to see whether this implementation works
> > > worse right from the start or I just broke something later on.
> > > 
> > > > and double checked and it seems the numbers are right. Here's the
> > > > some output from some profiles I took with:
> > > > 
> > > >      perf record -e cycles:k -a -- sleep 10
> > > >      perf --no-pager diff perf.data.baseline perf.data.withpatches > ...
> > > > 
> > > >      # Event 'cycles:k'
> > > >      # Baseline  Delta Abs  Shared Object                                                    Symbol
> > > >           6.13%     -3.60%  [kernel.kallsyms]                                                [k] _copy_to_iter
> > > 
> > 
> > I really appreciate that you provide perf data and perf diff, but as
> > described above, we need data and information on what CPU cores are
> > running which workload.
> > 
> > Fortunately perf diff (and perf report) support doing like this:
> >  perf diff --sort=cpu,symbol
> > 
> > But then you also need to control the CPUs used in experiment for the
> > diff to work.
> > 
> > I hope I made sense as these kind of CPU scaling benchmarks are tricky,
> 
> Indeed, sounds quite tricky.
> 
> My understanding with GRO is that it's a powerful general purpose
> optimization. Enough that it should rise above the usual noise on a
> reasonably configured system (which mine is).
> 
> Maybe we can consider decoupling the cpumap GRO enablement with the
> later optimizations?

I agree. First, we need to identify the best approach to enable GRO on cpumap
(between Olek's approach and what I have suggested) and then we can evaluate
subsequent optimizations.
@Olek: do you agree?

Regards,
Lorenzo

> 
> So in Olek's above series, patches 1-3 seem like they would still
> benefit from an simpler testbed. But the more targetted optimizations in
> patch 4+ would probably justify a de-noised setup.  Possibly single host
> with xdp-trafficgen or something.
> 
> Procedurally speaking, maybe it would save some wasted effort if
> everyone agreed on the general approach before investing more time into
> finer optimizations built on top of the basic GRO support?
> 
> Thanks,
> Daniel
>
Daniel Xu Nov. 25, 2024, 10:56 p.m. UTC | #16
On Mon, Nov 25, 2024, at 9:12 AM, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Fri, 22 Nov 2024 17:10:06 -0700
>
>> Hi Olek,
>> 
>> Here are the results.
>> 
>> On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote:
>>>
>>>
>>> On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote:
>
> [...]
>
>> Baseline (again)
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3169917	        0.00007295	0.00007871	0.00009343		Run 1	21749.43
>> Run 2	3228290	        0.00007103	0.00007679	0.00009215		Run 2	21897.17
>> Run 3	3226746	        0.00007231	0.00007871	0.00009087		Run 3	21906.82
>> Run 4	3191258	        0.00007231	0.00007743	0.00009087		Run 4	21155.15
>> Run 5	3235653	        0.00007231	0.00007743	0.00008703		Run 5	21397.06
>> Average	3210372.8	0.000072182	0.000077814	0.00009087		Average	21621.126
>> 
>> cpumap v2 Olek
>> 
>> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
>> Run 1	3253651	        0.00007167	0.00007807	0.00009343		Run 1	13497.57
>> Run 2	3221492	        0.00007231	0.00007743	0.00009087		Run 2	12115.53
>> Run 3	3296453	        0.00007039	0.00007807	0.00009087		Run 3	12323.38
>> Run 4	3254460	        0.00007167	0.00007807	0.00009087		Run 4	12901.88
>> Run 5	3173327	        0.00007295	0.00007871	0.00009215		Run 5	12593.22
>> Average	3239876.6	0.000071798	0.00007807	0.000091638		Average	12686.316
>> Delta	0.92%	        -0.53%	        0.33%	        0.85%			        -41.32%
>> 
>> 
>> It's very interesting that we see -40% tput w/ the patches. I went back
>
> Oh no, I messed up something =\
>
> Could you please also test not the whole series, but patches 1-3 (up to
> "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb
> array...")? Would be great to see whether this implementation works
> worse right from the start or I just broke something later on.

Patches 1-3 reproduces the -40% tput numbers. 

With patches 1-4 the numbers get slightly worse (~1gbps lower) but it was noisy.

tcp_rr results were unaffected.

Thanks,
Daniel