[net-next,v2,0/2] net: ethernet: mtk_eth_soc: improve RX performance

Message ID	20240729183038.1959-1-eladwf@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Elad Yifee <eladwf@gmail.com> To: Felix Fietkau <nbd@nbd.name>, Sean Wang <sean.wang@mediatek.com>, Mark Lee <Mark-MC.Lee@mediatek.com>, Lorenzo Bianconi <lorenzo@kernel.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Matthias Brugger <matthias.bgg@gmail.com>, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org Cc: Elad Yifee <eladwf@gmail.com>, Daniel Golle <daniel@makrotopia.org>, Joe Damato <jdamato@fastly.com> Subject: [PATCH net-next v2 0/2] net: ethernet: mtk_eth_soc: improve RX performance Date: Mon, 29 Jul 2024 21:29:53 +0300 Message-ID: <20240729183038.1959-1-eladwf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	net: ethernet: mtk_eth_soc: improve RX performance \| expand [net-next,v2,0/2] net: ethernet: mtk_eth_soc: improve RX performance [net-next,v2,1/2] net: ethernet: mtk_eth_soc: use prefetch methods [net-next,v2,2/2] net: ethernet: mtk_eth_soc: use PP exclusively for XDP programs

Elad Yifee July 29, 2024, 6:29 p.m. UTC

This small series includes two short and simple patches to improve RX performance
on this driver.

iperf3 result without these patches:
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-1.00   sec   563 MBytes  4.72 Gbits/sec
	[  4]   1.00-2.00   sec   563 MBytes  4.73 Gbits/sec
	[  4]   2.00-3.00   sec   552 MBytes  4.63 Gbits/sec
	[  4]   3.00-4.00   sec   561 MBytes  4.70 Gbits/sec
	[  4]   4.00-5.00   sec   562 MBytes  4.71 Gbits/sec
	[  4]   5.00-6.00   sec   565 MBytes  4.74 Gbits/sec
	[  4]   6.00-7.00   sec   563 MBytes  4.72 Gbits/sec
	[  4]   7.00-8.00   sec   565 MBytes  4.74 Gbits/sec
	[  4]   8.00-9.00   sec   562 MBytes  4.71 Gbits/sec
	[  4]   9.00-10.00  sec   558 MBytes  4.68 Gbits/sec
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-10.00  sec  5.48 GBytes  4.71 Gbits/sec                  sender
	[  4]   0.00-10.00  sec  5.48 GBytes  4.71 Gbits/sec                  receiver

iperf3 result with "use prefetch methods" patch:
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-1.00   sec   598 MBytes  5.02 Gbits/sec
	[  4]   1.00-2.00   sec   588 MBytes  4.94 Gbits/sec
	[  4]   2.00-3.00   sec   592 MBytes  4.97 Gbits/sec
	[  4]   3.00-4.00   sec   594 MBytes  4.98 Gbits/sec
	[  4]   4.00-5.00   sec   590 MBytes  4.95 Gbits/sec
	[  4]   5.00-6.00   sec   594 MBytes  4.98 Gbits/sec
	[  4]   6.00-7.00   sec   594 MBytes  4.98 Gbits/sec
	[  4]   7.00-8.00   sec   593 MBytes  4.98 Gbits/sec
	[  4]   8.00-9.00   sec   593 MBytes  4.98 Gbits/sec
	[  4]   9.00-10.00  sec   594 MBytes  4.98 Gbits/sec
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-10.00  sec  5.79 GBytes  4.98 Gbits/sec                  sender
	[  4]   0.00-10.00  sec  5.79 GBytes  4.98 Gbits/sec                  receiver

iperf3 result with "use PP exclusively for XDP programs" patch:
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-1.00   sec   635 MBytes  5.33 Gbits/sec
	[  4]   1.00-2.00   sec   636 MBytes  5.33 Gbits/sec
	[  4]   2.00-3.00   sec   637 MBytes  5.34 Gbits/sec
	[  4]   3.00-4.00   sec   636 MBytes  5.34 Gbits/sec
	[  4]   4.00-5.00   sec   637 MBytes  5.34 Gbits/sec
	[  4]   5.00-6.00   sec   637 MBytes  5.35 Gbits/sec
	[  4]   6.00-7.00   sec   637 MBytes  5.34 Gbits/sec
	[  4]   7.00-8.00   sec   636 MBytes  5.33 Gbits/sec
	[  4]   8.00-9.00   sec   634 MBytes  5.32 Gbits/sec
	[  4]   9.00-10.00  sec   637 MBytes  5.34 Gbits/sec
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-10.00  sec  6.21 GBytes  5.34 Gbits/sec                  sender
	[  4]   0.00-10.00  sec  6.21 GBytes  5.34 Gbits/sec                  receiver

iperf3 result with both patches:
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-1.00   sec   652 MBytes  5.47 Gbits/sec
	[  4]   1.00-2.00   sec   653 MBytes  5.47 Gbits/sec
	[  4]   2.00-3.00   sec   654 MBytes  5.48 Gbits/sec
	[  4]   3.00-4.00   sec   654 MBytes  5.49 Gbits/sec
	[  4]   4.00-5.00   sec   653 MBytes  5.48 Gbits/sec
	[  4]   5.00-6.00   sec   653 MBytes  5.48 Gbits/sec
	[  4]   6.00-7.00   sec   653 MBytes  5.48 Gbits/sec
	[  4]   7.00-8.00   sec   653 MBytes  5.48 Gbits/sec
	[  4]   8.00-9.00   sec   653 MBytes  5.48 Gbits/sec
	[  4]   9.00-10.00  sec   654 MBytes  5.48 Gbits/sec
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bandwidth
	[  4]   0.00-10.00  sec  6.38 GBytes  5.48 Gbits/sec                  sender
	[  4]   0.00-10.00  sec  6.38 GBytes  5.48 Gbits/sec                  receiver

About 16% more packets/sec without XDP program loaded,
and about 5% more packets/sec when using PP.
Tested on Banana Pi BPI-R4 (MT7988A)

---
Technically, this is version 2 of the “use prefetch methods” patch.
Initially, I submitted it as a single patch for review (RFC),
but later I decided to include a second patch, resulting in this series
Changes in v2:
	- Add "use PP exclusively for XDP programs" patch and create this series
---
Elad Yifee (2):
  net: ethernet: mtk_eth_soc: use prefetch methods
  net: ethernet: mtk_eth_soc: use PP exclusively for XDP programs

 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Lorenzo Bianconi July 29, 2024, 7:10 p.m. UTC | #1

> This small series includes two short and simple patches to improve RX performance
> on this driver.

Hi Elad,

What is the chip revision you are running?
If you are using a device that does not support HW-LRO (e.g. MT7986 or
MT7988), I guess we can try to use page_pool_dev_alloc_frag() APIs and
request a 2048B buffer. Doing so, we can use use a single page for two
rx buffers improving recycling with page_pool. What do you think?

Regards,
Lorenzo

> 
> iperf3 result without these patches:
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-1.00   sec   563 MBytes  4.72 Gbits/sec
> 	[  4]   1.00-2.00   sec   563 MBytes  4.73 Gbits/sec
> 	[  4]   2.00-3.00   sec   552 MBytes  4.63 Gbits/sec
> 	[  4]   3.00-4.00   sec   561 MBytes  4.70 Gbits/sec
> 	[  4]   4.00-5.00   sec   562 MBytes  4.71 Gbits/sec
> 	[  4]   5.00-6.00   sec   565 MBytes  4.74 Gbits/sec
> 	[  4]   6.00-7.00   sec   563 MBytes  4.72 Gbits/sec
> 	[  4]   7.00-8.00   sec   565 MBytes  4.74 Gbits/sec
> 	[  4]   8.00-9.00   sec   562 MBytes  4.71 Gbits/sec
> 	[  4]   9.00-10.00  sec   558 MBytes  4.68 Gbits/sec
> 	- - - - - - - - - - - - - - - - - - - - - - - - -
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-10.00  sec  5.48 GBytes  4.71 Gbits/sec                  sender
> 	[  4]   0.00-10.00  sec  5.48 GBytes  4.71 Gbits/sec                  receiver
> 
> iperf3 result with "use prefetch methods" patch:
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-1.00   sec   598 MBytes  5.02 Gbits/sec
> 	[  4]   1.00-2.00   sec   588 MBytes  4.94 Gbits/sec
> 	[  4]   2.00-3.00   sec   592 MBytes  4.97 Gbits/sec
> 	[  4]   3.00-4.00   sec   594 MBytes  4.98 Gbits/sec
> 	[  4]   4.00-5.00   sec   590 MBytes  4.95 Gbits/sec
> 	[  4]   5.00-6.00   sec   594 MBytes  4.98 Gbits/sec
> 	[  4]   6.00-7.00   sec   594 MBytes  4.98 Gbits/sec
> 	[  4]   7.00-8.00   sec   593 MBytes  4.98 Gbits/sec
> 	[  4]   8.00-9.00   sec   593 MBytes  4.98 Gbits/sec
> 	[  4]   9.00-10.00  sec   594 MBytes  4.98 Gbits/sec
> 	- - - - - - - - - - - - - - - - - - - - - - - - -
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-10.00  sec  5.79 GBytes  4.98 Gbits/sec                  sender
> 	[  4]   0.00-10.00  sec  5.79 GBytes  4.98 Gbits/sec                  receiver
> 
> iperf3 result with "use PP exclusively for XDP programs" patch:
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-1.00   sec   635 MBytes  5.33 Gbits/sec
> 	[  4]   1.00-2.00   sec   636 MBytes  5.33 Gbits/sec
> 	[  4]   2.00-3.00   sec   637 MBytes  5.34 Gbits/sec
> 	[  4]   3.00-4.00   sec   636 MBytes  5.34 Gbits/sec
> 	[  4]   4.00-5.00   sec   637 MBytes  5.34 Gbits/sec
> 	[  4]   5.00-6.00   sec   637 MBytes  5.35 Gbits/sec
> 	[  4]   6.00-7.00   sec   637 MBytes  5.34 Gbits/sec
> 	[  4]   7.00-8.00   sec   636 MBytes  5.33 Gbits/sec
> 	[  4]   8.00-9.00   sec   634 MBytes  5.32 Gbits/sec
> 	[  4]   9.00-10.00  sec   637 MBytes  5.34 Gbits/sec
> 	- - - - - - - - - - - - - - - - - - - - - - - - -
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-10.00  sec  6.21 GBytes  5.34 Gbits/sec                  sender
> 	[  4]   0.00-10.00  sec  6.21 GBytes  5.34 Gbits/sec                  receiver
> 
> iperf3 result with both patches:
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-1.00   sec   652 MBytes  5.47 Gbits/sec
> 	[  4]   1.00-2.00   sec   653 MBytes  5.47 Gbits/sec
> 	[  4]   2.00-3.00   sec   654 MBytes  5.48 Gbits/sec
> 	[  4]   3.00-4.00   sec   654 MBytes  5.49 Gbits/sec
> 	[  4]   4.00-5.00   sec   653 MBytes  5.48 Gbits/sec
> 	[  4]   5.00-6.00   sec   653 MBytes  5.48 Gbits/sec
> 	[  4]   6.00-7.00   sec   653 MBytes  5.48 Gbits/sec
> 	[  4]   7.00-8.00   sec   653 MBytes  5.48 Gbits/sec
> 	[  4]   8.00-9.00   sec   653 MBytes  5.48 Gbits/sec
> 	[  4]   9.00-10.00  sec   654 MBytes  5.48 Gbits/sec
> 	- - - - - - - - - - - - - - - - - - - - - - - - -
> 	[ ID] Interval           Transfer     Bandwidth
> 	[  4]   0.00-10.00  sec  6.38 GBytes  5.48 Gbits/sec                  sender
> 	[  4]   0.00-10.00  sec  6.38 GBytes  5.48 Gbits/sec                  receiver
> 
> About 16% more packets/sec without XDP program loaded,
> and about 5% more packets/sec when using PP.
> Tested on Banana Pi BPI-R4 (MT7988A)
> 
> ---
> Technically, this is version 2 of the “use prefetch methods” patch.
> Initially, I submitted it as a single patch for review (RFC),
> but later I decided to include a second patch, resulting in this series
> Changes in v2:
> 	- Add "use PP exclusively for XDP programs" patch and create this series
> ---
> Elad Yifee (2):
>   net: ethernet: mtk_eth_soc: use prefetch methods
>   net: ethernet: mtk_eth_soc: use PP exclusively for XDP programs
> 
>  drivers/net/ethernet/mediatek/mtk_eth_soc.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> -- 
> 2.45.2
>

Elad Yifee July 30, 2024, 5:29 a.m. UTC | #2

On Mon, Jul 29, 2024 at 10:10 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> > This small series includes two short and simple patches to improve RX performance
> > on this driver.
>
> Hi Elad,
>
> What is the chip revision you are running?
> If you are using a device that does not support HW-LRO (e.g. MT7986 or
> MT7988), I guess we can try to use page_pool_dev_alloc_frag() APIs and
> request a 2048B buffer. Doing so, we can use use a single page for two
> rx buffers improving recycling with page_pool. What do you think?
>
> Regards,
> Lorenzo
>
Hey Lorenzo,
It's Rev0. why, do you have any info on the revisions?
Since it's probably the reason for the performance hit,
allocating full pages every time, I think your suggestion would improve the
performance and probably match it with the napi_alloc_frag path.
I'll give it a try when I have time.
You also mentioned HW-LRO, which makes me think we also need the second patch
if we want to allow HW-LRO to co-exists with XDP on NETSYS2/3 devices.

Jakub Kicinski Aug. 1, 2024, 1:37 a.m. UTC | #3

On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote:
> Since it's probably the reason for the performance hit,
> allocating full pages every time, I think your suggestion would improve the
> performance and probably match it with the napi_alloc_frag path.
> I'll give it a try when I have time.

This is a better direction than disabling PP.
Feel free to repost patch 1 separately.

Elad Yifee Aug. 1, 2024, 3:53 a.m. UTC | #4

On Thu, Aug 1, 2024 at 4:37 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote:
> > Since it's probably the reason for the performance hit,
> > allocating full pages every time, I think your suggestion would improve the
> > performance and probably match it with the napi_alloc_frag path.
> > I'll give it a try when I have time.
>
> This is a better direction than disabling PP.
> Feel free to repost patch 1 separately.
> --
> pw-bot: cr
In this driver, the existence of PP is the condition to execute all
XDP-related operations which aren't necessary
on this hot path, so we anyway wouldn't want that. on XDP program
setup the rings are reallocated and the PP
would be created.
Other than that, for HWLRO we need contiguous pages of different order
than the PP, so the creation of PP
basically prevents the use of HWLRO.
So we solve this LRO problem and get a performance boost with this
simple change.

Lorenzo's suggestion would probably improve the performance of the XDP
path and we should try that nonetheless.

Lorenzo Bianconi Aug. 1, 2024, 7:30 a.m. UTC | #5

> On Thu, Aug 1, 2024 at 4:37 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote:
> > > Since it's probably the reason for the performance hit,
> > > allocating full pages every time, I think your suggestion would improve the
> > > performance and probably match it with the napi_alloc_frag path.
> > > I'll give it a try when I have time.
> >
> > This is a better direction than disabling PP.
> > Feel free to repost patch 1 separately.
> > --
> > pw-bot: cr
> In this driver, the existence of PP is the condition to execute all
> XDP-related operations which aren't necessary
> on this hot path, so we anyway wouldn't want that. on XDP program
> setup the rings are reallocated and the PP
> would be created.

nope, I added page_pool support even for non-XDP mode for hw that does
not support HW-LRO. I guess mtk folks can correct me if I am wrong but
IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am
not sure if it can be supported.

> Other than that, for HWLRO we need contiguous pages of different order
> than the PP, so the creation of PP
> basically prevents the use of HWLRO.
> So we solve this LRO problem and get a performance boost with this
> simple change.
> 
> Lorenzo's suggestion would probably improve the performance of the XDP
> path and we should try that nonetheless.

nope, I mean to improve peformances even for non-XDP case with page_pool frag
APIs.

Regards,
Lorenzo

Elad Yifee Aug. 1, 2024, 8:01 a.m. UTC | #6

On Thu, Aug 1, 2024 at 10:30 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> nope, I added page_pool support even for non-XDP mode for hw that does
> not support HW-LRO. I guess mtk folks can correct me if I am wrong but
> IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am
> not sure if it can be supported.
I know, but if we want to add support for HWLRO alongside XDP on NETSYS2/3,
we need to prevent the PP use (for HWLRO allocations) and enable it
only when there's
an XDP program.
I've been told HWLRO works on the MTK SDK version.

> > Other than that, for HWLRO we need contiguous pages of different order
> > than the PP, so the creation of PP
> > basically prevents the use of HWLRO.
> > So we solve this LRO problem and get a performance boost with this
> > simple change.
> >
> > Lorenzo's suggestion would probably improve the performance of the XDP
> > path and we should try that nonetheless.
>
> nope, I mean to improve peformances even for non-XDP case with page_pool frag
> APIs.
>
> Regards,
> Lorenzo
Yes of course it would improve it for non-XDP case if we still use PP
for non-XDP,
but my point is we shouldn't, mainly because of HWLRO, but also the
extra unnecessary code.

Lorenzo Bianconi Aug. 1, 2024, 8:15 a.m. UTC | #7

> On Thu, Aug 1, 2024 at 10:30 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> >
> > nope, I added page_pool support even for non-XDP mode for hw that does
> > not support HW-LRO. I guess mtk folks can correct me if I am wrong but
> > IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am
> > not sure if it can be supported.
> I know, but if we want to add support for HWLRO alongside XDP on NETSYS2/3,
> we need to prevent the PP use (for HWLRO allocations) and enable it
> only when there's
> an XDP program.
> I've been told HWLRO works on the MTK SDK version.

ack, but in this case, please provide even the HW-LRO support in the same
series. Moreover, I am not sure if it is performant enough or not, we could
increase the page_pool order.
Moreover I guess we should be sure the HW-LRO works on all NETSYS2/3 hws
revisions.

Regards,
Lorenzo

> 
> > > Other than that, for HWLRO we need contiguous pages of different order
> > > than the PP, so the creation of PP
> > > basically prevents the use of HWLRO.
> > > So we solve this LRO problem and get a performance boost with this
> > > simple change.
> > >
> > > Lorenzo's suggestion would probably improve the performance of the XDP
> > > path and we should try that nonetheless.
> >
> > nope, I mean to improve peformances even for non-XDP case with page_pool frag
> > APIs.
> >
> > Regards,
> > Lorenzo
> Yes of course it would improve it for non-XDP case if we still use PP
> for non-XDP,
> but my point is we shouldn't, mainly because of HWLRO, but also the
> extra unnecessary code.

[net-next,v2,0/2] net: ethernet: mtk_eth_soc: improve RX performance

Message

Comments