Message ID | 20240729183038.1959-1-eladwf@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | net: ethernet: mtk_eth_soc: improve RX performance | expand |
> This small series includes two short and simple patches to improve RX performance > on this driver. Hi Elad, What is the chip revision you are running? If you are using a device that does not support HW-LRO (e.g. MT7986 or MT7988), I guess we can try to use page_pool_dev_alloc_frag() APIs and request a 2048B buffer. Doing so, we can use use a single page for two rx buffers improving recycling with page_pool. What do you think? Regards, Lorenzo > > iperf3 result without these patches: > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-1.00 sec 563 MBytes 4.72 Gbits/sec > [ 4] 1.00-2.00 sec 563 MBytes 4.73 Gbits/sec > [ 4] 2.00-3.00 sec 552 MBytes 4.63 Gbits/sec > [ 4] 3.00-4.00 sec 561 MBytes 4.70 Gbits/sec > [ 4] 4.00-5.00 sec 562 MBytes 4.71 Gbits/sec > [ 4] 5.00-6.00 sec 565 MBytes 4.74 Gbits/sec > [ 4] 6.00-7.00 sec 563 MBytes 4.72 Gbits/sec > [ 4] 7.00-8.00 sec 565 MBytes 4.74 Gbits/sec > [ 4] 8.00-9.00 sec 562 MBytes 4.71 Gbits/sec > [ 4] 9.00-10.00 sec 558 MBytes 4.68 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-10.00 sec 5.48 GBytes 4.71 Gbits/sec sender > [ 4] 0.00-10.00 sec 5.48 GBytes 4.71 Gbits/sec receiver > > iperf3 result with "use prefetch methods" patch: > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-1.00 sec 598 MBytes 5.02 Gbits/sec > [ 4] 1.00-2.00 sec 588 MBytes 4.94 Gbits/sec > [ 4] 2.00-3.00 sec 592 MBytes 4.97 Gbits/sec > [ 4] 3.00-4.00 sec 594 MBytes 4.98 Gbits/sec > [ 4] 4.00-5.00 sec 590 MBytes 4.95 Gbits/sec > [ 4] 5.00-6.00 sec 594 MBytes 4.98 Gbits/sec > [ 4] 6.00-7.00 sec 594 MBytes 4.98 Gbits/sec > [ 4] 7.00-8.00 sec 593 MBytes 4.98 Gbits/sec > [ 4] 8.00-9.00 sec 593 MBytes 4.98 Gbits/sec > [ 4] 9.00-10.00 sec 594 MBytes 4.98 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-10.00 sec 5.79 GBytes 4.98 Gbits/sec sender > [ 4] 0.00-10.00 sec 5.79 GBytes 4.98 Gbits/sec receiver > > iperf3 result with "use PP exclusively for XDP programs" patch: > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-1.00 sec 635 MBytes 5.33 Gbits/sec > [ 4] 1.00-2.00 sec 636 MBytes 5.33 Gbits/sec > [ 4] 2.00-3.00 sec 637 MBytes 5.34 Gbits/sec > [ 4] 3.00-4.00 sec 636 MBytes 5.34 Gbits/sec > [ 4] 4.00-5.00 sec 637 MBytes 5.34 Gbits/sec > [ 4] 5.00-6.00 sec 637 MBytes 5.35 Gbits/sec > [ 4] 6.00-7.00 sec 637 MBytes 5.34 Gbits/sec > [ 4] 7.00-8.00 sec 636 MBytes 5.33 Gbits/sec > [ 4] 8.00-9.00 sec 634 MBytes 5.32 Gbits/sec > [ 4] 9.00-10.00 sec 637 MBytes 5.34 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec sender > [ 4] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec receiver > > iperf3 result with both patches: > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-1.00 sec 652 MBytes 5.47 Gbits/sec > [ 4] 1.00-2.00 sec 653 MBytes 5.47 Gbits/sec > [ 4] 2.00-3.00 sec 654 MBytes 5.48 Gbits/sec > [ 4] 3.00-4.00 sec 654 MBytes 5.49 Gbits/sec > [ 4] 4.00-5.00 sec 653 MBytes 5.48 Gbits/sec > [ 4] 5.00-6.00 sec 653 MBytes 5.48 Gbits/sec > [ 4] 6.00-7.00 sec 653 MBytes 5.48 Gbits/sec > [ 4] 7.00-8.00 sec 653 MBytes 5.48 Gbits/sec > [ 4] 8.00-9.00 sec 653 MBytes 5.48 Gbits/sec > [ 4] 9.00-10.00 sec 654 MBytes 5.48 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth > [ 4] 0.00-10.00 sec 6.38 GBytes 5.48 Gbits/sec sender > [ 4] 0.00-10.00 sec 6.38 GBytes 5.48 Gbits/sec receiver > > About 16% more packets/sec without XDP program loaded, > and about 5% more packets/sec when using PP. > Tested on Banana Pi BPI-R4 (MT7988A) > > --- > Technically, this is version 2 of the “use prefetch methods” patch. > Initially, I submitted it as a single patch for review (RFC), > but later I decided to include a second patch, resulting in this series > Changes in v2: > - Add "use PP exclusively for XDP programs" patch and create this series > --- > Elad Yifee (2): > net: ethernet: mtk_eth_soc: use prefetch methods > net: ethernet: mtk_eth_soc: use PP exclusively for XDP programs > > drivers/net/ethernet/mediatek/mtk_eth_soc.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > -- > 2.45.2 >
On Mon, Jul 29, 2024 at 10:10 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > This small series includes two short and simple patches to improve RX performance > > on this driver. > > Hi Elad, > > What is the chip revision you are running? > If you are using a device that does not support HW-LRO (e.g. MT7986 or > MT7988), I guess we can try to use page_pool_dev_alloc_frag() APIs and > request a 2048B buffer. Doing so, we can use use a single page for two > rx buffers improving recycling with page_pool. What do you think? > > Regards, > Lorenzo > Hey Lorenzo, It's Rev0. why, do you have any info on the revisions? Since it's probably the reason for the performance hit, allocating full pages every time, I think your suggestion would improve the performance and probably match it with the napi_alloc_frag path. I'll give it a try when I have time. You also mentioned HW-LRO, which makes me think we also need the second patch if we want to allow HW-LRO to co-exists with XDP on NETSYS2/3 devices.
On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote: > Since it's probably the reason for the performance hit, > allocating full pages every time, I think your suggestion would improve the > performance and probably match it with the napi_alloc_frag path. > I'll give it a try when I have time. This is a better direction than disabling PP. Feel free to repost patch 1 separately.
On Thu, Aug 1, 2024 at 4:37 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote: > > Since it's probably the reason for the performance hit, > > allocating full pages every time, I think your suggestion would improve the > > performance and probably match it with the napi_alloc_frag path. > > I'll give it a try when I have time. > > This is a better direction than disabling PP. > Feel free to repost patch 1 separately. > -- > pw-bot: cr In this driver, the existence of PP is the condition to execute all XDP-related operations which aren't necessary on this hot path, so we anyway wouldn't want that. on XDP program setup the rings are reallocated and the PP would be created. Other than that, for HWLRO we need contiguous pages of different order than the PP, so the creation of PP basically prevents the use of HWLRO. So we solve this LRO problem and get a performance boost with this simple change. Lorenzo's suggestion would probably improve the performance of the XDP path and we should try that nonetheless.
> On Thu, Aug 1, 2024 at 4:37 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Tue, 30 Jul 2024 08:29:58 +0300 Elad Yifee wrote: > > > Since it's probably the reason for the performance hit, > > > allocating full pages every time, I think your suggestion would improve the > > > performance and probably match it with the napi_alloc_frag path. > > > I'll give it a try when I have time. > > > > This is a better direction than disabling PP. > > Feel free to repost patch 1 separately. > > -- > > pw-bot: cr > In this driver, the existence of PP is the condition to execute all > XDP-related operations which aren't necessary > on this hot path, so we anyway wouldn't want that. on XDP program > setup the rings are reallocated and the PP > would be created. nope, I added page_pool support even for non-XDP mode for hw that does not support HW-LRO. I guess mtk folks can correct me if I am wrong but IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am not sure if it can be supported. > Other than that, for HWLRO we need contiguous pages of different order > than the PP, so the creation of PP > basically prevents the use of HWLRO. > So we solve this LRO problem and get a performance boost with this > simple change. > > Lorenzo's suggestion would probably improve the performance of the XDP > path and we should try that nonetheless. nope, I mean to improve peformances even for non-XDP case with page_pool frag APIs. Regards, Lorenzo
On Thu, Aug 1, 2024 at 10:30 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > nope, I added page_pool support even for non-XDP mode for hw that does > not support HW-LRO. I guess mtk folks can correct me if I am wrong but > IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am > not sure if it can be supported. I know, but if we want to add support for HWLRO alongside XDP on NETSYS2/3, we need to prevent the PP use (for HWLRO allocations) and enable it only when there's an XDP program. I've been told HWLRO works on the MTK SDK version. > > Other than that, for HWLRO we need contiguous pages of different order > > than the PP, so the creation of PP > > basically prevents the use of HWLRO. > > So we solve this LRO problem and get a performance boost with this > > simple change. > > > > Lorenzo's suggestion would probably improve the performance of the XDP > > path and we should try that nonetheless. > > nope, I mean to improve peformances even for non-XDP case with page_pool frag > APIs. > > Regards, > Lorenzo Yes of course it would improve it for non-XDP case if we still use PP for non-XDP, but my point is we shouldn't, mainly because of HWLRO, but also the extra unnecessary code.
> On Thu, Aug 1, 2024 at 10:30 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > nope, I added page_pool support even for non-XDP mode for hw that does > > not support HW-LRO. I guess mtk folks can correct me if I am wrong but > > IIRC there were some hw limirations on mt7986/mt7988 for HW-LRO, so I am > > not sure if it can be supported. > I know, but if we want to add support for HWLRO alongside XDP on NETSYS2/3, > we need to prevent the PP use (for HWLRO allocations) and enable it > only when there's > an XDP program. > I've been told HWLRO works on the MTK SDK version. ack, but in this case, please provide even the HW-LRO support in the same series. Moreover, I am not sure if it is performant enough or not, we could increase the page_pool order. Moreover I guess we should be sure the HW-LRO works on all NETSYS2/3 hws revisions. Regards, Lorenzo > > > > Other than that, for HWLRO we need contiguous pages of different order > > > than the PP, so the creation of PP > > > basically prevents the use of HWLRO. > > > So we solve this LRO problem and get a performance boost with this > > > simple change. > > > > > > Lorenzo's suggestion would probably improve the performance of the XDP > > > path and we should try that nonetheless. > > > > nope, I mean to improve peformances even for non-XDP case with page_pool frag > > APIs. > > > > Regards, > > Lorenzo > Yes of course it would improve it for non-XDP case if we still use PP > for non-XDP, > but my point is we shouldn't, mainly because of HWLRO, but also the > extra unnecessary code.