[V3,net-next] net: fec: add XDP_TX feature support

Message ID	20230731060025.3117343-1-wei.fang@nxp.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A756A7F8; Mon, 31 Jul 2023 06:07:28 +0000 (UTC) From: Wei Fang <wei.fang@nxp.com> To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, shenwei.wang@nxp.com, xiaoning.wang@nxp.com, ast@kernel.org, daniel@iogearbox.net, hawk@kernel.org, john.fastabend@gmail.com, netdev@vger.kernel.org Cc: linux-imx@nxp.com, linux-kernel@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH V3 net-next] net: fec: add XDP_TX feature support Date: Mon, 31 Jul 2023 14:00:25 +0800 Message-Id: <20230731060025.3117343-1-wei.fang@nxp.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	[V3,net-next] net: fec: add XDP_TX feature support \| expand [V3,net-next] net: fec: add XDP_TX feature support

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1328 this patch: 1328
netdev/cc_maintainers	success	CCed 14 of 14 maintainers
netdev/build_clang	success	Errors and warnings before: 1351 this patch: 1351
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1351 this patch: 1351
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 148 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Wei Fang July 31, 2023, 6 a.m. UTC

The XDP_TX feature is not supported before, and all the frames
which are deemed to do XDP_TX action actually do the XDP_DROP
action. So this patch adds the XDP_TX support to FEC driver.

I tested the performance of XDP_TX feature in XDP_DRV and XDP_SKB
modes on i.MX8MM-EVK and i.MX8MP-EVK platforms respectively, and
the test steps and results are as follows.

Step 1: Board A connects to the FEC port of the DUT and runs the
pktgen_sample03_burst_single_flow.sh script to generate and send
burst traffic to DUT. Note that the length of packet was set to
64 bytes and the procotol of packet was UDP in my test scenario.

Step 2: The DUT runs the xdp2 program to transmit received UDP
packets back out on the same port where they were received.

root@imx8mmevk:~# ./xdp2 eth0
proto 17:     150326 pkt/s
proto 17:     141920 pkt/s
proto 17:     147338 pkt/s
proto 17:     140783 pkt/s
proto 17:     150400 pkt/s
proto 17:     134651 pkt/s
proto 17:     134676 pkt/s
proto 17:     134959 pkt/s
proto 17:     148152 pkt/s
proto 17:     149885 pkt/s

root@imx8mmevk:~# ./xdp2 -S eth0
proto 17:     131094 pkt/s
proto 17:     134691 pkt/s
proto 17:     138930 pkt/s
proto 17:     129347 pkt/s
proto 17:     133050 pkt/s
proto 17:     132932 pkt/s
proto 17:     136628 pkt/s
proto 17:     132964 pkt/s
proto 17:     131265 pkt/s
proto 17:     135794 pkt/s

root@imx8mpevk:~# ./xdp2 eth0
proto 17:     135817 pkt/s
proto 17:     142776 pkt/s
proto 17:     142237 pkt/s
proto 17:     135673 pkt/s
proto 17:     139508 pkt/s
proto 17:     147340 pkt/s
proto 17:     133329 pkt/s
proto 17:     141171 pkt/s
proto 17:     146917 pkt/s
proto 17:     135488 pkt/s

root@imx8mpevk:~# ./xdp2 -S eth0
proto 17:     133150 pkt/s
proto 17:     133127 pkt/s
proto 17:     133538 pkt/s
proto 17:     133094 pkt/s
proto 17:     133690 pkt/s
proto 17:     133199 pkt/s
proto 17:     133905 pkt/s
proto 17:     132908 pkt/s
proto 17:     133292 pkt/s
proto 17:     133511 pkt/s

Signed-off-by: Wei Fang <wei.fang@nxp.com>
---
V2 changes:
According to Jakub's comments, the V2 patch adds two changes.
1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
tx timeout as XDP shares the queues with kernel stack.
2. Tx processing shouldn't call any XDP (or page pool) APIs if the
"budget" is 0.

V3 changes:
1. Remove the second change in V2, because this change has been
separated into another patch and it has been submmitted to the
upstream [1].
[1] https://lore.kernel.org/r/20230725074148.2936402-1-wei.fang@nxp.com
---
 drivers/net/ethernet/freescale/fec.h      |  1 +
 drivers/net/ethernet/freescale/fec_main.c | 80 ++++++++++++++++++-----
 2 files changed, 65 insertions(+), 16 deletions(-)

Larysa Zaremba Aug. 1, 2023, 3:34 p.m. UTC | #1

On Mon, Jul 31, 2023 at 02:00:25PM +0800, Wei Fang wrote:
> The XDP_TX feature is not supported before, and all the frames
> which are deemed to do XDP_TX action actually do the XDP_DROP
> action. So this patch adds the XDP_TX support to FEC driver.
> 
> I tested the performance of XDP_TX feature in XDP_DRV and XDP_SKB
> modes on i.MX8MM-EVK and i.MX8MP-EVK platforms respectively, and
> the test steps and results are as follows.
> 
> Step 1: Board A connects to the FEC port of the DUT and runs the
> pktgen_sample03_burst_single_flow.sh script to generate and send
> burst traffic to DUT. Note that the length of packet was set to
> 64 bytes and the procotol of packet was UDP in my test scenario.
> 
> Step 2: The DUT runs the xdp2 program to transmit received UDP
> packets back out on the same port where they were received.
> 
> root@imx8mmevk:~# ./xdp2 eth0
> proto 17:     150326 pkt/s
> proto 17:     141920 pkt/s
> proto 17:     147338 pkt/s
> proto 17:     140783 pkt/s
> proto 17:     150400 pkt/s
> proto 17:     134651 pkt/s
> proto 17:     134676 pkt/s
> proto 17:     134959 pkt/s
> proto 17:     148152 pkt/s
> proto 17:     149885 pkt/s
> 
> root@imx8mmevk:~# ./xdp2 -S eth0
> proto 17:     131094 pkt/s
> proto 17:     134691 pkt/s
> proto 17:     138930 pkt/s
> proto 17:     129347 pkt/s
> proto 17:     133050 pkt/s
> proto 17:     132932 pkt/s
> proto 17:     136628 pkt/s
> proto 17:     132964 pkt/s
> proto 17:     131265 pkt/s
> proto 17:     135794 pkt/s
> 
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     135817 pkt/s
> proto 17:     142776 pkt/s
> proto 17:     142237 pkt/s
> proto 17:     135673 pkt/s
> proto 17:     139508 pkt/s
> proto 17:     147340 pkt/s
> proto 17:     133329 pkt/s
> proto 17:     141171 pkt/s
> proto 17:     146917 pkt/s
> proto 17:     135488 pkt/s
> 
> root@imx8mpevk:~# ./xdp2 -S eth0
> proto 17:     133150 pkt/s
> proto 17:     133127 pkt/s
> proto 17:     133538 pkt/s
> proto 17:     133094 pkt/s
> proto 17:     133690 pkt/s
> proto 17:     133199 pkt/s
> proto 17:     133905 pkt/s
> proto 17:     132908 pkt/s
> proto 17:     133292 pkt/s
> proto 17:     133511 pkt/s
> 
> Signed-off-by: Wei Fang <wei.fang@nxp.com>

Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>

But I thought XDP-related code should go to bpf-next :/
Could somebody clarify this?

> ---
> V2 changes:
> According to Jakub's comments, the V2 patch adds two changes.
> 1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
> tx timeout as XDP shares the queues with kernel stack.
> 2. Tx processing shouldn't call any XDP (or page pool) APIs if the
> "budget" is 0.
> 
> V3 changes:
> 1. Remove the second change in V2, because this change has been
> separated into another patch and it has been submmitted to the
> upstream [1].
> [1] https://lore.kernel.org/r/20230725074148.2936402-1-wei.fang@nxp.com
> ---
>  drivers/net/ethernet/freescale/fec.h      |  1 +
>  drivers/net/ethernet/freescale/fec_main.c | 80 ++++++++++++++++++-----
>  2 files changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 8f1edcca96c4..f35445bddc7a 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -547,6 +547,7 @@ enum {
>  enum fec_txbuf_type {
>  	FEC_TXBUF_T_SKB,
>  	FEC_TXBUF_T_XDP_NDO,
> +	FEC_TXBUF_T_XDP_TX,
>  };
>  
>  struct fec_tx_buffer {
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 14d0dc7ba3c9..2068fe95504e 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -75,6 +75,8 @@
>  
>  static void set_multicast_list(struct net_device *ndev);
>  static void fec_enet_itr_coal_set(struct net_device *ndev);
> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> +				struct xdp_buff *xdp);
>  
>  #define DRIVER_NAME	"fec"
>  
> @@ -960,7 +962,8 @@ static void fec_enet_bd_init(struct net_device *dev)
>  					txq->tx_buf[i].skb = NULL;
>  				}
>  			} else {
> -				if (bdp->cbd_bufaddr)
> +				if (bdp->cbd_bufaddr &&
> +				    txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
>  					dma_unmap_single(&fep->pdev->dev,
>  							 fec32_to_cpu(bdp->cbd_bufaddr),
>  							 fec16_to_cpu(bdp->cbd_datlen),
> @@ -1423,7 +1426,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>  				break;
>  
>  			xdpf = txq->tx_buf[index].xdp;
> -			if (bdp->cbd_bufaddr)
> +			if (bdp->cbd_bufaddr &&
> +			    txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
>  				dma_unmap_single(&fep->pdev->dev,
>  						 fec32_to_cpu(bdp->cbd_bufaddr),
>  						 fec16_to_cpu(bdp->cbd_datlen),
> @@ -1482,7 +1486,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>  			/* Free the sk buffer associated with this last transmit */
>  			dev_kfree_skb_any(skb);
>  		} else {
> -			xdp_return_frame(xdpf);
> +			xdp_return_frame_rx_napi(xdpf);
>  
>  			txq->tx_buf[index].xdp = NULL;
>  			/* restore default tx buffer type: FEC_TXBUF_T_SKB */
> @@ -1573,11 +1577,18 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>  		}
>  		break;
>  
> -	default:
> -		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> -		fallthrough;
> -
>  	case XDP_TX:
> +		err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> +		if (err) {
> +			ret = FEC_ENET_XDP_CONSUMED;
> +			page = virt_to_head_page(xdp->data);
> +			page_pool_put_page(rxq->page_pool, page, sync, true);
> +		} else {
> +			ret = FEC_ENET_XDP_TX;
> +		}
> +		break;
> +
> +	default:
>  		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
>  		fallthrough;
>  
> @@ -3793,7 +3804,8 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
>  
>  static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>  				   struct fec_enet_priv_tx_q *txq,
> -				   struct xdp_frame *frame)
> +				   struct xdp_frame *frame,
> +				   bool ndo_xmit)
>  {
>  	unsigned int index, status, estatus;
>  	struct bufdesc *bdp;
> @@ -3813,10 +3825,24 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>  
>  	index = fec_enet_get_bd_index(bdp, &txq->bd);
>  
> -	dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> -				  frame->len, DMA_TO_DEVICE);
> -	if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> -		return -ENOMEM;
> +	if (ndo_xmit) {
> +		dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> +					  frame->len, DMA_TO_DEVICE);
> +		if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> +			return -ENOMEM;
> +
> +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> +	} else {
> +		struct page *page = virt_to_page(frame->data);
> +
> +		dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> +			   frame->headroom;
> +		dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> +					   frame->len, DMA_BIDIRECTIONAL);
> +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> +	}
> +
> +	txq->tx_buf[index].xdp = frame;
>  
>  	status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
>  	if (fep->bufdesc_ex)
> @@ -3835,9 +3861,6 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>  		ebdp->cbd_esc = cpu_to_fec32(estatus);
>  	}
>  
> -	txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> -	txq->tx_buf[index].xdp = frame;
> -
>  	/* Make sure the updates to rest of the descriptor are performed before
>  	 * transferring ownership.
>  	 */
> @@ -3863,6 +3886,31 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>  	return 0;
>  }
>  
> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> +				struct xdp_buff *xdp)
> +{
> +	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	struct fec_enet_priv_tx_q *txq;
> +	int cpu = smp_processor_id();
> +	struct netdev_queue *nq;
> +	int queue, ret;
> +
> +	queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> +	txq = fep->tx_queue[queue];
> +	nq = netdev_get_tx_queue(fep->netdev, queue);
> +
> +	__netif_tx_lock(nq, cpu);
> +
> +	/* Avoid tx timeout as XDP shares the queue with kernel stack */
> +	txq_trans_cond_update(nq);
> +	ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
> +
> +	__netif_tx_unlock(nq);
> +
> +	return ret;
> +}
> +
>  static int fec_enet_xdp_xmit(struct net_device *dev,
>  			     int num_frames,
>  			     struct xdp_frame **frames,
> @@ -3885,7 +3933,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
>  	/* Avoid tx timeout as XDP shares the queue with kernel stack */
>  	txq_trans_cond_update(nq);
>  	for (i = 0; i < num_frames; i++) {
> -		if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
> +		if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
>  			break;
>  		sent_frames++;
>  	}
> -- 
> 2.25.1
> 
>

Jakub Kicinski Aug. 1, 2023, 7:51 p.m. UTC | #2

On Tue, 1 Aug 2023 17:34:00 +0200 Larysa Zaremba wrote:
> But I thought XDP-related code should go to bpf-next :/
> Could somebody clarify this?

AFAIU pure driver XDP implementations end up flowing directly into
net-next most of the time. If there are core changes or new API
additions in the series - that's when bpf-next becomes important.

Jakub Kicinski Aug. 1, 2023, 9:57 p.m. UTC | #3

On Mon, 31 Jul 2023 14:00:25 +0800 Wei Fang wrote:
>  	case XDP_TX:
> +		err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> +		if (err) {
> +			ret = FEC_ENET_XDP_CONSUMED;
> +			page = virt_to_head_page(xdp->data);
> +			page_pool_put_page(rxq->page_pool, page, sync, true);

The error path should call trace_xdp_exception().

Wei Fang Aug. 2, 2023, 2:43 a.m. UTC | #4

> -----Original Message-----
> Subject: Re: [PATCH V3 net-next] net: fec: add XDP_TX feature support
> 
> On Mon, 31 Jul 2023 14:00:25 +0800 Wei Fang wrote:
> >  	case XDP_TX:
> > +		err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> > +		if (err) {
> > +			ret = FEC_ENET_XDP_CONSUMED;
> > +			page = virt_to_head_page(xdp->data);
> > +			page_pool_put_page(rxq->page_pool, page, sync, true);
> 
> The error path should call trace_xdp_exception().

Thanks for your reminder, it made me realize that the error processing of
other XDP actions also needs to add exception tracing. I'll add the exception
tracing for XDP_TX in V4 patch, and add the tracing for other XDP actions in
a separate patch.

Jesper Dangaard Brouer Aug. 2, 2023, 9:26 a.m. UTC | #5

On 31/07/2023 08.00, Wei Fang wrote:
> The XDP_TX feature is not supported before, and all the frames
> which are deemed to do XDP_TX action actually do the XDP_DROP
> action. So this patch adds the XDP_TX support to FEC driver.
> 
> I tested the performance of XDP_TX feature in XDP_DRV and XDP_SKB
> modes on i.MX8MM-EVK and i.MX8MP-EVK platforms respectively, and
> the test steps and results are as follows.
> 
> Step 1: Board A connects to the FEC port of the DUT and runs the
> pktgen_sample03_burst_single_flow.sh script to generate and send
> burst traffic to DUT. Note that the length of packet was set to
> 64 bytes and the procotol of packet was UDP in my test scenario.
> 
> Step 2: The DUT runs the xdp2 program to transmit received UDP
> packets back out on the same port where they were received.
> 

Below test result runs should have some more explaination, please.
(more inline code comments below)

> root@imx8mmevk:~# ./xdp2 eth0
> proto 17:     150326 pkt/s
> proto 17:     141920 pkt/s
> proto 17:     147338 pkt/s
> proto 17:     140783 pkt/s
> proto 17:     150400 pkt/s
> proto 17:     134651 pkt/s
> proto 17:     134676 pkt/s
> proto 17:     134959 pkt/s
> proto 17:     148152 pkt/s
> proto 17:     149885 pkt/s
> 
> root@imx8mmevk:~# ./xdp2 -S eth0
> proto 17:     131094 pkt/s
> proto 17:     134691 pkt/s
> proto 17:     138930 pkt/s
> proto 17:     129347 pkt/s
> proto 17:     133050 pkt/s
> proto 17:     132932 pkt/s
> proto 17:     136628 pkt/s
> proto 17:     132964 pkt/s
> proto 17:     131265 pkt/s
> proto 17:     135794 pkt/s
> 
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     135817 pkt/s
> proto 17:     142776 pkt/s
> proto 17:     142237 pkt/s
> proto 17:     135673 pkt/s
> proto 17:     139508 pkt/s
> proto 17:     147340 pkt/s
> proto 17:     133329 pkt/s
> proto 17:     141171 pkt/s
> proto 17:     146917 pkt/s
> proto 17:     135488 pkt/s
> 
> root@imx8mpevk:~# ./xdp2 -S eth0
> proto 17:     133150 pkt/s
> proto 17:     133127 pkt/s
> proto 17:     133538 pkt/s
> proto 17:     133094 pkt/s
> proto 17:     133690 pkt/s
> proto 17:     133199 pkt/s
> proto 17:     133905 pkt/s
> proto 17:     132908 pkt/s
> proto 17:     133292 pkt/s
> proto 17:     133511 pkt/s
> 

For this driver, I would like to see a benchmark comparison between
XDP_TX and XDP_REDIRECT.

As below code does could create a situation where XDP_REDIRECT is just
as fast as XDP_TX.  (Note, that I expect XDP_TX to be faster than
XDP_REDIRECT.)

> Signed-off-by: Wei Fang <wei.fang@nxp.com>
> ---
> V2 changes:
> According to Jakub's comments, the V2 patch adds two changes.
> 1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
> tx timeout as XDP shares the queues with kernel stack.
> 2. Tx processing shouldn't call any XDP (or page pool) APIs if the
> "budget" is 0.
> 
> V3 changes:
> 1. Remove the second change in V2, because this change has been
> separated into another patch and it has been submmitted to the
> upstream [1].
> [1] https://lore.kernel.org/r/20230725074148.2936402-1-wei.fang@nxp.com
> ---
>   drivers/net/ethernet/freescale/fec.h      |  1 +
>   drivers/net/ethernet/freescale/fec_main.c | 80 ++++++++++++++++++-----
>   2 files changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
> index 8f1edcca96c4..f35445bddc7a 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -547,6 +547,7 @@ enum {
>   enum fec_txbuf_type {
>   	FEC_TXBUF_T_SKB,
>   	FEC_TXBUF_T_XDP_NDO,
> +	FEC_TXBUF_T_XDP_TX,
>   };
>   
>   struct fec_tx_buffer {
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 14d0dc7ba3c9..2068fe95504e 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -75,6 +75,8 @@
>   
>   static void set_multicast_list(struct net_device *ndev);
>   static void fec_enet_itr_coal_set(struct net_device *ndev);
> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> +				struct xdp_buff *xdp);
>   
>   #define DRIVER_NAME	"fec"
>   
> @@ -960,7 +962,8 @@ static void fec_enet_bd_init(struct net_device *dev)
>   					txq->tx_buf[i].skb = NULL;
>   				}
>   			} else {
> -				if (bdp->cbd_bufaddr)
> +				if (bdp->cbd_bufaddr &&
> +				    txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
>   					dma_unmap_single(&fep->pdev->dev,
>   							 fec32_to_cpu(bdp->cbd_bufaddr),
>   							 fec16_to_cpu(bdp->cbd_datlen),
> @@ -1423,7 +1426,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>   				break;
>   
>   			xdpf = txq->tx_buf[index].xdp;
> -			if (bdp->cbd_bufaddr)
> +			if (bdp->cbd_bufaddr &&
> +			    txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
>   				dma_unmap_single(&fep->pdev->dev,
>   						 fec32_to_cpu(bdp->cbd_bufaddr),
>   						 fec16_to_cpu(bdp->cbd_datlen),
> @@ -1482,7 +1486,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>   			/* Free the sk buffer associated with this last transmit */
>   			dev_kfree_skb_any(skb);
>   		} else {
> -			xdp_return_frame(xdpf);
> +			xdp_return_frame_rx_napi(xdpf);
>   
>   			txq->tx_buf[index].xdp = NULL;
>   			/* restore default tx buffer type: FEC_TXBUF_T_SKB */
> @@ -1573,11 +1577,18 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>   		}
>   		break;
>   
> -	default:
> -		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> -		fallthrough;
> -
>   	case XDP_TX:
> +		err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);

You should pass along the "sync" length value to fec_enet_xdp_tx_xmit().
Because we know DMA comes from same device (it is already DMA mapped
to), then we can do a DMA sync "to_device" with only the sync length.

> +		if (err) {

Add an unlikely(err) or do like above case XDP_REDIRECT, where it takes
the likely case "if (!err)" first.

> +			ret = FEC_ENET_XDP_CONSUMED;
> +			page = virt_to_head_page(xdp->data);
> +			page_pool_put_page(rxq->page_pool, page, sync, true);
> +		} else {
> +			ret = FEC_ENET_XDP_TX;
> +		}
> +		break;
> +
> +	default:
>   		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
>   		fallthrough;
>   
> @@ -3793,7 +3804,8 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
>   
>   static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>   				   struct fec_enet_priv_tx_q *txq,
> -				   struct xdp_frame *frame)
> +				   struct xdp_frame *frame,
> +				   bool ndo_xmit)

E.g add parameter dma_sync_len.

>   {
>   	unsigned int index, status, estatus;
>   	struct bufdesc *bdp;
> @@ -3813,10 +3825,24 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>   
>   	index = fec_enet_get_bd_index(bdp, &txq->bd);
>   
> -	dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> -				  frame->len, DMA_TO_DEVICE);
> -	if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> -		return -ENOMEM;
> +	if (ndo_xmit) {
> +		dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> +					  frame->len, DMA_TO_DEVICE);
> +		if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> +			return -ENOMEM;
> +
> +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> +	} else {
> +		struct page *page = virt_to_page(frame->data);
> +
> +		dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> +			   frame->headroom;
> +		dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> +					   frame->len, DMA_BIDIRECTIONAL);

Optimization: use dma_sync_len here instead of frame->len.

> +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> +	}
> +
> +	txq->tx_buf[index].xdp = frame;
>   
>   	status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
>   	if (fep->bufdesc_ex)
> @@ -3835,9 +3861,6 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>   		ebdp->cbd_esc = cpu_to_fec32(estatus);
>   	}
>   
> -	txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> -	txq->tx_buf[index].xdp = frame;
> -
>   	/* Make sure the updates to rest of the descriptor are performed before
>   	 * transferring ownership.
>   	 */
> @@ -3863,6 +3886,31 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>   	return 0;
>   }
>   
> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> +				struct xdp_buff *xdp)
> +{

E.g add parameter dma_sync_len.

> +	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);

XDP_TX can avoid this conversion to xdp_frame.
It would requires some refactor of fec_enet_txq_xmit_frame().

> +	struct fec_enet_private *fep = netdev_priv(ndev);
> +	struct fec_enet_priv_tx_q *txq;
> +	int cpu = smp_processor_id();
> +	struct netdev_queue *nq;
> +	int queue, ret;
> +
> +	queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> +	txq = fep->tx_queue[queue];
> +	nq = netdev_get_tx_queue(fep->netdev, queue);
> +
> +	__netif_tx_lock(nq, cpu);

It is sad that XDP_TX takes a lock for each frame.

> +
> +	/* Avoid tx timeout as XDP shares the queue with kernel stack */
> +	txq_trans_cond_update(nq);
> +	ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);

Add/pass parameter dma_sync_len to fec_enet_txq_xmit_frame().


> +
> +	__netif_tx_unlock(nq);
> +
> +	return ret;
> +}
> +
>   static int fec_enet_xdp_xmit(struct net_device *dev,
>   			     int num_frames,
>   			     struct xdp_frame **frames,
> @@ -3885,7 +3933,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
>   	/* Avoid tx timeout as XDP shares the queue with kernel stack */
>   	txq_trans_cond_update(nq);
>   	for (i = 0; i < num_frames; i++) {
> -		if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
> +		if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
>   			break;
>   		sent_frames++;
>   	}

Wei Fang Aug. 2, 2023, 9:59 a.m. UTC | #6

>
> On 31/07/2023 08.00, Wei Fang wrote:
> > The XDP_TX feature is not supported before, and all the frames
> > which are deemed to do XDP_TX action actually do the XDP_DROP
> > action. So this patch adds the XDP_TX support to FEC driver.
> >
> > I tested the performance of XDP_TX feature in XDP_DRV and XDP_SKB
> > modes on i.MX8MM-EVK and i.MX8MP-EVK platforms respectively, and
> > the test steps and results are as follows.
> >
> > Step 1: Board A connects to the FEC port of the DUT and runs the
> > pktgen_sample03_burst_single_flow.sh script to generate and send
> > burst traffic to DUT. Note that the length of packet was set to
> > 64 bytes and the procotol of packet was UDP in my test scenario.
> >
> > Step 2: The DUT runs the xdp2 program to transmit received UDP
> > packets back out on the same port where they were received.
> >
>
> Below test result runs should have some more explaination, please.
> (more inline code comments below)
>
> > root@imx8mmevk:~# ./xdp2 eth0
> > proto 17:     150326 pkt/s
> > proto 17:     141920 pkt/s
> > proto 17:     147338 pkt/s
> > proto 17:     140783 pkt/s
> > proto 17:     150400 pkt/s
> > proto 17:     134651 pkt/s
> > proto 17:     134676 pkt/s
> > proto 17:     134959 pkt/s
> > proto 17:     148152 pkt/s
> > proto 17:     149885 pkt/s
> >
> > root@imx8mmevk:~# ./xdp2 -S eth0
> > proto 17:     131094 pkt/s
> > proto 17:     134691 pkt/s
> > proto 17:     138930 pkt/s
> > proto 17:     129347 pkt/s
> > proto 17:     133050 pkt/s
> > proto 17:     132932 pkt/s
> > proto 17:     136628 pkt/s
> > proto 17:     132964 pkt/s
> > proto 17:     131265 pkt/s
> > proto 17:     135794 pkt/s
> >
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     135817 pkt/s
> > proto 17:     142776 pkt/s
> > proto 17:     142237 pkt/s
> > proto 17:     135673 pkt/s
> > proto 17:     139508 pkt/s
> > proto 17:     147340 pkt/s
> > proto 17:     133329 pkt/s
> > proto 17:     141171 pkt/s
> > proto 17:     146917 pkt/s
> > proto 17:     135488 pkt/s
> >
> > root@imx8mpevk:~# ./xdp2 -S eth0
> > proto 17:     133150 pkt/s
> > proto 17:     133127 pkt/s
> > proto 17:     133538 pkt/s
> > proto 17:     133094 pkt/s
> > proto 17:     133690 pkt/s
> > proto 17:     133199 pkt/s
> > proto 17:     133905 pkt/s
> > proto 17:     132908 pkt/s
> > proto 17:     133292 pkt/s
> > proto 17:     133511 pkt/s
> >
>
> For this driver, I would like to see a benchmark comparison between
> XDP_TX and XDP_REDIRECT.
>
Okay, I'll do a comparison test.

> As below code does could create a situation where XDP_REDIRECT is just
> as fast as XDP_TX.  (Note, that I expect XDP_TX to be faster than
> XDP_REDIRECT.)
>
Could you explain why you expect XDP_TX should be faster than XDP_REDIRECT?
What's the problem if XDP_TX is as fast ad XDP_REDIRECT?

> > Signed-off-by: Wei Fang <wei.fang@nxp.com>
> > ---
> > V2 changes:
> > According to Jakub's comments, the V2 patch adds two changes.
> > 1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
> > tx timeout as XDP shares the queues with kernel stack.
> > 2. Tx processing shouldn't call any XDP (or page pool) APIs if the
> > "budget" is 0.
> >
> > V3 changes:
> > 1. Remove the second change in V2, because this change has been
> > separated into another patch and it has been submmitted to the
> > upstream [1].
> > [1]
> https://lore.k/
> ernel.org%2Fr%2F20230725074148.2936402-1-wei.fang%40nxp.com&data=
> 05%7C01%7Cwei.fang%40nxp.com%7C9a2fc5bab84947e4bea608db933aa5
> e9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638265652320
> 018962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV
> 2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wc
> xe8nBeLS9uQrbphuNI18owgDNHJq9478V53KybWB8%3D&reserved=0
> > ---
> >   drivers/net/ethernet/freescale/fec.h      |  1 +
> >   drivers/net/ethernet/freescale/fec_main.c | 80
> ++++++++++++++++++-----
> >   2 files changed, 65 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> > index 8f1edcca96c4..f35445bddc7a 100644
> > --- a/drivers/net/ethernet/freescale/fec.h
> > +++ b/drivers/net/ethernet/freescale/fec.h
> > @@ -547,6 +547,7 @@ enum {
> >   enum fec_txbuf_type {
> >     FEC_TXBUF_T_SKB,
> >     FEC_TXBUF_T_XDP_NDO,
> > +   FEC_TXBUF_T_XDP_TX,
> >   };
> >
> >   struct fec_tx_buffer {
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> > index 14d0dc7ba3c9..2068fe95504e 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -75,6 +75,8 @@
> >
> >   static void set_multicast_list(struct net_device *ndev);
> >   static void fec_enet_itr_coal_set(struct net_device *ndev);
> > +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> > +                           struct xdp_buff *xdp);
> >
> >   #define DRIVER_NAME       "fec"
> >
> > @@ -960,7 +962,8 @@ static void fec_enet_bd_init(struct net_device
> *dev)
> >                                     txq->tx_buf[i].skb = NULL;
> >                             }
> >                     } else {
> > -                           if (bdp->cbd_bufaddr)
> > +                           if (bdp->cbd_bufaddr &&
> > +                               txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
> >                                     dma_unmap_single(&fep->pdev->dev,
> >                                                      fec32_to_cpu(bdp->cbd_bufaddr),
> >                                                      fec16_to_cpu(bdp->cbd_datlen),
> > @@ -1423,7 +1426,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16
> queue_id, int budget)
> >                             break;
> >
> >                     xdpf = txq->tx_buf[index].xdp;
> > -                   if (bdp->cbd_bufaddr)
> > +                   if (bdp->cbd_bufaddr &&
> > +                       txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
> >                             dma_unmap_single(&fep->pdev->dev,
> >                                              fec32_to_cpu(bdp->cbd_bufaddr),
> >                                              fec16_to_cpu(bdp->cbd_datlen),
> > @@ -1482,7 +1486,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16
> queue_id, int budget)
> >                     /* Free the sk buffer associated with this last transmit */
> >                     dev_kfree_skb_any(skb);
> >             } else {
> > -                   xdp_return_frame(xdpf);
> > +                   xdp_return_frame_rx_napi(xdpf);
> >
> >                     txq->tx_buf[index].xdp = NULL;
> >                     /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> > @@ -1573,11 +1577,18 @@ fec_enet_run_xdp(struct fec_enet_private
> *fep, struct bpf_prog *prog,
> >             }
> >             break;
> >
> > -   default:
> > -           bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> > -           fallthrough;
> > -
> >     case XDP_TX:
> > +           err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
>
> You should pass along the "sync" length value to fec_enet_xdp_tx_xmit().
> Because we know DMA comes from same device (it is already DMA mapped
> to), then we can do a DMA sync "to_device" with only the sync length.
>
> > +           if (err) {
>
> Add an unlikely(err) or do like above case XDP_REDIRECT, where it takes
> the likely case "if (!err)" first.
>
> > +                   ret = FEC_ENET_XDP_CONSUMED;
> > +                   page = virt_to_head_page(xdp->data);
> > +                   page_pool_put_page(rxq->page_pool, page, sync, true);
> > +           } else {
> > +                   ret = FEC_ENET_XDP_TX;
> > +           }
> > +           break;
> > +
> > +   default:
> >             bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> >             fallthrough;
> >
> > @@ -3793,7 +3804,8 @@ fec_enet_xdp_get_tx_queue(struct
> fec_enet_private *fep, int index)
> >
> >   static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> >                                struct fec_enet_priv_tx_q *txq,
> > -                              struct xdp_frame *frame)
> > +                              struct xdp_frame *frame,
> > +                              bool ndo_xmit)
>
> E.g add parameter dma_sync_len.
>
> >   {
> >     unsigned int index, status, estatus;
> >     struct bufdesc *bdp;
> > @@ -3813,10 +3825,24 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >
> >     index = fec_enet_get_bd_index(bdp, &txq->bd);
> >
> > -   dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> > -                             frame->len, DMA_TO_DEVICE);
> > -   if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> > -           return -ENOMEM;
> > +   if (ndo_xmit) {
> > +           dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> > +                                     frame->len, DMA_TO_DEVICE);
> > +           if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> > +                   return -ENOMEM;
> > +
> > +           txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> > +   } else {
> > +           struct page *page = virt_to_page(frame->data);
> > +
> > +           dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> > +                      frame->headroom;
> > +           dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> > +                                      frame->len, DMA_BIDIRECTIONAL);
>
> Optimization: use dma_sync_len here instead of frame->len.
>
> > +           txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> > +   }
> > +
> > +   txq->tx_buf[index].xdp = frame;
> >
> >     status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> >     if (fep->bufdesc_ex)
> > @@ -3835,9 +3861,6 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >             ebdp->cbd_esc = cpu_to_fec32(estatus);
> >     }
> >
> > -   txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> > -   txq->tx_buf[index].xdp = frame;
> > -
> >     /* Make sure the updates to rest of the descriptor are performed
> before
> >      * transferring ownership.
> >      */
> > @@ -3863,6 +3886,31 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >     return 0;
> >   }
> >
> > +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> > +                           struct xdp_buff *xdp)
> > +{
>
> E.g add parameter dma_sync_len.
>
> > +   struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
>
> XDP_TX can avoid this conversion to xdp_frame.
> It would requires some refactor of fec_enet_txq_xmit_frame().
>
> > +   struct fec_enet_private *fep = netdev_priv(ndev);
> > +   struct fec_enet_priv_tx_q *txq;
> > +   int cpu = smp_processor_id();
> > +   struct netdev_queue *nq;
> > +   int queue, ret;
> > +
> > +   queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> > +   txq = fep->tx_queue[queue];
> > +   nq = netdev_get_tx_queue(fep->netdev, queue);
> > +
> > +   __netif_tx_lock(nq, cpu);
>
> It is sad that XDP_TX takes a lock for each frame.
>
> > +
> > +   /* Avoid tx timeout as XDP shares the queue with kernel stack */
> > +   txq_trans_cond_update(nq);
> > +   ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
>
> Add/pass parameter dma_sync_len to fec_enet_txq_xmit_frame().
>
>
> > +
> > +   __netif_tx_unlock(nq);
> > +
> > +   return ret;
> > +}
> > +
> >   static int fec_enet_xdp_xmit(struct net_device *dev,
> >                          int num_frames,
> >                          struct xdp_frame **frames,
> > @@ -3885,7 +3933,7 @@ static int fec_enet_xdp_xmit(struct net_device
> *dev,
> >     /* Avoid tx timeout as XDP shares the queue with kernel stack */
> >     txq_trans_cond_update(nq);
> >     for (i = 0; i < num_frames; i++) {
> > -           if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
> > +           if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
> >                     break;
> >             sent_frames++;
> >     }

Larysa Zaremba Aug. 2, 2023, 11:46 a.m. UTC | #7

On Tue, Aug 01, 2023 at 12:51:36PM -0700, Jakub Kicinski wrote:
> On Tue, 1 Aug 2023 17:34:00 +0200 Larysa Zaremba wrote:
> > But I thought XDP-related code should go to bpf-next :/
> > Could somebody clarify this?
> 
> AFAIU pure driver XDP implementations end up flowing directly into
> net-next most of the time. If there are core changes or new API
> additions in the series - that's when bpf-next becomes important.
> 
Thanks for the explanation!

Wei Fang Aug. 2, 2023, 12:33 p.m. UTC | #8

Sorry, I missed some comments before.

> > @@ -1482,7 +1486,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16
> queue_id, int budget)
> >   			/* Free the sk buffer associated with this last transmit */
> >   			dev_kfree_skb_any(skb);
> >   		} else {
> > -			xdp_return_frame(xdpf);
> > +			xdp_return_frame_rx_napi(xdpf);
> >
> >   			txq->tx_buf[index].xdp = NULL;
> >   			/* restore default tx buffer type: FEC_TXBUF_T_SKB */
> > @@ -1573,11 +1577,18 @@ fec_enet_run_xdp(struct fec_enet_private
> *fep, struct bpf_prog *prog,
> >   		}
> >   		break;
> >
> > -	default:
> > -		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> > -		fallthrough;
> > -
> >   	case XDP_TX:
> > +		err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> 
> You should pass along the "sync" length value to fec_enet_xdp_tx_xmit().
> Because we know DMA comes from same device (it is already DMA mapped
> to), then we can do a DMA sync "to_device" with only the sync length.
> 
I think it's okay if the frame length does not change, but if we increase the
length of the received frame, such as add a VLAN header into the frame, I
think the "sync" length value is not correct.

> > +		if (err) {
> 
> Add an unlikely(err) or do like above case XDP_REDIRECT, where it takes
> the likely case "if (!err)" first.
Yes, you are right, I will improve it.

> 
> > +			ret = FEC_ENET_XDP_CONSUMED;
> > +			page = virt_to_head_page(xdp->data);
> > +			page_pool_put_page(rxq->page_pool, page, sync, true);
> > +		} else {
> > +			ret = FEC_ENET_XDP_TX;
> > +		}
> > +		break;
> > +
> > +	default:
> >   		bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
> >   		fallthrough;
> >
> > @@ -3793,7 +3804,8 @@ fec_enet_xdp_get_tx_queue(struct
> fec_enet_private *fep, int index)
> >
> >   static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> >   				   struct fec_enet_priv_tx_q *txq,
> > -				   struct xdp_frame *frame)
> > +				   struct xdp_frame *frame,
> > +				   bool ndo_xmit)
> 
> E.g add parameter dma_sync_len.
> 
> >   {
> >   	unsigned int index, status, estatus;
> >   	struct bufdesc *bdp;
> > @@ -3813,10 +3825,24 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >
> >   	index = fec_enet_get_bd_index(bdp, &txq->bd);
> >
> > -	dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> > -				  frame->len, DMA_TO_DEVICE);
> > -	if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> > -		return -ENOMEM;
> > +	if (ndo_xmit) {
> > +		dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
> > +					  frame->len, DMA_TO_DEVICE);
> > +		if (dma_mapping_error(&fep->pdev->dev, dma_addr))
> > +			return -ENOMEM;
> > +
> > +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> > +	} else {
> > +		struct page *page = virt_to_page(frame->data);
> > +
> > +		dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
> > +			   frame->headroom;
> > +		dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> > +					   frame->len, DMA_BIDIRECTIONAL);
> 
> Optimization: use dma_sync_len here instead of frame->len.
> 
> > +		txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> > +	}
> > +
> > +	txq->tx_buf[index].xdp = frame;
> >
> >   	status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
> >   	if (fep->bufdesc_ex)
> > @@ -3835,9 +3861,6 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >   		ebdp->cbd_esc = cpu_to_fec32(estatus);
> >   	}
> >
> > -	txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
> > -	txq->tx_buf[index].xdp = frame;
> > -
> >   	/* Make sure the updates to rest of the descriptor are performed
> before
> >   	 * transferring ownership.
> >   	 */
> > @@ -3863,6 +3886,31 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >   	return 0;
> >   }
> >
> > +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> > +				struct xdp_buff *xdp)
> > +{
> 
> E.g add parameter dma_sync_len.
> 
> > +	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> 
> XDP_TX can avoid this conversion to xdp_frame.
> It would requires some refactor of fec_enet_txq_xmit_frame().
> 
Yes, but I'm not intend to change it, using the existing interface is enough.

> > +	struct fec_enet_private *fep = netdev_priv(ndev);
> > +	struct fec_enet_priv_tx_q *txq;
> > +	int cpu = smp_processor_id();
> > +	struct netdev_queue *nq;
> > +	int queue, ret;
> > +
> > +	queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> > +	txq = fep->tx_queue[queue];
> > +	nq = netdev_get_tx_queue(fep->netdev, queue);
> > +
> > +	__netif_tx_lock(nq, cpu);
> 
> It is sad that XDP_TX takes a lock for each frame.
> 
Yes, but the XDP path share the queue with the kernel network stack, so
we need a lock here, unless there is a dedicated queue for XDP path. Do
you have a better solution?

> > +
> > +	/* Avoid tx timeout as XDP shares the queue with kernel stack */
> > +	txq_trans_cond_update(nq);
> > +	ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
> 
> Add/pass parameter dma_sync_len to fec_enet_txq_xmit_frame().
> 
> 
> > +
> > +	__netif_tx_unlock(nq);
> > +
> > +	return ret;
> > +}
> > +
> >   static int fec_enet_xdp_xmit(struct net_device *dev,
> >   			     int num_frames,
> >   			     struct xdp_frame **frames,
> > @@ -3885,7 +3933,7 @@ static int fec_enet_xdp_xmit(struct net_device
> *dev,
> >   	/* Avoid tx timeout as XDP shares the queue with kernel stack */
> >   	txq_trans_cond_update(nq);
> >   	for (i = 0; i < num_frames; i++) {
> > -		if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
> > +		if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
> >   			break;
> >   		sent_frames++;
> >   	}

Jesper Dangaard Brouer Aug. 2, 2023, 12:34 p.m. UTC | #9

On 02/08/2023 11.59, Wei Fang wrote:
>>
>> On 31/07/2023 08.00, Wei Fang wrote:
>>> The XDP_TX feature is not supported before, and all the frames
>>> which are deemed to do XDP_TX action actually do the XDP_DROP
>>> action. So this patch adds the XDP_TX support to FEC driver.
>>>
>>> I tested the performance of XDP_TX feature in XDP_DRV and XDP_SKB
>>> modes on i.MX8MM-EVK and i.MX8MP-EVK platforms respectively, and
>>> the test steps and results are as follows.
>>>
>>> Step 1: Board A connects to the FEC port of the DUT and runs the
>>> pktgen_sample03_burst_single_flow.sh script to generate and send
>>> burst traffic to DUT. Note that the length of packet was set to
>>> 64 bytes and the procotol of packet was UDP in my test scenario.
>>>
>>> Step 2: The DUT runs the xdp2 program to transmit received UDP
>>> packets back out on the same port where they were received.
>>>
>>
>> Below test result runs should have some more explaination, please.
>> (more inline code comments below)
>>
>>> root@imx8mmevk:~# ./xdp2 eth0
>>> proto 17:     150326 pkt/s
>>> proto 17:     141920 pkt/s
>>> proto 17:     147338 pkt/s
>>> proto 17:     140783 pkt/s
>>> proto 17:     150400 pkt/s
>>> proto 17:     134651 pkt/s
>>> proto 17:     134676 pkt/s
>>> proto 17:     134959 pkt/s
>>> proto 17:     148152 pkt/s
>>> proto 17:     149885 pkt/s
>>>
>>> root@imx8mmevk:~# ./xdp2 -S eth0
>>> proto 17:     131094 pkt/s
>>> proto 17:     134691 pkt/s
>>> proto 17:     138930 pkt/s
>>> proto 17:     129347 pkt/s
>>> proto 17:     133050 pkt/s
>>> proto 17:     132932 pkt/s
>>> proto 17:     136628 pkt/s
>>> proto 17:     132964 pkt/s
>>> proto 17:     131265 pkt/s
>>> proto 17:     135794 pkt/s
>>>
>>> root@imx8mpevk:~# ./xdp2 eth0
>>> proto 17:     135817 pkt/s
>>> proto 17:     142776 pkt/s
>>> proto 17:     142237 pkt/s
>>> proto 17:     135673 pkt/s
>>> proto 17:     139508 pkt/s
>>> proto 17:     147340 pkt/s
>>> proto 17:     133329 pkt/s
>>> proto 17:     141171 pkt/s
>>> proto 17:     146917 pkt/s
>>> proto 17:     135488 pkt/s
>>>
>>> root@imx8mpevk:~# ./xdp2 -S eth0
>>> proto 17:     133150 pkt/s
>>> proto 17:     133127 pkt/s
>>> proto 17:     133538 pkt/s
>>> proto 17:     133094 pkt/s
>>> proto 17:     133690 pkt/s
>>> proto 17:     133199 pkt/s
>>> proto 17:     133905 pkt/s
>>> proto 17:     132908 pkt/s
>>> proto 17:     133292 pkt/s
>>> proto 17:     133511 pkt/s
>>>
>>
>> For this driver, I would like to see a benchmark comparison between
>> XDP_TX and XDP_REDIRECT.
>>
> Okay, I'll do a comparison test.

Thanks.

> 
>> As below code does could create a situation where XDP_REDIRECT is just
>> as fast as XDP_TX.  (Note, that I expect XDP_TX to be faster than
>> XDP_REDIRECT.)
>>
> Could you explain why you expect XDP_TX should be faster than XDP_REDIRECT?

First of all: I explained the changes needed to improve XDP_TX, below as
comments on the code. Please read and address.

XDP_TX should always be faster than XDP_REDIRECT, because it happens
locally in the driver and doesn't have to go through any generic
xdp_do_redirect code.

Like benchmarks shown in our XDP-paper[1] on mlx5 the graph[2] shows
XDP_TX vs XDP_REDIRECT vs. DPDK.

  [1] https://dl.acm.org/doi/10.1145/3281411.3281443
  [2] 
https://github.com/xdp-project/xdp-paper/blob/master/figures/redirect-test.pdf


> What's the problem if XDP_TX is as fast ad XDP_REDIRECT?
> 
>>> Signed-off-by: Wei Fang <wei.fang@nxp.com>
>>> ---
>>> V2 changes:
>>> According to Jakub's comments, the V2 patch adds two changes.
>>> 1. Call txq_trans_cond_update() in fec_enet_xdp_tx_xmit() to avoid
>>> tx timeout as XDP shares the queues with kernel stack.
>>> 2. Tx processing shouldn't call any XDP (or page pool) APIs if the
>>> "budget" is 0.
>>>
>>> V3 changes:
>>> 1. Remove the second change in V2, because this change has been
>>> separated into another patch and it has been submmitted to the
>>> upstream [1].
>>> [1]
>> https://lore.k/
>> ernel.org%2Fr%2F20230725074148.2936402-1-wei.fang%40nxp.com&data=
>> 05%7C01%7Cwei.fang%40nxp.com%7C9a2fc5bab84947e4bea608db933aa5
>> e9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638265652320
>> 018962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV
>> 2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wc
>> xe8nBeLS9uQrbphuNI18owgDNHJq9478V53KybWB8%3D&reserved=0
>>> ---
>>>    drivers/net/ethernet/freescale/fec.h      |  1 +
>>>    drivers/net/ethernet/freescale/fec_main.c | 80
>> ++++++++++++++++++-----
>>>    2 files changed, 65 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/freescale/fec.h
>> b/drivers/net/ethernet/freescale/fec.h
>>> index 8f1edcca96c4..f35445bddc7a 100644
>>> --- a/drivers/net/ethernet/freescale/fec.h
>>> +++ b/drivers/net/ethernet/freescale/fec.h
>>> @@ -547,6 +547,7 @@ enum {
>>>    enum fec_txbuf_type {
>>>      FEC_TXBUF_T_SKB,
>>>      FEC_TXBUF_T_XDP_NDO,
>>> +   FEC_TXBUF_T_XDP_TX,
>>>    };
>>>
>>>    struct fec_tx_buffer {
>>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>>> index 14d0dc7ba3c9..2068fe95504e 100644
>>> --- a/drivers/net/ethernet/freescale/fec_main.c
>>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>>> @@ -75,6 +75,8 @@
>>>
>>>    static void set_multicast_list(struct net_device *ndev);
>>>    static void fec_enet_itr_coal_set(struct net_device *ndev);
>>> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
>>> +                           struct xdp_buff *xdp);
>>>
>>>    #define DRIVER_NAME       "fec"
>>>
>>> @@ -960,7 +962,8 @@ static void fec_enet_bd_init(struct net_device
>> *dev)
>>>                                      txq->tx_buf[i].skb = NULL;
>>>                              }
>>>                      } else {
>>> -                           if (bdp->cbd_bufaddr)
>>> +                           if (bdp->cbd_bufaddr &&
>>> +                               txq->tx_buf[i].type == FEC_TXBUF_T_XDP_NDO)
>>>                                      dma_unmap_single(&fep->pdev->dev,
>>>                                                       fec32_to_cpu(bdp->cbd_bufaddr),
>>>                                                       fec16_to_cpu(bdp->cbd_datlen),
>>> @@ -1423,7 +1426,8 @@ fec_enet_tx_queue(struct net_device *ndev, u16
>> queue_id, int budget)
>>>                              break;
>>>
>>>                      xdpf = txq->tx_buf[index].xdp;
>>> -                   if (bdp->cbd_bufaddr)
>>> +                   if (bdp->cbd_bufaddr &&
>>> +                       txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
>>>                              dma_unmap_single(&fep->pdev->dev,
>>>                                               fec32_to_cpu(bdp->cbd_bufaddr),
>>>                                               fec16_to_cpu(bdp->cbd_datlen),
>>> @@ -1482,7 +1486,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16
>> queue_id, int budget)
>>>                      /* Free the sk buffer associated with this last transmit */
>>>                      dev_kfree_skb_any(skb);
>>>              } else {
>>> -                   xdp_return_frame(xdpf);
>>> +                   xdp_return_frame_rx_napi(xdpf);
>>>
>>>                      txq->tx_buf[index].xdp = NULL;
>>>                      /* restore default tx buffer type: FEC_TXBUF_T_SKB */
>>> @@ -1573,11 +1577,18 @@ fec_enet_run_xdp(struct fec_enet_private
>> *fep, struct bpf_prog *prog,
>>>              }
>>>              break;
>>>
>>> -   default:
>>> -           bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
>>> -           fallthrough;
>>> -
>>>      case XDP_TX:
>>> +           err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
>>
>> You should pass along the "sync" length value to fec_enet_xdp_tx_xmit().
>> Because we know DMA comes from same device (it is already DMA mapped
>> to), then we can do a DMA sync "to_device" with only the sync length.
>>
>>> +           if (err) {
>>
>> Add an unlikely(err) or do like above case XDP_REDIRECT, where it takes
>> the likely case "if (!err)" first.
>>
>>> +                   ret = FEC_ENET_XDP_CONSUMED;
>>> +                   page = virt_to_head_page(xdp->data);
>>> +                   page_pool_put_page(rxq->page_pool, page, sync, true);
>>> +           } else {
>>> +                   ret = FEC_ENET_XDP_TX;
>>> +           }
>>> +           break;
>>> +
>>> +   default:
>>>              bpf_warn_invalid_xdp_action(fep->netdev, prog, act);
>>>              fallthrough;
>>>
>>> @@ -3793,7 +3804,8 @@ fec_enet_xdp_get_tx_queue(struct
>> fec_enet_private *fep, int index)
>>>
>>>    static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>>>                                 struct fec_enet_priv_tx_q *txq,
>>> -                              struct xdp_frame *frame)
>>> +                              struct xdp_frame *frame,
>>> +                              bool ndo_xmit)
>>
>> E.g add parameter dma_sync_len.
>>
>>>    {
>>>      unsigned int index, status, estatus;
>>>      struct bufdesc *bdp;
>>> @@ -3813,10 +3825,24 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>>
>>>      index = fec_enet_get_bd_index(bdp, &txq->bd);
>>>
>>> -   dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
>>> -                             frame->len, DMA_TO_DEVICE);
>>> -   if (dma_mapping_error(&fep->pdev->dev, dma_addr))
>>> -           return -ENOMEM;
>>> +   if (ndo_xmit) {
>>> +           dma_addr = dma_map_single(&fep->pdev->dev, frame->data,
>>> +                                     frame->len, DMA_TO_DEVICE);
>>> +           if (dma_mapping_error(&fep->pdev->dev, dma_addr))
>>> +                   return -ENOMEM;
>>> +
>>> +           txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
>>> +   } else {
>>> +           struct page *page = virt_to_page(frame->data);
>>> +
>>> +           dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
>>> +                      frame->headroom;
>>> +           dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
>>> +                                      frame->len, DMA_BIDIRECTIONAL);
>>
>> Optimization: use dma_sync_len here instead of frame->len.
>>
>>> +           txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
>>> +   }
>>> +
>>> +   txq->tx_buf[index].xdp = frame;
>>>
>>>      status |= (BD_ENET_TX_INTR | BD_ENET_TX_LAST);
>>>      if (fep->bufdesc_ex)
>>> @@ -3835,9 +3861,6 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>>              ebdp->cbd_esc = cpu_to_fec32(estatus);
>>>      }
>>>
>>> -   txq->tx_buf[index].type = FEC_TXBUF_T_XDP_NDO;
>>> -   txq->tx_buf[index].xdp = frame;
>>> -
>>>      /* Make sure the updates to rest of the descriptor are performed
>> before
>>>       * transferring ownership.
>>>       */
>>> @@ -3863,6 +3886,31 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>>      return 0;
>>>    }
>>>
>>> +static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
>>> +                           struct xdp_buff *xdp)
>>> +{
>>
>> E.g add parameter dma_sync_len.
>>
>>> +   struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
>>
>> XDP_TX can avoid this conversion to xdp_frame.
>> It would requires some refactor of fec_enet_txq_xmit_frame().
>>
>>> +   struct fec_enet_private *fep = netdev_priv(ndev);
>>> +   struct fec_enet_priv_tx_q *txq;
>>> +   int cpu = smp_processor_id();
>>> +   struct netdev_queue *nq;
>>> +   int queue, ret;
>>> +
>>> +   queue = fec_enet_xdp_get_tx_queue(fep, cpu);
>>> +   txq = fep->tx_queue[queue];
>>> +   nq = netdev_get_tx_queue(fep->netdev, queue);
>>> +
>>> +   __netif_tx_lock(nq, cpu);
>>
>> It is sad that XDP_TX takes a lock for each frame.
>>
>>> +
>>> +   /* Avoid tx timeout as XDP shares the queue with kernel stack */
>>> +   txq_trans_cond_update(nq);
>>> +   ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
>>
>> Add/pass parameter dma_sync_len to fec_enet_txq_xmit_frame().
>>
>>
>>> +
>>> +   __netif_tx_unlock(nq);
>>> +
>>> +   return ret;
>>> +}
>>> +
>>>    static int fec_enet_xdp_xmit(struct net_device *dev,
>>>                           int num_frames,
>>>                           struct xdp_frame **frames,
>>> @@ -3885,7 +3933,7 @@ static int fec_enet_xdp_xmit(struct net_device
>> *dev,
>>>      /* Avoid tx timeout as XDP shares the queue with kernel stack */
>>>      txq_trans_cond_update(nq);
>>>      for (i = 0; i < num_frames; i++) {
>>> -           if (fec_enet_txq_xmit_frame(fep, txq, frames[i]) < 0)
>>> +           if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
>>>                      break;
>>>              sent_frames++;
>>>      }
>

Jakub Kicinski Aug. 2, 2023, 5:47 p.m. UTC | #10

On Mon, 31 Jul 2023 14:00:25 +0800 Wei Fang wrote:
>  		} else {
> -			xdp_return_frame(xdpf);
> +			xdp_return_frame_rx_napi(xdpf);

If you implement Jesper's syncing suggestions, I think you can use

	page_pool_put_page(pool, page, 0, true);

for XDP_TX here to avoid the DMA sync on page recycle.

Wei Fang Aug. 3, 2023, 3:58 a.m. UTC | #11

> >  		} else {
> > -			xdp_return_frame(xdpf);
> > +			xdp_return_frame_rx_napi(xdpf);
> 
> If you implement Jesper's syncing suggestions, I think you can use
> 
> 	page_pool_put_page(pool, page, 0, true);
> 
> for XDP_TX here to avoid the DMA sync on page recycle.

I tried Jasper's syncing suggestion and used page_pool_put_page() to recycle
pages, but the results does not seem to improve the performance of XDP_TX,
it even degrades the speed. 

The result of the current modification.
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     260180 pkt/s
proto 17:     260373 pkt/s
proto 17:     260363 pkt/s
proto 17:     259036 pkt/s
proto 17:     260180 pkt/s
proto 17:     260048 pkt/s
proto 17:     260029 pkt/s
proto 17:     260133 pkt/s
proto 17:     260021 pkt/s
proto 17:     260203 pkt/s
proto 17:     260293 pkt/s
proto 17:     259418 pkt/s

After using the sync suggestion, the result shows as follow.
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     255956 pkt/s
proto 17:     255841 pkt/s
proto 17:     255835 pkt/s
proto 17:     255381 pkt/s
proto 17:     255736 pkt/s
proto 17:     255779 pkt/s
proto 17:     254135 pkt/s
proto 17:     255584 pkt/s
proto 17:     255855 pkt/s
proto 17:     255664 pkt/s

Below are my changes, I don't know what cause it. Based on the results,
it's better to keep the current modification.

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index d5fda24a4c52..415c0cb83f84 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -77,7 +77,8 @@
 static void set_multicast_list(struct net_device *ndev);
 static void fec_enet_itr_coal_set(struct net_device *ndev);
 static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
-                               struct xdp_buff *xdp);
+                               struct xdp_buff *xdp,
+                               u32 dma_sync_len);

 #define DRIVER_NAME    "fec"

@@ -1487,7 +1488,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
                        /* Free the sk buffer associated with this last transmit */
                        dev_kfree_skb_any(skb);
                } else {
-                       xdp_return_frame_rx_napi(xdpf);
+                       if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
+                               xdp_return_frame_rx_napi(xdpf);
+                       else {
+                               struct page *page;
+
+                               page = virt_to_head_page(xdpf->data);
+                               page_pool_put_page(page->pp, page, 0, true);
+                       }

                        txq->tx_buf[index].xdp = NULL;
                        /* restore default tx buffer type: FEC_TXBUF_T_SKB */
@@ -1557,7 +1565,8 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
        act = bpf_prog_run_xdp(prog, xdp);

        /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
-       sync = xdp->data_end - xdp->data_hard_start - FEC_ENET_XDP_HEADROOM;
+       sync = xdp->data_end - xdp->data;
        sync = max(sync, len);

        switch (act) {
@@ -1579,7 +1588,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
                break;

        case XDP_TX:
-               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
+               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp, sync);
                if (unlikely(err)) {
                        ret = FEC_ENET_XDP_CONSUMED;
                        page = virt_to_head_page(xdp->data);
@@ -3807,6 +3816,7 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
 static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
                                   struct fec_enet_priv_tx_q *txq,
                                   struct xdp_frame *frame,
+                                  u32 dma_sync_len,
                                   bool ndo_xmit)
 {
        unsigned int index, status, estatus;
@@ -3840,7 +3850,7 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
                dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
                           frame->headroom;
                dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
-                                          frame->len, DMA_BIDIRECTIONAL);
+                                          dma_sync_len, DMA_BIDIRECTIONAL);
                txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
        }

@@ -3889,7 +3899,8 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
 }

 static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
-                               struct xdp_buff *xdp)
+                               struct xdp_buff *xdp,
+                               u32 dma_sync_len)
 {
        struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
        struct fec_enet_private *fep = netdev_priv(ndev);
@@ -3909,7 +3920,7 @@ static int fec_enet_xdp_tx_xmit(struct net_device *ndev,

        /* Avoid tx timeout as XDP shares the queue with kernel stack */
        txq_trans_cond_update(nq);
-       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
+       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);

        __netif_tx_unlock(nq);

@@ -3938,7 +3949,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
        /* Avoid tx timeout as XDP shares the queue with kernel stack */
        txq_trans_cond_update(nq);
        for (i = 0; i < num_frames; i++) {
-               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
+               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], 0, true) < 0)
                        break;
                sent_frames++;
        }

Jesper Dangaard Brouer Aug. 3, 2023, 7:36 a.m. UTC | #12

On 03/08/2023 05.58, Wei Fang wrote:
>>>   		} else {
>>> -			xdp_return_frame(xdpf);
>>> +			xdp_return_frame_rx_napi(xdpf);
>>
>> If you implement Jesper's syncing suggestions, I think you can use
>>
>> 	page_pool_put_page(pool, page, 0, true);

To Jakub, using 0 here you are trying to bypass the DMA-sync (which is
valid as driver knows XDP_TX have already done the sync).
The code will still call into DMA-sync calls with zero as size, so
wonder if we should detect size zero and skip that call?
(I mean is this something page_pool should support.)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 7ca456bfab71..778d061e4f2c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -323,7 +323,8 @@ static void page_pool_dma_sync_for_device(struct 
page_pool *pool,
         dma_addr_t dma_addr = page_pool_get_dma_addr(page);

         dma_sync_size = min(dma_sync_size, pool->p.max_len);
-       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
+       if (dma_sync_size)
+               dma_sync_single_range_for_device(pool->p.dev, dma_addr,
                                          pool->p.offset, dma_sync_size,
                                          pool->p.dma_dir);



>>
>> for XDP_TX here to avoid the DMA sync on page recycle.
> 
> I tried Jasper's syncing suggestion and used page_pool_put_page() to recycle
> pages, but the results does not seem to improve the performance of XDP_TX,

The optimization will only have effect on those devices which have
dev->dma_coherent=false else DMA function [1] (e.g.
dma_direct_sync_single_for_device) will skip the sync calls.

  [1] 
https://elixir.bootlin.com/linux/v6.5-rc4/source/kernel/dma/direct.h#L63

(Cc. Andrew Lunn)
Does any of the imx generations have dma-noncoherent memory?

And does any of these use the fec NIC driver?

> it even degrades the speed.

Could be low runs simply be a variation between your test runs?

The specific device (imx8mpevk) this was tested on, clearly have
dma_coherent=true, or else we would have seen a difference.
But the code change should not have any overhead for the
dma_coherent=true case, the only extra overhead is the extra empty DMA
sync call with size zero (as discussed in top).

> 
> The result of the current modification.
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     260180 pkt/s

These results are *significantly* better than reported in patch-1.
What happened?!?

e.g.
  root@imx8mpevk:~# ./xdp2 eth0
  proto 17:     135817 pkt/s
  proto 17:     142776 pkt/s

> proto 17:     260373 pkt/s
> proto 17:     260363 pkt/s
> proto 17:     259036 pkt/s
> proto 17:     260180 pkt/s
> proto 17:     260048 pkt/s
> proto 17:     260029 pkt/s
> proto 17:     260133 pkt/s
> proto 17:     260021 pkt/s
> proto 17:     260203 pkt/s
> proto 17:     260293 pkt/s
> proto 17:     259418 pkt/s
> 
> After using the sync suggestion, the result shows as follow.
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     255956 pkt/s
> proto 17:     255841 pkt/s
> proto 17:     255835 pkt/s
> proto 17:     255381 pkt/s
> proto 17:     255736 pkt/s
> proto 17:     255779 pkt/s
> proto 17:     254135 pkt/s
> proto 17:     255584 pkt/s
> proto 17:     255855 pkt/s
> proto 17:     255664 pkt/s
> 
> Below are my changes, I don't know what cause it. Based on the results,
> it's better to keep the current modification.
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index d5fda24a4c52..415c0cb83f84 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -77,7 +77,8 @@
>   static void set_multicast_list(struct net_device *ndev);
>   static void fec_enet_itr_coal_set(struct net_device *ndev);
>   static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> -                               struct xdp_buff *xdp);
> +                               struct xdp_buff *xdp,
> +                               u32 dma_sync_len);
> 
>   #define DRIVER_NAME    "fec"
> 
> @@ -1487,7 +1488,14 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
>                          /* Free the sk buffer associated with this last transmit */
>                          dev_kfree_skb_any(skb);
>                  } else {
> -                       xdp_return_frame_rx_napi(xdpf);
> +                       if (txq->tx_buf[index].type == FEC_TXBUF_T_XDP_NDO)
> +                               xdp_return_frame_rx_napi(xdpf);
> +                       else {
> +                               struct page *page;
> +
> +                               page = virt_to_head_page(xdpf->data);
> +                               page_pool_put_page(page->pp, page, 0, true);
> +                       }
> 
>                          txq->tx_buf[index].xdp = NULL;
>                          /* restore default tx buffer type: FEC_TXBUF_T_SKB */
> @@ -1557,7 +1565,8 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>          act = bpf_prog_run_xdp(prog, xdp);
> 
>          /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
> -       sync = xdp->data_end - xdp->data_hard_start - FEC_ENET_XDP_HEADROOM;
> +       sync = xdp->data_end - xdp->data;
>          sync = max(sync, len);
> 
>          switch (act) {
> @@ -1579,7 +1588,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>                  break;
> 
>          case XDP_TX:
> -               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> +               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp, sync);
>                  if (unlikely(err)) {
>                          ret = FEC_ENET_XDP_CONSUMED;
>                          page = virt_to_head_page(xdp->data);
> @@ -3807,6 +3816,7 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private *fep, int index)
>   static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>                                     struct fec_enet_priv_tx_q *txq,
>                                     struct xdp_frame *frame,
> +                                  u32 dma_sync_len,
>                                     bool ndo_xmit)
>   {
>          unsigned int index, status, estatus;
> @@ -3840,7 +3850,7 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>                  dma_addr = page_pool_get_dma_addr(page) + sizeof(*frame) +
>                             frame->headroom;
>                  dma_sync_single_for_device(&fep->pdev->dev, dma_addr,
> -                                          frame->len, DMA_BIDIRECTIONAL);
> +                                          dma_sync_len, DMA_BIDIRECTIONAL);
>                  txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
>          }
> 
> @@ -3889,7 +3899,8 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
>   }
> 
>   static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> -                               struct xdp_buff *xdp)
> +                               struct xdp_buff *xdp,
> +                               u32 dma_sync_len)
>   {
>          struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
>          struct fec_enet_private *fep = netdev_priv(ndev);
> @@ -3909,7 +3920,7 @@ static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> 
>          /* Avoid tx timeout as XDP shares the queue with kernel stack */
>          txq_trans_cond_update(nq);
> -       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
> +       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len, false);
> 
>          __netif_tx_unlock(nq);
> 
> @@ -3938,7 +3949,7 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
>          /* Avoid tx timeout as XDP shares the queue with kernel stack */
>          txq_trans_cond_update(nq);
>          for (i = 0; i < num_frames; i++) {
> -               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
> +               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], 0, true) < 0)
>                          break;
>                  sent_frames++;
>          }
>

Wei Fang Aug. 3, 2023, 11:18 a.m. UTC | #13

>
> On 03/08/2023 05.58, Wei Fang wrote:
> >>>                   } else {
> >>> -                 xdp_return_frame(xdpf);
> >>> +                 xdp_return_frame_rx_napi(xdpf);
> >>
> >> If you implement Jesper's syncing suggestions, I think you can use
> >>
> >>    page_pool_put_page(pool, page, 0, true);
>
> To Jakub, using 0 here you are trying to bypass the DMA-sync (which is valid
> as driver knows XDP_TX have already done the sync).
> The code will still call into DMA-sync calls with zero as size, so wonder if we
> should detect size zero and skip that call?
> (I mean is this something page_pool should support.)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c index
> 7ca456bfab71..778d061e4f2c 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -323,7 +323,8 @@ static void page_pool_dma_sync_for_device(struct
> page_pool *pool,
>          dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>
>          dma_sync_size = min(dma_sync_size, pool->p.max_len);
> -       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> +       if (dma_sync_size)
> +               dma_sync_single_range_for_device(pool->p.dev,
> dma_addr,
>                                           pool->p.offset,
> dma_sync_size,
>                                           pool->p.dma_dir);
>
>
>
> >>
> >> for XDP_TX here to avoid the DMA sync on page recycle.
> >
> > I tried Jasper's syncing suggestion and used page_pool_put_page() to
> > recycle pages, but the results does not seem to improve the
> > performance of XDP_TX,
>
> The optimization will only have effect on those devices which have
> dev->dma_coherent=false else DMA function [1] (e.g.
> dma_direct_sync_single_for_device) will skip the sync calls.
>
>   [1]
> https://elixir.b/
> ootlin.com%2Flinux%2Fv6.5-rc4%2Fsource%2Fkernel%2Fdma%2Fdirect.h%2
> 3L63&data=05%7C01%7Cwei.fang%40nxp.com%7Cb81436deb63d41dd41a1
> 08db93f454cb%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63
> 8266449821982804%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%
> 7C&sdata=CSi%2Fzld%2FucIYo6VDb9YdO91D8HKV1tbzPq8fB5X%2Bs3s%3D&
> reserved=0
>
> (Cc. Andrew Lunn)
> Does any of the imx generations have dma-noncoherent memory?
>
> And does any of these use the fec NIC driver?
>
> > it even degrades the speed.
>
> Could be low runs simply be a variation between your test runs?
>
Maybe, I just tested once before. So I test several times again, the
results of the two methods do not seem to be much different so far,
both about 255000 pkt/s.

> The specific device (imx8mpevk) this was tested on, clearly have
> dma_coherent=true, or else we would have seen a difference.
> But the code change should not have any overhead for the
> dma_coherent=true case, the only extra overhead is the extra empty DMA
> sync call with size zero (as discussed in top).
>
The FEC of i.MX8MP-EVK has dma_coherent=false, and as I mentioned
above, I did not see an obvious difference in the performance. :(

> >
> > The result of the current modification.
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     260180 pkt/s
>
> These results are *significantly* better than reported in patch-1.
> What happened?!?
>
The test environment is slightly different, in patch-1, the FEC port was
directly connected to the port of another board. But in the latest test,
the ports of the two boards were connected to a switch, so the ports of
the two boards are not directly connected.

> e.g.
>   root@imx8mpevk:~# ./xdp2 eth0
>   proto 17:     135817 pkt/s
>   proto 17:     142776 pkt/s
>
> > proto 17:     260373 pkt/s
> > proto 17:     260363 pkt/s
> > proto 17:     259036 pkt/s
> > proto 17:     260180 pkt/s
> > proto 17:     260048 pkt/s
> > proto 17:     260029 pkt/s
> > proto 17:     260133 pkt/s
> > proto 17:     260021 pkt/s
> > proto 17:     260203 pkt/s
> > proto 17:     260293 pkt/s
> > proto 17:     259418 pkt/s
> >
> > After using the sync suggestion, the result shows as follow.
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     255956 pkt/s
> > proto 17:     255841 pkt/s
> > proto 17:     255835 pkt/s
> > proto 17:     255381 pkt/s
> > proto 17:     255736 pkt/s
> > proto 17:     255779 pkt/s
> > proto 17:     254135 pkt/s
> > proto 17:     255584 pkt/s
> > proto 17:     255855 pkt/s
> > proto 17:     255664 pkt/s
> >
> > Below are my changes, I don't know what cause it. Based on the
> > results, it's better to keep the current modification.
> >
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> > b/drivers/net/ethernet/freescale/fec_main.c
> > index d5fda24a4c52..415c0cb83f84 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -77,7 +77,8 @@
> >   static void set_multicast_list(struct net_device *ndev);
> >   static void fec_enet_itr_coal_set(struct net_device *ndev);
> >   static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> > -                               struct xdp_buff *xdp);
> > +                               struct xdp_buff *xdp,
> > +                               u32 dma_sync_len);
> >
> >   #define DRIVER_NAME    "fec"
> >
> > @@ -1487,7 +1488,14 @@ fec_enet_tx_queue(struct net_device *ndev,
> u16 queue_id, int budget)
> >                          /* Free the sk buffer associated with this last
> transmit */
> >                          dev_kfree_skb_any(skb);
> >                  } else {
> > -                       xdp_return_frame_rx_napi(xdpf);
> > +                       if (txq->tx_buf[index].type ==
> FEC_TXBUF_T_XDP_NDO)
> > +                               xdp_return_frame_rx_napi(xdpf);
> > +                       else {
> > +                               struct page *page;
> > +
> > +                               page =
> virt_to_head_page(xdpf->data);
> > +                               page_pool_put_page(page->pp, page,
> 0, true);
> > +                       }
> >
> >                          txq->tx_buf[index].xdp = NULL;
> >                          /* restore default tx buffer type:
> > FEC_TXBUF_T_SKB */ @@ -1557,7 +1565,8 @@ fec_enet_run_xdp(struct
> fec_enet_private *fep, struct bpf_prog *prog,
> >          act = bpf_prog_run_xdp(prog, xdp);
> >
> >          /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU
> touch */
> > -       sync = xdp->data_end - xdp->data_hard_start -
> FEC_ENET_XDP_HEADROOM;
> > +       sync = xdp->data_end - xdp->data;
> >          sync = max(sync, len);
> >
> >          switch (act) {
> > @@ -1579,7 +1588,7 @@ fec_enet_run_xdp(struct fec_enet_private *fep,
> struct bpf_prog *prog,
> >                  break;
> >
> >          case XDP_TX:
> > -               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp);
> > +               err = fec_enet_xdp_tx_xmit(fep->netdev, xdp, sync);
> >                  if (unlikely(err)) {
> >                          ret = FEC_ENET_XDP_CONSUMED;
> >                          page = virt_to_head_page(xdp->data); @@
> > -3807,6 +3816,7 @@ fec_enet_xdp_get_tx_queue(struct fec_enet_private
> *fep, int index)
> >   static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> >                                     struct fec_enet_priv_tx_q *txq,
> >                                     struct xdp_frame *frame,
> > +                                  u32 dma_sync_len,
> >                                     bool ndo_xmit)
> >   {
> >          unsigned int index, status, estatus; @@ -3840,7 +3850,7 @@
> > static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep,
> >                  dma_addr = page_pool_get_dma_addr(page) +
> sizeof(*frame) +
> >                             frame->headroom;
> >                  dma_sync_single_for_device(&fep->pdev->dev,
> dma_addr,
> > -                                          frame->len,
> DMA_BIDIRECTIONAL);
> > +                                          dma_sync_len,
> > + DMA_BIDIRECTIONAL);
> >                  txq->tx_buf[index].type = FEC_TXBUF_T_XDP_TX;
> >          }
> >
> > @@ -3889,7 +3899,8 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> >   }
> >
> >   static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> > -                               struct xdp_buff *xdp)
> > +                               struct xdp_buff *xdp,
> > +                               u32 dma_sync_len)
> >   {
> >          struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> >          struct fec_enet_private *fep = netdev_priv(ndev); @@ -3909,7
> > +3920,7 @@ static int fec_enet_xdp_tx_xmit(struct net_device *ndev,
> >
> >          /* Avoid tx timeout as XDP shares the queue with kernel stack
> */
> >          txq_trans_cond_update(nq);
> > -       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, false);
> > +       ret = fec_enet_txq_xmit_frame(fep, txq, xdpf, dma_sync_len,
> > + false);
> >
> >          __netif_tx_unlock(nq);
> >
> > @@ -3938,7 +3949,7 @@ static int fec_enet_xdp_xmit(struct net_device
> *dev,
> >          /* Avoid tx timeout as XDP shares the queue with kernel stack
> */
> >          txq_trans_cond_update(nq);
> >          for (i = 0; i < num_frames; i++) {
> > -               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], true) < 0)
> > +               if (fec_enet_txq_xmit_frame(fep, txq, frames[i], 0,
> > + true) < 0)
> >                          break;
> >                  sent_frames++;
> >          }
> >

Jesper Dangaard Brouer Aug. 3, 2023, 12:55 p.m. UTC | #14

On 03/08/2023 13.18, Wei Fang wrote:
>> On 03/08/2023 05.58, Wei Fang wrote:
>>>>>                    } else {
>>>>> -                 xdp_return_frame(xdpf);
>>>>> +                 xdp_return_frame_rx_napi(xdpf);
>>>> If you implement Jesper's syncing suggestions, I think you can use
>>>>
>>>>     page_pool_put_page(pool, page, 0, true);
>> To Jakub, using 0 here you are trying to bypass the DMA-sync (which is valid
>> as driver knows XDP_TX have already done the sync).
>> The code will still call into DMA-sync calls with zero as size, so wonder if we
>> should detect size zero and skip that call?
>> (I mean is this something page_pool should support.)
>>
[...]
>>
>>
>>>> for XDP_TX here to avoid the DMA sync on page recycle.
>>> I tried Jasper's syncing suggestion and used page_pool_put_page() to
>>> recycle pages, but the results does not seem to improve the
>>> performance of XDP_TX,
>> The optimization will only have effect on those devices which have
>> dev->dma_coherent=false else DMA function [1] (e.g.
>> dma_direct_sync_single_for_device) will skip the sync calls.
>>
>>  [1] https://elixir.bootlin.com/linux/v6.5-rc4/source/kernel/dma/direct.h#L63
>>
>> (Cc. Andrew Lunn)
>> Does any of the imx generations have dma-noncoherent memory?
>>
>> And does any of these use the fec NIC driver?
>>
>>> it even degrades the speed.
>> 
>> Could be low runs simply be a variation between your test runs?
>>
> Maybe, I just tested once before. So I test several times again, the
> results of the two methods do not seem to be much different so far,
> both about 255000 pkt/s.
> 
>> The specific device (imx8mpevk) this was tested on, clearly have
>> dma_coherent=true, or else we would have seen a difference.
>> But the code change should not have any overhead for the
>> dma_coherent=true case, the only extra overhead is the extra empty DMA
>> sync call with size zero (as discussed in top).
>>
> The FEC of i.MX8MP-EVK has dma_coherent=false, and as I mentioned
> above, I did not see an obvious difference in the performance. :(

That is surprising - given the results.

(see below, lack of perf/diff might be caused by Ethernet flow-control).

> 
>>> The result of the current modification.
>>> root@imx8mpevk:~# ./xdp2 eth0
>>> proto 17:     260180 pkt/s
>>
>> These results are*significantly*  better than reported in patch-1.
>> What happened?!?
>>
> The test environment is slightly different, in patch-1, the FEC port was
> directly connected to the port of another board. But in the latest test,
> the ports of the two boards were connected to a switch, so the ports of
> the two boards are not directly connected.
> 

Hmm, I've seen this kind of perf behavior of direct-connected or via
switch before. The mistake I made was, that I had not disabled Ethernet
flow-control.  The xdp2 XDP_TX program will swap the mac addresses, and
send the packet back to the packet generator (running pktgen), which
will get overloaded itself and starts sending Ethernet flow-control
pause frames.

Command line to disable:
  # ethtool -A eth0 rx off tx off

Can I ask/get you to make sure that Ethernet flow-control is disabled
(on both generator and DUT (to be on safe-side)) and run the test again?

--Jesper

>> e.g.
>>    root@imx8mpevk:~# ./xdp2 eth0
>>    proto 17:     135817 pkt/s
>>    proto 17:     142776 pkt/s
>>
>>> proto 17:     260373 pkt/s
>>> proto 17:     260363 pkt/s
>>> proto 17:     259036 pkt/s
[...]
>>>
>>> After using the sync suggestion, the result shows as follow.
>>> root@imx8mpevk:~# ./xdp2 eth0
>>> proto 17:     255956 pkt/s
>>> proto 17:     255841 pkt/s
>>> proto 17:     255835 pkt/s

Wei Fang Aug. 4, 2023, 3:06 a.m. UTC | #15

> > The FEC of i.MX8MP-EVK has dma_coherent=false, and as I mentioned
> > above, I did not see an obvious difference in the performance. :(
> 
> That is surprising - given the results.
> 
> (see below, lack of perf/diff might be caused by Ethernet flow-control).
> 
> >
> >>> The result of the current modification.
> >>> root@imx8mpevk:~# ./xdp2 eth0
> >>> proto 17:     260180 pkt/s
> >>
> >> These results are*significantly*  better than reported in patch-1.
> >> What happened?!?
> >>
> > The test environment is slightly different, in patch-1, the FEC port
> > was directly connected to the port of another board. But in the latest
> > test, the ports of the two boards were connected to a switch, so the
> > ports of the two boards are not directly connected.
> >
> 
> Hmm, I've seen this kind of perf behavior of direct-connected or via switch
> before. The mistake I made was, that I had not disabled Ethernet flow-control.
> The xdp2 XDP_TX program will swap the mac addresses, and send the packet
> back to the packet generator (running pktgen), which will get overloaded
> itself and starts sending Ethernet flow-control pause frames.
> 
> Command line to disable:
>   # ethtool -A eth0 rx off tx off
> 
> Can I ask/get you to make sure that Ethernet flow-control is disabled (on
> both generator and DUT (to be on safe-side)) and run the test again?
> 
The flow-control was not disabled before, so according to your suggestion,
I disable the flow-control on the both boards and run the test again, the
performance is slightly improved, but still can not see a clear difference
between the two methods. Below are the results.

Result: use "sync_dma_len" method
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     258886 pkt/s
proto 17:     258879 pkt/s
proto 17:     258872 pkt/s
proto 17:     258312 pkt/s
proto 17:     258926 pkt/s
proto 17:     259057 pkt/s
proto 17:     258437 pkt/s
proto 17:     259242 pkt/s
proto 17:     258665 pkt/s
proto 17:     258779 pkt/s
proto 17:     259209 pkt/s

Result: use the current method
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     258752 pkt/s
proto 17:     258672 pkt/s
proto 17:     258317 pkt/s
proto 17:     258787 pkt/s
proto 17:     258757 pkt/s
proto 17:     258542 pkt/s
proto 17:     258829 pkt/s
proto 17:     258480 pkt/s
proto 17:     258859 pkt/s
proto 17:     258918 pkt/s
proto 17:     258782 pkt/s
proto 17:     259086 pkt/s
proto 17:     258337 pkt/s

Jesper Dangaard Brouer Aug. 4, 2023, 12:09 p.m. UTC | #16

On 02/08/2023 14.33, Wei Fang wrote:
>>> +	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
>> XDP_TX can avoid this conversion to xdp_frame.
>> It would requires some refactor of fec_enet_txq_xmit_frame().
>>
> Yes, but I'm not intend to change it, using the existing interface is enough.
> 
>>> +	struct fec_enet_private *fep = netdev_priv(ndev);
>>> +	struct fec_enet_priv_tx_q *txq;
>>> +	int cpu = smp_processor_id();
>>> +	struct netdev_queue *nq;
>>> +	int queue, ret;
>>> +
>>> +	queue = fec_enet_xdp_get_tx_queue(fep, cpu);
>>> +	txq = fep->tx_queue[queue];

Notice how TXQ gets selected based on CPU.
Thus it will be the same for all the frames.

>>> +	nq = netdev_get_tx_queue(fep->netdev, queue);
>>> +
>>> +	__netif_tx_lock(nq, cpu);
>> 
>> It is sad that XDP_TX takes a lock for each frame.
>>
> Yes, but the XDP path share the queue with the kernel network stack, so
> we need a lock here, unless there is a dedicated queue for XDP path. Do
> you have a better solution?
> 

Yes, the solution would be to keep a stack local (or per-CPU) queue for
all the XDP_TX frames, and send them at the xdp_do_flush_map() call
site. This is basically what happens with xdp_do_redirect() in cpumap.c
and devmap.c code, that have a per-CPU bulk queue and sends a bulk of
packets into fec_enet_xdp_xmit / ndo_xdp_xmit.

I understand if you don't want to add the complexity to the driver.
And I guess, it should be a followup patch to make sure this actually
improves performance.

--Jesper

Jesper Dangaard Brouer Aug. 4, 2023, 12:36 p.m. UTC | #17

On 04/08/2023 05.06, Wei Fang wrote:
>>> The FEC of i.MX8MP-EVK has dma_coherent=false, and as I mentioned
>>> above, I did not see an obvious difference in the performance. :(
>>
>> That is surprising - given the results.
>>
>> (see below, lack of perf/diff might be caused by Ethernet flow-control).
>>
>>>
>>>>> The result of the current modification.
>>>>> root@imx8mpevk:~# ./xdp2 eth0
>>>>> proto 17:     260180 pkt/s
>>>>
>>>> These results are*significantly*  better than reported in patch-1.
>>>> What happened?!?
>>>>
>>> The test environment is slightly different, in patch-1, the FEC port
>>> was directly connected to the port of another board. But in the latest
>>> test, the ports of the two boards were connected to a switch, so the
>>> ports of the two boards are not directly connected.
>>>
>>
>> Hmm, I've seen this kind of perf behavior of direct-connected or via switch
>> before. The mistake I made was, that I had not disabled Ethernet flow-control.
>> The xdp2 XDP_TX program will swap the mac addresses, and send the packet
>> back to the packet generator (running pktgen), which will get overloaded
>> itself and starts sending Ethernet flow-control pause frames.
>>
>> Command line to disable:
>>    # ethtool -A eth0 rx off tx off
>>
>> Can I ask/get you to make sure that Ethernet flow-control is disabled (on
>> both generator and DUT (to be on safe-side)) and run the test again?
>>
> The flow-control was not disabled before, so according to your suggestion,
> I disable the flow-control on the both boards and run the test again, the
> performance is slightly improved, but still can not see a clear difference
> between the two methods. Below are the results.

Something else must be stalling the CPU.
When looking at fec_main.c code, I noticed that 
fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to 
trigger transmit start), which I believe will stall the CPU.
The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the 
function that does the MMIO write to trigger transmit start.

$ git diff
diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 03ac7690b5c4..57a6a3899b80 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct 
fec_enet_private *fep,

         txq->bd.cur = bdp;

-       /* Trigger transmission start */
-       writel(0, txq->bd.reg_desc_active);
-
         return 0;
  }

@@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device *dev,
                 sent_frames++;
         }

+       /* Trigger transmission start */
+       writel(0, txq->bd.reg_desc_active);
+
         __netif_tx_unlock(nq);

         return sent_frames;


> Result: use "sync_dma_len" method
> root@imx8mpevk:~# ./xdp2 eth0

The xdp2 (and xdp1) program(s) have a performance issue
(due to using

Can I ask you to test using xdp_rxq_info, like:

  sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX


> proto 17:     258886 pkt/s
> proto 17:     258879 pkt/s

If you provide numbers for xdp_redirect, then we could better evaluate
if changing the lock per xdp_frame, for XDP_TX also, is worth it.

And also find out of moving the MMIO write have any effect.

I also noticed driver does a MMIO write (on rxq) for every RX-packet in
fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
performance stall.

--Jesper

Wei Fang Aug. 7, 2023, 8:41 a.m. UTC | #18

> 
> 
> On 02/08/2023 14.33, Wei Fang wrote:
> >>> +	struct xdp_frame *xdpf = xdp_convert_buff_to_frame(xdp);
> >> XDP_TX can avoid this conversion to xdp_frame.
> >> It would requires some refactor of fec_enet_txq_xmit_frame().
> >>
> > Yes, but I'm not intend to change it, using the existing interface is enough.
> >
> >>> +	struct fec_enet_private *fep = netdev_priv(ndev);
> >>> +	struct fec_enet_priv_tx_q *txq;
> >>> +	int cpu = smp_processor_id();
> >>> +	struct netdev_queue *nq;
> >>> +	int queue, ret;
> >>> +
> >>> +	queue = fec_enet_xdp_get_tx_queue(fep, cpu);
> >>> +	txq = fep->tx_queue[queue];
> 
> Notice how TXQ gets selected based on CPU.
> Thus it will be the same for all the frames.
> 
Yes, I'll optimize it, thanks!

> >>> +	nq = netdev_get_tx_queue(fep->netdev, queue);
> >>> +
> >>> +	__netif_tx_lock(nq, cpu);
> >>
> >> It is sad that XDP_TX takes a lock for each frame.
> >>
> > Yes, but the XDP path share the queue with the kernel network stack,
> > so we need a lock here, unless there is a dedicated queue for XDP
> > path. Do you have a better solution?
> >
> 
> Yes, the solution would be to keep a stack local (or per-CPU) queue for all the
> XDP_TX frames, and send them at the xdp_do_flush_map() call site. This is
> basically what happens with xdp_do_redirect() in cpumap.c and devmap.c
> code, that have a per-CPU bulk queue and sends a bulk of packets into
> fec_enet_xdp_xmit / ndo_xdp_xmit.
> 
> I understand if you don't want to add the complexity to the driver.
> And I guess, it should be a followup patch to make sure this actually
> improves performance.
> 
Thanks, I got it. I'll optimize in a followup patch if it really improves the performance.

Wei Fang Aug. 7, 2023, 10:30 a.m. UTC | #19

> > The flow-control was not disabled before, so according to your
> > suggestion, I disable the flow-control on the both boards and run the
> > test again, the performance is slightly improved, but still can not
> > see a clear difference between the two methods. Below are the results.
> 
> Something else must be stalling the CPU.
> When looking at fec_main.c code, I noticed that
> fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
> trigger transmit start), which I believe will stall the CPU.
> The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
> function that does the MMIO write to trigger transmit start.
> 
We'd better keep a MMIO write for every xdp_frame on txq, as you know,
the txq will be inactive when no additional ready descriptors remain in the
tx-BDR. So it may increase the delay of the packets if we do a MMIO write
for multiple packets.

> $ git diff
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 03ac7690b5c4..57a6a3899b80 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
> fec_enet_private *fep,
> 
>          txq->bd.cur = bdp;
> 
> -       /* Trigger transmission start */
> -       writel(0, txq->bd.reg_desc_active);
> -
>          return 0;
>   }
> 
> @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
> *dev,
>                  sent_frames++;
>          }
> 
> +       /* Trigger transmission start */
> +       writel(0, txq->bd.reg_desc_active);
> +
>          __netif_tx_unlock(nq);
> 
>          return sent_frames;
> 
> 
> > Result: use "sync_dma_len" method
> > root@imx8mpevk:~# ./xdp2 eth0
> 
> The xdp2 (and xdp1) program(s) have a performance issue (due to using
> 
> Can I ask you to test using xdp_rxq_info, like:
> 
>   sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
> 
Yes, below are the results, the results are also basically the same.
Result 1: current method
./xdp_rxq_info --dev eth0 --action XDP_TX
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,102     0
XDP-RX CPU      total   259,102
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,102     0
rx_queue_index    0:sum 259,102
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,498     0
XDP-RX CPU      total   259,498
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,496     0
rx_queue_index    0:sum 259,496
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,408     0
XDP-RX CPU      total   259,408

Result 2: dma_sync_len method
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       258,254     0
XDP-RX CPU      total   258,254
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   258,254     0
rx_queue_index    0:sum 258,254
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,316     0
XDP-RX CPU      total   259,316
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,318     0
rx_queue_index    0:sum 259,318
Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      0       259,554     0
XDP-RX CPU      total   259,554
RXQ stats       RXQ:CPU pps         issue-pps
rx_queue_index    0:0   259,553     0
rx_queue_index    0:sum 259,553

> 
> > proto 17:     258886 pkt/s
> > proto 17:     258879 pkt/s
> 
> If you provide numbers for xdp_redirect, then we could better evaluate if
> changing the lock per xdp_frame, for XDP_TX also, is worth it.
> 
For XDP_REDIRECT, the performance show as follow.
root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0        221,642 rx/s       0 err,drop/s      221,643 xmit/s
eth1->eth0        221,761 rx/s       0 err,drop/s      221,760 xmit/s
eth1->eth0        221,793 rx/s       0 err,drop/s      221,794 xmit/s
eth1->eth0        221,825 rx/s       0 err,drop/s      221,825 xmit/s
eth1->eth0        221,823 rx/s       0 err,drop/s      221,821 xmit/s
eth1->eth0        221,815 rx/s       0 err,drop/s      221,816 xmit/s
eth1->eth0        222,016 rx/s       0 err,drop/s      222,016 xmit/s
eth1->eth0        222,059 rx/s       0 err,drop/s      222,059 xmit/s
eth1->eth0        222,085 rx/s       0 err,drop/s      222,089 xmit/s
eth1->eth0        221,956 rx/s       0 err,drop/s      221,952 xmit/s
eth1->eth0        222,070 rx/s       0 err,drop/s      222,071 xmit/s
eth1->eth0        222,017 rx/s       0 err,drop/s      222,017 xmit/s
eth1->eth0        222,069 rx/s       0 err,drop/s      222,067 xmit/s
eth1->eth0        221,986 rx/s       0 err,drop/s      221,987 xmit/s
eth1->eth0        221,932 rx/s       0 err,drop/s      221,936 xmit/s
eth1->eth0        222,045 rx/s       0 err,drop/s      222,041 xmit/s
eth1->eth0        222,014 rx/s       0 err,drop/s      222,014 xmit/s
  Packets received    : 3,772,908
  Average packets/s   : 221,936
  Packets transmitted : 3,772,908
  Average transmit/s  : 221,936

> And also find out of moving the MMIO write have any effect.
> 
I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
the performance is slightly improved.

root@imx8mpevk:~# ./xdp_redirect eth1 eth0
Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
eth1->eth0        222,666 rx/s        0 err,drop/s      222,668 xmit/s
eth1->eth0        221,663 rx/s        0 err,drop/s      221,664 xmit/s
eth1->eth0        222,743 rx/s        0 err,drop/s      222,741 xmit/s
eth1->eth0        222,917 rx/s        0 err,drop/s      222,923 xmit/s
eth1->eth0        221,810 rx/s        0 err,drop/s      221,808 xmit/s
eth1->eth0        222,891 rx/s        0 err,drop/s      222,888 xmit/s
eth1->eth0        222,983 rx/s        0 err,drop/s      222,984 xmit/s
eth1->eth0        221,655 rx/s        0 err,drop/s      221,653 xmit/s
eth1->eth0        222,827 rx/s        0 err,drop/s      222,827 xmit/s
eth1->eth0        221,728 rx/s        0 err,drop/s      221,728 xmit/s
eth1->eth0        222,790 rx/s        0 err,drop/s      222,789 xmit/s
eth1->eth0        222,874 rx/s        0 err,drop/s      222,874 xmit/s
eth1->eth0        221,888 rx/s        0 err,drop/s      221,887 xmit/s
eth1->eth0        223,057 rx/s        0 err,drop/s      223,056 xmit/s
eth1->eth0        222,219 rx/s        0 err,drop/s      222,220 xmit/s
  Packets received    : 3,336,711
  Average packets/s   : 222,447
  Packets transmitted : 3,336,710
  Average transmit/s  : 222,447

> I also noticed driver does a MMIO write (on rxq) for every RX-packet in
> fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
> performance stall.
> 
The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
better do a MMIO write when we recycle a BD, so that the hardware can timely
attach the received pakcets on the rx-BDR.

In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
performance of XDP_TX is still not improved. :(

After these days of testing, I think it's best to keep the solution in V3, and then
make some optimizations on the V3 patch.

Jesper Dangaard Brouer Aug. 7, 2023, 1:15 p.m. UTC | #20

On 07/08/2023 12.30, Wei Fang wrote:
>>> The flow-control was not disabled before, so according to your
>>> suggestion, I disable the flow-control on the both boards and run the
>>> test again, the performance is slightly improved, but still can not
>>> see a clear difference between the two methods. Below are the results.
>>
>> Something else must be stalling the CPU.
>> When looking at fec_main.c code, I noticed that
>> fec_enet_txq_xmit_frame() will do a MMIO write for every xdp_frame (to
>> trigger transmit start), which I believe will stall the CPU.
>> The ndo_xdp_xmit/fec_enet_xdp_xmit does bulking, and should be the
>> function that does the MMIO write to trigger transmit start.
>>
> We'd better keep a MMIO write for every xdp_frame on txq, as you know,
> the txq will be inactive when no additional ready descriptors remain in the
> tx-BDR. So it may increase the delay of the packets if we do a MMIO write
> for multiple packets.
> 

You know this hardware better than me, so I will leave to you.

>> $ git diff
>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>> index 03ac7690b5c4..57a6a3899b80 100644
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -3849,9 +3849,6 @@ static int fec_enet_txq_xmit_frame(struct
>> fec_enet_private *fep,
>>
>>           txq->bd.cur = bdp;
>>
>> -       /* Trigger transmission start */
>> -       writel(0, txq->bd.reg_desc_active);
>> -
>>           return 0;
>>    }
>>
>> @@ -3880,6 +3877,9 @@ static int fec_enet_xdp_xmit(struct net_device
>> *dev,
>>                   sent_frames++;
>>           }
>>
>> +       /* Trigger transmission start */
>> +       writel(0, txq->bd.reg_desc_active);
>> +
>>           __netif_tx_unlock(nq);
>>
>>           return sent_frames;
>>
>>
>>> Result: use "sync_dma_len" method
>>> root@imx8mpevk:~# ./xdp2 eth0
>>
>> The xdp2 (and xdp1) program(s) have a performance issue (due to using
>>
>> Can I ask you to test using xdp_rxq_info, like:
>>
>>    sudo ./xdp_rxq_info --dev mlx5p1 --action XDP_TX
>>
> Yes, below are the results, the results are also basically the same.
> Result 1: current method
> ./xdp_rxq_info --dev eth0 --action XDP_TX
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,102     0
> XDP-RX CPU      total   259,102
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,102     0
> rx_queue_index    0:sum 259,102
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,498     0
> XDP-RX CPU      total   259,498
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,496     0
> rx_queue_index    0:sum 259,496
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,408     0
> XDP-RX CPU      total   259,408
> 
> Result 2: dma_sync_len method
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       258,254     0
> XDP-RX CPU      total   258,254
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   258,254     0
> rx_queue_index    0:sum 258,254
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,316     0
> XDP-RX CPU      total   259,316
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,318     0
> rx_queue_index    0:sum 259,318
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,554     0
> XDP-RX CPU      total   259,554
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    0:0   259,553     0
> rx_queue_index    0:sum 259,553
> 

Thanks for running this.

>>
>>> proto 17:     258886 pkt/s
>>> proto 17:     258879 pkt/s
>>
>> If you provide numbers for xdp_redirect, then we could better evaluate if
>> changing the lock per xdp_frame, for XDP_TX also, is worth it.
>>
> For XDP_REDIRECT, the performance show as follow.
> root@imx8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)

This is not exactly the same as XDP_TX setup as here you choose to 
redirect between eth1 (driver st_gmac) and to eth0 (driver fec).

I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to 
XDP_TX performance.
Sorry for all the requests, but can you provide those numbers?

> eth1->eth0        221,642 rx/s       0 err,drop/s      221,643 xmit/s

So, XDP_REDIRECT is approx (1-(221825/259554))*100 = 14.53% slower.
But as this is 'eth1->eth0' this isn't true comparison to XDP_TX.

> eth1->eth0        221,761 rx/s       0 err,drop/s      221,760 xmit/s
> eth1->eth0        221,793 rx/s       0 err,drop/s      221,794 xmit/s
> eth1->eth0        221,825 rx/s       0 err,drop/s      221,825 xmit/s
> eth1->eth0        221,823 rx/s       0 err,drop/s      221,821 xmit/s
> eth1->eth0        221,815 rx/s       0 err,drop/s      221,816 xmit/s
> eth1->eth0        222,016 rx/s       0 err,drop/s      222,016 xmit/s
> eth1->eth0        222,059 rx/s       0 err,drop/s      222,059 xmit/s
> eth1->eth0        222,085 rx/s       0 err,drop/s      222,089 xmit/s
> eth1->eth0        221,956 rx/s       0 err,drop/s      221,952 xmit/s
> eth1->eth0        222,070 rx/s       0 err,drop/s      222,071 xmit/s
> eth1->eth0        222,017 rx/s       0 err,drop/s      222,017 xmit/s
> eth1->eth0        222,069 rx/s       0 err,drop/s      222,067 xmit/s
> eth1->eth0        221,986 rx/s       0 err,drop/s      221,987 xmit/s
> eth1->eth0        221,932 rx/s       0 err,drop/s      221,936 xmit/s
> eth1->eth0        222,045 rx/s       0 err,drop/s      222,041 xmit/s
> eth1->eth0        222,014 rx/s       0 err,drop/s      222,014 xmit/s
>    Packets received    : 3,772,908
>    Average packets/s   : 221,936
>    Packets transmitted : 3,772,908
>    Average transmit/s  : 221,936
>> And also find out of moving the MMIO write have any effect.
>>
> I move the MMIO write to fec_enet_xdp_xmit(), the result shows as follow,
> the performance is slightly improved.
> 

I'm puzzled that moving the MMIO write isn't change performance.

Can you please verify that the packet generator machine is sending more
frame than the system can handle?

(meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)

> root@imx8mpevk:~# ./xdp_redirect eth1 eth0
> Redirecting from eth1 (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
> eth1->eth0        222,666 rx/s        0 err,drop/s      222,668 xmit/s
> eth1->eth0        221,663 rx/s        0 err,drop/s      221,664 xmit/s
> eth1->eth0        222,743 rx/s        0 err,drop/s      222,741 xmit/s
> eth1->eth0        222,917 rx/s        0 err,drop/s      222,923 xmit/s
> eth1->eth0        221,810 rx/s        0 err,drop/s      221,808 xmit/s
> eth1->eth0        222,891 rx/s        0 err,drop/s      222,888 xmit/s
> eth1->eth0        222,983 rx/s        0 err,drop/s      222,984 xmit/s
> eth1->eth0        221,655 rx/s        0 err,drop/s      221,653 xmit/s
> eth1->eth0        222,827 rx/s        0 err,drop/s      222,827 xmit/s
> eth1->eth0        221,728 rx/s        0 err,drop/s      221,728 xmit/s
> eth1->eth0        222,790 rx/s        0 err,drop/s      222,789 xmit/s
> eth1->eth0        222,874 rx/s        0 err,drop/s      222,874 xmit/s
> eth1->eth0        221,888 rx/s        0 err,drop/s      221,887 xmit/s
> eth1->eth0        223,057 rx/s        0 err,drop/s      223,056 xmit/s
> eth1->eth0        222,219 rx/s        0 err,drop/s      222,220 xmit/s
>    Packets received    : 3,336,711
>    Average packets/s   : 222,447
>    Packets transmitted : 3,336,710
>    Average transmit/s  : 222,447
> 
>> I also noticed driver does a MMIO write (on rxq) for every RX-packet in
>> fec_enet_rx_queue() napi-poll loop.  This also looks like a potential
>> performance stall.
>>
> The same as txq, the rxq will be inactive if the rx-BDR has no free BDs, so we'd
> better do a MMIO write when we recycle a BD, so that the hardware can timely
> attach the received pakcets on the rx-BDR.
> 
> In addition, I also tried to avoid using xdp_convert_buff_to_frame(), but the
> performance of XDP_TX is still not improved. :(
> 

I would not expect much performance improvement from this anyhow.

> After these days of testing, I think it's best to keep the solution in V3, and then
> make some optimizations on the V3 patch.

I agree.

I think you need to send a V5, and then I can ACK that.

Thanks for all this testing,
--Jesper

Jakub Kicinski Aug. 7, 2023, 4:33 p.m. UTC | #21

On Mon, 7 Aug 2023 10:30:34 +0000 Wei Fang wrote:
> ./xdp_rxq_info --dev eth0 --action XDP_TX
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       259,102     0
> XDP-RX CPU      total   259,102

> Result 2: dma_sync_len method
> Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      0       258,254     0
> XDP-RX CPU      total   258,254

Just to be clear are these number with xdp_return_frame() replaced
with page_pool_put_page(pool, page, 0, true); ?

Wei Fang Aug. 8, 2023, 12:19 a.m. UTC | #22

> On Mon, 7 Aug 2023 10:30:34 +0000 Wei Fang wrote:
> > ./xdp_rxq_info --dev eth0 --action XDP_TX Running XDP on dev:eth0
> > (ifindex:2) action:XDP_TX options:swapmac
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      0       259,102     0
> > XDP-RX CPU      total   259,102
> 
> > Result 2: dma_sync_len method
> > Running XDP on dev:eth0 (ifindex:2) action:XDP_TX options:swapmac
> > XDP stats       CPU     pps         issue-pps
> > XDP-RX CPU      0       258,254     0
> > XDP-RX CPU      total   258,254
> 
> Just to be clear are these number with xdp_return_frame() replaced with
> page_pool_put_page(pool, page, 0, true); ?
Yes, I used the page_pool_put_page() to instead of xdp_return_frame() when
I tried the "dma_sync_len" method.

Wei Fang Aug. 8, 2023, 5:02 a.m. UTC | #23

> > For XDP_REDIRECT, the performance show as follow.
> > root@imx8mpevk:~# ./xdp_redirect eth1 eth0 Redirecting from eth1
> > (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
> 
> This is not exactly the same as XDP_TX setup as here you choose to redirect
> between eth1 (driver st_gmac) and to eth0 (driver fec).
> 
> I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to
> XDP_TX performance.
> Sorry for all the requests, but can you provide those numbers?
> 

Oh, sorry, I thought what you wanted were XDP_REDIRECT results for different
NICs. Below is the result of XDP_REDIRECT on the same NIC.
root@imx8mpevk:~# ./xdp_redirect eth0 eth0
Redirecting from eth0 (ifindex 2; driver fec) to eth0 (ifindex 2; driver fec)
Summary        232,302 rx/s        0 err,drop/s      232,344 xmit/s
Summary        234,579 rx/s        0 err,drop/s      234,577 xmit/s
Summary        235,548 rx/s        0 err,drop/s      235,549 xmit/s
Summary        234,704 rx/s        0 err,drop/s      234,703 xmit/s
Summary        235,504 rx/s        0 err,drop/s      235,504 xmit/s
Summary        235,223 rx/s        0 err,drop/s      235,224 xmit/s
Summary        234,509 rx/s        0 err,drop/s      234,507 xmit/s
Summary        235,481 rx/s        0 err,drop/s      235,482 xmit/s
Summary        234,684 rx/s        0 err,drop/s      234,683 xmit/s
Summary        235,520 rx/s        0 err,drop/s      235,520 xmit/s
Summary        235,461 rx/s        0 err,drop/s      235,461 xmit/s
Summary        234,627 rx/s        0 err,drop/s      234,627 xmit/s
Summary        235,611 rx/s        0 err,drop/s      235,611 xmit/s
  Packets received    : 3,053,753
  Average packets/s   : 234,904
  Packets transmitted : 3,053,792
  Average transmit/s  : 234,907
> 
> I'm puzzled that moving the MMIO write isn't change performance.
> 
> Can you please verify that the packet generator machine is sending more
> frame than the system can handle?
> 
> (meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)
> 

Thanks very much!
You remind me, I always started the pktgen script first and then ran the xdp2
program in the previous tests. So I saw the transmit speed of the generator
was always greater than the speed of XDP_TX when I stopped the script. But
actually, the real-time transmit speed of the generator was degraded to as
equal to the speed of XDP_TX.

So I turned off the rx function of the generator in case of increasing the CPU
loading of the generator due to the returned traffic from xdp2. And I tested
the performance again. Below are the results.

Result 1: current method
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     326539 pkt/s
proto 17:     326464 pkt/s
proto 17:     326528 pkt/s
proto 17:     326465 pkt/s
proto 17:     326550 pkt/s

Result 2: sync_dma_len method
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     353918 pkt/s
proto 17:     352923 pkt/s
proto 17:     353900 pkt/s
proto 17:     352672 pkt/s
proto 17:     353912 pkt/s

Note: the speed of the generator is about 935397pps.

Compared result 1 with result 2. The "sync_dma_len" method actually improves
the performance of XDP_TX, so the conclusion from the previous tests is *incorrect*.
I'm so sorry for that. :(

In addition, I also tried the "dma_sync_len" + not use xdp_convert_buff_to_frame()
method, the performance has been further improved. Below is the result.

Result 3: sync_dma_len + not use xdp_convert_buff_to_frame() method
root@imx8mpevk:~# ./xdp2 eth0
proto 17:     369261 pkt/s
proto 17:     369267 pkt/s
proto 17:     369206 pkt/s
proto 17:     369214 pkt/s
proto 17:     369126 pkt/s

Therefore, I'm intend to use the "dma_sync_len"+ not use xdp_convert_buff_to_frame()
method in the V5 patch. Thank you again, Jesper and Jakub. You really helped me a lot. :)

Jesper Dangaard Brouer Aug. 8, 2023, 11:33 a.m. UTC | #24

On 08/08/2023 07.02, Wei Fang wrote:
>>> For XDP_REDIRECT, the performance show as follow.
>>> root@imx8mpevk:~# ./xdp_redirect eth1 eth0 Redirecting from eth1
>>> (ifindex 3; driver st_gmac) to eth0 (ifindex 2; driver fec)
>>
>> This is not exactly the same as XDP_TX setup as here you choose to redirect
>> between eth1 (driver st_gmac) and to eth0 (driver fec).
>>
>> I would like to see eth0 to eth0 XDP_REDIRECT, so we can compare to
>> XDP_TX performance.
>> Sorry for all the requests, but can you provide those numbers?
>>
> 
> Oh, sorry, I thought what you wanted were XDP_REDIRECT results for different
> NICs. Below is the result of XDP_REDIRECT on the same NIC.
> root@imx8mpevk:~# ./xdp_redirect eth0 eth0
> Redirecting from eth0 (ifindex 2; driver fec) to eth0 (ifindex 2; driver fec)
> Summary        232,302 rx/s        0 err,drop/s      232,344 xmit/s
> Summary        234,579 rx/s        0 err,drop/s      234,577 xmit/s
> Summary        235,548 rx/s        0 err,drop/s      235,549 xmit/s
> Summary        234,704 rx/s        0 err,drop/s      234,703 xmit/s
> Summary        235,504 rx/s        0 err,drop/s      235,504 xmit/s
> Summary        235,223 rx/s        0 err,drop/s      235,224 xmit/s
> Summary        234,509 rx/s        0 err,drop/s      234,507 xmit/s
> Summary        235,481 rx/s        0 err,drop/s      235,482 xmit/s
> Summary        234,684 rx/s        0 err,drop/s      234,683 xmit/s
> Summary        235,520 rx/s        0 err,drop/s      235,520 xmit/s
> Summary        235,461 rx/s        0 err,drop/s      235,461 xmit/s
> Summary        234,627 rx/s        0 err,drop/s      234,627 xmit/s
> Summary        235,611 rx/s        0 err,drop/s      235,611 xmit/s
>    Packets received    : 3,053,753
>    Average packets/s   : 234,904
>    Packets transmitted : 3,053,792
>    Average transmit/s  : 234,907
>>
>> I'm puzzled that moving the MMIO write isn't change performance.
>>
>> Can you please verify that the packet generator machine is sending more
>> frame than the system can handle?
>>
>> (meaning the pktgen_sample03_burst_single_flow.sh script fast enough?)
>>
> 
> Thanks very much!
> You remind me, I always started the pktgen script first and then ran the xdp2
> program in the previous tests. So I saw the transmit speed of the generator
> was always greater than the speed of XDP_TX when I stopped the script. But
> actually, the real-time transmit speed of the generator was degraded to as
> equal to the speed of XDP_TX.
> 

Good that we finally found the root-cause, that explains why it seems
our code changes didn't have any effect.  The generator gets affected
and slowed down due to the traffic that is bounced back to it. (I tried
to hint this earlier with the Ethernet Flow-Control settings).

> So I turned off the rx function of the generator in case of increasing the CPU
> loading of the generator due to the returned traffic from xdp2. 

How did you turned off the rx function of the generator?
(I a couple of tricks I use)

> And I tested
> the performance again. Below are the results.
> 
> Result 1: current method
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     326539 pkt/s
> proto 17:     326464 pkt/s
> proto 17:     326528 pkt/s
> proto 17:     326465 pkt/s
> proto 17:     326550 pkt/s
> 
> Result 2: sync_dma_len method
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     353918 pkt/s
> proto 17:     352923 pkt/s
> proto 17:     353900 pkt/s
> proto 17:     352672 pkt/s
> proto 17:     353912 pkt/s
> 

This looks more promising:
  ((353912/326550)-1)*100 = 8.37% faster.

Or gaining/saving approx 236 nanosec per packet ((1/326550-1/353912)*10^9).

> Note: the speed of the generator is about 935397pps.
> 
> Compared result 1 with result 2. The "sync_dma_len" method actually improves
> the performance of XDP_TX, so the conclusion from the previous tests is *incorrect*.
> I'm so sorry for that. :(
> 

I'm happy that we finally found the root-cause.
Thanks for doing all the requested tests I asked for.

> In addition, I also tried the "dma_sync_len" + not use xdp_convert_buff_to_frame()
> method, the performance has been further improved. Below is the result.
> 
> Result 3: sync_dma_len + not use xdp_convert_buff_to_frame() method
> root@imx8mpevk:~# ./xdp2 eth0
> proto 17:     369261 pkt/s
> proto 17:     369267 pkt/s
> proto 17:     369206 pkt/s
> proto 17:     369214 pkt/s
> proto 17:     369126 pkt/s
> 
> Therefore, I'm intend to use the "dma_sync_len"+ not use xdp_convert_buff_to_frame()
> method in the V5 patch. Thank you again, Jesper and Jakub. You really helped me a lot. :)
> 

I suggest, that V5 patch still use xdp_convert_buff_to_frame(), and then
you send followup patch (or as 2/2 patch) that remove the use of
xdp_convert_buff_to_frame() for XDP_TX.  This way it is easier to keep
track of the changes and improvements.

I would be very interested in knowing if the MMIO test change after this
correction to the testlab/generator.

--Jesper

Wei Fang Aug. 9, 2023, 6:22 a.m. UTC | #25

> > Thanks very much!
> > You remind me, I always started the pktgen script first and then ran
> > the xdp2 program in the previous tests. So I saw the transmit speed of
> > the generator was always greater than the speed of XDP_TX when I
> > stopped the script. But actually, the real-time transmit speed of the
> > generator was degraded to as equal to the speed of XDP_TX.
> >
> 
> Good that we finally found the root-cause, that explains why it seems our
> code changes didn't have any effect.  The generator gets affected and
> slowed down due to the traffic that is bounced back to it. (I tried to hint this
> earlier with the Ethernet Flow-Control settings).
> 
> > So I turned off the rx function of the generator in case of increasing
> > the CPU loading of the generator due to the returned traffic from xdp2.
> 
> How did you turned off the rx function of the generator?
> (I a couple of tricks I use)
> 
Actually, I didn't really disable the rx function of the generator, I just made
the generator hardware automatically discard the returned traffic from xdp2.
So I utilized the MAC filter feature of the hardware and did some modification
to the pktgen script to make the SMAC of the packet is different from the MAC
address of the generator.


> > And I tested
> > the performance again. Below are the results.
> >
> > Result 1: current method
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     326539 pkt/s
> > proto 17:     326464 pkt/s
> > proto 17:     326528 pkt/s
> > proto 17:     326465 pkt/s
> > proto 17:     326550 pkt/s
> >
> > Result 2: sync_dma_len method
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     353918 pkt/s
> > proto 17:     352923 pkt/s
> > proto 17:     353900 pkt/s
> > proto 17:     352672 pkt/s
> > proto 17:     353912 pkt/s
> >
> 
> This looks more promising:
>   ((353912/326550)-1)*100 = 8.37% faster.
> 
> Or gaining/saving approx 236 nanosec per packet
> ((1/326550-1/353912)*10^9).
> 
> > Note: the speed of the generator is about 935397pps.
> >
> > Compared result 1 with result 2. The "sync_dma_len" method actually
> > improves the performance of XDP_TX, so the conclusion from the previous
> tests is *incorrect*.
> > I'm so sorry for that. :(
> >
> 
> I'm happy that we finally found the root-cause.
> Thanks for doing all the requested tests I asked for.
> 
> > In addition, I also tried the "dma_sync_len" + not use
> > xdp_convert_buff_to_frame() method, the performance has been further
> improved. Below is the result.
> >
> > Result 3: sync_dma_len + not use xdp_convert_buff_to_frame() method
> > root@imx8mpevk:~# ./xdp2 eth0
> > proto 17:     369261 pkt/s
> > proto 17:     369267 pkt/s
> > proto 17:     369206 pkt/s
> > proto 17:     369214 pkt/s
> > proto 17:     369126 pkt/s
> >
> > Therefore, I'm intend to use the "dma_sync_len"+ not use
> > xdp_convert_buff_to_frame() method in the V5 patch. Thank you again,
> > Jesper and Jakub. You really helped me a lot. :)
> >
> 
> I suggest, that V5 patch still use xdp_convert_buff_to_frame(), and then you
> send followup patch (or as 2/2 patch) that remove the use of
> xdp_convert_buff_to_frame() for XDP_TX.  This way it is easier to keep track
> of the changes and improvements.
> 
Okay, I will do it.

> I would be very interested in knowing if the MMIO test change after this
> correction to the testlab/generator.
> 
The performance is significantly improved as you expected, but as I explained
before, I'm not sure whether there are the potential risks other than increase
latency. So I'm not going to modify it at the moment.

Below is the result that I changed the logic to do a MMIO-write on rx-BDR and
tx-BDR respectively in the end of the NPI callback.

root@imx8mpevk:~# ./xdp2 eth0
proto 17:     436020 pkt/s
proto 17:     436167 pkt/s
proto 17:     434205 pkt/s
proto 17:     436140 pkt/s
proto 17:     436115 pkt/s

[V3,net-next] net: fec: add XDP_TX feature support

Checks

Commit Message

Comments

Patch