diff mbox series

[net-next,v4,09/10] bnxt_en: Extend queue stop/start for TX rings

Message ID 20250208202916.1391614-10-michael.chan@broadcom.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series bnxt_en: Add NPAR 1.2 and TPH support | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 2 this patch: 2
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-02-11--00-00 (tests: 889)

Commit Message

Michael Chan Feb. 8, 2025, 8:29 p.m. UTC
From: Somnath Kotur <somnath.kotur@broadcom.com>

In order to use queue_stop/queue_start to support the new Steering
Tags, we need to free the TX ring and TX completion ring if it is a
combined channel with TX/RX sharing the same NAPI.  Otherwise
TX completions will not have the updated Steering Tag.  If TPH is
not enabled, we just stop the TX ring without freeing the TX/TX cmpl
rings.  With that we can now add napi_disable() and napi_enable()
during queue_stop()/ queue_start().  This will guarantee that NAPI
will stop processing the completion entries in case there are
additional pending entries in the completion rings after queue_stop().

There could be some NQEs sitting unprocessed while NAPI is disabled
thereby leaving the NQ unarmed.  Explicitly re-arm the NQ after
napi_enable() in queue start so that NAPI will resume properly.

Error handling in bnxt_queue_start() requires a reset.  If a TX
ring cannot be allocated or initialized properly, it will cause
TX timeout.  The reset will also free any partially allocated
rings.

Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
Cc: David Wei <dw@davidwei.uk>

v3:
Fix build bot warning.
Only free TX/TX cmpl rings when TPH is enabled.

v2:
Add reset for error handling in queue_start().
Fix compile error.

Discussion about adding napi_disable()/napi_enable():

https://lore.kernel.org/netdev/5336d624-8d8b-40a6-b732-b020e4a119a2@davidwei.uk/#t
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 132 ++++++++++++++++++++--
 1 file changed, 122 insertions(+), 10 deletions(-)

Comments

Jakub Kicinski Feb. 12, 2025, 1:44 a.m. UTC | #1
On Sat,  8 Feb 2025 12:29:15 -0800 Michael Chan wrote:
> +		rc = bnxt_hwrm_cp_ring_alloc_p5(bp, txr->tx_cpr);
> +		if (rc)
> +			return rc;
> +
> +		rc = bnxt_hwrm_tx_ring_alloc(bp, txr, false);
> +		if (rc)
> +			return rc;

Under what circumstances can these alloc calls fail?
"alloc" sounds concerning in a start call.

> +		txr->tx_prod = 0;
> +		txr->tx_cons = 0;
> +		txr->tx_hw_cons = 0;

>  	cpr->sw_stats->rx.rx_resets++;
>  
> +	if (bp->flags & BNXT_FLAG_SHARED_RINGS) {
> +		cpr->sw_stats->tx.tx_resets++;

Is there a reason why queue op stop/start cycles are counted as resets?
IIUC previously only faults (~errors) would be counted as resets.
ifdown / ifup or ring reconfig (ethtool -L / -G) would not increment
resets. I think queue reconfig is more like ethtool -L than a fault.
It'd be more consistent with existing code not to increment these
counters.

> +		rc = bnxt_tx_queue_start(bp, idx);
> +		if (rc) {
> +			netdev_warn(bp->dev,
> +				    "tx queue restart failed: rc=%d\n", rc);
> +			bnapi->tx_fault = 1;
> +			goto err_reset;
> +		}
> +	}
> +
> +	napi_enable(&bnapi->napi);

Here you first start the queue then enable NAPI...

> +	bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
> +
>  	for (i = 0; i <= BNXT_VNIC_NTUPLE; i++) {
>  		vnic = &bp->vnic_info[i];
>  

> @@ -15716,17 +15820,25 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
>  	/* Make sure NAPI sees that the VNIC is disabled */
>  	synchronize_net();
>  	rxr = &bp->rx_ring[idx];
> -	cancel_work_sync(&rxr->bnapi->cp_ring.dim.work);
> +	bnapi = rxr->bnapi;
> +	cpr = &bnapi->cp_ring;
> +	cancel_work_sync(&cpr->dim.work);
>  	bnxt_hwrm_rx_ring_free(bp, rxr, false);
>  	bnxt_hwrm_rx_agg_ring_free(bp, rxr, false);
>  	page_pool_disable_direct_recycling(rxr->page_pool);
>  	if (bnxt_separate_head_pool())
>  		page_pool_disable_direct_recycling(rxr->head_pool);
>  
> +	if (bp->flags & BNXT_FLAG_SHARED_RINGS)
> +		bnxt_tx_queue_stop(bp, idx);
> +
> +	napi_disable(&bnapi->napi);

... but here you do the opposite, and require extra synchronization 
in bnxt_tx_queue_stop() to set your magic flag, sync the NAPI etc.
Why can't the start and stop paths be the mirror image?
Michael Chan Feb. 12, 2025, 2:31 a.m. UTC | #2
On Tue, Feb 11, 2025 at 5:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat,  8 Feb 2025 12:29:15 -0800 Michael Chan wrote:
> > +             rc = bnxt_hwrm_cp_ring_alloc_p5(bp, txr->tx_cpr);
> > +             if (rc)
> > +                     return rc;
> > +
> > +             rc = bnxt_hwrm_tx_ring_alloc(bp, txr, false);
> > +             if (rc)
> > +                     return rc;
>
> Under what circumstances can these alloc calls fail?
> "alloc" sounds concerning in a start call.

The ring has been previously reserved with FW, so it normally should
not fail.  I'll need to ask the FW team for some possible failure
scenarios.

>
> > +             txr->tx_prod = 0;
> > +             txr->tx_cons = 0;
> > +             txr->tx_hw_cons = 0;
>
> >       cpr->sw_stats->rx.rx_resets++;
> >
> > +     if (bp->flags & BNXT_FLAG_SHARED_RINGS) {
> > +             cpr->sw_stats->tx.tx_resets++;
>
> Is there a reason why queue op stop/start cycles are counted as resets?
> IIUC previously only faults (~errors) would be counted as resets.
> ifdown / ifup or ring reconfig (ethtool -L / -G) would not increment
> resets. I think queue reconfig is more like ethtool -L than a fault.
> It'd be more consistent with existing code not to increment these
> counters.

I think David's original code increments the rx_reset counter for
every queue_start.  We're just following that.  Maybe it came from the
original plan to use HWRM_RING_RESET to do the RX
queue_stop/queue_start.  We can remove the reset counters for all
queue_stop/queue_start if that makes more sense.

>
> > +             rc = bnxt_tx_queue_start(bp, idx);
> > +             if (rc) {
> > +                     netdev_warn(bp->dev,
> > +                                 "tx queue restart failed: rc=%d\n", rc);
> > +                     bnapi->tx_fault = 1;
> > +                     goto err_reset;
> > +             }
> > +     }
> > +
> > +     napi_enable(&bnapi->napi);
>
> Here you first start the queue then enable NAPI...
>
> > +     bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
> > +
> >       for (i = 0; i <= BNXT_VNIC_NTUPLE; i++) {
> >               vnic = &bp->vnic_info[i];
> >
>
> > @@ -15716,17 +15820,25 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
> >       /* Make sure NAPI sees that the VNIC is disabled */
> >       synchronize_net();
> >       rxr = &bp->rx_ring[idx];
> > -     cancel_work_sync(&rxr->bnapi->cp_ring.dim.work);
> > +     bnapi = rxr->bnapi;
> > +     cpr = &bnapi->cp_ring;
> > +     cancel_work_sync(&cpr->dim.work);
> >       bnxt_hwrm_rx_ring_free(bp, rxr, false);
> >       bnxt_hwrm_rx_agg_ring_free(bp, rxr, false);
> >       page_pool_disable_direct_recycling(rxr->page_pool);
> >       if (bnxt_separate_head_pool())
> >               page_pool_disable_direct_recycling(rxr->head_pool);
> >
> > +     if (bp->flags & BNXT_FLAG_SHARED_RINGS)
> > +             bnxt_tx_queue_stop(bp, idx);
> > +
> > +     napi_disable(&bnapi->napi);
>
> ... but here you do the opposite, and require extra synchronization
> in bnxt_tx_queue_stop() to set your magic flag, sync the NAPI etc.
> Why can't the start and stop paths be the mirror image?

The ring free operation requires interrupt/NAPI to be working.  FW
signals the completion of the ring free command on the completion ring
associated with the ring we're freeing.  When we see this completion
during NAPI, it guarantees that this is the last DMA on that ring.
Only ring free FW commands are handled this way, requiring NAPI.
Jakub Kicinski Feb. 12, 2025, 2:43 a.m. UTC | #3
On Tue, 11 Feb 2025 18:31:21 -0800 Michael Chan wrote:
> On Tue, Feb 11, 2025 at 5:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Sat,  8 Feb 2025 12:29:15 -0800 Michael Chan wrote:  
> > > +             rc = bnxt_hwrm_cp_ring_alloc_p5(bp, txr->tx_cpr);
> > > +             if (rc)
> > > +                     return rc;
> > > +
> > > +             rc = bnxt_hwrm_tx_ring_alloc(bp, txr, false);
> > > +             if (rc)
> > > +                     return rc;  
> >
> > Under what circumstances can these alloc calls fail?
> > "alloc" sounds concerning in a start call.  
> 
> The ring has been previously reserved with FW, so it normally should
> not fail.  I'll need to ask the FW team for some possible failure
> scenarios.

Thanks, expectation is that start never fails.
If the FW team comes back with "should never happen if rings 
are reserved" please add a comment to that effect here. Since
this is one of very few implementations people may read it
and incorrectly assume that allocating is okay.
If the FW team comes back with a list of possible but unlikely
scenarios I'm afraid a rework will be needed.

> > >       cpr->sw_stats->rx.rx_resets++;
> > >
> > > +     if (bp->flags & BNXT_FLAG_SHARED_RINGS) {
> > > +             cpr->sw_stats->tx.tx_resets++;  
> >
> > Is there a reason why queue op stop/start cycles are counted as resets?
> > IIUC previously only faults (~errors) would be counted as resets.
> > ifdown / ifup or ring reconfig (ethtool -L / -G) would not increment
> > resets. I think queue reconfig is more like ethtool -L than a fault.
> > It'd be more consistent with existing code not to increment these
> > counters.  
> 
> I think David's original code increments the rx_reset counter for
> every queue_start.  We're just following that.  Maybe it came from the
> original plan to use HWRM_RING_RESET to do the RX
> queue_stop/queue_start.  We can remove the reset counters for all
> queue_stop/queue_start if that makes more sense.

I vote remove, just to be crystal clear.

> > > @@ -15716,17 +15820,25 @@ static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
> > >       /* Make sure NAPI sees that the VNIC is disabled */
> > >       synchronize_net();
> > >       rxr = &bp->rx_ring[idx];
> > > -     cancel_work_sync(&rxr->bnapi->cp_ring.dim.work);
> > > +     bnapi = rxr->bnapi;
> > > +     cpr = &bnapi->cp_ring;
> > > +     cancel_work_sync(&cpr->dim.work);
> > >       bnxt_hwrm_rx_ring_free(bp, rxr, false);
> > >       bnxt_hwrm_rx_agg_ring_free(bp, rxr, false);
> > >       page_pool_disable_direct_recycling(rxr->page_pool);
> > >       if (bnxt_separate_head_pool())
> > >               page_pool_disable_direct_recycling(rxr->head_pool);
> > >
> > > +     if (bp->flags & BNXT_FLAG_SHARED_RINGS)
> > > +             bnxt_tx_queue_stop(bp, idx);
> > > +
> > > +     napi_disable(&bnapi->napi);  
> >
> > ... but here you do the opposite, and require extra synchronization
> > in bnxt_tx_queue_stop() to set your magic flag, sync the NAPI etc.
> > Why can't the start and stop paths be the mirror image?  
> 
> The ring free operation requires interrupt/NAPI to be working.  FW
> signals the completion of the ring free command on the completion ring
> associated with the ring we're freeing.  When we see this completion
> during NAPI, it guarantees that this is the last DMA on that ring.
> Only ring free FW commands are handled this way, requiring NAPI.

Ugh, I feel like this was explained to me before, sorry.
Again, a comment in the code would go a long way for non-Broadcom
readers.
Michael Chan Feb. 12, 2025, 10:59 p.m. UTC | #4
On Tue, Feb 11, 2025 at 6:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Feb 2025 18:31:21 -0800 Michael Chan wrote:
> > The ring has been previously reserved with FW, so it normally should
> > not fail.  I'll need to ask the FW team for some possible failure
> > scenarios.
>
> Thanks, expectation is that start never fails.
> If the FW team comes back with "should never happen if rings
> are reserved" please add a comment to that effect here. Since
> this is one of very few implementations people may read it
> and incorrectly assume that allocating is okay.
> If the FW team comes back with a list of possible but unlikely
> scenarios I'm afraid a rework will be needed.
>

The FW team has confirmed that it should never fail in this case.  The
ring has been previously reserved and allocated.  Re-allocating it
with essentially the same parameters should never fail.  I will be
sending V5 soon.  Thanks.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 019a8433b0d6..fee9baff9e5a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7368,6 +7368,22 @@  static int hwrm_ring_free_send_msg(struct bnxt *bp,
 	return 0;
 }
 
+static void bnxt_hwrm_tx_ring_free(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
+				   bool close_path)
+{
+	struct bnxt_ring_struct *ring = &txr->tx_ring_struct;
+	u32 cmpl_ring_id;
+
+	if (ring->fw_ring_id == INVALID_HW_RING_ID)
+		return;
+
+	cmpl_ring_id = close_path ? bnxt_cp_ring_for_tx(bp, txr) :
+		       INVALID_HW_RING_ID;
+	hwrm_ring_free_send_msg(bp, ring, RING_FREE_REQ_RING_TYPE_TX,
+				cmpl_ring_id);
+	ring->fw_ring_id = INVALID_HW_RING_ID;
+}
+
 static void bnxt_hwrm_rx_ring_free(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
 				   bool close_path)
@@ -11274,6 +11290,75 @@  int bnxt_reserve_rings(struct bnxt *bp, bool irq_re_init)
 	return 0;
 }
 
+static void bnxt_tx_queue_stop(struct bnxt *bp, int idx)
+{
+	struct bnxt_tx_ring_info *txr;
+	struct netdev_queue *txq;
+	struct bnxt_napi *bnapi;
+	int i;
+
+	bnapi = bp->bnapi[idx];
+	bnxt_for_each_napi_tx(i, bnapi, txr) {
+		WRITE_ONCE(txr->dev_state, BNXT_DEV_STATE_CLOSING);
+		synchronize_net();
+
+		if (!(bnapi->flags & BNXT_NAPI_FLAG_XDP)) {
+			txq = netdev_get_tx_queue(bp->dev, txr->txq_index);
+			if (txq) {
+				__netif_tx_lock_bh(txq);
+				netif_tx_stop_queue(txq);
+				__netif_tx_unlock_bh(txq);
+			}
+		}
+
+		if (!bp->tph_mode)
+			continue;
+
+		bnxt_hwrm_tx_ring_free(bp, txr, true);
+		bnxt_hwrm_cp_ring_free(bp, txr->tx_cpr);
+		bnxt_free_one_tx_ring_skbs(bp, txr, txr->txq_index);
+		bnxt_clear_one_cp_ring(bp, txr->tx_cpr);
+	}
+}
+
+static int bnxt_tx_queue_start(struct bnxt *bp, int idx)
+{
+	struct bnxt_tx_ring_info *txr;
+	struct netdev_queue *txq;
+	struct bnxt_napi *bnapi;
+	int rc, i;
+
+	bnapi = bp->bnapi[idx];
+	bnxt_for_each_napi_tx(i, bnapi, txr) {
+		if (!bp->tph_mode)
+			goto start_tx;
+
+		rc = bnxt_hwrm_cp_ring_alloc_p5(bp, txr->tx_cpr);
+		if (rc)
+			return rc;
+
+		rc = bnxt_hwrm_tx_ring_alloc(bp, txr, false);
+		if (rc)
+			return rc;
+
+		txr->tx_prod = 0;
+		txr->tx_cons = 0;
+		txr->tx_hw_cons = 0;
+start_tx:
+		WRITE_ONCE(txr->dev_state, 0);
+		synchronize_net();
+
+		if (bnapi->flags & BNXT_NAPI_FLAG_XDP)
+			continue;
+
+		txq = netdev_get_tx_queue(bp->dev, txr->txq_index);
+		if (txq)
+			netif_tx_start_queue(txq);
+	}
+
+	return 0;
+}
+
 static void bnxt_free_irq(struct bnxt *bp)
 {
 	struct bnxt_irq *irq;
@@ -15638,6 +15723,7 @@  static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
 	struct bnxt_rx_ring_info *rxr, *clone;
 	struct bnxt_cp_ring_info *cpr;
 	struct bnxt_vnic_info *vnic;
+	struct bnxt_napi *bnapi;
 	int i, rc;
 
 	rxr = &bp->rx_ring[idx];
@@ -15655,27 +15741,42 @@  static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
 
 	bnxt_copy_rx_ring(bp, rxr, clone);
 
+	bnapi = rxr->bnapi;
+	cpr = &bnapi->cp_ring;
 	rc = bnxt_hwrm_rx_ring_alloc(bp, rxr);
 	if (rc)
-		return rc;
+		goto err_reset_rx;
 
 	if (bp->tph_mode) {
 		rc = bnxt_hwrm_cp_ring_alloc_p5(bp, rxr->rx_cpr);
 		if (rc)
-			goto err_free_hwrm_rx_ring;
+			goto err_reset_rx;
 	}
 
 	rc = bnxt_hwrm_rx_agg_ring_alloc(bp, rxr);
 	if (rc)
-		goto err_free_hwrm_cp_ring;
+		goto err_reset_rx;
 
 	bnxt_db_write(bp, &rxr->rx_db, rxr->rx_prod);
 	if (bp->flags & BNXT_FLAG_AGG_RINGS)
 		bnxt_db_write(bp, &rxr->rx_agg_db, rxr->rx_agg_prod);
 
-	cpr = &rxr->bnapi->cp_ring;
 	cpr->sw_stats->rx.rx_resets++;
 
+	if (bp->flags & BNXT_FLAG_SHARED_RINGS) {
+		cpr->sw_stats->tx.tx_resets++;
+		rc = bnxt_tx_queue_start(bp, idx);
+		if (rc) {
+			netdev_warn(bp->dev,
+				    "tx queue restart failed: rc=%d\n", rc);
+			bnapi->tx_fault = 1;
+			goto err_reset;
+		}
+	}
+
+	napi_enable(&bnapi->napi);
+	bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
+
 	for (i = 0; i <= BNXT_VNIC_NTUPLE; i++) {
 		vnic = &bp->vnic_info[i];
 
@@ -15692,11 +15793,12 @@  static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
 
 	return 0;
 
-err_free_hwrm_cp_ring:
-	if (bp->tph_mode)
-		bnxt_hwrm_cp_ring_free(bp, rxr->rx_cpr);
-err_free_hwrm_rx_ring:
-	bnxt_hwrm_rx_ring_free(bp, rxr, false);
+err_reset_rx:
+	rxr->bnapi->in_reset = true;
+err_reset:
+	napi_enable(&bnapi->napi);
+	bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
+	bnxt_reset_task(bp, true);
 	return rc;
 }
 
@@ -15704,7 +15806,9 @@  static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_rx_ring_info *rxr;
+	struct bnxt_cp_ring_info *cpr;
 	struct bnxt_vnic_info *vnic;
+	struct bnxt_napi *bnapi;
 	int i;
 
 	for (i = 0; i <= BNXT_VNIC_NTUPLE; i++) {
@@ -15716,17 +15820,25 @@  static int bnxt_queue_stop(struct net_device *dev, void *qmem, int idx)
 	/* Make sure NAPI sees that the VNIC is disabled */
 	synchronize_net();
 	rxr = &bp->rx_ring[idx];
-	cancel_work_sync(&rxr->bnapi->cp_ring.dim.work);
+	bnapi = rxr->bnapi;
+	cpr = &bnapi->cp_ring;
+	cancel_work_sync(&cpr->dim.work);
 	bnxt_hwrm_rx_ring_free(bp, rxr, false);
 	bnxt_hwrm_rx_agg_ring_free(bp, rxr, false);
 	page_pool_disable_direct_recycling(rxr->page_pool);
 	if (bnxt_separate_head_pool())
 		page_pool_disable_direct_recycling(rxr->head_pool);
 
+	if (bp->flags & BNXT_FLAG_SHARED_RINGS)
+		bnxt_tx_queue_stop(bp, idx);
+
+	napi_disable(&bnapi->napi);
+
 	if (bp->tph_mode) {
 		bnxt_hwrm_cp_ring_free(bp, rxr->rx_cpr);
 		bnxt_clear_one_cp_ring(bp, rxr->rx_cpr);
 	}
+	bnxt_db_nq(bp, &cpr->cp_db, cpr->cp_raw_cons);
 
 	memcpy(qmem, rxr, sizeof(*rxr));
 	bnxt_init_rx_ring_struct(bp, qmem);