net: stmmac: protect statistics updates with a spinlock

Message ID	20240105091556.15516-1-petr@tesarici.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Petr Tesarik <petr@tesarici.cz> To: Alexandre Torgue <alexandre.torgue@foss.st.com>, Jose Abreu <joabreu@synopsys.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Maxime Coquelin <mcoquelin.stm32@gmail.com>, Chen-Yu Tsai <wens@csie.org>, Jernej Skrabec <jernej.skrabec@gmail.com>, Samuel Holland <samuel@sholland.org>, netdev@vger.kernel.org (open list:STMMAC ETHERNET DRIVER), linux-stm32@st-md-mailman.stormreply.com (moderated list:ARM/STM32 ARCHITECTURE), linux-arm-kernel@lists.infradead.org (moderated list:ARM/STM32 ARCHITECTURE), linux-kernel@vger.kernel.org (open list), linux-sunxi@lists.linux.dev (open list:ARM/Allwinner sunXi SoC support) Cc: Petr Tesarik <petr@tesarici.cz> Subject: [PATCH] net: stmmac: protect statistics updates with a spinlock Date: Fri, 5 Jan 2024 10:15:56 +0100 Message-ID: <20240105091556.15516-1-petr@tesarici.cz> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	net: stmmac: protect statistics updates with a spinlock \| expand net: stmmac: protect statistics updates with a spinlock

Petr Tesařík Jan. 5, 2024, 9:15 a.m. UTC

Add a spinlock to fix race conditions while updating Tx/Rx statistics.

As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
u64_stats_sync must ensure mutual exclusion, or one seqcount update could
be lost on 32-bit platforms, thus blocking readers forever.

Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
on one core raced with stmmac_napi_poll_tx() on another core.

Signed-off-by: Petr Tesarik <petr@tesarici.cz>
---
 drivers/net/ethernet/stmicro/stmmac/common.h  |  2 +
 .../net/ethernet/stmicro/stmmac/dwmac-sun8i.c |  4 +
 .../net/ethernet/stmicro/stmmac/dwmac4_lib.c  |  4 +
 .../net/ethernet/stmicro/stmmac/dwmac_lib.c   |  4 +
 .../ethernet/stmicro/stmmac/dwxgmac2_dma.c    |  4 +
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 80 +++++++++++++------
 6 files changed, 72 insertions(+), 26 deletions(-)

Jiri Pirko Jan. 5, 2024, 9:45 a.m. UTC | #1

Fri, Jan 05, 2024 at 10:15:56AM CET, petr@tesarici.cz wrote:
>Add a spinlock to fix race conditions while updating Tx/Rx statistics.
>
>As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
>u64_stats_sync must ensure mutual exclusion, or one seqcount update could
>be lost on 32-bit platforms, thus blocking readers forever.
>
>Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
>on one core raced with stmmac_napi_poll_tx() on another core.
>
>Signed-off-by: Petr Tesarik <petr@tesarici.cz>
>---
> drivers/net/ethernet/stmicro/stmmac/common.h  |  2 +
> .../net/ethernet/stmicro/stmmac/dwmac-sun8i.c |  4 +
> .../net/ethernet/stmicro/stmmac/dwmac4_lib.c  |  4 +
> .../net/ethernet/stmicro/stmmac/dwmac_lib.c   |  4 +
> .../ethernet/stmicro/stmmac/dwxgmac2_dma.c    |  4 +
> .../net/ethernet/stmicro/stmmac/stmmac_main.c | 80 +++++++++++++------
> 6 files changed, 72 insertions(+), 26 deletions(-)
>
>diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
>index e3f650e88f82..9a17dfc1055d 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/common.h
>+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>@@ -70,6 +70,7 @@ struct stmmac_txq_stats {
> 	u64 tx_tso_frames;
> 	u64 tx_tso_nfrags;
> 	struct u64_stats_sync syncp;
>+	spinlock_t lock;	/* mutual writer exclusion */
> } ____cacheline_aligned_in_smp;
> 
> struct stmmac_rxq_stats {
>@@ -79,6 +80,7 @@ struct stmmac_rxq_stats {
> 	u64 rx_normal_irq_n;
> 	u64 napi_poll;
> 	struct u64_stats_sync syncp;
>+	spinlock_t lock;	/* mutual writer exclusion */
> } ____cacheline_aligned_in_smp;
> 
> /* Extra statistic and debug information exposed by ethtool */
>diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>index 137741b94122..9c568996321d 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>@@ -455,9 +455,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
> 
> 	if (v & EMAC_TX_INT) {
> 		ret |= handle_tx;
>+		spin_lock(&txq_stats->lock);
> 		u64_stats_update_begin(&txq_stats->syncp);
> 		txq_stats->tx_normal_irq_n++;
> 		u64_stats_update_end(&txq_stats->syncp);
>+		spin_unlock(&txq_stats->lock);
> 	}
> 
> 	if (v & EMAC_TX_DMA_STOP_INT)
>@@ -479,9 +481,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
> 
> 	if (v & EMAC_RX_INT) {
> 		ret |= handle_rx;
>+		spin_lock(&rxq_stats->lock);
> 		u64_stats_update_begin(&rxq_stats->syncp);
> 		rxq_stats->rx_normal_irq_n++;
> 		u64_stats_update_end(&rxq_stats->syncp);
>+		spin_unlock(&rxq_stats->lock);
> 	}
> 
> 	if (v & EMAC_RX_BUF_UA_INT)
>diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>index 9470d3fd2ded..e50e8b07724b 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>@@ -201,15 +201,19 @@ int dwmac4_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
> 	}
> 	/* TX/RX NORMAL interrupts */
> 	if (likely(intr_status & DMA_CHAN_STATUS_RI)) {
>+		spin_lock(&rxq_stats->lock);
> 		u64_stats_update_begin(&rxq_stats->syncp);
> 		rxq_stats->rx_normal_irq_n++;
> 		u64_stats_update_end(&rxq_stats->syncp);
>+		spin_unlock(&rxq_stats->lock);
> 		ret |= handle_rx;
> 	}
> 	if (likely(intr_status & DMA_CHAN_STATUS_TI)) {
>+		spin_lock(&txq_stats->lock);
> 		u64_stats_update_begin(&txq_stats->syncp);
> 		txq_stats->tx_normal_irq_n++;
> 		u64_stats_update_end(&txq_stats->syncp);
>+		spin_unlock(&txq_stats->lock);
> 		ret |= handle_tx;
> 	}
> 
>diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>index 7907d62d3437..a43396a7f852 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>@@ -215,16 +215,20 @@ int dwmac_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
> 			u32 value = readl(ioaddr + DMA_INTR_ENA);
> 			/* to schedule NAPI on real RIE event. */
> 			if (likely(value & DMA_INTR_ENA_RIE)) {
>+				spin_lock(&rxq_stats->lock);
> 				u64_stats_update_begin(&rxq_stats->syncp);
> 				rxq_stats->rx_normal_irq_n++;
> 				u64_stats_update_end(&rxq_stats->syncp);
>+				spin_unlock(&rxq_stats->lock);
> 				ret |= handle_rx;
> 			}
> 		}
> 		if (likely(intr_status & DMA_STATUS_TI)) {
>+			spin_lock(&txq_stats->lock);
> 			u64_stats_update_begin(&txq_stats->syncp);
> 			txq_stats->tx_normal_irq_n++;
> 			u64_stats_update_end(&txq_stats->syncp);
>+			spin_unlock(&txq_stats->lock);
> 			ret |= handle_tx;
> 		}
> 		if (unlikely(intr_status & DMA_STATUS_ERI))
>diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>index 3cde695fec91..f4e01436d4cc 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>@@ -367,15 +367,19 @@ static int dwxgmac2_dma_interrupt(struct stmmac_priv *priv,
> 	/* TX/RX NORMAL interrupts */
> 	if (likely(intr_status & XGMAC_NIS)) {
> 		if (likely(intr_status & XGMAC_RI)) {
>+			spin_lock(&rxq_stats->lock);
> 			u64_stats_update_begin(&rxq_stats->syncp);
> 			rxq_stats->rx_normal_irq_n++;
> 			u64_stats_update_end(&rxq_stats->syncp);
>+			spin_unlock(&rxq_stats->lock);
> 			ret |= handle_rx;
> 		}
> 		if (likely(intr_status & (XGMAC_TI | XGMAC_TBU))) {
>+			spin_lock(&txq_stats->lock);
> 			u64_stats_update_begin(&txq_stats->syncp);
> 			txq_stats->tx_normal_irq_n++;
> 			u64_stats_update_end(&txq_stats->syncp);
>+			spin_unlock(&txq_stats->lock);
> 			ret |= handle_tx;
> 		}
> 	}
>diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>index 37e64283f910..82d8db04d0d1 100644
>--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>@@ -2515,9 +2515,11 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
> 		tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
> 		entry = tx_q->cur_tx;
> 	}
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock_irqsave(&txq_stats->lock, flags);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->tx_set_ic_bit += tx_set_ic_bit;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	if (tx_desc) {
> 		stmmac_flush_tx_descriptors(priv, queue);
>@@ -2721,11 +2723,13 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
> 	if (tx_q->dirty_tx != tx_q->cur_tx)
> 		*pending_packets = true;
> 
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock_irqsave(&txq_stats->lock, flags);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->tx_packets += tx_packets;
> 	txq_stats->tx_pkt_n += tx_packets;
> 	txq_stats->tx_clean++;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	priv->xstats.tx_errors += tx_errors;
> 
>@@ -4311,13 +4315,15 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
> 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
> 	}
> 
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock_irqsave(&txq_stats->lock, flags);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->tx_bytes += skb->len;
> 	txq_stats->tx_tso_frames++;
> 	txq_stats->tx_tso_nfrags += nfrags;
> 	if (set_ic)
> 		txq_stats->tx_set_ic_bit++;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	if (priv->sarc_type)
> 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
>@@ -4560,11 +4566,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
> 	}
> 
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock_irqsave(&txq_stats->lock, flags);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->tx_bytes += skb->len;
> 	if (set_ic)
> 		txq_stats->tx_set_ic_bit++;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	if (priv->sarc_type)
> 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
>@@ -4831,9 +4839,11 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
> 		unsigned long flags;
> 		tx_q->tx_count_frames = 0;
> 		stmmac_set_tx_ic(priv, tx_desc);
>-		flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+		spin_lock_irqsave(&txq_stats->lock, flags);
>+		u64_stats_update_begin(&txq_stats->syncp);
> 		txq_stats->tx_set_ic_bit++;
>-		u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+		u64_stats_update_end(&txq_stats->syncp);
>+		spin_unlock_irqrestore(&txq_stats->lock, flags);
> 	}
> 
> 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
>@@ -5008,10 +5018,12 @@ static void stmmac_dispatch_skb_zc(struct stmmac_priv *priv, u32 queue,
> 	skb_record_rx_queue(skb, queue);
> 	napi_gro_receive(&ch->rxtx_napi, skb);
> 
>-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>+	spin_lock_irqsave(&rxq_stats->lock, flags);
>+	u64_stats_update_begin(&rxq_stats->syncp);
> 	rxq_stats->rx_pkt_n++;
> 	rxq_stats->rx_bytes += len;
>-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>+	u64_stats_update_end(&rxq_stats->syncp);
>+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> }
> 
> static bool stmmac_rx_refill_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
>@@ -5248,9 +5260,11 @@ static int stmmac_rx_zc(struct stmmac_priv *priv, int limit, u32 queue)
> 
> 	stmmac_finalize_xdp_rx(priv, xdp_status);
> 
>-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>+	spin_lock_irqsave(&rxq_stats->lock, flags);
>+	u64_stats_update_begin(&rxq_stats->syncp);
> 	rxq_stats->rx_pkt_n += count;
>-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>+	u64_stats_update_end(&rxq_stats->syncp);
>+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> 
> 	priv->xstats.rx_dropped += rx_dropped;
> 	priv->xstats.rx_errors += rx_errors;
>@@ -5541,11 +5555,13 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> 
> 	stmmac_rx_refill(priv, queue);
> 
>-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>+	spin_lock_irqsave(&rxq_stats->lock, flags);
>+	u64_stats_update_begin(&rxq_stats->syncp);
> 	rxq_stats->rx_packets += rx_packets;
> 	rxq_stats->rx_bytes += rx_bytes;
> 	rxq_stats->rx_pkt_n += count;
>-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>+	u64_stats_update_end(&rxq_stats->syncp);
>+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> 
> 	priv->xstats.rx_dropped += rx_dropped;
> 	priv->xstats.rx_errors += rx_errors;
>@@ -5564,9 +5580,11 @@ static int stmmac_napi_poll_rx(struct napi_struct *napi, int budget)
> 	int work_done;
> 
> 	rxq_stats = &priv->xstats.rxq_stats[chan];
>-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>+	spin_lock_irqsave(&rxq_stats->lock, flags);
>+	u64_stats_update_begin(&rxq_stats->syncp);
> 	rxq_stats->napi_poll++;
>-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>+	u64_stats_update_end(&rxq_stats->syncp);
>+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> 
> 	work_done = stmmac_rx(priv, budget, chan);
> 	if (work_done < budget && napi_complete_done(napi, work_done)) {
>@@ -5592,9 +5610,11 @@ static int stmmac_napi_poll_tx(struct napi_struct *napi, int budget)
> 	int work_done;
> 
> 	txq_stats = &priv->xstats.txq_stats[chan];
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock_irqsave(&txq_stats->lock, flags);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->napi_poll++;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	work_done = stmmac_tx_clean(priv, budget, chan, &pending_packets);
> 	work_done = min(work_done, budget);
>@@ -5627,14 +5647,18 @@ static int stmmac_napi_poll_rxtx(struct napi_struct *napi, int budget)
> 	unsigned long flags;
> 
> 	rxq_stats = &priv->xstats.rxq_stats[chan];
>-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>+	spin_lock_irqsave(&rxq_stats->lock, flags);
>+	u64_stats_update_begin(&rxq_stats->syncp);
> 	rxq_stats->napi_poll++;
>-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>+	u64_stats_update_end(&rxq_stats->syncp);
>+	spin_unlock(&rxq_stats->lock);

Nitpick:
I know that the original code does that, but any idea why
u64_stats_update_end_irqrestore() is called here when
u64_stats_update_begin_irqsave() is called 2 lines below?
IIUC, this could be one critical section. Could you perhaps merge these
while at it? Could be a follow-up patch.

Rest of the patch looks fine to me.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>


> 
> 	txq_stats = &priv->xstats.txq_stats[chan];
>-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>+	spin_lock(&txq_stats->lock);
>+	u64_stats_update_begin(&txq_stats->syncp);
> 	txq_stats->napi_poll++;
>-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>+	u64_stats_update_end(&txq_stats->syncp);
>+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> 
> 	tx_done = stmmac_tx_clean(priv, budget, chan, &tx_pending_packets);
> 	tx_done = min(tx_done, budget);
>@@ -7371,10 +7395,14 @@ int stmmac_dvr_probe(struct device *device,
> 	priv->device = device;
> 	priv->dev = ndev;
> 
>-	for (i = 0; i < MTL_MAX_RX_QUEUES; i++)
>+	for (i = 0; i < MTL_MAX_RX_QUEUES; i++) {
> 		u64_stats_init(&priv->xstats.rxq_stats[i].syncp);
>-	for (i = 0; i < MTL_MAX_TX_QUEUES; i++)
>+		spin_lock_init(&priv->xstats.rxq_stats[i].lock);
>+	}
>+	for (i = 0; i < MTL_MAX_TX_QUEUES; i++) {
> 		u64_stats_init(&priv->xstats.txq_stats[i].syncp);
>+		spin_lock_init(&priv->xstats.txq_stats[i].lock);
>+	}
> 
> 	stmmac_set_ethtool_ops(ndev);
> 	priv->pause = pause;
>-- 
>2.43.0
>
>

Eric Dumazet Jan. 5, 2024, 9:58 a.m. UTC | #2

On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:
>
> Add a spinlock to fix race conditions while updating Tx/Rx statistics.
>
> As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> be lost on 32-bit platforms, thus blocking readers forever.
>
> Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> on one core raced with stmmac_napi_poll_tx() on another core.
>
> Signed-off-by: Petr Tesarik <petr@tesarici.cz>

This is going to add more costs to 64bit platforms ?

It seems to me that the same syncp can be used from two different
threads : hard irq and napi poller...

At this point, I do not see why you keep linux/u64_stats_sync.h if you
decide to go for a spinlock...

Alternative would use atomic64_t fields for the ones where there is no
mutual exclusion.

RX : napi poll is definitely safe (protected by an atomic bit)
TX : each TX queue is also safe (protected by an atomic exclusion for
non LLTX drivers)

This leaves the fields updated from hardware interrupt context ?

Petr Tesařík Jan. 5, 2024, 10:25 a.m. UTC | #3

On Fri, 5 Jan 2024 10:45:55 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> Fri, Jan 05, 2024 at 10:15:56AM CET, petr@tesarici.cz wrote:
> >Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> >
> >As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> >u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> >be lost on 32-bit platforms, thus blocking readers forever.
> >
> >Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> >on one core raced with stmmac_napi_poll_tx() on another core.
> >
> >Signed-off-by: Petr Tesarik <petr@tesarici.cz>
> >---
> > drivers/net/ethernet/stmicro/stmmac/common.h  |  2 +
> > .../net/ethernet/stmicro/stmmac/dwmac-sun8i.c |  4 +
> > .../net/ethernet/stmicro/stmmac/dwmac4_lib.c  |  4 +
> > .../net/ethernet/stmicro/stmmac/dwmac_lib.c   |  4 +
> > .../ethernet/stmicro/stmmac/dwxgmac2_dma.c    |  4 +
> > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 80 +++++++++++++------
> > 6 files changed, 72 insertions(+), 26 deletions(-)
> >
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
> >index e3f650e88f82..9a17dfc1055d 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/common.h
> >+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> >@@ -70,6 +70,7 @@ struct stmmac_txq_stats {
> > 	u64 tx_tso_frames;
> > 	u64 tx_tso_nfrags;
> > 	struct u64_stats_sync syncp;
> >+	spinlock_t lock;	/* mutual writer exclusion */
> > } ____cacheline_aligned_in_smp;
> > 
> > struct stmmac_rxq_stats {
> >@@ -79,6 +80,7 @@ struct stmmac_rxq_stats {
> > 	u64 rx_normal_irq_n;
> > 	u64 napi_poll;
> > 	struct u64_stats_sync syncp;
> >+	spinlock_t lock;	/* mutual writer exclusion */
> > } ____cacheline_aligned_in_smp;
> > 
> > /* Extra statistic and debug information exposed by ethtool */
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >index 137741b94122..9c568996321d 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> >@@ -455,9 +455,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
> > 
> > 	if (v & EMAC_TX_INT) {
> > 		ret |= handle_tx;
> >+		spin_lock(&txq_stats->lock);
> > 		u64_stats_update_begin(&txq_stats->syncp);
> > 		txq_stats->tx_normal_irq_n++;
> > 		u64_stats_update_end(&txq_stats->syncp);
> >+		spin_unlock(&txq_stats->lock);
> > 	}
> > 
> > 	if (v & EMAC_TX_DMA_STOP_INT)
> >@@ -479,9 +481,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
> > 
> > 	if (v & EMAC_RX_INT) {
> > 		ret |= handle_rx;
> >+		spin_lock(&rxq_stats->lock);
> > 		u64_stats_update_begin(&rxq_stats->syncp);
> > 		rxq_stats->rx_normal_irq_n++;
> > 		u64_stats_update_end(&rxq_stats->syncp);
> >+		spin_unlock(&rxq_stats->lock);
> > 	}
> > 
> > 	if (v & EMAC_RX_BUF_UA_INT)
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
> >index 9470d3fd2ded..e50e8b07724b 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
> >@@ -201,15 +201,19 @@ int dwmac4_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
> > 	}
> > 	/* TX/RX NORMAL interrupts */
> > 	if (likely(intr_status & DMA_CHAN_STATUS_RI)) {
> >+		spin_lock(&rxq_stats->lock);
> > 		u64_stats_update_begin(&rxq_stats->syncp);
> > 		rxq_stats->rx_normal_irq_n++;
> > 		u64_stats_update_end(&rxq_stats->syncp);
> >+		spin_unlock(&rxq_stats->lock);
> > 		ret |= handle_rx;
> > 	}
> > 	if (likely(intr_status & DMA_CHAN_STATUS_TI)) {
> >+		spin_lock(&txq_stats->lock);
> > 		u64_stats_update_begin(&txq_stats->syncp);
> > 		txq_stats->tx_normal_irq_n++;
> > 		u64_stats_update_end(&txq_stats->syncp);
> >+		spin_unlock(&txq_stats->lock);
> > 		ret |= handle_tx;
> > 	}
> > 
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
> >index 7907d62d3437..a43396a7f852 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
> >@@ -215,16 +215,20 @@ int dwmac_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
> > 			u32 value = readl(ioaddr + DMA_INTR_ENA);
> > 			/* to schedule NAPI on real RIE event. */
> > 			if (likely(value & DMA_INTR_ENA_RIE)) {
> >+				spin_lock(&rxq_stats->lock);
> > 				u64_stats_update_begin(&rxq_stats->syncp);
> > 				rxq_stats->rx_normal_irq_n++;
> > 				u64_stats_update_end(&rxq_stats->syncp);
> >+				spin_unlock(&rxq_stats->lock);
> > 				ret |= handle_rx;
> > 			}
> > 		}
> > 		if (likely(intr_status & DMA_STATUS_TI)) {
> >+			spin_lock(&txq_stats->lock);
> > 			u64_stats_update_begin(&txq_stats->syncp);
> > 			txq_stats->tx_normal_irq_n++;
> > 			u64_stats_update_end(&txq_stats->syncp);
> >+			spin_unlock(&txq_stats->lock);
> > 			ret |= handle_tx;
> > 		}
> > 		if (unlikely(intr_status & DMA_STATUS_ERI))
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
> >index 3cde695fec91..f4e01436d4cc 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
> >@@ -367,15 +367,19 @@ static int dwxgmac2_dma_interrupt(struct stmmac_priv *priv,
> > 	/* TX/RX NORMAL interrupts */
> > 	if (likely(intr_status & XGMAC_NIS)) {
> > 		if (likely(intr_status & XGMAC_RI)) {
> >+			spin_lock(&rxq_stats->lock);
> > 			u64_stats_update_begin(&rxq_stats->syncp);
> > 			rxq_stats->rx_normal_irq_n++;
> > 			u64_stats_update_end(&rxq_stats->syncp);
> >+			spin_unlock(&rxq_stats->lock);
> > 			ret |= handle_rx;
> > 		}
> > 		if (likely(intr_status & (XGMAC_TI | XGMAC_TBU))) {
> >+			spin_lock(&txq_stats->lock);
> > 			u64_stats_update_begin(&txq_stats->syncp);
> > 			txq_stats->tx_normal_irq_n++;
> > 			u64_stats_update_end(&txq_stats->syncp);
> >+			spin_unlock(&txq_stats->lock);
> > 			ret |= handle_tx;
> > 		}
> > 	}
> >diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >index 37e64283f910..82d8db04d0d1 100644
> >--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> >@@ -2515,9 +2515,11 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
> > 		tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
> > 		entry = tx_q->cur_tx;
> > 	}
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock_irqsave(&txq_stats->lock, flags);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->tx_set_ic_bit += tx_set_ic_bit;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	if (tx_desc) {
> > 		stmmac_flush_tx_descriptors(priv, queue);
> >@@ -2721,11 +2723,13 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
> > 	if (tx_q->dirty_tx != tx_q->cur_tx)
> > 		*pending_packets = true;
> > 
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock_irqsave(&txq_stats->lock, flags);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->tx_packets += tx_packets;
> > 	txq_stats->tx_pkt_n += tx_packets;
> > 	txq_stats->tx_clean++;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	priv->xstats.tx_errors += tx_errors;
> > 
> >@@ -4311,13 +4315,15 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
> > 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
> > 	}
> > 
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock_irqsave(&txq_stats->lock, flags);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->tx_bytes += skb->len;
> > 	txq_stats->tx_tso_frames++;
> > 	txq_stats->tx_tso_nfrags += nfrags;
> > 	if (set_ic)
> > 		txq_stats->tx_set_ic_bit++;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	if (priv->sarc_type)
> > 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
> >@@ -4560,11 +4566,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> > 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
> > 	}
> > 
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock_irqsave(&txq_stats->lock, flags);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->tx_bytes += skb->len;
> > 	if (set_ic)
> > 		txq_stats->tx_set_ic_bit++;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	if (priv->sarc_type)
> > 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
> >@@ -4831,9 +4839,11 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
> > 		unsigned long flags;
> > 		tx_q->tx_count_frames = 0;
> > 		stmmac_set_tx_ic(priv, tx_desc);
> >-		flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+		spin_lock_irqsave(&txq_stats->lock, flags);
> >+		u64_stats_update_begin(&txq_stats->syncp);
> > 		txq_stats->tx_set_ic_bit++;
> >-		u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+		u64_stats_update_end(&txq_stats->syncp);
> >+		spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 	}
> > 
> > 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
> >@@ -5008,10 +5018,12 @@ static void stmmac_dispatch_skb_zc(struct stmmac_priv *priv, u32 queue,
> > 	skb_record_rx_queue(skb, queue);
> > 	napi_gro_receive(&ch->rxtx_napi, skb);
> > 
> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
> >+	u64_stats_update_begin(&rxq_stats->syncp);
> > 	rxq_stats->rx_pkt_n++;
> > 	rxq_stats->rx_bytes += len;
> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
> >+	u64_stats_update_end(&rxq_stats->syncp);
> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> > }
> > 
> > static bool stmmac_rx_refill_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
> >@@ -5248,9 +5260,11 @@ static int stmmac_rx_zc(struct stmmac_priv *priv, int limit, u32 queue)
> > 
> > 	stmmac_finalize_xdp_rx(priv, xdp_status);
> > 
> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
> >+	u64_stats_update_begin(&rxq_stats->syncp);
> > 	rxq_stats->rx_pkt_n += count;
> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
> >+	u64_stats_update_end(&rxq_stats->syncp);
> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> > 
> > 	priv->xstats.rx_dropped += rx_dropped;
> > 	priv->xstats.rx_errors += rx_errors;
> >@@ -5541,11 +5555,13 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
> > 
> > 	stmmac_rx_refill(priv, queue);
> > 
> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
> >+	u64_stats_update_begin(&rxq_stats->syncp);
> > 	rxq_stats->rx_packets += rx_packets;
> > 	rxq_stats->rx_bytes += rx_bytes;
> > 	rxq_stats->rx_pkt_n += count;
> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
> >+	u64_stats_update_end(&rxq_stats->syncp);
> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> > 
> > 	priv->xstats.rx_dropped += rx_dropped;
> > 	priv->xstats.rx_errors += rx_errors;
> >@@ -5564,9 +5580,11 @@ static int stmmac_napi_poll_rx(struct napi_struct *napi, int budget)
> > 	int work_done;
> > 
> > 	rxq_stats = &priv->xstats.rxq_stats[chan];
> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
> >+	u64_stats_update_begin(&rxq_stats->syncp);
> > 	rxq_stats->napi_poll++;
> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
> >+	u64_stats_update_end(&rxq_stats->syncp);
> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
> > 
> > 	work_done = stmmac_rx(priv, budget, chan);
> > 	if (work_done < budget && napi_complete_done(napi, work_done)) {
> >@@ -5592,9 +5610,11 @@ static int stmmac_napi_poll_tx(struct napi_struct *napi, int budget)
> > 	int work_done;
> > 
> > 	txq_stats = &priv->xstats.txq_stats[chan];
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock_irqsave(&txq_stats->lock, flags);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->napi_poll++;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	work_done = stmmac_tx_clean(priv, budget, chan, &pending_packets);
> > 	work_done = min(work_done, budget);
> >@@ -5627,14 +5647,18 @@ static int stmmac_napi_poll_rxtx(struct napi_struct *napi, int budget)
> > 	unsigned long flags;
> > 
> > 	rxq_stats = &priv->xstats.rxq_stats[chan];
> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
> >+	u64_stats_update_begin(&rxq_stats->syncp);
> > 	rxq_stats->napi_poll++;
> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
> >+	u64_stats_update_end(&rxq_stats->syncp);
> >+	spin_unlock(&rxq_stats->lock);  
> 
> Nitpick:
> I know that the original code does that, but any idea why
> u64_stats_update_end_irqrestore() is called here when
> u64_stats_update_begin_irqsave() is called 2 lines below?
> IIUC, this could be one critical section. Could you perhaps merge these
> while at it? Could be a follow-up patch.

I have merged the interrupt disable/enable, but there are two separate
spinlocks for rxq_stats (added to struct stmmac_txq_stats) and for
txq_stats (added to struct stmmac_rxq_stats), so they cannot be merged.

Alternatively, I could use the channel lock to protect stats updates,
but that could increase contention of that lock. I believe more
granularity is better, especially if it does not cost anything: There
is plenty of unused space in struct stmmac_txq_stats and struct
stmmac_rxq_stats (they are both cache-aligned).

Petr T

> 
> Rest of the patch looks fine to me.
> 
> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
> 
> 
> > 
> > 	txq_stats = &priv->xstats.txq_stats[chan];
> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
> >+	spin_lock(&txq_stats->lock);
> >+	u64_stats_update_begin(&txq_stats->syncp);
> > 	txq_stats->napi_poll++;
> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
> >+	u64_stats_update_end(&txq_stats->syncp);
> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
> > 
> > 	tx_done = stmmac_tx_clean(priv, budget, chan, &tx_pending_packets);
> > 	tx_done = min(tx_done, budget);
> >@@ -7371,10 +7395,14 @@ int stmmac_dvr_probe(struct device *device,
> > 	priv->device = device;
> > 	priv->dev = ndev;
> > 
> >-	for (i = 0; i < MTL_MAX_RX_QUEUES; i++)
> >+	for (i = 0; i < MTL_MAX_RX_QUEUES; i++) {
> > 		u64_stats_init(&priv->xstats.rxq_stats[i].syncp);
> >-	for (i = 0; i < MTL_MAX_TX_QUEUES; i++)
> >+		spin_lock_init(&priv->xstats.rxq_stats[i].lock);
> >+	}
> >+	for (i = 0; i < MTL_MAX_TX_QUEUES; i++) {
> > 		u64_stats_init(&priv->xstats.txq_stats[i].syncp);
> >+		spin_lock_init(&priv->xstats.txq_stats[i].lock);
> >+	}
> > 
> > 	stmmac_set_ethtool_ops(ndev);
> > 	priv->pause = pause;
> >-- 
> >2.43.0
> >
> >

Petr Tesařík Jan. 5, 2024, 10:34 a.m. UTC | #4

On Fri, 5 Jan 2024 10:58:42 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:
> >
> > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> >
> > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > be lost on 32-bit platforms, thus blocking readers forever.
> >
> > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > on one core raced with stmmac_napi_poll_tx() on another core.
> >
> > Signed-off-by: Petr Tesarik <petr@tesarici.cz>  
> 
> This is going to add more costs to 64bit platforms ?

Yes, it adds a (hopefully not too contended) spinlock and in most
places an interrupt disable/enable pair.

FWIW the race condition is also present on 64-bit platforms, resulting
in inaccurate statistic counters. I can understand if you consider it a
mild annoyance, not worth fixing.

> It seems to me that the same syncp can be used from two different
> threads : hard irq and napi poller...

Yes, that's exactly the scenario that locks up my system.

> At this point, I do not see why you keep linux/u64_stats_sync.h if you
> decide to go for a spinlock...

The spinlock does not havce to be taken on the reader side, so the
seqcounter still adds some value.

> Alternative would use atomic64_t fields for the ones where there is no
> mutual exclusion.
> 
> RX : napi poll is definitely safe (protected by an atomic bit)
> TX : each TX queue is also safe (protected by an atomic exclusion for
> non LLTX drivers)
> 
> This leaves the fields updated from hardware interrupt context ?

I'm afraid I don't have enough network-stack-foo to follow here.

My issue on 32 bit is that stmmac_xmit() may be called directly from
process context while another core runs the TX napi on the same channel
(in interrupt context). I didn't observe any race on the RX path, but I
believe it's possible with NAPI busy polling.

In any case, I don't see the connection with LLTX. Maybe you want to
say that the TX queue is safe for stmmac (because it is a non-LLTX
driver), but might not be safe for LLTX drivers?

Petr T

Eric Dumazet Jan. 5, 2024, 10:48 a.m. UTC | #5

On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:
>
> On Fri, 5 Jan 2024 10:58:42 +0100
> Eric Dumazet <edumazet@google.com> wrote:
>
> > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:
> > >
> > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > >
> > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > be lost on 32-bit platforms, thus blocking readers forever.
> > >
> > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > on one core raced with stmmac_napi_poll_tx() on another core.
> > >
> > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>
> >
> > This is going to add more costs to 64bit platforms ?
>
> Yes, it adds a (hopefully not too contended) spinlock and in most
> places an interrupt disable/enable pair.
>
> FWIW the race condition is also present on 64-bit platforms, resulting
> in inaccurate statistic counters. I can understand if you consider it a
> mild annoyance, not worth fixing.
>
> > It seems to me that the same syncp can be used from two different
> > threads : hard irq and napi poller...
>
> Yes, that's exactly the scenario that locks up my system.
>
> > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > decide to go for a spinlock...
>
> The spinlock does not havce to be taken on the reader side, so the
> seqcounter still adds some value.
>
> > Alternative would use atomic64_t fields for the ones where there is no
> > mutual exclusion.
> >
> > RX : napi poll is definitely safe (protected by an atomic bit)
> > TX : each TX queue is also safe (protected by an atomic exclusion for
> > non LLTX drivers)
> >
> > This leaves the fields updated from hardware interrupt context ?
>
> I'm afraid I don't have enough network-stack-foo to follow here.
>
> My issue on 32 bit is that stmmac_xmit() may be called directly from
> process context while another core runs the TX napi on the same channel
> (in interrupt context). I didn't observe any race on the RX path, but I
> believe it's possible with NAPI busy polling.
>
> In any case, I don't see the connection with LLTX. Maybe you want to
> say that the TX queue is safe for stmmac (because it is a non-LLTX
> driver), but might not be safe for LLTX drivers?

LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
would be a bug.
These drivers usually use per-cpu stats, to avoid races and false
sharing anyway.

I think you should split the structures into two separate groups, each
guarded with its own syncp.

No extra spinlocks, no extra costs on 64bit arches...

If TX completion can run in parallel with ndo_start_xmit(), then
clearly we have to split stmmac_txq_stats in two halves:

Also please note the conversion from u64 to u64_stats_t

Very partial patch, only to show the split and new structure :

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
b/drivers/net/ethernet/stmicro/stmmac/common.h
index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -60,16 +60,22 @@
 /* #define FRAME_FILTER_DEBUG */

 struct stmmac_txq_stats {
-       u64 tx_bytes;
-       u64 tx_packets;
-       u64 tx_pkt_n;
-       u64 tx_normal_irq_n;
-       u64 napi_poll;
-       u64 tx_clean;
-       u64 tx_set_ic_bit;
-       u64 tx_tso_frames;
-       u64 tx_tso_nfrags;
-       struct u64_stats_sync syncp;
+/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
+       struct u64_stats_sync syncp_tx;
+       u64_stats_t tx_bytes;
+       u64_stats_t tx_packets;
+       u64_stats_t tx_pkt_n;
+       u64_stats_t tx_tso_frames;
+       u64_stats_t tx_tso_nfrags;
+
+/* Second part, updated from TX completion (protected by NAPI poll logic) */
+       struct u64_stats_sync syncp_tx_completion;
+       u64_stats_t napi_poll;
+       u64_stats_t tx_clean;
+       u64_stats_t tx_set_ic_bit;
+
+/* Following feld is updated from hard irq context... */
+       atomic64_t tx_normal_irq_n;
 } ____cacheline_aligned_in_smp;

 struct stmmac_rxq_stats {

Petr Tesařík Jan. 5, 2024, 11:14 a.m. UTC | #6

On Fri, 5 Jan 2024 11:48:19 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > On Fri, 5 Jan 2024 10:58:42 +0100
> > Eric Dumazet <edumazet@google.com> wrote:
> >  
> > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:  
> > > >
> > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > >
> > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > >
> > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > >
> > > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>  
> > >
> > > This is going to add more costs to 64bit platforms ?  
> >
> > Yes, it adds a (hopefully not too contended) spinlock and in most
> > places an interrupt disable/enable pair.
> >
> > FWIW the race condition is also present on 64-bit platforms, resulting
> > in inaccurate statistic counters. I can understand if you consider it a
> > mild annoyance, not worth fixing.
> >  
> > > It seems to me that the same syncp can be used from two different
> > > threads : hard irq and napi poller...  
> >
> > Yes, that's exactly the scenario that locks up my system.
> >  
> > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > decide to go for a spinlock...  
> >
> > The spinlock does not havce to be taken on the reader side, so the
> > seqcounter still adds some value.
> >  
> > > Alternative would use atomic64_t fields for the ones where there is no
> > > mutual exclusion.
> > >
> > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > non LLTX drivers)
> > >
> > > This leaves the fields updated from hardware interrupt context ?  
> >
> > I'm afraid I don't have enough network-stack-foo to follow here.
> >
> > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > process context while another core runs the TX napi on the same channel
> > (in interrupt context). I didn't observe any race on the RX path, but I
> > believe it's possible with NAPI busy polling.
> >
> > In any case, I don't see the connection with LLTX. Maybe you want to
> > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > driver), but might not be safe for LLTX drivers?  
> 
> LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> would be a bug.
> These drivers usually use per-cpu stats, to avoid races and false
> sharing anyway.
> 
> I think you should split the structures into two separate groups, each
> guarded with its own syncp.
> 
> No extra spinlocks, no extra costs on 64bit arches...
> 
> If TX completion can run in parallel with ndo_start_xmit(), then
> clearly we have to split stmmac_txq_stats in two halves:

Oh, now I get it. Yes, that's much better, indeed.

I mean, the counters have never been consistent (due to the race on the
writer side), and nobody is concerned. So, there is no value in taking
a consistent snapshot in stmmac_get_ethtool_stats().

I'm going to rework and retest my patch. Thank you for pointing me in
the right direction!

Petr T

> Also please note the conversion from u64 to u64_stats_t

Noted. IIUC this will in turn close the update race on 64-bit by using
an atomic type and on 32-bit by using a seqlock. Clever.

Petr T

> Very partial patch, only to show the split and new structure :
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> b/drivers/net/ethernet/stmicro/stmmac/common.h
> index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> @@ -60,16 +60,22 @@
>  /* #define FRAME_FILTER_DEBUG */
> 
>  struct stmmac_txq_stats {
> -       u64 tx_bytes;
> -       u64 tx_packets;
> -       u64 tx_pkt_n;
> -       u64 tx_normal_irq_n;
> -       u64 napi_poll;
> -       u64 tx_clean;
> -       u64 tx_set_ic_bit;
> -       u64 tx_tso_frames;
> -       u64 tx_tso_nfrags;
> -       struct u64_stats_sync syncp;
> +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> +       struct u64_stats_sync syncp_tx;
> +       u64_stats_t tx_bytes;
> +       u64_stats_t tx_packets;
> +       u64_stats_t tx_pkt_n;
> +       u64_stats_t tx_tso_frames;
> +       u64_stats_t tx_tso_nfrags;
> +
> +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> +       struct u64_stats_sync syncp_tx_completion;
> +       u64_stats_t napi_poll;
> +       u64_stats_t tx_clean;
> +       u64_stats_t tx_set_ic_bit;
> +
> +/* Following feld is updated from hard irq context... */
> +       atomic64_t tx_normal_irq_n;
>  } ____cacheline_aligned_in_smp;
> 
>  struct stmmac_rxq_stats {

Jiri Pirko Jan. 5, 2024, 11:25 a.m. UTC | #7

Fri, Jan 05, 2024 at 11:25:38AM CET, petr@tesarici.cz wrote:
>On Fri, 5 Jan 2024 10:45:55 +0100
>Jiri Pirko <jiri@resnulli.us> wrote:
>
>> Fri, Jan 05, 2024 at 10:15:56AM CET, petr@tesarici.cz wrote:
>> >Add a spinlock to fix race conditions while updating Tx/Rx statistics.
>> >
>> >As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
>> >u64_stats_sync must ensure mutual exclusion, or one seqcount update could
>> >be lost on 32-bit platforms, thus blocking readers forever.
>> >
>> >Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
>> >on one core raced with stmmac_napi_poll_tx() on another core.
>> >
>> >Signed-off-by: Petr Tesarik <petr@tesarici.cz>
>> >---
>> > drivers/net/ethernet/stmicro/stmmac/common.h  |  2 +
>> > .../net/ethernet/stmicro/stmmac/dwmac-sun8i.c |  4 +
>> > .../net/ethernet/stmicro/stmmac/dwmac4_lib.c  |  4 +
>> > .../net/ethernet/stmicro/stmmac/dwmac_lib.c   |  4 +
>> > .../ethernet/stmicro/stmmac/dwxgmac2_dma.c    |  4 +
>> > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 80 +++++++++++++------
>> > 6 files changed, 72 insertions(+), 26 deletions(-)
>> >
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
>> >index e3f650e88f82..9a17dfc1055d 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/common.h
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
>> >@@ -70,6 +70,7 @@ struct stmmac_txq_stats {
>> > 	u64 tx_tso_frames;
>> > 	u64 tx_tso_nfrags;
>> > 	struct u64_stats_sync syncp;
>> >+	spinlock_t lock;	/* mutual writer exclusion */
>> > } ____cacheline_aligned_in_smp;
>> > 
>> > struct stmmac_rxq_stats {
>> >@@ -79,6 +80,7 @@ struct stmmac_rxq_stats {
>> > 	u64 rx_normal_irq_n;
>> > 	u64 napi_poll;
>> > 	struct u64_stats_sync syncp;
>> >+	spinlock_t lock;	/* mutual writer exclusion */
>> > } ____cacheline_aligned_in_smp;
>> > 
>> > /* Extra statistic and debug information exposed by ethtool */
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >index 137741b94122..9c568996321d 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> >@@ -455,9 +455,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
>> > 
>> > 	if (v & EMAC_TX_INT) {
>> > 		ret |= handle_tx;
>> >+		spin_lock(&txq_stats->lock);
>> > 		u64_stats_update_begin(&txq_stats->syncp);
>> > 		txq_stats->tx_normal_irq_n++;
>> > 		u64_stats_update_end(&txq_stats->syncp);
>> >+		spin_unlock(&txq_stats->lock);
>> > 	}
>> > 
>> > 	if (v & EMAC_TX_DMA_STOP_INT)
>> >@@ -479,9 +481,11 @@ static int sun8i_dwmac_dma_interrupt(struct stmmac_priv *priv,
>> > 
>> > 	if (v & EMAC_RX_INT) {
>> > 		ret |= handle_rx;
>> >+		spin_lock(&rxq_stats->lock);
>> > 		u64_stats_update_begin(&rxq_stats->syncp);
>> > 		rxq_stats->rx_normal_irq_n++;
>> > 		u64_stats_update_end(&rxq_stats->syncp);
>> >+		spin_unlock(&rxq_stats->lock);
>> > 	}
>> > 
>> > 	if (v & EMAC_RX_BUF_UA_INT)
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>> >index 9470d3fd2ded..e50e8b07724b 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_lib.c
>> >@@ -201,15 +201,19 @@ int dwmac4_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
>> > 	}
>> > 	/* TX/RX NORMAL interrupts */
>> > 	if (likely(intr_status & DMA_CHAN_STATUS_RI)) {
>> >+		spin_lock(&rxq_stats->lock);
>> > 		u64_stats_update_begin(&rxq_stats->syncp);
>> > 		rxq_stats->rx_normal_irq_n++;
>> > 		u64_stats_update_end(&rxq_stats->syncp);
>> >+		spin_unlock(&rxq_stats->lock);
>> > 		ret |= handle_rx;
>> > 	}
>> > 	if (likely(intr_status & DMA_CHAN_STATUS_TI)) {
>> >+		spin_lock(&txq_stats->lock);
>> > 		u64_stats_update_begin(&txq_stats->syncp);
>> > 		txq_stats->tx_normal_irq_n++;
>> > 		u64_stats_update_end(&txq_stats->syncp);
>> >+		spin_unlock(&txq_stats->lock);
>> > 		ret |= handle_tx;
>> > 	}
>> > 
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>> >index 7907d62d3437..a43396a7f852 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c
>> >@@ -215,16 +215,20 @@ int dwmac_dma_interrupt(struct stmmac_priv *priv, void __iomem *ioaddr,
>> > 			u32 value = readl(ioaddr + DMA_INTR_ENA);
>> > 			/* to schedule NAPI on real RIE event. */
>> > 			if (likely(value & DMA_INTR_ENA_RIE)) {
>> >+				spin_lock(&rxq_stats->lock);
>> > 				u64_stats_update_begin(&rxq_stats->syncp);
>> > 				rxq_stats->rx_normal_irq_n++;
>> > 				u64_stats_update_end(&rxq_stats->syncp);
>> >+				spin_unlock(&rxq_stats->lock);
>> > 				ret |= handle_rx;
>> > 			}
>> > 		}
>> > 		if (likely(intr_status & DMA_STATUS_TI)) {
>> >+			spin_lock(&txq_stats->lock);
>> > 			u64_stats_update_begin(&txq_stats->syncp);
>> > 			txq_stats->tx_normal_irq_n++;
>> > 			u64_stats_update_end(&txq_stats->syncp);
>> >+			spin_unlock(&txq_stats->lock);
>> > 			ret |= handle_tx;
>> > 		}
>> > 		if (unlikely(intr_status & DMA_STATUS_ERI))
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>> >index 3cde695fec91..f4e01436d4cc 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
>> >@@ -367,15 +367,19 @@ static int dwxgmac2_dma_interrupt(struct stmmac_priv *priv,
>> > 	/* TX/RX NORMAL interrupts */
>> > 	if (likely(intr_status & XGMAC_NIS)) {
>> > 		if (likely(intr_status & XGMAC_RI)) {
>> >+			spin_lock(&rxq_stats->lock);
>> > 			u64_stats_update_begin(&rxq_stats->syncp);
>> > 			rxq_stats->rx_normal_irq_n++;
>> > 			u64_stats_update_end(&rxq_stats->syncp);
>> >+			spin_unlock(&rxq_stats->lock);
>> > 			ret |= handle_rx;
>> > 		}
>> > 		if (likely(intr_status & (XGMAC_TI | XGMAC_TBU))) {
>> >+			spin_lock(&txq_stats->lock);
>> > 			u64_stats_update_begin(&txq_stats->syncp);
>> > 			txq_stats->tx_normal_irq_n++;
>> > 			u64_stats_update_end(&txq_stats->syncp);
>> >+			spin_unlock(&txq_stats->lock);
>> > 			ret |= handle_tx;
>> > 		}
>> > 	}
>> >diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> >index 37e64283f910..82d8db04d0d1 100644
>> >--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> >+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> >@@ -2515,9 +2515,11 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
>> > 		tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
>> > 		entry = tx_q->cur_tx;
>> > 	}
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock_irqsave(&txq_stats->lock, flags);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->tx_set_ic_bit += tx_set_ic_bit;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	if (tx_desc) {
>> > 		stmmac_flush_tx_descriptors(priv, queue);
>> >@@ -2721,11 +2723,13 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
>> > 	if (tx_q->dirty_tx != tx_q->cur_tx)
>> > 		*pending_packets = true;
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock_irqsave(&txq_stats->lock, flags);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->tx_packets += tx_packets;
>> > 	txq_stats->tx_pkt_n += tx_packets;
>> > 	txq_stats->tx_clean++;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	priv->xstats.tx_errors += tx_errors;
>> > 
>> >@@ -4311,13 +4315,15 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
>> > 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
>> > 	}
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock_irqsave(&txq_stats->lock, flags);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->tx_bytes += skb->len;
>> > 	txq_stats->tx_tso_frames++;
>> > 	txq_stats->tx_tso_nfrags += nfrags;
>> > 	if (set_ic)
>> > 		txq_stats->tx_set_ic_bit++;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	if (priv->sarc_type)
>> > 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
>> >@@ -4560,11 +4566,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>> > 		netif_tx_stop_queue(netdev_get_tx_queue(priv->dev, queue));
>> > 	}
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock_irqsave(&txq_stats->lock, flags);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->tx_bytes += skb->len;
>> > 	if (set_ic)
>> > 		txq_stats->tx_set_ic_bit++;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	if (priv->sarc_type)
>> > 		stmmac_set_desc_sarc(priv, first, priv->sarc_type);
>> >@@ -4831,9 +4839,11 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
>> > 		unsigned long flags;
>> > 		tx_q->tx_count_frames = 0;
>> > 		stmmac_set_tx_ic(priv, tx_desc);
>> >-		flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+		spin_lock_irqsave(&txq_stats->lock, flags);
>> >+		u64_stats_update_begin(&txq_stats->syncp);
>> > 		txq_stats->tx_set_ic_bit++;
>> >-		u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+		u64_stats_update_end(&txq_stats->syncp);
>> >+		spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 	}
>> > 
>> > 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
>> >@@ -5008,10 +5018,12 @@ static void stmmac_dispatch_skb_zc(struct stmmac_priv *priv, u32 queue,
>> > 	skb_record_rx_queue(skb, queue);
>> > 	napi_gro_receive(&ch->rxtx_napi, skb);
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
>> >+	u64_stats_update_begin(&rxq_stats->syncp);
>> > 	rxq_stats->rx_pkt_n++;
>> > 	rxq_stats->rx_bytes += len;
>> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>> >+	u64_stats_update_end(&rxq_stats->syncp);
>> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
>> > }
>> > 
>> > static bool stmmac_rx_refill_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
>> >@@ -5248,9 +5260,11 @@ static int stmmac_rx_zc(struct stmmac_priv *priv, int limit, u32 queue)
>> > 
>> > 	stmmac_finalize_xdp_rx(priv, xdp_status);
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
>> >+	u64_stats_update_begin(&rxq_stats->syncp);
>> > 	rxq_stats->rx_pkt_n += count;
>> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>> >+	u64_stats_update_end(&rxq_stats->syncp);
>> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
>> > 
>> > 	priv->xstats.rx_dropped += rx_dropped;
>> > 	priv->xstats.rx_errors += rx_errors;
>> >@@ -5541,11 +5555,13 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue)
>> > 
>> > 	stmmac_rx_refill(priv, queue);
>> > 
>> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
>> >+	u64_stats_update_begin(&rxq_stats->syncp);
>> > 	rxq_stats->rx_packets += rx_packets;
>> > 	rxq_stats->rx_bytes += rx_bytes;
>> > 	rxq_stats->rx_pkt_n += count;
>> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>> >+	u64_stats_update_end(&rxq_stats->syncp);
>> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
>> > 
>> > 	priv->xstats.rx_dropped += rx_dropped;
>> > 	priv->xstats.rx_errors += rx_errors;
>> >@@ -5564,9 +5580,11 @@ static int stmmac_napi_poll_rx(struct napi_struct *napi, int budget)
>> > 	int work_done;
>> > 
>> > 	rxq_stats = &priv->xstats.rxq_stats[chan];
>> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
>> >+	u64_stats_update_begin(&rxq_stats->syncp);
>> > 	rxq_stats->napi_poll++;
>> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>> >+	u64_stats_update_end(&rxq_stats->syncp);
>> >+	spin_unlock_irqrestore(&rxq_stats->lock, flags);
>> > 
>> > 	work_done = stmmac_rx(priv, budget, chan);
>> > 	if (work_done < budget && napi_complete_done(napi, work_done)) {
>> >@@ -5592,9 +5610,11 @@ static int stmmac_napi_poll_tx(struct napi_struct *napi, int budget)
>> > 	int work_done;
>> > 
>> > 	txq_stats = &priv->xstats.txq_stats[chan];
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock_irqsave(&txq_stats->lock, flags);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->napi_poll++;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	work_done = stmmac_tx_clean(priv, budget, chan, &pending_packets);
>> > 	work_done = min(work_done, budget);
>> >@@ -5627,14 +5647,18 @@ static int stmmac_napi_poll_rxtx(struct napi_struct *napi, int budget)
>> > 	unsigned long flags;
>> > 
>> > 	rxq_stats = &priv->xstats.rxq_stats[chan];
>> >-	flags = u64_stats_update_begin_irqsave(&rxq_stats->syncp);
>> >+	spin_lock_irqsave(&rxq_stats->lock, flags);
>> >+	u64_stats_update_begin(&rxq_stats->syncp);
>> > 	rxq_stats->napi_poll++;
>> >-	u64_stats_update_end_irqrestore(&rxq_stats->syncp, flags);
>> >+	u64_stats_update_end(&rxq_stats->syncp);
>> >+	spin_unlock(&rxq_stats->lock);  
>> 
>> Nitpick:
>> I know that the original code does that, but any idea why
>> u64_stats_update_end_irqrestore() is called here when
>> u64_stats_update_begin_irqsave() is called 2 lines below?
>> IIUC, this could be one critical section. Could you perhaps merge these
>> while at it? Could be a follow-up patch.
>
>I have merged the interrupt disable/enable, but there are two separate
>spinlocks for rxq_stats (added to struct stmmac_txq_stats) and for
>txq_stats (added to struct stmmac_rxq_stats), so they cannot be merged.

Ah, that's what I overlooked. Thanks.

>
>Alternatively, I could use the channel lock to protect stats updates,
>but that could increase contention of that lock. I believe more
>granularity is better, especially if it does not cost anything: There
>is plenty of unused space in struct stmmac_txq_stats and struct
>stmmac_rxq_stats (they are both cache-aligned).
>
>Petr T
>
>> 
>> Rest of the patch looks fine to me.
>> 
>> Reviewed-by: Jiri Pirko <jiri@nvidia.com>
>> 
>> 
>> > 
>> > 	txq_stats = &priv->xstats.txq_stats[chan];
>> >-	flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
>> >+	spin_lock(&txq_stats->lock);
>> >+	u64_stats_update_begin(&txq_stats->syncp);
>> > 	txq_stats->napi_poll++;
>> >-	u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
>> >+	u64_stats_update_end(&txq_stats->syncp);
>> >+	spin_unlock_irqrestore(&txq_stats->lock, flags);
>> > 
>> > 	tx_done = stmmac_tx_clean(priv, budget, chan, &tx_pending_packets);
>> > 	tx_done = min(tx_done, budget);
>> >@@ -7371,10 +7395,14 @@ int stmmac_dvr_probe(struct device *device,
>> > 	priv->device = device;
>> > 	priv->dev = ndev;
>> > 
>> >-	for (i = 0; i < MTL_MAX_RX_QUEUES; i++)
>> >+	for (i = 0; i < MTL_MAX_RX_QUEUES; i++) {
>> > 		u64_stats_init(&priv->xstats.rxq_stats[i].syncp);
>> >-	for (i = 0; i < MTL_MAX_TX_QUEUES; i++)
>> >+		spin_lock_init(&priv->xstats.rxq_stats[i].lock);
>> >+	}
>> >+	for (i = 0; i < MTL_MAX_TX_QUEUES; i++) {
>> > 		u64_stats_init(&priv->xstats.txq_stats[i].syncp);
>> >+		spin_lock_init(&priv->xstats.txq_stats[i].lock);
>> >+	}
>> > 
>> > 	stmmac_set_ethtool_ops(ndev);
>> > 	priv->pause = pause;
>> >-- 
>> >2.43.0
>> >
>> >  
>

Petr Tesařík Jan. 5, 2024, 1:27 p.m. UTC | #8

Hi Eric,

yeah, it's me again...

On Fri, 5 Jan 2024 12:14:47 +0100
Petr Tesařík <petr@tesarici.cz> wrote:

> On Fri, 5 Jan 2024 11:48:19 +0100
> Eric Dumazet <edumazet@google.com> wrote:
> 
> > On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > >
> > > On Fri, 5 Jan 2024 10:58:42 +0100
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >    
> > > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:    
> > > > >
> > > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > > >
> > > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > > >
> > > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > > >
> > > > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>    
> > > >
> > > > This is going to add more costs to 64bit platforms ?    
> > >
> > > Yes, it adds a (hopefully not too contended) spinlock and in most
> > > places an interrupt disable/enable pair.
> > >
> > > FWIW the race condition is also present on 64-bit platforms, resulting
> > > in inaccurate statistic counters. I can understand if you consider it a
> > > mild annoyance, not worth fixing.
> > >    
> > > > It seems to me that the same syncp can be used from two different
> > > > threads : hard irq and napi poller...    
> > >
> > > Yes, that's exactly the scenario that locks up my system.
> > >    
> > > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > > decide to go for a spinlock...    
> > >
> > > The spinlock does not havce to be taken on the reader side, so the
> > > seqcounter still adds some value.
> > >    
> > > > Alternative would use atomic64_t fields for the ones where there is no
> > > > mutual exclusion.
> > > >
> > > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > > non LLTX drivers)
> > > >
> > > > This leaves the fields updated from hardware interrupt context ?    
> > >
> > > I'm afraid I don't have enough network-stack-foo to follow here.
> > >
> > > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > > process context while another core runs the TX napi on the same channel
> > > (in interrupt context). I didn't observe any race on the RX path, but I
> > > believe it's possible with NAPI busy polling.
> > >
> > > In any case, I don't see the connection with LLTX. Maybe you want to
> > > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > > driver), but might not be safe for LLTX drivers?    
> > 
> > LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> > running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> > would be a bug.
> > These drivers usually use per-cpu stats, to avoid races and false
> > sharing anyway.
> > 
> > I think you should split the structures into two separate groups, each
> > guarded with its own syncp.
> > 
> > No extra spinlocks, no extra costs on 64bit arches...
> > 
> > If TX completion can run in parallel with ndo_start_xmit(), then
> > clearly we have to split stmmac_txq_stats in two halves:  
> 
> Oh, now I get it. Yes, that's much better, indeed.
> 
> I mean, the counters have never been consistent (due to the race on the
> writer side), and nobody is concerned. So, there is no value in taking
> a consistent snapshot in stmmac_get_ethtool_stats().
> 
> I'm going to rework and retest my patch. Thank you for pointing me in
> the right direction!
> 
> Petr T
> 
> > Also please note the conversion from u64 to u64_stats_t  
> 
> Noted. IIUC this will in turn close the update race on 64-bit by using
> an atomic type and on 32-bit by using a seqlock. Clever.
> 
> Petr T
> 
> > Very partial patch, only to show the split and new structure :
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> > b/drivers/net/ethernet/stmicro/stmmac/common.h
> > index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> > 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> > +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> > @@ -60,16 +60,22 @@
> >  /* #define FRAME_FILTER_DEBUG */
> > 
> >  struct stmmac_txq_stats {
> > -       u64 tx_bytes;
> > -       u64 tx_packets;
> > -       u64 tx_pkt_n;
> > -       u64 tx_normal_irq_n;
> > -       u64 napi_poll;
> > -       u64 tx_clean;
> > -       u64 tx_set_ic_bit;
> > -       u64 tx_tso_frames;
> > -       u64 tx_tso_nfrags;
> > -       struct u64_stats_sync syncp;
> > +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> > +       struct u64_stats_sync syncp_tx;
> > +       u64_stats_t tx_bytes;
> > +       u64_stats_t tx_packets;
> > +       u64_stats_t tx_pkt_n;
> > +       u64_stats_t tx_tso_frames;
> > +       u64_stats_t tx_tso_nfrags;
> > +
> > +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> > +       struct u64_stats_sync syncp_tx_completion;
> > +       u64_stats_t napi_poll;
> > +       u64_stats_t tx_clean;
> > +       u64_stats_t tx_set_ic_bit;

Unfortunately, this field is also updated from ndo_start_xmit():

4572)     if (set_ic)
4573)             txq_stats->tx_set_ic_bit++;

I feel it would be a shame to introduce a spinlock just for this one
update. But I think the field could be converted to an atomic64_t.

Which raises a question: Why aren't all stat counters simply atomic64_t? There
is no guarantee that the reader side takes a consistent snapshot
(except on 32-bit). So, why do we even bother with u64_stats_sync?

Is it merely because u64_stats_add() should be cheaper than
atomic64_add()? Or is there anything else I'm missing? If yes, does it
invalidate my proposal to convert tx_set_ic_bit to an atomic64_t?

Petr T

Eric Dumazet Jan. 5, 2024, 2:26 p.m. UTC | #9

On Fri, Jan 5, 2024 at 2:27 PM Petr Tesařík <petr@tesarici.cz> wrote:
>
> Hi Eric,
>
> yeah, it's me again...
>
> On Fri, 5 Jan 2024 12:14:47 +0100
> Petr Tesařík <petr@tesarici.cz> wrote:
>
> > On Fri, 5 Jan 2024 11:48:19 +0100
> > Eric Dumazet <edumazet@google.com> wrote:
> >
> > > On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > >
> > > > On Fri, 5 Jan 2024 10:58:42 +0100
> > > > Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:
> > > > > >
> > > > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > > > >
> > > > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > > > >
> > > > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > > > >
> > > > > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>
> > > > >
> > > > > This is going to add more costs to 64bit platforms ?
> > > >
> > > > Yes, it adds a (hopefully not too contended) spinlock and in most
> > > > places an interrupt disable/enable pair.
> > > >
> > > > FWIW the race condition is also present on 64-bit platforms, resulting
> > > > in inaccurate statistic counters. I can understand if you consider it a
> > > > mild annoyance, not worth fixing.
> > > >
> > > > > It seems to me that the same syncp can be used from two different
> > > > > threads : hard irq and napi poller...
> > > >
> > > > Yes, that's exactly the scenario that locks up my system.
> > > >
> > > > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > > > decide to go for a spinlock...
> > > >
> > > > The spinlock does not havce to be taken on the reader side, so the
> > > > seqcounter still adds some value.
> > > >
> > > > > Alternative would use atomic64_t fields for the ones where there is no
> > > > > mutual exclusion.
> > > > >
> > > > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > > > non LLTX drivers)
> > > > >
> > > > > This leaves the fields updated from hardware interrupt context ?
> > > >
> > > > I'm afraid I don't have enough network-stack-foo to follow here.
> > > >
> > > > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > > > process context while another core runs the TX napi on the same channel
> > > > (in interrupt context). I didn't observe any race on the RX path, but I
> > > > believe it's possible with NAPI busy polling.
> > > >
> > > > In any case, I don't see the connection with LLTX. Maybe you want to
> > > > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > > > driver), but might not be safe for LLTX drivers?
> > >
> > > LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> > > running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> > > would be a bug.
> > > These drivers usually use per-cpu stats, to avoid races and false
> > > sharing anyway.
> > >
> > > I think you should split the structures into two separate groups, each
> > > guarded with its own syncp.
> > >
> > > No extra spinlocks, no extra costs on 64bit arches...
> > >
> > > If TX completion can run in parallel with ndo_start_xmit(), then
> > > clearly we have to split stmmac_txq_stats in two halves:
> >
> > Oh, now I get it. Yes, that's much better, indeed.
> >
> > I mean, the counters have never been consistent (due to the race on the
> > writer side), and nobody is concerned. So, there is no value in taking
> > a consistent snapshot in stmmac_get_ethtool_stats().
> >
> > I'm going to rework and retest my patch. Thank you for pointing me in
> > the right direction!
> >
> > Petr T
> >
> > > Also please note the conversion from u64 to u64_stats_t
> >
> > Noted. IIUC this will in turn close the update race on 64-bit by using
> > an atomic type and on 32-bit by using a seqlock. Clever.
> >
> > Petr T
> >
> > > Very partial patch, only to show the split and new structure :
> > >
> > > diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> > > 100644
> > > --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > @@ -60,16 +60,22 @@
> > >  /* #define FRAME_FILTER_DEBUG */
> > >
> > >  struct stmmac_txq_stats {
> > > -       u64 tx_bytes;
> > > -       u64 tx_packets;
> > > -       u64 tx_pkt_n;
> > > -       u64 tx_normal_irq_n;
> > > -       u64 napi_poll;
> > > -       u64 tx_clean;
> > > -       u64 tx_set_ic_bit;
> > > -       u64 tx_tso_frames;
> > > -       u64 tx_tso_nfrags;
> > > -       struct u64_stats_sync syncp;
> > > +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> > > +       struct u64_stats_sync syncp_tx;
> > > +       u64_stats_t tx_bytes;
> > > +       u64_stats_t tx_packets;
> > > +       u64_stats_t tx_pkt_n;
> > > +       u64_stats_t tx_tso_frames;
> > > +       u64_stats_t tx_tso_nfrags;
> > > +
> > > +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> > > +       struct u64_stats_sync syncp_tx_completion;
> > > +       u64_stats_t napi_poll;
> > > +       u64_stats_t tx_clean;
> > > +       u64_stats_t tx_set_ic_bit;
>
> Unfortunately, this field is also updated from ndo_start_xmit():
>
> 4572)     if (set_ic)
> 4573)             txq_stats->tx_set_ic_bit++;
>
> I feel it would be a shame to introduce a spinlock just for this one
> update. But I think the field could be converted to an atomic64_t.
>
> Which raises a question: Why aren't all stat counters simply atomic64_t? There
> is no guarantee that the reader side takes a consistent snapshot
> (except on 32-bit). So, why do we even bother with u64_stats_sync?

This infra was added to have no _lock_ overhead on 64bit arches.

If a counter must be updated from multiple cpus (regardless of 32/64
bit kernel), then use atomic64_t

>
> Is it merely because u64_stats_add() should be cheaper than
> atomic64_add()? Or is there anything else I'm missing? If yes, does it
> invalidate my proposal to convert tx_set_ic_bit to an atomic64_t?

atomic64_add() is expensive, especially in contexts where updates are
already guarded by a spinlock or something.

Eric Dumazet Jan. 5, 2024, 2:28 p.m. UTC | #10

On Fri, Jan 5, 2024 at 3:26 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jan 5, 2024 at 2:27 PM Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > Hi Eric,
> >
> > yeah, it's me again...
> >
> > On Fri, 5 Jan 2024 12:14:47 +0100
> > Petr Tesařík <petr@tesarici.cz> wrote:
> >
> > > On Fri, 5 Jan 2024 11:48:19 +0100
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > > On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:
> > > > >
> > > > > On Fri, 5 Jan 2024 10:58:42 +0100
> > > > > Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:
> > > > > > >
> > > > > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > > > > >
> > > > > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > > > > >
> > > > > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > > > > >
> > > > > > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>
> > > > > >
> > > > > > This is going to add more costs to 64bit platforms ?
> > > > >
> > > > > Yes, it adds a (hopefully not too contended) spinlock and in most
> > > > > places an interrupt disable/enable pair.
> > > > >
> > > > > FWIW the race condition is also present on 64-bit platforms, resulting
> > > > > in inaccurate statistic counters. I can understand if you consider it a
> > > > > mild annoyance, not worth fixing.
> > > > >
> > > > > > It seems to me that the same syncp can be used from two different
> > > > > > threads : hard irq and napi poller...
> > > > >
> > > > > Yes, that's exactly the scenario that locks up my system.
> > > > >
> > > > > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > > > > decide to go for a spinlock...
> > > > >
> > > > > The spinlock does not havce to be taken on the reader side, so the
> > > > > seqcounter still adds some value.
> > > > >
> > > > > > Alternative would use atomic64_t fields for the ones where there is no
> > > > > > mutual exclusion.
> > > > > >
> > > > > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > > > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > > > > non LLTX drivers)
> > > > > >
> > > > > > This leaves the fields updated from hardware interrupt context ?
> > > > >
> > > > > I'm afraid I don't have enough network-stack-foo to follow here.
> > > > >
> > > > > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > > > > process context while another core runs the TX napi on the same channel
> > > > > (in interrupt context). I didn't observe any race on the RX path, but I
> > > > > believe it's possible with NAPI busy polling.
> > > > >
> > > > > In any case, I don't see the connection with LLTX. Maybe you want to
> > > > > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > > > > driver), but might not be safe for LLTX drivers?
> > > >
> > > > LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> > > > running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> > > > would be a bug.
> > > > These drivers usually use per-cpu stats, to avoid races and false
> > > > sharing anyway.
> > > >
> > > > I think you should split the structures into two separate groups, each
> > > > guarded with its own syncp.
> > > >
> > > > No extra spinlocks, no extra costs on 64bit arches...
> > > >
> > > > If TX completion can run in parallel with ndo_start_xmit(), then
> > > > clearly we have to split stmmac_txq_stats in two halves:
> > >
> > > Oh, now I get it. Yes, that's much better, indeed.
> > >
> > > I mean, the counters have never been consistent (due to the race on the
> > > writer side), and nobody is concerned. So, there is no value in taking
> > > a consistent snapshot in stmmac_get_ethtool_stats().
> > >
> > > I'm going to rework and retest my patch. Thank you for pointing me in
> > > the right direction!
> > >
> > > Petr T
> > >
> > > > Also please note the conversion from u64 to u64_stats_t
> > >
> > > Noted. IIUC this will in turn close the update race on 64-bit by using
> > > an atomic type and on 32-bit by using a seqlock. Clever.
> > >
> > > Petr T
> > >
> > > > Very partial patch, only to show the split and new structure :
> > > >
> > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> > > > 100644
> > > > --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > @@ -60,16 +60,22 @@
> > > >  /* #define FRAME_FILTER_DEBUG */
> > > >
> > > >  struct stmmac_txq_stats {
> > > > -       u64 tx_bytes;
> > > > -       u64 tx_packets;
> > > > -       u64 tx_pkt_n;
> > > > -       u64 tx_normal_irq_n;
> > > > -       u64 napi_poll;
> > > > -       u64 tx_clean;
> > > > -       u64 tx_set_ic_bit;
> > > > -       u64 tx_tso_frames;
> > > > -       u64 tx_tso_nfrags;
> > > > -       struct u64_stats_sync syncp;
> > > > +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> > > > +       struct u64_stats_sync syncp_tx;
> > > > +       u64_stats_t tx_bytes;
> > > > +       u64_stats_t tx_packets;
> > > > +       u64_stats_t tx_pkt_n;
> > > > +       u64_stats_t tx_tso_frames;
> > > > +       u64_stats_t tx_tso_nfrags;
> > > > +
> > > > +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> > > > +       struct u64_stats_sync syncp_tx_completion;
> > > > +       u64_stats_t napi_poll;
> > > > +       u64_stats_t tx_clean;
> > > > +       u64_stats_t tx_set_ic_bit;
> >
> > Unfortunately, this field is also updated from ndo_start_xmit():
> >
> > 4572)     if (set_ic)
> > 4573)             txq_stats->tx_set_ic_bit++;
> >
> > I feel it would be a shame to introduce a spinlock just for this one
> > update. But I think the field could be converted to an atomic64_t.
> >
> > Which raises a question: Why aren't all stat counters simply atomic64_t? There
> > is no guarantee that the reader side takes a consistent snapshot
> > (except on 32-bit). So, why do we even bother with u64_stats_sync?
>
> This infra was added to have no _lock_ overhead on 64bit arches.
>
> If a counter must be updated from multiple cpus (regardless of 32/64
> bit kernel), then use atomic64_t

Or use two different u64_stats_t (each guarded by a different syncp),
then fold them at stats gathering time.

>
> >
> > Is it merely because u64_stats_add() should be cheaper than
> > atomic64_add()? Or is there anything else I'm missing? If yes, does it
> > invalidate my proposal to convert tx_set_ic_bit to an atomic64_t?
>
> atomic64_add() is expensive, especially in contexts where updates are
> already guarded by a spinlock or something.

Petr Tesařík Jan. 5, 2024, 2:45 p.m. UTC | #11

On Fri, 5 Jan 2024 15:28:41 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jan 5, 2024 at 3:26 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 2:27 PM Petr Tesařík <petr@tesarici.cz> wrote:  
> > >
> > > Hi Eric,
> > >
> > > yeah, it's me again...
> > >
> > > On Fri, 5 Jan 2024 12:14:47 +0100
> > > Petr Tesařík <petr@tesarici.cz> wrote:
> > >  
> > > > On Fri, 5 Jan 2024 11:48:19 +0100
> > > > Eric Dumazet <edumazet@google.com> wrote:
> > > >  
> > > > > On Fri, Jan 5, 2024 at 11:34 AM Petr Tesařík <petr@tesarici.cz> wrote:  
> > > > > >
> > > > > > On Fri, 5 Jan 2024 10:58:42 +0100
> > > > > > Eric Dumazet <edumazet@google.com> wrote:
> > > > > >  
> > > > > > > On Fri, Jan 5, 2024 at 10:16 AM Petr Tesarik <petr@tesarici.cz> wrote:  
> > > > > > > >
> > > > > > > > Add a spinlock to fix race conditions while updating Tx/Rx statistics.
> > > > > > > >
> > > > > > > > As explained by a comment in <linux/u64_stats_sync.h>, write side of struct
> > > > > > > > u64_stats_sync must ensure mutual exclusion, or one seqcount update could
> > > > > > > > be lost on 32-bit platforms, thus blocking readers forever.
> > > > > > > >
> > > > > > > > Such lockups have been actually observed on 32-bit Arm after stmmac_xmit()
> > > > > > > > on one core raced with stmmac_napi_poll_tx() on another core.
> > > > > > > >
> > > > > > > > Signed-off-by: Petr Tesarik <petr@tesarici.cz>  
> > > > > > >
> > > > > > > This is going to add more costs to 64bit platforms ?  
> > > > > >
> > > > > > Yes, it adds a (hopefully not too contended) spinlock and in most
> > > > > > places an interrupt disable/enable pair.
> > > > > >
> > > > > > FWIW the race condition is also present on 64-bit platforms, resulting
> > > > > > in inaccurate statistic counters. I can understand if you consider it a
> > > > > > mild annoyance, not worth fixing.
> > > > > >  
> > > > > > > It seems to me that the same syncp can be used from two different
> > > > > > > threads : hard irq and napi poller...  
> > > > > >
> > > > > > Yes, that's exactly the scenario that locks up my system.
> > > > > >  
> > > > > > > At this point, I do not see why you keep linux/u64_stats_sync.h if you
> > > > > > > decide to go for a spinlock...  
> > > > > >
> > > > > > The spinlock does not havce to be taken on the reader side, so the
> > > > > > seqcounter still adds some value.
> > > > > >  
> > > > > > > Alternative would use atomic64_t fields for the ones where there is no
> > > > > > > mutual exclusion.
> > > > > > >
> > > > > > > RX : napi poll is definitely safe (protected by an atomic bit)
> > > > > > > TX : each TX queue is also safe (protected by an atomic exclusion for
> > > > > > > non LLTX drivers)
> > > > > > >
> > > > > > > This leaves the fields updated from hardware interrupt context ?  
> > > > > >
> > > > > > I'm afraid I don't have enough network-stack-foo to follow here.
> > > > > >
> > > > > > My issue on 32 bit is that stmmac_xmit() may be called directly from
> > > > > > process context while another core runs the TX napi on the same channel
> > > > > > (in interrupt context). I didn't observe any race on the RX path, but I
> > > > > > believe it's possible with NAPI busy polling.
> > > > > >
> > > > > > In any case, I don't see the connection with LLTX. Maybe you want to
> > > > > > say that the TX queue is safe for stmmac (because it is a non-LLTX
> > > > > > driver), but might not be safe for LLTX drivers?  
> > > > >
> > > > > LLTX drivers (mostly virtual drivers like tunnels...) can have multiple cpus
> > > > > running ndo_start_xmit() concurrently. So any use of a 'shared syncp'
> > > > > would be a bug.
> > > > > These drivers usually use per-cpu stats, to avoid races and false
> > > > > sharing anyway.
> > > > >
> > > > > I think you should split the structures into two separate groups, each
> > > > > guarded with its own syncp.
> > > > >
> > > > > No extra spinlocks, no extra costs on 64bit arches...
> > > > >
> > > > > If TX completion can run in parallel with ndo_start_xmit(), then
> > > > > clearly we have to split stmmac_txq_stats in two halves:  
> > > >
> > > > Oh, now I get it. Yes, that's much better, indeed.
> > > >
> > > > I mean, the counters have never been consistent (due to the race on the
> > > > writer side), and nobody is concerned. So, there is no value in taking
> > > > a consistent snapshot in stmmac_get_ethtool_stats().
> > > >
> > > > I'm going to rework and retest my patch. Thank you for pointing me in
> > > > the right direction!
> > > >
> > > > Petr T
> > > >  
> > > > > Also please note the conversion from u64 to u64_stats_t  
> > > >
> > > > Noted. IIUC this will in turn close the update race on 64-bit by using
> > > > an atomic type and on 32-bit by using a seqlock. Clever.
> > > >
> > > > Petr T
> > > >  
> > > > > Very partial patch, only to show the split and new structure :
> > > > >
> > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > index e3f650e88f82f927f0dcf95748fbd10c14c30cbe..702bceea5dc8c875a80f5e3a92b7bb058f373eda
> > > > > 100644
> > > > > --- a/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > +++ b/drivers/net/ethernet/stmicro/stmmac/common.h
> > > > > @@ -60,16 +60,22 @@
> > > > >  /* #define FRAME_FILTER_DEBUG */
> > > > >
> > > > >  struct stmmac_txq_stats {
> > > > > -       u64 tx_bytes;
> > > > > -       u64 tx_packets;
> > > > > -       u64 tx_pkt_n;
> > > > > -       u64 tx_normal_irq_n;
> > > > > -       u64 napi_poll;
> > > > > -       u64 tx_clean;
> > > > > -       u64 tx_set_ic_bit;
> > > > > -       u64 tx_tso_frames;
> > > > > -       u64 tx_tso_nfrags;
> > > > > -       struct u64_stats_sync syncp;
> > > > > +/* First part, updated from ndo_start_xmit(), protected by tx queue lock */
> > > > > +       struct u64_stats_sync syncp_tx;
> > > > > +       u64_stats_t tx_bytes;
> > > > > +       u64_stats_t tx_packets;
> > > > > +       u64_stats_t tx_pkt_n;
> > > > > +       u64_stats_t tx_tso_frames;
> > > > > +       u64_stats_t tx_tso_nfrags;
> > > > > +
> > > > > +/* Second part, updated from TX completion (protected by NAPI poll logic) */
> > > > > +       struct u64_stats_sync syncp_tx_completion;
> > > > > +       u64_stats_t napi_poll;
> > > > > +       u64_stats_t tx_clean;
> > > > > +       u64_stats_t tx_set_ic_bit;  
> > >
> > > Unfortunately, this field is also updated from ndo_start_xmit():
> > >
> > > 4572)     if (set_ic)
> > > 4573)             txq_stats->tx_set_ic_bit++;
> > >
> > > I feel it would be a shame to introduce a spinlock just for this one
> > > update. But I think the field could be converted to an atomic64_t.
> > >
> > > Which raises a question: Why aren't all stat counters simply atomic64_t? There
> > > is no guarantee that the reader side takes a consistent snapshot
> > > (except on 32-bit). So, why do we even bother with u64_stats_sync?  
> >
> > This infra was added to have no _lock_ overhead on 64bit arches.
> >
> > If a counter must be updated from multiple cpus (regardless of 32/64
> > bit kernel), then use atomic64_t  
> 
> Or use two different u64_stats_t (each guarded by a different syncp),
> then fold them at stats gathering time.

Oh, right, why didn't I think about it!

This only leaves an atomic_t in hard irq context. I have tried to find
something that could relax the requirement, but AFAICS at least some
setups use several interrupts that can be delivered to different CPUs
simultaneously, and all of them will walk over all channels. So we're
left with an atomic_t here.

> > > Is it merely because u64_stats_add() should be cheaper than
> > > atomic64_add()? Or is there anything else I'm missing? If yes,
> > > does it invalidate my proposal to convert tx_set_ic_bit to an
> > > atomic64_t?  
> >
> > atomic64_add() is expensive, especially in contexts where updates
> > are already guarded by a spinlock or something.  

Yeah, I know atomics can be expensive. I've heard of cases where an Arm
core was spinning on an atomic operation for several seconds...

Petr T

Andrew Lunn Jan. 5, 2024, 5:36 p.m. UTC | #12

> This only leaves an atomic_t in hard irq context. I have tried to find
> something that could relax the requirement, but AFAICS at least some
> setups use several interrupts that can be delivered to different CPUs
> simultaneously, and all of them will walk over all channels. So we're
> left with an atomic_t here.

You might want to consider per CPU statistics. Since each CPU has its
own structure of statistics, you don't need atomic.

The code actually using the statistics then needs to sum up the per
CPU statistics, and using syncp should be sufficient for that.

Maybe look at mvneta.c for inspiration.

      Andrew

Petr Tesařík Jan. 5, 2024, 6:21 p.m. UTC | #13

On Fri, 5 Jan 2024 18:36:45 +0100
Andrew Lunn <andrew@lunn.ch> wrote:

> > This only leaves an atomic_t in hard irq context. I have tried to find
> > something that could relax the requirement, but AFAICS at least some
> > setups use several interrupts that can be delivered to different CPUs
> > simultaneously, and all of them will walk over all channels. So we're
> > left with an atomic_t here.  
> 
> You might want to consider per CPU statistics. Since each CPU has its
> own structure of statistics, you don't need atomic.
> 
> The code actually using the statistics then needs to sum up the per
> CPU statistics, and using syncp should be sufficient for that.

Thanks for the tip.

Honestly, the more I'm looking into this, the less I am convinced that
it is worth the effort, but yeah, if we touch this code now, let's do
things properly. ;-)

> Maybe look at mvneta.c for inspiration.

I think I know what you mean, but I'll have a look.

Petr T

David Laight Jan. 8, 2024, 10:34 a.m. UTC | #14

From: Andrew Lunn
> Sent: 05 January 2024 17:37
> 
> > This only leaves an atomic_t in hard irq context. I have tried to find
> > something that could relax the requirement, but AFAICS at least some
> > setups use several interrupts that can be delivered to different CPUs
> > simultaneously, and all of them will walk over all channels. So we're
> > left with an atomic_t here.
> 
> You might want to consider per CPU statistics. Since each CPU has its
> own structure of statistics, you don't need atomic.
> 
> The code actually using the statistics then needs to sum up the per
> CPU statistics, and using syncp should be sufficient for that.

Doesn't that consume rather a lot of memory on systems with
'silly numbers' of cpu?

Updating an atomic_t is (pretty much) the same as taking a lock.
unlock() is also likely to also contain an atomic operation.
So if you update more than two atomic_t it is likely that a lock
will be faster.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Andrew Lunn Jan. 8, 2024, 1:41 p.m. UTC | #15

> > You might want to consider per CPU statistics. Since each CPU has its
> > own structure of statistics, you don't need atomic.
> > 
> > The code actually using the statistics then needs to sum up the per
> > CPU statistics, and using syncp should be sufficient for that.
> 
> Doesn't that consume rather a lot of memory on systems with
> 'silly numbers' of cpu?

Systems with silly number of CPUS tend to also have silly amounts of
memory. We are talking about maybe a dozen u64 here. So the memory
usage goes from 144 bytes, to 144K for a 1024CPU system.  Is 144K
excessive for such a system?

> Updating an atomic_t is (pretty much) the same as taking a lock.
> unlock() is also likely to also contain an atomic operation.
> So if you update more than two atomic_t it is likely that a lock
> will be faster.

True, but all those 1024 CPUs in your silly system get affected by a
lock or an atomic. They all need to do something with there L1 and L2
cache when using atomics. Spending an extra 144K of RAM means the
other 1023 CPUs don't notice anything at all during the increment
phase, which could be happening 1M times a second. They only get
involved when something in user space wants the statistics, so maybe
once per second from the SNMP agent.

Also, stmmac is not used on silly CPU systems. It used in embedded
systems. I doubt its integrated into anything with more than 8 CPUs.

       Andrew

Petr Tesařík Jan. 8, 2024, 3:12 p.m. UTC | #16

On Mon, 8 Jan 2024 14:41:10 +0100
Andrew Lunn <andrew@lunn.ch> wrote:

> > > You might want to consider per CPU statistics. Since each CPU has its
> > > own structure of statistics, you don't need atomic.
> > > 
> > > The code actually using the statistics then needs to sum up the per
> > > CPU statistics, and using syncp should be sufficient for that.  
> > 
> > Doesn't that consume rather a lot of memory on systems with
> > 'silly numbers' of cpu?  
> 
> Systems with silly number of CPUS tend to also have silly amounts of
> memory. We are talking about maybe a dozen u64 here. So the memory
> usage goes from 144 bytes, to 144K for a 1024CPU system.  Is 144K
> excessive for such a system?

I'm not even sure it's worth converting _all_ statistic counters to
per-CPU variables. Most of them are already guarded by a lock (either
the queue lock, or NAPI scheduling). Only the hard interrupt counter is
not protected by anything, so it's more like 8k on a 1024-CPU system....

> > Updating an atomic_t is (pretty much) the same as taking a lock.
> > unlock() is also likely to also contain an atomic operation.
> > So if you update more than two atomic_t it is likely that a lock
> > will be faster.  
> 
> True, but all those 1024 CPUs in your silly system get affected by a
> lock or an atomic. They all need to do something with there L1 and L2
> cache when using atomics. Spending an extra 144K of RAM means the
> other 1023 CPUs don't notice anything at all during the increment
> phase, which could be happening 1M times a second. They only get
> involved when something in user space wants the statistics, so maybe
> once per second from the SNMP agent.
> 
> Also, stmmac is not used on silly CPU systems. It used in embedded
> systems. I doubt its integrated into anything with more than 8 CPUs.

I also doubt it as of today, but hey, it seems that more CPU cores is
the future of embedded. Ten years ago, who would have imagined putting
an 8-core CPU into a smartphone? OTOH who would have imagined a
smartphone with 24G of RAM...

Petr T

net: stmmac: protect statistics updates with a spinlock

Commit Message

Comments

Patch