diff mbox series

[v3,2/3] mlxbf_gige: Fix intermittent no ip issue

Message ID 20230922173626.23790-3-asmaa@nvidia.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series mlxbf_gige: Fix several bugs | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Asmaa Mnebhi Sept. 22, 2023, 5:36 p.m. UTC
Although the link is up, there is no ip assigned on a setup with high background
traffic. Nothing is transmitted nor received.
The RX error count keeps on increasing. After several minutes, the RX error count
stagnates and the GigE interface finally gets an ip.

The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA is enabled,
the RX CI reaches the max of 128, and it becomes equal to RX PI. RX CI doesn't decrease
since the code hasn't ran phy_start yet.

The solution is to move the rx init after phy_start.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: David Thompson <davthompson@nvidia.com>
---
v2->v3:
- No changes
v1->v2:
- No changes

 .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 14 +++++++-------
 .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c   |  6 +++---
 2 files changed, 10 insertions(+), 10 deletions(-)

Comments

Florian Fainelli Sept. 22, 2023, 5:47 p.m. UTC | #1
On 9/22/2023 10:36 AM, Asmaa Mnebhi wrote:
> Although the link is up, there is no ip assigned on a setup with high background
> traffic. Nothing is transmitted nor received.
> The RX error count keeps on increasing. After several minutes, the RX error count
> stagnates and the GigE interface finally gets an ip.
> 
> The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA is enabled,
> the RX CI reaches the max of 128, and it becomes equal to RX PI. RX CI doesn't decrease
> since the code hasn't ran phy_start yet.
> 
> The solution is to move the rx init after phy_start.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> Reviewed-by: David Thompson <davthompson@nvidia.com>

This seems fine, but your description of the problem still looks like 
there may be a more fundamental ordering issue when you enable your RX 
pipe here.

It seems to me like you should enable it from "inner" as in closest to 
the CPU/DMA subsystem towards "outer" which is the MAC and finally the PHY.

It should be fine to enable your RX DMA as long as you keep the MAC's RX 
disabled, and then you can enable your MAC's RX enable and later start 
the PHY.
Asmaa Mnebhi Oct. 13, 2023, 3 p.m. UTC | #2
> > Although the link is up, there is no ip assigned on a setup with high
> > background traffic. Nothing is transmitted nor received.
> > The RX error count keeps on increasing. After several minutes, the RX
> > error count stagnates and the GigE interface finally gets an ip.
> >
> > The issue is in the mlxbf_gige_rx_init function. As soon as the RX DMA
> > is enabled, the RX CI reaches the max of 128, and it becomes equal to
> > RX PI. RX CI doesn't decrease since the code hasn't ran phy_start yet.
> >
> > The solution is to move the rx init after phy_start.
> >
> > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> > Reviewed-by: David Thompson <davthompson@nvidia.com>
> 
> This seems fine, but your description of the problem still looks like there may
> be a more fundamental ordering issue when you enable your RX pipe here.
> 
> It seems to me like you should enable it from "inner" as in closest to the
> CPU/DMA subsystem towards "outer" which is the MAC and finally the PHY.
> 
> It should be fine to enable your RX DMA as long as you keep the MAC's RX
> disabled, and then you can enable your MAC's RX enable and later start the
> PHY.

Thanks for your feedback Florian. I will take a look and address your comments shortly. Sorry for the delayed response, I was OOO.
Asmaa Mnebhi Oct. 16, 2023, 5:45 p.m. UTC | #3
> > > Although the link is up, there is no ip assigned on a setup with
> > > high background traffic. Nothing is transmitted nor received.
> > > The RX error count keeps on increasing. After several minutes, the
> > > RX error count stagnates and the GigE interface finally gets an ip.
> > >
> > > The issue is in the mlxbf_gige_rx_init function. As soon as the RX
> > > DMA is enabled, the RX CI reaches the max of 128, and it becomes
> > > equal to RX PI. RX CI doesn't decrease since the code hasn't ran phy_start
> yet.
> > >
> > > The solution is to move the rx init after phy_start.
> > >
> > > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet
> > > driver")
> > > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> > > Reviewed-by: David Thompson <davthompson@nvidia.com>
> >
> > This seems fine, but your description of the problem still looks like
> > there may be a more fundamental ordering issue when you enable your RX
> pipe here.
> >
> > It seems to me like you should enable it from "inner" as in closest to
> > the CPU/DMA subsystem towards "outer" which is the MAC and finally the
> PHY.
> >
> > It should be fine to enable your RX DMA as long as you keep the MAC's
> > RX disabled, and then you can enable your MAC's RX enable and later
> > start the PHY.
> 
> Thanks for your feedback Florian. I will take a look and address your
> comments shortly. Sorry for the delayed response, I was OOO.
Hi Florian,

We would like to maintain the code as is because we need to set the RX DMA after the MAC RX filters and the RX rings are setup (in mlxbf_gige_rx_init()).
The PHY start logic needs to be done  before that, otherwise, there is a chance we would encounter this bug where our MAC RX consumer index (CI) equals our MAC RX production index (PI) and that results in a MAC state that cannot be solved until we cleanup the MAC again. Note that this bug is difficult to reproduce. Our QA had to run the reboot test and have a setup with really high background traffic.

Thanks.
Asmaa
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
index 74185b02daa0..fd4fac1ca26c 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
@@ -147,14 +147,14 @@  static int mlxbf_gige_open(struct net_device *netdev)
 	 */
 	priv->valid_polarity = 0;
 
-	err = mlxbf_gige_rx_init(priv);
+	phy_start(phydev);
+
+	err = mlxbf_gige_tx_init(priv);
 	if (err)
 		goto free_irqs;
-	err = mlxbf_gige_tx_init(priv);
+	err = mlxbf_gige_rx_init(priv);
 	if (err)
-		goto rx_deinit;
-
-	phy_start(phydev);
+		goto tx_deinit;
 
 	netif_napi_add(netdev, &priv->napi, mlxbf_gige_poll);
 	napi_enable(&priv->napi);
@@ -176,8 +176,8 @@  static int mlxbf_gige_open(struct net_device *netdev)
 
 	return 0;
 
-rx_deinit:
-	mlxbf_gige_rx_deinit(priv);
+tx_deinit:
+	mlxbf_gige_tx_deinit(priv);
 
 free_irqs:
 	mlxbf_gige_free_irqs(priv);
diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
index cfb8fb957f0c..d82feeabb061 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c
@@ -142,6 +142,9 @@  int mlxbf_gige_rx_init(struct mlxbf_gige *priv)
 	writeq(MLXBF_GIGE_RX_MAC_FILTER_COUNT_PASS_EN,
 	       priv->base + MLXBF_GIGE_RX_MAC_FILTER_COUNT_PASS);
 
+	writeq(ilog2(priv->rx_q_entries),
+	       priv->base + MLXBF_GIGE_RX_WQE_SIZE_LOG2);
+
 	/* Clear MLXBF_GIGE_INT_MASK 'receive pkt' bit to
 	 * indicate readiness to receive interrupts
 	 */
@@ -154,9 +157,6 @@  int mlxbf_gige_rx_init(struct mlxbf_gige *priv)
 	data |= MLXBF_GIGE_RX_DMA_EN;
 	writeq(data, priv->base + MLXBF_GIGE_RX_DMA);
 
-	writeq(ilog2(priv->rx_q_entries),
-	       priv->base + MLXBF_GIGE_RX_WQE_SIZE_LOG2);
-
 	return 0;
 
 free_wqe_and_skb: