diff mbox series

[net] net: phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present

Message ID 20240913121230.2620122-1-vladimir.oltean@nxp.com (mailing list archive)
State Accepted
Commit 194ef9d0de9021df4a0ba8b112f91e56adaddd22
Delegated to: Netdev Maintainers
Headers show
Series [net] net: phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 16 this patch: 16
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 1 maintainers not CCed: robimarko@gmail.com
netdev/build_clang success Errors and warnings before: 16 this patch: 16
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 16 this patch: 16
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 82 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-09-14--09-00 (tests: 764)

Commit Message

Vladimir Oltean Sept. 13, 2024, 12:12 p.m. UTC
The author of the blamed commit apparently did not notice something
about aqr_wait_reset_complete(): it polls the exact same register -
MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load().

Thus, the entire logic after the introduction of aqr_wait_reset_complete() is
now completely side-stepped, because if aqr_wait_reset_complete()
succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a
non-zero value. The handling of the case where the register reads as 0
is dead code, due to the previous -ETIMEDOUT having stopped execution
and returning a fatal error to the caller. We never attempt to load
new firmware if no firmware is present.

Based on static code analysis, I guess we should simply introduce a
switch/case statement based on the return code from aqr_wait_reset_complete(),
to determine whether to load firmware or not. I am not intending to
change the procedure through which the driver determines whether to load
firmware or not, as I am unaware of alternative possibilities.

At the same time, Russell King suggests that if aqr_wait_reset_complete()
is expected to return -ETIMEDOUT as part of normal operation and not
just catastrophic failure, the use of phy_read_mmd_poll_timeout() is
improper, since that has an embedded print inside. Just open-code a
call to read_poll_timeout() to avoid printing -ETIMEDOUT, but continue
printing actual read errors from the MDIO bus.

Fixes: ad649a1fac37 ("net: phy: aquantia: wait for FW reset before checking the vendor ID")
Reported-by: Clark Wang <xiaoning.wang@nxp.com>
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/netdev/8ac00a45-ac61-41b4-9f74-d18157b8b6bf@nvidia.com/
Reported-by: Hans-Frieder Vogt <hfdevel@gmx.net>
Closes: https://lore.kernel.org/netdev/c7c1a3ae-be97-4929-8d89-04c8aa870209@gmx.net/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
Only compile-tested. However, my timeout timer expired waiting for
reactions on the thread with Bartosz' original patch, and Hans-Frieder
Vogt wrote a message in his cover letter implying that the patch fixes
the issue for him. Any Tested-by: tags are welcome.

 drivers/net/phy/aquantia/aquantia_firmware.c | 42 +++++++++++---------
 drivers/net/phy/aquantia/aquantia_main.c     | 19 +++++++--
 2 files changed, 39 insertions(+), 22 deletions(-)

Comments

Bartosz Golaszewski Sept. 13, 2024, 1:18 p.m. UTC | #1
On Fri, 13 Sept 2024 at 14:12, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> The author of the blamed commit apparently did not notice something
> about aqr_wait_reset_complete(): it polls the exact same register -
> MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load().
>
> Thus, the entire logic after the introduction of aqr_wait_reset_complete() is
> now completely side-stepped, because if aqr_wait_reset_complete()
> succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a
> non-zero value. The handling of the case where the register reads as 0
> is dead code, due to the previous -ETIMEDOUT having stopped execution
> and returning a fatal error to the caller. We never attempt to load
> new firmware if no firmware is present.
>
> Based on static code analysis, I guess we should simply introduce a
> switch/case statement based on the return code from aqr_wait_reset_complete(),
> to determine whether to load firmware or not. I am not intending to
> change the procedure through which the driver determines whether to load
> firmware or not, as I am unaware of alternative possibilities.
>
> At the same time, Russell King suggests that if aqr_wait_reset_complete()
> is expected to return -ETIMEDOUT as part of normal operation and not
> just catastrophic failure, the use of phy_read_mmd_poll_timeout() is
> improper, since that has an embedded print inside. Just open-code a
> call to read_poll_timeout() to avoid printing -ETIMEDOUT, but continue
> printing actual read errors from the MDIO bus.
>
> Fixes: ad649a1fac37 ("net: phy: aquantia: wait for FW reset before checking the vendor ID")
> Reported-by: Clark Wang <xiaoning.wang@nxp.com>
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Closes: https://lore.kernel.org/netdev/8ac00a45-ac61-41b4-9f74-d18157b8b6bf@nvidia.com/
> Reported-by: Hans-Frieder Vogt <hfdevel@gmx.net>
> Closes: https://lore.kernel.org/netdev/c7c1a3ae-be97-4929-8d89-04c8aa870209@gmx.net/
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
> Only compile-tested. However, my timeout timer expired waiting for
> reactions on the thread with Bartosz' original patch, and Hans-Frieder
> Vogt wrote a message in his cover letter implying that the patch fixes
> the issue for him. Any Tested-by: tags are welcome.
>

Still works on sa8775p-ride v3

Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Vladimir Oltean Sept. 13, 2024, 1:21 p.m. UTC | #2
On Fri, Sep 13, 2024 at 03:18:42PM +0200, Bartosz Golaszewski wrote:
> Still works on sa8775p-ride v3
> 
> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>

Thanks for testing, I appreciate it.
Hans-Frieder Vogt Sept. 14, 2024, 1:16 p.m. UTC | #3
On 13.09.2024 14.12, Vladimir Oltean wrote:
> The author of the blamed commit apparently did not notice something
> about aqr_wait_reset_complete(): it polls the exact same register -
> MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load().
>
> Thus, the entire logic after the introduction of aqr_wait_reset_complete() is
> now completely side-stepped, because if aqr_wait_reset_complete()
> succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a
> non-zero value. The handling of the case where the register reads as 0
> is dead code, due to the previous -ETIMEDOUT having stopped execution
> and returning a fatal error to the caller. We never attempt to load
> new firmware if no firmware is present.
>
> Based on static code analysis, I guess we should simply introduce a
> switch/case statement based on the return code from aqr_wait_reset_complete(),
> to determine whether to load firmware or not. I am not intending to
> change the procedure through which the driver determines whether to load
> firmware or not, as I am unaware of alternative possibilities.
>
> At the same time, Russell King suggests that if aqr_wait_reset_complete()
> is expected to return -ETIMEDOUT as part of normal operation and not
> just catastrophic failure, the use of phy_read_mmd_poll_timeout() is
> improper, since that has an embedded print inside. Just open-code a
> call to read_poll_timeout() to avoid printing -ETIMEDOUT, but continue
> printing actual read errors from the MDIO bus.
>
> Fixes: ad649a1fac37 ("net: phy: aquantia: wait for FW reset before checking the vendor ID")
> Reported-by: Clark Wang <xiaoning.wang@nxp.com>
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Closes: https://lore.kernel.org/netdev/8ac00a45-ac61-41b4-9f74-d18157b8b6bf@nvidia.com/
> Reported-by: Hans-Frieder Vogt <hfdevel@gmx.net>
> Closes: https://lore.kernel.org/netdev/c7c1a3ae-be97-4929-8d89-04c8aa870209@gmx.net/
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
> Only compile-tested. However, my timeout timer expired waiting for
> reactions on the thread with Bartosz' original patch, and Hans-Frieder
> Vogt wrote a message in his cover letter implying that the patch fixes
> the issue for him. Any Tested-by: tags are welcome.
>
>   drivers/net/phy/aquantia/aquantia_firmware.c | 42 +++++++++++---------
>   drivers/net/phy/aquantia/aquantia_main.c     | 19 +++++++--
>   2 files changed, 39 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/phy/aquantia/aquantia_firmware.c b/drivers/net/phy/aquantia/aquantia_firmware.c
> index 524627a36c6f..dac6464b5fe2 100644
> --- a/drivers/net/phy/aquantia/aquantia_firmware.c
> +++ b/drivers/net/phy/aquantia/aquantia_firmware.c
> @@ -353,26 +353,32 @@ int aqr_firmware_load(struct phy_device *phydev)
>   {
>   	int ret;
>
> -	ret = aqr_wait_reset_complete(phydev);
> -	if (ret)
> -		return ret;
> -
> -	/* Check if the firmware is not already loaded by pooling
> -	 * the current version returned by the PHY. If 0 is returned,
> -	 * no firmware is loaded.
> +	/* Check if the firmware is not already loaded by polling
> +	 * the current version returned by the PHY.
>   	 */
> -	ret = phy_read_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_FW_ID);
> -	if (ret > 0)
> -		goto exit;
> -
> -	ret = aqr_firmware_load_nvmem(phydev);
> -	if (!ret)
> -		goto exit;
> -
> -	ret = aqr_firmware_load_fs(phydev);
> -	if (ret)
> +	ret = aqr_wait_reset_complete(phydev);
> +	switch (ret) {
> +	case 0:
> +		/* Some firmware is loaded => do nothing */
> +		return 0;
> +	case -ETIMEDOUT:
> +		/* VEND1_GLOBAL_FW_ID still reads 0 after 2 seconds of polling.
> +		 * We don't have full confidence that no firmware is loaded (in
> +		 * theory it might just not have loaded yet), but we will
> +		 * assume that, and load a new image.
> +		 */
> +		ret = aqr_firmware_load_nvmem(phydev);
> +		if (!ret)
> +			return ret;
> +
> +		ret = aqr_firmware_load_fs(phydev);
> +		if (ret)
> +			return ret;
> +		break;
> +	default:
> +		/* PHY read error, propagate it to the caller */
>   		return ret;
> +	}
>
> -exit:
>   	return 0;
>   }
> diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c
> index e982e9ce44a5..57b8b8f400fd 100644
> --- a/drivers/net/phy/aquantia/aquantia_main.c
> +++ b/drivers/net/phy/aquantia/aquantia_main.c
> @@ -435,6 +435,9 @@ static int aqr107_set_tunable(struct phy_device *phydev,
>   	}
>   }
>
> +#define AQR_FW_WAIT_SLEEP_US	20000
> +#define AQR_FW_WAIT_TIMEOUT_US	2000000
> +
>   /* If we configure settings whilst firmware is still initializing the chip,
>    * then these settings may be overwritten. Therefore make sure chip
>    * initialization has completed. Use presence of the firmware ID as
> @@ -444,11 +447,19 @@ static int aqr107_set_tunable(struct phy_device *phydev,
>    */
>   int aqr_wait_reset_complete(struct phy_device *phydev)
>   {
> -	int val;
> +	int ret, val;
> +
> +	ret = read_poll_timeout(phy_read_mmd, val, val != 0,
> +				AQR_FW_WAIT_SLEEP_US, AQR_FW_WAIT_TIMEOUT_US,
> +				false, phydev, MDIO_MMD_VEND1,
> +				VEND1_GLOBAL_FW_ID);
> +	if (val < 0) {
> +		phydev_err(phydev, "Failed to read VEND1_GLOBAL_FW_ID: %pe\n",
> +			   ERR_PTR(val));
> +		return val;
> +	}
>
> -	return phy_read_mmd_poll_timeout(phydev, MDIO_MMD_VEND1,
> -					 VEND1_GLOBAL_FW_ID, val, val != 0,
> -					 20000, 2000000, false);
> +	return ret;
>   }
>
>   static void aqr107_chip_info(struct phy_device *phydev)
Tested-by: Hans-Frieder Vogt <hfdevel@gmx.net>

     Hans
patchwork-bot+netdevbpf@kernel.org Sept. 19, 2024, 11 a.m. UTC | #4
Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Fri, 13 Sep 2024 15:12:30 +0300 you wrote:
> The author of the blamed commit apparently did not notice something
> about aqr_wait_reset_complete(): it polls the exact same register -
> MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID - as aqr_firmware_load().
> 
> Thus, the entire logic after the introduction of aqr_wait_reset_complete() is
> now completely side-stepped, because if aqr_wait_reset_complete()
> succeeds, MDIO_MMD_VEND1:VEND1_GLOBAL_FW_ID could have only been a
> non-zero value. The handling of the case where the register reads as 0
> is dead code, due to the previous -ETIMEDOUT having stopped execution
> and returning a fatal error to the caller. We never attempt to load
> new firmware if no firmware is present.
> 
> [...]

Here is the summary with links:
  - [net] net: phy: aquantia: fix -ETIMEDOUT PHY probe failure when firmware not present
    https://git.kernel.org/netdev/net/c/194ef9d0de90

You are awesome, thank you!
diff mbox series

Patch

diff --git a/drivers/net/phy/aquantia/aquantia_firmware.c b/drivers/net/phy/aquantia/aquantia_firmware.c
index 524627a36c6f..dac6464b5fe2 100644
--- a/drivers/net/phy/aquantia/aquantia_firmware.c
+++ b/drivers/net/phy/aquantia/aquantia_firmware.c
@@ -353,26 +353,32 @@  int aqr_firmware_load(struct phy_device *phydev)
 {
 	int ret;
 
-	ret = aqr_wait_reset_complete(phydev);
-	if (ret)
-		return ret;
-
-	/* Check if the firmware is not already loaded by pooling
-	 * the current version returned by the PHY. If 0 is returned,
-	 * no firmware is loaded.
+	/* Check if the firmware is not already loaded by polling
+	 * the current version returned by the PHY.
 	 */
-	ret = phy_read_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_FW_ID);
-	if (ret > 0)
-		goto exit;
-
-	ret = aqr_firmware_load_nvmem(phydev);
-	if (!ret)
-		goto exit;
-
-	ret = aqr_firmware_load_fs(phydev);
-	if (ret)
+	ret = aqr_wait_reset_complete(phydev);
+	switch (ret) {
+	case 0:
+		/* Some firmware is loaded => do nothing */
+		return 0;
+	case -ETIMEDOUT:
+		/* VEND1_GLOBAL_FW_ID still reads 0 after 2 seconds of polling.
+		 * We don't have full confidence that no firmware is loaded (in
+		 * theory it might just not have loaded yet), but we will
+		 * assume that, and load a new image.
+		 */
+		ret = aqr_firmware_load_nvmem(phydev);
+		if (!ret)
+			return ret;
+
+		ret = aqr_firmware_load_fs(phydev);
+		if (ret)
+			return ret;
+		break;
+	default:
+		/* PHY read error, propagate it to the caller */
 		return ret;
+	}
 
-exit:
 	return 0;
 }
diff --git a/drivers/net/phy/aquantia/aquantia_main.c b/drivers/net/phy/aquantia/aquantia_main.c
index e982e9ce44a5..57b8b8f400fd 100644
--- a/drivers/net/phy/aquantia/aquantia_main.c
+++ b/drivers/net/phy/aquantia/aquantia_main.c
@@ -435,6 +435,9 @@  static int aqr107_set_tunable(struct phy_device *phydev,
 	}
 }
 
+#define AQR_FW_WAIT_SLEEP_US	20000
+#define AQR_FW_WAIT_TIMEOUT_US	2000000
+
 /* If we configure settings whilst firmware is still initializing the chip,
  * then these settings may be overwritten. Therefore make sure chip
  * initialization has completed. Use presence of the firmware ID as
@@ -444,11 +447,19 @@  static int aqr107_set_tunable(struct phy_device *phydev,
  */
 int aqr_wait_reset_complete(struct phy_device *phydev)
 {
-	int val;
+	int ret, val;
+
+	ret = read_poll_timeout(phy_read_mmd, val, val != 0,
+				AQR_FW_WAIT_SLEEP_US, AQR_FW_WAIT_TIMEOUT_US,
+				false, phydev, MDIO_MMD_VEND1,
+				VEND1_GLOBAL_FW_ID);
+	if (val < 0) {
+		phydev_err(phydev, "Failed to read VEND1_GLOBAL_FW_ID: %pe\n",
+			   ERR_PTR(val));
+		return val;
+	}
 
-	return phy_read_mmd_poll_timeout(phydev, MDIO_MMD_VEND1,
-					 VEND1_GLOBAL_FW_ID, val, val != 0,
-					 20000, 2000000, false);
+	return ret;
 }
 
 static void aqr107_chip_info(struct phy_device *phydev)