diff mbox series

[net-next,v1,1/1] mlxbf_gige: Fix kernel panic at shutdown

Message ID 20230602182443.25514-1-asmaa@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v1,1/1] mlxbf_gige: Fix kernel panic at shutdown | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers fail 1 blamed authors not CCed: limings@nvidia.com; 1 maintainers not CCed: limings@nvidia.com
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 8 this patch: 8
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 11 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Asmaa Mnebhi June 2, 2023, 6:24 p.m. UTC
There is a race condition happening during shutdown due to pending napi transactions.
Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a
result causes a kernel panic.
To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi.

Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

Jakub Kicinski June 5, 2023, 11:15 p.m. UTC | #1
On Fri, 2 Jun 2023 14:24:43 -0400 Asmaa Mnebhi wrote:
> There is a race condition happening during shutdown due to pending napi transactions.
> Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a
> result causes a kernel panic.
> To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>

Judging by the Fixes tag the problem can happen on 6.4-rc5 already,
right? So the tree in the [PATCH ] tag should have been net rather
than net-next?

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#git-trees-and-patch-flow

No need to repost confirmation is enough.
Paolo Abeni June 6, 2023, 10:47 a.m. UTC | #2
On Fri, 2023-06-02 at 14:24 -0400, Asmaa Mnebhi wrote:
> There is a race condition happening during shutdown due to pending napi transactions.
> Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a
> result causes a kernel panic.
> To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi.
> 
> Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver")
> Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com>
> ---
>  drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> index 694de9513b9f..7017f14595db 100644
> --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
> @@ -485,10 +485,7 @@ static int mlxbf_gige_remove(struct platform_device *pdev)
>  
>  static void mlxbf_gige_shutdown(struct platform_device *pdev)
>  {
> -	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
> -
> -	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
> -	mlxbf_gige_clean_port(priv);
> +	mlxbf_gige_remove(pdev);
>  }
>  
>  static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {

if the device goes through both shutdown() and remove(), the netdevice
will go through unregister_netdevice() 2 times, which is wrong. Am I
missing something relevant?

Thanks!

Paolo
Asmaa Mnebhi June 6, 2023, 12:25 p.m. UTC | #3
Hi Jakub , 

Yes indeed. Thank you!

Best,
Asmaa
> 
> Judging by the Fixes tag the problem can happen on 6.4-rc5 already, right? So
> the tree in the [PATCH ] tag should have been net rather than net-next?
> 
> https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#git-
> trees-and-patch-flow
> 
> No need to repost confirmation is enough.
Jakub Kicinski June 6, 2023, 5:29 p.m. UTC | #4
On Tue, 06 Jun 2023 12:47:09 +0200 Paolo Abeni wrote:
> >  static void mlxbf_gige_shutdown(struct platform_device *pdev)
> >  {
> > -	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
> > -
> > -	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
> > -	mlxbf_gige_clean_port(priv);
> > +	mlxbf_gige_remove(pdev);
> >  }
> >  
> >  static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {  
> 
> if the device goes through both shutdown() and remove(), the netdevice
> will go through unregister_netdevice() 2 times, which is wrong. Am I
> missing something relevant?

Good point, mlxbf_gige_remove() needs to check that the priv pointer
is not NULL.
Asmaa Mnebhi June 7, 2023, 1:54 p.m. UTC | #5
> > >  static void mlxbf_gige_shutdown(struct platform_device *pdev)  {
> > > -	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
> > > -
> > > -	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
> > > -	mlxbf_gige_clean_port(priv);
> > > +	mlxbf_gige_remove(pdev);
> > >  }
> > >
> > >  static const struct acpi_device_id __maybe_unused
> > > mlxbf_gige_acpi_match[] = {
> >
> > if the device goes through both shutdown() and remove(), the netdevice
> > will go through unregister_netdevice() 2 times, which is wrong. Am I
> > missing something relevant?
> 
> Good point, mlxbf_gige_remove() needs to check that the priv pointer is not
> NULL.

Thank you all for your feedback. I will fix it shortly along with net-next -> net.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
index 694de9513b9f..7017f14595db 100644
--- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
+++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c
@@ -485,10 +485,7 @@  static int mlxbf_gige_remove(struct platform_device *pdev)
 
 static void mlxbf_gige_shutdown(struct platform_device *pdev)
 {
-	struct mlxbf_gige *priv = platform_get_drvdata(pdev);
-
-	writeq(0, priv->base + MLXBF_GIGE_INT_EN);
-	mlxbf_gige_clean_port(priv);
+	mlxbf_gige_remove(pdev);
 }
 
 static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {