Message ID | 20230602182443.25514-1-asmaa@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,v1,1/1] mlxbf_gige: Fix kernel panic at shutdown | expand |
On Fri, 2 Jun 2023 14:24:43 -0400 Asmaa Mnebhi wrote: > There is a race condition happening during shutdown due to pending napi transactions. > Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a > result causes a kernel panic. > To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi. > > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver") > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com> Judging by the Fixes tag the problem can happen on 6.4-rc5 already, right? So the tree in the [PATCH ] tag should have been net rather than net-next? https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#git-trees-and-patch-flow No need to repost confirmation is enough.
On Fri, 2023-06-02 at 14:24 -0400, Asmaa Mnebhi wrote: > There is a race condition happening during shutdown due to pending napi transactions. > Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a > result causes a kernel panic. > To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi. > > Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver") > Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com> > --- > drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c > index 694de9513b9f..7017f14595db 100644 > --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c > +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c > @@ -485,10 +485,7 @@ static int mlxbf_gige_remove(struct platform_device *pdev) > > static void mlxbf_gige_shutdown(struct platform_device *pdev) > { > - struct mlxbf_gige *priv = platform_get_drvdata(pdev); > - > - writeq(0, priv->base + MLXBF_GIGE_INT_EN); > - mlxbf_gige_clean_port(priv); > + mlxbf_gige_remove(pdev); > } > > static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = { if the device goes through both shutdown() and remove(), the netdevice will go through unregister_netdevice() 2 times, which is wrong. Am I missing something relevant? Thanks! Paolo
Hi Jakub , Yes indeed. Thank you! Best, Asmaa > > Judging by the Fixes tag the problem can happen on 6.4-rc5 already, right? So > the tree in the [PATCH ] tag should have been net rather than net-next? > > https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#git- > trees-and-patch-flow > > No need to repost confirmation is enough.
On Tue, 06 Jun 2023 12:47:09 +0200 Paolo Abeni wrote: > > static void mlxbf_gige_shutdown(struct platform_device *pdev) > > { > > - struct mlxbf_gige *priv = platform_get_drvdata(pdev); > > - > > - writeq(0, priv->base + MLXBF_GIGE_INT_EN); > > - mlxbf_gige_clean_port(priv); > > + mlxbf_gige_remove(pdev); > > } > > > > static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = { > > if the device goes through both shutdown() and remove(), the netdevice > will go through unregister_netdevice() 2 times, which is wrong. Am I > missing something relevant? Good point, mlxbf_gige_remove() needs to check that the priv pointer is not NULL.
> > > static void mlxbf_gige_shutdown(struct platform_device *pdev) { > > > - struct mlxbf_gige *priv = platform_get_drvdata(pdev); > > > - > > > - writeq(0, priv->base + MLXBF_GIGE_INT_EN); > > > - mlxbf_gige_clean_port(priv); > > > + mlxbf_gige_remove(pdev); > > > } > > > > > > static const struct acpi_device_id __maybe_unused > > > mlxbf_gige_acpi_match[] = { > > > > if the device goes through both shutdown() and remove(), the netdevice > > will go through unregister_netdevice() 2 times, which is wrong. Am I > > missing something relevant? > > Good point, mlxbf_gige_remove() needs to check that the priv pointer is not > NULL. Thank you all for your feedback. I will fix it shortly along with net-next -> net.
diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c index 694de9513b9f..7017f14595db 100644 --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c @@ -485,10 +485,7 @@ static int mlxbf_gige_remove(struct platform_device *pdev) static void mlxbf_gige_shutdown(struct platform_device *pdev) { - struct mlxbf_gige *priv = platform_get_drvdata(pdev); - - writeq(0, priv->base + MLXBF_GIGE_INT_EN); - mlxbf_gige_clean_port(priv); + mlxbf_gige_remove(pdev); } static const struct acpi_device_id __maybe_unused mlxbf_gige_acpi_match[] = {
There is a race condition happening during shutdown due to pending napi transactions. Since mlxbf_gige_poll is still running, it tries to access a NULL pointer and as a result causes a kernel panic. To fix this during shutdown, invoke mlxbf_gige_remove to disable and dequeue napi. Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver") Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com> --- drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)