Message ID | 20231221005721.186607-9-saeed@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Commit | e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,01/15] net/mlx5e: Use the correct lag ports number when creating TISes | expand |
On 21/12/2023 00:57, Saeed Mahameed wrote: > From: Tariq Toukan <tariqt@nvidia.com> > > Integrate the SD library calls into the auxiliary_driver ops in > preparation for creating a single netdev for the multiple devices > belonging to the same SD group. > > SD is still disabled at this stage. It is enabled by a downstream patch > when all needed parts are implemented. > > The netdev is created only when the SD group, with all its participants, > are ready. It is later destroyed if any of the participating devices > drops. > > Signed-off-by: Tariq Toukan <tariqt@nvidia.com> > Reviewed-by: Gal Pressman <gal@nvidia.com> > Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> > --- Hi Tariq, Currently when booting the kernel against next-master(next-20240108) with Arm64 on Marvell Thunder X2 (TX2), the kernel is failing to probe the network card which is resulting in boot failures for our CI (with rootfs over NFS). I can send the full logs if required. Most other boards seem fine. A bisect (full log below) identified this patch as introducing the failure. Bisected it on the tag "mlx5-updates-2023-12-20" at repo "https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/". This works fine on Linux 6.7-rc5 Sample back trace from failure: ------ <3>[ 67.915121] mlx5_core 0000:0b:00.1: mlx5_cmd_out_err:808:(pid 1585): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x6c4d48), err(-22) <3>[ 67.915121] mlx5_core 0000:0b:00.1: mlx5_cmd_out_err:808:(pid 1585): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3), syndrome (0x6c4d48), err(-22) <4>[ 67.945022] mlx5_core.eth: probe of mlx5_core.eth.1 failed with error -22 <4>[ 67.945022] mlx5_core.eth: probe of mlx5_core.eth.1 failed with error -22 ------ Here is the lspci o/p for the card: ------ 0b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Subsystem: Hewlett Packard Enterprise MT27710 Family [ConnectX-4 Lx] Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0, IOMMU group 0 Memory at 10000000000 (64-bit, prefetchable) [size=32M] Expansion ROM at 43000000 [disabled] [size=1M] Capabilities: [60] Express Endpoint, MSI 00 Capabilities: [48] Vital Product Data Capabilities: [9c] MSI-X: Enable+ Count=64 Masked- Capabilities: [c0] Vendor Specific Information: Len=18 <?> Capabilities: [40] Power Management version 3 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [180] Single Root I/O Virtualization (SR-IOV) Capabilities: [1c0] Secondary PCI Express Capabilities: [230] Access Control Services ------ Bisect log: ------ git bisect start # good: [a39b6ac3781d46ba18193c9dbb2110f31e9bffe9] Linux 6.7-rc5 git bisect good a39b6ac3781d46ba18193c9dbb2110f31e9bffe9 # bad: [22c4640698a1d47606b5a4264a584e8046641784] net/mlx5: Implement management PF Ethernet profile git bisect bad 22c4640698a1d47606b5a4264a584e8046641784 # good: [f12f551b5b966ec58bfba9daa15f3cb99a92c1f9] bnxt_en: Prevent TX timeout with a very small TX ring git bisect good f12f551b5b966ec58bfba9daa15f3cb99a92c1f9 # good: [509afc7452707e62fb7c4bb257f111617332ffad] Merge branch 'tools-net-ynl-add-sub-message-support-to-ynl' git bisect good 509afc7452707e62fb7c4bb257f111617332ffad # good: [0ee28c9ae042e77100fae2cd82a54750668aafce] Merge tag 'wireless-next-2023-12-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next git bisect good 0ee28c9ae042e77100fae2cd82a54750668aafce # good: [29c302a2e265a356434b005155990a9e766db75d] libbpf: further decouple feature checking logic from bpf_object git bisect good 29c302a2e265a356434b005155990a9e766db75d # good: [852486b35f344887786d63250946dd921a05d7e8] x86/cfi,bpf: Fix bpf_exception_cb() signature git bisect good 852486b35f344887786d63250946dd921a05d7e8 # good: [e37a11fca41864c9f652ff81296b82e6f65a4242] bridge: add MDB state mask uAPI attribute git bisect good e37a11fca41864c9f652ff81296b82e6f65a4242 # good: [bee9705c679d0df8ee099e3c5312ac76f447848a] Merge branch 'net-sched-tc-drop-reason' git bisect good bee9705c679d0df8ee099e3c5312ac76f447848a # good: [c82d360325112ccc512fc11a3b68cdcdf04a1478] net/mlx5: SD, Add informative prints in kernel log git bisect good c82d360325112ccc512fc11a3b68cdcdf04a1478 # bad: [c73a3ab8fa6e93a783bd563938d7cf00d62d5d34] net/mlx5e: Support cross-vhca RSS git bisect bad c73a3ab8fa6e93a783bd563938d7cf00d62d5d34 # bad: [c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1] net/mlx5e: Create EN core HW resources for all secondary devices git bisect bad c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1 # bad: [e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad] net/mlx5e: Create single netdev per SD group git bisect bad e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad # first bad commit: [e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad] net/mlx5e: Create single netdev per SD group ------ Thanks, Aishwarya
On 08/01/2024 15:36, Aishwarya TCV wrote: > > > On 21/12/2023 00:57, Saeed Mahameed wrote: >> From: Tariq Toukan <tariqt@nvidia.com> >> >> Integrate the SD library calls into the auxiliary_driver ops in >> preparation for creating a single netdev for the multiple devices >> belonging to the same SD group. >> >> SD is still disabled at this stage. It is enabled by a downstream patch >> when all needed parts are implemented. >> >> The netdev is created only when the SD group, with all its participants, >> are ready. It is later destroyed if any of the participating devices >> drops. >> >> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> >> Reviewed-by: Gal Pressman <gal@nvidia.com> >> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> >> --- > > Hi Tariq, > > > Currently when booting the kernel against next-master(next-20240108) > with Arm64 on Marvell Thunder X2 (TX2), the kernel is failing to probe > the network card which is resulting in boot failures for our CI (with > rootfs over NFS). I can send the full logs if required. Most other > boards seem fine. > > A bisect (full log below) identified this patch as introducing the > failure. Bisected it on the tag "mlx5-updates-2023-12-20" at repo > "https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/". > > This works fine on Linux 6.7-rc5 Thanks Aishwarya! We just stumbled upon this internally as well, I assume you are using a (very) old firmware version? If it's the same issue we should have a fix coming soon.
On Mon, Jan 08, 2024 at 03:50:09PM +0200, Gal Pressman wrote: > We just stumbled upon this internally as well, I assume you are using a > (very) old firmware version? > If it's the same issue we should have a fix coming soon. The firmware version announced on boot is 14.21.1000 - the rootfs the tests are using is based on Debian Bullseye, the firmware will be coming from either there or the UEFI image on the system.
On 08/01/2024 17:54, Mark Brown wrote: > On Mon, Jan 08, 2024 at 03:50:09PM +0200, Gal Pressman wrote: > >> We just stumbled upon this internally as well, I assume you are using a >> (very) old firmware version? >> If it's the same issue we should have a fix coming soon. > > The firmware version announced on boot is 14.21.1000 - the rootfs the > tests are using is based on Debian Bullseye, the firmware will be > coming from either there or the UEFI image on the system. Makes sense, you are using a fw version from 2017 :(. Anyway, we should have a fix soon.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index c8e8f512803e..2c47c9076aa6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -70,6 +70,7 @@ #include "qos.h" #include "en/trap.h" #include "lib/devcom.h" +#include "lib/sd.h" bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev, u8 page_shift, enum mlx5e_mpwrq_umr_mode umr_mode) @@ -5980,7 +5981,7 @@ void mlx5e_destroy_netdev(struct mlx5e_priv *priv) free_netdev(netdev); } -static int mlx5e_resume(struct auxiliary_device *adev) +static int _mlx5e_resume(struct auxiliary_device *adev) { struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev); @@ -6005,6 +6006,23 @@ static int mlx5e_resume(struct auxiliary_device *adev) return 0; } +static int mlx5e_resume(struct auxiliary_device *adev) +{ + struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); + struct mlx5_core_dev *mdev = edev->mdev; + struct auxiliary_device *actual_adev; + int err; + + err = mlx5_sd_init(mdev); + if (err) + return err; + + actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx); + if (actual_adev) + return _mlx5e_resume(actual_adev); + return 0; +} + static int _mlx5e_suspend(struct auxiliary_device *adev) { struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev); @@ -6025,7 +6043,17 @@ static int _mlx5e_suspend(struct auxiliary_device *adev) static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state) { - return _mlx5e_suspend(adev); + struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); + struct mlx5_core_dev *mdev = edev->mdev; + struct auxiliary_device *actual_adev; + int err = 0; + + actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx); + if (actual_adev) + err = _mlx5e_suspend(actual_adev); + + mlx5_sd_cleanup(mdev); + return err; } static int _mlx5e_probe(struct auxiliary_device *adev) @@ -6071,9 +6099,9 @@ static int _mlx5e_probe(struct auxiliary_device *adev) goto err_destroy_netdev; } - err = mlx5e_resume(adev); + err = _mlx5e_resume(adev); if (err) { - mlx5_core_err(mdev, "mlx5e_resume failed, %d\n", err); + mlx5_core_err(mdev, "_mlx5e_resume failed, %d\n", err); goto err_profile_cleanup; } @@ -6104,15 +6132,29 @@ static int _mlx5e_probe(struct auxiliary_device *adev) static int mlx5e_probe(struct auxiliary_device *adev, const struct auxiliary_device_id *id) { - return _mlx5e_probe(adev); + struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); + struct mlx5_core_dev *mdev = edev->mdev; + struct auxiliary_device *actual_adev; + int err; + + err = mlx5_sd_init(mdev); + if (err) + return err; + + actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx); + if (actual_adev) + return _mlx5e_probe(actual_adev); + return 0; } -static void mlx5e_remove(struct auxiliary_device *adev) +static void _mlx5e_remove(struct auxiliary_device *adev) { + struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev); struct mlx5e_priv *priv = mlx5e_dev->priv; + struct mlx5_core_dev *mdev = edev->mdev; - mlx5_core_uplink_netdev_set(priv->mdev, NULL); + mlx5_core_uplink_netdev_set(mdev, NULL); mlx5e_dcbnl_delete_app(priv); unregister_netdev(priv->netdev); _mlx5e_suspend(adev); @@ -6122,6 +6164,19 @@ static void mlx5e_remove(struct auxiliary_device *adev) mlx5e_destroy_devlink(mlx5e_dev); } +static void mlx5e_remove(struct auxiliary_device *adev) +{ + struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev); + struct mlx5_core_dev *mdev = edev->mdev; + struct auxiliary_device *actual_adev; + + actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx); + if (actual_adev) + _mlx5e_remove(actual_adev); + + mlx5_sd_cleanup(mdev); +} + static const struct auxiliary_device_id mlx5e_id_table[] = { { .name = MLX5_ADEV_NAME ".eth", }, {},