diff mbox series

[net-next,08/15] net/mlx5e: Create single netdev per SD group

Message ID 20231221005721.186607-9-saeed@kernel.org (mailing list archive)
State Accepted
Commit e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad
Delegated to: Netdev Maintainers
Headers show
Series [net-next,01/15] net/mlx5e: Use the correct lag ports number when creating TISes | expand

Checks

Context Check Description
netdev/series_format success Pull request is its own cover letter
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1115 this patch: 1115
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang fail Errors and warnings before: 12 this patch: 12
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1142 this patch: 1142
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 118 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Saeed Mahameed Dec. 21, 2023, 12:57 a.m. UTC
From: Tariq Toukan <tariqt@nvidia.com>

Integrate the SD library calls into the auxiliary_driver ops in
preparation for creating a single netdev for the multiple devices
belonging to the same SD group.

SD is still disabled at this stage. It is enabled by a downstream patch
when all needed parts are implemented.

The netdev is created only when the SD group, with all its participants,
are ready. It is later destroyed if any of the participating devices
drops.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 69 +++++++++++++++++--
 1 file changed, 62 insertions(+), 7 deletions(-)

Comments

Aishwarya TCV Jan. 8, 2024, 1:36 p.m. UTC | #1
On 21/12/2023 00:57, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@nvidia.com>
> 
> Integrate the SD library calls into the auxiliary_driver ops in
> preparation for creating a single netdev for the multiple devices
> belonging to the same SD group.
> 
> SD is still disabled at this stage. It is enabled by a downstream patch
> when all needed parts are implemented.
> 
> The netdev is created only when the SD group, with all its participants,
> are ready. It is later destroyed if any of the participating devices
> drops.
> 
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> Reviewed-by: Gal Pressman <gal@nvidia.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> ---

Hi Tariq,


Currently when booting the kernel against next-master(next-20240108)
with Arm64 on Marvell Thunder X2 (TX2), the kernel is failing to probe
the network card which is resulting in boot failures for our CI (with
rootfs over NFS). I can send the full logs if required. Most other
boards seem fine.

A bisect (full log below) identified this patch as introducing the
failure. Bisected it on the tag "mlx5-updates-2023-12-20" at repo
"https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/".

This works fine on Linux 6.7-rc5


Sample back trace from failure:
------
<3>[   67.915121] mlx5_core 0000:0b:00.1: mlx5_cmd_out_err:808:(pid
1585): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3),
syndrome (0x6c4d48), err(-22)
<3>[   67.915121] mlx5_core 0000:0b:00.1: mlx5_cmd_out_err:808:(pid
1585): ACCESS_REG(0x805) op_mod(0x1) failed, status bad parameter(0x3),
syndrome (0x6c4d48), err(-22)
<4>[   67.945022] mlx5_core.eth: probe of mlx5_core.eth.1 failed with
error -22
<4>[   67.945022] mlx5_core.eth: probe of mlx5_core.eth.1 failed with
error -22
------


Here is the lspci o/p for the card:
------
0b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
    Subsystem: Hewlett Packard Enterprise MT27710 Family [ConnectX-4 Lx]
    Flags: bus master, fast devsel, latency 0, IRQ 30, NUMA node 0,
IOMMU group 0
    Memory at 10000000000 (64-bit, prefetchable) [size=32M]
    Expansion ROM at 43000000 [disabled] [size=1M]
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [48] Vital Product Data
    Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
    Capabilities: [c0] Vendor Specific Information: Len=18 <?>
    Capabilities: [40] Power Management version 3
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
    Capabilities: [1c0] Secondary PCI Express
    Capabilities: [230] Access Control Services
------


Bisect log:
------
git bisect start
# good: [a39b6ac3781d46ba18193c9dbb2110f31e9bffe9] Linux 6.7-rc5
git bisect good a39b6ac3781d46ba18193c9dbb2110f31e9bffe9
# bad: [22c4640698a1d47606b5a4264a584e8046641784] net/mlx5: Implement
management PF Ethernet profile
git bisect bad 22c4640698a1d47606b5a4264a584e8046641784
# good: [f12f551b5b966ec58bfba9daa15f3cb99a92c1f9] bnxt_en: Prevent TX
timeout with a very small TX ring
git bisect good f12f551b5b966ec58bfba9daa15f3cb99a92c1f9
# good: [509afc7452707e62fb7c4bb257f111617332ffad] Merge branch
'tools-net-ynl-add-sub-message-support-to-ynl'
git bisect good 509afc7452707e62fb7c4bb257f111617332ffad
# good: [0ee28c9ae042e77100fae2cd82a54750668aafce] Merge tag
'wireless-next-2023-12-18' of
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
git bisect good 0ee28c9ae042e77100fae2cd82a54750668aafce
# good: [29c302a2e265a356434b005155990a9e766db75d] libbpf: further
decouple feature checking logic from bpf_object
git bisect good 29c302a2e265a356434b005155990a9e766db75d
# good: [852486b35f344887786d63250946dd921a05d7e8] x86/cfi,bpf: Fix
bpf_exception_cb() signature
git bisect good 852486b35f344887786d63250946dd921a05d7e8
# good: [e37a11fca41864c9f652ff81296b82e6f65a4242] bridge: add MDB state
mask uAPI attribute
git bisect good e37a11fca41864c9f652ff81296b82e6f65a4242
# good: [bee9705c679d0df8ee099e3c5312ac76f447848a] Merge branch
'net-sched-tc-drop-reason'
git bisect good bee9705c679d0df8ee099e3c5312ac76f447848a
# good: [c82d360325112ccc512fc11a3b68cdcdf04a1478] net/mlx5: SD, Add
informative prints in kernel log
git bisect good c82d360325112ccc512fc11a3b68cdcdf04a1478
# bad: [c73a3ab8fa6e93a783bd563938d7cf00d62d5d34] net/mlx5e: Support
cross-vhca RSS
git bisect bad c73a3ab8fa6e93a783bd563938d7cf00d62d5d34
# bad: [c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1] net/mlx5e: Create EN
core HW resources for all secondary devices
git bisect bad c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1
# bad: [e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad] net/mlx5e: Create
single netdev per SD group
git bisect bad e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad
# first bad commit: [e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad]
net/mlx5e: Create single netdev per SD group
------

Thanks,
Aishwarya
Gal Pressman Jan. 8, 2024, 1:50 p.m. UTC | #2
On 08/01/2024 15:36, Aishwarya TCV wrote:
> 
> 
> On 21/12/2023 00:57, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@nvidia.com>
>>
>> Integrate the SD library calls into the auxiliary_driver ops in
>> preparation for creating a single netdev for the multiple devices
>> belonging to the same SD group.
>>
>> SD is still disabled at this stage. It is enabled by a downstream patch
>> when all needed parts are implemented.
>>
>> The netdev is created only when the SD group, with all its participants,
>> are ready. It is later destroyed if any of the participating devices
>> drops.
>>
>> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>> Reviewed-by: Gal Pressman <gal@nvidia.com>
>> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
>> ---
> 
> Hi Tariq,
> 
> 
> Currently when booting the kernel against next-master(next-20240108)
> with Arm64 on Marvell Thunder X2 (TX2), the kernel is failing to probe
> the network card which is resulting in boot failures for our CI (with
> rootfs over NFS). I can send the full logs if required. Most other
> boards seem fine.
> 
> A bisect (full log below) identified this patch as introducing the
> failure. Bisected it on the tag "mlx5-updates-2023-12-20" at repo
> "https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/".
> 
> This works fine on Linux 6.7-rc5

Thanks Aishwarya!

We just stumbled upon this internally as well, I assume you are using a
(very) old firmware version?
If it's the same issue we should have a fix coming soon.
Mark Brown Jan. 8, 2024, 3:54 p.m. UTC | #3
On Mon, Jan 08, 2024 at 03:50:09PM +0200, Gal Pressman wrote:

> We just stumbled upon this internally as well, I assume you are using a
> (very) old firmware version?
> If it's the same issue we should have a fix coming soon.

The firmware version announced on boot is 14.21.1000 - the rootfs the
tests are using is based on Debian Bullseye, the firmware will be
coming from either there or the UEFI image on the system.
Gal Pressman Jan. 8, 2024, 4 p.m. UTC | #4
On 08/01/2024 17:54, Mark Brown wrote:
> On Mon, Jan 08, 2024 at 03:50:09PM +0200, Gal Pressman wrote:
> 
>> We just stumbled upon this internally as well, I assume you are using a
>> (very) old firmware version?
>> If it's the same issue we should have a fix coming soon.
> 
> The firmware version announced on boot is 14.21.1000 - the rootfs the
> tests are using is based on Debian Bullseye, the firmware will be
> coming from either there or the UEFI image on the system.

Makes sense, you are using a fw version from 2017 :(.
Anyway, we should have a fix soon.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c8e8f512803e..2c47c9076aa6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -70,6 +70,7 @@ 
 #include "qos.h"
 #include "en/trap.h"
 #include "lib/devcom.h"
+#include "lib/sd.h"
 
 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev, u8 page_shift,
 					    enum mlx5e_mpwrq_umr_mode umr_mode)
@@ -5980,7 +5981,7 @@  void mlx5e_destroy_netdev(struct mlx5e_priv *priv)
 	free_netdev(netdev);
 }
 
-static int mlx5e_resume(struct auxiliary_device *adev)
+static int _mlx5e_resume(struct auxiliary_device *adev)
 {
 	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
@@ -6005,6 +6006,23 @@  static int mlx5e_resume(struct auxiliary_device *adev)
 	return 0;
 }
 
+static int mlx5e_resume(struct auxiliary_device *adev)
+{
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err;
+
+	err = mlx5_sd_init(mdev);
+	if (err)
+		return err;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		return _mlx5e_resume(actual_adev);
+	return 0;
+}
+
 static int _mlx5e_suspend(struct auxiliary_device *adev)
 {
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
@@ -6025,7 +6043,17 @@  static int _mlx5e_suspend(struct auxiliary_device *adev)
 
 static int mlx5e_suspend(struct auxiliary_device *adev, pm_message_t state)
 {
-	return _mlx5e_suspend(adev);
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err = 0;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		err = _mlx5e_suspend(actual_adev);
+
+	mlx5_sd_cleanup(mdev);
+	return err;
 }
 
 static int _mlx5e_probe(struct auxiliary_device *adev)
@@ -6071,9 +6099,9 @@  static int _mlx5e_probe(struct auxiliary_device *adev)
 		goto err_destroy_netdev;
 	}
 
-	err = mlx5e_resume(adev);
+	err = _mlx5e_resume(adev);
 	if (err) {
-		mlx5_core_err(mdev, "mlx5e_resume failed, %d\n", err);
+		mlx5_core_err(mdev, "_mlx5e_resume failed, %d\n", err);
 		goto err_profile_cleanup;
 	}
 
@@ -6104,15 +6132,29 @@  static int _mlx5e_probe(struct auxiliary_device *adev)
 static int mlx5e_probe(struct auxiliary_device *adev,
 		       const struct auxiliary_device_id *id)
 {
-	return _mlx5e_probe(adev);
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+	int err;
+
+	err = mlx5_sd_init(mdev);
+	if (err)
+		return err;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		return _mlx5e_probe(actual_adev);
+	return 0;
 }
 
-static void mlx5e_remove(struct auxiliary_device *adev)
+static void _mlx5e_remove(struct auxiliary_device *adev)
 {
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
 	struct mlx5e_dev *mlx5e_dev = auxiliary_get_drvdata(adev);
 	struct mlx5e_priv *priv = mlx5e_dev->priv;
+	struct mlx5_core_dev *mdev = edev->mdev;
 
-	mlx5_core_uplink_netdev_set(priv->mdev, NULL);
+	mlx5_core_uplink_netdev_set(mdev, NULL);
 	mlx5e_dcbnl_delete_app(priv);
 	unregister_netdev(priv->netdev);
 	_mlx5e_suspend(adev);
@@ -6122,6 +6164,19 @@  static void mlx5e_remove(struct auxiliary_device *adev)
 	mlx5e_destroy_devlink(mlx5e_dev);
 }
 
+static void mlx5e_remove(struct auxiliary_device *adev)
+{
+	struct mlx5_adev *edev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = edev->mdev;
+	struct auxiliary_device *actual_adev;
+
+	actual_adev = mlx5_sd_get_adev(mdev, adev, edev->idx);
+	if (actual_adev)
+		_mlx5e_remove(actual_adev);
+
+	mlx5_sd_cleanup(mdev);
+}
+
 static const struct auxiliary_device_id mlx5e_id_table[] = {
 	{ .name = MLX5_ADEV_NAME ".eth", },
 	{},