diff mbox series

Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode

Message ID 90e1efad457f40c1f9f7b8cb56852072d8ea00fd.camel@linux.ibm.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Niklas Schnelle April 11, 2023, 3:11 p.m. UTC
Hi Saeed, Hi Leon,

While testing PCI recovery with a ConnectX-5 card (MT28800, fw
16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
crash (stacktrace attached) when the card is in switchdev mode. No
crash occurs and the recovery succeeds in legacy mode (with VFs). I
found that the same crash occurs also with a simple Function Level
Reset instead of the s390 specific PCI recovery, see instructions
below. From the looks of it I think this could affect non-s390 too but
I don't have a proper x86 test system with a ConnectX card to test
with.

Anyway, I tried to analyze further but got stuck after figuring out
that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
(see trace) the mlx5e_dev->priv pointer is valid but the pointed to
struct only contains zeros as it was previously zeroed by
mlx5_mdev_uninit() which then leads to a NULL pointer access.

The crash itself can be prevented by the following debug patch though
clearly this is not a proper fix:

        unregister_netdev(priv->netdev);

With that I then tried to track down why mlx5_mdev_uninit() is called
and this might actually be s390 specific in that this happens during
the removal of the VF which on s390 causes extra hot unplug events for
the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
the following call trace:

...
zpci_bus_remove_device()
   zpci_iov_remove_virtfn()
      pci_iov_remove_virtfn()
         pci_stop_and_remove_bus_device()
            pci_stop_bus_device()
               device_release_driver_internal()
                  pci_device_remove()
                     remove_one()
                        mlx5_mdev_uninit()

Then again I would expect that on other architectures VFs become at
leastunresponsive during a FLR of the PF not sure if that also lead to
calls to remove_one() though.

As another data point I tried the same on the default Ubuntu 22.04
generic 5.15 kernel and there no crash occurs so this might be a newer
issue.

Also, I did test with and without the patch I sent recently for
skipping the wait in mlx5_health_wait_pci_up() but that made no
difference.

Any hints on how to debug this further and could you try to see if this
occurs on other architectures as well?

Thanks,
Niklas

Reproduced with (0004:00:00.0 being the first PF of a ConnectX-5):

$ devlink dev eswitch set pci/0004:00:00.0 mode switchdev
$ devlink dev eswitch set pci/0004:00:00.1 mode switchdev

The next 2 lines are needed on s390 due to an unrelated issue with smsf mode.

$ devlink dev param set pci/0004:00:00.0 name flow_steering_mode value dmfs cmode runtime
$ devlink dev param set pci/0004:00:00.1 name flow_steering_mode value dmfs cmode runtime
$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/sriov_numvfs
# Check the reset method is FLR though others might also cause this
$ cat /sys/bus/pci/devices/0004:00:00.0/reset_method
flr

Then to trigger the crash (after a bit of recovery, it may be racy though it hits pretty
consistently for me)

$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/reset
[ 1346.987414] mlx5_core 0003:00:00.0 ens8704f0np0: Link up
[ 1347.006845] IPv6: ADDRCONF(NETDEV_CHANGE): ens8704f0np0: link becomes ready
[ 1356.518669] mlx5_core 0004:00:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
[ 1359.192049] mlx5_core 0004:00:00.0 ens8832f0np0: Dropping C-tag vlan stripping offload due to S-tag vlan
[ 1359.192152] mlx5_core 0004:00:00.0 ens8832f0np0: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[ 1359.365370] mlx5_core 0004:00:00.0: E-Switch: Enable: mode(OFFLOADS), nvfs(0), active vports(1)
[ 1359.365952] devlink (1586) used greatest stack depth: 7552 bytes left
[ 1359.448638] mlx5_core 0004:00:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
[ 1361.256671] debugfs: Directory 'mlx5_dr_chunks' with parent 'slab' already present!
[ 1361.257295] debugfs: Directory 'mlx5_dr_htbls' with parent 'slab' already present!
[ 1362.063433] mlx5_core 0004:00:00.1 ens8896f1np1: Dropping C-tag vlan stripping offload due to S-tag vlan
[ 1362.063560] mlx5_core 0004:00:00.1 ens8896f1np1: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[ 1362.240899] mlx5_core 0004:00:00.1: E-Switch: Enable: mode(OFFLOADS), nvfs(0), active vports(1)
[ 1362.241527] devlink (1595) used greatest stack depth: 7520 bytes left
[ 1362.327301] mlx5_core 0004:00:00.0 ens8832f0np0: Link up
[ 1362.345353] IPv6: ADDRCONF(NETDEV_CHANGE): ens8832f0np0: link becomes ready
[ 1362.346169] ip (1603) used greatest stack depth: 7368 bytes left
[ 1362.437044] pci 0004:00:00.2: [15b3:101a] type 00 class 0x020000
[ 1362.437277] pci 0004:00:00.2: reg 0x10: [mem 0xffffc00002000000-0xffffc000021fffff 64bit pref]
[ 1362.437423] pci 0004:00:00.2: enabling Extended Tags
[ 1362.438619] pci 0004:00:00.2: Adding to iommu group 30
[ 1362.440640] mlx5_core 0004:00:00.2: enabling device (0000 -> 0002)
[ 1362.440810] mlx5_core 0004:00:00.2: firmware version: 16.35.1012
[ 1362.675795] mlx5_core 0004:00:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 1362.676711] debugfs: Directory 'mlx5_fs_fgs' with parent 'slab' already present!
[ 1362.677278] debugfs: Directory 'mlx5_fs_ftes' with parent 'slab' already present!
[ 1362.696847] mlx5_core 0004:00:00.2: Assigned random MAC address 02:0e:8a:c7:93:9a
[ 1362.830432] mlx5_core 0004:00:00.2: MLX5E: StrdRq(1) RqSz(16) StrdSz(4096) RxCqeCmprss(0 basic)
[ 1362.835203] mlx5_core 0004:00:00.2 ens8832f0v0: renamed from eth1
[ 1372.768645] mlx5_core 0004:00:00.2: poll_health:818:(pid 0): Fatal error 1 detected
[ 1372.768791] mlx5_core 0004:00:00.2: print_health_info:423:(pid 0): PCI slot is unavailable
[ 1373.168544] mlx5_core 0004:00:00.0: poll_health:818:(pid 0): Fatal error 3 detected
[ 1375.434387] mlx5_core 0004:00:00.0: mlx5_health_try_recover:335:(pid 1505): handling bad device here
[ 1375.434502] mlx5_core 0004:00:00.0: mlx5_handle_bad_state:292:(pid 1505): starting teardown
[ 1375.434545] mlx5_core 0004:00:00.0: mlx5_error_sw_reset:243:(pid 1505): start
[ 1375.434586] mlx5_core 0004:00:00.0: mlx5_error_sw_reset:276:(pid 1505): end
[ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
[ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
[ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
[ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
[ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile, -67
[ 1376.329159] ------------[ cut here ]------------
[ 1376.329161] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 1376.329171] WARNING: CPU: 6 PID: 1505 at kernel/locking/mutex.c:582 __mutex_lock+0xbc6/0xfa8
[ 1376.329182] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uvdevice s390_trng sunrpc mlx5_ib ib_uverbs ib_core ism eadm_sch vfio_ccw mdev v
fio_iommu_type1 vfio sch_fq_codel loop configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme mlx5_core nvme_core pkey zcrypt rng_core autofs4
[ 1376.329230] CPU: 6 PID: 1505 Comm: kworker/u800:3 Not tainted 6.3.0-rc4 #49
[ 1376.329233] Hardware name: IBM 3931 A01 704 (LPAR)
[ 1376.329235] Workqueue: mlx5_health0004:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 1376.329345] Krnl PSW : 0704d00180000000 00000000ee827992 (__mutex_lock+0xbca/0xfa8)
[ 1376.329351]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 1376.329354] Krnl GPRS: c0000000fffeffff 0000000080000001 0000000000000028 00000000eec90198
[ 1376.329357]            0000038004af36b8 0000038004af36b0 000003ff7fbcdfd8 0000000000000080
[ 1376.329359]            00000000f03839a8 0000000000000000 0000000000000000 0000000000059fe8
[ 1376.329361]            000000009fa58100 0000000000000000 00000000ee82798e 0000038004af3848
[ 1376.329368] Krnl Code: 00000000ee827982: c020001ef4a1        larl    %r2,00000000eec062c4
[ 1376.329368]            00000000ee827988: c0e5ff90c318        brasl   %r14,00000000eda3ffb8
[ 1376.329368]           #00000000ee82798e: af000000            mc      0,0
[ 1376.329368]           >00000000ee827992: a7f4fa4d            brc     15,00000000ee826e2c
[ 1376.329368]            00000000ee827996: af000000            mc      0,0
[ 1376.329368]            00000000ee82799a: a7f4faf4            brc     15,00000000ee826f82
[ 1376.329368]            00000000ee82799e: e31003400004        lg      %r1,832
[ 1376.329368]            00000000ee8279a4: e32010000004        lg      %r2,0(%r1)
[ 1376.329405] Call Trace:
[ 1376.329406]  [<00000000ee827992>] __mutex_lock+0xbca/0xfa8 
[ 1376.329409] ([<00000000ee82798e>] __mutex_lock+0xbc6/0xfa8)
[ 1376.329412]  [<00000000ee827da2>] mutex_lock_nested+0x32/0x40 
[ 1376.329416]  [<000003ff7fbcdfd8>] mlx5_core_uplink_netdev_set+0x38/0x60 [mlx5_core] 
[ 1376.329469]  [<000003ff7fc1100c>] mlx5e_remove+0x3c/0xa0 [mlx5_core] 
[ 1376.329521]  [<00000000ee42fa7c>] device_release_driver_internal+0x1c4/0x270 
[ 1376.329526]  [<00000000ee42d788>] bus_remove_device+0x100/0x188 
[ 1376.329528]  [<00000000ee42724e>] device_del+0x186/0x3b8 
[ 1376.329533]  [<000003ff7fbf6134>] mlx5_detach_device+0xac/0x110 [mlx5_core] 
[ 1376.329583]  [<000003ff7fbcf812>] mlx5_unload_one_devl_locked+0x52/0xc8 [mlx5_core] 
[ 1376.329635]  [<000003ff7fc777e8>] mlx5_health_try_recover+0x168/0x240 [mlx5_core] 
[ 1376.329687]  [<00000000ee72b9d2>] devlink_health_reporter_recover+0x3a/0x98 
[ 1376.329690]  [<00000000ee72bb3c>] devlink_health_report+0x10c/0x300 
[ 1376.329693]  [<000003ff7fbdd0ac>] mlx5_fw_fatal_reporter_err_work+0xc4/0x1d8 [mlx5_core] 
[ 1376.329744]  [<00000000eda6b934>] process_one_work+0x30c/0x688 
[ 1376.329749]  [<00000000eda6bd12>] worker_thread+0x62/0x438 
[ 1376.329752]  [<00000000eda77400>] kthread+0x138/0x150 
[ 1376.329756]  [<00000000ed9eb894>] __ret_from_fork+0x3c/0x58 
[ 1376.329759]  [<00000000ee8309da>] ret_from_fork+0xa/0x40 
[ 1376.329762] INFO: lockdep is turned off.
[ 1376.329764] Last Breaking-Event-Address:
[ 1376.329765]  [<00000000eda400e4>] __warn_printk+0x12c/0x138
[ 1376.329770] irq event stamp: 987413
[ 1376.329772] hardirqs last  enabled at (987413): [<00000000ee82fab6>] _raw_spin_unlock_irqrestore+0x76/0xc0
[ 1376.329774] hardirqs last disabled at (987412): [<00000000ee82f6b6>] _raw_spin_lock_irqsave+0x9e/0xd8
[ 1376.329779] softirqs last  enabled at (986754): [<00000000ee4ec40e>] netif_set_real_num_tx_queues+0x116/0x270
[ 1376.329784] softirqs last disabled at (986752): [<00000000ee4ec3fa>] netif_set_real_num_tx_queues+0x102/0x270
[ 1376.329787] ---[ end trace 0000000000000000 ]---
[ 1376.329792] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1376.329794] Failing address: a7190001eb112000 TEID: a7190001eb112803
[ 1376.329796] Fault in home space mode while using kernel ASCE.
[ 1376.329801] AS:00000000ef868007 R3:0000000000000024 
[ 1376.329831] Oops: 0038 ilc:2 [#1] PREEMPT SMP 
[ 1376.329834] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uvdevice s390_trng sunrpc mlx5_ib ib_uverbs ib_core ism eadm_sch vfio_ccw mdev v
fio_iommu_type1 vfio sch_fq_codel loop configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme mlx5_core nvme_core pkey zcrypt rng_core autofs4
[ 1376.329876] CPU: 6 PID: 1505 Comm: kworker/u800:3 Tainted: G        W          6.3.0-rc4 #49
[ 1376.329878] Hardware name: IBM 3931 A01 704 (LPAR)
[ 1376.329880] Workqueue: mlx5_health0004:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 1376.329933] Krnl PSW : 0704d00180000000 00000000ee8277c4 (__mutex_lock+0x9fc/0xfa8)
[ 1376.329938]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 1376.329941] Krnl GPRS: c0000000fffeffff 000000009fa58100 a7190001eb112000 a7190001eb112000
[ 1376.329944]            0000000000000000 0000000000000000 00000000eee2cd60 a7190001eb112000
[ 1376.329946]            00000000f03839a8 00000000eee2ca48 000000000005a058 0000000000059fe8
[ 1376.329949]            000000009fa58100 000000009fa58100 00000000ee827396 0000038004af3848
[ 1376.329953] Krnl Code: 00000000ee8277b6: c0e5ffffe97d        brasl   %r14,00000000ee824ab0
[ 1376.329953]            00000000ee8277bc: a7f4fc36            brc     15,00000000ee827028
[ 1376.329953]           #00000000ee8277c0: a527fff8            nill    %r2,65528
[ 1376.329953]           >00000000ee8277c4: 58302034            l       %r3,52(%r2)
[ 1376.329953]            00000000ee8277c8: ec38fc66007e        cij     %r3,0,8,00000000ee827094
[ 1376.329953]            00000000ee8277ce: 58202010            l       %r2,16(%r2)
[ 1376.329953]            00000000ee8277d2: b9140022            lgfr    %r2,%r2
[ 1376.329953]            00000000ee8277d6: c0e5ff8ecdfd        brasl   %r14,00000000eda013d0
[ 1376.329969] Call Trace:
[ 1376.329971]  [<00000000ee8277c4>] __mutex_lock+0x9fc/0xfa8 
[ 1376.329974] ([<00000000ee827396>] __mutex_lock+0x5ce/0xfa8)
[ 1376.329977]  [<00000000ee827da2>] mutex_lock_nested+0x32/0x40 
[ 1376.329980]  [<000003ff7fbcdfd8>] mlx5_core_uplink_netdev_set+0x38/0x60 [mlx5_core] 
[ 1376.330032]  [<000003ff7fc1100c>] mlx5e_remove+0x3c/0xa0 [mlx5_core] 
[ 1376.330085]  [<00000000ee42fa7c>] device_release_driver_internal+0x1c4/0x270 
[ 1376.330088]  [<00000000ee42d788>] bus_remove_device+0x100/0x188 
[ 1376.330090]  [<00000000ee42724e>] device_del+0x186/0x3b8 
[ 1376.330093]  [<000003ff7fbf6134>] mlx5_detach_device+0xac/0x110 [mlx5_core] 
[ 1376.330143]  [<000003ff7fbcf812>] mlx5_unload_one_devl_locked+0x52/0xc8 [mlx5_core] 
[ 1376.330195]  [<000003ff7fc777e8>] mlx5_health_try_recover+0x168/0x240 [mlx5_core] 
[ 1376.330246]  [<00000000ee72b9d2>] devlink_health_reporter_recover+0x3a/0x98 
[ 1376.330249]  [<00000000ee72bb3c>] devlink_health_report+0x10c/0x300 
[ 1376.330251]  [<000003ff7fbdd0ac>] mlx5_fw_fatal_reporter_err_work+0xc4/0x1d8 [mlx5_core] 
[ 1376.330303]  [<00000000eda6b934>] process_one_work+0x30c/0x688 
[ 1376.330305]  [<00000000eda6bd12>] worker_thread+0x62/0x438 
[ 1376.330308]  [<00000000eda77400>] kthread+0x138/0x150 
[ 1376.330311]  [<00000000ed9eb894>] __ret_from_fork+0x3c/0x58 
[ 1376.330314]  [<00000000ee8309da>] ret_from_fork+0xa/0x40 
[ 1376.330316] INFO: lockdep is turned off.
[ 1376.330317] Last Breaking-Event-Address:
[ 1376.330319]  [<00000000ee826f9c>] __mutex_lock+0x1d4/0xfa8
[ 1376.330323] Kernel panic - not syncing: Fatal exception: panic_on_oops

Comments

Leon Romanovsky April 13, 2023, 11:02 a.m. UTC | #1
On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
> Hi Saeed, Hi Leon,
> 
> While testing PCI recovery with a ConnectX-5 card (MT28800, fw
> 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
> crash (stacktrace attached) when the card is in switchdev mode. No
> crash occurs and the recovery succeeds in legacy mode (with VFs). I
> found that the same crash occurs also with a simple Function Level
> Reset instead of the s390 specific PCI recovery, see instructions
> below. From the looks of it I think this could affect non-s390 too but
> I don't have a proper x86 test system with a ConnectX card to test
> with.
> 
> Anyway, I tried to analyze further but got stuck after figuring out
> that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
> (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
> struct only contains zeros as it was previously zeroed by
> mlx5_mdev_uninit() which then leads to a NULL pointer access.
> 
> The crash itself can be prevented by the following debug patch though
> clearly this is not a proper fix:
> 
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
> *adev)
>         struct mlx5e_priv *priv = mlx5e_dev->priv;
>         pm_message_t state = {};
> 
> +       if (!priv->mdev) {
> +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
> +               return;
> +       }
>         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
>         mlx5e_dcbnl_delete_app(priv);
>         unregister_netdev(priv->netdev);
> 
> With that I then tried to track down why mlx5_mdev_uninit() is called
> and this might actually be s390 specific in that this happens during
> the removal of the VF which on s390 causes extra hot unplug events for
> the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
> the following call trace:
> 
> ...
> zpci_bus_remove_device()
>    zpci_iov_remove_virtfn()
>       pci_iov_remove_virtfn()
>          pci_stop_and_remove_bus_device()
>             pci_stop_bus_device()
>                device_release_driver_internal()
>                   pci_device_remove()
>                      remove_one()
>                         mlx5_mdev_uninit()
> 
> Then again I would expect that on other architectures VFs become at
> leastunresponsive during a FLR of the PF not sure if that also lead to
> calls to remove_one() though.
> 
> As another data point I tried the same on the default Ubuntu 22.04
> generic 5.15 kernel and there no crash occurs so this might be a newer
> issue.
> 
> Also, I did test with and without the patch I sent recently for
> skipping the wait in mlx5_health_wait_pci_up() but that made no
> difference.
> 
> Any hints on how to debug this further and could you try to see if this
> occurs on other architectures as well?

My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
[ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
[ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
[ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
[ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
[ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,

-67 is -ENOLINK from mlx5_internal_err_ret_value().

Thanks
Saeed Mahameed April 13, 2023, 10:02 p.m. UTC | #2
On 13 Apr 14:02, Leon Romanovsky wrote:
>On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
>> Hi Saeed, Hi Leon,
>>
>> While testing PCI recovery with a ConnectX-5 card (MT28800, fw
>> 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
>> crash (stacktrace attached) when the card is in switchdev mode. No
>> crash occurs and the recovery succeeds in legacy mode (with VFs). I
>> found that the same crash occurs also with a simple Function Level
>> Reset instead of the s390 specific PCI recovery, see instructions
>> below. From the looks of it I think this could affect non-s390 too but
>> I don't have a proper x86 test system with a ConnectX card to test
>> with.
>>
>> Anyway, I tried to analyze further but got stuck after figuring out
>> that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
>> (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
>> struct only contains zeros as it was previously zeroed by
>> mlx5_mdev_uninit() which then leads to a NULL pointer access.
>>
>> The crash itself can be prevented by the following debug patch though
>> clearly this is not a proper fix:
>>
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
>> *adev)
>>         struct mlx5e_priv *priv = mlx5e_dev->priv;
>>         pm_message_t state = {};
>>
>> +       if (!priv->mdev) {
>> +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
>> +               return;
>> +       }
>>         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
>>         mlx5e_dcbnl_delete_app(priv);
>>         unregister_netdev(priv->netdev);
>>
>> With that I then tried to track down why mlx5_mdev_uninit() is called
>> and this might actually be s390 specific in that this happens during
>> the removal of the VF which on s390 causes extra hot unplug events for
>> the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
>> the following call trace:
>>
>> ...
>> zpci_bus_remove_device()
>>    zpci_iov_remove_virtfn()
>>       pci_iov_remove_virtfn()
>>          pci_stop_and_remove_bus_device()
>>             pci_stop_bus_device()
>>                device_release_driver_internal()
>>                   pci_device_remove()
>>                      remove_one()
>>                         mlx5_mdev_uninit()
>>
>> Then again I would expect that on other architectures VFs become at
>> leastunresponsive during a FLR of the PF not sure if that also lead to
>> calls to remove_one() though.
>>
>> As another data point I tried the same on the default Ubuntu 22.04
>> generic 5.15 kernel and there no crash occurs so this might be a newer
>> issue.
>>
>> Also, I did test with and without the patch I sent recently for
>> skipping the wait in mlx5_health_wait_pci_up() but that made no
>> difference.
>>
>> Any hints on how to debug this further and could you try to see if this
>> occurs on other architectures as well?
>
>My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
>[ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
>[ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
>[ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
>[ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
>[ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,

Yes, I also agree with Leon, if rollback fails this could be fatal to mlx5e
aux device removal as we don't have a way to check the state of the mlx5e
priv, We always assume it is up as long as the aux is up, which is wrong
only in case of this un-expected error flow.

If we just add a flag and skip mlx5e_remove, then we will end up with
dangling netdev and some other resources as the cleanup wasn't complete..

I need to dive deeper to figure out a proper solution, I will create an internal
ticket to track this and help provide a solution soon, hopefully.

>
>-67 is -ENOLINK from mlx5_internal_err_ret_value().
>
>Thanks
Niklas Schnelle April 14, 2023, 7:12 a.m. UTC | #3
On Thu, 2023-04-13 at 15:02 -0700, Saeed Mahameed wrote:
> On 13 Apr 14:02, Leon Romanovsky wrote:
> > On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
> > > Hi Saeed, Hi Leon,
> > > 
> > > While testing PCI recovery with a ConnectX-5 card (MT28800, fw
> > > 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
> > > crash (stacktrace attached) when the card is in switchdev mode. No
> > > crash occurs and the recovery succeeds in legacy mode (with VFs). I
> > > found that the same crash occurs also with a simple Function Level
> > > Reset instead of the s390 specific PCI recovery, see instructions
> > > below. From the looks of it I think this could affect non-s390 too but
> > > I don't have a proper x86 test system with a ConnectX card to test
> > > with.
> > > 
> > > Anyway, I tried to analyze further but got stuck after figuring out
> > > that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
> > > (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
> > > struct only contains zeros as it was previously zeroed by
> > > mlx5_mdev_uninit() which then leads to a NULL pointer access.
> > > 
> > > The crash itself can be prevented by the following debug patch though
> > > clearly this is not a proper fix:
> > > 
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
> > > *adev)
> > >         struct mlx5e_priv *priv = mlx5e_dev->priv;
> > >         pm_message_t state = {};
> > > 
> > > +       if (!priv->mdev) {
> > > +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
> > > +               return;
> > > +       }
> > >         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
> > >         mlx5e_dcbnl_delete_app(priv);
> > >         unregister_netdev(priv->netdev);
> > > 
> > > With that I then tried to track down why mlx5_mdev_uninit() is called
> > > and this might actually be s390 specific in that this happens during
> > > the removal of the VF which on s390 causes extra hot unplug events for
> > > the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
> > > the following call trace:
> > > 
> > > ...
> > > zpci_bus_remove_device()
> > >    zpci_iov_remove_virtfn()
> > >       pci_iov_remove_virtfn()
> > >          pci_stop_and_remove_bus_device()
> > >             pci_stop_bus_device()
> > >                device_release_driver_internal()
> > >                   pci_device_remove()
> > >                      remove_one()
> > >                         mlx5_mdev_uninit()
> > > 
> > > Then again I would expect that on other architectures VFs become at
> > > leastunresponsive during a FLR of the PF not sure if that also lead to
> > > calls to remove_one() though.
> > > 
> > > As another data point I tried the same on the default Ubuntu 22.04
> > > generic 5.15 kernel and there no crash occurs so this might be a newer
> > > issue.
> > > 
> > > Also, I did test with and without the patch I sent recently for
> > > skipping the wait in mlx5_health_wait_pci_up() but that made no
> > > difference.
> > > 
> > > Any hints on how to debug this further and could you try to see if this
> > > occurs on other architectures as well?
> > 
> > My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
> > [ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
> > [ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
> > [ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
> > [ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
> > [ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,
> 
> Yes, I also agree with Leon, if rollback fails this could be fatal to mlx5e
> aux device removal as we don't have a way to check the state of the mlx5e
> priv, We always assume it is up as long as the aux is up, which is wrong
> only in case of this un-expected error flow.
> 
> If we just add a flag and skip mlx5e_remove, then we will end up with
> dangling netdev and some other resources as the cleanup wasn't complete..
> 
> I need to dive deeper to figure out a proper solution, I will create an internal
> ticket to track this and help provide a solution soon, hopefully.
> 

Thank you for looking into this, do you have an idea what got us into
this unexpected error flow. This occurs very reliably for me but I'm
not sure if it is s390 specific or just caused by the switchdev setup.
It's also unexpected to me that the code reports -ENOLINK does that
refer to the PCIe link here or to the representor device being
disconnected?

Thanks,
Niklas
Saeed Mahameed April 14, 2023, 10:27 p.m. UTC | #4
On 14 Apr 09:12, Niklas Schnelle wrote:
>On Thu, 2023-04-13 at 15:02 -0700, Saeed Mahameed wrote:
>> On 13 Apr 14:02, Leon Romanovsky wrote:
>> > On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
>> > > Hi Saeed, Hi Leon,
>> > >
>> > > While testing PCI recovery with a ConnectX-5 card (MT28800, fw
>> > > 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
>> > > crash (stacktrace attached) when the card is in switchdev mode. No
>> > > crash occurs and the recovery succeeds in legacy mode (with VFs). I
>> > > found that the same crash occurs also with a simple Function Level
>> > > Reset instead of the s390 specific PCI recovery, see instructions
>> > > below. From the looks of it I think this could affect non-s390 too but
>> > > I don't have a proper x86 test system with a ConnectX card to test
>> > > with.
>> > >
>> > > Anyway, I tried to analyze further but got stuck after figuring out
>> > > that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
>> > > (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
>> > > struct only contains zeros as it was previously zeroed by
>> > > mlx5_mdev_uninit() which then leads to a NULL pointer access.
>> > >
>> > > The crash itself can be prevented by the following debug patch though
>> > > clearly this is not a proper fix:
>> > >
>> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> > > @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
>> > > *adev)
>> > >         struct mlx5e_priv *priv = mlx5e_dev->priv;
>> > >         pm_message_t state = {};
>> > >
>> > > +       if (!priv->mdev) {
>> > > +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
>> > > +               return;
>> > > +       }
>> > >         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
>> > >         mlx5e_dcbnl_delete_app(priv);
>> > >         unregister_netdev(priv->netdev);
>> > >
>> > > With that I then tried to track down why mlx5_mdev_uninit() is called
>> > > and this might actually be s390 specific in that this happens during
>> > > the removal of the VF which on s390 causes extra hot unplug events for
>> > > the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
>> > > the following call trace:
>> > >
>> > > ...
>> > > zpci_bus_remove_device()
>> > >    zpci_iov_remove_virtfn()
>> > >       pci_iov_remove_virtfn()
>> > >          pci_stop_and_remove_bus_device()
>> > >             pci_stop_bus_device()
>> > >                device_release_driver_internal()
>> > >                   pci_device_remove()
>> > >                      remove_one()
>> > >                         mlx5_mdev_uninit()
>> > >
>> > > Then again I would expect that on other architectures VFs become at
>> > > leastunresponsive during a FLR of the PF not sure if that also lead to
>> > > calls to remove_one() though.
>> > >
>> > > As another data point I tried the same on the default Ubuntu 22.04
>> > > generic 5.15 kernel and there no crash occurs so this might be a newer
>> > > issue.
>> > >
>> > > Also, I did test with and without the patch I sent recently for
>> > > skipping the wait in mlx5_health_wait_pci_up() but that made no
>> > > difference.
>> > >
>> > > Any hints on how to debug this further and could you try to see if this
>> > > occurs on other architectures as well?
>> >
>> > My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
>> > [ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
>> > [ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
>> > [ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
>> > [ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
>> > [ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,
>>
>> Yes, I also agree with Leon, if rollback fails this could be fatal to mlx5e
>> aux device removal as we don't have a way to check the state of the mlx5e
>> priv, We always assume it is up as long as the aux is up, which is wrong
>> only in case of this un-expected error flow.
>>
>> If we just add a flag and skip mlx5e_remove, then we will end up with
>> dangling netdev and some other resources as the cleanup wasn't complete..
>>
>> I need to dive deeper to figure out a proper solution, I will create an internal
>> ticket to track this and help provide a solution soon, hopefully.
>>
>
>Thank you for looking into this, do you have an idea what got us into
>this unexpected error flow. This occurs very reliably for me but I'm
>not sure if it is s390 specific or just caused by the switchdev setup.
>It's also unexpected to me that the code reports -ENOLINK does that
>refer to the PCIe link here or to the representor device being
>disconnected?
>

I believe this is not related to s390, and should happen on  x86 as well, 
I just learned yesterday that you already filed this issue through our
support and we already have an assignee working on this, let's work through
the support ticket and reduce clutter on the mailing list, I am sure we will
come up with a patch very soon and will all learn what went wrong :) ..
I have a clear idea what the issue is, but the solution may require a
bit of refactoring.. 

-Saeed.

>Thanks,
>Niklas
>
Niklas Schnelle April 19, 2023, 11:47 a.m. UTC | #5
On Fri, 2023-04-14 at 15:27 -0700, Saeed Mahameed wrote:
> On 14 Apr 09:12, Niklas Schnelle wrote:
> > On Thu, 2023-04-13 at 15:02 -0700, Saeed Mahameed wrote:
> > > On 13 Apr 14:02, Leon Romanovsky wrote:
> > > > On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
> > > > > Hi Saeed, Hi Leon,
> > > > > 
> > > > > While testing PCI recovery with a ConnectX-5 card (MT28800, fw
> > > > > 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
> > > > > crash (stacktrace attached) when the card is in switchdev mode. No
> > > > > crash occurs and the recovery succeeds in legacy mode (with VFs). I
> > > > > found that the same crash occurs also with a simple Function Level
> > > > > Reset instead of the s390 specific PCI recovery, see instructions
> > > > > below. From the looks of it I think this could affect non-s390 too but
> > > > > I don't have a proper x86 test system with a ConnectX card to test
> > > > > with.
> > > > > 
> > > > > Anyway, I tried to analyze further but got stuck after figuring out
> > > > > that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
> > > > > (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
> > > > > struct only contains zeros as it was previously zeroed by
> > > > > mlx5_mdev_uninit() which then leads to a NULL pointer access.
> > > > > 
> > > > > The crash itself can be prevented by the following debug patch though
> > > > > clearly this is not a proper fix:
> > > > > 
> > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > > > @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
> > > > > *adev)
> > > > >         struct mlx5e_priv *priv = mlx5e_dev->priv;
> > > > >         pm_message_t state = {};
> > > > > 
> > > > > +       if (!priv->mdev) {
> > > > > +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
> > > > > +               return;
> > > > > +       }
> > > > >         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
> > > > >         mlx5e_dcbnl_delete_app(priv);
> > > > >         unregister_netdev(priv->netdev);
> > > > > 
> > > > > With that I then tried to track down why mlx5_mdev_uninit() is called
> > > > > and this might actually be s390 specific in that this happens during
> > > > > the removal of the VF which on s390 causes extra hot unplug events for
> > > > > the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
> > > > > the following call trace:
> > > > > 
> > > > > ...
> > > > > zpci_bus_remove_device()
> > > > >    zpci_iov_remove_virtfn()
> > > > >       pci_iov_remove_virtfn()
> > > > >          pci_stop_and_remove_bus_device()
> > > > >             pci_stop_bus_device()
> > > > >                device_release_driver_internal()
> > > > >                   pci_device_remove()
> > > > >                      remove_one()
> > > > >                         mlx5_mdev_uninit()
> > > > > 
> > > > > Then again I would expect that on other architectures VFs become at
> > > > > leastunresponsive during a FLR of the PF not sure if that also lead to
> > > > > calls to remove_one() though.
> > > > > 
> > > > > As another data point I tried the same on the default Ubuntu 22.04
> > > > > generic 5.15 kernel and there no crash occurs so this might be a newer
> > > > > issue.
> > > > > 
> > > > > Also, I did test with and without the patch I sent recently for
> > > > > skipping the wait in mlx5_health_wait_pci_up() but that made no
> > > > > difference.
> > > > > 
> > > > > Any hints on how to debug this further and could you try to see if this
> > > > > occurs on other architectures as well?
> > > > 
> > > > My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
> > > > [ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
> > > > [ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
> > > > [ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
> > > > [ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
> > > > [ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,
> > > 
> > > Yes, I also agree with Leon, if rollback fails this could be fatal to mlx5e
> > > aux device removal as we don't have a way to check the state of the mlx5e
> > > priv, We always assume it is up as long as the aux is up, which is wrong
> > > only in case of this un-expected error flow.
> > > 
> > > If we just add a flag and skip mlx5e_remove, then we will end up with
> > > dangling netdev and some other resources as the cleanup wasn't complete..
> > > 
> > > I need to dive deeper to figure out a proper solution, I will create an internal
> > > ticket to track this and help provide a solution soon, hopefully.
> > > 
> > 
> > Thank you for looking into this, do you have an idea what got us into
> > this unexpected error flow. This occurs very reliably for me but I'm
> > not sure if it is s390 specific or just caused by the switchdev setup.
> > It's also unexpected to me that the code reports -ENOLINK does that
> > refer to the PCIe link here or to the representor device being
> > disconnected?
> > 
> 
> I believe this is not related to s390, and should happen on  x86 as well, 
> I just learned yesterday that you already filed this issue through our
> support and we already have an assignee working on this, let's work through
> the support ticket and reduce clutter on the mailing list, I am sure we will
> come up with a patch very soon and will all learn what went wrong :) ..
> I have a clear idea what the issue is, but the solution may require a
> bit of refactoring.. 
> 
> -Saeed.

Sounds good and thank you!

Niklas
diff mbox series

Patch

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6012,6 +6012,10 @@  static void mlx5e_remove(struct auxiliary_device
*adev)
        struct mlx5e_priv *priv = mlx5e_dev->priv;
        pm_message_t state = {};

+       if (!priv->mdev) {
+               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
+               return;
+       }
        mlx5_core_uplink_netdev_set(priv->mdev, NULL);
        mlx5e_dcbnl_delete_app(priv);