diff mbox

nvmeof rdma regression issue on 4.14.0-rc1 (or maybe mlx4?)

Message ID 47493aa0-4cad-721b-4ea2-c3b2293340aa@grimberg.me (mailing list archive)
State Deferred
Headers show

Commit Message

Sagi Grimberg Sept. 24, 2017, 7:38 a.m. UTC
> Adding linux-rdma, the dma mappings happen in the mlx4 driver

...

>> [  293.209662] DMAR: ERROR: DMA PTE for vPFN 0xe0f59 already set (to 10369a9001 not 10115ed001)
>> [  293.219117] ------------[ cut here ]------------
>> [  293.224284] WARNING: CPU: 14 PID: 751 at drivers/iommu/intel-iommu.c:2305 __domain_mapping+0x367/0x380
>> [  293.234698] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core intel_rapl ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore iTCO_wdt ipmi_si intel_rapl_perf iTCO_vendor_support ipmi_devintf dcdbas sg pcspkr ipmi_msghandler ioatdma mei_me mei dca shpchp lpc_ich acpi_pad acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c mlx4_en sd_mod
>> [  293.313884]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata i2c_core crc32c_intel devlink pps_core dm_mirror dm_region_hash dm_log dm_mod
>> [  293.335583] CPU: 14 PID: 751 Comm: kworker/u369:7 Not tainted 4.14.0-rc1 #2
>> [  293.343374] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
>> [  293.351750] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>> [  293.359249] task: ffff881032ecdd00 task.stack: ffffc900084d8000
>> [  293.365873] RIP: 0010:__domain_mapping+0x367/0x380
>> [  293.371230] RSP: 0018:ffffc900084dbc60 EFLAGS: 00010202
>> [  293.377075] RAX: 0000000000000004 RBX: 00000010115ed001 RCX: 0000000000000000
>> [  293.385056] RDX: 0000000000000000 RSI: ffff88103e7ce038 RDI: ffff88103e7ce038
>> [  293.393040] RBP: ffffc900084dbcc0 R08: 0000000000000000 R09: 0000000000000000
>> [  293.401024] R10: 00000000000002f7 R11: 00000000010115ed R12: ffff88103b9e1ac8
>> [  293.409744] R13: 0000000000000001 R14: 0000000000000001 R15: 00000000000e0f59
>> [  293.418456] FS:  0000000000000000(0000) GS:ffff88103e7c0000(0000) knlGS:0000000000000000
>> [  293.428229] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  293.435391] CR2: 0000154ecabc9140 CR3: 0000001005709001 CR4: 00000000001606e0
>> [  293.444112] Call Trace:
>> [  293.447594]  __intel_map_single+0xeb/0x180
>> [  293.452918]  intel_map_page+0x39/0x40
>> [  293.457765]  mlx4_ib_alloc_mr+0x141/0x220 [mlx4_ib]
>> [  293.463965]  ib_alloc_mr+0x26/0x50 [ib_core]
>> [  293.469471]  nvme_rdma_reinit_request+0x3a/0x70 [nvme_rdma]
>> [  293.476433]  ? nvme_rdma_free_ctrl+0xb0/0xb0 [nvme_rdma]
>> [  293.483100]  blk_mq_reinit_tagset+0x5c/0x90
>> [  293.488508]  nvme_rdma_configure_io_queues+0x211/0x290 [nvme_rdma]
>> [  293.496152]  nvme_rdma_reconnect_ctrl_work+0x5b/0xd0 [nvme_rdma]
>> [  293.503598]  process_one_work+0x149/0x360
>> [  293.508815]  worker_thread+0x4d/0x3c0
>> [  293.513638]  kthread+0x109/0x140
>> [  293.517973]  ? rescuer_thread+0x380/0x380
>> [  293.523176]  ? kthread_park+0x60/0x60
>> [  293.527993]  ret_from_fork+0x25/0x30

Is it possible that ib_dereg_mr failed?

can you please apply the following patch and report if you see a warning?
--
                         ctrl->max_fr_pages);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yi Zhang Sept. 24, 2017, 9:28 a.m. UTC | #1
> Is it possible that ib_dereg_mr failed?
>
It seems not, and finally the system get panic, here is the log:

[  104.373784] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  104.564001] nvme nvme0: creating 40 I/O queues.
[  105.070022] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  144.135070] nvme nvme0: rescanning
[  204.383678] nvme nvme0: Reconnecting in 10 seconds...
[  214.506489] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  214.513996] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  214.520426] nvme nvme0: Failed reconnect attempt 1
[  214.525788] nvme nvme0: Reconnecting in 10 seconds...
[  224.733962] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  224.741464] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  224.747898] nvme nvme0: Failed reconnect attempt 2
[  224.753301] nvme nvme0: Reconnecting in 10 seconds...
[  234.973834] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  234.981335] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  234.987768] nvme nvme0: Failed reconnect attempt 3
[  234.993150] nvme nvme0: Reconnecting in 10 seconds...
[  245.233395] nvme nvme0: creating 40 I/O queues.
[  245.238480] DMAR: ERROR: DMA PTE for vPFN 0xe109b already set (to 
10098cc002 not 103b85e003)
[  245.247940] ------------[ cut here ]------------
[  245.253110] WARNING: CPU: 38 PID: 6 at 
drivers/iommu/intel-iommu.c:2305 __domain_mapping+0x367/0x380
[  245.263329] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert 
iscsi_target_mod ibd
[  245.342493]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp 
libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
[  245.364191] CPU: 38 PID: 6 Comm: kworker/u368:0 Not tainted 
4.14.0-rc1+ #7
[  245.371880] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  245.380265] Workqueue: ib_addr process_one_req [ib_core]
[  245.386211] task: ffff88018cb245c0 task.stack: ffffc9000009c000
[  245.392836] RIP: 0010:__domain_mapping+0x367/0x380
[  245.398194] RSP: 0018:ffffc9000009fa98 EFLAGS: 00010202
[  245.404039] RAX: 0000000000000004 RBX: 000000103b85e003 RCX: 
0000000000000000
[  245.412018] RDX: 0000000000000000 RSI: ffff88103eace038 RDI: 
ffff88103eace038
[  245.420001] RBP: ffffc9000009faf8 R08: 0000000000000000 R09: 
0000000000000000
[  245.427983] R10: 00000000000002f7 R11: 000000000103b85e R12: 
ffff881009bc74d8
[  245.436711] R13: 0000000000000001 R14: 0000000000000001 R15: 
00000000000e109b
[  245.445419] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000) 
knlGS:0000000000000000
[  245.455199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  245.462357] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 
00000000001606e0
[  245.471074] Call Trace:
[  245.474549]  __intel_map_single+0xeb/0x180
[  245.479868]  intel_alloc_coherent+0xb5/0x130
[  245.485388]  mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core]
[  245.491482]  mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib]
[  245.498540]  mlx4_ib_create_cq+0x223/0x450 [mlx4_ib]
[  245.504822]  ib_alloc_cq+0x49/0x170 [ib_core]
[  245.510413]  nvme_rdma_cm_handler+0x3a2/0x7ab [nvme_rdma]
[  245.517179]  ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm]
[  245.523456]  addr_handler+0xa4/0x1c0 [rdma_cm]
[  245.529147]  process_one_req+0x8d/0x120 [ib_core]
[  245.535132]  process_one_work+0x149/0x360
[  245.540334]  worker_thread+0x4d/0x3c0
[  245.545145]  kthread+0x109/0x140
[  245.549462]  ? rescuer_thread+0x380/0x380
[  245.554654]  ? kthread_park+0x60/0x60
[  245.559456]  ret_from_fork+0x25/0x30
[  245.564153] Code: fe aa 81 4c 89 5d a0 4c 89 4d a8 e8 87 e1 c0 ff 8b 
05 fe 6e 87 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 e9 6e 
87 00 <0f> ff e9 b8 fd ff ff e8 8d c7 ba ff 0f 1f 00 66 2e 0f 1f 8
[  245.586712] ---[ end trace 56749c1831388ff8 ]---
[  245.592920] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, 
cccccccccccccccc/ccd80eccccccf203 (bad dma)
[  245.604179] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, 
cccccccccccccccc/cccccccccccccccc (bad dma)
[  245.615647] general protection fault: 0000 [#1] SMP
[  245.621836] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert 
iscsi_target_mod ibd
[  245.706171]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp 
libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
[  245.729344] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G W       
4.14.0-rc1+ #7
[  245.739128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  245.748234] Workqueue: ib_addr process_one_req [ib_core]
[  245.754905] task: ffff88018cb245c0 task.stack: ffffc9000009c000
[  245.762256] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20
[  245.769313] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286
[  245.775881] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: 
0000000000001793
[  245.784591] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: 
ffff88018fc07aa0
[  245.793294] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: 
ffff8810098cccc0
[  245.802002] R10: ffffffff818a99e0 R11: 00000000010098cd R12: 
00000000014080c0
[  245.810706] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: 
ffff88018fc07a80
[  245.819409] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000) 
knlGS:0000000000000000
[  245.829184] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  245.836342] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 
00000000001606e0
[  245.845056] Call Trace:
[  245.848524]  kmem_cache_alloc_trace+0xa0/0x1c0
[  245.854220]  nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
[  245.860990]  addr_handler+0xa4/0x1c0 [rdma_cm]
[  245.866694]  process_one_req+0x8d/0x120 [ib_core]
[  245.872687]  process_one_work+0x149/0x360
[  245.877899]  worker_thread+0x4d/0x3c0
[  245.882720]  kthread+0x109/0x140
[  245.887051]  ? rescuer_thread+0x380/0x380
[  245.892244]  ? kthread_park+0x60/0x60
[  245.897054]  ret_from_fork+0x25/0x30
[  245.901760] Code: 31 d2 e8 b3 ea ff ff 5b 41 5c 5d c3 0f 1f 40 00 66 
2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 74 0a 48 
63 07 <48> 8b 04 06 0f 18 08 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 0
[  245.924349] RIP: prefetch_freepointer.isra.65+0x11/0x20 RSP: 
ffffc9000009fcc0
[  245.933145] ---[ end trace 56749c1831388ff9 ]---
[  245.942680] Kernel panic - not syncing: Fatal exception
[  245.950207] Kernel Offset: disabled
[  245.958566] ---[ end Kernel panic - not syncing: Fatal exception
[  245.966082] ------------[ cut here ]------------
[  245.972014] WARNING: CPU: 38 PID: 6 at kernel/sched/core.c:1179 
set_task_cpu+0x191/0x1a0
[  245.981822] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert 
iscsi_target_mod ibd
[  246.066533]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp 
libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
[  246.089836] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G      D 
W       4.14.0-rc1+ #7
[  246.099683] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  246.108849] Workqueue: ib_addr process_one_req [ib_core]
[  246.115566] task: ffff88018cb245c0 task.stack: ffffc9000009c000
[  246.122948] RIP: 0010:set_task_cpu+0x191/0x1a0
[  246.128668] RSP: 0018:ffff88103eac3c38 EFLAGS: 00010046
[  246.135255] RAX: 0000000000000100 RBX: ffff88207bf445c0 RCX: 
0000000000000001
[  246.143978] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 
ffff88207bf445c0
[  246.152699] RBP: ffff88103eac3c58 R08: 0000000000000001 R09: 
0000000000000000
[  246.161418] R10: 0000000000000001 R11: 0000000003e236eb R12: 
ffff88207bf4516c
[  246.170137] R13: 0000000000000001 R14: 0000000000000001 R15: 
000000000001b900
[  246.178854] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000) 
knlGS:0000000000000000
[  246.188644] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  246.195812] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 
00000000001606e0
[  246.204540] Call Trace:
[  246.208027]  <IRQ>
[  246.211016]  try_to_wake_up+0x166/0x470
[  246.216036]  default_wake_function+0x12/0x20
[  246.221537]  __wake_up_common+0x8a/0x160
[  246.226641]  __wake_up_locked+0x16/0x20
[  246.231643]  ep_poll_callback+0xd0/0x300
[  246.236727]  __wake_up_common+0x8a/0x160
[  246.241817]  __wake_up_common_lock+0x7e/0xc0
[  246.247291]  __wake_up+0x13/0x20
[  246.251596]  wake_up_klogd_work_func+0x40/0x60
[  246.257265]  irq_work_run_list+0x4d/0x70
[  246.262353]  ? tick_sched_do_timer+0x70/0x70
[  246.267830]  irq_work_tick+0x40/0x50
[  246.272530]  update_process_times+0x42/0x60
[  246.277912]  tick_sched_handle+0x2d/0x60
[  246.282987]  tick_sched_timer+0x39/0x70
[  246.287945]  __hrtimer_run_queues+0xe5/0x230
[  246.293371]  hrtimer_interrupt+0xa8/0x1a0
[  246.298509]  smp_apic_timer_interrupt+0x5f/0x130
[  246.304322]  apic_timer_interrupt+0x9d/0xb0
[  246.309640]  </IRQ>
[  246.312633] RIP: 0010:panic+0x1fd/0x245
[  246.317554] RSP: 0018:ffffc9000009fb18 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffff10
[  246.326659] RAX: 0000000000000034 RBX: 0000000000000200 RCX: 
0000000000000006
[  246.335268] RDX: 0000000000000000 RSI: 0000000000000086 RDI: 
ffff88103eace030
[  246.343856] RBP: ffffc9000009fb88 R08: 0000000000000000 R09: 
0000000000000877
[  246.352424] R10: 00000000000003ff R11: 0000000000000001 R12: 
ffffffff81a3e1d8
[  246.360975] R13: 0000000000000000 R14: 0000000000000000 R15: 
ffff88018fc07a80
[  246.369508]  ? panic+0x1f6/0x245
[  246.373657]  oops_end+0xb8/0xd0
[  246.377676]  die+0x42/0x50
[  246.381194]  do_general_protection+0xd2/0x160
[  246.386540]  ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
[  246.393238]  general_protection+0x22/0x30
[  246.398181] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20
[  246.404964] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286
[  246.411258] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: 
0000000000001793
[  246.419692] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: 
ffff88018fc07aa0
[  246.428115] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: 
ffff8810098cccc0
[  246.436543] R10: ffffffff818a99e0 R11: 00000000010098cd R12: 
00000000014080c0
[  246.444970] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: 
ffff88018fc07a80
[  246.453402]  ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
[  246.460087]  kmem_cache_alloc_trace+0xa0/0x1c0
[  246.465511]  nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
[  246.472004]  addr_handler+0xa4/0x1c0 [rdma_cm]
[  246.477424]  process_one_req+0x8d/0x120 [ib_core]
[  246.483128]  process_one_work+0x149/0x360
[  246.488045]  worker_thread+0x4d/0x3c0
[  246.492577]  kthread+0x109/0x140
[  246.496620]  ? rescuer_thread+0x380/0x380
[  246.501540]  ? kthread_park+0x60/0x60
[  246.506070]  ret_from_fork+0x25/0x30
[  246.510496] Code: ff 80 8b ac 08 00 00 04 e9 23 ff ff ff 0f ff e9 bf 
fe ff ff f7 83 84 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f ff e9 c2 fe 
ff ff <0f> ff e9 d1 fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 0
[  246.532545] ---[ end trace 56749c1831388ffa ]---

> can you please apply the following patch and report if you see a warning?
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 92a03ff5fb4d..ef50b58b0bb6 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -274,7 +274,7 @@ static int nvme_rdma_reinit_request(void *data, 
> struct request *rq)
>         struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
>         int ret = 0;
>
> -       ib_dereg_mr(req->mr);
> +       WARN_ON_ONCE(ib_dereg_mr(req->mr));
>
>         req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG,
>                         ctrl->max_fr_pages);
> -- 
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leon Romanovsky Sept. 24, 2017, 10:34 a.m. UTC | #2
On Sun, Sep 24, 2017 at 05:28:30PM +0800, Yi Zhang wrote:
>
> > Is it possible that ib_dereg_mr failed?
> >
> It seems not, and finally the system get panic, here is the log:

I looked on the issue during the weekend and didn't see any suspicious
commit in the mlx4 alloc/mapping area.

Can I ask you to perform git bisect to find the problematic change?

Added Tariq to the thread.

Thanks

>
> [  104.373784] nvme nvme0: new ctrl: NQN
> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [  104.564001] nvme nvme0: creating 40 I/O queues.
> [  105.070022] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [  144.135070] nvme nvme0: rescanning
> [  204.383678] nvme nvme0: Reconnecting in 10 seconds...
> [  214.506489] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [  214.513996] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [  214.520426] nvme nvme0: Failed reconnect attempt 1
> [  214.525788] nvme nvme0: Reconnecting in 10 seconds...
> [  224.733962] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [  224.741464] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [  224.747898] nvme nvme0: Failed reconnect attempt 2
> [  224.753301] nvme nvme0: Reconnecting in 10 seconds...
> [  234.973834] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [  234.981335] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [  234.987768] nvme nvme0: Failed reconnect attempt 3
> [  234.993150] nvme nvme0: Reconnecting in 10 seconds...
> [  245.233395] nvme nvme0: creating 40 I/O queues.
> [  245.238480] DMAR: ERROR: DMA PTE for vPFN 0xe109b already set (to
> 10098cc002 not 103b85e003)
> [  245.247940] ------------[ cut here ]------------
> [  245.253110] WARNING: CPU: 38 PID: 6 at drivers/iommu/intel-iommu.c:2305
> __domain_mapping+0x367/0x380
> [  245.263329] Modules linked in: nvme_rdma nvme_fabrics nvme_core
> sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd
> [  245.342493]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata
> crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
> [  245.364191] CPU: 38 PID: 6 Comm: kworker/u368:0 Not tainted 4.14.0-rc1+
> #7
> [  245.371880] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2
> 01/08/2016
> [  245.380265] Workqueue: ib_addr process_one_req [ib_core]
> [  245.386211] task: ffff88018cb245c0 task.stack: ffffc9000009c000
> [  245.392836] RIP: 0010:__domain_mapping+0x367/0x380
> [  245.398194] RSP: 0018:ffffc9000009fa98 EFLAGS: 00010202
> [  245.404039] RAX: 0000000000000004 RBX: 000000103b85e003 RCX:
> 0000000000000000
> [  245.412018] RDX: 0000000000000000 RSI: ffff88103eace038 RDI:
> ffff88103eace038
> [  245.420001] RBP: ffffc9000009faf8 R08: 0000000000000000 R09:
> 0000000000000000
> [  245.427983] R10: 00000000000002f7 R11: 000000000103b85e R12:
> ffff881009bc74d8
> [  245.436711] R13: 0000000000000001 R14: 0000000000000001 R15:
> 00000000000e109b
> [  245.445419] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000)
> knlGS:0000000000000000
> [  245.455199] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  245.462357] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4:
> 00000000001606e0
> [  245.471074] Call Trace:
> [  245.474549]  __intel_map_single+0xeb/0x180
> [  245.479868]  intel_alloc_coherent+0xb5/0x130
> [  245.485388]  mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core]
> [  245.491482]  mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib]
> [  245.498540]  mlx4_ib_create_cq+0x223/0x450 [mlx4_ib]
> [  245.504822]  ib_alloc_cq+0x49/0x170 [ib_core]
> [  245.510413]  nvme_rdma_cm_handler+0x3a2/0x7ab [nvme_rdma]
> [  245.517179]  ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm]
> [  245.523456]  addr_handler+0xa4/0x1c0 [rdma_cm]
> [  245.529147]  process_one_req+0x8d/0x120 [ib_core]
> [  245.535132]  process_one_work+0x149/0x360
> [  245.540334]  worker_thread+0x4d/0x3c0
> [  245.545145]  kthread+0x109/0x140
> [  245.549462]  ? rescuer_thread+0x380/0x380
> [  245.554654]  ? kthread_park+0x60/0x60
> [  245.559456]  ret_from_fork+0x25/0x30
> [  245.564153] Code: fe aa 81 4c 89 5d a0 4c 89 4d a8 e8 87 e1 c0 ff 8b 05
> fe 6e 87 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 e9 6e 87 00
> <0f> ff e9 b8 fd ff ff e8 8d c7 ba ff 0f 1f 00 66 2e 0f 1f 8
> [  245.586712] ---[ end trace 56749c1831388ff8 ]---
> [  245.592920] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd,
> cccccccccccccccc/ccd80eccccccf203 (bad dma)
> [  245.604179] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd,
> cccccccccccccccc/cccccccccccccccc (bad dma)
> [  245.615647] general protection fault: 0000 [#1] SMP
> [  245.621836] Modules linked in: nvme_rdma nvme_fabrics nvme_core
> sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd
> [  245.706171]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata
> crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
> [  245.729344] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G W
> 4.14.0-rc1+ #7
> [  245.739128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2
> 01/08/2016
> [  245.748234] Workqueue: ib_addr process_one_req [ib_core]
> [  245.754905] task: ffff88018cb245c0 task.stack: ffffc9000009c000
> [  245.762256] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20
> [  245.769313] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286
> [  245.775881] RAX: 0000000000000000 RBX: cccccccccccccccc RCX:
> 0000000000001793
> [  245.784591] RDX: 0000000000001792 RSI: cccccccccccccccc RDI:
> ffff88018fc07aa0
> [  245.793294] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09:
> ffff8810098cccc0
> [  245.802002] R10: ffffffff818a99e0 R11: 00000000010098cd R12:
> 00000000014080c0
> [  245.810706] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15:
> ffff88018fc07a80
> [  245.819409] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000)
> knlGS:0000000000000000
> [  245.829184] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  245.836342] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4:
> 00000000001606e0
> [  245.845056] Call Trace:
> [  245.848524]  kmem_cache_alloc_trace+0xa0/0x1c0
> [  245.854220]  nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
> [  245.860990]  addr_handler+0xa4/0x1c0 [rdma_cm]
> [  245.866694]  process_one_req+0x8d/0x120 [ib_core]
> [  245.872687]  process_one_work+0x149/0x360
> [  245.877899]  worker_thread+0x4d/0x3c0
> [  245.882720]  kthread+0x109/0x140
> [  245.887051]  ? rescuer_thread+0x380/0x380
> [  245.892244]  ? kthread_park+0x60/0x60
> [  245.897054]  ret_from_fork+0x25/0x30
> [  245.901760] Code: 31 d2 e8 b3 ea ff ff 5b 41 5c 5d c3 0f 1f 40 00 66 2e
> 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 74 0a 48 63 07
> <48> 8b 04 06 0f 18 08 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 0
> [  245.924349] RIP: prefetch_freepointer.isra.65+0x11/0x20 RSP:
> ffffc9000009fcc0
> [  245.933145] ---[ end trace 56749c1831388ff9 ]---
> [  245.942680] Kernel panic - not syncing: Fatal exception
> [  245.950207] Kernel Offset: disabled
> [  245.958566] ---[ end Kernel panic - not syncing: Fatal exception
> [  245.966082] ------------[ cut here ]------------
> [  245.972014] WARNING: CPU: 38 PID: 6 at kernel/sched/core.c:1179
> set_task_cpu+0x191/0x1a0
> [  245.981822] Modules linked in: nvme_rdma nvme_fabrics nvme_core
> sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd
> [  246.066533]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata
> crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd
> [  246.089836] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G      D W
> 4.14.0-rc1+ #7
> [  246.099683] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2
> 01/08/2016
> [  246.108849] Workqueue: ib_addr process_one_req [ib_core]
> [  246.115566] task: ffff88018cb245c0 task.stack: ffffc9000009c000
> [  246.122948] RIP: 0010:set_task_cpu+0x191/0x1a0
> [  246.128668] RSP: 0018:ffff88103eac3c38 EFLAGS: 00010046
> [  246.135255] RAX: 0000000000000100 RBX: ffff88207bf445c0 RCX:
> 0000000000000001
> [  246.143978] RDX: 0000000000000001 RSI: 0000000000000001 RDI:
> ffff88207bf445c0
> [  246.152699] RBP: ffff88103eac3c58 R08: 0000000000000001 R09:
> 0000000000000000
> [  246.161418] R10: 0000000000000001 R11: 0000000003e236eb R12:
> ffff88207bf4516c
> [  246.170137] R13: 0000000000000001 R14: 0000000000000001 R15:
> 000000000001b900
> [  246.178854] FS:  0000000000000000(0000) GS:ffff88103eac0000(0000)
> knlGS:0000000000000000
> [  246.188644] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  246.195812] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4:
> 00000000001606e0
> [  246.204540] Call Trace:
> [  246.208027]  <IRQ>
> [  246.211016]  try_to_wake_up+0x166/0x470
> [  246.216036]  default_wake_function+0x12/0x20
> [  246.221537]  __wake_up_common+0x8a/0x160
> [  246.226641]  __wake_up_locked+0x16/0x20
> [  246.231643]  ep_poll_callback+0xd0/0x300
> [  246.236727]  __wake_up_common+0x8a/0x160
> [  246.241817]  __wake_up_common_lock+0x7e/0xc0
> [  246.247291]  __wake_up+0x13/0x20
> [  246.251596]  wake_up_klogd_work_func+0x40/0x60
> [  246.257265]  irq_work_run_list+0x4d/0x70
> [  246.262353]  ? tick_sched_do_timer+0x70/0x70
> [  246.267830]  irq_work_tick+0x40/0x50
> [  246.272530]  update_process_times+0x42/0x60
> [  246.277912]  tick_sched_handle+0x2d/0x60
> [  246.282987]  tick_sched_timer+0x39/0x70
> [  246.287945]  __hrtimer_run_queues+0xe5/0x230
> [  246.293371]  hrtimer_interrupt+0xa8/0x1a0
> [  246.298509]  smp_apic_timer_interrupt+0x5f/0x130
> [  246.304322]  apic_timer_interrupt+0x9d/0xb0
> [  246.309640]  </IRQ>
> [  246.312633] RIP: 0010:panic+0x1fd/0x245
> [  246.317554] RSP: 0018:ffffc9000009fb18 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff10
> [  246.326659] RAX: 0000000000000034 RBX: 0000000000000200 RCX:
> 0000000000000006
> [  246.335268] RDX: 0000000000000000 RSI: 0000000000000086 RDI:
> ffff88103eace030
> [  246.343856] RBP: ffffc9000009fb88 R08: 0000000000000000 R09:
> 0000000000000877
> [  246.352424] R10: 00000000000003ff R11: 0000000000000001 R12:
> ffffffff81a3e1d8
> [  246.360975] R13: 0000000000000000 R14: 0000000000000000 R15:
> ffff88018fc07a80
> [  246.369508]  ? panic+0x1f6/0x245
> [  246.373657]  oops_end+0xb8/0xd0
> [  246.377676]  die+0x42/0x50
> [  246.381194]  do_general_protection+0xd2/0x160
> [  246.386540]  ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
> [  246.393238]  general_protection+0x22/0x30
> [  246.398181] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20
> [  246.404964] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286
> [  246.411258] RAX: 0000000000000000 RBX: cccccccccccccccc RCX:
> 0000000000001793
> [  246.419692] RDX: 0000000000001792 RSI: cccccccccccccccc RDI:
> ffff88018fc07aa0
> [  246.428115] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09:
> ffff8810098cccc0
> [  246.436543] R10: ffffffff818a99e0 R11: 00000000010098cd R12:
> 00000000014080c0
> [  246.444970] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15:
> ffff88018fc07a80
> [  246.453402]  ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
> [  246.460087]  kmem_cache_alloc_trace+0xa0/0x1c0
> [  246.465511]  nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma]
> [  246.472004]  addr_handler+0xa4/0x1c0 [rdma_cm]
> [  246.477424]  process_one_req+0x8d/0x120 [ib_core]
> [  246.483128]  process_one_work+0x149/0x360
> [  246.488045]  worker_thread+0x4d/0x3c0
> [  246.492577]  kthread+0x109/0x140
> [  246.496620]  ? rescuer_thread+0x380/0x380
> [  246.501540]  ? kthread_park+0x60/0x60
> [  246.506070]  ret_from_fork+0x25/0x30
> [  246.510496] Code: ff 80 8b ac 08 00 00 04 e9 23 ff ff ff 0f ff e9 bf fe
> ff ff f7 83 84 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f ff e9 c2 fe ff ff
> <0f> ff e9 d1 fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 0
> [  246.532545] ---[ end trace 56749c1831388ffa ]---
>
> > can you please apply the following patch and report if you see a warning?
> > --
> > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> > index 92a03ff5fb4d..ef50b58b0bb6 100644
> > --- a/drivers/nvme/host/rdma.c
> > +++ b/drivers/nvme/host/rdma.c
> > @@ -274,7 +274,7 @@ static int nvme_rdma_reinit_request(void *data,
> > struct request *rq)
> >         struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
> >         int ret = 0;
> >
> > -       ib_dereg_mr(req->mr);
> > +       WARN_ON_ONCE(ib_dereg_mr(req->mr));
> >
> >         req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG,
> >                         ctrl->max_fr_pages);
> > --
> >
> > _______________________________________________
> > Linux-nvme mailing list
> > Linux-nvme@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yi Zhang Sept. 25, 2017, 5:20 a.m. UTC | #3
On 09/24/2017 06:34 PM, Leon Romanovsky wrote:
> On Sun, Sep 24, 2017 at 05:28:30PM +0800, Yi Zhang wrote:
>>> Is it possible that ib_dereg_mr failed?
>>>
>> It seems not, and finally the system get panic, here is the log:
> I looked on the issue during the weekend and didn't see any suspicious
> commit in the mlx4 alloc/mapping area.
>
> Can I ask you to perform git bisect to find the problematic change?
Hi Sagi
I did git bisect for this issue, seems it was introduced by your patch 
"Few more patches from the centralization set".
Here is the testing on the patch, let me know if  you need more info.

BAD  148b4e7 nvme-rdma: stop queues instead of simply flipping their state
BAD  a57bd54 nvme-rdma: introduce configure/destroy io queues
Log:
[  127.899255] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  128.074263] nvme nvme0: creating 40 I/O queues.
[  128.581822] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  177.486110] print_req_error: I/O error, dev nvme0n1, sector 0
[  191.256637] nvme nvme0: Reconnecting in 10 seconds...
[  201.855846] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  201.863353] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  201.869824] nvme nvme0: Failed reconnect attempt 1
[  201.875183] nvme nvme0: Reconnecting in 10 seconds...
[  212.087828] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  212.095330] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  212.101766] nvme nvme0: Failed reconnect attempt 2
[  212.107129] nvme nvme0: Reconnecting in 10 seconds...
[  222.328398] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  222.335900] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  222.342335] nvme nvme0: Failed reconnect attempt 3
[  222.347699] nvme nvme0: Reconnecting in 10 seconds...
[  232.567791] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  232.575292] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  232.581730] nvme nvme0: Failed reconnect attempt 4
[  232.587094] nvme nvme0: Reconnecting in 10 seconds...
[  242.827727] nvme nvme0: creating 40 I/O queues.
[  242.832810] DMAR: ERROR: DMA PTE for vPFN 0xe129b already set (to 
103c692002 not 1000915003)
[  242.842265] ------------[ cut here ]------------
[  242.847437] WARNING: CPU: 0 PID: 783 at 
drivers/iommu/intel-iommu.c:2299 __domain_mapping+0x363/0x370
[  242.857755] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert 
iscsi_target_mod ibd
[  242.936919]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci 
crc32c_intel ptp libata i2c_core pps_core devlink dm_mirror 
dm_region_hash dmd
[  242.958625] CPU: 0 PID: 783 Comm: kworker/u368:1 Not tainted 
4.13.0-rc7.a57bd54+ #15
[  242.967304] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  242.975687] Workqueue: ib_addr process_one_req [ib_core]
[  242.981631] task: ffff881019491740 task.stack: ffffc9000b534000
[  242.989011] RIP: 0010:__domain_mapping+0x363/0x370
[  242.995108] RSP: 0018:ffffc9000b537a50 EFLAGS: 00010202
[  243.001694] RAX: 0000000000000004 RBX: 0000001000915003 RCX: 
0000000000000000
[  243.010433] RDX: 0000000000000000 RSI: ffff88103e60e038 RDI: 
ffff88103e60e038
[  243.019170] RBP: ffffc9000b537ab0 R08: 0000000000000000 R09: 
0000000000000000
[  243.027893] R10: 00000000000002f7 R11: 0000000001000915 R12: 
ffff88201ea5c4d8
[  243.036632] R13: 0000000000000001 R14: 0000000000000001 R15: 
00000000000e129b
[  243.045348] FS:  0000000000000000(0000) GS:ffff88103e600000(0000) 
knlGS:0000000000000000
[  243.055142] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  243.062337] CR2: 00007f9d3b0fab70 CR3: 00000020245b7000 CR4: 
00000000001406f0
[  243.071076] Call Trace:
[  243.074564]  __intel_map_single+0xeb/0x180
[  243.079903]  intel_alloc_coherent+0xb5/0x130
[  243.085445]  mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core]
[  243.091555]  mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib]
[  243.098621]  mlx4_ib_create_cq+0x223/0x440 [mlx4_ib]
[  243.104901]  ? find_gid.isra.5+0x167/0x1f0 [ib_core]
[  243.111178]  ib_alloc_cq+0x49/0x170 [ib_core]
[  243.116791]  nvme_rdma_cm_handler+0x3e7/0x886 [nvme_rdma]
[  243.123557]  ? cma_attach_to_dev+0x17/0x50 [rdma_cm]
[  243.129838]  ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm]
[  243.136115]  ? account_entity_dequeue+0xaa/0xe0
[  243.141918]  addr_handler+0xa4/0x1c0 [rdma_cm]
[  243.147604]  process_one_req+0x8d/0x120 [ib_core]
[  243.153585]  process_one_work+0x149/0x360
[  243.158807]  worker_thread+0x4d/0x3c0
[  243.163618]  kthread+0x109/0x140
[  243.167936]  ? rescuer_thread+0x380/0x380
[  243.173131]  ? kthread_park+0x60/0x60
[  243.177930]  ret_from_fork+0x25/0x30
[  243.182641] Code: f1 a9 81 4c 89 5d a0 4c 89 4d a8 e8 0b 58 c1 ff 8b 
05 f2 16 88 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 dd 16 
88 00 <0f> ff e9 bc fd ff ff e8 21 3d bb ff 90 0f 1f 44 00 00 55 4
[  243.205193] ---[ end trace 725c2de52628c061 ]---
[  243.211723] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, 
cccccccccccccccc/cc242fcccccc2801 (bad dma)
[  243.211724] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, 
cccccccccccccccc/cccccccccccccccc (bad dma)
[  243.212312] general protection fault: 0000 [#1] SMP
[  243.212312] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert 
iscsi_target_mod ibd
[  243.212339]  mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci 
crc32c_intel ptp libata i2c_core pps_core devlink dm_mirror 
dm_region_hash dmd
[  243.212353] CPU: 36 PID: 783 Comm: kworker/u368:1 Tainted: G        
W       4.13.0-rc7.a57bd54+ #15
[  243.212353] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  243.212360] Workqueue: ib_addr process_one_req [ib_core]
[  243.212361] task: ffff881019491740 task.stack: ffffc9000b534000
[  243.212364] RIP: 0010:kmem_cache_alloc_trace+0x7d/0x1b0
[  243.212364] RSP: 0018:ffffc9000b537c88 EFLAGS: 00010282
[  243.212365] RAX: 0000000000000000 RBX: 00000000014080c0 RCX: 
0000000000006b83
[  243.212366] RDX: 0000000000006b82 RSI: 00000000014080c0 RDI: 
ffff88018fc07a80
[  243.212366] RBP: ffffc9000b537cc0 R08: 000000000001ed40 R09: 
0000000000000000
[  243.212373] R10: ffff88018fc07a80 R11: 000000000103c693 R12: 
00000000014080c0
[  243.212374] R13: ffffffffa08b317f R14: ffff88018fc07a80 R15: 
cccccccccccccccc
[  243.212375] FS:  0000000000000000(0000) GS:ffff88103ea80000(0000) 
knlGS:0000000000000000
[  243.212375] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  243.212376] CR2: 00007f736b47c000 CR3: 000000201b770000 CR4: 
00000000001406e0
[  243.212377] Call Trace:
[  243.212379]  nvme_rdma_cm_handler+0x4ef/0x886 [nvme_rdma]
[  243.212382]  ? cma_attach_to_dev+0x17/0x50 [rdma_cm]
[  243.212383]  ? nvme_rdma_memreg_done+0x30/0x30 [nvme_rdma]
[  243.212385]  addr_handler+0xa4/0x1c0 [rdma_cm]
[  243.212390]  process_one_req+0x8d/0x120 [ib_core]
[  243.212398]  process_one_work+0x149/0x360
[  243.212399]  worker_thread+0x4d/0x3c0
[  243.212400]  kthread+0x109/0x140
[  243.212401]  ? rescuer_thread+0x380/0x380
[  243.212403]  ? kthread_park+0x60/0x60
[  243.212404]  ret_from_fork+0x25/0x30
[  243.212405] Code: 4c 03 05 6f 12 dd 7e 4d 8b 38 49 8b 40 10 4d 85 ff 
0f 84 ec 00 00 00 48 85 c0 0f 84 e3 00 00 00 49 63 42 20 48 8d 4a 01 4d 
8b 02 <49> 8b 1c 07 4c 89 f8 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 4
[  243.212420] RIP: kmem_cache_alloc_trace+0x7d/0x1b0 RSP: ffffc9000b537c88
[  243.212519] ---[ end trace 725c2de52628c062 ]---
[  243.216792] Kernel panic - not syncing: Fatal exception
[  243.216878] Kernel Offset: disabled
[  243.583898] ---[ end Kernel panic - not syncing: Fatal exception

Panic after connection with below commits, detailed log here: 
https://pastebin.com/7z0XSGSd
31fdf18     nvme-rdma: reuse configure/destroy_admin_queue
3f02fff       nvme-rdma: don't free tagset on resets
18398af    nvme-rdma: disable the controller on resets
b28a308   nvme-rdma: move tagset allocation to a dedicated routine

good    34b6c23 nvme: Add admin_tagset pointer to nvme_ctrl


Thanks
Yi
> Added Tariq to the thread.
>
> Thanks
>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 92a03ff5fb4d..ef50b58b0bb6 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -274,7 +274,7 @@  static int nvme_rdma_reinit_request(void *data, 
struct request *rq)
         struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
         int ret = 0;

-       ib_dereg_mr(req->mr);
+       WARN_ON_ONCE(ib_dereg_mr(req->mr));

         req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG,