Message ID | 47493aa0-4cad-721b-4ea2-c3b2293340aa@grimberg.me (mailing list archive) |
---|---|
State | Deferred |
Headers | show |
> Is it possible that ib_dereg_mr failed? > It seems not, and finally the system get panic, here is the log: [ 104.373784] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420 [ 104.564001] nvme nvme0: creating 40 I/O queues. [ 105.070022] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420 [ 144.135070] nvme nvme0: rescanning [ 204.383678] nvme nvme0: Reconnecting in 10 seconds... [ 214.506489] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 214.513996] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 214.520426] nvme nvme0: Failed reconnect attempt 1 [ 214.525788] nvme nvme0: Reconnecting in 10 seconds... [ 224.733962] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 224.741464] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 224.747898] nvme nvme0: Failed reconnect attempt 2 [ 224.753301] nvme nvme0: Reconnecting in 10 seconds... [ 234.973834] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 234.981335] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 234.987768] nvme nvme0: Failed reconnect attempt 3 [ 234.993150] nvme nvme0: Reconnecting in 10 seconds... [ 245.233395] nvme nvme0: creating 40 I/O queues. [ 245.238480] DMAR: ERROR: DMA PTE for vPFN 0xe109b already set (to 10098cc002 not 103b85e003) [ 245.247940] ------------[ cut here ]------------ [ 245.253110] WARNING: CPU: 38 PID: 6 at drivers/iommu/intel-iommu.c:2305 __domain_mapping+0x367/0x380 [ 245.263329] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd [ 245.342493] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd [ 245.364191] CPU: 38 PID: 6 Comm: kworker/u368:0 Not tainted 4.14.0-rc1+ #7 [ 245.371880] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016 [ 245.380265] Workqueue: ib_addr process_one_req [ib_core] [ 245.386211] task: ffff88018cb245c0 task.stack: ffffc9000009c000 [ 245.392836] RIP: 0010:__domain_mapping+0x367/0x380 [ 245.398194] RSP: 0018:ffffc9000009fa98 EFLAGS: 00010202 [ 245.404039] RAX: 0000000000000004 RBX: 000000103b85e003 RCX: 0000000000000000 [ 245.412018] RDX: 0000000000000000 RSI: ffff88103eace038 RDI: ffff88103eace038 [ 245.420001] RBP: ffffc9000009faf8 R08: 0000000000000000 R09: 0000000000000000 [ 245.427983] R10: 00000000000002f7 R11: 000000000103b85e R12: ffff881009bc74d8 [ 245.436711] R13: 0000000000000001 R14: 0000000000000001 R15: 00000000000e109b [ 245.445419] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) knlGS:0000000000000000 [ 245.455199] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 245.462357] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 00000000001606e0 [ 245.471074] Call Trace: [ 245.474549] __intel_map_single+0xeb/0x180 [ 245.479868] intel_alloc_coherent+0xb5/0x130 [ 245.485388] mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core] [ 245.491482] mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib] [ 245.498540] mlx4_ib_create_cq+0x223/0x450 [mlx4_ib] [ 245.504822] ib_alloc_cq+0x49/0x170 [ib_core] [ 245.510413] nvme_rdma_cm_handler+0x3a2/0x7ab [nvme_rdma] [ 245.517179] ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm] [ 245.523456] addr_handler+0xa4/0x1c0 [rdma_cm] [ 245.529147] process_one_req+0x8d/0x120 [ib_core] [ 245.535132] process_one_work+0x149/0x360 [ 245.540334] worker_thread+0x4d/0x3c0 [ 245.545145] kthread+0x109/0x140 [ 245.549462] ? rescuer_thread+0x380/0x380 [ 245.554654] ? kthread_park+0x60/0x60 [ 245.559456] ret_from_fork+0x25/0x30 [ 245.564153] Code: fe aa 81 4c 89 5d a0 4c 89 4d a8 e8 87 e1 c0 ff 8b 05 fe 6e 87 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 e9 6e 87 00 <0f> ff e9 b8 fd ff ff e8 8d c7 ba ff 0f 1f 00 66 2e 0f 1f 8 [ 245.586712] ---[ end trace 56749c1831388ff8 ]--- [ 245.592920] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, cccccccccccccccc/ccd80eccccccf203 (bad dma) [ 245.604179] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, cccccccccccccccc/cccccccccccccccc (bad dma) [ 245.615647] general protection fault: 0000 [#1] SMP [ 245.621836] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd [ 245.706171] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd [ 245.729344] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G W 4.14.0-rc1+ #7 [ 245.739128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016 [ 245.748234] Workqueue: ib_addr process_one_req [ib_core] [ 245.754905] task: ffff88018cb245c0 task.stack: ffffc9000009c000 [ 245.762256] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20 [ 245.769313] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286 [ 245.775881] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: 0000000000001793 [ 245.784591] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: ffff88018fc07aa0 [ 245.793294] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: ffff8810098cccc0 [ 245.802002] R10: ffffffff818a99e0 R11: 00000000010098cd R12: 00000000014080c0 [ 245.810706] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: ffff88018fc07a80 [ 245.819409] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) knlGS:0000000000000000 [ 245.829184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 245.836342] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 00000000001606e0 [ 245.845056] Call Trace: [ 245.848524] kmem_cache_alloc_trace+0xa0/0x1c0 [ 245.854220] nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] [ 245.860990] addr_handler+0xa4/0x1c0 [rdma_cm] [ 245.866694] process_one_req+0x8d/0x120 [ib_core] [ 245.872687] process_one_work+0x149/0x360 [ 245.877899] worker_thread+0x4d/0x3c0 [ 245.882720] kthread+0x109/0x140 [ 245.887051] ? rescuer_thread+0x380/0x380 [ 245.892244] ? kthread_park+0x60/0x60 [ 245.897054] ret_from_fork+0x25/0x30 [ 245.901760] Code: 31 d2 e8 b3 ea ff ff 5b 41 5c 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 74 0a 48 63 07 <48> 8b 04 06 0f 18 08 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 0 [ 245.924349] RIP: prefetch_freepointer.isra.65+0x11/0x20 RSP: ffffc9000009fcc0 [ 245.933145] ---[ end trace 56749c1831388ff9 ]--- [ 245.942680] Kernel panic - not syncing: Fatal exception [ 245.950207] Kernel Offset: disabled [ 245.958566] ---[ end Kernel panic - not syncing: Fatal exception [ 245.966082] ------------[ cut here ]------------ [ 245.972014] WARNING: CPU: 38 PID: 6 at kernel/sched/core.c:1179 set_task_cpu+0x191/0x1a0 [ 245.981822] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd [ 246.066533] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd [ 246.089836] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G D W 4.14.0-rc1+ #7 [ 246.099683] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016 [ 246.108849] Workqueue: ib_addr process_one_req [ib_core] [ 246.115566] task: ffff88018cb245c0 task.stack: ffffc9000009c000 [ 246.122948] RIP: 0010:set_task_cpu+0x191/0x1a0 [ 246.128668] RSP: 0018:ffff88103eac3c38 EFLAGS: 00010046 [ 246.135255] RAX: 0000000000000100 RBX: ffff88207bf445c0 RCX: 0000000000000001 [ 246.143978] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88207bf445c0 [ 246.152699] RBP: ffff88103eac3c58 R08: 0000000000000001 R09: 0000000000000000 [ 246.161418] R10: 0000000000000001 R11: 0000000003e236eb R12: ffff88207bf4516c [ 246.170137] R13: 0000000000000001 R14: 0000000000000001 R15: 000000000001b900 [ 246.178854] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) knlGS:0000000000000000 [ 246.188644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 246.195812] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: 00000000001606e0 [ 246.204540] Call Trace: [ 246.208027] <IRQ> [ 246.211016] try_to_wake_up+0x166/0x470 [ 246.216036] default_wake_function+0x12/0x20 [ 246.221537] __wake_up_common+0x8a/0x160 [ 246.226641] __wake_up_locked+0x16/0x20 [ 246.231643] ep_poll_callback+0xd0/0x300 [ 246.236727] __wake_up_common+0x8a/0x160 [ 246.241817] __wake_up_common_lock+0x7e/0xc0 [ 246.247291] __wake_up+0x13/0x20 [ 246.251596] wake_up_klogd_work_func+0x40/0x60 [ 246.257265] irq_work_run_list+0x4d/0x70 [ 246.262353] ? tick_sched_do_timer+0x70/0x70 [ 246.267830] irq_work_tick+0x40/0x50 [ 246.272530] update_process_times+0x42/0x60 [ 246.277912] tick_sched_handle+0x2d/0x60 [ 246.282987] tick_sched_timer+0x39/0x70 [ 246.287945] __hrtimer_run_queues+0xe5/0x230 [ 246.293371] hrtimer_interrupt+0xa8/0x1a0 [ 246.298509] smp_apic_timer_interrupt+0x5f/0x130 [ 246.304322] apic_timer_interrupt+0x9d/0xb0 [ 246.309640] </IRQ> [ 246.312633] RIP: 0010:panic+0x1fd/0x245 [ 246.317554] RSP: 0018:ffffc9000009fb18 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [ 246.326659] RAX: 0000000000000034 RBX: 0000000000000200 RCX: 0000000000000006 [ 246.335268] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff88103eace030 [ 246.343856] RBP: ffffc9000009fb88 R08: 0000000000000000 R09: 0000000000000877 [ 246.352424] R10: 00000000000003ff R11: 0000000000000001 R12: ffffffff81a3e1d8 [ 246.360975] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88018fc07a80 [ 246.369508] ? panic+0x1f6/0x245 [ 246.373657] oops_end+0xb8/0xd0 [ 246.377676] die+0x42/0x50 [ 246.381194] do_general_protection+0xd2/0x160 [ 246.386540] ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] [ 246.393238] general_protection+0x22/0x30 [ 246.398181] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20 [ 246.404964] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286 [ 246.411258] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: 0000000000001793 [ 246.419692] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: ffff88018fc07aa0 [ 246.428115] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: ffff8810098cccc0 [ 246.436543] R10: ffffffff818a99e0 R11: 00000000010098cd R12: 00000000014080c0 [ 246.444970] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: ffff88018fc07a80 [ 246.453402] ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] [ 246.460087] kmem_cache_alloc_trace+0xa0/0x1c0 [ 246.465511] nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] [ 246.472004] addr_handler+0xa4/0x1c0 [rdma_cm] [ 246.477424] process_one_req+0x8d/0x120 [ib_core] [ 246.483128] process_one_work+0x149/0x360 [ 246.488045] worker_thread+0x4d/0x3c0 [ 246.492577] kthread+0x109/0x140 [ 246.496620] ? rescuer_thread+0x380/0x380 [ 246.501540] ? kthread_park+0x60/0x60 [ 246.506070] ret_from_fork+0x25/0x30 [ 246.510496] Code: ff 80 8b ac 08 00 00 04 e9 23 ff ff ff 0f ff e9 bf fe ff ff f7 83 84 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f ff e9 c2 fe ff ff <0f> ff e9 d1 fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 0 [ 246.532545] ---[ end trace 56749c1831388ffa ]--- > can you please apply the following patch and report if you see a warning? > -- > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index 92a03ff5fb4d..ef50b58b0bb6 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -274,7 +274,7 @@ static int nvme_rdma_reinit_request(void *data, > struct request *rq) > struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq); > int ret = 0; > > - ib_dereg_mr(req->mr); > + WARN_ON_ONCE(ib_dereg_mr(req->mr)); > > req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG, > ctrl->max_fr_pages); > -- > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 24, 2017 at 05:28:30PM +0800, Yi Zhang wrote: > > > Is it possible that ib_dereg_mr failed? > > > It seems not, and finally the system get panic, here is the log: I looked on the issue during the weekend and didn't see any suspicious commit in the mlx4 alloc/mapping area. Can I ask you to perform git bisect to find the problematic change? Added Tariq to the thread. Thanks > > [ 104.373784] nvme nvme0: new ctrl: NQN > "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420 > [ 104.564001] nvme nvme0: creating 40 I/O queues. > [ 105.070022] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420 > [ 144.135070] nvme nvme0: rescanning > [ 204.383678] nvme nvme0: Reconnecting in 10 seconds... > [ 214.506489] nvme nvme0: Connect rejected: status 8 (invalid service ID). > [ 214.513996] nvme nvme0: rdma_resolve_addr wait failed (-104). > [ 214.520426] nvme nvme0: Failed reconnect attempt 1 > [ 214.525788] nvme nvme0: Reconnecting in 10 seconds... > [ 224.733962] nvme nvme0: Connect rejected: status 8 (invalid service ID). > [ 224.741464] nvme nvme0: rdma_resolve_addr wait failed (-104). > [ 224.747898] nvme nvme0: Failed reconnect attempt 2 > [ 224.753301] nvme nvme0: Reconnecting in 10 seconds... > [ 234.973834] nvme nvme0: Connect rejected: status 8 (invalid service ID). > [ 234.981335] nvme nvme0: rdma_resolve_addr wait failed (-104). > [ 234.987768] nvme nvme0: Failed reconnect attempt 3 > [ 234.993150] nvme nvme0: Reconnecting in 10 seconds... > [ 245.233395] nvme nvme0: creating 40 I/O queues. > [ 245.238480] DMAR: ERROR: DMA PTE for vPFN 0xe109b already set (to > 10098cc002 not 103b85e003) > [ 245.247940] ------------[ cut here ]------------ > [ 245.253110] WARNING: CPU: 38 PID: 6 at drivers/iommu/intel-iommu.c:2305 > __domain_mapping+0x367/0x380 > [ 245.263329] Modules linked in: nvme_rdma nvme_fabrics nvme_core > sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter > bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd > [ 245.342493] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect > sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata > crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd > [ 245.364191] CPU: 38 PID: 6 Comm: kworker/u368:0 Not tainted 4.14.0-rc1+ > #7 > [ 245.371880] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 > 01/08/2016 > [ 245.380265] Workqueue: ib_addr process_one_req [ib_core] > [ 245.386211] task: ffff88018cb245c0 task.stack: ffffc9000009c000 > [ 245.392836] RIP: 0010:__domain_mapping+0x367/0x380 > [ 245.398194] RSP: 0018:ffffc9000009fa98 EFLAGS: 00010202 > [ 245.404039] RAX: 0000000000000004 RBX: 000000103b85e003 RCX: > 0000000000000000 > [ 245.412018] RDX: 0000000000000000 RSI: ffff88103eace038 RDI: > ffff88103eace038 > [ 245.420001] RBP: ffffc9000009faf8 R08: 0000000000000000 R09: > 0000000000000000 > [ 245.427983] R10: 00000000000002f7 R11: 000000000103b85e R12: > ffff881009bc74d8 > [ 245.436711] R13: 0000000000000001 R14: 0000000000000001 R15: > 00000000000e109b > [ 245.445419] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) > knlGS:0000000000000000 > [ 245.455199] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 245.462357] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: > 00000000001606e0 > [ 245.471074] Call Trace: > [ 245.474549] __intel_map_single+0xeb/0x180 > [ 245.479868] intel_alloc_coherent+0xb5/0x130 > [ 245.485388] mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core] > [ 245.491482] mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib] > [ 245.498540] mlx4_ib_create_cq+0x223/0x450 [mlx4_ib] > [ 245.504822] ib_alloc_cq+0x49/0x170 [ib_core] > [ 245.510413] nvme_rdma_cm_handler+0x3a2/0x7ab [nvme_rdma] > [ 245.517179] ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm] > [ 245.523456] addr_handler+0xa4/0x1c0 [rdma_cm] > [ 245.529147] process_one_req+0x8d/0x120 [ib_core] > [ 245.535132] process_one_work+0x149/0x360 > [ 245.540334] worker_thread+0x4d/0x3c0 > [ 245.545145] kthread+0x109/0x140 > [ 245.549462] ? rescuer_thread+0x380/0x380 > [ 245.554654] ? kthread_park+0x60/0x60 > [ 245.559456] ret_from_fork+0x25/0x30 > [ 245.564153] Code: fe aa 81 4c 89 5d a0 4c 89 4d a8 e8 87 e1 c0 ff 8b 05 > fe 6e 87 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 e9 6e 87 00 > <0f> ff e9 b8 fd ff ff e8 8d c7 ba ff 0f 1f 00 66 2e 0f 1f 8 > [ 245.586712] ---[ end trace 56749c1831388ff8 ]--- > [ 245.592920] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, > cccccccccccccccc/ccd80eccccccf203 (bad dma) > [ 245.604179] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, > cccccccccccccccc/cccccccccccccccc (bad dma) > [ 245.615647] general protection fault: 0000 [#1] SMP > [ 245.621836] Modules linked in: nvme_rdma nvme_fabrics nvme_core > sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter > bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd > [ 245.706171] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect > sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata > crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd > [ 245.729344] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G W > 4.14.0-rc1+ #7 > [ 245.739128] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 > 01/08/2016 > [ 245.748234] Workqueue: ib_addr process_one_req [ib_core] > [ 245.754905] task: ffff88018cb245c0 task.stack: ffffc9000009c000 > [ 245.762256] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20 > [ 245.769313] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286 > [ 245.775881] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: > 0000000000001793 > [ 245.784591] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: > ffff88018fc07aa0 > [ 245.793294] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: > ffff8810098cccc0 > [ 245.802002] R10: ffffffff818a99e0 R11: 00000000010098cd R12: > 00000000014080c0 > [ 245.810706] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: > ffff88018fc07a80 > [ 245.819409] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) > knlGS:0000000000000000 > [ 245.829184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 245.836342] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: > 00000000001606e0 > [ 245.845056] Call Trace: > [ 245.848524] kmem_cache_alloc_trace+0xa0/0x1c0 > [ 245.854220] nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] > [ 245.860990] addr_handler+0xa4/0x1c0 [rdma_cm] > [ 245.866694] process_one_req+0x8d/0x120 [ib_core] > [ 245.872687] process_one_work+0x149/0x360 > [ 245.877899] worker_thread+0x4d/0x3c0 > [ 245.882720] kthread+0x109/0x140 > [ 245.887051] ? rescuer_thread+0x380/0x380 > [ 245.892244] ? kthread_park+0x60/0x60 > [ 245.897054] ret_from_fork+0x25/0x30 > [ 245.901760] Code: 31 d2 e8 b3 ea ff ff 5b 41 5c 5d c3 0f 1f 40 00 66 2e > 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 85 f6 48 89 e5 74 0a 48 63 07 > <48> 8b 04 06 0f 18 08 5d c3 66 0f 1f 44 00 00 0f 1f 44 00 0 > [ 245.924349] RIP: prefetch_freepointer.isra.65+0x11/0x20 RSP: > ffffc9000009fcc0 > [ 245.933145] ---[ end trace 56749c1831388ff9 ]--- > [ 245.942680] Kernel panic - not syncing: Fatal exception > [ 245.950207] Kernel Offset: disabled > [ 245.958566] ---[ end Kernel panic - not syncing: Fatal exception > [ 245.966082] ------------[ cut here ]------------ > [ 245.972014] WARNING: CPU: 38 PID: 6 at kernel/sched/core.c:1179 > set_task_cpu+0x191/0x1a0 > [ 245.981822] Modules linked in: nvme_rdma nvme_fabrics nvme_core > sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter > bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd > [ 246.066533] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect > sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci ptp libata > crc32c_intel i2c_core pps_core devlink dm_mirror dm_region_hash dmd > [ 246.089836] CPU: 38 PID: 6 Comm: kworker/u368:0 Tainted: G D W > 4.14.0-rc1+ #7 > [ 246.099683] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 > 01/08/2016 > [ 246.108849] Workqueue: ib_addr process_one_req [ib_core] > [ 246.115566] task: ffff88018cb245c0 task.stack: ffffc9000009c000 > [ 246.122948] RIP: 0010:set_task_cpu+0x191/0x1a0 > [ 246.128668] RSP: 0018:ffff88103eac3c38 EFLAGS: 00010046 > [ 246.135255] RAX: 0000000000000100 RBX: ffff88207bf445c0 RCX: > 0000000000000001 > [ 246.143978] RDX: 0000000000000001 RSI: 0000000000000001 RDI: > ffff88207bf445c0 > [ 246.152699] RBP: ffff88103eac3c58 R08: 0000000000000001 R09: > 0000000000000000 > [ 246.161418] R10: 0000000000000001 R11: 0000000003e236eb R12: > ffff88207bf4516c > [ 246.170137] R13: 0000000000000001 R14: 0000000000000001 R15: > 000000000001b900 > [ 246.178854] FS: 0000000000000000(0000) GS:ffff88103eac0000(0000) > knlGS:0000000000000000 > [ 246.188644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 246.195812] CR2: 000014940a9b7140 CR3: 00000010119b5000 CR4: > 00000000001606e0 > [ 246.204540] Call Trace: > [ 246.208027] <IRQ> > [ 246.211016] try_to_wake_up+0x166/0x470 > [ 246.216036] default_wake_function+0x12/0x20 > [ 246.221537] __wake_up_common+0x8a/0x160 > [ 246.226641] __wake_up_locked+0x16/0x20 > [ 246.231643] ep_poll_callback+0xd0/0x300 > [ 246.236727] __wake_up_common+0x8a/0x160 > [ 246.241817] __wake_up_common_lock+0x7e/0xc0 > [ 246.247291] __wake_up+0x13/0x20 > [ 246.251596] wake_up_klogd_work_func+0x40/0x60 > [ 246.257265] irq_work_run_list+0x4d/0x70 > [ 246.262353] ? tick_sched_do_timer+0x70/0x70 > [ 246.267830] irq_work_tick+0x40/0x50 > [ 246.272530] update_process_times+0x42/0x60 > [ 246.277912] tick_sched_handle+0x2d/0x60 > [ 246.282987] tick_sched_timer+0x39/0x70 > [ 246.287945] __hrtimer_run_queues+0xe5/0x230 > [ 246.293371] hrtimer_interrupt+0xa8/0x1a0 > [ 246.298509] smp_apic_timer_interrupt+0x5f/0x130 > [ 246.304322] apic_timer_interrupt+0x9d/0xb0 > [ 246.309640] </IRQ> > [ 246.312633] RIP: 0010:panic+0x1fd/0x245 > [ 246.317554] RSP: 0018:ffffc9000009fb18 EFLAGS: 00000246 ORIG_RAX: > ffffffffffffff10 > [ 246.326659] RAX: 0000000000000034 RBX: 0000000000000200 RCX: > 0000000000000006 > [ 246.335268] RDX: 0000000000000000 RSI: 0000000000000086 RDI: > ffff88103eace030 > [ 246.343856] RBP: ffffc9000009fb88 R08: 0000000000000000 R09: > 0000000000000877 > [ 246.352424] R10: 00000000000003ff R11: 0000000000000001 R12: > ffffffff81a3e1d8 > [ 246.360975] R13: 0000000000000000 R14: 0000000000000000 R15: > ffff88018fc07a80 > [ 246.369508] ? panic+0x1f6/0x245 > [ 246.373657] oops_end+0xb8/0xd0 > [ 246.377676] die+0x42/0x50 > [ 246.381194] do_general_protection+0xd2/0x160 > [ 246.386540] ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] > [ 246.393238] general_protection+0x22/0x30 > [ 246.398181] RIP: 0010:prefetch_freepointer.isra.65+0x11/0x20 > [ 246.404964] RSP: 0018:ffffc9000009fcc0 EFLAGS: 00010286 > [ 246.411258] RAX: 0000000000000000 RBX: cccccccccccccccc RCX: > 0000000000001793 > [ 246.419692] RDX: 0000000000001792 RSI: cccccccccccccccc RDI: > ffff88018fc07aa0 > [ 246.428115] RBP: ffffc9000009fcc0 R08: 000000000001ed40 R09: > ffff8810098cccc0 > [ 246.436543] R10: ffffffff818a99e0 R11: 00000000010098cd R12: > 00000000014080c0 > [ 246.444970] R13: ffffffffa07bd1e0 R14: ffff88018fc07a80 R15: > ffff88018fc07a80 > [ 246.453402] ? nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] > [ 246.460087] kmem_cache_alloc_trace+0xa0/0x1c0 > [ 246.465511] nvme_rdma_cm_handler+0x4e0/0x7ab [nvme_rdma] > [ 246.472004] addr_handler+0xa4/0x1c0 [rdma_cm] > [ 246.477424] process_one_req+0x8d/0x120 [ib_core] > [ 246.483128] process_one_work+0x149/0x360 > [ 246.488045] worker_thread+0x4d/0x3c0 > [ 246.492577] kthread+0x109/0x140 > [ 246.496620] ? rescuer_thread+0x380/0x380 > [ 246.501540] ? kthread_park+0x60/0x60 > [ 246.506070] ret_from_fork+0x25/0x30 > [ 246.510496] Code: ff 80 8b ac 08 00 00 04 e9 23 ff ff ff 0f ff e9 bf fe > ff ff f7 83 84 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f ff e9 c2 fe ff ff > <0f> ff e9 d1 fe ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 0 > [ 246.532545] ---[ end trace 56749c1831388ffa ]--- > > > can you please apply the following patch and report if you see a warning? > > -- > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > > index 92a03ff5fb4d..ef50b58b0bb6 100644 > > --- a/drivers/nvme/host/rdma.c > > +++ b/drivers/nvme/host/rdma.c > > @@ -274,7 +274,7 @@ static int nvme_rdma_reinit_request(void *data, > > struct request *rq) > > struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq); > > int ret = 0; > > > > - ib_dereg_mr(req->mr); > > + WARN_ON_ONCE(ib_dereg_mr(req->mr)); > > > > req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG, > > ctrl->max_fr_pages); > > -- > > > > _______________________________________________ > > Linux-nvme mailing list > > Linux-nvme@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-nvme > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/24/2017 06:34 PM, Leon Romanovsky wrote: > On Sun, Sep 24, 2017 at 05:28:30PM +0800, Yi Zhang wrote: >>> Is it possible that ib_dereg_mr failed? >>> >> It seems not, and finally the system get panic, here is the log: > I looked on the issue during the weekend and didn't see any suspicious > commit in the mlx4 alloc/mapping area. > > Can I ask you to perform git bisect to find the problematic change? Hi Sagi I did git bisect for this issue, seems it was introduced by your patch "Few more patches from the centralization set". Here is the testing on the patch, let me know if you need more info. BAD 148b4e7 nvme-rdma: stop queues instead of simply flipping their state BAD a57bd54 nvme-rdma: introduce configure/destroy io queues Log: [ 127.899255] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420 [ 128.074263] nvme nvme0: creating 40 I/O queues. [ 128.581822] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420 [ 177.486110] print_req_error: I/O error, dev nvme0n1, sector 0 [ 191.256637] nvme nvme0: Reconnecting in 10 seconds... [ 201.855846] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 201.863353] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 201.869824] nvme nvme0: Failed reconnect attempt 1 [ 201.875183] nvme nvme0: Reconnecting in 10 seconds... [ 212.087828] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 212.095330] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 212.101766] nvme nvme0: Failed reconnect attempt 2 [ 212.107129] nvme nvme0: Reconnecting in 10 seconds... [ 222.328398] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 222.335900] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 222.342335] nvme nvme0: Failed reconnect attempt 3 [ 222.347699] nvme nvme0: Reconnecting in 10 seconds... [ 232.567791] nvme nvme0: Connect rejected: status 8 (invalid service ID). [ 232.575292] nvme nvme0: rdma_resolve_addr wait failed (-104). [ 232.581730] nvme nvme0: Failed reconnect attempt 4 [ 232.587094] nvme nvme0: Reconnecting in 10 seconds... [ 242.827727] nvme nvme0: creating 40 I/O queues. [ 242.832810] DMAR: ERROR: DMA PTE for vPFN 0xe129b already set (to 103c692002 not 1000915003) [ 242.842265] ------------[ cut here ]------------ [ 242.847437] WARNING: CPU: 0 PID: 783 at drivers/iommu/intel-iommu.c:2299 __domain_mapping+0x363/0x370 [ 242.857755] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd [ 242.936919] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci crc32c_intel ptp libata i2c_core pps_core devlink dm_mirror dm_region_hash dmd [ 242.958625] CPU: 0 PID: 783 Comm: kworker/u368:1 Not tainted 4.13.0-rc7.a57bd54+ #15 [ 242.967304] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016 [ 242.975687] Workqueue: ib_addr process_one_req [ib_core] [ 242.981631] task: ffff881019491740 task.stack: ffffc9000b534000 [ 242.989011] RIP: 0010:__domain_mapping+0x363/0x370 [ 242.995108] RSP: 0018:ffffc9000b537a50 EFLAGS: 00010202 [ 243.001694] RAX: 0000000000000004 RBX: 0000001000915003 RCX: 0000000000000000 [ 243.010433] RDX: 0000000000000000 RSI: ffff88103e60e038 RDI: ffff88103e60e038 [ 243.019170] RBP: ffffc9000b537ab0 R08: 0000000000000000 R09: 0000000000000000 [ 243.027893] R10: 00000000000002f7 R11: 0000000001000915 R12: ffff88201ea5c4d8 [ 243.036632] R13: 0000000000000001 R14: 0000000000000001 R15: 00000000000e129b [ 243.045348] FS: 0000000000000000(0000) GS:ffff88103e600000(0000) knlGS:0000000000000000 [ 243.055142] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 243.062337] CR2: 00007f9d3b0fab70 CR3: 00000020245b7000 CR4: 00000000001406f0 [ 243.071076] Call Trace: [ 243.074564] __intel_map_single+0xeb/0x180 [ 243.079903] intel_alloc_coherent+0xb5/0x130 [ 243.085445] mlx4_buf_alloc+0xe5/0x1c0 [mlx4_core] [ 243.091555] mlx4_ib_alloc_cq_buf.isra.9+0x38/0xd0 [mlx4_ib] [ 243.098621] mlx4_ib_create_cq+0x223/0x440 [mlx4_ib] [ 243.104901] ? find_gid.isra.5+0x167/0x1f0 [ib_core] [ 243.111178] ib_alloc_cq+0x49/0x170 [ib_core] [ 243.116791] nvme_rdma_cm_handler+0x3e7/0x886 [nvme_rdma] [ 243.123557] ? cma_attach_to_dev+0x17/0x50 [rdma_cm] [ 243.129838] ? cma_acquire_dev+0x1e3/0x3b0 [rdma_cm] [ 243.136115] ? account_entity_dequeue+0xaa/0xe0 [ 243.141918] addr_handler+0xa4/0x1c0 [rdma_cm] [ 243.147604] process_one_req+0x8d/0x120 [ib_core] [ 243.153585] process_one_work+0x149/0x360 [ 243.158807] worker_thread+0x4d/0x3c0 [ 243.163618] kthread+0x109/0x140 [ 243.167936] ? rescuer_thread+0x380/0x380 [ 243.173131] ? kthread_park+0x60/0x60 [ 243.177930] ret_from_fork+0x25/0x30 [ 243.182641] Code: f1 a9 81 4c 89 5d a0 4c 89 4d a8 e8 0b 58 c1 ff 8b 05 f2 16 88 00 4c 8b 4d a8 4c 8b 5d a0 85 c0 74 09 83 e8 01 89 05 dd 16 88 00 <0f> ff e9 bc fd ff ff e8 21 3d bb ff 90 0f 1f 44 00 00 55 4 [ 243.205193] ---[ end trace 725c2de52628c061 ]--- [ 243.211723] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, cccccccccccccccc/cc242fcccccc2801 (bad dma) [ 243.211724] mlx4_core 0000:04:00.0: dma_pool_free mlx4_cmd, cccccccccccccccc/cccccccccccccccc (bad dma) [ 243.212312] general protection fault: 0000 [#1] SMP [ 243.212312] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc rpcrdma ib_isert iscsi_target_mod ibd [ 243.212339] mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core tg3 ahci libahci crc32c_intel ptp libata i2c_core pps_core devlink dm_mirror dm_region_hash dmd [ 243.212353] CPU: 36 PID: 783 Comm: kworker/u368:1 Tainted: G W 4.13.0-rc7.a57bd54+ #15 [ 243.212353] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016 [ 243.212360] Workqueue: ib_addr process_one_req [ib_core] [ 243.212361] task: ffff881019491740 task.stack: ffffc9000b534000 [ 243.212364] RIP: 0010:kmem_cache_alloc_trace+0x7d/0x1b0 [ 243.212364] RSP: 0018:ffffc9000b537c88 EFLAGS: 00010282 [ 243.212365] RAX: 0000000000000000 RBX: 00000000014080c0 RCX: 0000000000006b83 [ 243.212366] RDX: 0000000000006b82 RSI: 00000000014080c0 RDI: ffff88018fc07a80 [ 243.212366] RBP: ffffc9000b537cc0 R08: 000000000001ed40 R09: 0000000000000000 [ 243.212373] R10: ffff88018fc07a80 R11: 000000000103c693 R12: 00000000014080c0 [ 243.212374] R13: ffffffffa08b317f R14: ffff88018fc07a80 R15: cccccccccccccccc [ 243.212375] FS: 0000000000000000(0000) GS:ffff88103ea80000(0000) knlGS:0000000000000000 [ 243.212375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 243.212376] CR2: 00007f736b47c000 CR3: 000000201b770000 CR4: 00000000001406e0 [ 243.212377] Call Trace: [ 243.212379] nvme_rdma_cm_handler+0x4ef/0x886 [nvme_rdma] [ 243.212382] ? cma_attach_to_dev+0x17/0x50 [rdma_cm] [ 243.212383] ? nvme_rdma_memreg_done+0x30/0x30 [nvme_rdma] [ 243.212385] addr_handler+0xa4/0x1c0 [rdma_cm] [ 243.212390] process_one_req+0x8d/0x120 [ib_core] [ 243.212398] process_one_work+0x149/0x360 [ 243.212399] worker_thread+0x4d/0x3c0 [ 243.212400] kthread+0x109/0x140 [ 243.212401] ? rescuer_thread+0x380/0x380 [ 243.212403] ? kthread_park+0x60/0x60 [ 243.212404] ret_from_fork+0x25/0x30 [ 243.212405] Code: 4c 03 05 6f 12 dd 7e 4d 8b 38 49 8b 40 10 4d 85 ff 0f 84 ec 00 00 00 48 85 c0 0f 84 e3 00 00 00 49 63 42 20 48 8d 4a 01 4d 8b 02 <49> 8b 1c 07 4c 89 f8 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 4 [ 243.212420] RIP: kmem_cache_alloc_trace+0x7d/0x1b0 RSP: ffffc9000b537c88 [ 243.212519] ---[ end trace 725c2de52628c062 ]--- [ 243.216792] Kernel panic - not syncing: Fatal exception [ 243.216878] Kernel Offset: disabled [ 243.583898] ---[ end Kernel panic - not syncing: Fatal exception Panic after connection with below commits, detailed log here: https://pastebin.com/7z0XSGSd 31fdf18 nvme-rdma: reuse configure/destroy_admin_queue 3f02fff nvme-rdma: don't free tagset on resets 18398af nvme-rdma: disable the controller on resets b28a308 nvme-rdma: move tagset allocation to a dedicated routine good 34b6c23 nvme: Add admin_tagset pointer to nvme_ctrl Thanks Yi > Added Tariq to the thread. > > Thanks > >> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 92a03ff5fb4d..ef50b58b0bb6 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -274,7 +274,7 @@ static int nvme_rdma_reinit_request(void *data, struct request *rq) struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq); int ret = 0; - ib_dereg_mr(req->mr); + WARN_ON_ONCE(ib_dereg_mr(req->mr)); req->mr = ib_alloc_mr(dev->pd, IB_MR_TYPE_MEM_REG,