From patchwork Sun Jul 30 23:34:32 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Nicholas A. Bellinger" X-Patchwork-Id: 9870737 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1AEA06037D for ; Sun, 30 Jul 2017 23:34:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 08A942842E for ; Sun, 30 Jul 2017 23:34:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EF1DC28584; Sun, 30 Jul 2017 23:34:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 54FC62842E for ; Sun, 30 Jul 2017 23:34:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751677AbdG3Xee (ORCPT ); Sun, 30 Jul 2017 19:34:34 -0400 Received: from mail.linux-iscsi.org ([67.23.28.174]:56722 "EHLO linux-iscsi.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751655AbdG3Xee (ORCPT ); Sun, 30 Jul 2017 19:34:34 -0400 Received: from [172.16.2.183] (unknown [50.225.59.10]) (using SSLv3 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: nab) by linux-iscsi.org (Postfix) with ESMTPSA id 988DD40B19; Sun, 30 Jul 2017 23:40:16 +0000 (UTC) Message-ID: <1501457672.1499.62.camel@haakon3.daterainc.com> Subject: Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319! From: "Nicholas A. Bellinger" To: Pascal de Bruijn Cc: "Madhani, Himanshu" , target-devel , "Tran, Quinn" , s.kleijkers@unilogic.nl Date: Sun, 30 Jul 2017 16:34:32 -0700 In-Reply-To: <438029e4-827e-5496-13b9-848aca70b1bc@unilogic.nl> References: <1498429881.26123.55.camel@haakon3.risingtidesystems.com> <0883cc0b-1cd8-66e7-e2d0-8bd9c0c76b9d@unilogic.nl> <1498807372.24896.45.camel@haakon3.risingtidesystems.com> <061954d8-6cfe-8b21-6e35-3f3c570b8c68@unilogic.nl> <1499401581.30628.16.camel@haakon3.risingtidesystems.com> <1499744492.28145.12.camel@haakon3.daterainc.com> <1499745137.28145.15.camel@haakon3.daterainc.com> <1499795551.19338.4.camel@haakon3.daterainc.com> <438029e4-827e-5496-13b9-848aca70b1bc@unilogic.nl> X-Mailer: Evolution 3.4.4-1 Mime-Version: 1.0 Sender: target-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: target-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Pascal & Co, Apologies for the delayed follow up. On Fri, 2017-07-21 at 15:07 +0200, Pascal de Bruijn wrote: > We still haven't been able to reliably reproduce, however we did come > across this, that's very likely related. > > Essentially we're running 'echo check > /sys/block/md0/md/sync_action' > throughout the night, with no practical sync_speed_max limit. > > Only in the morning presumably when I was doing stuff in VMware (I don't > particularly recall what), VMware started reporting the volume as > inaccessible. > > Here's the full dmesg, to provide maximum context (sorry for the long post): > No worries. The more verbosity the better when it comes to stuff like this. ;) Below is the analysis for the last logs. > [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 > [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 > [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_dpo attribute > [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_fua_read attribute > [Thu Jul 20 18:01:51 2017] qla2xxx [0000:81:00.1]-00af:13: Performing ISP error recovery - ha=ffff9ebcd33c0000. > [Thu Jul 20 18:01:53 2017] qla2xxx [0000:81:00.1]-500a:13: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:53 2017] qla2xxx [0000:82:00.0]-00af:14: Performing ISP error recovery - ha=ffff9ebcd3180000. > [Thu Jul 20 18:01:55 2017] qla2xxx [0000:82:00.0]-500a:14: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:56 2017] qla2xxx [0000:82:00.1]-00af:15: Performing ISP error recovery - ha=ffff9ebcd3400000. > [Thu Jul 20 18:01:57 2017] qla2xxx [0000:82:00.1]-500a:15: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:01:58 2017] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9ebcd4190000. > [Thu Jul 20 18:01:59 2017] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps). > [Thu Jul 20 18:04:42 2017] md: data-check of RAID array md0 > [Thu Jul 20 18:04:42 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 18:04:42 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 18:04:42 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 18:40:14 2017] md: md0: data-check done. > [Thu Jul 20 18:56:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 18:56:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 18:56:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 18:56:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 19:34:16 2017] md: md0: data-check done. > [Thu Jul 20 20:03:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 20:03:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 20:03:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 20:03:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 20:04:51 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1223620 > [Thu Jul 20 20:04:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1223620 > [Thu Jul 20 20:08:00 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1237348 > [Thu Jul 20 20:08:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1237348 Ok, so there are the first two occurrences of outstanding I/O to the backend MD device taking long enough to trigger the ESX host side ABORT_TASK timeouts. Notice the timestamps between 'Found referenced qla2xxx task_tag ...', and 'Sending TMR_FUNCTION_COMPLETE'. This tells us the backend I/O to the md device only took ~1 second to complete back to target-core, once the ABORT_TASK was received. > [Thu Jul 20 20:40:04 2017] md: md0: data-check done. > [Thu Jul 20 21:10:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 21:10:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 21:10:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 21:10:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 21:51:29 2017] md: md0: data-check done. > [Thu Jul 20 22:17:49 2017] md: data-check of RAID array md0 > [Thu Jul 20 22:17:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 22:17:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 22:17:49 2017] md: using 128k window, over a total of 405798912k. > [Thu Jul 20 22:53:37 2017] md: md0: data-check done. > [Thu Jul 20 23:24:50 2017] md: data-check of RAID array md0 > [Thu Jul 20 23:24:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Thu Jul 20 23:24:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Thu Jul 20 23:24:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 00:04:39 2017] md: md0: data-check done. > [Fri Jul 21 00:31:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 00:31:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 00:31:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 00:31:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 01:09:07 2017] md: md0: data-check done. > [Fri Jul 21 01:38:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 01:38:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 01:38:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 01:38:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 01:39:11 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1161652 > [Fri Jul 21 01:39:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1161652 > [Fri Jul 21 02:05:54 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1213252 > [Fri Jul 21 02:05:55 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1213252 > [Fri Jul 21 02:20:30 2017] md: md0: data-check done. > [Fri Jul 21 02:45:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 02:45:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 02:45:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 02:45:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 03:08:14 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1243780 > [Fri Jul 21 03:08:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1243780 And the second occurrence. > [Fri Jul 21 03:28:21 2017] md: md0: data-check done. > [Fri Jul 21 03:52:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 03:52:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 03:52:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 03:52:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 03:58:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1162612 > [Fri Jul 21 03:58:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1162612 And the third. > [Fri Jul 21 04:33:10 2017] md: md0: data-check done. > [Fri Jul 21 04:59:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 04:59:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 04:59:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 04:59:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 05:40:34 2017] md: md0: data-check done. > [Fri Jul 21 06:06:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 06:06:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 06:06:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 06:06:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 06:46:58 2017] md: md0: data-check done. > [Fri Jul 21 07:13:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 07:13:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 07:13:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 07:13:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 07:59:38 2017] md: md0: data-check done. > [Fri Jul 21 08:20:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 08:20:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 08:20:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 08:20:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 09:04:12 2017] md: md0: data-check done. > [Fri Jul 21 09:27:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 09:27:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 09:27:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 09:27:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 10:09:28 2017] md: md0: data-check done. > [Fri Jul 21 10:34:50 2017] md: data-check of RAID array md0 > [Fri Jul 21 10:34:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. > [Fri Jul 21 10:34:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > [Fri Jul 21 10:34:50 2017] md: using 128k window, over a total of 405798912k. > [Fri Jul 21 10:46:22 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1247284 > [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1247284 > [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1247236 And the forth, and fifth which doesn't locate a tag (eg: the response has already been sent). > [Fri Jul 21 10:46:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1248148 > [Fri Jul 21 10:49:04 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1175812 And finally we reach this point, there target-core receives ABORT_TASKs for two tags (1248148 + 1175812), and waits for the backend md device to complete them.. > [Fri Jul 21 10:50:22 2017] INFO: task kworker/u34:2:4840 blocked for more than 180 seconds. > [Fri Jul 21 10:50:22 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:50:22 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:50:22 2017] kworker/u34:2 D 0 4840 2 0x00000000 > [Fri Jul 21 10:50:22 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > [Fri Jul 21 10:50:22 2017] 0000000000000000 ffff9eb4d5331880 ffff9ebccaccd240 ffff9ebcffcd5080 > [Fri Jul 21 10:50:22 2017] ffff9ebcdc5ca340 ffffb11c07d0fbe8 ffffffffa8092196 0000000000130b94 > [Fri Jul 21 10:50:22 2017] ffff9ebccaccd240 7fffffffffffffff 0000000000000000 ffff9ebccaccd240 > [Fri Jul 21 10:50:22 2017] Call Trace: > [Fri Jul 21 10:50:22 2017] [] ? __schedule+0x166/0x530 > [Fri Jul 21 10:50:22 2017] [] schedule+0x31/0x80 > [Fri Jul 21 10:50:22 2017] [] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:50:22 2017] [] ? up+0x2d/0x50 > [Fri Jul 21 10:50:22 2017] [] ? irq_work_queue+0x85/0x90 > [Fri Jul 21 10:50:22 2017] [] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:50:22 2017] [] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:50:22 2017] [] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:50:22 2017] [] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [] core_tmr_abort_task+0x119/0x190 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [] target_tmr_work+0x111/0x120 [target_core_mod] > [Fri Jul 21 10:50:22 2017] [] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:50:22 2017] [] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:50:22 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:50:22 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:50:22 2017] [] kthread+0xd4/0xf0 > [Fri Jul 21 10:50:22 2017] [] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:50:22 2017] [] ret_from_fork+0x22/0x30 First hung task for the tmr-iblock process, that is waiting for one of the two tags to complete from md back to target-core. > [Fri Jul 21 10:53:23 2017] INFO: task kworker/13:1:3063 blocked for more than 180 seconds. > [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:53:23 2017] kworker/13:1 D 0 3063 2 0x00000000 > [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] > [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebccd9c2340 ffff9ebcffd55080 > [Fri Jul 21 10:53:23 2017] ffff9ebcdc5cde00 ffffb11c2736fb28 ffffffffa8092196 ffffffffa7ca4a37 > [Fri Jul 21 10:53:23 2017] ffff9ebccd9c2340 7fffffffffffffff 0000000000000000 ffff9ebccd9c2340 > [Fri Jul 21 10:53:23 2017] Call Trace: > [Fri Jul 21 10:53:23 2017] [] ? __schedule+0x166/0x530 > [Fri Jul 21 10:53:23 2017] [] ? __enqueue_entity+0x67/0x70 > [Fri Jul 21 10:53:23 2017] [] schedule+0x31/0x80 > [Fri Jul 21 10:53:23 2017] [] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:53:23 2017] [] ? sched_clock_cpu+0x8f/0xa0 > [Fri Jul 21 10:53:23 2017] [] ? check_preempt_curr+0x4f/0x80 > [Fri Jul 21 10:53:23 2017] [] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:53:23 2017] [] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:53:23 2017] [] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:53:23 2017] [] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] > [Fri Jul 21 10:53:23 2017] [] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] > [Fri Jul 21 10:53:23 2017] [] ? insert_work+0x53/0xc0 > [Fri Jul 21 10:53:23 2017] [] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:53:23 2017] [] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:53:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [] kthread+0xd4/0xf0 > [Fri Jul 21 10:53:23 2017] [] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:53:23 2017] [] ret_from_fork+0x22/0x30 This is a hung task warning for the process responsible for doing qla2xxx FC session reinstatement. It's blocked (as expected) waiting for the backend I/O to complete for the ABORT_TASK in the tmr-iblock process. > [Fri Jul 21 10:53:23 2017] INFO: task kworker/u33:0:3806 blocked for more than 180 seconds. > [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:53:23 2017] kworker/u33:0 D 0 3806 2 0x00000000 > [Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9eb4d5331880 ffff9eb468909780 ffff9eb4dfd15080 > [Fri Jul 21 10:53:23 2017] ffff9eb4dc68e9c0 ffffb11c26d47be8 ffffffffa8092196 000000000011f104 > [Fri Jul 21 10:53:23 2017] ffff9eb468909780 7fffffffffffffff 0000000000000000 ffff9eb468909780 > [Fri Jul 21 10:53:23 2017] Call Trace: > [Fri Jul 21 10:53:23 2017] [] ? __schedule+0x166/0x530 > [Fri Jul 21 10:53:23 2017] [] schedule+0x31/0x80 > [Fri Jul 21 10:53:23 2017] [] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:53:23 2017] [] ? up+0x2d/0x50 > [Fri Jul 21 10:53:23 2017] [] ? irq_work_queue+0x85/0x90 > [Fri Jul 21 10:53:23 2017] [] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:53:23 2017] [] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:53:23 2017] [] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:53:23 2017] [] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [] core_tmr_abort_task+0x119/0x190 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [] target_tmr_work+0x111/0x120 [target_core_mod] > [Fri Jul 21 10:53:23 2017] [] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:53:23 2017] [] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:53:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:53:23 2017] [] kthread+0xd4/0xf0 > [Fri Jul 21 10:53:23 2017] [] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:53:23 2017] [] ret_from_fork+0x22/0x30 Second hung task warning for tmr-iblock progress, for the second outstanding I/O to the md backend device, waiting for completion. > [Fri Jul 21 10:56:23 2017] INFO: task kworker/2:1:20395 blocked for more than 180 seconds. > [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:56:23 2017] kworker/2:1 D 0 20395 2 0x00000000 > [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] > [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9eb468940bc0 ffff9eb4dfc95080 > [Fri Jul 21 10:56:23 2017] ffff9eb4dc689780 ffffb11c077e7b28 ffffffffa8092196 ffffffffa7ca4a37 > [Fri Jul 21 10:56:23 2017] ffff9eb468940bc0 7fffffffffffffff 0000000000000000 ffff9eb468940bc0 > [Fri Jul 21 10:56:23 2017] Call Trace: > [Fri Jul 21 10:56:23 2017] [] ? __schedule+0x166/0x530 > [Fri Jul 21 10:56:23 2017] [] ? __enqueue_entity+0x67/0x70 > [Fri Jul 21 10:56:23 2017] [] schedule+0x31/0x80 > [Fri Jul 21 10:56:23 2017] [] schedule_timeout+0x147/0x190 > [Fri Jul 21 10:56:23 2017] [] ? sched_clock_cpu+0x8f/0xa0 > [Fri Jul 21 10:56:23 2017] [] ? check_preempt_curr+0x4f/0x80 > [Fri Jul 21 10:56:23 2017] [] wait_for_common+0xb0/0x170 > [Fri Jul 21 10:56:23 2017] [] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:56:23 2017] [] wait_for_completion+0x18/0x20 > [Fri Jul 21 10:56:23 2017] [] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] > [Fri Jul 21 10:56:23 2017] [] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] > [Fri Jul 21 10:56:23 2017] [] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] > [Fri Jul 21 10:56:23 2017] [] ? insert_work+0x53/0xc0 > [Fri Jul 21 10:56:23 2017] [] process_one_work+0x121/0x3c0 > [Fri Jul 21 10:56:23 2017] [] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:56:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [] ? process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [] kthread+0xd4/0xf0 > [Fri Jul 21 10:56:23 2017] [] ? kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:56:23 2017] [] ret_from_fork+0x22/0x30 This is a hung task warning for (another) process responsible for doing qla2xxx FC session reinstatement. As before, it's blocked (as expected) waiting for the backend I/O to complete for the ABORT_TASK in the tmr-iblock process. > I hope this helps, let me know if you need anything from our end. > According to the hung task warnings for the two tmr-iblock processes, it would mean target-core is never getting I/O responses back from the backend driver. That is, there is never a 'Sending TMR_FUNCTION_COMPLETE for ref_tag' message for the two tags (1247284 + 1175812) undergoing ABORT_TASK, which would tend to indicate the md backend device is never completing I/O back to target-core for some reason.. To verify this is really the case, here is one more debug patch for you so we can see the se_cmd state once the hung task occurs. Please re-run this along with the original patches. Also, after thinking more about the earlier report of the 'D' state processes, my guess is you're hitting the same type of problem where md is presumably not completing I/Os back to target-core. So all that said, I think it's safe to push part of original qla2xxx patch to address the original BUG_ON() you reported, as AFICT that is separate from whatever else is going on here with md. I'll send out an smaller version of that patch shortly. --- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 077344c..1263612 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2791,7 +2791,13 @@ void transport_clear_lun_ref(struct se_lun *lun) spin_unlock_irqrestore(&cmd->t_state_lock, *flags); - wait_for_completion(&cmd->t_transport_stop_comp); + while (!wait_for_completion_timeout(&cmd->t_transport_stop_comp, + 180 * HZ)) { + pr_warn_ratelimited("se_cmd waiting CDB: 0x%02x tag: %llu t_state: %d" + "se_cmd_flags: 0x%08x transport_state: 0x%08x\n", + cmd->t_task_cdb[0], cmd->tag, cmd->t_state, + cmd->se_cmd_flags, cmd->transport_state); + } spin_lock_irqsave(&cmd->t_state_lock, *flags); cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);