From patchwork Fri Nov 11 18:47:21 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 9423433 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 852E0601C0 for ; Fri, 11 Nov 2016 18:47:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7B57428F9E for ; Fri, 11 Nov 2016 18:47:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7029329976; Fri, 11 Nov 2016 18:47:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5641A28F9E for ; Fri, 11 Nov 2016 18:47:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933880AbcKKSru (ORCPT ); Fri, 11 Nov 2016 13:47:50 -0500 Received: from mail-it0-f46.google.com ([209.85.214.46]:35741 "EHLO mail-it0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934945AbcKKSrt (ORCPT ); Fri, 11 Nov 2016 13:47:49 -0500 Received: by mail-it0-f46.google.com with SMTP id e187so342126107itc.0 for ; Fri, 11 Nov 2016 10:47:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=to:cc:from:subject:message-id:date:user-agent:mime-version :content-transfer-encoding; bh=LtMVE9t4vy7wIeD6OILaKZ/UIsyDv1RHJS8OBb8n8eQ=; b=adjDe+BQFWVmh8/Fzeg987FubO1wf1GCMwAgd5ucS6fIDp7S9zZZL4NzuYJGlQY0Fq A75zAbQ25lWeXuK7u1P3gspOsRtZLPUEybiYJkI/uBg00s8T4sgQn4qASUA3zOYqIJ8j G4+1ArJOclIwDoZJVfjOSNjNOWY+FW7PcrNsnyKEMhfLg5Zg4Ub7IPxGvVd8HvRPtwlU j9D6rAZ2aTlM50wm9xcpzdxRT5WLBtc31Xsjw230021rYhixmXzzMdeq3qJIb2JTcc2X GvTX+ZgmscxZixxHmYqSbx0R6BvH03itUFgI5wW4sOLBjARk/emJwxMEo6F551mYgWO7 o20w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:to:cc:from:subject:message-id:date:user-agent :mime-version:content-transfer-encoding; bh=LtMVE9t4vy7wIeD6OILaKZ/UIsyDv1RHJS8OBb8n8eQ=; b=Vvyij0hPpCy++xsrCp9F46Nx3s+hYeIo15t+bWlp/21bozghwnWeF/kypIECNjuRCK HVFCbUYDw/eEsTFmxbHebe+w/w4fT0yBx7kFJDDSy84tJStIqs3emGZn3Ym5zJCqmYXH 08vbMPiv7iu6UWo8/wjjD21MXbRcYFpejh+ZFzgrzc9rW2JbH8AMCCBM9wx1dKu83mH5 NSrtK9oaxpab38R/Rhi1XMdIe93Lw6OLkutUOoW46u84YCfIaDleMqA3SkU7AWfHksjg gIBUb8WfrBAobkEBUycw//nKW4BOeDbOCgIVifGdJBfnonbGyF1DFsnIuo04D+m0nCAU cbIQ== X-Gm-Message-State: ABUngvfHk7BFJPRwp2bBSCOY5z45Im9M7wzSImb6tGqTKWP50NoYNJuB6V7iLxhYB1VIjA== X-Received: by 10.107.18.208 with SMTP id 77mr13659158ios.195.1478890043760; Fri, 11 Nov 2016 10:47:23 -0800 (PST) Received: from [192.168.1.129] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id v74sm4361916ioi.2.2016.11.11.10.47.22 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Nov 2016 10:47:22 -0800 (PST) To: Bart Van Assche Cc: "linux-block@vger.kernel.org" , Christoph Hellwig , Keith Busch From: Jens Axboe Subject: Regression: nvme timeouts and oopses Message-ID: Date: Fri, 11 Nov 2016 11:47:21 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi, I've been running into problems when stability testing my 4.10 branch, and I finally got an easy reproducer today (on the laptop, no less) and was able to bisect it. Boils down to this: 2253efc850c4cf690516bbc07854eeb1077202ba is the first bad commit commit 2253efc850c4cf690516bbc07854eeb1077202ba Author: Bart Van Assche Date: Fri Oct 28 17:20:02 2016 -0700 blk-mq: Move more code into blk_mq_direct_issue_request() The symptoms are one of two things: 1) We get command timeouts: nvme nvme0: I/O 567 QID 14 timeout, aborting nvme nvme0: Abort status: 0x0 nvme nvme0: I/O 567 QID 14 timeout, reset controller nvme nvme0: completing aborted command with status: blk_update_request: I/O error, dev nvme0n1, sector EXT4-fs warning (device nvme0n1): ext4_end_bio:314: I/O g to inode 20185097 (offset 0 size 8388608 starting block Buffer I/O error on device nvme0n1, logical block 247040 Buffer I/O error on device nvme0n1, logical block 247041 Buffer I/O error on device nvme0n1, logical block 247042 Buffer I/O error on device nvme0n1, logical block 247043 Buffer I/O error on device nvme0n1, logical block 247044 Buffer I/O error on device nvme0n1, logical block 247045 Buffer I/O error on device nvme0n1, logical block 247046 Buffer I/O error on device nvme0n1, logical block 247047 Buffer I/O error on device nvme0n1, logical block 247048 Buffer I/O error on device nvme0n1, logical block 247049 No corruption though, the data has been written. 2) We oops in __blk_mq_complete_request(), because __nvme_process_cq() -> blk_mq_complete_request() -> __blk_mq_complete_request() gets a request that has NULL ->q, ->bio, ->biotail, etc. I did a manual revert of the patch, see below, and it seems to work fine with this applied. I'll take a look at why this is, since it isn't immediately obvious to me. Sending this to get it out there while I take a deeper look. .rq = rq, .list = NULL, @@ -1303,9 +1303,6 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, }; blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num); - if (blk_mq_hctx_stopped(hctx)) - goto insert; - /* * For OK queue, we are done. For error, kill it. Any other * error (busy), just add it to our list as we previously @@ -1314,7 +1311,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = q->mq_ops->queue_rq(hctx, &bd); if (ret == BLK_MQ_RQ_QUEUE_OK) { *cookie = new_cookie; - return; + return 0; } __blk_mq_requeue_request(rq); @@ -1323,11 +1320,10 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, *cookie = BLK_QC_T_NONE; rq->errors = -EIO; blk_mq_end_request(rq, rq->errors); - return; + return 0; } -insert: - blk_mq_insert_request(rq, false, true, true); + return -1; } /* @@ -1414,11 +1410,15 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) { rcu_read_lock(); - blk_mq_try_issue_directly(data.hctx, old_rq, &cookie); + if (blk_mq_hctx_stopped(data.hctx) || + blk_mq_direct_issue_request(old_rq, &cookie) != 0) + blk_mq_insert_request(old_rq, false, true, true); rcu_read_unlock(); } else { srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu); - blk_mq_try_issue_directly(data.hctx, old_rq, &cookie); + if (blk_mq_hctx_stopped(data.hctx) || + blk_mq_direct_issue_request(old_rq, &cookie) != 0) + blk_mq_insert_request(old_rq, false, true, true); srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx); } goto done; diff --git a/block/blk-mq.c b/block/blk-mq.c index d180c989a0e5..365ae17c3f2b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1291,11 +1291,11 @@ static struct request *blk_mq_map_request(struct request_queue *q, return rq; } -static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, - struct request *rq, blk_qc_t *cookie) +static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie) { int ret; struct request_queue *q = rq->q; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); struct blk_mq_queue_data bd = {