From patchwork Thu Nov 9 18:49:58 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bart Van Assche X-Patchwork-Id: 10051677 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0E35860632 for ; Thu, 9 Nov 2017 18:50:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 019512B06C for ; Thu, 9 Nov 2017 18:50:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E832F2B06B; Thu, 9 Nov 2017 18:50:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 14D0E2B066 for ; Thu, 9 Nov 2017 18:50:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752790AbdKISuM (ORCPT ); Thu, 9 Nov 2017 13:50:12 -0500 Received: from esa4.hgst.iphmx.com ([216.71.154.42]:43041 "EHLO esa4.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753761AbdKISuF (ORCPT ); Thu, 9 Nov 2017 13:50:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1510253406; x=1541789406; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=48Cx+Cjzy0uTiOYNNw7jjR0I3GqQd05F0ZrTbQz2y8U=; b=U8Y84tF0c0VBifgP6f4H/NzVzJmCgfNrn9GhbM2haaTuPCroa3NoiryJ yJp7LAM75ie96fNZlM7iROMKIt5n1t3R304FarZ0mT1y9p/LwuugtqWYL yX5oiHlAZv7rY5dvg0fSKryx+YAEFHGIxOkY1Ep6sEkoK4XQsNap2Y/xZ dC2H/SRY04EuPXrt/nwWAgiwU6Ly+Y6bVp2OaJgX6mWnp4YF9xf3niDzS j+Pt5bpC5JqWDIH2hEJVTX6WhxU+fD2/eS7t3+TJhjPSdiXN0xhc4gR1E qLYjU+7F22VHlpgxOJau8Q1TcufhxpHjm5AV6OrX0X1CDdQXSmfTilXPT A==; X-IronPort-AV: E=Sophos;i="5.44,370,1505750400"; d="scan'208";a="61801482" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 10 Nov 2017 02:50:04 +0800 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP; 09 Nov 2017 10:47:26 -0800 Received: from unknown (HELO MILHUBIP04.sdcorp.global.sandisk.com) ([10.177.9.97]) by uls-op-cesaip02.wdc.com with ESMTP; 09 Nov 2017 10:50:03 -0800 Received: from milsmgip11.sandisk.com (10.177.9.6) by MILHUBIP04.sdcorp.global.sandisk.com (10.177.9.97) with Microsoft SMTP Server id 14.3.319.2; Thu, 9 Nov 2017 10:50:02 -0800 X-AuditID: 0ab10959-41c5898000002c97-f6-5a04a359ecbe Received: from thinkpad-bart.int.fusionio.com ( [10.177.9.6]) by (Symantec Messaging Gateway) with SMTP id 45.AE.11415.953A40A5; Thu, 9 Nov 2017 10:50:01 -0800 (PST) From: Bart Van Assche To: Jens Axboe CC: , , "Christoph Hellwig" , "Martin K . Petersen" , Oleksandr Natalenko , Ming Lei , Martin Steigerwald , "Bart Van Assche" , Johannes Thumshirn Subject: [PATCH v12 6/7] block, scsi: Make SCSI quiesce and resume work reliably Date: Thu, 9 Nov 2017 10:49:58 -0800 Message-ID: <20171109184959.4617-7-bart.vanassche@wdc.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20171109184959.4617-1-bart.vanassche@wdc.com> References: <20171109184959.4617-1-bart.vanassche@wdc.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrLJMWRmVeSWpSXmKPExsXCtZGTTTdyMUuUwa+Jxhar7/azWVz684XR YuXqo0wWj7tmsFrsvaVt0X19B5vF8uP/mCyWzGpmsjg0GUgsfLGJ2YHL4/LZUo+J6/k8dt9s YPN4eYnD4+PTWywe7/ddZfPYfLra4/MmOY/2A91MAZxRXDYpqTmZZalF+nYJXBnLls9iLFgU XHHg63SmBsZzLl2MnBwSAiYSp6dPZuxi5OIQEljJKLF52mwmkASbgJ7EqXn7wGwRAQWJnt8r 2UBsZoE3TBIPv0mC2MICARI/e5aygNgsAqoSV9a+YgSxeQWsJHaePMQGsUBeYu3be2BxTgFr ib975gLFOYCWWUnsvs8zgZF7ASPDKkax3Myc4tz0zAJDQ73ixLyUzOJsveT83E2MkACL3MH4 dKL5IUYmDk6pBka+P1nC5vcX7c88d3+x/QH/ZundN51eXNy+aOnqk/W/XOv0NRuOHLzhkxuv VV6/c1Ff/R3bpcsUZ6q4l9lIuPwwyItgUHxvzhJ5vP1nt3m1G+senYoYZdbXH3nyQ91OrpSQ O2Wv7uFh0Rpx3KQ7feqc3AkPWU86Wi2IYki5wyHcnHtL+fnSSCWW4oxEQy3mouJEAN/hf6Hg AQAA MIME-Version: 1.0 Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The contexts from which a SCSI device can be quiesced or resumed are: * Writing into /sys/class/scsi_device/*/device/state. * SCSI parallel (SPI) domain validation. * The SCSI device power management methods. See also scsi_bus_pm_ops. It is essential during suspend and resume that neither the filesystem state nor the filesystem metadata in RAM changes. This is why while the hibernation image is being written or restored that SCSI devices are quiesced. The SCSI core quiesces devices through scsi_device_quiesce() and scsi_device_resume(). In the SDEV_QUIESCE state execution of non-preempt requests is deferred. This is realized by returning BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI devices. Avoid that a full queue prevents power management requests to be submitted by deferring allocation of non-preempt requests for devices in the quiesced state. This patch has been tested by running the following commands and by verifying that after each resume the fio job was still running: for ((i=0; i<10; i++)); do ( cd /sys/block/md0/md && while true; do [ "$( sync_action sleep 1 done ) & pids=($!) for d in /sys/class/block/sd*[a-z]; do bdev=${d#/sys/class/block/} hcil=$(readlink "$d/device") hcil=${hcil#../../../} echo 4 > "$d/queue/nr_requests" echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth" fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \ --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16 \ --iodepth_batch=1 --thread --loops=$((2**31)) & pids+=($!) done sleep 1 echo "$(date) Hibernating ..." >>hibernate-test-log.txt systemctl hibernate sleep 10 kill "${pids[@]}" echo idle > /sys/block/md0/md/sync_action wait echo "$(date) Done." >>hibernate-test-log.txt done Reported-by: Oleksandr Natalenko References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348). Signed-off-by: Bart Van Assche Reviewed-by: Hannes Reinecke Tested-by: Martin Steigerwald Tested-by: Oleksandr Natalenko Cc: Martin K. Petersen Cc: Ming Lei Cc: Christoph Hellwig Cc: Johannes Thumshirn --- block/blk-core.c | 42 ++++++++++++++++++++++++++++++++++-------- block/blk-mq.c | 4 ++-- drivers/scsi/scsi_lib.c | 42 ++++++++++++++++++++++++++++++------------ fs/block_dev.c | 4 ++-- include/linux/blkdev.h | 2 +- include/scsi/scsi_device.h | 1 + 6 files changed, 70 insertions(+), 25 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index edc276899116..29b08428ae45 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -374,6 +374,7 @@ void blk_clear_preempt_only(struct request_queue *q) spin_lock_irqsave(q->queue_lock, flags); queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q); + wake_up_all(&q->mq_freeze_wq); spin_unlock_irqrestore(q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_clear_preempt_only); @@ -795,15 +796,38 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); -int blk_queue_enter(struct request_queue *q, bool nowait) +/** + * blk_queue_enter() - try to increase q->q_usage_counter + * @q: request queue pointer + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT + */ +int blk_queue_enter(struct request_queue *q, unsigned int flags) { + const bool preempt = flags & BLK_MQ_REQ_PREEMPT; + while (true) { + bool success = false; int ret; - if (percpu_ref_tryget_live(&q->q_usage_counter)) + rcu_read_lock_sched(); + if (percpu_ref_tryget_live(&q->q_usage_counter)) { + /* + * The code that sets the PREEMPT_ONLY flag is + * responsible for ensuring that that flag is globally + * visible before the queue is unfrozen. + */ + if (preempt || !blk_queue_preempt_only(q)) { + success = true; + } else { + percpu_ref_put(&q->q_usage_counter); + } + } + rcu_read_unlock_sched(); + + if (success) return 0; - if (nowait) + if (flags & BLK_MQ_REQ_NOWAIT) return -EBUSY; /* @@ -816,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait) smp_rmb(); ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || + (atomic_read(&q->mq_freeze_depth) == 0 && + (preempt || !blk_queue_preempt_only(q))) || blk_queue_dying(q)); if (blk_queue_dying(q)) return -ENODEV; @@ -1445,8 +1470,7 @@ static struct request *blk_old_get_request(struct request_queue *q, /* create ioc upfront */ create_io_context(gfp_mask, q->node); - ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) || - (op & REQ_NOWAIT)); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); spin_lock_irq(q->queue_lock); @@ -2267,8 +2291,10 @@ blk_qc_t generic_make_request(struct bio *bio) current->bio_list = bio_list_on_stack; do { struct request_queue *q = bio->bi_disk->queue; + unsigned int flags = bio->bi_opf & REQ_NOWAIT ? + BLK_MQ_REQ_NOWAIT : 0; - if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) { + if (likely(blk_queue_enter(q, flags) == 0)) { struct bio_list lower, same; /* Create a fresh bio_list for all subordinate requests */ @@ -2327,7 +2353,7 @@ blk_qc_t direct_make_request(struct bio *bio) if (!generic_make_request_checks(bio)) return BLK_QC_T_NONE; - if (unlikely(blk_queue_enter(q, nowait))) { + if (unlikely(blk_queue_enter(q, nowait ? BLK_MQ_REQ_NOWAIT : 0))) { if (nowait && !blk_queue_dying(q)) bio->bi_status = BLK_STS_AGAIN; else diff --git a/block/blk-mq.c b/block/blk-mq.c index 2c7ceb34a016..4fb187570e96 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -389,7 +389,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, struct request *rq; int ret; - ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); @@ -428,7 +428,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, if (hctx_idx >= q->nr_hw_queues) return ERR_PTR(-EIO); - ret = blk_queue_enter(q, true); + ret = blk_queue_enter(q, flags); if (ret) return ERR_PTR(ret); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 1e94081cc791..91016208d107 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -2941,21 +2941,37 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev) int scsi_device_quiesce(struct scsi_device *sdev) { + struct request_queue *q = sdev->request_queue; int err; + /* + * It is allowed to call scsi_device_quiesce() multiple times from + * the same context but concurrent scsi_device_quiesce() calls are + * not allowed. + */ + WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current); + + blk_set_preempt_only(q); + + blk_mq_freeze_queue(q); + /* + * Ensure that the effect of blk_set_preempt_only() will be visible + * for percpu_ref_tryget() callers that occur after the queue + * unfreeze even if the queue was already frozen before this function + * was called. See also https://lwn.net/Articles/573497/. + */ + synchronize_rcu(); + blk_mq_unfreeze_queue(q); + mutex_lock(&sdev->state_mutex); err = scsi_device_set_state(sdev, SDEV_QUIESCE); + if (err == 0) + sdev->quiesced_by = current; + else + blk_clear_preempt_only(q); mutex_unlock(&sdev->state_mutex); - if (err) - return err; - - scsi_run_queue(sdev->request_queue); - while (atomic_read(&sdev->device_busy)) { - msleep_interruptible(200); - scsi_run_queue(sdev->request_queue); - } - return 0; + return err; } EXPORT_SYMBOL(scsi_device_quiesce); @@ -2975,9 +2991,11 @@ void scsi_device_resume(struct scsi_device *sdev) * device deleted during suspend) */ mutex_lock(&sdev->state_mutex); - if (sdev->sdev_state == SDEV_QUIESCE && - scsi_device_set_state(sdev, SDEV_RUNNING) == 0) - scsi_run_queue(sdev->request_queue); + WARN_ON_ONCE(!sdev->quiesced_by); + sdev->quiesced_by = NULL; + blk_clear_preempt_only(sdev->request_queue); + if (sdev->sdev_state == SDEV_QUIESCE) + scsi_device_set_state(sdev, SDEV_RUNNING); mutex_unlock(&sdev->state_mutex); } EXPORT_SYMBOL(scsi_device_resume); diff --git a/fs/block_dev.c b/fs/block_dev.c index 222824ecd480..4a181fcb5175 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -662,7 +662,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector, if (!ops->rw_page || bdev_get_integrity(bdev)) return result; - result = blk_queue_enter(bdev->bd_queue, false); + result = blk_queue_enter(bdev->bd_queue, 0); if (result) return result; result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false); @@ -698,7 +698,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector, if (!ops->rw_page || bdev_get_integrity(bdev)) return -EOPNOTSUPP; - result = blk_queue_enter(bdev->bd_queue, false); + result = blk_queue_enter(bdev->bd_queue, 0); if (result) return result; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2147e2381a22..402c9d536ae1 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -959,7 +959,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, struct scsi_ioctl_command __user *); -extern int blk_queue_enter(struct request_queue *q, bool nowait); +extern int blk_queue_enter(struct request_queue *q, unsigned int flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_start_queue(struct request_queue *q); extern void blk_start_queue_async(struct request_queue *q); diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 67c5a9f223f7..59ea1e4a4a50 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -220,6 +220,7 @@ struct scsi_device { unsigned char access_state; struct mutex state_mutex; enum scsi_device_state sdev_state; + struct task_struct *quiesced_by; unsigned long sdev_data[0]; } __attribute__((aligned(sizeof(unsigned long))));