From patchwork Wed Aug 24 05:47:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952938 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E60A0C00140 for ; Wed, 24 Aug 2022 05:49:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232570AbiHXFtA (ORCPT ); Wed, 24 Aug 2022 01:49:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46482 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233212AbiHXFs7 (ORCPT ); Wed, 24 Aug 2022 01:48:59 -0400 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 36856861D0; Tue, 23 Aug 2022 22:48:54 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R621e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045168;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTYO_1661320119; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTYO_1661320119) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:52 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 1/9] ublk_drv: check 'current' instead of 'ubq_daemon' Date: Wed, 24 Aug 2022 13:47:36 +0800 Message-Id: <20220824054744.77812-2-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org This check is not atomic. So with recovery feature, ubq_daemon may be updated simultaneously by recovery task. Instead, check 'current' is safe here because 'current' never changes. Also add comment explaining this check, which is really important for understanding recovery feature. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei --- drivers/block/ublk_drv.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 6a4a94b4cdf4..c39b67d7133d 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req) struct ublk_device *ub = ubq->dev; int tag = req->tag; struct ublk_io *io = &ubq->ios[tag]; - bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq); unsigned int mapped_bytes; pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, ublk_get_iod(ubq, req->tag)->addr); - if (unlikely(task_exiting)) { + /* + * Task is exiting if either: + * + * (1) current != ubq_daemon. + * io_uring_cmd_complete_in_task() tries to run task_work + * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. + * + * (2) current->flags & PF_EXITING. + */ + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { blk_mq_end_request(req, BLK_STS_IOERR); mod_delayed_work(system_wq, &ub->monitor_work, 0); return; From patchwork Wed Aug 24 05:47:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952940 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B82DEC32793 for ; Wed, 24 Aug 2022 05:49:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233776AbiHXFtF (ORCPT ); Wed, 24 Aug 2022 01:49:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46570 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233451AbiHXFtE (ORCPT ); Wed, 24 Aug 2022 01:49:04 -0400 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C39408001A; Tue, 23 Aug 2022 22:48:56 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R891e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTdx_1661320132; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTdx_1661320132) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:53 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 2/9] ublk_drv: refactor ublk_cancel_queue() Date: Wed, 24 Aug 2022 13:47:37 +0800 Message-Id: <20220824054744.77812-3-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds received so that io_uring ctx will not leak. ublk_cancel_queue() may be called before START_DEV or after STOP_DEV, we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index c39b67d7133d..e08f636b0b9d 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -967,18 +967,23 @@ static void ublk_cancel_queue(struct ublk_queue *ubq) { int i; - if (!ublk_queue_ready(ubq)) + if (!ubq->nr_io_ready) return; for (i = 0; i < ubq->q_depth; i++) { struct ublk_io *io = &ubq->ios[i]; - if (io->flags & UBLK_IO_FLAG_ACTIVE) + if (io->flags & UBLK_IO_FLAG_ACTIVE) { + pr_devel("%s: done old cmd: qid %d tag %d\n", + __func__, ubq->q_id, i); io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0); + io->flags &= ~UBLK_IO_FLAG_ACTIVE; + ubq->nr_io_ready--; + } } /* all io commands are canceled */ - ubq->nr_io_ready = 0; + WARN_ON_ONCE(ubq->nr_io_ready); } /* Cancel all pending commands, must be called after del_gendisk() returns */ From patchwork Wed Aug 24 05:47:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952939 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24B00C32792 for ; Wed, 24 Aug 2022 05:49:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233648AbiHXFtE (ORCPT ); Wed, 24 Aug 2022 01:49:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46556 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233212AbiHXFtD (ORCPT ); Wed, 24 Aug 2022 01:49:03 -0400 Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com [115.124.30.45]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 58E8F8689D; Tue, 23 Aug 2022 22:48:56 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R361e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTeA_1661320133; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTeA_1661320133) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:53 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 3/9] ublk_drv: add a helper to get ioucmd from pdu Date: Wed, 24 Aug 2022 13:47:38 +0800 Message-Id: <20220824054744.77812-4-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org We store pointer of task_work in pdu. And we should get ioucmd from pdu since we prepare to only pass ioucmd to task_work function. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index e08f636b0b9d..8add6e3ae15f 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -555,6 +555,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu( return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu; } +static inline struct io_uring_cmd *ublk_uring_cmd_from_pdu( + struct ublk_uring_cmd_pdu *pdu) +{ + return container_of((u8 *)pdu, struct io_uring_cmd, pdu[0]); +} + static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) { return ubq->ubq_daemon->flags & PF_EXITING; From patchwork Wed Aug 24 05:47:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952943 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FF54C32792 for ; Wed, 24 Aug 2022 05:49:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232817AbiHXFtK (ORCPT ); Wed, 24 Aug 2022 01:49:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46634 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234045AbiHXFtH (ORCPT ); Wed, 24 Aug 2022 01:49:07 -0400 Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2135F80F46; Tue, 23 Aug 2022 22:48:59 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045168;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTeJ_1661320133; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTeJ_1661320133) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:54 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism Date: Wed, 24 Aug 2022 13:47:39 +0800 Message-Id: <20220824054744.77812-5-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org If one rq is handled by io_uring_cmd_complete_in_task(), after a crash this rq is actually handled by an io_uring fallback wq. We have to end(abort) this rq since this fallback wq is a task other than the crashed task. However, current code does not call io_uring_cmd_done() at the same time but do it in ublk_cancel_queue(). With current design, this does work because ublk_cancel_queue() is called AFTER del_gendisk(), which waits for the rq ended(aborted) in fallback wq. This implies that fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done() on the corresponding ioucmd in ublk_cancel_queue(). However, while considering recovery feature, we cannot rely on del_gendisk() or blk_mq_freeze_queue() to wait for completion of all rqs because we may not want any aborted rq. Besides, io_uring does not provide "flush fallback" machenism so we cannot trace this ioucmd. The recovery machenism needs to complete all ioucmds of a dying ubq to avoid leaking io_uring ctx. But as talked above, we are unsafe to call io_uring_cmd_done() in the recovery task if fallback wq happens to run simultaneously. This is a UAF case because io_uring ctx may be freed. Actually a similar case happens in (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a re-issued request aborted previously to ioucmd's task_work). Besides, in order to implement recovery machenism, in ublk_queue_rq() and __ublk_rq_task_work(), we should not end(abort) current rq while ubq_daemon is dying. Instead, we should wait for new ubq_daemon getting ready and requeue it. In summary, We refactor some code to avoid the UAF problem and prepare for recovery machenism: (1) Refactor monitor_work Monitor_work is only used without recovery feature which aborts rqs and stops the device. Now ublk_abort_queue() is the only function end(abort) inflight rqs with a dying ubq_daemon. While for a not inflight(idle) rq, its ioucmd is completed by io_uring_cmd_done() safely. We do not rely on UBLK_IO_FLAG_ACTIVE anymore. monitor_work also sets 'force_abort' for a dying ubq. (2) Refactor ublk_queue_rq() Check 'force_abort' before blk_mq_start_request(). If 'force_abort' is true, end(abort) current rq immediately. 'force_abort' is set by monitor_work for a dying ubq. Aborting rqs ASAP ensures that all rqs' status do not change while we traverse rqs in monitor_work. (3) Refactor __ublk_rq_task_work(). No matter we use task_work_add() or io_uring_cmd_complete_in_task(), now __ublk_rq_task_work() only accepts one arg: ioucmd, which is set in ublk_queue_rq() eariler. And no matter ubq_daemon is dying or not, we always call io_uring_cmd_done(ioucmd). Note that ioucmd might not be the same as io->cmd because a new ubq_daemon may set new ioucmds before old fallback wq or exit_task_work runs. In this way the old ioucmd can be safely freed eventually and io->cmd can be updated without race. (4) Refactor ublk_cancel_dev() ublk_cancel_dev() cannot complete ioucmds for a dying ubq because we have done this in monitor_work. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 216 +++++++++++++++++++++------------------ 1 file changed, 115 insertions(+), 101 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 8add6e3ae15f..1b1c0190bba4 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -54,12 +54,10 @@ /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) -struct ublk_rq_data { - struct callback_head work; -}; - struct ublk_uring_cmd_pdu { - struct request *req; + int q_id; + int tag; + struct callback_head work; }; /* @@ -122,6 +120,7 @@ struct ublk_queue { bool abort_work_pending; unsigned short nr_io_ready; /* how many ios setup */ struct ublk_device *dev; + bool force_abort; struct ublk_io ios[0]; }; @@ -610,24 +609,6 @@ static void ublk_complete_rq(struct request *req) __blk_mq_end_request(req, BLK_STS_OK); } -/* - * Since __ublk_rq_task_work always fails requests immediately during - * exiting, __ublk_fail_req() is only called from abort context during - * exiting. So lock is unnecessary. - * - * Also aborting may not be started yet, keep in mind that one failed - * request may be issued by block layer again. - */ -static void __ublk_fail_req(struct ublk_io *io, struct request *req) -{ - WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE); - - if (!(io->flags & UBLK_IO_FLAG_ABORTED)) { - io->flags |= UBLK_IO_FLAG_ABORTED; - blk_mq_end_request(req, BLK_STS_IOERR); - } -} - static void ubq_complete_io_cmd(struct ublk_io *io, int res) { /* mark this cmd owned by ublksrv */ @@ -645,18 +626,14 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res) #define UBLK_REQUEUE_DELAY_MS 3 -static inline void __ublk_rq_task_work(struct request *req) +static inline void __ublk_rq_task_work(struct io_uring_cmd *cmd) { - struct ublk_queue *ubq = req->mq_hctx->driver_data; - struct ublk_device *ub = ubq->dev; - int tag = req->tag; - struct ublk_io *io = &ubq->ios[tag]; + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); + struct ublk_device *ub = cmd->file->private_data; + struct ublk_queue *ubq = ublk_get_queue(ub, pdu->q_id); + struct ublk_io *io; + struct request *req; unsigned int mapped_bytes; - - pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", - __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags, - ublk_get_iod(ubq, req->tag)->addr); - /* * Task is exiting if either: * @@ -667,11 +644,22 @@ static inline void __ublk_rq_task_work(struct request *req) * (2) current->flags & PF_EXITING. */ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { - blk_mq_end_request(req, BLK_STS_IOERR); - mod_delayed_work(system_wq, &ub->monitor_work, 0); + pr_devel("%s: (%s) done old cmd: qid %d tag %d\n", + __func__, + current != ubq->ubq_daemon ? "fallback wq" : "exit_task_work", + pdu->q_id, pdu->tag); + io_uring_cmd_done(cmd, UBLK_IO_RES_ABORT, 0); return; } + io = &ubq->ios[pdu->tag]; + req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], pdu->tag); + WARN_ON_ONCE(!req); + + pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n", + __func__, cmd->cmd_op, ubq->q_id, req->tag, io->flags, + ublk_get_iod(ubq, req->tag)->addr); + if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE || req_op(req) == REQ_OP_FLUSH)) { @@ -728,18 +716,16 @@ static inline void __ublk_rq_task_work(struct request *req) static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd) { - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); - - __ublk_rq_task_work(pdu->req); + __ublk_rq_task_work(cmd); } static void ublk_rq_task_work_fn(struct callback_head *work) { - struct ublk_rq_data *data = container_of(work, - struct ublk_rq_data, work); - struct request *req = blk_mq_rq_from_pdu(data); + struct ublk_uring_cmd_pdu *pdu = container_of(work, + struct ublk_uring_cmd_pdu, work); + struct io_uring_cmd *cmd = ublk_uring_cmd_from_pdu(pdu); - __ublk_rq_task_work(req); + __ublk_rq_task_work(cmd); } static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, @@ -747,6 +733,8 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, { struct ublk_queue *ubq = hctx->driver_data; struct request *rq = bd->rq; + struct ublk_io *io = &ubq->ios[rq->tag]; + struct ublk_uring_cmd_pdu *pdu; blk_status_t res; /* fill iod to slot in io cmd buffer */ @@ -754,43 +742,43 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, if (unlikely(res != BLK_STS_OK)) return BLK_STS_IOERR; - blk_mq_start_request(bd->rq); - - if (unlikely(ubq_daemon_is_dying(ubq))) { - fail: - mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); + /* + * We set force_abort because we want to abort this rq ASAP. + * + * NOTE: At this moment, rq is NOT inflight. So after the ubq is marked + * as force_abort, no new rq can be inflight and we are safe to check + * all rqs' status in monitor_work and choose whether: + * + * (1) if inflight, end(abort) this rq; + * (2) if not inflight, io_uring_cmd_done() the ioucmd. + */ + if (ubq->force_abort) { + pr_devel("%s: force abort: qid %d tag %d io_flag %d\n", + __func__, ubq->q_id, rq->tag, io->flags); return BLK_STS_IOERR; } + blk_mq_start_request(bd->rq); + pr_devel("%s: start req: q_id %d tag %d io_flags %d\n", + __func__, ubq->q_id, rq->tag, io->flags); + pdu = ublk_get_uring_cmd_pdu(io->cmd); + pdu->q_id = ubq->q_id; + pdu->tag = rq->tag; + if (ublk_can_use_task_work(ubq)) { - struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); enum task_work_notify_mode notify_mode = bd->last ? TWA_SIGNAL_NO_IPI : TWA_NONE; - if (task_work_add(ubq->ubq_daemon, &data->work, notify_mode)) - goto fail; - } else { - struct ublk_io *io = &ubq->ios[rq->tag]; - struct io_uring_cmd *cmd = io->cmd; - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); + init_task_work(&pdu->work, ublk_rq_task_work_fn); - /* - * If the check pass, we know that this is a re-issued request aborted - * previously in monitor_work because the ubq_daemon(cmd's task) is - * PF_EXITING. We cannot call io_uring_cmd_complete_in_task() anymore - * because this ioucmd's io_uring context may be freed now if no inflight - * ioucmd exists. Otherwise we may cause null-deref in ctx->fallback_work. - * - * Note: monitor_work sets UBLK_IO_FLAG_ABORTED and ends this request(releasing - * the tag). Then the request is re-started(allocating the tag) and we are here. - * Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED - * guarantees that here is a re-issued request aborted previously. - */ - if ((io->flags & UBLK_IO_FLAG_ABORTED)) - goto fail; + /* If task_work_add() fails, we know that ubq_daemon(cmd's task) is PF_EXITING. */ + if (task_work_add(ubq->ubq_daemon, &pdu->work, notify_mode)) + pr_devel("%s: done old cmd: qid %d tag %d\n", + __func__, ubq->q_id, rq->tag); + io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0); - pdu->req = rq; - io_uring_cmd_complete_in_task(cmd, ublk_rq_task_work_cb); + } else { + io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb); } return BLK_STS_OK; @@ -814,20 +802,10 @@ static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data, return 0; } -static int ublk_init_rq(struct blk_mq_tag_set *set, struct request *req, - unsigned int hctx_idx, unsigned int numa_node) -{ - struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); - - init_task_work(&data->work, ublk_rq_task_work_fn); - return 0; -} - static const struct blk_mq_ops ublk_mq_ops = { .queue_rq = ublk_queue_rq, .commit_rqs = ublk_commit_rqs, .init_hctx = ublk_init_hctx, - .init_request = ublk_init_rq, }; static int ublk_ch_open(struct inode *inode, struct file *filp) @@ -906,31 +884,46 @@ static void ublk_commit_completion(struct ublk_device *ub, ublk_complete_rq(req); } -/* - * When ->ubq_daemon is exiting, either new request is ended immediately, - * or any queued io command is drained, so it is safe to abort queue - * lockless +/* do cleanup work for a dying(PF_EXITING) ubq_daemon: + * (1) end(abort) all 'inflight' rqs here. + * (2) complete ioucmd of all not 'inflight' rqs here. + * + * Note: we have set ubq->force_abort before. So ublk_queue_rq() must fail + * a rq immediately before it calls blk_mq_start_request() which will + * set a rq's status as 'inflight'. Therefore, set of 'inflight' + * rqs must not change for a dying ubq. + * + * Note: If we fail one rq in ublk_queue_rq(), we cannot fail it here again. + * This is because checking 'force_abort' happens before blk_mq_start_request() + * so its status is always 'idle' and never changes to 'inflight'. + * + * Also aborting may not be started yet, keep in mind that one failed + * request may be issued by block layer again. */ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq) { int i; + WARN_ON_ONCE(!ubq->force_abort); + if (!ublk_get_device(ub)) return; for (i = 0; i < ubq->q_depth; i++) { - struct ublk_io *io = &ubq->ios[i]; - - if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) { - struct request *rq; + struct request *rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); + if (rq && blk_mq_request_started(rq)) { /* - * Either we fail the request or ublk_rq_task_work_fn + * Either we fail the request or ublk_queue_rq * will do it */ - rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i); - if (rq) - __ublk_fail_req(io, rq); + pr_devel("%s: abort request: qid %d tag %d\n", + __func__, ubq->q_id, i); + blk_mq_end_request(rq, BLK_STS_IOERR); + } else { + pr_devel("%s: done old cmd: qid %d tag %d\n", + __func__, ubq->q_id, i); + io_uring_cmd_done(ubq->ios[i].cmd, UBLK_IO_RES_ABORT, 0); } } ublk_put_device(ub); @@ -945,7 +938,18 @@ static void ublk_daemon_monitor_work(struct work_struct *work) for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { struct ublk_queue *ubq = ublk_get_queue(ub, i); - if (ubq_daemon_is_dying(ubq)) { + /* check force_abort so we do not abort a queue two times! */ + if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq) && !ubq->force_abort) { + struct request_queue *q = ub->ub_disk->queue; + + ubq->force_abort = true; + + /* ensure that all ublk_queue_rq() calls see force_abort */ + if (blk_queue_has_srcu(q)) + synchronize_srcu(q->srcu); + else + synchronize_rcu(); + schedule_work(&ub->stop_work); /* abort queue is for making forward progress */ @@ -997,8 +1001,18 @@ static void ublk_cancel_dev(struct ublk_device *ub) { int i; - for (i = 0; i < ub->dev_info.nr_hw_queues; i++) + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + /* + * for ubq with a dying daemon, we have io_uring_cmd_done() all cmd in + * ublk_abort_queue(). Do not io_uring_cmd_done() cmd two times! + */ + if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq)) + continue; + ublk_cancel_queue(ublk_get_queue(ub, i)); + } } static void ublk_stop_dev(struct ublk_device *ub) @@ -1037,17 +1051,17 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id, int tag, struct io_uring_cmd *cmd) { struct ublk_queue *ubq = ublk_get_queue(ub, q_id); - struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag); + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); + + pdu->q_id = q_id; + pdu->tag = tag; if (ublk_can_use_task_work(ubq)) { - struct ublk_rq_data *data = blk_mq_rq_to_pdu(req); + init_task_work(&pdu->work, ublk_rq_task_work_fn); /* should not fail since we call it just in ubq->ubq_daemon */ - task_work_add(ubq->ubq_daemon, &data->work, TWA_SIGNAL_NO_IPI); + task_work_add(ubq->ubq_daemon, &pdu->work, TWA_SIGNAL_NO_IPI); } else { - struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); - - pdu->req = req; io_uring_cmd_complete_in_task(cmd, ublk_rq_task_work_cb); } } @@ -1320,7 +1334,7 @@ static int ublk_add_tag_set(struct ublk_device *ub) ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues; ub->tag_set.queue_depth = ub->dev_info.queue_depth; ub->tag_set.numa_node = NUMA_NO_NODE; - ub->tag_set.cmd_size = sizeof(struct ublk_rq_data); + ub->tag_set.cmd_size = 0; ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; ub->tag_set.driver_data = ub; return blk_mq_alloc_tag_set(&ub->tag_set); From patchwork Wed Aug 24 05:47:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952946 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 807EAC3F6B0 for ; Wed, 24 Aug 2022 05:50:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234187AbiHXFuw (ORCPT ); Wed, 24 Aug 2022 01:50:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47218 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234844AbiHXFuh (ORCPT ); Wed, 24 Aug 2022 01:50:37 -0400 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08600915FA; Tue, 23 Aug 2022 22:50:15 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTec_1661320134; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTec_1661320134) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:54 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 5/9] ublk_drv: refactor ublk_stop_dev() Date: Wed, 24 Aug 2022 13:47:40 +0800 Message-Id: <20220824054744.77812-6-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Since we rely on rq's status to end(abort) it in monitor_work, calling del_gendisk() too early can result in UAF. So add blk_mq_freeze_queue() to end all inflight rqs. del_gendisk() can be then safely called with a frozen queue. Note: blk_mq_freeze_queue() hangs until: (1) all rqs are aborted and ioucmds are copmleted(freed) on a dying ubq by monitor_work scheduled. (2) all rqs are ended on other ubqs. We cancel monitor_work before del_gendisk() because monitor_work also touches rq's status. Monitor_work is guaranteed to be canceled because we mark ub's state as UBLK_S_DEV_DEAD. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 1b1c0190bba4..df8751ea3711 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -1018,18 +1018,23 @@ static void ublk_cancel_dev(struct ublk_device *ub) static void ublk_stop_dev(struct ublk_device *ub) { mutex_lock(&ub->mutex); - if (ub->dev_info.state != UBLK_S_DEV_LIVE) - goto unlock; + /* No gendisk is live. ubq may be ready or not */ + if (ub->dev_info.state == UBLK_S_DEV_DEAD) + goto out_cancel_dev; - del_gendisk(ub->ub_disk); + mod_delayed_work(system_wq, &ub->monitor_work, 0); + pr_devel("%s: Wait for all requests ended...\n", __func__); + blk_mq_freeze_queue(ub->ub_disk->queue); ub->dev_info.state = UBLK_S_DEV_DEAD; + cancel_delayed_work_sync(&ub->monitor_work); + pr_devel("%s: All requests are ended.\n", __func__); + del_gendisk(ub->ub_disk); ub->dev_info.ublksrv_pid = -1; put_disk(ub->ub_disk); ub->ub_disk = NULL; - unlock: + out_cancel_dev: ublk_cancel_dev(ub); mutex_unlock(&ub->mutex); - cancel_delayed_work_sync(&ub->monitor_work); } /* device can only be started after all IOs are ready */ From patchwork Wed Aug 24 05:47:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952941 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36BDAC32793 for ; Wed, 24 Aug 2022 05:49:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233827AbiHXFtI (ORCPT ); Wed, 24 Aug 2022 01:49:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233796AbiHXFtF (ORCPT ); Wed, 24 Aug 2022 01:49:05 -0400 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA79986FD3; Tue, 23 Aug 2022 22:48:57 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R631e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTew_1661320134; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTew_1661320134) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:55 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 6/9] ublk_drv: add pr_devel() to prepare for recovery feature Date: Wed, 24 Aug 2022 13:47:41 +0800 Message-Id: <20220824054744.77812-7-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Recovery feature needs to correctly complete all old ioucmds. So print some message for one ioucmd to track its status. After io_uring ctx is freed, the /dev/ublkcX file can be released. Print while opening/releasing the file struct to make sure there is no resource leakage. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index df8751ea3711..4bbd97ccaedf 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -572,6 +572,9 @@ static void ublk_complete_rq(struct request *req) struct ublk_io *io = &ubq->ios[req->tag]; unsigned int unmapped_bytes; + pr_devel("%s complete req: qid %d tag %d io_flags %d\n", + __func__, ubq->q_id, req->tag, io->flags); + /* failed read IO if nothing is read */ if (!io->res && req_op(req) == REQ_OP_READ) io->res = -EIO; @@ -620,6 +623,9 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res) */ io->flags &= ~UBLK_IO_FLAG_ACTIVE; + pr_devel("%s: complete: op %d, res %d io_flags %x\n", + __func__, io->cmd->cmd_op, res, io->flags); + /* tell ublksrv one io request is coming */ io_uring_cmd_done(io->cmd, res, 0); } @@ -816,6 +822,7 @@ static int ublk_ch_open(struct inode *inode, struct file *filp) if (test_and_set_bit(UB_STATE_OPEN, &ub->state)) return -EBUSY; filp->private_data = ub; + pr_devel("%s: /dev/ublkc%d opened.\n", __func__, ub->dev_info.dev_id); return 0; } @@ -824,6 +831,7 @@ static int ublk_ch_release(struct inode *inode, struct file *filp) struct ublk_device *ub = filp->private_data; clear_bit(UB_STATE_OPEN, &ub->state); + pr_devel("%s: /dev/ublkc%d released.\n", __func__, ub->dev_info.dev_id); return 0; } @@ -942,6 +950,8 @@ static void ublk_daemon_monitor_work(struct work_struct *work) if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq) && !ubq->force_abort) { struct request_queue *q = ub->ub_disk->queue; + pr_devel("%s: find dying ubq_daemon: qid %d\n", + __func__, ubq->q_id); ubq->force_abort = true; /* ensure that all ublk_queue_rq() calls see force_abort */ @@ -1043,12 +1053,16 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq) mutex_lock(&ub->mutex); ubq->nr_io_ready++; if (ublk_queue_ready(ubq)) { + pr_devel("%s: ubq %d ready\n", __func__, ubq->q_id); ubq->ubq_daemon = current; get_task_struct(ubq->ubq_daemon); ub->nr_queues_ready++; } - if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) + if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) { + pr_devel("%s: all queues(%d) ready\n", + __func__, ub->dev_info.nr_hw_queues); complete_all(&ub->completion); + } mutex_unlock(&ub->mutex); } From patchwork Wed Aug 24 05:47:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952942 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FF60C00140 for ; Wed, 24 Aug 2022 05:49:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233620AbiHXFtH (ORCPT ); Wed, 24 Aug 2022 01:49:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46584 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233729AbiHXFtE (ORCPT ); Wed, 24 Aug 2022 01:49:04 -0400 Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D16D1876BB; Tue, 23 Aug 2022 22:48:58 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTfC_1661320135; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTfC_1661320135) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:55 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 7/9] ublk_drv: define macros for recovery feature and check them Date: Wed, 24 Aug 2022 13:47:42 +0800 Message-Id: <20220824054744.77812-8-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Define some macros for recovery feature. Especially define a new state: UBLK_S_DEV_RECOVERING which implies the ublk_device is recovering. UBLK_F_USER_RECOVERY implies that: (1) ublk_drv enables recovery feature. It won't schedule monitor_work to automatically abort rqs and release the device. Instead, it waits for user's START_USER_RECOVERY ctrl-cmd. (2) while re-initing a ubq, ublk_drv ends(aborts) rqs issued to userspace(ublksrv) before crash. (3) while re-initing a ubq, ublk_drv requeues rqs dispatched after crash. UBLK_F_USER_RECOVERY_REISSUE implies that: (1) everything UBLK_F_USER_RECOVERY implies except (2) ublk_drv requeues rqs issued to userspace(ublksrv) before crash. UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: (1) tolerate double-writes because we may issue the same rq twice. (2) cannot let frontend users get I/O error, such as a RDONLY system. For now, we do not allow STOP_DEV while we are in UBLK_S_DEV_RECOVERING. This means that user must assign a new ubq_daemon for each ubq after sending START_USER_RECOVERY ctrl-cmd. Also modify checks on state in START_DEV and SET_PARAMS because now we have three states. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 47 ++++++++++++++++++++++++++++++----- include/uapi/linux/ublk_cmd.h | 7 ++++++ 2 files changed, 48 insertions(+), 6 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 4bbd97ccaedf..0ee871fa3f92 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -49,7 +49,9 @@ /* All UBLK_F_* have to be included into UBLK_F_ALL */ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ - | UBLK_F_NEED_GET_DATA) + | UBLK_F_NEED_GET_DATA \ + | UBLK_F_USER_RECOVERY \ + | UBLK_F_USER_RECOVERY_REISSUE) /* All UBLK_PARAM_TYPE_* should be included here */ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) @@ -322,6 +324,33 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id) PAGE_SIZE); } +/* + * TODO: UBLK_F_USER_RECOVERY should be a flag for device, not for queue, + * since "some queues are aborted while others are recovered" is really weird. + */ +static inline bool ublk_can_use_recovery(struct ublk_device *ub) +{ + struct ublk_queue *ubq = ublk_get_queue(ub, 0); + + if (ubq->flags & UBLK_F_USER_RECOVERY) + return true; + return false; +} + +/* + * TODO: UBLK_F_USER_RECOVERY_REISSUE should be a flag for device, not for queue, + * since "some queues are aborted while others are recovered" is really weird. + */ +static inline bool ublk_can_use_recovery_reissue(struct ublk_device *ub) +{ + struct ublk_queue *ubq = ublk_get_queue(ub, 0); + + if ((ubq->flags & UBLK_F_USER_RECOVERY) && + (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE)) + return true; + return false; +} + static void ublk_free_disk(struct gendisk *disk) { struct ublk_device *ub = disk->private_data; @@ -1029,10 +1058,15 @@ static void ublk_stop_dev(struct ublk_device *ub) { mutex_lock(&ub->mutex); /* No gendisk is live. ubq may be ready or not */ - if (ub->dev_info.state == UBLK_S_DEV_DEAD) + if (ub->dev_info.state == UBLK_S_DEV_DEAD) { goto out_cancel_dev; - - mod_delayed_work(system_wq, &ub->monitor_work, 0); + /* TODO: support stop_dev just after start_recovery */ + } else if (ub->dev_info.state == UBLK_S_DEV_RECOVERING) { + goto out_unlock; + /* schedule monitor_work to abort any dying queue */ + } else { + mod_delayed_work(system_wq, &ub->monitor_work, 0); + } pr_devel("%s: Wait for all requests ended...\n", __func__); blk_mq_freeze_queue(ub->ub_disk->queue); ub->dev_info.state = UBLK_S_DEV_DEAD; @@ -1044,6 +1078,7 @@ static void ublk_stop_dev(struct ublk_device *ub) ub->ub_disk = NULL; out_cancel_dev: ublk_cancel_dev(ub); + out_unlock: mutex_unlock(&ub->mutex); } @@ -1403,7 +1438,7 @@ static int ublk_ctrl_start_dev(struct io_uring_cmd *cmd) schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); mutex_lock(&ub->mutex); - if (ub->dev_info.state == UBLK_S_DEV_LIVE || + if (ub->dev_info.state != UBLK_S_DEV_DEAD || test_bit(UB_STATE_USED, &ub->state)) { ret = -EEXIST; goto out_unlock; @@ -1746,7 +1781,7 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd) /* parameters can only be changed when device isn't live */ mutex_lock(&ub->mutex); - if (ub->dev_info.state == UBLK_S_DEV_LIVE) { + if (ub->dev_info.state != UBLK_S_DEV_DEAD) { ret = -EACCES; } else if (copy_from_user(&ub->params, argp, ph.len)) { ret = -EFAULT; diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 677edaab2b66..7f7e6f44cec5 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -17,6 +17,8 @@ #define UBLK_CMD_STOP_DEV 0x07 #define UBLK_CMD_SET_PARAMS 0x08 #define UBLK_CMD_GET_PARAMS 0x09 +#define UBLK_CMD_START_USER_RECOVERY 0x10 +#define UBLK_CMD_END_USER_RECOVERY 0x11 /* * IO commands, issued by ublk server, and handled by ublk driver. @@ -74,9 +76,14 @@ */ #define UBLK_F_NEED_GET_DATA (1UL << 2) +#define UBLK_F_USER_RECOVERY (1UL << 3) + +#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1 +#define UBLK_S_DEV_RECOVERING 2 /* shipped via sqe->cmd of io_uring command */ struct ublksrv_ctrl_cmd { From patchwork Wed Aug 24 05:47:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952945 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C944C00140 for ; Wed, 24 Aug 2022 05:49:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234189AbiHXFtV (ORCPT ); Wed, 24 Aug 2022 01:49:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46706 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234164AbiHXFtL (ORCPT ); Wed, 24 Aug 2022 01:49:11 -0400 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 414B5861CD; Tue, 23 Aug 2022 22:49:00 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046051;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTfX_1661320136; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTfX_1661320136) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:56 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 8/9] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support Date: Wed, 24 Aug 2022 13:47:43 +0800 Message-Id: <20220824054744.77812-9-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org START_USER_RECOVERY and END_USER_RECOVERY are two new control commands to support user recovery feature. After a crash, user should send START_USER_RECOVERY, it will: (1) check current ublk_device state and make sure all ubq_daemons are dying. We always expect crash of the ublksrv 'process', not exit of a single ubq_daemon 'thread'. (2) quiesce the request queue so here is no incoming ublk_queue_rq(). (3) reinit all ubqs, including: (a) put the dying task(thread). (b) requeue/abort(depends on user's setting, some backends can tolerate double writes) rqs issued to ublksrv before crash. (c) requeue rqs dispatched after crash. (d) complete ioucmds on tags not used. (4) convert state to UBLK_S_DEV_RECOVERING and reset ub->mm to NULL. Then, user should start a new 'process' and send FETCH_REQ on each ubq_daemon. After a crash, user should send END_USER_RECOVERY, it will: (1) wait for all new ubq_daemons getting ready. (2) unquiesce the request queue and expect incoming ublk_queue_rq() (2) convert state to UBLK_S_DEV_LIVE Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 197 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 197 insertions(+) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 0ee871fa3f92..ccaf3761de74 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -1796,6 +1796,197 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd) return ret; } +/* + * reinit a ubq with a dying ubq_daemon for recovery. It must be called after ub is quiesced. + */ +static bool ublk_recover_rq(struct request *rq, void *data) +{ + struct ublk_queue *ubq = rq->mq_hctx->driver_data; + struct ublk_device *ub = ubq->dev; + struct ublk_io *io = &ubq->ios[rq->tag]; + + /* The request has been sent to ublksrv, and it is inflight. */ + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) { + pr_devel("%s: req has sent to ublksrv, %s it: qid %d tag %d io_flags %x\n", + __func__, + ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort", + ubq->q_id, rq->tag, io->flags); + /* requeue rq but never call ublk_queue_rq() until the queue is unquiesced.*/ + if (ublk_can_use_recovery_reissue(ub)) { + blk_mq_requeue_request(rq, false); + blk_mq_delay_kick_requeue_list(rq->q, + UBLK_REQUEUE_DELAY_MS); + /* abort this req */ + } else { + /* + * Laterly, rq may be issued again but ublk_queue_rq() + * cannot be called until we unquiesce the queue. + */ + blk_mq_end_request(rq, BLK_STS_IOERR); + } + /* + * The request has not been sent to ublksrv, but it is inflight. So: + * + * (1) corresponding ublk_queue_rq() call happens after ubq_daemon is dying. + * + * (2) exit_task_work() runs task_work after ubq_daemon is dying; or + * + * (3) io_uring fallback_work is running in workqueue after ubq_daemon is dying. + * + * Therefore, we simply requeue the request. + * + * Note: + * (1) We do not io_uring_cmd_done() the old ioucmd here because (2) or (3) + * described above can do this. + * + * (2) We are safe to requeue rq since (2) or (3) only touches ioucmd(passed as + * an argument) first. Then they may detect that ubq_daemon is dying, return + * in advance and never touch rq(struct request), so no race on rq. + * + * (3) We requeue rq but ublk_queue_rq() is not called until the queue is + * unquiesced. + */ + } else { + pr_devel("%s: requeue req: qid %d tag %d io_flags %x\n", + __func__, ubq->q_id, rq->tag, io->flags); + blk_mq_requeue_request(rq, false); + blk_mq_delay_kick_requeue_list(rq->q, + UBLK_REQUEUE_DELAY_MS); + } + + /* forget everything now and be ready for new FETCH_REQ */ + io->flags = 0; + io->cmd = NULL; + io->addr = 0; + ubq->nr_io_ready--; + return true; +} + +static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + struct ublk_device *ub; + int ret = -EINVAL; + int i, j; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + mutex_lock(&ub->mutex); + + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + + if (ub->dev_info.state != UBLK_S_DEV_LIVE) { + ret = -EBUSY; + goto out_unlock; + } + + /* Recovery feature is for 'process' crash, so all ubq_daemon should be PF_EXITING */ + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))) + goto out_unlock; + } + + blk_mq_quiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue quiesced.\n", __func__); + ub->dev_info.state = UBLK_S_DEV_RECOVERING; + /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ + ub->mm = NULL; + init_completion(&ub->completion); + + blk_mq_tagset_busy_iter(&ub->tag_set, ublk_recover_rq, NULL); + + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) { + struct ublk_queue *ubq = ublk_get_queue(ub, i); + + WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))); + + pr_devel("%s: dying ubq_daemon: qid %d\n", __func__, ubq->q_id); + + /* old daemon is PF_EXITING, put it now */ + put_task_struct(ubq->ubq_daemon); + /* otherwise ublk_ch_uring_cmd() won't accept new FETCH_REQ */ + ubq->ubq_daemon = NULL; + + for (j = 0; j < ubq->q_depth; j++) { + struct ublk_io *io = &ubq->ios[j]; + + /* This is an inflight rq, pass */ + if (!io->cmd) + continue; + /* + * No request has been issued(allocated) on this tag. We are safe to + * io_uring_cmd_done() the old ioucmd because no one else can do this. + * + * Note that ublk_cancel_queue() calling io_uring_cmd_done() must + * only be called with NOT dying ubq_daemon. + */ + pr_devel("%s: done old cmd: qid %d tag %d\n", + __func__, ubq->q_id, j); + io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0); + + /* forget everything now and be ready for new FETCH_REQ */ + io->flags = 0; + io->cmd = NULL; + io->addr = 0; + ubq->nr_io_ready--; + } + + WARN_ON_ONCE(ubq->nr_io_ready); + ub->nr_queues_ready--; + } + + WARN_ON_ONCE(ub->nr_queues_ready); + ret = 0; + +out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + +static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd) +{ + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd; + int ublksrv_pid = (int)header->data[0]; + struct ublk_device *ub; + int ret = -EINVAL; + + ub = ublk_get_device_from_id(header->dev_id); + if (!ub) + return ret; + + pr_devel("%s: Waiting for new ubq_daemon is ready...\n", __func__); + /* wait until new ubq_daemon sending all FETCH_REQ */ + wait_for_completion_interruptible(&ub->completion); + pr_devel("%s: All new ubq_daemon is ready.\n", __func__); + + mutex_lock(&ub->mutex); + + if (!ublk_can_use_recovery(ub)) + goto out_unlock; + + if (ub->dev_info.state != UBLK_S_DEV_RECOVERING) { + ret = -EBUSY; + goto out_unlock; + } + ub->dev_info.ublksrv_pid = ublksrv_pid; + pr_devel("%s: new ublksrv_pid %d\n", __func__, + ub->dev_info.ublksrv_pid); + blk_mq_unquiesce_queue(ub->ub_disk->queue); + pr_devel("%s: queue unquiesced.\n", __func__); + ub->dev_info.state = UBLK_S_DEV_LIVE; + ret = 0; +out_unlock: + mutex_unlock(&ub->mutex); + ublk_put_device(ub); + return ret; +} + static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) { @@ -1837,6 +2028,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd, case UBLK_CMD_SET_PARAMS: ret = ublk_ctrl_set_params(cmd); break; + case UBLK_CMD_START_USER_RECOVERY: + ret = ublk_ctrl_start_recovery(cmd); + break; + case UBLK_CMD_END_USER_RECOVERY: + ret = ublk_ctrl_end_recovery(cmd); + break; default: break; } From patchwork Wed Aug 24 05:47:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ziyang Zhang X-Patchwork-Id: 12952944 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E4EBC00140 for ; Wed, 24 Aug 2022 05:49:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234222AbiHXFtM (ORCPT ); Wed, 24 Aug 2022 01:49:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233977AbiHXFtL (ORCPT ); Wed, 24 Aug 2022 01:49:11 -0400 Received: from out30-54.freemail.mail.aliyun.com (out30-54.freemail.mail.aliyun.com [115.124.30.54]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40C637D7AC; Tue, 23 Aug 2022 22:49:01 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R611e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046056;MF=ziyangzhang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0VN5zTfh_1661320136; Received: from localhost.localdomain(mailfrom:ZiyangZhang@linux.alibaba.com fp:SMTPD_---0VN5zTfh_1661320136) by smtp.aliyun-inc.com; Wed, 24 Aug 2022 13:48:56 +0800 From: ZiyangZhang To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang Subject: [RFC PATCH 9/9] ublk_drv: do not schedule monitor_work with recovery feature enabled Date: Wed, 24 Aug 2022 13:47:44 +0800 Message-Id: <20220824054744.77812-10-ZiyangZhang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> References: <20220824054744.77812-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org monitor_work is used to end(abort) all dying ubqs and cleanup device. With recovery feature enabled, we reinit dying ubqs in START_USER_RECOVERY command and we do not cleanup ublk device. So scheduling monitor_work in START_DEV is unnecessary. We manually schedule monitor_work in STOP_DEV regardless reocvery feature is enabled or not. So we can cleanup everything in the end. Signed-off-by: ZiyangZhang --- drivers/block/ublk_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index ccaf3761de74..f8bb2e818d33 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -1435,7 +1435,8 @@ static int ublk_ctrl_start_dev(struct io_uring_cmd *cmd) wait_for_completion_interruptible(&ub->completion); - schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); + if (!ublk_can_use_recovery(ub)) + schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD); mutex_lock(&ub->mutex); if (ub->dev_info.state != UBLK_S_DEV_DEAD ||