[RFC,V2,5/6] ublk_drv: consider recovery feature in aborting mechanism

We change the default behavior of aborting machenism. Now monitor_work
will not be manually scheduled by ublk_queue_rq() or task_work after a
ubq_daemon or process is dying(PF_EXITING). The monitor work should
find a dying ubq_daemon or a crash process by itself. Then, it can
start the aborting machenism. We do such modification is because we want
to strictly separate the STOP_DEV procedure and monitor_work. More
specifically, we ensure that monitor_work must not be scheduled after
we start deleting gendisk and ending(aborting) all inflight rqs. In this
way we are easy to consider recovery feature and unify it into existing
aborting mechanism. Really we do not want too many "if can_use_recovery"
checks.

With recovery feature disabled and after a ubq_daemon crash:
(1) monitor_work notices the crash and schedules stop_work
(2) stop_work calls ublk_stop_dev()
(3) In ublk_stop_dev():
    (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();
	    If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
		in fallback wq.
	(b) Then it cancels monitor_work;
	(c) Then it schedules abort_work which ends(aborts) all inflight rqs.
	(d) At the same time del_gendisk() is called.
	(e) Finally, we complete all ioucmds.

Note: we do not change the existing behavior with reocvery disabled. Note
that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.

With recovery feature enabled and after a process crash:
(1) monitor_work notices the crash and all ubq_daemon are dying.
    We do not consider a "single" ubq_daemon(pthread) crash. Please send
	STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.
(2) The monitor_work quiesces request queue.
(3) The monotor_work checks if there is any inflight rq with
    UBLK_IO_FLAG_ACTIVE unset. If so, we give up and schedule monitor_work
	later to retry. This is because we have to wait these rqs requeued(IDLE)
	and we are safe to complete their ioucmds later. Otherwise we may cause
	UAF on ioucmd in fallback wq.
(4) If check in (3) passes, we should requeue/abort inflight rqs issued
    to the crash ubq_daemon before. If UBLK_F_USER_RECOVERY_REISSUE is set,
	rq is requeued. Otherwise it is aborted.
(5) All ioucmds are completed by calling io_uring_cmd_done().
(6) monitor_work set ub's state to UBLK_S_DEV_RECOVERING. It does not
    scheduled itself anymore. Now we are ready for START_USER_RECOVERY. 

Note: If (3) fails, monitor_work schedules itself and retires (3). We allow
user to manually start STOP_DEV procedure without reagrd to monitor_work.
STOP_DEV can cancel monitor_work, unquiesce request queue and drain all
requeued rqs. More importantly, STOP_DEV can safely complete all ioucmds
since monitor_work has been canceled at that moment.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
---
 drivers/block/ublk_drv.c | 222 +++++++++++++++++++++++++++++++++++----
 1 file changed, 202 insertions(+), 20 deletions(-)

Message ID	20220831155136.23434-6-ZiyangZhang@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: ZiyangZhang <ZiyangZhang@linux.alibaba.com> To: ming.lei@redhat.com, axboe@kernel.dk Cc: xiaoguang.wang@linux.alibaba.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, joseph.qi@linux.alibaba.com, ZiyangZhang <ZiyangZhang@linux.alibaba.com> Subject: [RFC PATCH V2 5/6] ublk_drv: consider recovery feature in aborting mechanism Date: Wed, 31 Aug 2022 23:51:35 +0800 Message-Id: <20220831155136.23434-6-ZiyangZhang@linux.alibaba.com> In-Reply-To: <20220831155136.23434-1-ZiyangZhang@linux.alibaba.com> References: <20220831155136.23434-1-ZiyangZhang@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ublk_drv: add USER_RECOVERY support \| expand [RFC,V2,0/6] ublk_drv: add USER_RECOVERY support [RFC,V2,1/6] ublk_drv: check 'current' instead of 'ubq_daemon' [RFC,V2,2/6] ublk_drv: refactor ublk_cancel_queue() [RFC,V2,3/6] ublk_drv: define macros for recovery feature and check them [RFC,V2,4/6] ublk_drv: requeue rqs with recovery feature enabled [RFC,V2,5/6] ublk_drv: consider recovery feature in aborting mechanism [RFC,V2,6/6] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support

[RFC,V2,5/6] ublk_drv: consider recovery feature in aborting mechanism

Commit Message

Comments

Patch