diff mbox series

[RFC,V2,5/6] ublk_drv: consider recovery feature in aborting mechanism

Message ID 20220831155136.23434-6-ZiyangZhang@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series ublk_drv: add USER_RECOVERY support | expand

Commit Message

Ziyang Zhang Aug. 31, 2022, 3:51 p.m. UTC
We change the default behavior of aborting machenism. Now monitor_work
will not be manually scheduled by ublk_queue_rq() or task_work after a
ubq_daemon or process is dying(PF_EXITING). The monitor work should
find a dying ubq_daemon or a crash process by itself. Then, it can
start the aborting machenism. We do such modification is because we want
to strictly separate the STOP_DEV procedure and monitor_work. More
specifically, we ensure that monitor_work must not be scheduled after
we start deleting gendisk and ending(aborting) all inflight rqs. In this
way we are easy to consider recovery feature and unify it into existing
aborting mechanism. Really we do not want too many "if can_use_recovery"
checks.

With recovery feature disabled and after a ubq_daemon crash:
(1) monitor_work notices the crash and schedules stop_work
(2) stop_work calls ublk_stop_dev()
(3) In ublk_stop_dev():
    (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();
	    If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
		in fallback wq.
	(b) Then it cancels monitor_work;
	(c) Then it schedules abort_work which ends(aborts) all inflight rqs.
	(d) At the same time del_gendisk() is called.
	(e) Finally, we complete all ioucmds.

Note: we do not change the existing behavior with reocvery disabled. Note
that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.

With recovery feature enabled and after a process crash:
(1) monitor_work notices the crash and all ubq_daemon are dying.
    We do not consider a "single" ubq_daemon(pthread) crash. Please send
	STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.
(2) The monitor_work quiesces request queue.
(3) The monotor_work checks if there is any inflight rq with
    UBLK_IO_FLAG_ACTIVE unset. If so, we give up and schedule monitor_work
	later to retry. This is because we have to wait these rqs requeued(IDLE)
	and we are safe to complete their ioucmds later. Otherwise we may cause
	UAF on ioucmd in fallback wq.
(4) If check in (3) passes, we should requeue/abort inflight rqs issued
    to the crash ubq_daemon before. If UBLK_F_USER_RECOVERY_REISSUE is set,
	rq is requeued. Otherwise it is aborted.
(5) All ioucmds are completed by calling io_uring_cmd_done().
(6) monitor_work set ub's state to UBLK_S_DEV_RECOVERING. It does not
    scheduled itself anymore. Now we are ready for START_USER_RECOVERY. 

Note: If (3) fails, monitor_work schedules itself and retires (3). We allow
user to manually start STOP_DEV procedure without reagrd to monitor_work.
STOP_DEV can cancel monitor_work, unquiesce request queue and drain all
requeued rqs. More importantly, STOP_DEV can safely complete all ioucmds
since monitor_work has been canceled at that moment.

Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
---
 drivers/block/ublk_drv.c | 222 +++++++++++++++++++++++++++++++++++----
 1 file changed, 202 insertions(+), 20 deletions(-)

Comments

Ming Lei Sept. 3, 2022, 1:30 p.m. UTC | #1
On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang
<ZiyangZhang@linux.alibaba.com> wrote:
>
> We change the default behavior of aborting machenism. Now monitor_work
> will not be manually scheduled by ublk_queue_rq() or task_work after a
> ubq_daemon or process is dying(PF_EXITING). The monitor work should
> find a dying ubq_daemon or a crash process by itself. Then, it can

Looks you don't consider one dying ubq_daemon as one crash candidate.
Most io implementation is done in the ubq pthread, so it should be
covered by the crash recovery.

> start the aborting machenism. We do such modification is because we want
> to strictly separate the STOP_DEV procedure and monitor_work. More
> specifically, we ensure that monitor_work must not be scheduled after
> we start deleting gendisk and ending(aborting) all inflight rqs. In this
> way we are easy to consider recovery feature and unify it into existing
> aborting mechanism. Really we do not want too many "if can_use_recovery"
> checks.

Frankly speaking, not sure we need to invent new wheel for the
'aborting' mechanism.

In theory, you needn't change the current monitor work and cancel
dev/queue. What you need is how to handle the dying ubq daemon:

1) without user recovery, delete disk if any ubq daemon is died.

2) with user recovery:
    - quiesce request queue and wait until all inflight requests are
requeued(become IDLE);
    - call io_uring_cmd_done for any active io slot
    - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling
      the potential crash; if it is confirmed as crash by userspace,
      userspace will send command to handle it.
    (this way will simplify userspace too, since we can add one utility
    and provide it via udev script for handling rec

Checking one flag lockless is usually not safe, also not sure why we
need such flag here, and the original check is supposed to work.

overy)

>
> With recovery feature disabled and after a ubq_daemon crash:
> (1) monitor_work notices the crash and schedules stop_work

driver can't figure out if it is crash, and it can just see if the
ubq deamon is died or not. And crash detection logic should be done
in userspace, IMO.

> (2) stop_work calls ublk_stop_dev()
> (3) In ublk_stop_dev():
>     (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();

Please don't add new flag in fast path lockless, and the original check
is supposed to be reused for recovery feature.

>             If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
>                 in fallback wq.
>         (b) Then it cancels monitor_work;
>         (c) Then it schedules abort_work which ends(aborts) all inflight rqs.
>         (d) At the same time del_gendisk() is called.
>         (e) Finally, we complete all ioucmds.
>
> Note: we do not change the existing behavior with reocvery disabled. Note
> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.
>
> With recovery feature enabled and after a process crash:
> (1) monitor_work notices the crash and all ubq_daemon are dying.
>     We do not consider a "single" ubq_daemon(pthread) crash. Please send
>         STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.

Can you consider why you don't consider it as one crash? IMO, most of
userspace block logic is run in ubq_daemon, so it is reasonable to
consider it.

ublk_reinit_dev() is supposed to be run in standalone context, just like
ublk_stop_dev(), we need monitor_work to provide forward progress,
so don't run wait in monitor work.

And please don't change this model for making forward progress.


> (2) The monitor_work quiesces request queue.
> (3) The monotor_work checks if there is any inflight rq with
>     UBLK_IO_FLAG_ACTIVE unset. If so, we give up and schedule monitor_work
>         later to retry. This is because we have to wait these rqs requeued(IDLE)
>         and we are safe to complete their ioucmds later. Otherwise we may cause
>         UAF on ioucmd in fallback wq.
> (4) If check in (3) passes, we should requeue/abort inflight rqs issued
>     to the crash ubq_daemon before. If UBLK_F_USER_RECOVERY_REISSUE is set,
>         rq is requeued. Otherwise it is aborted.
> (5) All ioucmds are completed by calling io_uring_cmd_done().
> (6) monitor_work set ub's state to UBLK_S_DEV_RECOVERING. It does not
>     scheduled itself anymore. Now we are ready for START_USER_RECOVERY.
>
> Note: If (3) fails, monitor_work schedules itself and retires (3). We allow
> user to manually start STOP_DEV procedure without reagrd to monitor_work.
> STOP_DEV can cancel monitor_work, unquiesce request queue and drain all
> requeued rqs. More importantly, STOP_DEV can safely complete all ioucmds
> since monitor_work has been canceled at that moment.
>
> Signed-off-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
> ---
>  drivers/block/ublk_drv.c | 222 +++++++++++++++++++++++++++++++++++----
>  1 file changed, 202 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 296b9d80f003..0e185d1fa2c4 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -156,7 +156,7 @@ struct ublk_device {
>
>         struct completion       completion;
>         unsigned int            nr_queues_ready;
> -       atomic_t                nr_aborted_queues;
> +       bool    force_abort;
>
>         /*
>          * Our ubq->daemon may be killed without any notification, so
> @@ -164,6 +164,7 @@ struct ublk_device {
>          */
>         struct delayed_work     monitor_work;
>         struct work_struct      stop_work;
> +       struct work_struct      abort_work;
>  };
>
>  /* header of ublk_params */
> @@ -643,9 +644,13 @@ static void ublk_complete_rq(struct request *req)
>   */
>  static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>  {
> +       struct ublk_queue *ubq = req->mq_hctx->driver_data;
> +
>         WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>
>         if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
> +               pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
> +                               __func__, ubq->q_id, req->tag, io->flags);
>                 io->flags |= UBLK_IO_FLAG_ABORTED;
>                 blk_mq_end_request(req, BLK_STS_IOERR);
>         }
> @@ -664,6 +669,8 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>
>         /* tell ublksrv one io request is coming */
>         io_uring_cmd_done(io->cmd, res, 0);
> +       pr_devel("%s: complete ioucmd: res %d io_flags %x\n",
> +                       __func__, res, io->flags);
>  }
>
>  #define UBLK_REQUEUE_DELAY_MS  3
> @@ -675,11 +682,6 @@ static inline void __ublk_rq_task_work(struct request *req)
>         int tag = req->tag;
>         struct ublk_io *io = &ubq->ios[tag];
>         unsigned int mapped_bytes;
> -
> -       pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
> -                       __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
> -                       ublk_get_iod(ubq, req->tag)->addr);
> -
>         /*
>          * Task is exiting if either:
>          *
> @@ -700,10 +702,13 @@ static inline void __ublk_rq_task_work(struct request *req)
>                 } else {
>                         blk_mq_end_request(req, BLK_STS_IOERR);
>                 }
> -               mod_delayed_work(system_wq, &ub->monitor_work, 0);
>                 return;
>         }
>
> +       pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
> +                       __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
> +                       ublk_get_iod(ubq, req->tag)->addr);
> +
>         if (ublk_need_get_data(ubq) &&
>                         (req_op(req) == REQ_OP_WRITE ||
>                         req_op(req) == REQ_OP_FLUSH)) {
> @@ -782,6 +787,11 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>         struct ublk_io *io = &ubq->ios[rq->tag];
>         blk_status_t res;
>
> +       if (unlikely(ubq->dev->force_abort)) {
> +               pr_devel("%s: abort q_id %d tag %d io_flags %x.\n",
> +                               __func__, ubq->q_id, rq->tag, io->flags);
> +               return BLK_STS_IOERR;
> +       }

Checking one flag lockless is usually not safe, also not sure why we
need such flag here, and the original check is supposed to work.

>         /* fill iod to slot in io cmd buffer */
>         res = ublk_setup_iod(ubq, rq);
>         if (unlikely(res != BLK_STS_OK))
> @@ -789,13 +799,15 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>
>         blk_mq_start_request(bd->rq);
>
> +       pr_devel("%s: start rq: q_id %d tag %d io_flags %x.\n",
> +                       __func__, ubq->q_id, rq->tag, io->flags);
> +
>         if (unlikely(ubq_daemon_is_dying(ubq))) {
>   fail:
>                 pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>                                 (ublk_can_use_recovery(ubq->dev)) ? "requeue" : "abort",
>                                 ubq->q_id, rq->tag, io->flags);
>
> -               mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
>                 if (ublk_can_use_recovery(ubq->dev)) {
>                         /* We cannot process this rq so just requeue it. */
>                         blk_mq_requeue_request(rq, false);
> @@ -880,6 +892,7 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
>         if (test_and_set_bit(UB_STATE_OPEN, &ub->state))
>                 return -EBUSY;
>         filp->private_data = ub;
> +       pr_devel("%s: open /dev/ublkc%d\n", __func__, ub->dev_info.dev_id);
>         return 0;
>  }
>
> @@ -888,6 +901,7 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
>         struct ublk_device *ub = filp->private_data;
>
>         clear_bit(UB_STATE_OPEN, &ub->state);
> +       pr_devel("%s: release /dev/ublkc%d\n", __func__, ub->dev_info.dev_id);
>         return 0;
>  }
>
> @@ -971,37 +985,180 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>                          * will do it
>                          */
>                         rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
> -                       if (rq)
> +                       if (rq && blk_mq_request_started(rq))
>                                 __ublk_fail_req(io, rq);
>                 }
>         }
>         ublk_put_device(ub);
>  }
>
> -static void ublk_daemon_monitor_work(struct work_struct *work)
> +
> +
> +static void ublk_abort_work_fn(struct work_struct *work)
>  {
>         struct ublk_device *ub =
> -               container_of(work, struct ublk_device, monitor_work.work);
> +               container_of(work, struct ublk_device, abort_work);
> +
> +       int i;
> +
> +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +               if (ubq_daemon_is_dying(ubq))
> +                       ublk_abort_queue(ub, ubq);
> +       }
> +
> +       if (ub->force_abort)
> +               schedule_work(&ub->abort_work);
> +}
> +
> +static void ublk_reinit_queue(struct ublk_device *ub,
> +               struct ublk_queue *ubq)

ublk_reinit_queue() is bad name, it is called during aborting, what
we need is to shutdown them completely. And reinit or reset will
be done when you start to recovery, at that time, you can reinit
or reset the queue really.

> +{
> +       int i;
> +
> +       for (i = 0; i < ubq->q_depth; i++) {
> +               struct ublk_io *io = &ubq->ios[i];
> +
> +               if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
> +                       struct request *rq = blk_mq_tag_to_rq(
> +                                       ub->tag_set.tags[ubq->q_id], i);
> +
> +                       WARN_ON_ONCE(!rq);
> +                       pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n",
> +                                       __func__,
> +                                       ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort",
> +                                       ubq->q_id, i, io->flags);
> +                       if (ublk_can_use_recovery_reissue(ub))
> +                               blk_mq_requeue_request(rq, false);
> +                       else
> +                               __ublk_fail_req(io, rq);
> +
> +               } else {
> +                       io_uring_cmd_done(io->cmd,
> +                                       UBLK_IO_RES_ABORT, 0);
> +                       io->flags &= ~UBLK_IO_FLAG_ACTIVE;
> +                       pr_devel("%s: send UBLK_IO_RES_ABORT: qid %d tag %d io_flags %x\n",
> +                               __func__, ubq->q_id, i, io->flags);
> +               }
> +               ubq->nr_io_ready--;
> +       }
> +       WARN_ON_ONCE(ubq->nr_io_ready);
> +}
> +
> +static bool ublk_check_inflight_rq(struct request *rq, void *data)
> +{
> +       struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> +       struct ublk_io *io = &ubq->ios[rq->tag];
> +       bool *busy = data;
> +
> +       if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> +               *busy = true;
> +               return false;
> +       }
> +       return true;
> +}
> +
> +static void ublk_reinit_dev(struct ublk_device *ub)

bad name too.

> +{
> +       int i;
> +       bool busy = false;
> +
> +       if (!ublk_get_device(ub))
> +               return;
> +
> +       /* If we have quiesced q, all ubq_daemons are dying */
> +       if (blk_queue_quiesced(ub->ub_disk->queue))
> +               goto check_inflight;
> +
> +       /* Recovery feature is for 'process' crash. */
> +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +               if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)))
> +                       goto out;
> +       }
> +
> +       pr_devel("%s: all ubq_daemons(nr: %d) are dying.\n",
> +                       __func__, ub->dev_info.nr_hw_queues);
> +
> +       /* Now all ubq_daemons are PF_EXITING, let's quiesce q. */
> +       blk_mq_quiesce_queue(ub->ub_disk->queue);
> +       pr_devel("%s: queue quiesced.\n", __func__);
> + check_inflight:
> +       /* All rqs have to be requeued(and stay in queue now) */
> +       blk_mq_tagset_busy_iter(&ub->tag_set, ublk_check_inflight_rq, &busy);
> +       if (busy) {
> +               pr_devel("%s: still some inflight rqs, retry later...\n",
> +                               __func__);
> +               goto out;
> +       }
> +
> +       pr_devel("%s: all inflight rqs are requeued.\n", __func__);
> +
> +       for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +               struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +               ublk_reinit_queue(ub, ubq);
> +       }
> +       /* So monitor_work won't be scheduled anymore */
> +       ub->dev_info.state = UBLK_S_DEV_RECOVERING;
> +       pr_devel("%s: convert state to UBLK_S_DEV_RECOVERING\n",
> +                       __func__);
> + out:
> +       ublk_put_device(ub);
> +}
> +
> +static void ublk_kill_dev(struct ublk_device *ub)
> +{
>         int i;
>
>         for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
>                 struct ublk_queue *ubq = ublk_get_queue(ub, i);
>
>                 if (ubq_daemon_is_dying(ubq)) {
> +                       pr_devel("%s: ubq(%d) is dying, schedule stop_work now\n",
> +                                       __func__, i);
>                         schedule_work(&ub->stop_work);
> -
> -                       /* abort queue is for making forward progress */
> -                       ublk_abort_queue(ub, ubq);
>                 }
>         }
> +}
> +
> +static void ublk_daemon_monitor_work(struct work_struct *work)
> +{
> +       struct ublk_device *ub =
> +               container_of(work, struct ublk_device, monitor_work.work);
> +
> +       pr_devel("%s: mode(%s) running...\n",
> +                       __func__,
> +                       ublk_can_use_recovery(ub) ? "recovery" : "kill");
> +       /*
> +        * We can't schedule monitor work if:
> +        * (1) The state is DEAD.
> +        *     The gendisk is not alive, so either all rqs are ended
> +        *     or request queue is not created.
> +        *
> +        * (2) The state is RECOVERYING.
> +        *     The process crashed, all rqs were requeued and request queue
> +        *     was quiesced.
> +        */
> +       WARN_ON_ONCE(ub->dev_info.state != UBLK_S_DEV_LIVE);
>
> +       if (ublk_can_use_recovery(ub))
> +               ublk_reinit_dev(ub);

ublk_reinit_dev() is supposed to be run in standalone context, just like
ublk_stop_dev(), we need monitor_work to provide forward progress,
so don't run wait in monitor work.

And please don't change this model of making forward progress.

> +       else
> +               ublk_kill_dev(ub);
>         /*
> -        * We can't schedule monitor work after ublk_remove() is started.
> +        * No need ub->mutex, monitor work are canceled after ub is marked
> +        * as force_abort which is observed reliably.
> +        *
> +        * Note:
> +        * All incoming rqs are aborted in ublk_queue_rq ASAP. Then
> +        * we will hang on del_gendisk() and wait for all inflight rqs'
> +        * completion.
>          *
> -        * No need ub->mutex, monitor work are canceled after state is marked
> -        * as DEAD, so DEAD state is observed reliably.
>          */
> -       if (ub->dev_info.state != UBLK_S_DEV_DEAD)
> +       if (ub->dev_info.state == UBLK_S_DEV_LIVE && !ub->force_abort)
>                 schedule_delayed_work(&ub->monitor_work,
>                                 UBLK_DAEMON_MONITOR_PERIOD);
>  }
> @@ -1058,10 +1215,35 @@ static void ublk_cancel_dev(struct ublk_device *ub)
>  static void ublk_stop_dev(struct ublk_device *ub)
>  {
>         mutex_lock(&ub->mutex);
> -       if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> +       if (ub->dev_info.state == UBLK_S_DEV_DEAD)
>                 goto unlock;
>
> +       /*
> +        * All incoming rqs are aborted in ublk_queue_rq ASAP. Then
> +        * we will hang on del_gendisk() and wait for all inflight rqs'
> +        * completion.
> +        */
> +       ub->force_abort = true;
> +       /* wait until monitor_work is not scheduled */
> +       cancel_delayed_work_sync(&ub->monitor_work);
> +       pr_devel("%s: monitor work is canceled.\n", __func__);
> +       /* unquiesce q and let all inflight rqs' be aborted */
> +       if (blk_queue_quiesced(ub->ub_disk->queue)) {
> +               blk_mq_unquiesce_queue(ub->ub_disk->queue);
> +               pr_devel("%s: queue unquiesced.\n", __func__);
> +       }
> +       /* requeued requests will be aborted ASAP because of 'force_abort' */
> +       blk_mq_kick_requeue_list(ub->ub_disk->queue);
> +       /* forward progress */
> +       schedule_work(&ub->abort_work);
> +       pr_devel("%s: abort work is scheduled, start delete gendisk...\n",
> +                       __func__);
> +       pr_devel("%s: gendisk is deleted.\n", __func__);
>         del_gendisk(ub->ub_disk);
> +       pr_devel("%s: gendisk is deleted.\n", __func__);
> +       ub->force_abort = false;
> +       cancel_work_sync(&ub->abort_work);
> +       pr_devel("%s: abort work is canceled.\n", __func__);
>         ub->dev_info.state = UBLK_S_DEV_DEAD;
>         ub->dev_info.ublksrv_pid = -1;
>         put_disk(ub->ub_disk);

ublk_stop_dev() isn't supposed to be changed so much, and it can be just
for handling non-recovery, but it can be renamed as ublk_delete_disk().

For recovery, you can add one function of ublk_quiesce_disk() for
preparing for recovering.

thanks
Ming
Ziyang Zhang Sept. 4, 2022, 11:23 a.m. UTC | #2
On 2022/9/3 21:30, Ming Lei wrote:
> On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang
> <ZiyangZhang@linux.alibaba.com> wrote:
>>
>> We change the default behavior of aborting machenism. Now monitor_work
>> will not be manually scheduled by ublk_queue_rq() or task_work after a
>> ubq_daemon or process is dying(PF_EXITING). The monitor work should
>> find a dying ubq_daemon or a crash process by itself. Then, it can
> 
> Looks you don't consider one dying ubq_daemon as one crash candidate.
> Most io implementation is done in the ubq pthread, so it should be
> covered by the crash recovery.
> 
>> start the aborting machenism. We do such modification is because we want
>> to strictly separate the STOP_DEV procedure and monitor_work. More
>> specifically, we ensure that monitor_work must not be scheduled after
>> we start deleting gendisk and ending(aborting) all inflight rqs. In this
>> way we are easy to consider recovery feature and unify it into existing
>> aborting mechanism. Really we do not want too many "if can_use_recovery"
>> checks.
> 
> Frankly speaking, not sure we need to invent new wheel for the
> 'aborting' mechanism.
> 
> In theory, you needn't change the current monitor work and cancel
> dev/queue. What you need is how to handle the dying ubq daemon:
> 
> 1) without user recovery, delete disk if any ubq daemon is died.
> 
> 2) with user recovery:
>     - quiesce request queue and wait until all inflight requests are
> requeued(become IDLE);
>     - call io_uring_cmd_done for any active io slot
>     - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling
>       the potential crash; if it is confirmed as crash by userspace,
>       userspace will send command to handle it.
>     (this way will simplify userspace too, since we can add one utility
>     and provide it via udev script for handling rec
> 
> Checking one flag lockless is usually not safe, also not sure why we
> need such flag here, and the original check is supposed to work.
> 
> overy)
> 
>>
>> With recovery feature disabled and after a ubq_daemon crash:
>> (1) monitor_work notices the crash and schedules stop_work
> 
> driver can't figure out if it is crash, and it can just see if the
> ubq deamon is died or not. And crash detection logic should be done
> in userspace, IMO.
> 
>> (2) stop_work calls ublk_stop_dev()
>> (3) In ublk_stop_dev():
>>     (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();
> 
> Please don't add new flag in fast path lockless, and the original check
> is supposed to be reused for recovery feature.
> 
>>             If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
>>                 in fallback wq.
>>         (b) Then it cancels monitor_work;
>>         (c) Then it schedules abort_work which ends(aborts) all inflight rqs.
>>         (d) At the same time del_gendisk() is called.
>>         (e) Finally, we complete all ioucmds.
>>
>> Note: we do not change the existing behavior with reocvery disabled. Note
>> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.
>>
>> With recovery feature enabled and after a process crash:
>> (1) monitor_work notices the crash and all ubq_daemon are dying.
>>     We do not consider a "single" ubq_daemon(pthread) crash. Please send
>>         STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.
> 
> Can you consider why you don't consider it as one crash? IMO, most of
> userspace block logic is run in ubq_daemon, so it is reasonable to
> consider it.
> 
> ublk_reinit_dev() is supposed to be run in standalone context, just like
> ublk_stop_dev(), we need monitor_work to provide forward progress,
> so don't run wait in monitor work.
> 
> And please don't change this model for making forward progress.
> 
> 

Hi, Ming.

I will take your advice and provide V4 soon. Here is the new design:

(0) No modification in fast patch. We just requeue rqs with a dying ubq_daemon
    and schedule monitor_work immediately.
    
    BTW: I think here we should call 'blk_mq_delay_kick_requeue_list()' after
    requeuing a rq. Otherwise del_gendisk() in ublk_stop_dev() hangs.

(1) Add quiesce_work, which is scheduled by monitor_work after a ubq_daemon
    is dying with recovery enabled.

(2) quiesce_work runs ublk_quiesce_dev(). It accquires the ub lock, and
    quiescses the request queue(only once). On each dying ubq, call
    ublk_quiesce_queue(). It waits until all inflight rqs(ACTIVE) are
    requeued(IDLE). Finally it completes all ioucmds.
    Note: So We need to add a per ubq flag 'quiesced', which means
    we have done this 'quiesce && clean' stuff on the ubq.

(3) After the request queue is quiesced, change ub's state to STATE_QUIESCED.
    This state can be checked by GET_DEV_INFO ctrl-cmd, just like STATE_LIVE. So
    user can detect a crash by sending GET_DEV_INFO and getting STATE_QUIESCED
    back.
    
    BTW, I'm unsure that sending one kobj_uevent(KOBJ_CHANGE) really helps. Users
    have may ways to detect a dying process/pthread. For example, they can 'ps'
    ublksrv_pid or check ub's state by GET_DEV_INFO ctrl-cmd. Anyway, this work
    can be done in the future. We can introduce a better way to detect a crash.
    For this patchset, let's focus on how to deal with a dying ubq_daemon.
    Do you agree?

(4) Do not change ublk_stop_dev(). BTW, ublk_stop_dev() and ublk_quiescd_dev()
    exclude each other by accqiring ub lock.

(5) ublk_stop_dev() has to consider a quiesced ubq. It should unquiesce request
    queue(only once) if it is quiesced. There is nothing else ublk_stop_dev()
    has to do. Inflight rqs requeued before will be aborted naturally by
    del_gendisk().

(6) ublk_quiesce_dev() cannot be run after gendisk is removed(STATE_DEAD).

(7) No need to run ublk_quiesce_queue() on a 'quiesced' ubq by checking the flag.
    Note: I think this check is safe here.

(8) START_USER_RECOVERY needs to consider both a dying process and pthread(ubq_daemon).

    For a dying process, it has to reset ub->dev_info.ublksrv_pid and ub->mm. This can
    by done by passing a qid = -1 in ctrl-cmd. We should make sure all ubq_daemons
    are dying in this case.
    
    otherwise it is a dying pthread. Only this ubq is reinit. Users may send many
    START_USER_RECOVERY with different qid to recover many ubqs.

 
Thanks for reviewing patches.

Regards,
Zhang.
Ming Lei Sept. 6, 2022, 1:12 a.m. UTC | #3
On Sun, Sep 04, 2022 at 07:23:49PM +0800, Ziyang Zhang wrote:
> On 2022/9/3 21:30, Ming Lei wrote:
> > On Wed, Aug 31, 2022 at 11:54 PM ZiyangZhang
> > <ZiyangZhang@linux.alibaba.com> wrote:
> >>
> >> We change the default behavior of aborting machenism. Now monitor_work
> >> will not be manually scheduled by ublk_queue_rq() or task_work after a
> >> ubq_daemon or process is dying(PF_EXITING). The monitor work should
> >> find a dying ubq_daemon or a crash process by itself. Then, it can
> > 
> > Looks you don't consider one dying ubq_daemon as one crash candidate.
> > Most io implementation is done in the ubq pthread, so it should be
> > covered by the crash recovery.
> > 
> >> start the aborting machenism. We do such modification is because we want
> >> to strictly separate the STOP_DEV procedure and monitor_work. More
> >> specifically, we ensure that monitor_work must not be scheduled after
> >> we start deleting gendisk and ending(aborting) all inflight rqs. In this
> >> way we are easy to consider recovery feature and unify it into existing
> >> aborting mechanism. Really we do not want too many "if can_use_recovery"
> >> checks.
> > 
> > Frankly speaking, not sure we need to invent new wheel for the
> > 'aborting' mechanism.
> > 
> > In theory, you needn't change the current monitor work and cancel
> > dev/queue. What you need is how to handle the dying ubq daemon:
> > 
> > 1) without user recovery, delete disk if any ubq daemon is died.
> > 
> > 2) with user recovery:
> >     - quiesce request queue and wait until all inflight requests are
> > requeued(become IDLE);
> >     - call io_uring_cmd_done for any active io slot
> >     - send one kobj_uevent(KOBJ_CHANGE) to notify userspace for handling
> >       the potential crash; if it is confirmed as crash by userspace,
> >       userspace will send command to handle it.
> >     (this way will simplify userspace too, since we can add one utility
> >     and provide it via udev script for handling rec
> > 
> > Checking one flag lockless is usually not safe, also not sure why we
> > need such flag here, and the original check is supposed to work.
> > 
> > overy)
> > 
> >>
> >> With recovery feature disabled and after a ubq_daemon crash:
> >> (1) monitor_work notices the crash and schedules stop_work
> > 
> > driver can't figure out if it is crash, and it can just see if the
> > ubq deamon is died or not. And crash detection logic should be done
> > in userspace, IMO.
> > 
> >> (2) stop_work calls ublk_stop_dev()
> >> (3) In ublk_stop_dev():
> >>     (a) It sets 'force_abort', which prevents new rqs in ublk_queue_rq();
> > 
> > Please don't add new flag in fast path lockless, and the original check
> > is supposed to be reused for recovery feature.
> > 
> >>             If ublk_queue_rq() does not see it, rqs can still be ended(aborted)
> >>                 in fallback wq.
> >>         (b) Then it cancels monitor_work;
> >>         (c) Then it schedules abort_work which ends(aborts) all inflight rqs.
> >>         (d) At the same time del_gendisk() is called.
> >>         (e) Finally, we complete all ioucmds.
> >>
> >> Note: we do not change the existing behavior with reocvery disabled. Note
> >> that STOP_DEV ctrl-cmd can be processed without reagrd to monitor_work.
> >>
> >> With recovery feature enabled and after a process crash:
> >> (1) monitor_work notices the crash and all ubq_daemon are dying.
> >>     We do not consider a "single" ubq_daemon(pthread) crash. Please send
> >>         STOP_DEV ctrl-cmd which calling ublk_stop_dev() for this case.
> > 
> > Can you consider why you don't consider it as one crash? IMO, most of
> > userspace block logic is run in ubq_daemon, so it is reasonable to
> > consider it.
> > 
> > ublk_reinit_dev() is supposed to be run in standalone context, just like
> > ublk_stop_dev(), we need monitor_work to provide forward progress,
> > so don't run wait in monitor work.
> > 
> > And please don't change this model for making forward progress.
> > 
> > 
> 
> Hi, Ming.
> 
> I will take your advice and provide V4 soon. Here is the new design:
> 
> (0) No modification in fast patch. We just requeue rqs with a dying ubq_daemon
>     and schedule monitor_work immediately.
>     
>     BTW: I think here we should call 'blk_mq_delay_kick_requeue_list()' after
>     requeuing a rq. Otherwise del_gendisk() in ublk_stop_dev() hangs.

No.

IMO you can add one helper for either ending or adding request in
requeue list, then code is still clean.

And when you call del_gendisk() in case of being in recovery state,
blk_mq_unquiesce_queue() will handle/abort them and make del_gendisk()
move on.

> 
> (1) Add quiesce_work, which is scheduled by monitor_work after a ubq_daemon
>     is dying with recovery enabled.
> 
> (2) quiesce_work runs ublk_quiesce_dev(). It accquires the ub lock, and
>     quiescses the request queue(only once). On each dying ubq, call
>     ublk_quiesce_queue(). It waits until all inflight rqs(ACTIVE) are
>     requeued(IDLE). Finally it completes all ioucmds.
>     Note: So We need to add a per ubq flag 'quiesced', which means
>     we have done this 'quiesce && clean' stuff on the ubq.

ubq->nr_io_ready should be reused for checking if the queue is quiesced.
It is actually same with current usage.

> 
> (3) After the request queue is quiesced, change ub's state to STATE_QUIESCED.
>     This state can be checked by GET_DEV_INFO ctrl-cmd, just like STATE_LIVE. So
>     user can detect a crash by sending GET_DEV_INFO and getting STATE_QUIESCED
>     back.
>     
>     BTW, I'm unsure that sending one kobj_uevent(KOBJ_CHANGE) really helps. Users
>     have may ways to detect a dying process/pthread. For example, they can 'ps'
>     ublksrv_pid or check ub's state by GET_DEV_INFO ctrl-cmd. Anyway, this work
>     can be done in the future. We can introduce a better way to detect a crash.
>     For this patchset, let's focus on how to deal with a dying ubq_daemon.
>     Do you agree?

kobj_uevent(KOBJ_CHANGE) is useful, which can avoid userspace's polling on
device. That provides one accurate chance for userspace utility to confirm
if it is one really crash.

And checking if ubq daemon/process is crashed is really userspace's
responsibility, but sending the uevent after ubq daemon/process is dying
is helpful for userspace to run the check, at least polling can be
avoided, or done in time.

If you don't want to add it from beginning, that is fine, and we can do it
after your patchset is merged.

> 
> (4) Do not change ublk_stop_dev(). BTW, ublk_stop_dev() and ublk_quiescd_dev()
>     exclude each other by accqiring ub lock.
> 
> (5) ublk_stop_dev() has to consider a quiesced ubq. It should unquiesce request
>     queue(only once) if it is quiesced. There is nothing else ublk_stop_dev()
>     has to do. Inflight rqs requeued before will be aborted naturally by
>     del_gendisk().
> 
> (6) ublk_quiesce_dev() cannot be run after gendisk is removed(STATE_DEAD).
> 
> (7) No need to run ublk_quiesce_queue() on a 'quiesced' ubq by checking the flag.
>     Note: I think this check is safe here.
> 
> (8) START_USER_RECOVERY needs to consider both a dying process and pthread(ubq_daemon).
> 
>     For a dying process, it has to reset ub->dev_info.ublksrv_pid and ub->mm. This can
>     by done by passing a qid = -1 in ctrl-cmd. We should make sure all ubq_daemons
>     are dying in this case.
>     
>     otherwise it is a dying pthread. Only this ubq is reinit. Users may send many
>     START_USER_RECOVERY with different qid to recover many ubqs.

The simplest handling might be to exit all ublk queues first, and re-create one
new process to recover all since the request queue is required to be
quiesced first, and all ublk queue is actually quiesced too. So from user
viewpoint, there is nothing visible comparing with just recovering
single ubq daemon/pthread.


Thanks,
Ming
diff mbox series

Patch

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 296b9d80f003..0e185d1fa2c4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -156,7 +156,7 @@  struct ublk_device {
 
 	struct completion	completion;
 	unsigned int		nr_queues_ready;
-	atomic_t		nr_aborted_queues;
+	bool	force_abort;
 
 	/*
 	 * Our ubq->daemon may be killed without any notification, so
@@ -164,6 +164,7 @@  struct ublk_device {
 	 */
 	struct delayed_work	monitor_work;
 	struct work_struct	stop_work;
+	struct work_struct	abort_work;
 };
 
 /* header of ublk_params */
@@ -643,9 +644,13 @@  static void ublk_complete_rq(struct request *req)
  */
 static void __ublk_fail_req(struct ublk_io *io, struct request *req)
 {
+	struct ublk_queue *ubq = req->mq_hctx->driver_data;
+
 	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
 
 	if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
+		pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
+				__func__, ubq->q_id, req->tag, io->flags);
 		io->flags |= UBLK_IO_FLAG_ABORTED;
 		blk_mq_end_request(req, BLK_STS_IOERR);
 	}
@@ -664,6 +669,8 @@  static void ubq_complete_io_cmd(struct ublk_io *io, int res)
 
 	/* tell ublksrv one io request is coming */
 	io_uring_cmd_done(io->cmd, res, 0);
+	pr_devel("%s: complete ioucmd: res %d io_flags %x\n",
+			__func__, res, io->flags);
 }
 
 #define UBLK_REQUEUE_DELAY_MS	3
@@ -675,11 +682,6 @@  static inline void __ublk_rq_task_work(struct request *req)
 	int tag = req->tag;
 	struct ublk_io *io = &ubq->ios[tag];
 	unsigned int mapped_bytes;
-
-	pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
-			__func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
-			ublk_get_iod(ubq, req->tag)->addr);
-
 	/*
 	 * Task is exiting if either:
 	 *
@@ -700,10 +702,13 @@  static inline void __ublk_rq_task_work(struct request *req)
 		} else {
 			blk_mq_end_request(req, BLK_STS_IOERR);
 		}
-		mod_delayed_work(system_wq, &ub->monitor_work, 0);
 		return;
 	}
 
+	pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
+			__func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
+			ublk_get_iod(ubq, req->tag)->addr);
+
 	if (ublk_need_get_data(ubq) &&
 			(req_op(req) == REQ_OP_WRITE ||
 			req_op(req) == REQ_OP_FLUSH)) {
@@ -782,6 +787,11 @@  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct ublk_io *io = &ubq->ios[rq->tag];
 	blk_status_t res;
 
+	if (unlikely(ubq->dev->force_abort)) {
+		pr_devel("%s: abort q_id %d tag %d io_flags %x.\n",
+				__func__, ubq->q_id, rq->tag, io->flags);
+		return BLK_STS_IOERR;
+	}
 	/* fill iod to slot in io cmd buffer */
 	res = ublk_setup_iod(ubq, rq);
 	if (unlikely(res != BLK_STS_OK))
@@ -789,13 +799,15 @@  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(bd->rq);
 
+	pr_devel("%s: start rq: q_id %d tag %d io_flags %x.\n",
+			__func__, ubq->q_id, rq->tag, io->flags);
+
 	if (unlikely(ubq_daemon_is_dying(ubq))) {
  fail:
 		pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
 				(ublk_can_use_recovery(ubq->dev)) ? "requeue" : "abort",
 				ubq->q_id, rq->tag, io->flags);
 
-		mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
 		if (ublk_can_use_recovery(ubq->dev)) {
 			/* We cannot process this rq so just requeue it. */
 			blk_mq_requeue_request(rq, false);
@@ -880,6 +892,7 @@  static int ublk_ch_open(struct inode *inode, struct file *filp)
 	if (test_and_set_bit(UB_STATE_OPEN, &ub->state))
 		return -EBUSY;
 	filp->private_data = ub;
+	pr_devel("%s: open /dev/ublkc%d\n", __func__, ub->dev_info.dev_id);
 	return 0;
 }
 
@@ -888,6 +901,7 @@  static int ublk_ch_release(struct inode *inode, struct file *filp)
 	struct ublk_device *ub = filp->private_data;
 
 	clear_bit(UB_STATE_OPEN, &ub->state);
+	pr_devel("%s: release /dev/ublkc%d\n", __func__, ub->dev_info.dev_id);
 	return 0;
 }
 
@@ -971,37 +985,180 @@  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
 			 * will do it
 			 */
 			rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
-			if (rq)
+			if (rq && blk_mq_request_started(rq))
 				__ublk_fail_req(io, rq);
 		}
 	}
 	ublk_put_device(ub);
 }
 
-static void ublk_daemon_monitor_work(struct work_struct *work)
+
+
+static void ublk_abort_work_fn(struct work_struct *work)
 {
 	struct ublk_device *ub =
-		container_of(work, struct ublk_device, monitor_work.work);
+		container_of(work, struct ublk_device, abort_work);
+
+	int i;
+
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		if (ubq_daemon_is_dying(ubq))
+			ublk_abort_queue(ub, ubq);
+	}
+
+	if (ub->force_abort)
+		schedule_work(&ub->abort_work);
+}
+
+static void ublk_reinit_queue(struct ublk_device *ub,
+		struct ublk_queue *ubq)
+{
+	int i;
+
+	for (i = 0; i < ubq->q_depth; i++) {
+		struct ublk_io *io = &ubq->ios[i];
+
+		if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
+			struct request *rq = blk_mq_tag_to_rq(
+					ub->tag_set.tags[ubq->q_id], i);
+
+			WARN_ON_ONCE(!rq);
+			pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n",
+					__func__,
+					ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort",
+					ubq->q_id, i, io->flags);
+			if (ublk_can_use_recovery_reissue(ub))
+				blk_mq_requeue_request(rq, false);
+			else
+				__ublk_fail_req(io, rq);
+
+		} else {
+			io_uring_cmd_done(io->cmd,
+					UBLK_IO_RES_ABORT, 0);
+			io->flags &= ~UBLK_IO_FLAG_ACTIVE;
+			pr_devel("%s: send UBLK_IO_RES_ABORT: qid %d tag %d io_flags %x\n",
+				__func__, ubq->q_id, i, io->flags);
+		}
+		ubq->nr_io_ready--;
+	}
+	WARN_ON_ONCE(ubq->nr_io_ready);
+}
+
+static bool ublk_check_inflight_rq(struct request *rq, void *data)
+{
+	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+	struct ublk_io *io = &ubq->ios[rq->tag];
+	bool *busy = data;
+
+	if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+		*busy = true;
+		return false;
+	}
+	return true;
+}
+
+static void ublk_reinit_dev(struct ublk_device *ub)
+{
+	int i;
+	bool busy = false;
+
+	if (!ublk_get_device(ub))
+		return;
+
+	/* If we have quiesced q, all ubq_daemons are dying */
+	if (blk_queue_quiesced(ub->ub_disk->queue))
+		goto check_inflight;
+
+	/* Recovery feature is for 'process' crash. */
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)))
+			goto out;
+	}
+
+	pr_devel("%s: all ubq_daemons(nr: %d) are dying.\n",
+			__func__, ub->dev_info.nr_hw_queues);
+
+	/* Now all ubq_daemons are PF_EXITING, let's quiesce q. */
+	blk_mq_quiesce_queue(ub->ub_disk->queue);
+	pr_devel("%s: queue quiesced.\n", __func__);
+ check_inflight:
+	/* All rqs have to be requeued(and stay in queue now) */
+	blk_mq_tagset_busy_iter(&ub->tag_set, ublk_check_inflight_rq, &busy);
+	if (busy) {
+		pr_devel("%s: still some inflight rqs, retry later...\n",
+				__func__);
+		goto out;
+	}
+
+	pr_devel("%s: all inflight rqs are requeued.\n", __func__);
+
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		ublk_reinit_queue(ub, ubq);
+	}
+	/* So monitor_work won't be scheduled anymore */
+	ub->dev_info.state = UBLK_S_DEV_RECOVERING;
+	pr_devel("%s: convert state to UBLK_S_DEV_RECOVERING\n",
+			__func__);
+ out:
+	ublk_put_device(ub);
+}
+
+static void ublk_kill_dev(struct ublk_device *ub)
+{
 	int i;
 
 	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
 		struct ublk_queue *ubq = ublk_get_queue(ub, i);
 
 		if (ubq_daemon_is_dying(ubq)) {
+			pr_devel("%s: ubq(%d) is dying, schedule stop_work now\n",
+					__func__, i);
 			schedule_work(&ub->stop_work);
-
-			/* abort queue is for making forward progress */
-			ublk_abort_queue(ub, ubq);
 		}
 	}
+}
+
+static void ublk_daemon_monitor_work(struct work_struct *work)
+{
+	struct ublk_device *ub =
+		container_of(work, struct ublk_device, monitor_work.work);
+
+	pr_devel("%s: mode(%s) running...\n",
+			__func__,
+			ublk_can_use_recovery(ub) ? "recovery" : "kill");
+	/*
+	 * We can't schedule monitor work if:
+	 * (1) The state is DEAD.
+	 *     The gendisk is not alive, so either all rqs are ended
+	 *     or request queue is not created.
+	 *
+	 * (2) The state is RECOVERYING.
+	 *     The process crashed, all rqs were requeued and request queue
+	 *     was quiesced.
+	 */
+	WARN_ON_ONCE(ub->dev_info.state != UBLK_S_DEV_LIVE);
 
+	if (ublk_can_use_recovery(ub))
+		ublk_reinit_dev(ub);
+	else
+		ublk_kill_dev(ub);
 	/*
-	 * We can't schedule monitor work after ublk_remove() is started.
+	 * No need ub->mutex, monitor work are canceled after ub is marked
+	 * as force_abort which is observed reliably.
+	 *
+	 * Note:
+	 * All incoming rqs are aborted in ublk_queue_rq ASAP. Then
+	 * we will hang on del_gendisk() and wait for all inflight rqs'
+	 * completion.
 	 *
-	 * No need ub->mutex, monitor work are canceled after state is marked
-	 * as DEAD, so DEAD state is observed reliably.
 	 */
-	if (ub->dev_info.state != UBLK_S_DEV_DEAD)
+	if (ub->dev_info.state == UBLK_S_DEV_LIVE && !ub->force_abort)
 		schedule_delayed_work(&ub->monitor_work,
 				UBLK_DAEMON_MONITOR_PERIOD);
 }
@@ -1058,10 +1215,35 @@  static void ublk_cancel_dev(struct ublk_device *ub)
 static void ublk_stop_dev(struct ublk_device *ub)
 {
 	mutex_lock(&ub->mutex);
-	if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+	if (ub->dev_info.state == UBLK_S_DEV_DEAD)
 		goto unlock;
 
+	/*
+	 * All incoming rqs are aborted in ublk_queue_rq ASAP. Then
+	 * we will hang on del_gendisk() and wait for all inflight rqs'
+	 * completion.
+	 */
+	ub->force_abort = true;
+	/* wait until monitor_work is not scheduled */
+	cancel_delayed_work_sync(&ub->monitor_work);
+	pr_devel("%s: monitor work is canceled.\n", __func__);
+	/* unquiesce q and let all inflight rqs' be aborted */
+	if (blk_queue_quiesced(ub->ub_disk->queue)) {
+		blk_mq_unquiesce_queue(ub->ub_disk->queue);
+		pr_devel("%s: queue unquiesced.\n", __func__);
+	}
+	/* requeued requests will be aborted ASAP because of 'force_abort' */
+	blk_mq_kick_requeue_list(ub->ub_disk->queue);
+	/* forward progress */
+	schedule_work(&ub->abort_work);
+	pr_devel("%s: abort work is scheduled, start delete gendisk...\n",
+			__func__);
+	pr_devel("%s: gendisk is deleted.\n", __func__);
 	del_gendisk(ub->ub_disk);
+	pr_devel("%s: gendisk is deleted.\n", __func__);
+	ub->force_abort = false;
+	cancel_work_sync(&ub->abort_work);
+	pr_devel("%s: abort work is canceled.\n", __func__);
 	ub->dev_info.state = UBLK_S_DEV_DEAD;
 	ub->dev_info.ublksrv_pid = -1;
 	put_disk(ub->ub_disk);
@@ -1069,7 +1251,6 @@  static void ublk_stop_dev(struct ublk_device *ub)
  unlock:
 	ublk_cancel_dev(ub);
 	mutex_unlock(&ub->mutex);
-	cancel_delayed_work_sync(&ub->monitor_work);
 }
 
 /* device can only be started after all IOs are ready */
@@ -1560,6 +1741,7 @@  static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 		goto out_unlock;
 	mutex_init(&ub->mutex);
 	spin_lock_init(&ub->mm_lock);
+	INIT_WORK(&ub->abort_work, ublk_abort_work_fn);
 	INIT_WORK(&ub->stop_work, ublk_stop_work_fn);
 	INIT_DELAYED_WORK(&ub->monitor_work, ublk_daemon_monitor_work);