Message ID | 20220923153919.44078-1-ZiyangZhang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | ublk_drv: add USER_RECOVERY support | expand |
On 9/23/22 9:39 AM, ZiyangZhang wrote: > ublk_drv is a driver simply passes all blk-mq rqs to userspace > target(such as ublksrv[1]). For each ublk queue, there is one > ubq_daemon(pthread). All ubq_daemons share the same process > which opens /dev/ublkcX. The ubq_daemon code infinitely loops on > io_uring_enter() to send/receive io_uring cmds which pass > information of blk-mq rqs. > > Since the real IO handler(the process/thread opening /dev/ublkcX) is > in userspace, it could crash if: > (1) the user kills -9 it because of IO hang on backend, system > reboot, etc... > (2) the process/thread catches a exception(segfault, divisor error, > oom...) Therefore, the kernel driver has to deal with a dying > ubq_daemon or the process. > > Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv > must abort the dying ubq, stop the device and delete everything. > This is not a good choice in practice because users do not expect > aborted requests, I/O errors and a deleted device. They may want > a recovery machenism so that no requests are aborted and no I/O > error occurs. Anyway, users just want everything works as usual. > > This patchset implements USER_RECOVERY support. If the process > or any ubq_daemon(pthread) crashes(exits accidentally), we allow > user to provide new process and ubq_daemons. > > Note: The responsibility of recovery belongs to the user who opens > /dev/ublkcX. After a crash, the kernel driver only switch the > device's state to be ready for recovery(START_USER_RECOVERY) or > termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED. > This patchset does not provide how to detect such a crash in userspace. > The user has may ways to do so. For example, user may: > (1) send GET_DEV_INFO on specific dev_id and check if its state is > UBLK_S_DEV_QUIESCED. > (2) 'ps' on ublksrv_pid. > > Recovery feature is quite useful for real products. In detail, > we support this scenario: > (1) The /dev/ublkc0 is opened by process 0. > (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all > rqs are handled by process 0. > (3) Process 0 suddenly crashes(e.g. segfault); > (4) Fio is still running and submit IOs(but these IOs cannot > be dispatched now) > (5) User starts process 1 and attach it to /dev/ublkc0 > (6) All rqs are handled by process 1 now and IOs can be > completed now. > > Note: The backend must tolerate double-write because we re-issue > a rq sent to the old process 0 before. > > We provide a sample script here to simulate the above steps: > > ***************************script*************************** > LOOPS=10 > > __ublk_get_pid() { > pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'` > echo $pid > } > > ublk_recover_kill() > { > for CNT in `seq $LOOPS`; do > dmesg -C > pid=`__ublk_get_pid` > echo -e "*** kill $pid now ***" > kill -9 $pid > sleep 6 > echo -e "*** recover now ***" > ./ublk recover -n 0 > sleep 6 > done > } > > ublk_test() > { > echo -e "*** add ublk device ***" > ./ublk add -t null -d 4 -i 1 > sleep 2 > echo -e "*** start fio ***" > fio --bs=4k \ > --filename=/dev/ublkb0 \ > --runtime=140s \ > --rw=read & > sleep 4 > ublk_recover_kill > wait > echo -e "*** delete ublk device ***" > ./ublk del -n 0 > } > > for CNT in `seq 4`; do > modprobe -rv ublk_drv > modprobe ublk_drv > echo -e "************ round $CNT ************" > ublk_test > sleep 5 > done > ***************************script*************************** > > You may run it with our modified ublksrv[2] which supports > recovery feature. No I/O error occurs and you can verify it > by typing > $ perf-tools/bin/tpoint block:block_rq_error > > The basic idea of USER_RECOVERY is quite straightfoward: > (1) quiesce ublk queues and requeue/abort rqs. > (2) release/free everything belongs to the dying process. > Note: Since ublk_drv does save information about user process, > this work is important because we don't expect any resource > lekage. Particularly, ioucmds from the dying ubq_daemons > need to be completed(freed). > (3) allow new ubq_daemons issue FETCH_REQ. > Note: ublk_ch_uring_cmd() checks some states and flags. We > have to set them to a correct value. > > Here is steps to reocver: > (0) requests dispatched after the corresponding ubq_daemon is dying > are requeued. > (1) monitor_work finds one dying ubq_daemon, and it should > schedule quiesce_work and requeue/abort requests issued to > userspace before the ubq_daemon is dying. > (2) quiesce_work must (a)quiesce request queue to ban any incoming > ublk_queue_rq(), (b)wait unitl all rqs are IDLE, (c)complete old > ioucmds. Then the ublk device is ready for recovery or stop. > (3) The user sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control > with a dev_id X (such as 3 for /dev/ublkc3). > (4) Then ublk_drv should perpare for a new process to attach /dev/ublkcX. > All ublk_io structures are cleared and ubq_daemons are reset. > (5) Then, user should start a new process and ubq_daemons(pthreads) and > send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The > user must correctly setup queues, flags and so on(how to persist > user's information is not related to this patchset). > (6) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a > dev_id X. > (7) After receiving END_USER_RECOVERY, ublk_drv waits for all ubq_daemons > getting ready. Then it unquiesces request queue and new rqs are > allowed. > > You should use ublksrv[2] and tests[3] provided by us. We add 3 additional > tests to verify that recovery feature works. Our code will be PR-ed to > Ming's repo soon. I'm going to apply 1-6 for 6.1, applying the doc patch is difficult as it only went into 6.0 past forking off the 6.1 block branch. Would you mind resending the 7/7 patch once the merge window opens and I've pushed the previous bits? I may forget otherwise...
On 2022/9/24 09:12, Jens Axboe wrote: > > I'm going to apply 1-6 for 6.1, applying the doc patch is difficult as > it only went into 6.0 past forking off the 6.1 block branch. Would you > mind resending the 7/7 patch once the merge window opens and I've pushed > the previous bits? I may forget otherwise... OK, I will resend the doc patch. Regards, Zhang