Message ID | 20230723234422.1629194-1-haowenchao2@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | scsi: Support LUN/target based error handle | expand |
On 2023/7/24 7:44, Wenchao Hao wrote: > The origin error handle would set host to recovery state and perform > error recovery operations, and makes all LUNs which share a same host > can not handle IOs. This phenomenon is unbearable for systems which > deploy many LUNs in one HBA. > > This patchset introduce support for LUN/target based error handle, > drivers can chose if to implement it. They can implement LUN, target or > both of LUN and target based error handle by their own error handle > strategy. The first patch defined this framework, it abstract three > key operations which are: add error command, wake up error handle, block > ios when error command is added and recoverying. Drivers should > implement these three function callbacks and setup to SCSI middle level. > > Besides the basic framework, this patchset also add a basic LUN/target > based error handle strategy. > > For LUN based eh, it would try check sense, start unit and reset LUN, > if all above steps can not recovery all error commands, fallback to > further recovery like tartget based (if implemented) or host based error > handle. > > It's same for tartget based eh, it would try check sense, start unit, > reset LUN and reset target. If all above steps can not recovery all error > commands, fallback to further recovery which is host based error handle. > > This patchset is tested by scsi_debug which support single LUN error > injection, the scsi_debug patches is here: > > https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t > I tested this patch set with scsi_debug with following scenarios, check attachments to get my test script and result logs. +-----------+---------+-------------------------------------------------------+ | lun reset | TUR | Desired result | + --------- + ------- + ------------------------------------------------------+ | success | success | retry or finish with EIO(may offline disk) | + --------- + ------- + ------------------------------------------------------+ | success | fail | fallback to host recovery, retry or finish with | | | | EIO(may offline disk) | + --------- + ------- + ------------------------------------------------------+ | fail | NA | fallback to host recovery, retry or finish with | | | | EIO(may offline disk) | + --------- + ------- + ------------------------------------------------------+ +-----------+---------+--------------+---------+------------------------------+ | lun reset | TUR | target reset | TUR | Desired result | +-----------+---------+--------------+---------+------------------------------+ | success | success | NA | NA | retry or finish with | | | | | | EIO(may offline disk) | +-----------+---------+--------------+---------+------------------------------+ | success | fail | success | success | retry or finish with | | | | | | EIO(may offline disk) | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | success | success | retry or finish with | | | | | | EIO(may offline disk) | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | success | fail | fallback to host recovery, | | | | | | retry or finish with EIO(may | | | | | | offline disk) | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | fail | NA | fallback to host recovery, | | | | | | retry or finish with EIO(may | | | | | | offline disk) | +-----------+---------+--------------+---------+------------------------------+ +-----------+---------+--------------+---------+------------------------------+ | lun reset | TUR | target reset | TUR | Desired result | +-----------+---------+--------------+---------+------------------------------+ | success | success | NA | NA | retry or finish with | | | | | | EIO(may offline disk) | +-----------+---------+--------------+---------+------------------------------+ | success | fail | success | success | lun recovery fallback to | | | | | | target recovery, retry or | | | | | | finish with EIO(may offline | | | | | | disk | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | success | success | lun recovery fallback to | | | | | | target recovery, retry or | | | | | | finish with EIO(may offline | | | | | | disk | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | success | fail | lun recovery fallback to | | | | | | target recovery, then fall | | | | | | back to host recovery, retry | | | | | | or fhinsi with EIO(may | | | | | | offline disk) | +-----------+---------+--------------+---------+------------------------------+ | fail | NA | fail | NA | lun recovery fallback to | | | | | | target recovery, then fall | | | | | | back to host recovery, retry | | | | | | or fhinsi with EIO(may | | | | | | offline disk) | +-----------+---------+--------------+---------+------------------------------+ > Wenchao Hao (13): > scsi: Define basic framework for driver LUN/target based error handle > scsi:scsi_error: Move complete variable eh_action from shost to sdevice > scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset > scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT > scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset > scsi:scsi_error: Add flags to mark error handle steps has done > scsi:scsi_error: Define helper to perform LUN based error handle > scsi:scsi_error: Add LUN based error handler based previous helper > scsi:core: increase/decrease target_busy without check can_queue > scsi:scsi_error: Define helper to perform target based error handle > scsi:scsi_error: Add target based error handler based previous helper > scsi:scsi_debug: Add param to control if setup LUN based error handle > scsi:scsi_debug: Add param to control if setup target based error handle > > drivers/scsi/scsi_debug.c | 19 + > drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++--- > drivers/scsi/scsi_lib.c | 23 +- > drivers/scsi/scsi_priv.h | 20 ++ > include/scsi/scsi_device.h | 97 +++++ > include/scsi/scsi_eh.h | 4 + > include/scsi/scsi_host.h | 2 - > 7 files changed, 813 insertions(+), 57 deletions(-) > #!/bin/sh scsi_debug=/mnt/mainline/drivers/scsi/scsi_debug.ko function clear_error() { error=$1 tmpfile=$$_clear cat $error | grep -v Type | awk '{print $1,$3}' > $tmpfile while read -r line; do echo "- $line" > $error; done < $tmpfile rm -rf $tmpfile echo 0 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset } function lun_test_sense1() { echo "LUN reset success, TUR success" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function lun_test_sense2() { echo "LUN reset success, TUR failed" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject timeout command for TUR command echo "0 -1 0x0 " > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function lun_test_sense3() { echo "LUN reset failed, fallback to target reset success" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject lunreset failed echo "4 -1 0xff" > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function target_test_sense1() { echo "LUN reset success, TUR success" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function target_test_sense2() { echo "LUN reset success, TUR failed, target reset success, TUR success" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject timeout command for TUR command echo "0 -1 0x0 " > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function target_test_sense3() { echo "LUN reset failed, target reset success, TUR success" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject lunreset failed echo "4 -1 0xff" > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function target_test_sense4() { echo "LUN reset failed, target reset success TUR failed" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject lunreset failed echo "4 -1 0xff" > ${error} # inject timeout command for TUR command echo "0 -1 0x0 " > ${error} dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } function target_test_sense5() { echo "LUN reset failed, target reset failed, fallback to host recovery" # inject timeout command for write command echo "0 -10 0x2a " > ${error} # inject abort command for write command echo "3 -1 0x2a " > ${error} # inject lunreset failed echo "4 -1 0xff" > ${error} # inject target reset failed echo 1 > /sys/kernel/debug/scsi_debug/target$target_id/fail_reset dd if=/dev/zero of=/dev/$disk bs=1K count=10 oflag=direct echo $(cat /sys/block/$disk/device/state) clear_error $error echo running > /sys/block/$disk/device/state } scsi_logging_level -s --error 4 > /dev/null 2>&1 insmod $scsi_debug lun_eh=Y target_eh=N str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}') scsi_id=${str#*\[} scsi_id=${scsi_id%\]*} error=/sys/kernel/debug/scsi_debug/$scsi_id/error str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}') disk=$(basename $str) target_id=${scsi_id%\:*} echo none > /sys/block/$disk/queue/scheduler echo 1 > /sys/block/$disk/device/timeout echo 1 > /sys/block/$disk/device/eh_timeout for((loop=1;loop<=3;loop++)) do time=$(date "+%Y-%m-%d-%H-%M-%S") since=$(date "+%Y-%m-%d %H:%M:%S") lun_test_sense$loop sleep 3 until=$(date "+%Y-%m-%d %H:%M:%S") mkdir logs/lun_sense$loop journalctl --since="$since" --until="$until" > logs/lun_sense$loop/$time.log done rmmod scsi_debug insmod $scsi_debug lun_eh=N target_eh=Y str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}') scsi_id=${str#*\[} scsi_id=${scsi_id%\]*} error=/sys/kernel/debug/scsi_debug/$scsi_id/error str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}') disk=$(basename $str) echo none > /sys/block/$disk/queue/scheduler echo 1 > /sys/block/$disk/device/timeout echo 1 > /sys/block/$disk/device/eh_timeout for((loop=1;loop<=5;loop++)) do time=$(date "+%Y-%m-%d-%H-%M-%S") since=$(date "+%Y-%m-%d %H:%M:%S") target_test_sense$loop sleep 3 until=$(date "+%Y-%m-%d %H:%M:%S") mkdir logs/target_sense$loop journalctl --since="$since" --until="$until" > logs/target_sense$loop/$time.log done rmmod scsi_debug insmod $scsi_debug lun_eh=Y target_eh=Y str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $1}') scsi_id=${str#*\[} scsi_id=${scsi_id%\]*} error=/sys/kernel/debug/scsi_debug/$scsi_id/error str=$(lsscsi | grep scsi_debug | head -n 1 | awk '{print $6}') disk=$(basename $str) echo none > /sys/block/$disk/queue/scheduler echo 1 > /sys/block/$disk/device/timeout echo 1 > /sys/block/$disk/device/eh_timeout for((loop=1;loop<=5;loop++)) do time=$(date "+%Y-%m-%d-%H-%M-%S") since=$(date "+%Y-%m-%d %H:%M:%S") target_test_sense$loop sleep 3 until=$(date "+%Y-%m-%d %H:%M:%S") mkdir logs/lun_target_sense$loop journalctl --since="$since" --until="$until" > logs/lun_target_sense$loop/$time.log done rmmod scsi_debug
On 2023/7/24 7:44, Wenchao Hao wrote: > The origin error handle would set host to recovery state and perform > error recovery operations, and makes all LUNs which share a same host > can not handle IOs. This phenomenon is unbearable for systems which > deploy many LUNs in one HBA. > Friendly PING... We can reduce probability of blocking whole host when handle error commands with this patchset, which is important for servers which deploy large scale disks. And the new error handler is not enabled default, so it would not affect drivers which do not need it. > This patchset introduce support for LUN/target based error handle, > drivers can chose if to implement it. They can implement LUN, target or > both of LUN and target based error handle by their own error handle > strategy. The first patch defined this framework, it abstract three > key operations which are: add error command, wake up error handle, block > ios when error command is added and recoverying. Drivers should > implement these three function callbacks and setup to SCSI middle level. > > Besides the basic framework, this patchset also add a basic LUN/target > based error handle strategy. > > For LUN based eh, it would try check sense, start unit and reset LUN, > if all above steps can not recovery all error commands, fallback to > further recovery like tartget based (if implemented) or host based error > handle. > > It's same for tartget based eh, it would try check sense, start unit, > reset LUN and reset target. If all above steps can not recovery all error > commands, fallback to further recovery which is host based error handle. > > This patchset is tested by scsi_debug which support single LUN error > injection, the scsi_debug patches is here: > > https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t > > Wenchao Hao (13): > scsi: Define basic framework for driver LUN/target based error handle > scsi:scsi_error: Move complete variable eh_action from shost to sdevice > scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset > scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT > scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset > scsi:scsi_error: Add flags to mark error handle steps has done > scsi:scsi_error: Define helper to perform LUN based error handle > scsi:scsi_error: Add LUN based error handler based previous helper > scsi:core: increase/decrease target_busy without check can_queue > scsi:scsi_error: Define helper to perform target based error handle > scsi:scsi_error: Add target based error handler based previous helper > scsi:scsi_debug: Add param to control if setup LUN based error handle > scsi:scsi_debug: Add param to control if setup target based error handle > > drivers/scsi/scsi_debug.c | 19 + > drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++--- > drivers/scsi/scsi_lib.c | 23 +- > drivers/scsi/scsi_priv.h | 20 ++ > include/scsi/scsi_device.h | 97 +++++ > include/scsi/scsi_eh.h | 4 + > include/scsi/scsi_host.h | 2 - > 7 files changed, 813 insertions(+), 57 deletions(-) >
On 8/15/23 07:17, haowenchao (C) wrote: > We can reduce probability of blocking whole host when handle error > commands with this patchset, which is important for servers which > deploy large scale disks. And the new error handler is not enabled > default, so it would not affect drivers which do not need it. Which drivers need this new error handler? I don't see any changes for SCSI drivers in this patch series other than scsi_debug. Has this patch series perhaps been developed for a pass-through driver between virtual machine guests and their host? If so, has it been considered to configure pass-through such that there is one disk per SCSI host instead of multiple? Thanks, Bart.
On 2023/8/15 23:48, Bart Van Assche wrote: > On 8/15/23 07:17, haowenchao (C) wrote: >> We can reduce probability of blocking whole host when handle error >> commands with this patchset, which is important for servers which >> deploy large scale disks. And the new error handler is not enabled >> default, so it would not affect drivers which do not need it. > > Which drivers need this new error handler? I don't see any changes for > SCSI drivers in this patch series other than scsi_debug. Has this patch > series perhaps been developed for a pass-through driver between virtual > machine guests and their host? If so, has it been considered to > configure pass-through such that there is one disk per SCSI host instead > of multiple? > I tested the error hander with our private hardware(the driver code was not pushed in mainline), as discussed, megaraid_sas, mpt3sas, smartpqi, hiraid and hisi_sas need this new error handler too, while hisi_sas needs more steps to using it because it is tightly coupled with libsas/libata. I want the basic frame to be reviewed first, so just modify the scsi_debug, which is accessible for everyone and easy to simulate various kind of error. I do not know how pass-through driver between virtual machine guests and their host work, do you mean virtio-scsi in guests OS? Can you describe more? Thanks. > Thanks, > > Bart. > >
On 2023/7/24 7:44, Wenchao Hao wrote: > The origin error handle would set host to recovery state and perform > error recovery operations, and makes all LUNs which share a same host > can not handle IOs. This phenomenon is unbearable for systems which > deploy many LUNs in one HBA. > > This patchset introduce support for LUN/target based error handle, > drivers can chose if to implement it. They can implement LUN, target or > both of LUN and target based error handle by their own error handle > strategy. The first patch defined this framework, it abstract three > key operations which are: add error command, wake up error handle, block > ios when error command is added and recoverying. Drivers should > implement these three function callbacks and setup to SCSI middle level. > Ping... Is anyone reviewing these changes? > Besides the basic framework, this patchset also add a basic LUN/target > based error handle strategy. > > For LUN based eh, it would try check sense, start unit and reset LUN, > if all above steps can not recovery all error commands, fallback to > further recovery like tartget based (if implemented) or host based error > handle. > > It's same for tartget based eh, it would try check sense, start unit, > reset LUN and reset target. If all above steps can not recovery all error > commands, fallback to further recovery which is host based error handle. > > This patchset is tested by scsi_debug which support single LUN error > injection, the scsi_debug patches is here: > > https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t > > Wenchao Hao (13): > scsi: Define basic framework for driver LUN/target based error handle > scsi:scsi_error: Move complete variable eh_action from shost to sdevice > scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset > scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT > scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset > scsi:scsi_error: Add flags to mark error handle steps has done > scsi:scsi_error: Define helper to perform LUN based error handle > scsi:scsi_error: Add LUN based error handler based previous helper > scsi:core: increase/decrease target_busy without check can_queue > scsi:scsi_error: Define helper to perform target based error handle > scsi:scsi_error: Add target based error handler based previous helper > scsi:scsi_debug: Add param to control if setup LUN based error handle > scsi:scsi_debug: Add param to control if setup target based error handle > > drivers/scsi/scsi_debug.c | 19 + > drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++--- > drivers/scsi/scsi_lib.c | 23 +- > drivers/scsi/scsi_priv.h | 20 ++ > include/scsi/scsi_device.h | 97 +++++ > include/scsi/scsi_eh.h | 4 + > include/scsi/scsi_host.h | 2 - > 7 files changed, 813 insertions(+), 57 deletions(-) >
On 2023/7/24 7:44, Wenchao Hao wrote: Ping again... > The origin error handle would set host to recovery state and perform > error recovery operations, and makes all LUNs which share a same host > can not handle IOs. This phenomenon is unbearable for systems which > deploy many LUNs in one HBA. > > This patchset introduce support for LUN/target based error handle, > drivers can chose if to implement it. They can implement LUN, target or > both of LUN and target based error handle by their own error handle > strategy. The first patch defined this framework, it abstract three > key operations which are: add error command, wake up error handle, block > ios when error command is added and recoverying. Drivers should > implement these three function callbacks and setup to SCSI middle level. > > Besides the basic framework, this patchset also add a basic LUN/target > based error handle strategy. > > For LUN based eh, it would try check sense, start unit and reset LUN, > if all above steps can not recovery all error commands, fallback to > further recovery like tartget based (if implemented) or host based error > handle. > > It's same for tartget based eh, it would try check sense, start unit, > reset LUN and reset target. If all above steps can not recovery all error > commands, fallback to further recovery which is host based error handle. > > This patchset is tested by scsi_debug which support single LUN error > injection, the scsi_debug patches is here: > > https://lore.kernel.org/linux-scsi/20230723234105.1628982-1-haowenchao2@huawei.com/T/#t > > Wenchao Hao (13): > scsi: Define basic framework for driver LUN/target based error handle > scsi:scsi_error: Move complete variable eh_action from shost to sdevice > scsi:scsi_error: Check if to do reset in scsi_try_xxx_reset > scsi:scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT > scsi:scsi_error: Add helper scsi_eh_sdev_reset to do lun reset > scsi:scsi_error: Add flags to mark error handle steps has done > scsi:scsi_error: Define helper to perform LUN based error handle > scsi:scsi_error: Add LUN based error handler based previous helper > scsi:core: increase/decrease target_busy without check can_queue > scsi:scsi_error: Define helper to perform target based error handle > scsi:scsi_error: Add target based error handler based previous helper > scsi:scsi_debug: Add param to control if setup LUN based error handle > scsi:scsi_debug: Add param to control if setup target based error handle > > drivers/scsi/scsi_debug.c | 19 + > drivers/scsi/scsi_error.c | 705 ++++++++++++++++++++++++++++++++++--- > drivers/scsi/scsi_lib.c | 23 +- > drivers/scsi/scsi_priv.h | 20 ++ > include/scsi/scsi_device.h | 97 +++++ > include/scsi/scsi_eh.h | 4 + > include/scsi/scsi_host.h | 2 - > 7 files changed, 813 insertions(+), 57 deletions(-) >