Message ID | 20221114203412.383-1-jonathan.derrick@linux.dev (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | tests/nvme: Add admin-passthru+reset race test | expand |
Hi Jonathan,
I love your patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v6.1-rc5 next-20221114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Jonathan-Derrick/tests-nvme-Add-admin-passthru-reset-race-test/20221115-043752
patch link: https://lore.kernel.org/r/20221114203412.383-1-jonathan.derrick%40linux.dev
patch subject: [PATCH] tests/nvme: Add admin-passthru+reset race test
reproduce:
scripts/spdxcheck.py
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
spdxcheck warnings: (new ones prefixed by >>)
drivers/cpufreq/amd-pstate-ut.c: 1:28 Invalid License ID: GPL-1.0-or-later
>> tests/nvme/046: 2:27 Invalid License ID: GPL-3.0+
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote: > + echo "Running ${TEST_NAME}" > + > + local sysfs > + local attr > + local m > + > + sysfs="$TEST_DEV_SYSFS/device" That's not the correct directory when the device is using native nvme-multipath. > + timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2)) > + > + sleep 5 > + > + if [[ ! -d "$sysfs" ]]; then > + echo "$sysfs doesn't exist" > + fi > + > + # do reset controller/format loops > + # don't check status now because a timing race is desired > + i=0 > + start=0 > + timing_out=false > + while [[ $i -le 1000 ]]; do > + start=$SECONDS > + if [[ -f "$sysfs/reset_controller" ]]; then > + echo 1 > "$sysfs/reset_controller" 2>/dev/null & > + i=$((i+1)) > + fi > + nvme format -l 0 -f $TEST_DEV 2>/dev/null & > + > + #Assume the controller is hung and unrecoverable > + if [[ $(($SECONDS - $start)) -gt $timeout ]]; then > + echo "nvme controller timing out" > + timing_out=true > + break > + fi > + done If the controller is already undergoing a reset, then writing to the reset_controller file becomes a no-op. Unless your "reset_controller" completes near instantaneously, I find that this loop tears through 1000 iterations, forks 1000 formats, and only 1 reset_controller actually gets through. If I remove the upper limit, then I can also see the stalled task, but it is only temporary and gets itself out of it after the admin timeout (1 minute). Is that also your observation, or is it stuck forever?
On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote: > Adds a test which runs many formats and reset_controllers in parallel. > The intent is to expose timing holes in the controller state machine > which will lead to hung task timing and the controller becoming > unavailable. This passes just fine for me both on qemu and a thunderbolt attached m.2 SSD. So it seems to be hardware / timing dependent.
diff --git a/tests/nvme/046 b/tests/nvme/046 new file mode 100755 index 0000000..4b47783 --- /dev/null +++ b/tests/nvme/046 @@ -0,0 +1,85 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# Copyright (C) 2022 Jonathan Derrick <jonathan.derrick@linux.dev> +# +# Test nvme reset controller during admin passthru +# +# Regression for issue reported by +# https://bugzilla.kernel.org/show_bug.cgi?id=216354 + +. tests/nvme/rc + +#restrict test to nvme-pci only +nvme_trtype=pci + +DESCRIPTION="test nvme reset controller during admin passthru" +QUICK=1 +CAN_BE_ZONED=1 + +requires() { + _nvme_requires +} + +device_requires() { + _require_test_dev_is_nvme +} + +test_device() { + echo "Running ${TEST_NAME}" + + local sysfs + local attr + local m + + sysfs="$TEST_DEV_SYSFS/device" + timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2)) + + sleep 5 + + if [[ ! -d "$sysfs" ]]; then + echo "$sysfs doesn't exist" + fi + + # do reset controller/format loops + # don't check status now because a timing race is desired + i=0 + start=0 + timing_out=false + while [[ $i -le 1000 ]]; do + start=$SECONDS + if [[ -f "$sysfs/reset_controller" ]]; then + echo 1 > "$sysfs/reset_controller" 2>/dev/null & + i=$((i+1)) + fi + nvme format -l 0 -f $TEST_DEV 2>/dev/null & + + #Assume the controller is hung and unrecoverable + if [[ $(($SECONDS - $start)) -gt $timeout ]]; then + echo "nvme controller timing out" + timing_out=true + break + fi + done + + { kill $!; wait; } &> /dev/null + + # at this point it may have waited hung_task_timeout / 2 already, so + # only wait 25% longer for a total of about 75% of allowed timeout + m=0 + while [[ $m -le $((timeout / 2)) ]]; do + if [[ $timing_out == true ]]; then + break + fi + if grep -q live "$sysfs/state"; then + break + fi + sleep 1 + m=$((m+1)) + done + if ! grep -q live "$sysfs/state"; then + echo "nvme still not live after $(($SECONDS - $start)) seconds!" + fi + udevadm settle + + echo "Test complete" +} diff --git a/tests/nvme/046.out b/tests/nvme/046.out new file mode 100644 index 0000000..2b5fa6a --- /dev/null +++ b/tests/nvme/046.out @@ -0,0 +1,2 @@ +Running nvme/046 +Test complete
Adds a test which runs many formats and reset_controllers in parallel. The intent is to expose timing holes in the controller state machine which will lead to hung task timing and the controller becoming unavailable. Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354 Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev> --- tests/nvme/046 | 85 ++++++++++++++++++++++++++++++++++++++++++++++ tests/nvme/046.out | 2 ++ 2 files changed, 87 insertions(+) create mode 100755 tests/nvme/046 create mode 100644 tests/nvme/046.out