Message ID | 20221117212210.934-1-jonathan.derrick@linux.dev (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] tests/nvme: Add admin-passthru+reset race test | expand |
On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: > I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 > reliably segfaults my QEMU instance (something else to look into) and I don't > have any 'real' hardware to test this on at the moment. It looks like several > passthru commands are able to enqueue prior/during/after resetting/connecting. I'm not seeing any problem with the latest nvme-qemu after several dozen iterations of this test case. In that environment, the formats and resets complete practically synchronously with the call, so everything proceeds quickly. Is there anything special I need to change? > The issue seems to be very heavily timing related, so the loop in the header is > a lot more forceful in this approach. > > As far as the loop goes, I've noticed it will typically repro immediately or > pass the whole test. I can only get possible repro in scenarios that have multi-second long, serialized format times. Even then, it still appears that everything fixes itself after a waiting. Are you observing the same, or is it stuck forever in your observations? > +remove_and_rescan() { > + local pdev=$1 > + echo 1 > /sys/bus/pci/devices/"$pdev"/remove > + echo 1 > /sys/bus/pci/rescan > +} This function isn't called anywhere.
On 11/21/2022 1:55 PM, Keith Busch wrote: > On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: >> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 >> reliably segfaults my QEMU instance (something else to look into) and I don't >> have any 'real' hardware to test this on at the moment. It looks like several >> passthru commands are able to enqueue prior/during/after resetting/connecting. > > I'm not seeing any problem with the latest nvme-qemu after several dozen > iterations of this test case. In that environment, the formats and > resets complete practically synchronously with the call, so everything > proceeds quickly. Is there anything special I need to change? > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself Does the tighter loop in the test comment header produce results? >> The issue seems to be very heavily timing related, so the loop in the header is >> a lot more forceful in this approach. >> >> As far as the loop goes, I've noticed it will typically repro immediately or >> pass the whole test. > > I can only get possible repro in scenarios that have multi-second long, > serialized format times. Even then, it still appears that everything > fixes itself after a waiting. Are you observing the same, or is it stuck > forever in your observations? In 5.19, it gets stuck forever with lots of formats outstanding and controller stuck in resetting. I'll keep digging. Thanks Keith > >> +remove_and_rescan() { >> + local pdev=$1 >> + echo 1 > /sys/bus/pci/devices/"$pdev"/remove >> + echo 1 > /sys/bus/pci/rescan >> +} > > This function isn't called anywhere.
On Mon, Nov 21, 2022 at 03:34:44PM -0700, Jonathan Derrick wrote: > On 11/21/2022 1:55 PM, Keith Busch wrote: > > On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: > >> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 > >> reliably segfaults my QEMU instance (something else to look into) and I don't > >> have any 'real' hardware to test this on at the moment. It looks like several > >> passthru commands are able to enqueue prior/during/after resetting/connecting. > > > > I'm not seeing any problem with the latest nvme-qemu after several dozen > > iterations of this test case. In that environment, the formats and > > resets complete practically synchronously with the call, so everything > > proceeds quickly. Is there anything special I need to change? > > > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself > Does the tighter loop in the test comment header produce results? My qemu's backing storage is a nullblk which makes format a no-op, but I can try something slower if you think that will have different results. These kinds of tests are definitely more pleasant to run under emulation, so having the recipe to recreate there is a boon. > >> The issue seems to be very heavily timing related, so the loop in the header is > >> a lot more forceful in this approach. > >> > >> As far as the loop goes, I've noticed it will typically repro immediately or > >> pass the whole test. > > > > I can only get possible repro in scenarios that have multi-second long, > > serialized format times. Even then, it still appears that everything > > fixes itself after a waiting. Are you observing the same, or is it stuck > > forever in your observations? > In 5.19, it gets stuck forever with lots of formats outstanding and > controller stuck in resetting. I'll keep digging. Thanks Keith At the moment I'm interested in upstream, so either Linus' latest 6.1-rc, or the nvme-6.2 branch. If you can confirm these are okay (which appears to be the case on my side), then I can definitely shift focus to stable back-ports. But if they're not okay, then I want to straighten that side out first.
On 11/21/2022 3:34 PM, Jonathan Derrick wrote: > > > On 11/21/2022 1:55 PM, Keith Busch wrote: >> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: >>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 >>> reliably segfaults my QEMU instance (something else to look into) and I don't >>> have any 'real' hardware to test this on at the moment. It looks like several >>> passthru commands are able to enqueue prior/during/after resetting/connecting. >> >> I'm not seeing any problem with the latest nvme-qemu after several dozen >> iterations of this test case. In that environment, the formats and >> resets complete practically synchronously with the call, so everything >> proceeds quickly. Is there anything special I need to change? >> > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself Here's a backtrace: Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7554400 (LWP 531154)] 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 539 return sq->ctrl; (gdb) backtrace #0 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 #1 0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852 #2 0x0000555555f4db15 in aio_bh_call (bh=0x7fffec279910) at ../util/async.c:150 #3 0x0000555555f4dc24 in aio_bh_poll (ctx=0x55555688fa00) at ../util/async.c:178 #4 0x0000555555f34df0 in aio_dispatch (ctx=0x55555688fa00) at ../util/aio-posix.c:421 #5 0x0000555555f4e083 in aio_ctx_dispatch (source=0x55555688fa00, callback=0x0, user_data=0x0) at ../util/async.c:320 #6 0x00007ffff7bd717d in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0 #7 0x0000555555f600c2 in glib_pollfds_poll () at ../util/main-loop.c:297 #8 0x0000555555f60140 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:320 #9 0x0000555555f60251 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:596 #10 0x0000555555a8f27c in qemu_main_loop () at ../softmmu/runstate.c:739 #11 0x000055555582b77a in qemu_default_main () at ../softmmu/main.c:37 #12 0x000055555582b7b4 in main (argc=53, argv=0x7fffffffdf88) at ../softmmu/main.c:48 > Does the tighter loop in the test comment header produce results? > > >>> The issue seems to be very heavily timing related, so the loop in the header is >>> a lot more forceful in this approach. >>> >>> As far as the loop goes, I've noticed it will typically repro immediately or >>> pass the whole test. >> >> I can only get possible repro in scenarios that have multi-second long, >> serialized format times. Even then, it still appears that everything >> fixes itself after a waiting. Are you observing the same, or is it stuck >> forever in your observations? > In 5.19, it gets stuck forever with lots of formats outstanding and > controller stuck in resetting. I'll keep digging. Thanks Keith > >> >>> +remove_and_rescan() { >>> + local pdev=$1 >>> + echo 1 > /sys/bus/pci/devices/"$pdev"/remove >>> + echo 1 > /sys/bus/pci/rescan >>> +} >> >> This function isn't called anywhere.
[cc'ing Klaus] On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote: > On 11/21/2022 3:34 PM, Jonathan Derrick wrote: > > On 11/21/2022 1:55 PM, Keith Busch wrote: > >> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: > >>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 > >>> reliably segfaults my QEMU instance (something else to look into) and I don't > >>> have any 'real' hardware to test this on at the moment. It looks like several > >>> passthru commands are able to enqueue prior/during/after resetting/connecting. > >> > >> I'm not seeing any problem with the latest nvme-qemu after several dozen > >> iterations of this test case. In that environment, the formats and > >> resets complete practically synchronously with the call, so everything > >> proceeds quickly. Is there anything special I need to change? > >> > > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself > Here's a backtrace: > > Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7ffff7554400 (LWP 531154)] > 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > 540 return sq->ctrl; > (gdb) backtrace > #0 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > #1 0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852 Thanks, looks like a race between the admin queue format's bottom half, and the controller reset tearing down that queue. I'll work with Klaus on that qemu side (looks like a well placed qemu_bh_cancel() should do it).
On Nov 21 16:04, Keith Busch wrote: > [cc'ing Klaus] > > On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote: > > On 11/21/2022 3:34 PM, Jonathan Derrick wrote: > > > On 11/21/2022 1:55 PM, Keith Busch wrote: > > >> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: > > >>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 > > >>> reliably segfaults my QEMU instance (something else to look into) and I don't > > >>> have any 'real' hardware to test this on at the moment. It looks like several > > >>> passthru commands are able to enqueue prior/during/after resetting/connecting. > > >> > > >> I'm not seeing any problem with the latest nvme-qemu after several dozen > > >> iterations of this test case. In that environment, the formats and > > >> resets complete practically synchronously with the call, so everything > > >> proceeds quickly. Is there anything special I need to change? > > >> > > > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself > > Here's a backtrace: > > > > Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 0x7ffff7554400 (LWP 531154)] > > 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > > 540 return sq->ctrl; > > (gdb) backtrace > > #0 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > > #1 0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852 > > Thanks, looks like a race between the admin queue format's bottom half, > and the controller reset tearing down that queue. I'll work with Klaus > on that qemu side (looks like a well placed qemu_bh_cancel() should do > it). > Yuck. Bug located and quelched I think. Jonathan, please try https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/ This fixes the qemu crash, but I still see a "nvme still not live after 42 seconds!" resulting from the test. I'm seeing A LOT of invalid submission queue doorbell writes: pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring Tested on a 6.1-rc4.
On 11/22/2022 1:26 AM, Klaus Jensen wrote: > On Nov 21 16:04, Keith Busch wrote: >> [cc'ing Klaus] >> >> On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote: >>> On 11/21/2022 3:34 PM, Jonathan Derrick wrote: >>>> On 11/21/2022 1:55 PM, Keith Busch wrote: >>>>> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: >>>>>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 >>>>>> reliably segfaults my QEMU instance (something else to look into) and I don't >>>>>> have any 'real' hardware to test this on at the moment. It looks like several >>>>>> passthru commands are able to enqueue prior/during/after resetting/connecting. >>>>> >>>>> I'm not seeing any problem with the latest nvme-qemu after several dozen >>>>> iterations of this test case. In that environment, the formats and >>>>> resets complete practically synchronously with the call, so everything >>>>> proceeds quickly. Is there anything special I need to change? >>>>> >>>> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself >>> Here's a backtrace: >>> >>> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. >>> [Switching to Thread 0x7ffff7554400 (LWP 531154)] >>> 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 >>> 540 return sq->ctrl; >>> (gdb) backtrace >>> #0 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 >>> #1 0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852 >> >> Thanks, looks like a race between the admin queue format's bottom half, >> and the controller reset tearing down that queue. I'll work with Klaus >> on that qemu side (looks like a well placed qemu_bh_cancel() should do >> it). >> > > Yuck. Bug located and quelched I think. > > Jonathan, please try > > https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/ > > This fixes the qemu crash, but I still see a "nvme still not live after > 42 seconds!" resulting from the test. I'm seeing A LOT of invalid > submission queue doorbell writes: > > pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring > > Tested on a 6.1-rc4. Good change, just defers it a bit for me: Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff7554400 (LWP 559269)] 0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390 1390 assert(cq->cqid == req->sq->cqid); (gdb) backtrace #0 0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390 #1 0x000055555598a7a7 in nvme_misc_cb (opaque=0x7fffec141310, ret=0) at ../hw/nvme/ctrl.c:2002 #2 0x000055555599448a in nvme_do_format (iocb=0x55555770ccd0) at ../hw/nvme/ctrl.c:5891 #3 0x00005555559942a9 in nvme_format_ns_cb (opaque=0x55555770ccd0, ret=0) at ../hw/nvme/ctrl.c:5828 #4 0x0000555555dda018 in blk_aio_complete (acb=0x7fffec1fccd0) at ../block/block-backend.c:1501 #5 0x0000555555dda2fc in blk_aio_write_entry (opaque=0x7fffec1fccd0) at ../block/block-backend.c:1568 #6 0x0000555555f506b9 in coroutine_trampoline (i0=-331119632, i1=32767) at ../util/coroutine-ucontext.c:177 #7 0x00007ffff77c84e0 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 #8 0x00007ffff4ff2bd0 in () #9 0x0000000000000000 in ()
On Nov 22 13:30, Jonathan Derrick wrote: > > > On 11/22/2022 1:26 AM, Klaus Jensen wrote: > > On Nov 21 16:04, Keith Busch wrote: > >> [cc'ing Klaus] > >> > >> On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote: > >>> On 11/21/2022 3:34 PM, Jonathan Derrick wrote: > >>>> On 11/21/2022 1:55 PM, Keith Busch wrote: > >>>>> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote: > >>>>>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 > >>>>>> reliably segfaults my QEMU instance (something else to look into) and I don't > >>>>>> have any 'real' hardware to test this on at the moment. It looks like several > >>>>>> passthru commands are able to enqueue prior/during/after resetting/connecting. > >>>>> > >>>>> I'm not seeing any problem with the latest nvme-qemu after several dozen > >>>>> iterations of this test case. In that environment, the formats and > >>>>> resets complete practically synchronously with the call, so everything > >>>>> proceeds quickly. Is there anything special I need to change? > >>>>> > >>>> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself > >>> Here's a backtrace: > >>> > >>> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. > >>> [Switching to Thread 0x7ffff7554400 (LWP 531154)] > >>> 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > >>> 540 return sq->ctrl; > >>> (gdb) backtrace > >>> #0 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539 > >>> #1 0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852 > >> > >> Thanks, looks like a race between the admin queue format's bottom half, > >> and the controller reset tearing down that queue. I'll work with Klaus > >> on that qemu side (looks like a well placed qemu_bh_cancel() should do > >> it). > >> > > > > Yuck. Bug located and quelched I think. > > > > Jonathan, please try > > > > https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/ > > > > This fixes the qemu crash, but I still see a "nvme still not live after > > 42 seconds!" resulting from the test. I'm seeing A LOT of invalid > > submission queue doorbell writes: > > > > pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring > > > > Tested on a 6.1-rc4. > > Good change, just defers it a bit for me: > > Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7ffff7554400 (LWP 559269)] > 0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390 > 1390 assert(cq->cqid == req->sq->cqid); > (gdb) backtrace > #0 0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390 > #1 0x000055555598a7a7 in nvme_misc_cb (opaque=0x7fffec141310, ret=0) at ../hw/nvme/ctrl.c:2002 > #2 0x000055555599448a in nvme_do_format (iocb=0x55555770ccd0) at ../hw/nvme/ctrl.c:5891 > #3 0x00005555559942a9 in nvme_format_ns_cb (opaque=0x55555770ccd0, ret=0) at ../hw/nvme/ctrl.c:5828 > #4 0x0000555555dda018 in blk_aio_complete (acb=0x7fffec1fccd0) at ../block/block-backend.c:1501 > #5 0x0000555555dda2fc in blk_aio_write_entry (opaque=0x7fffec1fccd0) at ../block/block-backend.c:1568 > #6 0x0000555555f506b9 in coroutine_trampoline (i0=-331119632, i1=32767) at ../util/coroutine-ucontext.c:177 > #7 0x00007ffff77c84e0 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 > #8 0x00007ffff4ff2bd0 in () > #9 0x0000000000000000 in () > Bummer. I'll keep digging.
diff --git a/tests/nvme/047 b/tests/nvme/047 new file mode 100755 index 0000000..fb8609c --- /dev/null +++ b/tests/nvme/047 @@ -0,0 +1,121 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# Copyright (C) 2022 Jonathan Derrick <jonathan.derrick@linux.dev> +# +# Test nvme reset controller during admin passthru +# +# Regression for issue reported by +# https://bugzilla.kernel.org/show_bug.cgi?id=216354 +# +# Simpler form: +# for i in {1..50}; do +# nvme format -f /dev/nvme0n1 & +# echo 1 > /sys/block/nvme0n1/device/reset_controller & +# done + +. tests/nvme/rc + +#restrict test to nvme-pci only +nvme_trtype=pci + +DESCRIPTION="test nvme reset controller during admin passthru" +QUICK=1 +CAN_BE_ZONED=1 + +RUN_TIME=300 +RESET_PCIE=true + +requires() { + _nvme_requires +} + +device_requires() { + _require_test_dev_is_nvme +} + +remove_and_rescan() { + local pdev=$1 + echo 1 > /sys/bus/pci/devices/"$pdev"/remove + echo 1 > /sys/bus/pci/rescan +} + +test_device() { + echo "Running ${TEST_NAME}" + + local pdev + local blkdev + local ctrldev + local sysfs + local max_timeout + local timeout + local timeleft + local start + local last_live + local i + + pdev="$(_get_pci_dev_from_blkdev)" + blkdev="${TEST_DEV_SYSFS##*/}" + ctrldev="$(echo "$blkdev" | grep -Eo 'nvme[0-9]+')" + sysfs="/sys/block/$blkdev/device" + max_timeout=$(cat /proc/sys/kernel/hung_task_timeout_secs) + timeout=$((max_timeout * 3 / 4)) + + sleep 5 + + start=$SECONDS + while [[ $((SECONDS - start)) -le $RUN_TIME ]]; do + if [[ $(cat "$sysfs/state") == "live" ]]; then + last_live=$SECONDS + fi + + # Failure case appears to stack up formats while controller is resetting/connecting + if [[ $(pgrep -cf "nvme format") -lt 100 ]]; then + for ((i=0; i<100; i++)); do + nvme format -f "$TEST_DEV" & + echo 1 > "$sysfs/reset_controller" & + done &> /dev/null + fi + + # Might have failed probe, so reset and continue test + if [[ $((SECONDS - last_live)) -gt 10 && \ + ! -c "/dev/$ctrldev" && "$RESET_PCIE" == true ]]; then + { + echo 1 > /sys/bus/pci/devices/"$pdev"/remove + echo 1 > /sys/bus/pci/rescan + } & + + timeleft=$((max_timeout - timeout)) + sleep $((timeleft < 30 ? timeleft : 30)) + if [[ ! -c "/dev/$ctrldev" ]]; then + echo "/dev/$ctrldev missing" + echo "failed to reset $ctrldev's pcie device $pdev" + break + fi + sleep 5 + continue + fi + + if [[ $((SECONDS - last_live)) -gt $timeout ]]; then + if [[ ! -c "/dev/$ctrldev" ]]; then + echo "/dev/$ctrldev missing" + break + fi + + # Assume the controller is hung and unrecoverable + if [[ -f "$sysfs/state" ]]; then + echo "nvme controller hung ($(cat "$sysfs/state"))" + break + else + echo "nvme controller hung" + break + fi + fi + done + + if [[ ! -c "/dev/$ctrldev" || $(cat "$sysfs/state") != "live" ]]; then + echo "nvme still not live after $((SECONDS - last_live)) seconds!" + fi + udevadm settle + + echo "Test complete" +} diff --git a/tests/nvme/047.out b/tests/nvme/047.out new file mode 100644 index 0000000..915d0a2 --- /dev/null +++ b/tests/nvme/047.out @@ -0,0 +1,2 @@ +Running nvme/047 +Test complete
Adds a test which runs many formats and reset_controllers in parallel. The intent is to expose timing holes in the controller state machine which will lead to hung task timeouts and the controller becoming unavailable. Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354 Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev> --- I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2 reliably segfaults my QEMU instance (something else to look into) and I don't have any 'real' hardware to test this on at the moment. It looks like several passthru commands are able to enqueue prior/during/after resetting/connecting. The issue seems to be very heavily timing related, so the loop in the header is a lot more forceful in this approach. As far as the loop goes, I've noticed it will typically repro immediately or pass the whole test. tests/nvme/047 | 121 +++++++++++++++++++++++++++++++++++++++++++++ tests/nvme/047.out | 2 + 2 files changed, 123 insertions(+) create mode 100755 tests/nvme/047 create mode 100644 tests/nvme/047.out