Message ID | 5vnpdeocos6k4nmh6ewh7ltqz7b6wuemzcmqflfkybejssewkh@edtqm3t4w3zv (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blktests nvme/039 failure | expand |
On 4/10/23 4:49 AM, Shin'ichiro Kawasaki wrote: > Hello Alan, > > I noticed that recently nvme/039 fails on my system occasionally (around 40%). > The failure messages are as follows: > > nvme/039 => nvme0n1 (test error logging) [failed] > runtime 0.176s ... 0.167s > --- tests/nvme/039.out 2023-04-06 10:11:07.925670528 +0900 > +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad 2023-04-10 20:15:07.679538017 +0900 > @@ -1,5 +1,2 @@ > Running nvme/039 > - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR > - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR > - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR > Test complete > > nvme/039 => nvme0n1 (test error logging) [failed] > runtime 0.167s ... 0.199s > --- tests/nvme/039.out 2023-04-06 10:11:07.925670528 +0900 > +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad 2023-04-10 20:15:09.114539650 +0900 > @@ -1,5 +1,4 @@ > Running nvme/039 > - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR > Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR > Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR > Test complete > > It looks that expected error messages were not reported. > > I suspect that the time duration is too short between error injection enable > and I/O to trigger the error. With the one line change below to add wait after > the error injection enable, the failures disappear. Do you think such wait is > the valid fix? I saw the same thing Friday evening. Let me take a look. Alan > > tests/nvme/rc | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/tests/nvme/rc b/tests/nvme/rc > index 210a82a..7043c23 100644 > --- a/tests/nvme/rc > +++ b/tests/nvme/rc > @@ -652,6 +652,7 @@ _nvme_enable_err_inject() > echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry > echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status > echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times > + sleep 0.1 > } > > _nvme_disable_err_inject()
On 4/10/23 4:49 AM, Shin'ichiro Kawasaki wrote: > Hello Alan, > > I noticed that recently nvme/039 fails on my system occasionally (around 40%). > The failure messages are as follows: > > nvme/039 => nvme0n1 (test error logging) [failed] > runtime 0.176s ... 0.167s > --- tests/nvme/039.out 2023-04-06 10:11:07.925670528 +0900 > +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad 2023-04-10 20:15:07.679538017 +0900 > @@ -1,5 +1,2 @@ > Running nvme/039 > - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR > - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR > - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR > Test complete > > nvme/039 => nvme0n1 (test error logging) [failed] > runtime 0.167s ... 0.199s > --- tests/nvme/039.out 2023-04-06 10:11:07.925670528 +0900 > +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad 2023-04-10 20:15:09.114539650 +0900 > @@ -1,5 +1,4 @@ > Running nvme/039 > - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR > Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR > Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR > Test complete > > It looks that expected error messages were not reported. > > I suspect that the time duration is too short between error injection enable > and I/O to trigger the error. With the one line change below to add wait after > the error injection enable, the failures disappear. Do you think such wait is > the valid fix? > > tests/nvme/rc | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/tests/nvme/rc b/tests/nvme/rc > index 210a82a..7043c23 100644 > --- a/tests/nvme/rc > +++ b/tests/nvme/rc > @@ -652,6 +652,7 @@ _nvme_enable_err_inject() > echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry > echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status > echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times > + sleep 0.1 > } > > _nvme_disable_err_inject() I've been able to reproduce it. The sleep .1 helps but doesn't eliminate the issue. I did notice whenever there was a failure, there was also a "blk_print_req_error: 2 callbacks suppressed" in the log which would break the parsing the test needs to do. Alan
On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote: [...] > I've been able to reproduce it. The sleep .1 helps but doesn't eliminate > the issue. I did notice whenever there was a failure, there was also a > "blk_print_req_error: 2 callbacks suppressed" in the log which would break > the parsing the test needs to do. Ah, I see. The error messages were no printed because pr_err_ratelimited() suppressed them. AFAIK, there is no way to disable or relax the rate limit from userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of the test will ensure the error message printed. It will also avoid the "x callbacks suppressed" messages. With the change below, I observe no failure on my system. diff --git a/tests/nvme/039 b/tests/nvme/039 index 11d6d24..5d76297 100755 --- a/tests/nvme/039 +++ b/tests/nvme/039 @@ -139,6 +139,9 @@ test_device() { _nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}" + # wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed + sleep 5 + inject_unrec_read_on_read "${ns_dev}" inject_invalid_status_on_read "${ns_dev}" inject_write_fault_on_write "${ns_dev}"
On 4/10/23 6:27 PM, Shin'ichiro Kawasaki wrote: > On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote: > [...] >> I've been able to reproduce it. The sleep .1 helps but doesn't eliminate >> the issue. I did notice whenever there was a failure, there was also a >> "blk_print_req_error: 2 callbacks suppressed" in the log which would break >> the parsing the test needs to do. > Ah, I see. The error messages were no printed because pr_err_ratelimited() > suppressed them. AFAIK, there is no way to disable or relax the rate limit from > userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of > the test will ensure the error message printed. It will also avoid the "x > callbacks suppressed" messages. > > With the change below, I observe no failure on my system. > > diff --git a/tests/nvme/039 b/tests/nvme/039 > index 11d6d24..5d76297 100755 > --- a/tests/nvme/039 > +++ b/tests/nvme/039 > @@ -139,6 +139,9 @@ test_device() { > > _nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}" > > + # wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed > + sleep 5 > + > inject_unrec_read_on_read "${ns_dev}" > inject_invalid_status_on_read "${ns_dev}" > inject_write_fault_on_write "${ns_dev}" > Looks good. I tested it with my environment. When we get the passthru logging in, I'll be adding more tests here, but I'll have to do another sleep 5. Alan
On Apr 11, 2023 / 13:55, alan.adamson@oracle.com wrote: > > On 4/10/23 6:27 PM, Shin'ichiro Kawasaki wrote: > > On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote: > > [...] > > > I've been able to reproduce it. The sleep .1 helps but doesn't eliminate > > > the issue. I did notice whenever there was a failure, there was also a > > > "blk_print_req_error: 2 callbacks suppressed" in the log which would break > > > the parsing the test needs to do. > > Ah, I see. The error messages were no printed because pr_err_ratelimited() > > suppressed them. AFAIK, there is no way to disable or relax the rate limit from > > userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of > > the test will ensure the error message printed. It will also avoid the "x > > callbacks suppressed" messages. > > > > With the change below, I observe no failure on my system. > > > > diff --git a/tests/nvme/039 b/tests/nvme/039 > > index 11d6d24..5d76297 100755 > > --- a/tests/nvme/039 > > +++ b/tests/nvme/039 > > @@ -139,6 +139,9 @@ test_device() { > > _nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}" > > + # wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed > > + sleep 5 > > + > > inject_unrec_read_on_read "${ns_dev}" > > inject_invalid_status_on_read "${ns_dev}" > > inject_write_fault_on_write "${ns_dev}" > > > Looks good. I tested it with my environment. I did further testing, and found the test case still fails... The 5 seconds wait ensures the expected errors are printed, but still we have the chance of the "callbacks suppressed" message when previous test cases have printed many messages. Here's the kernel output I observed with the fix above. run blktests nvme/039 at 2023-04-12 13:52:19 nvme0n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x2 / sc 0x81) DNR blk_print_req_error: 5804 callbacks suppressed critical medium error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2 nvme0n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x3 / sc 0x75) DNR I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2 nvme0n1: I/O Cmd(0x1) @ LBA 0, 1 blocks, I/O Error (sct 0x2 / sc 0x80) DNR critical medium error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 2 I think further modification is required to grep -v the "callbacks suppressed". Will post a patch with this modification to ask your comment. > When we get the passthru > logging in, I'll be adding more tests here, but I'll have to do another > sleep 5. Agreed.
diff --git a/tests/nvme/rc b/tests/nvme/rc index 210a82a..7043c23 100644 --- a/tests/nvme/rc +++ b/tests/nvme/rc @@ -652,6 +652,7 @@ _nvme_enable_err_inject() echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times + sleep 0.1 } _nvme_disable_err_inject()