diff mbox series

blktests nvme/039 failure

Message ID 5vnpdeocos6k4nmh6ewh7ltqz7b6wuemzcmqflfkybejssewkh@edtqm3t4w3zv (mailing list archive)
State New, archived
Headers show
Series blktests nvme/039 failure | expand

Commit Message

Shin'ichiro Kawasaki April 10, 2023, 11:49 a.m. UTC
Hello Alan,

I noticed that recently nvme/039 fails on my system occasionally (around 40%).
The failure messages are as follows:

nvme/039 => nvme0n1 (test error logging)                     [failed]
    runtime  0.176s  ...  0.167s
    --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
    +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:07.679538017 +0900
    @@ -1,5 +1,2 @@
     Running nvme/039
    - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
    - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
    - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
     Test complete

nvme/039 => nvme0n1 (test error logging)                     [failed]
    runtime  0.167s  ...  0.199s
    --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
    +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:09.114539650 +0900
    @@ -1,5 +1,4 @@
     Running nvme/039
    - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
      Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
      Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
     Test complete

It looks that expected error messages were not reported.

I suspect that the time duration is too short between error injection enable
and I/O to trigger the error. With the one line change below to add wait after
the error injection enable, the failures disappear. Do you think such wait is
the valid fix?

 tests/nvme/rc | 1 +
 1 file changed, 1 insertion(+)

Comments

Alan Adamson April 10, 2023, 4:17 p.m. UTC | #1
On 4/10/23 4:49 AM, Shin'ichiro Kawasaki wrote:

> Hello Alan,
>
> I noticed that recently nvme/039 fails on my system occasionally (around 40%).
> The failure messages are as follows:
>
> nvme/039 => nvme0n1 (test error logging)                     [failed]
>      runtime  0.176s  ...  0.167s
>      --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
>      +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:07.679538017 +0900
>      @@ -1,5 +1,2 @@
>       Running nvme/039
>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
>      - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>      - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>       Test complete
>
> nvme/039 => nvme0n1 (test error logging)                     [failed]
>      runtime  0.167s  ...  0.199s
>      --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
>      +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:09.114539650 +0900
>      @@ -1,5 +1,4 @@
>       Running nvme/039
>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
>        Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>        Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>       Test complete
>
> It looks that expected error messages were not reported.
>
> I suspect that the time duration is too short between error injection enable
> and I/O to trigger the error. With the one line change below to add wait after
> the error injection enable, the failures disappear. Do you think such wait is
> the valid fix?

I saw the same thing Friday evening. Let me take a look.


Alan



>
>   tests/nvme/rc | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/tests/nvme/rc b/tests/nvme/rc
> index 210a82a..7043c23 100644
> --- a/tests/nvme/rc
> +++ b/tests/nvme/rc
> @@ -652,6 +652,7 @@ _nvme_enable_err_inject()
>           echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry
>           echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status
>           echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times
> +	sleep 0.1
>   }
>   
>   _nvme_disable_err_inject()
Alan Adamson April 10, 2023, 11:06 p.m. UTC | #2
On 4/10/23 4:49 AM, Shin'ichiro Kawasaki wrote:
> Hello Alan,
>
> I noticed that recently nvme/039 fails on my system occasionally (around 40%).
> The failure messages are as follows:
>
> nvme/039 => nvme0n1 (test error logging)                     [failed]
>      runtime  0.176s  ...  0.167s
>      --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
>      +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:07.679538017 +0900
>      @@ -1,5 +1,2 @@
>       Running nvme/039
>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
>      - Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>      - Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>       Test complete
>
> nvme/039 => nvme0n1 (test error logging)                     [failed]
>      runtime  0.167s  ...  0.199s
>      --- tests/nvme/039.out      2023-04-06 10:11:07.925670528 +0900
>      +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/039.out.bad   2023-04-10 20:15:09.114539650 +0900
>      @@ -1,5 +1,4 @@
>       Running nvme/039
>      - Read(0x2) @ LBA 0, 1 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) DNR
>        Read(0x2) @ LBA 0, 1 blocks, Unknown (sct 0x3 / sc 0x75) DNR
>        Write(0x1) @ LBA 0, 1 blocks, Write Fault (sct 0x2 / sc 0x80) DNR
>       Test complete
>
> It looks that expected error messages were not reported.
>
> I suspect that the time duration is too short between error injection enable
> and I/O to trigger the error. With the one line change below to add wait after
> the error injection enable, the failures disappear. Do you think such wait is
> the valid fix?
>
>   tests/nvme/rc | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/tests/nvme/rc b/tests/nvme/rc
> index 210a82a..7043c23 100644
> --- a/tests/nvme/rc
> +++ b/tests/nvme/rc
> @@ -652,6 +652,7 @@ _nvme_enable_err_inject()
>           echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry
>           echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status
>           echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times
> +	sleep 0.1
>   }
>   
>   _nvme_disable_err_inject()

I've been able to reproduce it.  The sleep .1 helps but doesn't 
eliminate the issue.  I did notice whenever there was a failure, there 
was also a "blk_print_req_error: 2 callbacks suppressed" in the log 
which would break the parsing the test needs to do.


Alan
Shin'ichiro Kawasaki April 11, 2023, 1:27 a.m. UTC | #3
On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote:
[...]
> I've been able to reproduce it.  The sleep .1 helps but doesn't eliminate
> the issue.  I did notice whenever there was a failure, there was also a
> "blk_print_req_error: 2 callbacks suppressed" in the log which would break
> the parsing the test needs to do.

Ah, I see. The error messages were no printed because pr_err_ratelimited()
suppressed them. AFAIK, there is no way to disable or relax the rate limit from
userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of
the test will ensure the error message printed. It will also avoid the "x
callbacks suppressed" messages.

With the change below, I observe no failure on my system.

diff --git a/tests/nvme/039 b/tests/nvme/039
index 11d6d24..5d76297 100755
--- a/tests/nvme/039
+++ b/tests/nvme/039
@@ -139,6 +139,9 @@ test_device() {
 
 	_nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}"
 
+	# wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed
+	sleep 5
+
 	inject_unrec_read_on_read "${ns_dev}"
 	inject_invalid_status_on_read "${ns_dev}"
 	inject_write_fault_on_write "${ns_dev}"
Alan Adamson April 11, 2023, 8:55 p.m. UTC | #4
On 4/10/23 6:27 PM, Shin'ichiro Kawasaki wrote:
> On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote:
> [...]
>> I've been able to reproduce it.  The sleep .1 helps but doesn't eliminate
>> the issue.  I did notice whenever there was a failure, there was also a
>> "blk_print_req_error: 2 callbacks suppressed" in the log which would break
>> the parsing the test needs to do.
> Ah, I see. The error messages were no printed because pr_err_ratelimited()
> suppressed them. AFAIK, there is no way to disable or relax the rate limit from
> userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of
> the test will ensure the error message printed. It will also avoid the "x
> callbacks suppressed" messages.
>
> With the change below, I observe no failure on my system.
>
> diff --git a/tests/nvme/039 b/tests/nvme/039
> index 11d6d24..5d76297 100755
> --- a/tests/nvme/039
> +++ b/tests/nvme/039
> @@ -139,6 +139,9 @@ test_device() {
>   
>   	_nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}"
>   
> +	# wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed
> +	sleep 5
> +
>   	inject_unrec_read_on_read "${ns_dev}"
>   	inject_invalid_status_on_read "${ns_dev}"
>   	inject_write_fault_on_write "${ns_dev}"
>
Looks good.  I tested it with my environment.  When we get the passthru 
logging in,  I'll be adding more tests here, but I'll have to do another 
sleep 5.

Alan
Shin'ichiro Kawasaki April 12, 2023, 8:49 a.m. UTC | #5
On Apr 11, 2023 / 13:55, alan.adamson@oracle.com wrote:
> 
> On 4/10/23 6:27 PM, Shin'ichiro Kawasaki wrote:
> > On Apr 10, 2023 / 16:06, alan.adamson@oracle.com wrote:
> > [...]
> > > I've been able to reproduce it.  The sleep .1 helps but doesn't eliminate
> > > the issue.  I did notice whenever there was a failure, there was also a
> > > "blk_print_req_error: 2 callbacks suppressed" in the log which would break
> > > the parsing the test needs to do.
> > Ah, I see. The error messages were no printed because pr_err_ratelimited()
> > suppressed them. AFAIK, there is no way to disable or relax the rate limit from
> > userland. I think sleep for 5 = DEFAULT_RATE_LIMIT seconds at the beginning of
> > the test will ensure the error message printed. It will also avoid the "x
> > callbacks suppressed" messages.
> > 
> > With the change below, I observe no failure on my system.
> > 
> > diff --git a/tests/nvme/039 b/tests/nvme/039
> > index 11d6d24..5d76297 100755
> > --- a/tests/nvme/039
> > +++ b/tests/nvme/039
> > @@ -139,6 +139,9 @@ test_device() {
> >   	_nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}"
> > +	# wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed
> > +	sleep 5
> > +
> >   	inject_unrec_read_on_read "${ns_dev}"
> >   	inject_invalid_status_on_read "${ns_dev}"
> >   	inject_write_fault_on_write "${ns_dev}"
> > 
> Looks good.  I tested it with my environment. 

I did further testing, and found the test case still fails... The 5 seconds wait
ensures the expected errors are printed, but still we have the chance of the
"callbacks suppressed" message when previous test cases have printed many
messages. Here's the kernel output I observed with the fix above.

  run blktests nvme/039 at 2023-04-12 13:52:19
  nvme0n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x2 / sc 0x81) DNR
  blk_print_req_error: 5804 callbacks suppressed
  critical medium error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
  nvme0n1: I/O Cmd(0x2) @ LBA 0, 1 blocks, I/O Error (sct 0x3 / sc 0x75) DNR
  I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 2
  nvme0n1: I/O Cmd(0x1) @ LBA 0, 1 blocks, I/O Error (sct 0x2 / sc 0x80) DNR
  critical medium error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 2

I think further modification is required to grep -v the "callbacks suppressed".
Will post a patch with this modification to ask your comment.

> When we get the passthru
> logging in,  I'll be adding more tests here, but I'll have to do another
> sleep 5.

Agreed.
diff mbox series

Patch

diff --git a/tests/nvme/rc b/tests/nvme/rc
index 210a82a..7043c23 100644
--- a/tests/nvme/rc
+++ b/tests/nvme/rc
@@ -652,6 +652,7 @@  _nvme_enable_err_inject()
         echo "$4" > /sys/kernel/debug/"$1"/fault_inject/dont_retry
         echo "$5" > /sys/kernel/debug/"$1"/fault_inject/status
         echo "$6" > /sys/kernel/debug/"$1"/fault_inject/times
+	sleep 0.1
 }
 
 _nvme_disable_err_inject()