Message ID | 20230412085923.1616977-1-shinichiro@fastmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [blktests] nvme/039: avoid failure by error message rate limit | expand |
On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote: > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > The test case nvme/039 tests that expected error messages are printed > for errors injected to the nvme driver. However, the test case fails by > chance when previous test cases generate many error messages. In this > case, the kernel function pr_err_ratelimited() may suppress the error > messages that the test case expects. Also, it may print messages that > the test case does not expect, such as "blk_print_req_error: xxxx > callbacks suppressed". > > To avoid the failure, make two improvements for the test case. Firstly, > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure > the expected error messages are not suppressed. Secondly, exclude the > unexpected message for the error message check. Introduce a helper > function last_dmesg() for the second improvement. Why are we seeing the callback messages? By the time the test starts generating errors (after a 5 sec delay) we should be able to log 10 messages without any being suppressed. Alan
On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote: > > On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote: > > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > > > The test case nvme/039 tests that expected error messages are printed > > for errors injected to the nvme driver. However, the test case fails by > > chance when previous test cases generate many error messages. In this > > case, the kernel function pr_err_ratelimited() may suppress the error > > messages that the test case expects. Also, it may print messages that > > the test case does not expect, such as "blk_print_req_error: xxxx > > callbacks suppressed". > > > > To avoid the failure, make two improvements for the test case. Firstly, > > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure > > the expected error messages are not suppressed. Secondly, exclude the > > unexpected message for the error message check. Introduce a helper > > function last_dmesg() for the second improvement. > > Why are we seeing the callback messages? By the time the test starts > generating errors (after a 5 sec delay) we should be able to log 10 messages > without any being suppressed. That is because other test cases before nvme/039 can generate errors. For instance, block/014 generates many errors. When I ran block/014 and nvme/039 in sequence, I always observe nvme/039 failure even with the 5 seconds wait in nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038 failure regardless of the errors generated before nvme/039.
On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote: > On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote: > > > > On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote: > > > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > > > > > The test case nvme/039 tests that expected error messages are printed > > > for errors injected to the nvme driver. However, the test case fails by > > > chance when previous test cases generate many error messages. In this > > > case, the kernel function pr_err_ratelimited() may suppress the error > > > messages that the test case expects. Also, it may print messages that > > > the test case does not expect, such as "blk_print_req_error: xxxx > > > callbacks suppressed". > > > > > > To avoid the failure, make two improvements for the test case. Firstly, > > > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure > > > the expected error messages are not suppressed. Secondly, exclude the > > > unexpected message for the error message check. Introduce a helper > > > function last_dmesg() for the second improvement. > > > > Why are we seeing the callback messages? By the time the test starts > > generating errors (after a 5 sec delay) we should be able to log 10 messages > > without any being suppressed. > > That is because other test cases before nvme/039 can generate errors. For > instance, block/014 generates many errors. When I ran block/014 and nvme/039 in > sequence, I always observe nvme/039 failure even with the 5 seconds wait in > nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038 > failure regardless of the errors generated before nvme/039. Reading back my explanation above, I found it may not be clear enough. Let me ammend it. My understanding is as follows. The test case block/014 generates many error messages by blk_print_req_error(). The messages are rate limited and some of them are suppressed. At that time, __ratelimit() calls printk_deferred() to print the "callbacks suppressed" message, which is deferred and not printed during block/014. When nvme/039 triggers an error message (after the 5 sec delay), the "callbacks suppressed" message for blk_print_req_error() and block/014 is printed. It makes the nvme/039 fail, since nvme/039 expects the error message it triggered, but it finds the "callbacks suppressed" message instead. In theory, not only block/014 but also other workload with rate limited error messages can cause this nvme/039 failure. The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not suppressed, but it can not avoid the "callbacks suppressed" messages that deferred before nvme/039.
On 4/12/23 8:56 PM, Shin'ichiro Kawasaki wrote: > On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote: >> On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote: >>> On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote: >>>> From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> >>>> >>>> The test case nvme/039 tests that expected error messages are printed >>>> for errors injected to the nvme driver. However, the test case fails by >>>> chance when previous test cases generate many error messages. In this >>>> case, the kernel function pr_err_ratelimited() may suppress the error >>>> messages that the test case expects. Also, it may print messages that >>>> the test case does not expect, such as "blk_print_req_error: xxxx >>>> callbacks suppressed". >>>> >>>> To avoid the failure, make two improvements for the test case. Firstly, >>>> wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure >>>> the expected error messages are not suppressed. Secondly, exclude the >>>> unexpected message for the error message check. Introduce a helper >>>> function last_dmesg() for the second improvement. >>> Why are we seeing the callback messages? By the time the test starts >>> generating errors (after a 5 sec delay) we should be able to log 10 messages >>> without any being suppressed. >> That is because other test cases before nvme/039 can generate errors. For >> instance, block/014 generates many errors. When I ran block/014 and nvme/039 in >> sequence, I always observe nvme/039 failure even with the 5 seconds wait in >> nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038 >> failure regardless of the errors generated before nvme/039. > Reading back my explanation above, I found it may not be clear enough. Let me > ammend it. My understanding is as follows. > > The test case block/014 generates many error messages by blk_print_req_error(). > The messages are rate limited and some of them are suppressed. At that time, > __ratelimit() calls printk_deferred() to print the "callbacks suppressed" > message, which is deferred and not printed during block/014. When nvme/039 > triggers an error message (after the 5 sec delay), the "callbacks suppressed" > message for blk_print_req_error() and block/014 is printed. It makes the > nvme/039 fail, since nvme/039 expects the error message it triggered, but it > finds the "callbacks suppressed" message instead. In theory, not only block/014 > but also other workload with rate limited error messages can cause this nvme/039 > failure. > > The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not > suppressed, but it can not avoid the "callbacks suppressed" messages that > deferred before nvme/039. That's right. I went through the rate limit code and that's how it works, you will get a "callback suppressed" message along with the first message nvme/039 logs. I ran the following for a while: while true do ./check block/014 ./check nvme/039 done Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Tested-by: Alan Adamson <alan.adamson@oracle.com> Alan
On Apr 13, 2023 / 10:19, alan.adamson@oracle.com wrote: [...] > Reviewed-by: Alan Adamson <alan.adamson@oracle.com> > > Tested-by: Alan Adamson <alan.adamson@oracle.com> Thanks for the review and the test. I've applied the fix.
diff --git a/tests/nvme/039 b/tests/nvme/039 index 11d6d24..f327b54 100755 --- a/tests/nvme/039 +++ b/tests/nvme/039 @@ -18,6 +18,15 @@ requires() { _have_kernel_option FAULT_INJECTION_DEBUG_FS } +# Get the last dmesg lines as many as specified. Exclude the lines to indicate +# suppression by rate limit. +last_dmesg() +{ + local nr_lines=$1 + + dmesg -t | grep -v "callbacks suppressed" | tail "-$nr_lines" +} + inject_unrec_read_on_read() { # Inject a 'Unrecovered Read Error' (0x281) status error on a READ @@ -29,10 +38,10 @@ inject_unrec_read_on_read() _nvme_disable_err_inject "$1" if ${nvme_verbose_errors}; then - dmesg -t | tail -2 | grep "Unrecovered Read Error (" | \ + last_dmesg 2 | grep "Unrecovered Read Error (" | \ sed 's/nvme.*://g' else - dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \ + last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \ sed 's/I\/O Error/Unrecovered Read Error/g' | \ sed 's/nvme.*://g' fi @@ -49,10 +58,10 @@ inject_invalid_status_on_read() _nvme_disable_err_inject "$1" if ${nvme_verbose_errors}; then - dmesg -t | tail -2 | grep "Unknown (" | \ + last_dmesg 2 | grep "Unknown (" | \ sed 's/nvme.*://g' else - dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \ + last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \ sed 's/I\/O Error/Unknown/g' | \ sed 's/nvme.*://g' fi @@ -69,10 +78,10 @@ inject_write_fault_on_write() _nvme_disable_err_inject "$1" if ${nvme_verbose_errors}; then - dmesg -t | tail -2 | grep "Write Fault (" | \ + last_dmesg 2 | grep "Write Fault (" | \ sed 's/nvme.*://g' else - dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Write/g' | \ + last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Write/g' | \ sed 's/I\/O Error/Write Fault/g' | \ sed 's/nvme.*://g' fi @@ -90,10 +99,10 @@ inject_access_denied_on_identify() _nvme_disable_err_inject "$1" if ${nvme_verbose_errors}; then - dmesg -t | tail -1 | grep "Access Denied (" | \ + last_dmesg 1 | grep "Access Denied (" | \ sed 's/nvme.*://g' else - dmesg -t | tail -1 | grep "Admin Cmd(" | \ + last_dmesg 1 | grep "Admin Cmd(" | \ sed 's/Admin Cmd/Identify/g' | \ sed 's/I\/O Error/Access Denied/g' | \ sed 's/nvme.*://g' @@ -139,6 +148,9 @@ test_device() { _nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}" + # wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed + sleep 5 + inject_unrec_read_on_read "${ns_dev}" inject_invalid_status_on_read "${ns_dev}" inject_write_fault_on_write "${ns_dev}"