diff mbox series

[blktests] nvme/039: avoid failure by error message rate limit

Message ID 20230412085923.1616977-1-shinichiro@fastmail.com (mailing list archive)
State New, archived
Headers show
Series [blktests] nvme/039: avoid failure by error message rate limit | expand

Commit Message

Shin'ichiro Kawasaki April 12, 2023, 8:59 a.m. UTC
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

The test case nvme/039 tests that expected error messages are printed
for errors injected to the nvme driver. However, the test case fails by
chance when previous test cases generate many error messages. In this
case, the kernel function pr_err_ratelimited() may suppress the error
messages that the test case expects. Also, it may print messages that
the test case does not expect, such as "blk_print_req_error: xxxx
callbacks suppressed".

To avoid the failure, make two improvements for the test case. Firstly,
wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
the expected error messages are not suppressed. Secondly, exclude the
unexpected message for the error message check. Introduce a helper
function last_dmesg() for the second improvement.

Fixes: 9accb5f86670 ("tests/nvme: add tests for error logging")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/linux-block/5vnpdeocos6k4nmh6ewh7ltqz7b6wuemzcmqflfkybejssewkh@edtqm3t4w3zv/
---
 tests/nvme/039 | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Comments

Alan Adamson April 12, 2023, 5:49 p.m. UTC | #1
On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
> From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>
> The test case nvme/039 tests that expected error messages are printed
> for errors injected to the nvme driver. However, the test case fails by
> chance when previous test cases generate many error messages. In this
> case, the kernel function pr_err_ratelimited() may suppress the error
> messages that the test case expects. Also, it may print messages that
> the test case does not expect, such as "blk_print_req_error: xxxx
> callbacks suppressed".
>
> To avoid the failure, make two improvements for the test case. Firstly,
> wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
> the expected error messages are not suppressed. Secondly, exclude the
> unexpected message for the error message check. Introduce a helper
> function last_dmesg() for the second improvement.

Why are we seeing the callback messages?  By the time the test starts 
generating errors (after a 5 sec delay) we should be able to log 10 
messages without any being suppressed.

Alan
Shin'ichiro Kawasaki April 13, 2023, 12:20 a.m. UTC | #2
On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote:
> 
> On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
> > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > 
> > The test case nvme/039 tests that expected error messages are printed
> > for errors injected to the nvme driver. However, the test case fails by
> > chance when previous test cases generate many error messages. In this
> > case, the kernel function pr_err_ratelimited() may suppress the error
> > messages that the test case expects. Also, it may print messages that
> > the test case does not expect, such as "blk_print_req_error: xxxx
> > callbacks suppressed".
> > 
> > To avoid the failure, make two improvements for the test case. Firstly,
> > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
> > the expected error messages are not suppressed. Secondly, exclude the
> > unexpected message for the error message check. Introduce a helper
> > function last_dmesg() for the second improvement.
> 
> Why are we seeing the callback messages?  By the time the test starts
> generating errors (after a 5 sec delay) we should be able to log 10 messages
> without any being suppressed.

That is because other test cases before nvme/039 can generate errors. For
instance, block/014 generates many errors. When I ran block/014 and nvme/039 in
sequence, I always observe nvme/039 failure even with the 5 seconds wait in
nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038
failure regardless of the errors generated before nvme/039.
Shin'ichiro Kawasaki April 13, 2023, 3:56 a.m. UTC | #3
On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote:
> On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote:
> > 
> > On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
> > > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > > 
> > > The test case nvme/039 tests that expected error messages are printed
> > > for errors injected to the nvme driver. However, the test case fails by
> > > chance when previous test cases generate many error messages. In this
> > > case, the kernel function pr_err_ratelimited() may suppress the error
> > > messages that the test case expects. Also, it may print messages that
> > > the test case does not expect, such as "blk_print_req_error: xxxx
> > > callbacks suppressed".
> > > 
> > > To avoid the failure, make two improvements for the test case. Firstly,
> > > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
> > > the expected error messages are not suppressed. Secondly, exclude the
> > > unexpected message for the error message check. Introduce a helper
> > > function last_dmesg() for the second improvement.
> > 
> > Why are we seeing the callback messages?  By the time the test starts
> > generating errors (after a 5 sec delay) we should be able to log 10 messages
> > without any being suppressed.
> 
> That is because other test cases before nvme/039 can generate errors. For
> instance, block/014 generates many errors. When I ran block/014 and nvme/039 in
> sequence, I always observe nvme/039 failure even with the 5 seconds wait in
> nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038
> failure regardless of the errors generated before nvme/039.

Reading back my explanation above, I found it may not be clear enough. Let me
ammend it. My understanding is as follows.

The test case block/014 generates many error messages by blk_print_req_error().
The messages are rate limited and some of them are suppressed. At that time,
__ratelimit() calls printk_deferred() to print the "callbacks suppressed"
message, which is deferred and not printed during block/014. When nvme/039
triggers an error message (after the 5 sec delay), the "callbacks suppressed"
message for blk_print_req_error() and block/014 is printed. It makes the
nvme/039 fail, since nvme/039 expects the error message it triggered, but it
finds the "callbacks suppressed" message instead. In theory, not only block/014
but also other workload with rate limited error messages can cause this nvme/039
failure.

The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not
suppressed, but it can not avoid the "callbacks suppressed" messages that
deferred before nvme/039.
Alan Adamson April 13, 2023, 5:19 p.m. UTC | #4
On 4/12/23 8:56 PM, Shin'ichiro Kawasaki wrote:
> On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote:
>> On Apr 12, 2023 / 10:49, alan.adamson@oracle.com wrote:
>>> On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
>>>> From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>>>
>>>> The test case nvme/039 tests that expected error messages are printed
>>>> for errors injected to the nvme driver. However, the test case fails by
>>>> chance when previous test cases generate many error messages. In this
>>>> case, the kernel function pr_err_ratelimited() may suppress the error
>>>> messages that the test case expects. Also, it may print messages that
>>>> the test case does not expect, such as "blk_print_req_error: xxxx
>>>> callbacks suppressed".
>>>>
>>>> To avoid the failure, make two improvements for the test case. Firstly,
>>>> wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
>>>> the expected error messages are not suppressed. Secondly, exclude the
>>>> unexpected message for the error message check. Introduce a helper
>>>> function last_dmesg() for the second improvement.
>>> Why are we seeing the callback messages?  By the time the test starts
>>> generating errors (after a 5 sec delay) we should be able to log 10 messages
>>> without any being suppressed.
>> That is because other test cases before nvme/039 can generate errors. For
>> instance, block/014 generates many errors. When I ran block/014 and nvme/039 in
>> sequence, I always observe nvme/039 failure even with the 5 seconds wait in
>> nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038
>> failure regardless of the errors generated before nvme/039.
> Reading back my explanation above, I found it may not be clear enough. Let me
> ammend it. My understanding is as follows.
>
> The test case block/014 generates many error messages by blk_print_req_error().
> The messages are rate limited and some of them are suppressed. At that time,
> __ratelimit() calls printk_deferred() to print the "callbacks suppressed"
> message, which is deferred and not printed during block/014. When nvme/039
> triggers an error message (after the 5 sec delay), the "callbacks suppressed"
> message for blk_print_req_error() and block/014 is printed. It makes the
> nvme/039 fail, since nvme/039 expects the error message it triggered, but it
> finds the "callbacks suppressed" message instead. In theory, not only block/014
> but also other workload with rate limited error messages can cause this nvme/039
> failure.
>
> The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not
> suppressed, but it can not avoid the "callbacks suppressed" messages that
> deferred before nvme/039.

That's right.  I went through the rate limit code and that's how it 
works, you will get a "callback suppressed" message along with the first 
message nvme/039 logs.

I ran the following for a while:

while true

do

./check block/014

./check nvme/039

done


Reviewed-by: Alan Adamson <alan.adamson@oracle.com>

Tested-by: Alan Adamson <alan.adamson@oracle.com>


Alan
Shin'ichiro Kawasaki April 18, 2023, 2:17 a.m. UTC | #5
On Apr 13, 2023 / 10:19, alan.adamson@oracle.com wrote:
[...]
> Reviewed-by: Alan Adamson <alan.adamson@oracle.com>
> 
> Tested-by: Alan Adamson <alan.adamson@oracle.com>

Thanks for the review and the test. I've applied the fix.
diff mbox series

Patch

diff --git a/tests/nvme/039 b/tests/nvme/039
index 11d6d24..f327b54 100755
--- a/tests/nvme/039
+++ b/tests/nvme/039
@@ -18,6 +18,15 @@  requires() {
 	    _have_kernel_option FAULT_INJECTION_DEBUG_FS
 }
 
+# Get the last dmesg lines as many as specified. Exclude the lines to indicate
+# suppression by rate limit.
+last_dmesg()
+{
+	local nr_lines=$1
+
+	dmesg -t | grep -v "callbacks suppressed" | tail "-$nr_lines"
+}
+
 inject_unrec_read_on_read()
 {
 	# Inject a 'Unrecovered Read Error' (0x281) status error on a READ
@@ -29,10 +38,10 @@  inject_unrec_read_on_read()
 	_nvme_disable_err_inject "$1"
 
 	if ${nvme_verbose_errors}; then
-		dmesg -t | tail -2 | grep "Unrecovered Read Error (" | \
+		last_dmesg 2 | grep "Unrecovered Read Error (" | \
 		    sed 's/nvme.*://g'
 	else
-		dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \
+		last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \
 		    sed 's/I\/O Error/Unrecovered Read Error/g' | \
 		    sed 's/nvme.*://g'
 	fi
@@ -49,10 +58,10 @@  inject_invalid_status_on_read()
 	_nvme_disable_err_inject "$1"
 
 	if ${nvme_verbose_errors}; then
-		dmesg -t | tail -2 | grep "Unknown (" | \
+		last_dmesg 2 | grep "Unknown (" | \
 		    sed 's/nvme.*://g'
 	else
-		dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \
+		last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Read/g' | \
 		    sed 's/I\/O Error/Unknown/g' | \
 		    sed 's/nvme.*://g'
 	fi
@@ -69,10 +78,10 @@  inject_write_fault_on_write()
 	_nvme_disable_err_inject "$1"
 
 	if ${nvme_verbose_errors}; then
-		dmesg -t | tail -2 | grep "Write Fault (" | \
+		last_dmesg 2 | grep "Write Fault (" | \
 		    sed 's/nvme.*://g'
 	else
-		dmesg -t | tail -2 | grep "Cmd(" | sed 's/I\/O Cmd/Write/g' | \
+		last_dmesg 2 | grep "Cmd(" | sed 's/I\/O Cmd/Write/g' | \
 		    sed 's/I\/O Error/Write Fault/g' | \
 		    sed 's/nvme.*://g'
 	fi
@@ -90,10 +99,10 @@  inject_access_denied_on_identify()
 	_nvme_disable_err_inject "$1"
 
 	if ${nvme_verbose_errors}; then
-		dmesg -t | tail -1 | grep "Access Denied (" | \
+		last_dmesg 1 | grep "Access Denied (" | \
 		    sed 's/nvme.*://g'
 	else
-		dmesg -t | tail -1 | grep "Admin Cmd(" | \
+		last_dmesg 1 | grep "Admin Cmd(" | \
 		    sed 's/Admin Cmd/Identify/g' | \
 		    sed 's/I\/O Error/Access Denied/g' | \
 		    sed 's/nvme.*://g'
@@ -139,6 +148,9 @@  test_device() {
 
 	_nvme_err_inject_setup "${ns_dev}" "${ctrl_dev}"
 
+	# wait DEFAULT_RATELIMIT_INTERVAL=5 seconds to ensure errors are printed
+	sleep 5
+
 	inject_unrec_read_on_read "${ns_dev}"
 	inject_invalid_status_on_read "${ns_dev}"
 	inject_write_fault_on_write "${ns_dev}"