[v6] block: introduce block_rq_error tracepoint

Message ID	20220203201207.1075933-1-shy828301@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Yang Shi <shy828301@gmail.com> To: axboe@kernel.dk, hch@infradead.org, chaitanyak@nvidia.com, rostedt@goodmis.org, xiyou.wangcong@gmail.com Cc: shy828301@gmail.com, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [v6 PATCH] block: introduce block_rq_error tracepoint Date: Thu, 3 Feb 2022 12:12:07 -0800 Message-Id: <20220203201207.1075933-1-shy828301@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v6] block: introduce block_rq_error tracepoint \| expand [v6] block: introduce block_rq_error tracepoint

Yang Shi Feb. 3, 2022, 8:12 p.m. UTC

Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.

But there are a few problems with this approach:

1. Even kernel trace filter could do the filtering work, there is
   still some overhead after we enable this tracepoint.

2. The filter is merely based on errno, which does not align with kernel
   logic to check the errors for print_req_error().

3. block_rq_complete only provides dev major and minor to identify
   the block device, it is not convenient to use in user-space.

So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
The v3 patch was submitted in Feb 2020, and Steven reviewed the patch, but
it was not merged to upstream. See
https://lore.kernel.org/lkml/20200203053650.8923-1-xiyou.wangcong@gmail.com/.

The problems fixed by that patch still exist and we do need it to make
disk error handling in rasdaemon easier. So this resurrected it and
continued the version number.

v5 --> v6:
 * Removed disk name per Christoph and Chaitanya
 * Kept errno since I didn't find any other block tracepoints print blk
   status code and userspace (i.e. rasdaemon) does expect errno.
v4 --> v5:
 * Report the actual block layer status code instead of the errno per
   Christoph Hellwig.
v3 --> v4:
 * Rebased to v5.17-rc1.
 * Collected reviewed-by tag from Steven.

 block/blk-mq.c               |  4 +++-
 include/trace/events/block.h | 39 ++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+), 1 deletion(-)

Chaitanya Kulkarni Feb. 4, 2022, 2:46 a.m. UTC | #1

Yang,

On 2/3/22 12:12, Yang Shi wrote:
> Currently, rasdaemon uses the existing tracepoint block_rq_complete
> and filters out non-error cases in order to capture block disk errors.
> 
> But there are a few problems with this approach:
> 
> 1. Even kernel trace filter could do the filtering work, there is
>     still some overhead after we enable this tracepoint.
> 
> 2. The filter is merely based on errno, which does not align with kernel
>     logic to check the errors for print_req_error().
> 
> 3. block_rq_complete only provides dev major and minor to identify
>     the block device, it is not convenient to use in user-space.
> 
> So introduce a new tracepoint block_rq_error just for the error case.
> With this patch, rasdaemon could switch to block_rq_error.
> 

This patch looks good, but I've a question for you.

We already have a tracepoint for the request completion
block_rq_complete(). We are adding a new tracepoint blk_rq_error()
that is also similar to what blk_rq_complete() reports.
Similar call sites  :-
trace_block_rq_complete(req, error, nr_bytes);
trace_block_rq_error(req, error, nr_bytes);

The only delta between blk_rq_complete() and blk_rq_error() is
cmd field for blk_rq_complete() in the TP_STRUCT_ENTRY() and
__get_str(cmd) field in TP_printk() which I don't think will
have any issue if we use that for blk_rq_error().

Question 1 :- What prevents us from using the same format for
both blk_rq_complete() and blk_rq_error() ?

Question 2 :- assuming that blk_rq_complete() and blk_rq_error()
are using same format why can't we :-

declare DECLARE_EVENT_CLASS(blk_rq_completion....)
and use that class for blk_rq_complete() and blk_rq_error() ?

since if I remember correctly we need to define a event class
instead of duplicating a tracepoint with similar reporting.

-ck

Chaitanya Kulkarni Feb. 4, 2022, 8:35 a.m. UTC | #2

Yang,

> Question 1 :- What prevents us from using the same format for
> both blk_rq_complete() and blk_rq_error() ?
> 
> Question 2 :- assuming that blk_rq_complete() and blk_rq_error()
> are using same format why can't we :-
> 
> declare DECLARE_EVENT_CLASS(blk_rq_completion....)
> and use that class for blk_rq_complete() and blk_rq_error() ?
> 
> since if I remember correctly we need to define a event class
> instead of duplicating a tracepoint with similar reporting.

What I meant is following compile tested patch (although it will
need to split into a prep patch in order to post it) :-

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 27170e40e8c9..7f4dfbdf12a6 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -100,19 +100,7 @@ TRACE_EVENT(block_rq_requeue,
                   __entry->nr_sector, 0)
  );

-/**
- * block_rq_complete - block IO operation completed by device driver
- * @rq: block operations request
- * @error: status code
- * @nr_bytes: number of completed bytes
- *
- * The block_rq_complete tracepoint event indicates that some portion
- * of operation request has been completed by the device driver.  If
- * the @rq->bio is %NULL, then there is absolutely no additional work to
- * do for the request. If @rq->bio is non-NULL then there is
- * additional work required to complete the request.
- */
-TRACE_EVENT(block_rq_complete,
+DECLARE_EVENT_CLASS(block_rq_completion,

         TP_PROTO(struct request *rq, blk_status_t error, unsigned int 
nr_bytes),

@@ -144,6 +132,41 @@ TRACE_EVENT(block_rq_complete,
                   __entry->nr_sector, __entry->error)
  );

+/**
+ * block_rq_complete - block IO operation completed by device driver
+ * @rq: block operations request
+ * @error: status code
+ * @nr_bytes: number of completed bytes
+ *
+ * The block_rq_complete tracepoint event indicates that some portion
+ * of operation request has been completed by the device driver.  If
+ * the @rq->bio is %NULL, then there is absolutely no additional work to
+ * do for the request. If @rq->bio is non-NULL then there is
+ * additional work required to complete the request.
+ */
+DEFINE_EVENT(block_rq_completion, block_rq_complete,
+
+       TP_PROTO(struct request *rq, blk_status_t error, unsigned int 
nr_bytes),
+
+       TP_ARGS(rq, error, nr_bytes)
+);
+
+/**
+ * block_rq_error - block IO operation error reported by device driver
+ * @rq: block operations request
+ * @error: status code
+ * @nr_bytes: number of completed bytes
+ *
+ * The block_rq_error tracepoint event indicates that some portion
+ * of operation request has failed as reported by the device driver.
+ */
+DEFINE_EVENT(block_rq_completion, block_rq_error,
+
+       TP_PROTO(struct request *rq, blk_status_t error, unsigned int 
nr_bytes),
+
+       TP_ARGS(rq, error, nr_bytes)
+);
+
  DECLARE_EVENT_CLASS(block_rq,

         TP_PROTO(struct request *rq),
> 
> -ck
> 
>

Yang Shi Feb. 4, 2022, 6:08 p.m. UTC | #3

On Thu, Feb 3, 2022 at 6:46 PM Chaitanya Kulkarni <chaitanyak@nvidia.com> wrote:
>
> Yang,
>
> On 2/3/22 12:12, Yang Shi wrote:
> > Currently, rasdaemon uses the existing tracepoint block_rq_complete
> > and filters out non-error cases in order to capture block disk errors.
> >
> > But there are a few problems with this approach:
> >
> > 1. Even kernel trace filter could do the filtering work, there is
> >     still some overhead after we enable this tracepoint.
> >
> > 2. The filter is merely based on errno, which does not align with kernel
> >     logic to check the errors for print_req_error().
> >
> > 3. block_rq_complete only provides dev major and minor to identify
> >     the block device, it is not convenient to use in user-space.
> >
> > So introduce a new tracepoint block_rq_error just for the error case.
> > With this patch, rasdaemon could switch to block_rq_error.
> >
>
> This patch looks good, but I've a question for you.
>
> We already have a tracepoint for the request completion
> block_rq_complete(). We are adding a new tracepoint blk_rq_error()
> that is also similar to what blk_rq_complete() reports.
> Similar call sites  :-
> trace_block_rq_complete(req, error, nr_bytes);
> trace_block_rq_error(req, error, nr_bytes);
>
> The only delta between blk_rq_complete() and blk_rq_error() is
> cmd field for blk_rq_complete() in the TP_STRUCT_ENTRY() and
> __get_str(cmd) field in TP_printk() which I don't think will
> have any issue if we use that for blk_rq_error().

Yes, I agree. Just no user needs it for our usecase.

>
> Question 1 :- What prevents us from using the same format for
> both blk_rq_complete() and blk_rq_error() ?

Actually nothing if we ignore cmd.

>
> Question 2 :- assuming that blk_rq_complete() and blk_rq_error()
> are using same format why can't we :-
>
> declare DECLARE_EVENT_CLASS(blk_rq_completion....)
> and use that class for blk_rq_complete() and blk_rq_error() ?
>
> since if I remember correctly we need to define a event class
> instead of duplicating a tracepoint with similar reporting.

Very good point. I did overlook it. The original post did have disk
name and didn't have cmd, now the two tracepoints look much more
similar than the original post, so I agree the duplicate could be
combined into an event class.

>
> -ck
>
>

Yang Shi Feb. 4, 2022, 6:10 p.m. UTC | #4

On Fri, Feb 4, 2022 at 12:35 AM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> Yang,
>
> > Question 1 :- What prevents us from using the same format for
> > both blk_rq_complete() and blk_rq_error() ?
> >
> > Question 2 :- assuming that blk_rq_complete() and blk_rq_error()
> > are using same format why can't we :-
> >
> > declare DECLARE_EVENT_CLASS(blk_rq_completion....)
> > and use that class for blk_rq_complete() and blk_rq_error() ?
> >
> > since if I remember correctly we need to define a event class
> > instead of duplicating a tracepoint with similar reporting.
>
> What I meant is following compile tested patch (although it will
> need to split into a prep patch in order to post it) :-

Thank you so much. I will prepare the new patches. I assume you prefer
a prep patch which converts trace event to event class, then add
block_rq_err event.

>
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 27170e40e8c9..7f4dfbdf12a6 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -100,19 +100,7 @@ TRACE_EVENT(block_rq_requeue,
>                    __entry->nr_sector, 0)
>   );
>
> -/**
> - * block_rq_complete - block IO operation completed by device driver
> - * @rq: block operations request
> - * @error: status code
> - * @nr_bytes: number of completed bytes
> - *
> - * The block_rq_complete tracepoint event indicates that some portion
> - * of operation request has been completed by the device driver.  If
> - * the @rq->bio is %NULL, then there is absolutely no additional work to
> - * do for the request. If @rq->bio is non-NULL then there is
> - * additional work required to complete the request.
> - */
> -TRACE_EVENT(block_rq_complete,
> +DECLARE_EVENT_CLASS(block_rq_completion,
>
>          TP_PROTO(struct request *rq, blk_status_t error, unsigned int
> nr_bytes),
>
> @@ -144,6 +132,41 @@ TRACE_EVENT(block_rq_complete,
>                    __entry->nr_sector, __entry->error)
>   );
>
> +/**
> + * block_rq_complete - block IO operation completed by device driver
> + * @rq: block operations request
> + * @error: status code
> + * @nr_bytes: number of completed bytes
> + *
> + * The block_rq_complete tracepoint event indicates that some portion
> + * of operation request has been completed by the device driver.  If
> + * the @rq->bio is %NULL, then there is absolutely no additional work to
> + * do for the request. If @rq->bio is non-NULL then there is
> + * additional work required to complete the request.
> + */
> +DEFINE_EVENT(block_rq_completion, block_rq_complete,
> +
> +       TP_PROTO(struct request *rq, blk_status_t error, unsigned int
> nr_bytes),
> +
> +       TP_ARGS(rq, error, nr_bytes)
> +);
> +
> +/**
> + * block_rq_error - block IO operation error reported by device driver
> + * @rq: block operations request
> + * @error: status code
> + * @nr_bytes: number of completed bytes
> + *
> + * The block_rq_error tracepoint event indicates that some portion
> + * of operation request has failed as reported by the device driver.
> + */
> +DEFINE_EVENT(block_rq_completion, block_rq_error,
> +
> +       TP_PROTO(struct request *rq, blk_status_t error, unsigned int
> nr_bytes),
> +
> +       TP_ARGS(rq, error, nr_bytes)
> +);
> +
>   DECLARE_EVENT_CLASS(block_rq,
>
>          TP_PROTO(struct request *rq),
> >
> > -ck
> >
> >
>

Chaitanya Kulkarni Feb. 5, 2022, 9:12 a.m. UTC | #5

>> since if I remember correctly we need to define a event class
>> instead of duplicating a tracepoint with similar reporting.
> 
> Very good point. I did overlook it. The original post did have disk
> name and didn't have cmd, now the two tracepoints look much more
> similar than the original post, so I agree the duplicate could be
> combined into an event class.
> 

I sent out a V7 with a prep patch and testlog keeping your
original authorship from this patch.

If you find anything wrong with that please send out V8 on the
top of V7.

>>
>> -ck
>>
>>

[v6] block: introduce block_rq_error tracepoint

Commit Message

Comments

Patch