diff mbox series

[v2,RESEND] io_uring/fdinfo: add timeout_list to fdinfo

Message ID 20240925085800.1729-1-ruyi.zhang@samsung.com (mailing list archive)
State New
Headers show
Series [v2,RESEND] io_uring/fdinfo: add timeout_list to fdinfo | expand

Commit Message

Ruyi Zhang Sept. 25, 2024, 8:58 a.m. UTC
io_uring fdinfo contains most of the runtime information,which is
helpful for debugging io_uring applications; However, there is
currently a lack of timeout-related information, and this patch adds
timeout_list information.

--
changes since v1:
- use _irq version spin_lock.
- Fixed formatting issues and delete redundant code.
- v1 :https://lore.kernel.org/io-uring/20240812020052.8763-1-ruyi.zhang@samsung.com/
--

Signed-off-by: Ruyi Zhang <ruyi.zhang@samsung.com>
---
 io_uring/fdinfo.c  | 14 ++++++++++++++
 io_uring/timeout.c | 12 ------------
 io_uring/timeout.h | 12 ++++++++++++
 3 files changed, 26 insertions(+), 12 deletions(-)

Comments

Pavel Begunkov Sept. 25, 2024, 11:58 a.m. UTC | #1
On 9/25/24 09:58, Ruyi Zhang wrote:
> io_uring fdinfo contains most of the runtime information,which is
> helpful for debugging io_uring applications; However, there is
> currently a lack of timeout-related information, and this patch adds
> timeout_list information.

Please refer to unaddressed comments from v1. We can't have irqs
disabled for that long. And it's too verbose (i.e. depends on
the number of timeouts).


> --
> changes since v1:
> - use _irq version spin_lock.
> - Fixed formatting issues and delete redundant code.
> - v1 :https://lore.kernel.org/io-uring/20240812020052.8763-1-ruyi.zhang@samsung.com/
> --
> 
> Signed-off-by: Ruyi Zhang <ruyi.zhang@samsung.com>
Ruyi Zhang Oct. 10, 2024, 9:20 a.m. UTC | #2
---
On 25 Sep 2024 12:58 Pavel Begunkov wrote
> On 9/25/24 09:58, Ruyi Zhang wrote:
>> io_uring fdinfo contains most of the runtime information,which is
>> helpful for debugging io_uring applications; However, there is
>> currently a lack of timeout-related information, and this patch adds
>> timeout_list information.

> Please refer to unaddressed comments from v1. We can't have irqs
> disabled for that long. And it's too verbose (i.e. depends on
> the number of timeouts).

Two questions:

1. I agree with you, we shouldn't walk a potentially very long list
under spinlock. but i can't find any other way to get all the timeout
information than to walk the timeout_list. Do you have any good ideas?

2. I also agree seq_printf heavier, if we use seq_put_decimal_ull and
seq_puts to concatenate strings, I haven't tested whether it's more
efficient or not, but the code is certainly not as readable as the
former. It's also possible that I don't fully understand what you mean
and want to hear your opinion.

---
Ruyi Zhang
Pavel Begunkov Oct. 10, 2024, 3:35 p.m. UTC | #3
On 10/10/24 10:20, Ruyi Zhang wrote:
> ---
> On 25 Sep 2024 12:58 Pavel Begunkov wrote
>> On 9/25/24 09:58, Ruyi Zhang wrote:
>>> io_uring fdinfo contains most of the runtime information,which is
>>> helpful for debugging io_uring applications; However, there is
>>> currently a lack of timeout-related information, and this patch adds
>>> timeout_list information.
> 
>> Please refer to unaddressed comments from v1. We can't have irqs
>> disabled for that long. And it's too verbose (i.e. depends on
>> the number of timeouts).
> 
> Two questions:
> 
> 1. I agree with you, we shouldn't walk a potentially very long list
> under spinlock. but i can't find any other way to get all the timeout

If only it's just under the spin, but with disabled irqs...

> information than to walk the timeout_list. Do you have any good ideas?

In the long run it'd be great to replace the spinlock
with a mutex, i.e. just ->uring_lock, but that would might be
a bit involving as need to move handling to the task context.

> 2. I also agree seq_printf heavier, if we use seq_put_decimal_ull and
> seq_puts to concatenate strings, I haven't tested whether it's more
> efficient or not, but the code is certainly not as readable as the
> former. It's also possible that I don't fully understand what you mean
> and want to hear your opinion.

I don't think there is any difference, it'd be a matter of
doubling the number of in flight timeouts to achieve same
timings. Tell me, do you really have a good case where you
need that (pretty verbose)? Why not drgn / bpftrace it out
of the kernel instead?
Ruyi Zhang Oct. 12, 2024, 9:10 a.m. UTC | #4
---
On 2024-10-10 15:35 Pavel Begunkov wrote:
>> Two questions:
>> 
>> 1. I agree with you, we shouldn't walk a potentially very
>> long list under spinlock. but i can't find any other way
>> to get all the timeout

> If only it's just under the spin, but with disabled irqs...

>> information than to walk the timeout_list. Do you have any
>> good ideas?

> In the long run it'd be great to replace the spinlock
> with a mutex, i.e. just ->uring_lock, but that would might be
> a bit involving as need to move handling to the task context.
 
 Yes, it makes more sense to replace spin_lock, but that would
 require other related logic to be modified, and I don't think
 it's wise to do that for the sake of a piece of debugging
 information.

>> 2. I also agree seq_printf heavier, if we use
>> seq_put_decimal_ull and seq_puts to concatenate strings,
>> I haven't tested whether it's more efficient or not, but
>> the code is certainly not as readable as the former. It's
>> also possible that I don't fully understand what you mean
>> and want to hear your opinion.

> I don't think there is any difference, it'd be a matter of
> doubling the number of in flight timeouts to achieve same
> timings. Tell me, do you really have a good case where you
> need that (pretty verbose)? Why not drgn / bpftrace it out
> of the kernel instead?

 Of course, this information is available through existing tools.
 But I think that most of the io_uring metadata has been exported
 from the fdinfo file, and the purpose of adding the timeout
 information is the same as before, easier to use. This way, 
 I don't have to write additional scripts to get all kinds of data.

 And as far as I know, the io_uring_show_fdinfo function is
 only called once when the user is viewing the 
 /proc/xxx/fdinfo/x file once. I don't think we normally need to 
 look at this file as often, and only look at it when the program
 is abnormal, and the timeout_list is very long in the extreme case,
 so I think the performance impact of adding this code is limited.

---
Ruyi Zhang
Jens Axboe Oct. 24, 2024, 5:31 p.m. UTC | #5
On Sat, Oct 12, 2024 at 3:30?AM Ruyi Zhang <ruyi.zhang@samsung.com> wrote:
>
> ---
> On 2024-10-10 15:35 Pavel Begunkov wrote:
> >> Two questions:
> >>
> >> 1. I agree with you, we shouldn't walk a potentially very
> >> long list under spinlock. but i can't find any other way
> >> to get all the timeout
>
> > If only it's just under the spin, but with disabled irqs...
>
> >> information than to walk the timeout_list. Do you have any
> >> good ideas?
>
> > In the long run it'd be great to replace the spinlock
> > with a mutex, i.e. just ->uring_lock, but that would might be
> > a bit involving as need to move handling to the task context.
>
>  Yes, it makes more sense to replace spin_lock, but that would
>  require other related logic to be modified, and I don't think
>  it's wise to do that for the sake of a piece of debugging
>  information.
>
> >> 2. I also agree seq_printf heavier, if we use
> >> seq_put_decimal_ull and seq_puts to concatenate strings,
> >> I haven't tested whether it's more efficient or not, but
> >> the code is certainly not as readable as the former. It's
> >> also possible that I don't fully understand what you mean
> >> and want to hear your opinion.
>
> > I don't think there is any difference, it'd be a matter of
> > doubling the number of in flight timeouts to achieve same
> > timings. Tell me, do you really have a good case where you
> > need that (pretty verbose)? Why not drgn / bpftrace it out
> > of the kernel instead?
>
>  Of course, this information is available through existing tools.
>  But I think that most of the io_uring metadata has been exported
>  from the fdinfo file, and the purpose of adding the timeout
>  information is the same as before, easier to use. This way,
>  I don't have to write additional scripts to get all kinds of data.
>
>  And as far as I know, the io_uring_show_fdinfo function is
>  only called once when the user is viewing the
>  /proc/xxx/fdinfo/x file once. I don't think we normally need to
>  look at this file as often, and only look at it when the program
>  is abnormal, and the timeout_list is very long in the extreme case,
>  so I think the performance impact of adding this code is limited.

I do think it's useful, sometimes the only thing you have to poke at
after-the-fact is the fdinfo information. At the same time, would it be
more useful to dump _some_ of the info, even if we can't get all of it?
Would not be too hard to just stop dumping if need_resched() is set, and
even note that - you can always retry, as this info is generally grabbed
from the console anyway, not programmatically. That avoids the worst
possible scenario, which is a malicious setup with a shit ton of pending
timers, while still allowing it to be useful for a normal setup. And
this patch could just do that, rather than attempt to re-architect how
the timers are tracked and which locking it uses.
Pavel Begunkov Oct. 24, 2024, 6:10 p.m. UTC | #6
On 10/24/24 18:31, Jens Axboe wrote:
> On Sat, Oct 12, 2024 at 3:30?AM Ruyi Zhang <ruyi.zhang@samsung.com> wrote:
...
>>> I don't think there is any difference, it'd be a matter of
>>> doubling the number of in flight timeouts to achieve same
>>> timings. Tell me, do you really have a good case where you
>>> need that (pretty verbose)? Why not drgn / bpftrace it out
>>> of the kernel instead?
>>
>>   Of course, this information is available through existing tools.
>>   But I think that most of the io_uring metadata has been exported
>>   from the fdinfo file, and the purpose of adding the timeout
>>   information is the same as before, easier to use. This way,
>>   I don't have to write additional scripts to get all kinds of data.
>>
>>   And as far as I know, the io_uring_show_fdinfo function is
>>   only called once when the user is viewing the
>>   /proc/xxx/fdinfo/x file once. I don't think we normally need to
>>   look at this file as often, and only look at it when the program
>>   is abnormal, and the timeout_list is very long in the extreme case,
>>   so I think the performance impact of adding this code is limited.
> 
> I do think it's useful, sometimes the only thing you have to poke at
> after-the-fact is the fdinfo information. At the same time, would it be

If you have an fd to print fdinfo, you can just well run drgn
or any other debugging tool. We keep pushing more debugging code
that can be extracted with bpf and other tools, and not only
it bloats the code, but potentially cripples the entire kernel.

> more useful to dump _some_ of the info, even if we can't get all of it?
> Would not be too hard to just stop dumping if need_resched() is set, and

need_resched() takes eternity in the eyes of hard irqs, that is
surely one way to make the system unusable. Will we even get the
request for rescheduling considering that irqs are off => timers
can't run?

> even note that - you can always retry, as this info is generally grabbed
> from the console anyway, not programmatically. That avoids the worst
> possible scenario, which is a malicious setup with a shit ton of pending
> timers, while still allowing it to be useful for a normal setup. And
> this patch could just do that, rather than attempt to re-architect how
> the timers are tracked and which locking it uses.

Or it can be done with one of the existing tools that already
exist specifically for that purpose, which don't need any additional
kernel and custom handling in the kernel, and users won't need to
wait until the patch lands into your kernel and can be run right
away.
Jens Axboe Oct. 24, 2024, 11:25 p.m. UTC | #7
On 10/24/24 12:10 PM, Pavel Begunkov wrote:
> On 10/24/24 18:31, Jens Axboe wrote:
>> On Sat, Oct 12, 2024 at 3:30?AM Ruyi Zhang <ruyi.zhang@samsung.com> wrote:
> ...
>>>> I don't think there is any difference, it'd be a matter of
>>>> doubling the number of in flight timeouts to achieve same
>>>> timings. Tell me, do you really have a good case where you
>>>> need that (pretty verbose)? Why not drgn / bpftrace it out
>>>> of the kernel instead?
>>>
>>>   Of course, this information is available through existing tools.
>>>   But I think that most of the io_uring metadata has been exported
>>>   from the fdinfo file, and the purpose of adding the timeout
>>>   information is the same as before, easier to use. This way,
>>>   I don't have to write additional scripts to get all kinds of data.
>>>
>>>   And as far as I know, the io_uring_show_fdinfo function is
>>>   only called once when the user is viewing the
>>>   /proc/xxx/fdinfo/x file once. I don't think we normally need to
>>>   look at this file as often, and only look at it when the program
>>>   is abnormal, and the timeout_list is very long in the extreme case,
>>>   so I think the performance impact of adding this code is limited.
>>
>> I do think it's useful, sometimes the only thing you have to poke at
>> after-the-fact is the fdinfo information. At the same time, would it be
> 
> If you have an fd to print fdinfo, you can just well run drgn
> or any other debugging tool. We keep pushing more debugging code
> that can be extracted with bpf and other tools, and not only
> it bloats the code, but potentially cripples the entire kernel.

While that is certainly true, it's also a much harder barrier to entry.
If you're already setup with eg drgn, then yeah fdinfo is useless as you
can grab much more info out by just using drgn.

I'm fine punting this to "needs more advanced debugging than fdinfo".
It's just important we get closure on these patches, so they don't
linger forever in no man's land.
Pavel Begunkov Oct. 30, 2024, 1:29 a.m. UTC | #8
On 10/25/24 00:25, Jens Axboe wrote:
> On 10/24/24 12:10 PM, Pavel Begunkov wrote:
>> On 10/24/24 18:31, Jens Axboe wrote:
>>> On Sat, Oct 12, 2024 at 3:30?AM Ruyi Zhang <ruyi.zhang@samsung.com> wrote:
>> ...
>>>>> I don't think there is any difference, it'd be a matter of
>>>>> doubling the number of in flight timeouts to achieve same
>>>>> timings. Tell me, do you really have a good case where you
>>>>> need that (pretty verbose)? Why not drgn / bpftrace it out
>>>>> of the kernel instead?
>>>>
>>>>    Of course, this information is available through existing tools.
>>>>    But I think that most of the io_uring metadata has been exported
>>>>    from the fdinfo file, and the purpose of adding the timeout
>>>>    information is the same as before, easier to use. This way,
>>>>    I don't have to write additional scripts to get all kinds of data.
>>>>
>>>>    And as far as I know, the io_uring_show_fdinfo function is
>>>>    only called once when the user is viewing the
>>>>    /proc/xxx/fdinfo/x file once. I don't think we normally need to
>>>>    look at this file as often, and only look at it when the program
>>>>    is abnormal, and the timeout_list is very long in the extreme case,
>>>>    so I think the performance impact of adding this code is limited.
>>>
>>> I do think it's useful, sometimes the only thing you have to poke at
>>> after-the-fact is the fdinfo information. At the same time, would it be
>>
>> If you have an fd to print fdinfo, you can just well run drgn
>> or any other debugging tool. We keep pushing more debugging code
>> that can be extracted with bpf and other tools, and not only
>> it bloats the code, but potentially cripples the entire kernel.
> 
> While that is certainly true, it's also a much harder barrier to entry.
> If you're already setup with eg drgn, then yeah fdinfo is useless as you
> can grab much more info out by just using drgn.

drgn is simple, not that harder than patching fdinfo, we can add
liburing/scripts, and push it there so that don't need rewriting
it each time.

> I'm fine punting this to "needs more advanced debugging than fdinfo".
> It's just important we get closure on these patches, so they don't
> linger forever in no man's land.

The only option I see is to dump first ~5 and stop there, but
I still think the tooling option is better.
Jens Axboe Oct. 30, 2024, 1:26 p.m. UTC | #9
On 10/29/24 7:29 PM, Pavel Begunkov wrote:
> On 10/25/24 00:25, Jens Axboe wrote:
>> On 10/24/24 12:10 PM, Pavel Begunkov wrote:
>>> On 10/24/24 18:31, Jens Axboe wrote:
>>>> On Sat, Oct 12, 2024 at 3:30?AM Ruyi Zhang <ruyi.zhang@samsung.com> wrote:
>>> ...
>>>>>> I don't think there is any difference, it'd be a matter of
>>>>>> doubling the number of in flight timeouts to achieve same
>>>>>> timings. Tell me, do you really have a good case where you
>>>>>> need that (pretty verbose)? Why not drgn / bpftrace it out
>>>>>> of the kernel instead?
>>>>>
>>>>>    Of course, this information is available through existing tools.
>>>>>    But I think that most of the io_uring metadata has been exported
>>>>>    from the fdinfo file, and the purpose of adding the timeout
>>>>>    information is the same as before, easier to use. This way,
>>>>>    I don't have to write additional scripts to get all kinds of data.
>>>>>
>>>>>    And as far as I know, the io_uring_show_fdinfo function is
>>>>>    only called once when the user is viewing the
>>>>>    /proc/xxx/fdinfo/x file once. I don't think we normally need to
>>>>>    look at this file as often, and only look at it when the program
>>>>>    is abnormal, and the timeout_list is very long in the extreme case,
>>>>>    so I think the performance impact of adding this code is limited.
>>>>
>>>> I do think it's useful, sometimes the only thing you have to poke at
>>>> after-the-fact is the fdinfo information. At the same time, would it be
>>>
>>> If you have an fd to print fdinfo, you can just well run drgn
>>> or any other debugging tool. We keep pushing more debugging code
>>> that can be extracted with bpf and other tools, and not only
>>> it bloats the code, but potentially cripples the entire kernel.
>>
>> While that is certainly true, it's also a much harder barrier to entry.
>> If you're already setup with eg drgn, then yeah fdinfo is useless as you
>> can grab much more info out by just using drgn.
> 
> drgn is simple, not that harder than patching fdinfo, we can add
> liburing/scripts, and push it there so that don't need rewriting
> it each time.

It's not that drgn it's hard to use, it's not, but that people aren't
necessarily aware of it. Once you've used it, yeah it's trivial. But for
the cases where you are stuck in prod and you haven't used anything like
that, it's a bit of a stretch to get there. Once it's part of your usual
arsenal of tools, not an issue at all.

Adding something to liburing/scripts/ would indeed be awesome.

>> I'm fine punting this to "needs more advanced debugging than fdinfo".
>> It's just important we get closure on these patches, so they don't
>> linger forever in no man's land.
> 
> The only option I see is to dump first ~5 and stop there, but
> I still think the tooling option is better.

Let's just not do it at all, I think a partial dump is likely to be
potentially useless. And you can't cat it again and expect something
different if things are stuck.
diff mbox series

Patch

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index d43e1b5fcb36..f524c3cd6f57 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -14,6 +14,7 @@ 
 #include "fdinfo.h"
 #include "cancel.h"
 #include "rsrc.h"
+#include "timeout.h"
 
 #ifdef CONFIG_PROC_FS
 static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id,
@@ -55,6 +56,7 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
 	struct io_ring_ctx *ctx = file->private_data;
 	struct io_overflow_cqe *ocqe;
 	struct io_rings *r = ctx->rings;
+	struct io_timeout *timeout;
 	struct rusage sq_usage;
 	unsigned int sq_mask = ctx->sq_entries - 1, cq_mask = ctx->cq_entries - 1;
 	unsigned int sq_head = READ_ONCE(r->sq.head);
@@ -235,5 +237,17 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
 		seq_puts(m, "NAPI:\tdisabled\n");
 	}
 #endif
+
+	seq_puts(m, "TimeoutList:\n");
+	spin_lock_irq(&ctx->timeout_lock);
+	list_for_each_entry(timeout, &ctx->timeout_list, list) {
+		struct io_timeout_data *data;
+
+		data = cmd_to_io_kiocb(timeout)->async_data;
+		seq_printf(m, "  off=%u, repeats=%u, sec=%lld, nsec=%ld\n",
+			   timeout->off, timeout->repeats, data->ts.tv_sec,
+			   data->ts.tv_nsec);
+	}
+	spin_unlock_irq(&ctx->timeout_lock);
 }
 #endif
diff --git a/io_uring/timeout.c b/io_uring/timeout.c
index 9973876d91b0..4449e139e371 100644
--- a/io_uring/timeout.c
+++ b/io_uring/timeout.c
@@ -13,18 +13,6 @@ 
 #include "cancel.h"
 #include "timeout.h"
 
-struct io_timeout {
-	struct file			*file;
-	u32				off;
-	u32				target_seq;
-	u32				repeats;
-	struct list_head		list;
-	/* head of the link, used by linked timeouts only */
-	struct io_kiocb			*head;
-	/* for linked completions */
-	struct io_kiocb			*prev;
-};
-
 struct io_timeout_rem {
 	struct file			*file;
 	u64				addr;
diff --git a/io_uring/timeout.h b/io_uring/timeout.h
index a6939f18313e..befd489a6286 100644
--- a/io_uring/timeout.h
+++ b/io_uring/timeout.h
@@ -1,5 +1,17 @@ 
 // SPDX-License-Identifier: GPL-2.0
 
+struct io_timeout {
+	struct file			*file;
+	u32				off;
+	u32				target_seq;
+	u32				repeats;
+	struct list_head		list;
+	/* head of the link, used by linked timeouts only */
+	struct io_kiocb			*head;
+	/* for linked completions */
+	struct io_kiocb			*prev;
+};
+
 struct io_timeout_data {
 	struct io_kiocb			*req;
 	struct hrtimer			timer;