mbox series

[0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

Message ID 20240308183816.676883229@goodmis.org (mailing list archive)
Headers show
Series tracing/ring-buffer: Fix wakeup of ring buffer waiters | expand

Message

Steven Rostedt March 8, 2024, 6:38 p.m. UTC
A patch was sent to "fix" the wait_index variable that is used to help with
waking of waiters on the ring buffer. The patch was rejected, but I started
looking at associated code. Discussing it on IRC with Mathieu Desnoyers
we discovered a design flaw.

The waiter reads "wait_index" then enters a "wait loop". After adding
itself to the wait queue, if the buffer is filled or a signal is pending
it will exit out. If wait_index is different, it will also exit out.
(Note, I noticed that the check for wait_index was after the schedule()
call and should have been before it, but that's besides the point).

The race is what happens if the waker updates the wait_index,
a new waiter comes in and reads the updated wait_index before it adds
itself to the queue, it will miss the update!

This is a series of changes to fix the design, and other bugs found
along the way.

1) Remove the wait loop in the ring_buffer_wait() code. It doesn't
   have enough context to know if it should continue to wait.

2) Fix "shortest_full" accounting. When zero, it gets set to the
   smallest percentage requested of the buffer before the writers need to
   start waking up waiters. But after waiters are all woken up,
   it never gets reset, so the smallest value will always be the
   smallest value until the buffer itself is reset.

3) The wake up of readers on close was incorrectly added to the
   .release() callback and not the .flush(). That is, waiters
   in the .read() call, will never be woken up because .release()
   isn't called until all .read()s have finished. Move the wakeup
   to .flush() which is called immediately by close().

4) Add a "waking" field to the trace iterator that gets set when a
   waker wants to wake up the readers. For the .flush() callback,
   it is set and never cleared to make sure new readers do not block.

5/6) Break up ring_buffer_wait() into three functions:

  ring_buffer_prepare_to_wait()
  ring_buffer_wait()
  ring_buffer_finish_wait()

    This allows the caller to add itself to the wait queue, check
    if its own condition has been set (in this case: iter->waking)
    and then sleep. Follows the same semantics as any other wait logic.

Steven Rostedt (Google) (6):
      ring-buffer: Fix waking up ring buffer readers
      ring-buffer: Fix resetting of shortest_full
      tracing: Use .flush() call to wake up readers
      tracing: Fix waking up tracing readers
      ring-buffer: Restructure ring_buffer_wait() to prepare for updates
      tracing/ring-buffer: Fix wait_on_pipe() race

----
 include/linux/ring_buffer.h  |   4 +
 include/linux/trace_events.h |   3 +-
 kernel/trace/ring_buffer.c   | 251 +++++++++++++++++++++++++++----------------
 kernel/trace/trace.c         | 128 ++++++++++++++++++----
 4 files changed, 269 insertions(+), 117 deletions(-)

Comments

Linus Torvalds March 8, 2024, 8:39 p.m. UTC | #1
On Fri, 8 Mar 2024 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> A patch was sent to "fix" the wait_index variable that is used to help with
> waking of waiters on the ring buffer. The patch was rejected, but I started
> looking at associated code. Discussing it on IRC with Mathieu Desnoyers
> we discovered a design flaw.

Honestly, all of this seems excessively complicated.

And your new locking shouldn't be necessary if you just do things much
more simply.

Here's what I *think* you should do:

  struct xyz {
        ...
        atomic_t seq;
        struct wait_queue_head seq_wait;
        ...
  };

with the consumer doing something very simple like this:

        int seq = atomic_read_acquire(&my->seq);
        for (;;) {
                .. consume outstanding events ..
                seq = wait_for_seq_change(seq, my);
        }

and the producer being similarly trivial, just having a
"add_seq_event()" at the end:

        ... add whatever event ..
        add_seq_event(my);

And the helper functions for this are really darn simple:

  static inline int wait_for_seq_change(int old, struct xyz *my)
  {
        int new;
        wait_event(my->seq_wait,
                (new = atomic_read_acquire(&my->seq)) != old);
        return new;
  }

  static inline void add_seq_event(struct xyz *my)
  {
        atomic_fetch_inc_release(&my->seq);
        wake_up(&my->seq_wait);
  }

Note how you don't need any new locks, and note how "wait_event()"
will do all the required optimistic stuff for you (ie it will check
that "has seq changed" before even bothering to add itself to the wait
queue etc).

So the above is not only short and sweet, it generates fairly good
code too, and doesn't it look really simple and fairly understandable?

And - AS ALWAYS - the above isn't actually tested in any way, shape or form.

                 Linus
Steven Rostedt March 8, 2024, 9:35 p.m. UTC | #2
On Fri, 8 Mar 2024 12:39:10 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 8 Mar 2024 at 10:38, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > A patch was sent to "fix" the wait_index variable that is used to help with
> > waking of waiters on the ring buffer. The patch was rejected, but I started
> > looking at associated code. Discussing it on IRC with Mathieu Desnoyers
> > we discovered a design flaw.  
> 
> Honestly, all of this seems excessively complicated.
> 
> And your new locking shouldn't be necessary if you just do things much
> more simply.

You mean to replace the wait_woken_*() code (that has the new locking)?

> 
> Here's what I *think* you should do:
> 
>   struct xyz {
>         ...
>         atomic_t seq;
>         struct wait_queue_head seq_wait;
>         ...
>   };
> 
> with the consumer doing something very simple like this:
> 
>         int seq = atomic_read_acquire(&my->seq);
>         for (;;) {
>                 .. consume outstanding events ..
>                 seq = wait_for_seq_change(seq, my);
>         }
> 
> and the producer being similarly trivial, just having a
> "add_seq_event()" at the end:
> 
>         ... add whatever event ..
>         add_seq_event(my);
> 
> And the helper functions for this are really darn simple:
> 
>   static inline int wait_for_seq_change(int old, struct xyz *my)
>   {
>         int new;
>         wait_event(my->seq_wait,
>                 (new = atomic_read_acquire(&my->seq)) != old);

But the index isn't the only condition for it to wake up to. If the file is
closing, it want's to know that too. Or if it's just being kicked out to
consume whatever is there and ignore the watermark.

>         return new;
>   }
> 
>   static inline void add_seq_event(struct xyz *my)
>   {
>         atomic_fetch_inc_release(&my->seq);
>         wake_up(&my->seq_wait);
>   }

But it's not only the producer that does the wakeup. That part wasn't
broken.

The broken part is a third party that comes along and wants to wake up the
consumer and tell them to just consume what's there and exit.

There's two layers:

1) the ring buffer has the above simple producer / consumer.
   Where the wake ups can happen at the point of where the buffer has
   the amount filled that the consumer wants to start consuming with.

2) The tracing layer; Here on close of a file, the consumers need to be
   woken up and not wait again. And just take whatever was there to finish
   reading.

   There's also another case that the ioctl() just kicks the current
   readers out, but doesn't care about new readers.

I'm not sure how the seq can handle both there being enough data to wake up
the consumer and the case that another task just wants the consume to wake
up and ignore the watermark.

The wake_woken_*() code was only for the second part (to wake up consumers
and tell them to no longer wait for the producer), and had nothing to do
with the produce/consumer part.

-- Steve
Linus Torvalds March 8, 2024, 9:39 p.m. UTC | #3
On Fri, 8 Mar 2024 at 13:33, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> There's two layers:
>
> 1) the ring buffer has the above simple producer / consumer.
>    Where the wake ups can happen at the point of where the buffer has
>    the amount filled that the consumer wants to start consuming with.
>
> 2) The tracing layer; Here on close of a file, the consumers need to be
>    woken up and not wait again. And just take whatever was there to finish
>    reading.
>
>    There's also another case that the ioctl() just kicks the current
>    readers out, but doesn't care about new readers.

But that's the beauty of just using the wait_event() model.

Just add that "exit" condition to the condition.

So the above "complexity" is *literally* just changing the

                  (new = atomic_read_acquire(&my->seq)) != old

condition to

                  should_exit ||
                  (new = atomic_read_acquire(&my->seq)) != old

(replace "should_exit" with whatever that condition is, of course) and
the wait_event() logic will take care of the rest.

             Linus
Linus Torvalds March 8, 2024, 9:41 p.m. UTC | #4
On Fri, 8 Mar 2024 at 13:39, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the above "complexity" is *literally* just changing the
>
>                   (new = atomic_read_acquire(&my->seq)) != old
>
> condition to
>
>                   should_exit ||
>                   (new = atomic_read_acquire(&my->seq)) != old

.. and obviously you'll need to add the exit condition to the actual
"deal with events" loop too.

                Linus
Steven Rostedt March 10, 2024, 4:19 p.m. UTC | #5
On Fri, 8 Mar 2024 13:41:59 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 8 Mar 2024 at 13:39, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So the above "complexity" is *literally* just changing the
> >
> >                   (new = atomic_read_acquire(&my->seq)) != old
> >
> > condition to
> >
> >                   should_exit ||
> >                   (new = atomic_read_acquire(&my->seq)) != old  
> 
> .. and obviously you'll need to add the exit condition to the actual
> "deal with events" loop too.

I haven't had a chance to rework this part of the patches, but I have
some other fixes to push to you from earlier this week, and I think the
first three patches of this series are also fine. As the loop in
ring_buffer_wait() isn't needed, and patch 2 and 3 are trivial bugs.

I'll send you a pull request for that work and I'll work on this code
later.

-- Steve