PATCH? avoid the unnecessary wakeups in pipe_read()

Message ID	20241229135737.GA3293@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCAF01487FA for <linux-fsdevel@vger.kernel.org>; Sun, 29 Dec 2024 13:58:10 +0000 (UTC) Date: Sun, 29 Dec 2024 14:57:37 +0100 From: Oleg Nesterov <oleg@redhat.com> To: Manfred Spraul <manfred@colorfullife.com>, Linus Torvalds <torvalds@linux-foundation.org>, Christian Brauner <brauner@kernel.org>, David Howells <dhowells@redhat.com> Cc: WangYuli <wangyuli@uniontech.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: PATCH? avoid the unnecessary wakeups in pipe_read() Message-ID: <20241229135737.GA3293@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30)
Series	PATCH? avoid the unnecessary wakeups in pipe_read() \| expand PATCH? avoid the unnecessary wakeups in pipe_read()

Oleg Nesterov Dec. 29, 2024, 1:57 p.m. UTC

The previous discussion was very confusing, let me start another thread.
This is orthogonal to the possible wq_has_sleeper() optimizations in fs/pipe.c
we discussed before.

Let me quote one of my previous emails. Consider

	int main(void)
	{
		int fd[2], cnt;
		char c;

		pipe(fd);

		if (!fork()) {
			// wait until the parent blocks in pipe_write() ->
			// wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
			sleep(1);

			for (cnt = 0; cnt < 4096; ++cnt)
				read(fd[0], &c, 1);
			return 0;
		}

		// parent
		for (;;)
			write(fd[1], &c, 1);
	}

If I read this code correctly, in this case the child will wakeup the parent
4095 times for no reason, pipe_writable() == !pipe_pull() will still be true
until the last read(fd[0], &c, 1) does

	if (!buf->len)
		tail = pipe_update_tail(pipe, buf, tail);

and after that the parent can write the next char.

Does the patch below make sense? With this patch pipe_read() wakes the
writer up only when pipe_full() changes from T to F.

Still incomplete, obviously not for inclusion. But is it correct or not?
I am not sure I understand this nontrivial logic...

Oleg.
---

Linus Torvalds Dec. 29, 2024, 5:27 p.m. UTC | #1

On Sun, 29 Dec 2024 at 05:58, Oleg Nesterov <oleg@redhat.com> wrote:
>
> If I read this code correctly, in this case the child will wakeup the parent
> 4095 times for no reason, pipe_writable() == !pipe_pull() will still be true
> until the last read(fd[0], &c, 1) does

Ack, that patch looks sane to me.

Only wake writer if we actually released a pipe slot, and it was full
before we did so.

Makes sense.

                Linus

Oleg Nesterov Jan. 2, 2025, 4:33 p.m. UTC | #2

I was going to send a one-liner patch which adds mb() into pipe_poll()
but then I decided to make even more spam and ask some questions first.

	static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
	{
		smp_mb();
		if (waitqueue_active(&pipe->rd_wait))
			wake_up_interruptible(&pipe->rd_wait);
		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
	}

I think that wq_has_sleeper() + wake_up_interruptible_poll(POLLIN) make more
sense but this is minor.

Either way the waitqueue_active() check is only correct if the waiter has a
barrier between __add_wait_queue() and "check the condition". wait_event()
is fine, but pipe_poll() does:

	// poll_wait()
	__pollwait() -> add_wait_queue(pipe->rd_wait) -> list_add()

	READ_ONCE(pipe->head);
	READ_ONCE(pipe->tail);

In theory these LOAD's can leak into the critical section in add_wait_queue()
and they can happen before list_add(entry, rd_wait.head).

So I think we need the trivial

	--- a/fs/pipe.c
	+++ b/fs/pipe.c
	@@ -680,6 +680,7 @@ pipe_poll(struct file *filp, poll_table *wait)
		 * if something changes and you got it wrong, the poll
		 * table entry will wake you up and fix it.
		 */
	+	smp_mb();
		head = READ_ONCE(pipe->head);
		tail = READ_ONCE(pipe->tail);

and after that pipe_read/pipe_write can use the wq_has_sleeper() check too
(this is what the patch from WangYuli did).

-------------------------------------------------------------------------------
But perhaps this mb() should go into __pollwait() ? We can have more
waitqueue_active() users which do not take .poll() into account...

The are more init_poll_funcptr()'s, but at least epoll looks fine,
epi_fget() in ep_item_poll() provides a full barrier before vfs_poll().

-------------------------------------------------------------------------------
Or really add mb() into __add_wait_queue/__add_wait_queue_entry_tail as
Manfred suggests? Somehow I am not sure about this change.

Oleg.

Manfred Spraul Jan. 4, 2025, 8:57 p.m. UTC | #3

Hi Oleg,

On 1/2/25 5:33 PM, Oleg Nesterov wrote:
> I was going to send a one-liner patch which adds mb() into pipe_poll()
> but then I decided to make even more spam and ask some questions first.
>
> 	static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
> 	{
> 		smp_mb();
> 		if (waitqueue_active(&pipe->rd_wait))
> 			wake_up_interruptible(&pipe->rd_wait);
> 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> 	}
>
> I think that wq_has_sleeper() + wake_up_interruptible_poll(POLLIN) make more
> sense but this is minor.
>
> Either way the waitqueue_active() check is only correct if the waiter has a
> barrier between __add_wait_queue() and "check the condition". wait_event()
> is fine, but pipe_poll() does:
>
> 	// poll_wait()
> 	__pollwait() -> add_wait_queue(pipe->rd_wait) -> list_add()
>
> 	READ_ONCE(pipe->head);
> 	READ_ONCE(pipe->tail);
>
> In theory these LOAD's can leak into the critical section in add_wait_queue()
> and they can happen before list_add(entry, rd_wait.head).
>
> So I think we need the trivial
>
> 	--- a/fs/pipe.c
> 	+++ b/fs/pipe.c
> 	@@ -680,6 +680,7 @@ pipe_poll(struct file *filp, poll_table *wait)
> 		 * if something changes and you got it wrong, the poll
> 		 * table entry will wake you up and fix it.
> 		 */
> 	+	smp_mb();
> 		head = READ_ONCE(pipe->head);
> 		tail = READ_ONCE(pipe->tail);
>
> and after that pipe_read/pipe_write can use the wq_has_sleeper() check too
> (this is what the patch from WangYuli did).

Would it be possible to create a perf probe to get some statistics?

I see at least 4 options:

- do nothing

- add the smp_mb() into pipe_poll, and convert pipe to wq_has_sleepers()

- add the smp_mb() into poll_wait(), convert pipe and potentially 
further poll users to wq_has_sleepers()

- add the smp_mb() into __add_wait_queue(), and merge wq_has_sleepers() 
into wake_up().

The tricky part is probably to differentiate wake_up on empty wait 
queues vs. wake_up on wait queues with entries.

--

     Manfred

Linus Torvalds Jan. 4, 2025, 10:05 p.m. UTC | #4

On Thu, 2 Jan 2025 at 08:33, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I was going to send a one-liner patch which adds mb() into pipe_poll()
> but then I decided to make even more spam and ask some questions first.

poll functions are not *supposed* to need memory barriers.

They are supposed to do "poll_wait()" and then not need any more
serialization after that, because we either

 (a) have a NULL wait-address, in which case we're not going to sleep
and this is just a "check state"

 (b) the waiting function is supposed to do add_wait_queue() (usually
by way of __pollwait) and that should be a sufficient barrier to
anybody who does a wakeup

Note that add_wait_queue() ends up doing a spinlock sequence, and
while that is not a full memory barrier (well, it is on x86, but not
necessarily in general), it *should* be sufficient against an actual
waker.

That's kind of how add_wait_queue() vs wake_up() is supposed to work.

Of course, the fact that we're not discussing the pipe code *not*
doing a full wake sequence (but just a "is the wait queue empty"
thing) is what then messes with the generic rules.

And this makes me think that the whole comment above
waitqueue_active() is just fundamentally wrong. The smp_mb() is *not*
sufficient in the sequence

    smp_mb();
    if (waitqueue_active(wq_head))
        wake_up(wq_head);

because while it happens to work wrt prepare_to_wait() sequences, is
is *not* against other users of add_wait_queue().

In those other sequences the smp_mb() in set_current_state might have
happened long long before.

Those other users aren't just 'poll()', btw. There's quite a lot of
add_wait_queue() users in the kernel. It's a traditional model even if
it's not something people generally add to any more.

Now, hopefully many of those add_wait_queue() users end up using
set_current_state() and getting the memory barrier that way. Or they
use wait_woken() or any of the other proper helpers we have.

But I think this poll() thing is very much an example of this *not*
being valid, and I don't think it's in any way pipe-specific.

So maybe we really do need to add the memory barrier to
__add_wait_queue(). That's going to be painful, particularly with lots
of users not needing it because they have the barrier in all the other
places.

End result: maybe adding it just to __pollwait() is the thing to do,
in the hopes that non-poll users all use the proper sequences.

But no, this is most definitely not a pipe-only thing.

          Linus

Oleg Nesterov Jan. 6, 2025, 4:30 p.m. UTC | #5

On 01/04, Linus Torvalds wrote:
>
> On Thu, 2 Jan 2025 at 08:33, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I was going to send a one-liner patch which adds mb() into pipe_poll()
> > but then I decided to make even more spam and ask some questions first.
>
> poll functions are not *supposed* to need memory barriers.
...
> But no, this is most definitely not a pipe-only thing.

Agreed, that is why I didn't send the patch which adds mb() into pipe_poll().

> They are supposed to do "poll_wait()" and then not need any more
> serialization after that, because we either
>
>  (a) have a NULL wait-address, in which case we're not going to sleep
> and this is just a "check state"

To be honest, I don't understand the wait_address check in poll_wait(),
it seems that wait_address is never NULL.

But ->_qproc can be NULL if do_poll/etc does another iteration after
poll_schedule_timeout(), in this case we can sleep again. But this
case is fine.

>  (b) the waiting function is supposed to do add_wait_queue() (usually
> by way of __pollwait) and that should be a sufficient barrier to
> anybody who does a wakeup

Yes.

> And this makes me think that the whole comment above
> waitqueue_active() is just fundamentally wrong. The smp_mb() is *not*
> sufficient in the sequence
>
>     smp_mb();
>     if (waitqueue_active(wq_head))
>         wake_up(wq_head);
>
> because while it happens to work wrt prepare_to_wait() sequences, is
> is *not* against other users of add_wait_queue().

Well, this comment doesn't look wrong to me, but perhaps it can be
more clear. It should probably explain that this pseudo code is only
correct because the waiter has a barrier before "if (@cond)" which
pairs with smp_mb() above waitqueue_active(). It even says

	prepare_to_wait(&wq_head, &wait, state);
	// smp_mb() from set_current_state()

but perhaps this is not 100% clear.

> But I think this poll() thing is very much an example of this *not*
> being valid, and I don't think it's in any way pipe-specific.

Yes.

> So maybe we really do need to add the memory barrier to
> __add_wait_queue(). That's going to be painful, particularly with lots
> of users not needing it because they have the barrier in all the other
> places.

Yes, that is why I don't really like the idea to add mb() into
__add_wait_queue().

> End result: maybe adding it just to __pollwait() is the thing to do,
> in the hopes that non-poll users all use the proper sequences.

That is what I tried to propose. Will you agree with this change?
We can even use smp_store_mb(), say

	@@ -224,11 +224,12 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
		if (!entry)
			return;
		entry->filp = get_file(filp);
	-	entry->wait_address = wait_address;
		entry->key = p->_key;
		init_waitqueue_func_entry(&entry->wait, pollwake);
		entry->wait.private = pwq;
		add_wait_queue(wait_address, &entry->wait);
	+	// COMMENT
	+	smp_store_mb(entry->wait_address, wait_address);

Oleg.

Oleg Nesterov Jan. 6, 2025, 6:03 p.m. UTC | #6

On 01/06, Oleg Nesterov wrote:
>
> To be honest, I don't understand the wait_address check in poll_wait(),
> it seems that wait_address is never NULL.

Yes,

	$ git grep -E '\bpoll_wait\('  | perl -wne '/\(.*?,\s*\&/ or print'

doesn't find anything "interesting". I think the wait_address check
should die, it only adds the unnecessary confusion.

Oleg.

Linus Torvalds Jan. 6, 2025, 6:23 p.m. UTC | #7

On Mon, 6 Jan 2025 at 08:31, Oleg Nesterov <oleg@redhat.com> wrote:
>
> To be honest, I don't understand the wait_address check in poll_wait(),
> it seems that wait_address is never NULL.

Oh, it seems to be historical.

That *used* to be how select worked - once select() or poll() had seen
that somebody returns a "I have data", they set wait_address to NULL
because there's no point in adding any wait-queues any more at that
point.

But these days they do that

              wait->_qproc = NULL;

thing instead.

It seems to go back to 626cf2366085 ("poll: add
poll_requested_events() and poll_does_not_wait() functions").

So yeah, I guess these days the wait_table pointer is never NULL
(where "these days" is the last decade+).

> That is what I tried to propose. Will you agree with this change?
> We can even use smp_store_mb(), say

I think it's clearer to just use smp_mb().

The whole smp_store_mb() thing is a pretty random thing, I think we
could / should probably just remove it. It's basically a combination
of "atomic store" and "smp_mb()", and at one point we thought that
doing it with an "xchg" instruction would be better on x86.

And I don't think the one or two byte shorter instruction sequence is
worth it, definitely not for something like updating
entry->wait_address where there's no actual point to making the store
itsdelf atomic.

                Linus

Oleg Nesterov Jan. 6, 2025, 6:36 p.m. UTC | #8

On 01/06, Linus Torvalds wrote:
>
> On Mon, 6 Jan 2025 at 08:31, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > To be honest, I don't understand the wait_address check in poll_wait(),
> > it seems that wait_address is never NULL.
>
> Oh, it seems to be historical.

Yeah, please see my previous email.

I think it must die, but needs another one-liner.

> > That is what I tried to propose. Will you agree with this change?
> > We can even use smp_store_mb(), say
>
> I think it's clearer to just use smp_mb().

OK, agreed, I'll send the trivial patch once I write the changelog.

Oleg.

Oleg Nesterov Jan. 6, 2025, 7:33 p.m. UTC | #9

Damn ;)

It is amazing how much unnecessary spam I added to these discussions.
But let me ask 2 more questions, hopefully no more.

1. pipe_read() says

	 * But when we do wake up writers, we do so using a sync wakeup
	 * (WF_SYNC), because we want them to get going and generate more
	 * data for us.

OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event()
after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if
we are going to wakeup the writer or next_reader before return.

2. I can't understand this code in pipe_write()

	if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
		int err = file_update_time(filp);
		if (err)
			ret = err;
		sb_end_write(file_inode(filp)->i_sb);
	}

	- it only makes sense in the "fifo" case, right? When
	  i_sb->s_magic != PIPEFS_MAGIC...

	- why should we propogate the error code if "ret > 0" but
	  file_update_time() fails?

Oleg.

Linus Torvalds Jan. 6, 2025, 8:23 p.m. UTC | #10

On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@redhat.com> wrote:
>
> 1. pipe_read() says
>
>          * But when we do wake up writers, we do so using a sync wakeup
>          * (WF_SYNC), because we want them to get going and generate more
>          * data for us.
>
> OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event()
> after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if
> we are going to wakeup the writer or next_reader before return.

This heuristic has always been a bit iffy. And honestly, I think it's
been driven by benchmarks that aren't necessarily always realistic (ie
for ping-pong benchmarks, the best behavior is often to stay on the
same CPU and just schedule between the reader/writer).

Those are exactly the kinds that will *not* do wait_event(), because
they just read or wrote a single byte, but doing a "sync" wakeup then
gets you the optimal pattern of "run the other end on the same CPU".

And do those cases exist outside of benchmarks? Probably not.

Ergo: I don't think there's a "right solution".

> 2. I can't understand this code in pipe_write()
>
>         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>                 int err = file_update_time(filp);
>                 if (err)
>                         ret = err;
>                 sb_end_write(file_inode(filp)->i_sb);
>         }
>
>         - it only makes sense in the "fifo" case, right? When
>           i_sb->s_magic != PIPEFS_MAGIC...

I think we've done it for regular pipes too. You can see it with
'fstat()', after all.

And we've done it forever. It goes way back to even before the BK
days, although back when it was done differently with an open-coded

        inode->i_ctime = inode->i_mtime = CURRENT_TIME;
        mark_inode_dirty(inode);

instead.

I actually went back and looked at some historical trees. The inode
time updates were originally added in linux-2.0.1 (July 1996) and
didn't actually have the "mark_inode_dirty()" part, so they didn't
cause the inode to be written back to disk, but people would see the
time updates while they were happening.

The "write it back to disk" part started happening in 2.1.37pre1 (so
May 1997), although back then it was just a simple

        inode->i_dirt = 1;

instead.

My point being that we've been doing it for a *loong* time. And yes,
we've always done it for regular pipes too, not just on-disk FIFOs.

>         - why should we propogate the error code if "ret > 0" but
>           file_update_time() fails?

Normally file_update_time() wouldn't fail.

This was changed in c3b2da314834 ("fs: introduce inode operation
->update_time") and honestly, I think it was just a global
search-and-replace kind of change.

And I think the answer is: nobody cares, because it's FIFO's that
nobody uses in the first place, with an operation that never fails
anyway.

Could we remove the error propagation? Almost certainly. Does it
matter? Almost certainly not.

Notice how the read side also does the time update, except it's
obviously just atime, and it's done with

        if (ret > 0)
                file_accessed(filp);

instead (which honors O_NOATIME and then does touch_atime, and which
does the inode_update_time() *without* any error checking).

IOW, that is different, but it all boils down to the same thing:
nobody really cares in practice.

So to both of your issues, I think you can make changes as long as you
have good reasoning for them and document them in the commit (both
code and commit message).

           Linus

Oleg Nesterov Jan. 7, 2025, 5:25 p.m. UTC | #11

On 01/06, Linus Torvalds wrote:
>
> On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > 1. pipe_read() says
> >
> >          * But when we do wake up writers, we do so using a sync wakeup
> >          * (WF_SYNC), because we want them to get going and generate more
> >          * data for us.
> >
> > OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event()
> > after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if
> > we are going to wakeup the writer or next_reader before return.
>
> This heuristic has always been a bit iffy. And honestly, I think it's
> been driven by benchmarks that aren't necessarily always realistic (ie
> for ping-pong benchmarks, the best behavior is often to stay on the
> same CPU and just schedule between the reader/writer).

Agreed. But my question was not about performance, I just tried to
understand this logic. So in the case of

	wake_up_interruptible_sync_poll(wr_wait);
	wait_event_interruptible_exclusive(wr_read);

WF_SYNC is understandable, "stay on the same CPU" looks like the right
thing, and "_sync_" matches the comment above.

But if we are going to return, wake_up_interruptible_sync_poll() looks
a bit misleading to me.

> > 2. I can't understand this code in pipe_write()
> >
> >         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
> >                 int err = file_update_time(filp);
> >                 if (err)
> >                         ret = err;
> >                 sb_end_write(file_inode(filp)->i_sb);
> >         }
> >
> >         - it only makes sense in the "fifo" case, right? When
> >           i_sb->s_magic != PIPEFS_MAGIC...
>
> I think we've done it for regular pipes too. You can see it with
> 'fstat()', after all.

Ah, indeed, thanks for correcting me...

And thanks for your other explanations. Again, it is not that I thought
this needs changes, just I was a bit confused. In particular by

	err = file_update_time();
	if (err)
		ret = err;

which doesn't match the usage of file_accessed() in pipe_read().

Oleg.

PATCH? avoid the unnecessary wakeups in pipe_read()

Commit Message

Comments

Patch