Message ID | 20241229135737.GA3293@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | PATCH? avoid the unnecessary wakeups in pipe_read() | expand |
On Sun, 29 Dec 2024 at 05:58, Oleg Nesterov <oleg@redhat.com> wrote: > > If I read this code correctly, in this case the child will wakeup the parent > 4095 times for no reason, pipe_writable() == !pipe_pull() will still be true > until the last read(fd[0], &c, 1) does Ack, that patch looks sane to me. Only wake writer if we actually released a pipe slot, and it was full before we did so. Makes sense. Linus
I was going to send a one-liner patch which adds mb() into pipe_poll() but then I decided to make even more spam and ask some questions first. static void wakeup_pipe_readers(struct pipe_inode_info *pipe) { smp_mb(); if (waitqueue_active(&pipe->rd_wait)) wake_up_interruptible(&pipe->rd_wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } I think that wq_has_sleeper() + wake_up_interruptible_poll(POLLIN) make more sense but this is minor. Either way the waitqueue_active() check is only correct if the waiter has a barrier between __add_wait_queue() and "check the condition". wait_event() is fine, but pipe_poll() does: // poll_wait() __pollwait() -> add_wait_queue(pipe->rd_wait) -> list_add() READ_ONCE(pipe->head); READ_ONCE(pipe->tail); In theory these LOAD's can leak into the critical section in add_wait_queue() and they can happen before list_add(entry, rd_wait.head). So I think we need the trivial --- a/fs/pipe.c +++ b/fs/pipe.c @@ -680,6 +680,7 @@ pipe_poll(struct file *filp, poll_table *wait) * if something changes and you got it wrong, the poll * table entry will wake you up and fix it. */ + smp_mb(); head = READ_ONCE(pipe->head); tail = READ_ONCE(pipe->tail); and after that pipe_read/pipe_write can use the wq_has_sleeper() check too (this is what the patch from WangYuli did). ------------------------------------------------------------------------------- But perhaps this mb() should go into __pollwait() ? We can have more waitqueue_active() users which do not take .poll() into account... The are more init_poll_funcptr()'s, but at least epoll looks fine, epi_fget() in ep_item_poll() provides a full barrier before vfs_poll(). ------------------------------------------------------------------------------- Or really add mb() into __add_wait_queue/__add_wait_queue_entry_tail as Manfred suggests? Somehow I am not sure about this change. Oleg.
Hi Oleg, On 1/2/25 5:33 PM, Oleg Nesterov wrote: > I was going to send a one-liner patch which adds mb() into pipe_poll() > but then I decided to make even more spam and ask some questions first. > > static void wakeup_pipe_readers(struct pipe_inode_info *pipe) > { > smp_mb(); > if (waitqueue_active(&pipe->rd_wait)) > wake_up_interruptible(&pipe->rd_wait); > kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > } > > I think that wq_has_sleeper() + wake_up_interruptible_poll(POLLIN) make more > sense but this is minor. > > Either way the waitqueue_active() check is only correct if the waiter has a > barrier between __add_wait_queue() and "check the condition". wait_event() > is fine, but pipe_poll() does: > > // poll_wait() > __pollwait() -> add_wait_queue(pipe->rd_wait) -> list_add() > > READ_ONCE(pipe->head); > READ_ONCE(pipe->tail); > > In theory these LOAD's can leak into the critical section in add_wait_queue() > and they can happen before list_add(entry, rd_wait.head). > > So I think we need the trivial > > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -680,6 +680,7 @@ pipe_poll(struct file *filp, poll_table *wait) > * if something changes and you got it wrong, the poll > * table entry will wake you up and fix it. > */ > + smp_mb(); > head = READ_ONCE(pipe->head); > tail = READ_ONCE(pipe->tail); > > and after that pipe_read/pipe_write can use the wq_has_sleeper() check too > (this is what the patch from WangYuli did). Would it be possible to create a perf probe to get some statistics? I see at least 4 options: - do nothing - add the smp_mb() into pipe_poll, and convert pipe to wq_has_sleepers() - add the smp_mb() into poll_wait(), convert pipe and potentially further poll users to wq_has_sleepers() - add the smp_mb() into __add_wait_queue(), and merge wq_has_sleepers() into wake_up(). The tricky part is probably to differentiate wake_up on empty wait queues vs. wake_up on wait queues with entries. -- Manfred
On Thu, 2 Jan 2025 at 08:33, Oleg Nesterov <oleg@redhat.com> wrote: > > I was going to send a one-liner patch which adds mb() into pipe_poll() > but then I decided to make even more spam and ask some questions first. poll functions are not *supposed* to need memory barriers. They are supposed to do "poll_wait()" and then not need any more serialization after that, because we either (a) have a NULL wait-address, in which case we're not going to sleep and this is just a "check state" (b) the waiting function is supposed to do add_wait_queue() (usually by way of __pollwait) and that should be a sufficient barrier to anybody who does a wakeup Note that add_wait_queue() ends up doing a spinlock sequence, and while that is not a full memory barrier (well, it is on x86, but not necessarily in general), it *should* be sufficient against an actual waker. That's kind of how add_wait_queue() vs wake_up() is supposed to work. Of course, the fact that we're not discussing the pipe code *not* doing a full wake sequence (but just a "is the wait queue empty" thing) is what then messes with the generic rules. And this makes me think that the whole comment above waitqueue_active() is just fundamentally wrong. The smp_mb() is *not* sufficient in the sequence smp_mb(); if (waitqueue_active(wq_head)) wake_up(wq_head); because while it happens to work wrt prepare_to_wait() sequences, is is *not* against other users of add_wait_queue(). In those other sequences the smp_mb() in set_current_state might have happened long long before. Those other users aren't just 'poll()', btw. There's quite a lot of add_wait_queue() users in the kernel. It's a traditional model even if it's not something people generally add to any more. Now, hopefully many of those add_wait_queue() users end up using set_current_state() and getting the memory barrier that way. Or they use wait_woken() or any of the other proper helpers we have. But I think this poll() thing is very much an example of this *not* being valid, and I don't think it's in any way pipe-specific. So maybe we really do need to add the memory barrier to __add_wait_queue(). That's going to be painful, particularly with lots of users not needing it because they have the barrier in all the other places. End result: maybe adding it just to __pollwait() is the thing to do, in the hopes that non-poll users all use the proper sequences. But no, this is most definitely not a pipe-only thing. Linus
On 01/04, Linus Torvalds wrote: > > On Thu, 2 Jan 2025 at 08:33, Oleg Nesterov <oleg@redhat.com> wrote: > > > > I was going to send a one-liner patch which adds mb() into pipe_poll() > > but then I decided to make even more spam and ask some questions first. > > poll functions are not *supposed* to need memory barriers. ... > But no, this is most definitely not a pipe-only thing. Agreed, that is why I didn't send the patch which adds mb() into pipe_poll(). > They are supposed to do "poll_wait()" and then not need any more > serialization after that, because we either > > (a) have a NULL wait-address, in which case we're not going to sleep > and this is just a "check state" To be honest, I don't understand the wait_address check in poll_wait(), it seems that wait_address is never NULL. But ->_qproc can be NULL if do_poll/etc does another iteration after poll_schedule_timeout(), in this case we can sleep again. But this case is fine. > (b) the waiting function is supposed to do add_wait_queue() (usually > by way of __pollwait) and that should be a sufficient barrier to > anybody who does a wakeup Yes. > And this makes me think that the whole comment above > waitqueue_active() is just fundamentally wrong. The smp_mb() is *not* > sufficient in the sequence > > smp_mb(); > if (waitqueue_active(wq_head)) > wake_up(wq_head); > > because while it happens to work wrt prepare_to_wait() sequences, is > is *not* against other users of add_wait_queue(). Well, this comment doesn't look wrong to me, but perhaps it can be more clear. It should probably explain that this pseudo code is only correct because the waiter has a barrier before "if (@cond)" which pairs with smp_mb() above waitqueue_active(). It even says prepare_to_wait(&wq_head, &wait, state); // smp_mb() from set_current_state() but perhaps this is not 100% clear. > But I think this poll() thing is very much an example of this *not* > being valid, and I don't think it's in any way pipe-specific. Yes. > So maybe we really do need to add the memory barrier to > __add_wait_queue(). That's going to be painful, particularly with lots > of users not needing it because they have the barrier in all the other > places. Yes, that is why I don't really like the idea to add mb() into __add_wait_queue(). > End result: maybe adding it just to __pollwait() is the thing to do, > in the hopes that non-poll users all use the proper sequences. That is what I tried to propose. Will you agree with this change? We can even use smp_store_mb(), say @@ -224,11 +224,12 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address, if (!entry) return; entry->filp = get_file(filp); - entry->wait_address = wait_address; entry->key = p->_key; init_waitqueue_func_entry(&entry->wait, pollwake); entry->wait.private = pwq; add_wait_queue(wait_address, &entry->wait); + // COMMENT + smp_store_mb(entry->wait_address, wait_address); Oleg.
On 01/06, Oleg Nesterov wrote: > > To be honest, I don't understand the wait_address check in poll_wait(), > it seems that wait_address is never NULL. Yes, $ git grep -E '\bpoll_wait\(' | perl -wne '/\(.*?,\s*\&/ or print' doesn't find anything "interesting". I think the wait_address check should die, it only adds the unnecessary confusion. Oleg.
On Mon, 6 Jan 2025 at 08:31, Oleg Nesterov <oleg@redhat.com> wrote: > > To be honest, I don't understand the wait_address check in poll_wait(), > it seems that wait_address is never NULL. Oh, it seems to be historical. That *used* to be how select worked - once select() or poll() had seen that somebody returns a "I have data", they set wait_address to NULL because there's no point in adding any wait-queues any more at that point. But these days they do that wait->_qproc = NULL; thing instead. It seems to go back to 626cf2366085 ("poll: add poll_requested_events() and poll_does_not_wait() functions"). So yeah, I guess these days the wait_table pointer is never NULL (where "these days" is the last decade+). > That is what I tried to propose. Will you agree with this change? > We can even use smp_store_mb(), say I think it's clearer to just use smp_mb(). The whole smp_store_mb() thing is a pretty random thing, I think we could / should probably just remove it. It's basically a combination of "atomic store" and "smp_mb()", and at one point we thought that doing it with an "xchg" instruction would be better on x86. And I don't think the one or two byte shorter instruction sequence is worth it, definitely not for something like updating entry->wait_address where there's no actual point to making the store itsdelf atomic. Linus
On 01/06, Linus Torvalds wrote: > > On Mon, 6 Jan 2025 at 08:31, Oleg Nesterov <oleg@redhat.com> wrote: > > > > To be honest, I don't understand the wait_address check in poll_wait(), > > it seems that wait_address is never NULL. > > Oh, it seems to be historical. Yeah, please see my previous email. I think it must die, but needs another one-liner. > > That is what I tried to propose. Will you agree with this change? > > We can even use smp_store_mb(), say > > I think it's clearer to just use smp_mb(). OK, agreed, I'll send the trivial patch once I write the changelog. Oleg.
Damn ;) It is amazing how much unnecessary spam I added to these discussions. But let me ask 2 more questions, hopefully no more. 1. pipe_read() says * But when we do wake up writers, we do so using a sync wakeup * (WF_SYNC), because we want them to get going and generate more * data for us. OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event() after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if we are going to wakeup the writer or next_reader before return. 2. I can't understand this code in pipe_write() if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp); if (err) ret = err; sb_end_write(file_inode(filp)->i_sb); } - it only makes sense in the "fifo" case, right? When i_sb->s_magic != PIPEFS_MAGIC... - why should we propogate the error code if "ret > 0" but file_update_time() fails? Oleg.
On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@redhat.com> wrote: > > 1. pipe_read() says > > * But when we do wake up writers, we do so using a sync wakeup > * (WF_SYNC), because we want them to get going and generate more > * data for us. > > OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event() > after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if > we are going to wakeup the writer or next_reader before return. This heuristic has always been a bit iffy. And honestly, I think it's been driven by benchmarks that aren't necessarily always realistic (ie for ping-pong benchmarks, the best behavior is often to stay on the same CPU and just schedule between the reader/writer). Those are exactly the kinds that will *not* do wait_event(), because they just read or wrote a single byte, but doing a "sync" wakeup then gets you the optimal pattern of "run the other end on the same CPU". And do those cases exist outside of benchmarks? Probably not. Ergo: I don't think there's a "right solution". > 2. I can't understand this code in pipe_write() > > if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { > int err = file_update_time(filp); > if (err) > ret = err; > sb_end_write(file_inode(filp)->i_sb); > } > > - it only makes sense in the "fifo" case, right? When > i_sb->s_magic != PIPEFS_MAGIC... I think we've done it for regular pipes too. You can see it with 'fstat()', after all. And we've done it forever. It goes way back to even before the BK days, although back when it was done differently with an open-coded inode->i_ctime = inode->i_mtime = CURRENT_TIME; mark_inode_dirty(inode); instead. I actually went back and looked at some historical trees. The inode time updates were originally added in linux-2.0.1 (July 1996) and didn't actually have the "mark_inode_dirty()" part, so they didn't cause the inode to be written back to disk, but people would see the time updates while they were happening. The "write it back to disk" part started happening in 2.1.37pre1 (so May 1997), although back then it was just a simple inode->i_dirt = 1; instead. My point being that we've been doing it for a *loong* time. And yes, we've always done it for regular pipes too, not just on-disk FIFOs. > - why should we propogate the error code if "ret > 0" but > file_update_time() fails? Normally file_update_time() wouldn't fail. This was changed in c3b2da314834 ("fs: introduce inode operation ->update_time") and honestly, I think it was just a global search-and-replace kind of change. And I think the answer is: nobody cares, because it's FIFO's that nobody uses in the first place, with an operation that never fails anyway. Could we remove the error propagation? Almost certainly. Does it matter? Almost certainly not. Notice how the read side also does the time update, except it's obviously just atime, and it's done with if (ret > 0) file_accessed(filp); instead (which honors O_NOATIME and then does touch_atime, and which does the inode_update_time() *without* any error checking). IOW, that is different, but it all boils down to the same thing: nobody really cares in practice. So to both of your issues, I think you can make changes as long as you have good reasoning for them and document them in the commit (both code and commit message). Linus
On 01/06, Linus Torvalds wrote: > > On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@redhat.com> wrote: > > > > 1. pipe_read() says > > > > * But when we do wake up writers, we do so using a sync wakeup > > * (WF_SYNC), because we want them to get going and generate more > > * data for us. > > > > OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event() > > after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if > > we are going to wakeup the writer or next_reader before return. > > This heuristic has always been a bit iffy. And honestly, I think it's > been driven by benchmarks that aren't necessarily always realistic (ie > for ping-pong benchmarks, the best behavior is often to stay on the > same CPU and just schedule between the reader/writer). Agreed. But my question was not about performance, I just tried to understand this logic. So in the case of wake_up_interruptible_sync_poll(wr_wait); wait_event_interruptible_exclusive(wr_read); WF_SYNC is understandable, "stay on the same CPU" looks like the right thing, and "_sync_" matches the comment above. But if we are going to return, wake_up_interruptible_sync_poll() looks a bit misleading to me. > > 2. I can't understand this code in pipe_write() > > > > if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { > > int err = file_update_time(filp); > > if (err) > > ret = err; > > sb_end_write(file_inode(filp)->i_sb); > > } > > > > - it only makes sense in the "fifo" case, right? When > > i_sb->s_magic != PIPEFS_MAGIC... > > I think we've done it for regular pipes too. You can see it with > 'fstat()', after all. Ah, indeed, thanks for correcting me... And thanks for your other explanations. Again, it is not that I thought this needs changes, just I was a bit confused. In particular by err = file_update_time(); if (err) ret = err; which doesn't match the usage of file_accessed() in pipe_read(). Oleg.
diff --git a/fs/pipe.c b/fs/pipe.c index 12b22c2723b7..27ffb650f131 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) size_t total_len = iov_iter_count(to); struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - bool was_full, wake_next_reader = false; + bool wake_writer = false, wake_next_reader = false; ssize_t ret; /* Null read succeeds. */ @@ -271,7 +271,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * (WF_SYNC), because we want them to get going and generate more * data for us. */ - was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); for (;;) { /* Read ->head with a barrier vs post_one_notification() */ unsigned int head = smp_load_acquire(&pipe->head); @@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) buf->len = 0; } - if (!buf->len) + if (!buf->len) { + wake_writer |= pipe_full(head, tail, pipe->max_usage); tail = pipe_update_tail(pipe, buf, tail); + } total_len -= chars; if (!total_len) break; /* common path: read succeeded */ @@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * _very_ unlikely case that the pipe was full, but we got * no data. */ - if (unlikely(was_full)) + if (unlikely(wake_writer)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); @@ -391,14 +392,14 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) return -ERESTARTSYS; mutex_lock(&pipe->mutex); - was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage); wake_next_reader = true; + wake_writer = false; } if (pipe_empty(pipe->head, pipe->tail)) wake_next_reader = false; mutex_unlock(&pipe->mutex); - if (was_full) + if (wake_writer) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (wake_next_reader) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);