Message ID | 75B06EE0B67747ED+20241225094202.597305-1-wangyuli@uniontech.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RESEND] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write | expand |
Don't you think the Cc list is a bit overloaded? On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > When a user calls the read/write system call and passes a pipe > descriptor, the pipe_read/pipe_write functions are invoked: > > 1. pipe_read(): > 1). Checks if the pipe is valid and if there is any data in the > pipe buffer. > 2). Waits for data: > *If there is no data in the pipe and the write end is still open, > the current process enters a sleep state (wait_event()) until data > is written. > *If the write end is closed, return 0. > 3). Reads data: > *Wakes up the process and copies data from the pipe's memory > buffer to user space. > *When the buffer is full, the writing process will go to sleep, > waiting for the pipe state to change to be awakened (using the > wake_up_interruptible_sync_poll() mechanism). Once data is read > from the buffer, the writing process can continue writing, and the > reading process can continue reading new data. > 4). Returns the number of bytes read upon successful read. > > 2. pipe_write(): > 1). Checks if the pipe is valid and if there is any available > space in the pipe buffer. > 2). Waits for buffer space: > *If the pipe buffer is full and the reading process has not > read any data, pipe_write() may put the current process to sleep > until there is space in the buffer. > *If the read end of the pipe is closed (no process is waiting > to read), an error code -EPIPE is returned, and a SIGPIPE signal may > be sent to the process. > 3). Writes data: > *If there is enough space in the pipe buffer, pipe_write() copies > data from the user space buffer to the kernel buffer of the pipe > (using copy_from_user()). > *If the amount of data the user requests to write is larger than > the available space in the buffer, multiple writes may be required, > or the process may wait for new space to be freed. > 4). Wakes up waiting reading processes: > *After the data is successfully written, pipe_write() wakes up > any processes that may be waiting to read data (using the > wake_up_interruptible_sync_poll() mechanism). > 5). Returns the number of bytes successfully written. > > Check if there are any waiting processes in the process wait queue > by introducing wq_has_sleeper() when waking up processes for pipe > read/write operations. > > If no processes are waiting, there's no need to execute > wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups. > > Unnecessary wake-ups can lead to context switches, where a process > is woken up to handle I/O events even when there is no immediate > need. > > Only wake up processes when there are actually waiting processes to > reduce context switches and system overhead by checking > with wq_has_sleeper(). > > Additionally, by reducing unnecessary synchronization and wake-up > operations, wq_has_sleeper() can decrease system resource waste and > lock contention, improving overall system performance. > > For pipe read/write operations, this eliminates ineffective scheduling > and enhances concurrency. > > It's important to note that enabling this option means invoking > wq_has_sleeper() to check for sleeping processes in the wait queue > for every read or write operation. > > While this is a lightweight operation, it still incurs some overhead. > > In low-load or single-task scenarios, this overhead may not yield > significant benefits and could even introduce minor performance > degradation. > > UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor: > > With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6 > With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7 > > Single-core performance improved by 4.1%, multi-core performance > improved by 4.8%. ... > +config PIPE_SKIP_SLEEPER > + bool "Skip sleeping processes during pipe read/write" > + default n 'n' is the default 'default', no need to have this line. > + help > + This option introduces a check whether the sleep queue will > + be awakened during pipe read/write. > + > + It often leads to a performance improvement. However, in > + low-load or single-task scenarios, it may introduce minor > + performance overhead. > + If unsure, say N. Illogical, it's already N as you stated by putting a redundant line, but after removing that line it will make sense. ... > +static inline bool Have you build this with Clang and `make W=1 ...`? > +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) > +{ > + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER)) > + return wq_has_sleeper(wq_head); > + else Redundant. > + return true; if (!foo) return true; return bar(...); > +}
On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > Don't you think the Cc list is a bit overloaded? Indeed, my mail server doesn't let me reply-all. > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > +config PIPE_SKIP_SLEEPER > > + bool "Skip sleeping processes during pipe read/write" > > + default n > > 'n' is the default 'default', no need to have this line. Actually, I'd say to skip the kconfig option for this. Kconfig options that affect the behaviour of core code increase our testing burden, and are another variable to account for when chasing down bugs, and the potential overhead looks negligable. Also, did you look at adding this optimization to wake_up()? No-op wakeups are very common, I think this has wider applicability.
On 2024/12/25 21:30, Andy Shevchenko wrote: > Don't you think the Cc list is a bit overloaded? Hi, I apologize for any inconvenience this may cause. I understand that under normal circumstances, one would simply pass the modified code path as an argument to the kernel's scripts/get_maintainer.pl script to determine the appropriate recipients. However, given the vast and complex nature of the Linux kernel community, with tens of thousands of developers worldwide, and considering the varying "customs" of different subsystems, as well as time zone differences and individual work habits, it's not uncommon for patches to be sent to mailing lists and subsequently ignored or left pending. This patch, for example, has been submitted multiple times without receiving any response, unfortunately. My intention is simply to seek your review, and that of other technical experts like yourself, but I cannot be certain, prior to your response, which specific experts on which lists would be willing to provide feedback. I would appreciate any other suggestions you may have. >> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor: >> >> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6 >> With the option enabled: Single-core: 877.8, Multi-core (8): 4854.7 >> >> Single-core performance improved by 4.1%, multi-core performance >> improved by 4.8%. > ... As you know, the kernel is extremely sensitive to performance. Even a 1% performance improvement can lead to significant efficiency gains and reduced carbon emissions in production environments, as long as there is sufficient testing and theoretical analysis to prove that the improvement is real and not due to measurement error or jitter. >> +config PIPE_SKIP_SLEEPER >> + bool "Skip sleeping processes during pipe read/write" >> + default n > 'n' is the default 'default', no need to have this line. OK, I'll drop it. Thanks. > >> + help >> + This option introduces a check whether the sleep queue will >> + be awakened during pipe read/write. >> + >> + It often leads to a performance improvement. However, in >> + low-load or single-task scenarios, it may introduce minor >> + performance overhead. >> + If unsure, say N. > Illogical, it's already N as you stated by putting a redundant line, but after > removing that line it will make sense. > > ... As noted, I'll remove "default n" as it serves no purpose. > >> +static inline bool > Have you build this with Clang and `make W=1 ...`? Hmm...I've noticed a discrepancy in kernel compilation results with and without "make W=1". When I use x86_64_defconfig and clang-19.1.1 (Ubuntu 24.10) and run "make", there are no warnings. However, when I run "make W=1", the kernel generates a massive number of errors, causing the compilation to fail prematurely. e.g. In file included from arch/x86/kernel/asm-offsets.c:14: In file included from ./include/linux/suspend.h:5: In file included from ./include/linux/swap.h:9: In file included from ./include/linux/memcontrol.h:21: In file included from ./include/linux/mm.h:2224: ./include/linux/vmstat.h:504:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 504 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 505 | item]; | ~~~~ ./include/linux/vmstat.h:511:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 511 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 512 | NR_VM_NUMA_EVENT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/vmstat.h:524:43: error: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Werror,-Wenum-enum-conversion] 524 | return vmstat_text[NR_VM_ZONE_STAT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~ ^ 525 | NR_VM_NUMA_EVENT_ITEMS + | ~~~~~~~~~~~~~~~~~~~~~~ 3 errors generated. And I've observed similar behavior with gcc-14.2.0. While I'm keen on addressing as many potential compile errors and warnings in the kernel as possible, it seems like a long-term endeavor. Regarding this specific code, I'd appreciate your insights on how to improve it. > >> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) >> +{ >> + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER)) >> + return wq_has_sleeper(wq_head); >> + else > Redundant. > >> + return true; > if (!foo) > return true; > > return bar(...); > >> +} Yes. I'll rework the code structure here. Thanks. Thanks,
On Wed, Dec 25, 2024 at 11:42:29PM +0800, WangYuli wrote: > On 2024/12/25 21:30, Andy Shevchenko wrote: > > > Don't you think the Cc list is a bit overloaded? > > Hi, > > I apologize for any inconvenience this may cause. > > I understand that under normal circumstances, one would simply pass the > modified code path as an argument to the kernel's scripts/get_maintainer.pl > script to determine the appropriate recipients. > > However, given the vast and complex nature of the Linux kernel community, > with tens of thousands of developers worldwide, and considering the varying > "customs" of different subsystems, as well as time zone differences and > individual work habits, it's not uncommon for patches to be sent to mailing > lists and subsequently ignored or left pending. (...) Sorry, but by CCing 191 random addresses like this, that's the best way to be added to .procmailrc and be completely ignored by everyone. At some point one should wonder whether that's a common practice or if such behaviors will be considered offensive by the majority. get_maintainer.pl only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy whom I left since they replied to you). > This patch, for example, has been submitted multiple times without receiving > any response, unfortunately. It can happen, but sending to the right people and possibly resending if it gets lost is usually sufficient to attract attention. Sending to that many people make you look like someone feeling so important they need to shout in a loudspeaker to send orders to everyone. Please refrain from doing this in the future. Thanks, Willy
On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote: > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > > Don't you think the Cc list is a bit overloaded? > > Indeed, my mail server doesn't let me reply-all. > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > > +config PIPE_SKIP_SLEEPER > > > + bool "Skip sleeping processes during pipe read/write" > > > + default n > > > > 'n' is the default 'default', no need to have this line. > > Actually, I'd say to skip the kconfig option for this. Kconfig options > that affect the behaviour of core code increase our testing burden, and > are another variable to account for when chasing down bugs, and the > potential overhead looks negligable. > I agree the behavior should not be guarded by an option. However, because of how wq_has_sleeper is implemented (see below) I would argue this needs to show how often locking can be avoided in real workloads. The commit message does state this comes with a slowdown for cases which can't avoid wakeups, but as is I thought the submitter just meant an extra branch. > Also, did you look at adding this optimization to wake_up()? No-op > wakeups are very common, I think this has wider applicability. I was going to suggest it myself, but then: static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) { /* * We need to be sure we are in sync with the * add_wait_queue modifications to the wait queue. * * This memory barrier should be paired with one on the * waiting side. */ smp_mb(); return waitqueue_active(wq_head); } Which means this is in fact quite expensive. Since wakeup is a lock + an interrupt trip, it would still be cheaper single-threaded to "merely" suffer a full fence and for cases where the queue is empty often enough this is definitely the right thing to do. On the other hand this executing when the queue is mostly *not* empty would combat the point. So unfortunately embedding this in wake_up is a no-go.
On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote: > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote: > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > > > Don't you think the Cc list is a bit overloaded? > > > > Indeed, my mail server doesn't let me reply-all. > > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > > > +config PIPE_SKIP_SLEEPER > > > > + bool "Skip sleeping processes during pipe read/write" > > > > + default n > > > > > > 'n' is the default 'default', no need to have this line. > > > > Actually, I'd say to skip the kconfig option for this. Kconfig options > > that affect the behaviour of core code increase our testing burden, and > > are another variable to account for when chasing down bugs, and the > > potential overhead looks negligable. > > > > I agree the behavior should not be guarded by an option. However, > because of how wq_has_sleeper is implemented (see below) I would argue > this needs to show how often locking can be avoided in real workloads. > > The commit message does state this comes with a slowdown for cases which > can't avoid wakeups, but as is I thought the submitter just meant an > extra branch. > > > Also, did you look at adding this optimization to wake_up()? No-op > > wakeups are very common, I think this has wider applicability. > > I was going to suggest it myself, but then: > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > { > /* > * We need to be sure we are in sync with the > * add_wait_queue modifications to the wait queue. > * > * This memory barrier should be paired with one on the > * waiting side. > */ > smp_mb(); > return waitqueue_active(wq_head); > } > > Which means this is in fact quite expensive. > > Since wakeup is a lock + an interrupt trip, it would still be > cheaper single-threaded to "merely" suffer a full fence and for cases > where the queue is empty often enough this is definitely the right thing > to do. We're comparing against no-op wakeup. A real wakeup does an IPI, which completely dwarfs the cost of a barrier. And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I assume it's gotten better, but back when I was looking at waitqueues nested pushf/popf was horrifically expensive. But perhaps can we do this with just a release barrier? Similar to how list_empty_careful() works. > On the other hand this executing when the queue is mostly *not* empty > would combat the point. > > So unfortunately embedding this in wake_up is a no-go. You definitely can't say that without knowing how often no-op wake_up()s occur. It wouldn't be hard to gather that (write a patch to add a pair of percpu counters, throw it on a few machines running random workloads) and I think the results might surprise you.
Hi, I've reviewed the Contributor Covenant and the Linux Kernel Contributor Covenant Code of Conduct Interpretation, and I couldn't find anything suggesting that CCing a large number of people is "unfriendly". And while I don't believe my actions were malicious, I understand your concern. Going forward, I'll be more considerate of the recipients when sending emails and will avoid CCing more than a hundred people at once in similar situations. On 2024/12/26 00:00, Willy Tarreau wrote: > (...) > > Sorry, but by CCing 191 random addresses like this, I think there may be a misunderstanding. These all recipients can be found in the git history of fs/pipe.c. > that's the best way > to be added to .procmailrc and be completely ignored by everyone. At some > point one should wonder whether that's a common practice or if such > behaviors will be considered offensive by the majority. get_maintainer.pl > only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy > whom I left since they replied to you). > >> This patch, for example, has been submitted multiple times without receiving >> any response, unfortunately. > It can happen, but sending to the right people and possibly resending if > it gets lost is usually sufficient to attract attention. Sending to that > many people make you look like someone feeling so important they need to > shout in a loudspeaker to send orders to everyone. Sincerely hope that's the case. > Please refrain from > doing this in the future. As above. Thanks,
On Thu, Dec 26, 2024 at 12:32:35AM +0800, WangYuli wrote: > Hi, > > I've reviewed the Contributor Covenant and the Linux Kernel Contributor > Covenant Code of Conduct Interpretation, and I couldn't find anything > suggesting that CCing a large number of people is "unfriendly". This is unrelated, it's a matter of basic social interactions and respect of others' time. > And while I don't believe my actions were malicious, I understand your > concern. > > Going forward, I'll be more considerate of the recipients when sending > emails and will avoid CCing more than a hundred people at once in similar > situations. "More than a hundred" ? Are you serious ? For what purpose ? I'll explain you something related to how people consume e-mails: the first thing they do if they hesitate to process it is to check if someone else in the To or Cc is more knowledgeable or legitimate than them. If so they often prefer to let others deal with it. With such a large list, it's impossible to check all other addresses and virtually *everyone* will consider that surely one of the hundreds of others is more legitimate. So the more people you add, the less likely anyone will handle your request. The common rules that apply here are not made out of nowhere but based on what works best. Just send to the most relevant outputs of get_maintainer, wait for one week and if you get no response it's just that these people were busy and forgot about you, so just kindly ping again to make sure your message was received. With such a large community it's normal to lose some messages, and pinging again is normally not considered as an offence so that's fine. If you get no response after multiple attempts, it might mean there's something annoying in your message (like sending to tens or hundreds of people at once). And really, when a script tells you "send your message to these 3 people" and you send it to 191, how can you imagine the recipients will think anything different from "this person feels super important to spam everyone like this". Why would they even respond if they feel like you consider you have the right to spam everyone in hope to get your orders processed immediately ? I'm pretty sure that the majority will just let it rot by principle. Good luck, Willy
On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote: > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote: > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > > > > Don't you think the Cc list is a bit overloaded? > > > > > > Indeed, my mail server doesn't let me reply-all. > > > > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > > > > +config PIPE_SKIP_SLEEPER > > > > > + bool "Skip sleeping processes during pipe read/write" > > > > > + default n > > > > > > > > 'n' is the default 'default', no need to have this line. > > > > > > Actually, I'd say to skip the kconfig option for this. Kconfig options > > > that affect the behaviour of core code increase our testing burden, and > > > are another variable to account for when chasing down bugs, and the > > > potential overhead looks negligable. > > > > > > > I agree the behavior should not be guarded by an option. However, > > because of how wq_has_sleeper is implemented (see below) I would argue > > this needs to show how often locking can be avoided in real workloads. > > > > The commit message does state this comes with a slowdown for cases which > > can't avoid wakeups, but as is I thought the submitter just meant an > > extra branch. > > > > > Also, did you look at adding this optimization to wake_up()? No-op > > > wakeups are very common, I think this has wider applicability. > > > > I was going to suggest it myself, but then: > > > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > { > > /* > > * We need to be sure we are in sync with the > > * add_wait_queue modifications to the wait queue. > > * > > * This memory barrier should be paired with one on the > > * waiting side. > > */ > > smp_mb(); > > return waitqueue_active(wq_head); > > } > > > > Which means this is in fact quite expensive. > > > > Since wakeup is a lock + an interrupt trip, it would still be > > cheaper single-threaded to "merely" suffer a full fence and for cases > > where the queue is empty often enough this is definitely the right thing > > to do. > > We're comparing against no-op wakeup. A real wakeup does an IPI, which > completely dwarfs the cost of a barrier. > > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I > assume it's gotten better, but back when I was looking at waitqueues > nested pushf/popf was horrifically expensive. > > But perhaps can we do this with just a release barrier? Similar to how > list_empty_careful() works. > > > On the other hand this executing when the queue is mostly *not* empty > > would combat the point. > > > > So unfortunately embedding this in wake_up is a no-go. > > You definitely can't say that without knowing how often no-op > wake_up()s occur. It wouldn't be hard to gather that (write a patch to > add a pair of percpu counters, throw it on a few machines running random > workloads) and I think the results might surprise you. There is some talking past each other here. I explicitly noted one needs to check what happens in real workloads. I very much expect there will be consumers where there are no waiters almost every time and consumers which almost always do have them. My claim is that this should be handled on a case-by-case basis. So i whipped out a bpftrace one liner do take a look at the kernel build, details at the end. In terms of the total (0 == no waiters, 1 == waiters): [0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | There is some funzies in the vfs layer which I'm going to sort out myself. The kernel is tags/next-20241220 As far as pipes go: @[ wakeprobe+5 __wake_up_common+63 __wake_up_sync_key+59 pipe_read+385 ]: [0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| So this guy literally never had any waiters when wakeup was issued. faddr2line claims line 405, which I presume is off by one: 401 if (was_full) 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); 403 if (wake_next_reader) 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); 405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); I'm guessing the real empty queue is rd_wait. Definitely a candidate depending on other workloads, personally I would just patch it as is. @[ wakeprobe+5 __wake_up_common+63 __wake_up+54 pipe_release+92 ]: [0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ | a wash, would not touch that no matter what @[ wakeprobe+5 __wake_up_common+63 __wake_up+54 pipe_release+110 ]: [0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| again no waiters, line claimed is 737, again off by one: 733 /* Was that the last reader or writer, but not the other side? */ 734 if (!pipe->readers != !pipe->writers) { 735 │ wake_up_interruptible_all(&pipe->rd_wait); 736 │ wake_up_interruptible_all(&pipe->wr_wait); 737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); 738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); so I presume wr_wait? same comment as the entry at claimed line 405 @[ wakeprobe+5 __wake_up_common+63 __wake_up_sync_key+59 pipe_write+773 ]: [0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| again no waiters, claimed line 606 604 if (wake_next_writer) 605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); 606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { again would be inclined to patch as is @[ wakeprobe+5 __wake_up_common+63 __wake_up_sync_key+59 pipe_read+943 ]: [0, 1) 9488 |@@@@@@@@@@@@@ | [1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| majority of the time there were waiters, would not touch regardless of other workloads, line 403 401 if (was_full) 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); 403 if (wake_next_reader) 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); the wr_wait thing @[ wakeprobe+5 __wake_up_common+63 __wake_up_sync_key+59 pipe_write+729 ]: [0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| ditto concerning not touching, resolved to line 603 601 if (was_empty || pipe->poll_usage) 602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); 603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); 604 if (wake_next_writer) 605 wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); That is to say as far as this workload goes the submitted patch does avoid some of the lock + irq trips by covering cases where there no waiters seen in this workload, but also adds the smp_mb thing when it does not help -- I would remove those spots from the submission. While I agree a full service treatment would require a bunch of different workloads, personally I would be inclined justify the change merely based on the kernel build + leaving bpftrace running over some a real-world box running random crap. As for obtaining such info, I failed to convince bpftrace to check if the waiter list is empty. Instead I resorted to adding a dedicated func which grabs the bit and probing on that. The func is in a different file because gcc decided to omit the call otherwise one-liner: bpftrace -e 'kprobe:wakeprobe { @[kstack(4)] = lhist(arg0, 0, 1, 1); }' hack: diff --git a/fs/file.c b/fs/file.c index d868cdb95d1e..00d6a34eb174 100644 --- a/fs/file.c +++ b/fs/file.c @@ -1442,3 +1442,11 @@ int iterate_fd(struct files_struct *files, unsigned n, return res; } EXPORT_SYMBOL(iterate_fd); + + +void wakeprobe(int count); + +void wakeprobe(int count) +{ +} +EXPORT_SYMBOL(wakeprobe); diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index 51e38f5f4701..8db7b0daf04b 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -57,6 +57,8 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry } EXPORT_SYMBOL(remove_wait_queue); +void wakeprobe(int count); + /* * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve @@ -77,6 +79,8 @@ static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, lockdep_assert_held(&wq_head->lock); + wakeprobe(wq_has_sleeper(wq_head)); + curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry); if (&curr->entry == &wq_head->head)
On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote: > On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote: > > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote: > > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > > > > > Don't you think the Cc list is a bit overloaded? > > > > > > > > Indeed, my mail server doesn't let me reply-all. > > > > > > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > > > > > +config PIPE_SKIP_SLEEPER > > > > > > + bool "Skip sleeping processes during pipe read/write" > > > > > > + default n > > > > > > > > > > 'n' is the default 'default', no need to have this line. > > > > > > > > Actually, I'd say to skip the kconfig option for this. Kconfig options > > > > that affect the behaviour of core code increase our testing burden, and > > > > are another variable to account for when chasing down bugs, and the > > > > potential overhead looks negligable. > > > > > > > > > > I agree the behavior should not be guarded by an option. However, > > > because of how wq_has_sleeper is implemented (see below) I would argue > > > this needs to show how often locking can be avoided in real workloads. > > > > > > The commit message does state this comes with a slowdown for cases which > > > can't avoid wakeups, but as is I thought the submitter just meant an > > > extra branch. > > > > > > > Also, did you look at adding this optimization to wake_up()? No-op > > > > wakeups are very common, I think this has wider applicability. > > > > > > I was going to suggest it myself, but then: > > > > > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > > { > > > /* > > > * We need to be sure we are in sync with the > > > * add_wait_queue modifications to the wait queue. > > > * > > > * This memory barrier should be paired with one on the > > > * waiting side. > > > */ > > > smp_mb(); > > > return waitqueue_active(wq_head); > > > } > > > > > > Which means this is in fact quite expensive. > > > > > > Since wakeup is a lock + an interrupt trip, it would still be > > > cheaper single-threaded to "merely" suffer a full fence and for cases > > > where the queue is empty often enough this is definitely the right thing > > > to do. > > > > We're comparing against no-op wakeup. A real wakeup does an IPI, which > > completely dwarfs the cost of a barrier. > > > > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I > > assume it's gotten better, but back when I was looking at waitqueues > > nested pushf/popf was horrifically expensive. > > > > But perhaps can we do this with just a release barrier? Similar to how > > list_empty_careful() works. > > > > > On the other hand this executing when the queue is mostly *not* empty > > > would combat the point. > > > > > > So unfortunately embedding this in wake_up is a no-go. > > > > You definitely can't say that without knowing how often no-op > > wake_up()s occur. It wouldn't be hard to gather that (write a patch to > > add a pair of percpu counters, throw it on a few machines running random > > workloads) and I think the results might surprise you. > > There is some talking past each other here. > > I explicitly noted one needs to check what happens in real workloads. > > I very much expect there will be consumers where there are no waiters > almost every time and consumers which almost always do have them. > > My claim is that this should be handled on a case-by-case basis. > > So i whipped out a bpftrace one liner do take a look at the kernel > build, details at the end. > > In terms of the total (0 == no waiters, 1 == waiters): > [0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > There is some funzies in the vfs layer which I'm going to sort out myself. > > The kernel is tags/next-20241220 > > As far as pipes go: > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_read+385 > ]: > [0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > So this guy literally never had any waiters when wakeup was issued. > faddr2line claims line 405, which I presume is off by one: > > 401 if (was_full) > 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 403 if (wake_next_reader) > 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > 405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > > I'm guessing the real empty queue is rd_wait. Definitely a candidate > depending on other workloads, personally I would just patch it as is. > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up+54 > pipe_release+92 > ]: > [0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ | > > a wash, would not touch that no matter what > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up+54 > pipe_release+110 > ]: > [0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > again no waiters, line claimed is 737, again off by one: > 733 /* Was that the last reader or writer, but not the other side? */ > 734 if (!pipe->readers != !pipe->writers) { > 735 │ wake_up_interruptible_all(&pipe->rd_wait); > 736 │ wake_up_interruptible_all(&pipe->wr_wait); > 737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > 738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > > so I presume wr_wait? same comment as the entry at claimed line 405 > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_write+773 > ]: > [0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > again no waiters, claimed line 606 > 604 if (wake_next_writer) > 605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { > > again would be inclined to patch as is > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_read+943 > ]: > [0, 1) 9488 |@@@@@@@@@@@@@ | > [1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > majority of the time there were waiters, would not touch regardless of > other workloads, line 403 > > 401 if (was_full) > 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 403 if (wake_next_reader) > 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > > the wr_wait thing > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_write+729 > ]: > [0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > ditto concerning not touching, resolved to line 603 > > 601 if (was_empty || pipe->poll_usage) > 602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > 603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > 604 if (wake_next_writer) > 605 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > > That is to say as far as this workload goes the submitted patch does > avoid some of the lock + irq trips by covering cases where there no > waiters seen in this workload, but also adds the smp_mb thing when it > does not help -- I would remove those spots from the submission. Neat use of bpf, although if you had to patch the kernel anyways I would've just gone the percpu counter route... :) Based on those numbers, even in the cases where wake-up dominates it doesn't dominate by enough that I'd expect the patch to cause a regression, at least if we can do it with a proper release barrier. Why is smp_load_release() not a thing? Unusual I suppose, but it is what we want here...
diff --git a/fs/Kconfig b/fs/Kconfig index 64d420e3c475..0dacc46a73fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -429,4 +429,17 @@ source "fs/unicode/Kconfig" config IO_WQ bool +config PIPE_SKIP_SLEEPER + bool "Skip sleeping processes during pipe read/write" + default n + help + This option introduces a check whether the sleep queue will + be awakened during pipe read/write. + + It often leads to a performance improvement. However, in + low-load or single-task scenarios, it may introduce minor + performance overhead. + + If unsure, say N. + endmenu diff --git a/fs/pipe.c b/fs/pipe.c index 12b22c2723b7..c085333ae72c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -247,6 +247,15 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe, return tail; } +static inline bool +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head) +{ + if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER)) + return wq_has_sleeper(wq_head); + else + return true; +} + static ssize_t pipe_read(struct kiocb *iocb, struct iov_iter *to) { @@ -377,7 +386,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) * _very_ unlikely case that the pipe was full, but we got * no data. */ - if (unlikely(was_full)) + if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); @@ -398,9 +407,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) wake_next_reader = false; mutex_unlock(&pipe->mutex); - if (was_full) + if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); - if (wake_next_reader) + if (wake_next_reader && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); if (ret > 0) @@ -573,7 +582,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * become empty while we dropped the lock. */ mutex_unlock(&pipe->mutex); - if (was_empty) + if (was_empty && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); @@ -598,10 +607,10 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from) * Epoll nonsensically wants a wakeup whether the pipe * was already empty or not. */ - if (was_empty || pipe->poll_usage) + if ((was_empty || pipe->poll_usage) && pipe_check_wq_has_sleeper(&pipe->rd_wait)) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); - if (wake_next_writer) + if (wake_next_writer && pipe_check_wq_has_sleeper(&pipe->wr_wait)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { int err = file_update_time(filp);