diff mbox series

[RESEND] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write

Message ID 75B06EE0B67747ED+20241225094202.597305-1-wangyuli@uniontech.com (mailing list archive)
State New
Headers show
Series [RESEND] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write | expand

Commit Message

WangYuli Dec. 25, 2024, 9:42 a.m. UTC
When a user calls the read/write system call and passes a pipe
descriptor, the pipe_read/pipe_write functions are invoked:

1. pipe_read():
  1). Checks if the pipe is valid and if there is any data in the
pipe buffer.
  2). Waits for data:
    *If there is no data in the pipe and the write end is still open,
the current process enters a sleep state (wait_event()) until data
is written.
    *If the write end is closed, return 0.
  3). Reads data:
    *Wakes up the process and copies data from the pipe's memory
buffer to user space.
    *When the buffer is full, the writing process will go to sleep,
waiting for the pipe state to change to be awakened (using the
wake_up_interruptible_sync_poll() mechanism). Once data is read
from the buffer, the writing process can continue writing, and the
reading process can continue reading new data.
  4). Returns the number of bytes read upon successful read.

2. pipe_write():
  1). Checks if the pipe is valid and if there is any available
space in the pipe buffer.
  2). Waits for buffer space:
    *If the pipe buffer is full and the reading process has not
read any data, pipe_write() may put the current process to sleep
until there is space in the buffer.
    *If the read end of the pipe is closed (no process is waiting
to read), an error code -EPIPE is returned, and a SIGPIPE signal may
be sent to the process.
  3). Writes data:
    *If there is enough space in the pipe buffer, pipe_write() copies
data from the user space buffer to the kernel buffer of the pipe
(using copy_from_user()).
    *If the amount of data the user requests to write is larger than
the available space in the buffer, multiple writes may be required,
or the process may wait for new space to be freed.
  4). Wakes up waiting reading processes:
    *After the data is successfully written, pipe_write() wakes up
any processes that may be waiting to read data (using the
wake_up_interruptible_sync_poll() mechanism).
  5). Returns the number of bytes successfully written.

Check if there are any waiting processes in the process wait queue
by introducing wq_has_sleeper() when waking up processes for pipe
read/write operations.

If no processes are waiting, there's no need to execute
wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.

Unnecessary wake-ups can lead to context switches, where a process
is woken up to handle I/O events even when there is no immediate
need.

Only wake up processes when there are actually waiting processes to
reduce context switches and system overhead by checking
with wq_has_sleeper().

Additionally, by reducing unnecessary synchronization and wake-up
operations, wq_has_sleeper() can decrease system resource waste and
lock contention, improving overall system performance.

For pipe read/write operations, this eliminates ineffective scheduling
and enhances concurrency.

It's important to note that enabling this option means invoking
wq_has_sleeper() to check for sleeping processes in the wait queue
for every read or write operation.

While this is a lightweight operation, it still incurs some overhead.

In low-load or single-task scenarios, this overhead may not yield
significant benefits and could even introduce minor performance
degradation.

UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:

With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
With the option enabled:  Single-core: 877.8, Multi-core (8): 4854.7

Single-core performance improved by 4.1%, multi-core performance
improved by 4.8%.

Co-developed-by: Shengjin Yu <yushengjin@uniontech.com>
Signed-off-by: Shengjin Yu <yushengjin@uniontech.com>
Co-developed-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: Dandan Zhang <zhangdandan@uniontech.com>
Tested-by: Dandan Zhang <zhangdandan@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
---
 fs/Kconfig | 13 +++++++++++++
 fs/pipe.c  | 21 +++++++++++++++------
 2 files changed, 28 insertions(+), 6 deletions(-)

Comments

Andy Shevchenko Dec. 25, 2024, 1:30 p.m. UTC | #1
Don't you think the Cc list is a bit overloaded?

On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> When a user calls the read/write system call and passes a pipe
> descriptor, the pipe_read/pipe_write functions are invoked:
> 
> 1. pipe_read():
>   1). Checks if the pipe is valid and if there is any data in the
> pipe buffer.
>   2). Waits for data:
>     *If there is no data in the pipe and the write end is still open,
> the current process enters a sleep state (wait_event()) until data
> is written.
>     *If the write end is closed, return 0.
>   3). Reads data:
>     *Wakes up the process and copies data from the pipe's memory
> buffer to user space.
>     *When the buffer is full, the writing process will go to sleep,
> waiting for the pipe state to change to be awakened (using the
> wake_up_interruptible_sync_poll() mechanism). Once data is read
> from the buffer, the writing process can continue writing, and the
> reading process can continue reading new data.
>   4). Returns the number of bytes read upon successful read.
> 
> 2. pipe_write():
>   1). Checks if the pipe is valid and if there is any available
> space in the pipe buffer.
>   2). Waits for buffer space:
>     *If the pipe buffer is full and the reading process has not
> read any data, pipe_write() may put the current process to sleep
> until there is space in the buffer.
>     *If the read end of the pipe is closed (no process is waiting
> to read), an error code -EPIPE is returned, and a SIGPIPE signal may
> be sent to the process.
>   3). Writes data:
>     *If there is enough space in the pipe buffer, pipe_write() copies
> data from the user space buffer to the kernel buffer of the pipe
> (using copy_from_user()).
>     *If the amount of data the user requests to write is larger than
> the available space in the buffer, multiple writes may be required,
> or the process may wait for new space to be freed.
>   4). Wakes up waiting reading processes:
>     *After the data is successfully written, pipe_write() wakes up
> any processes that may be waiting to read data (using the
> wake_up_interruptible_sync_poll() mechanism).
>   5). Returns the number of bytes successfully written.
> 
> Check if there are any waiting processes in the process wait queue
> by introducing wq_has_sleeper() when waking up processes for pipe
> read/write operations.
> 
> If no processes are waiting, there's no need to execute
> wake_up_interruptible_sync_poll(), thus avoiding unnecessary wake-ups.
> 
> Unnecessary wake-ups can lead to context switches, where a process
> is woken up to handle I/O events even when there is no immediate
> need.
> 
> Only wake up processes when there are actually waiting processes to
> reduce context switches and system overhead by checking
> with wq_has_sleeper().
> 
> Additionally, by reducing unnecessary synchronization and wake-up
> operations, wq_has_sleeper() can decrease system resource waste and
> lock contention, improving overall system performance.
> 
> For pipe read/write operations, this eliminates ineffective scheduling
> and enhances concurrency.
> 
> It's important to note that enabling this option means invoking
> wq_has_sleeper() to check for sleeping processes in the wait queue
> for every read or write operation.
> 
> While this is a lightweight operation, it still incurs some overhead.
> 
> In low-load or single-task scenarios, this overhead may not yield
> significant benefits and could even introduce minor performance
> degradation.
> 
> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
> 
> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
> With the option enabled:  Single-core: 877.8, Multi-core (8): 4854.7
> 
> Single-core performance improved by 4.1%, multi-core performance
> improved by 4.8%.

...

> +config PIPE_SKIP_SLEEPER
> +	bool "Skip sleeping processes during pipe read/write"
> +	default n

'n' is the default 'default', no need to have this line.

> +	help
> +	  This option introduces a check whether the sleep queue will
> +	  be awakened during pipe read/write.
> +
> +	  It often leads to a performance improvement. However, in
> +	  low-load or single-task scenarios, it may introduce minor
> +	  performance overhead.

> +	  If unsure, say N.

Illogical, it's already N as you stated by putting a redundant line, but after
removing that line it will make sense.

...

> +static inline bool

Have you build this with Clang and `make W=1 ...`?

> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
> +{
> +	if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
> +		return wq_has_sleeper(wq_head);
> +	else

Redundant.

> +		return true;

	if (!foo)
		return true;

	return bar(...);

> +}
Kent Overstreet Dec. 25, 2024, 1:53 p.m. UTC | #2
On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> Don't you think the Cc list is a bit overloaded?

Indeed, my mail server doesn't let me reply-all.

> On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > +config PIPE_SKIP_SLEEPER
> > +	bool "Skip sleeping processes during pipe read/write"
> > +	default n
> 
> 'n' is the default 'default', no need to have this line.

Actually, I'd say to skip the kconfig option for this. Kconfig options
that affect the behaviour of core code increase our testing burden, and
are another variable to account for when chasing down bugs, and the
potential overhead looks negligable.

Also, did you look at adding this optimization to wake_up()? No-op
wakeups are very common, I think this has wider applicability.
WangYuli Dec. 25, 2024, 3:42 p.m. UTC | #3
On 2024/12/25 21:30, Andy Shevchenko wrote:

> Don't you think the Cc list is a bit overloaded?

Hi,

I apologize for any inconvenience this may cause.

I understand that under normal circumstances, one would simply pass the 
modified code path as an argument to the kernel's 
scripts/get_maintainer.pl script to determine the appropriate recipients.

However, given the vast and complex nature of the Linux kernel 
community, with tens of thousands of developers worldwide, and 
considering the varying "customs" of different subsystems, as well as 
time zone differences and individual work habits, it's not uncommon for 
patches to be sent to mailing lists and subsequently ignored or left 
pending.

This patch, for example, has been submitted multiple times without 
receiving any response, unfortunately.

My intention is simply to seek your review, and that of other technical 
experts like yourself, but I cannot be certain, prior to your response, 
which specific experts on which lists would be willing to provide feedback.

I would appreciate any other suggestions you may have.

>> UnixBench Pipe benchmark results on Zhaoxin KX-U6780A processor:
>>
>> With the option disabled: Single-core: 841.8, Multi-core (8): 4621.6
>> With the option enabled:  Single-core: 877.8, Multi-core (8): 4854.7
>>
>> Single-core performance improved by 4.1%, multi-core performance
>> improved by 4.8%.
> ...

As you know, the kernel is extremely sensitive to performance.

Even a 1% performance improvement can lead to significant efficiency 
gains and reduced carbon emissions in production environments, as long 
as there is sufficient testing and theoretical analysis to prove that 
the improvement is real and not due to measurement error or jitter.

>> +config PIPE_SKIP_SLEEPER
>> +	bool "Skip sleeping processes during pipe read/write"
>> +	default n
> 'n' is the default 'default', no need to have this line.
OK, I'll drop it. Thanks.
>
>> +	help
>> +	  This option introduces a check whether the sleep queue will
>> +	  be awakened during pipe read/write.
>> +
>> +	  It often leads to a performance improvement. However, in
>> +	  low-load or single-task scenarios, it may introduce minor
>> +	  performance overhead.
>> +	  If unsure, say N.
> Illogical, it's already N as you stated by putting a redundant line, but after
> removing that line it will make sense.
>
> ...
As noted, I'll remove "default n" as it serves no purpose.
>
>> +static inline bool
> Have you build this with Clang and `make W=1 ...`?

Hmm...I've noticed a discrepancy in kernel compilation results with and 
without "make W=1".

When I use x86_64_defconfig and clang-19.1.1 (Ubuntu 24.10) and run 
"make", there are no warnings.

However, when I run "make W=1", the kernel generates a massive number of 
errors, causing the compilation to fail prematurely.

e.g.

In file included from arch/x86/kernel/asm-offsets.c:14:
In file included from ./include/linux/suspend.h:5:
In file included from ./include/linux/swap.h:9:
In file included from ./include/linux/memcontrol.h:21:
In file included from ./include/linux/mm.h:2224:
./include/linux/vmstat.h:504:43: error: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Werror,-Wenum-enum-conversion]
   504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
       |                            ~~~~~~~~~~~~~~~~~~~~~ ^
   505 |                            item];
       |                            ~~~~
./include/linux/vmstat.h:511:43: error: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Werror,-Wenum-enum-conversion]
   511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
       |                            ~~~~~~~~~~~~~~~~~~~~~ ^
   512 |                            NR_VM_NUMA_EVENT_ITEMS +
       |                            ~~~~~~~~~~~~~~~~~~~~~~
./include/linux/vmstat.h:524:43: error: arithmetic between different 
enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') 
[-Werror,-Wenum-enum-conversion]
   524 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
       |                            ~~~~~~~~~~~~~~~~~~~~~ ^
   525 |                            NR_VM_NUMA_EVENT_ITEMS +
       |                            ~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.

And I've observed similar behavior with gcc-14.2.0.

While I'm keen on addressing as many potential compile errors and 
warnings in the kernel as possible, it seems like a long-term endeavor.

Regarding this specific code, I'd appreciate your insights on how to 
improve it.

>
>> +pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
>> +{
>> +	if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
>> +		return wq_has_sleeper(wq_head);
>> +	else
> Redundant.
>
>> +		return true;
> 	if (!foo)
> 		return true;
>
> 	return bar(...);
>
>> +}

Yes. I'll rework the code structure here. Thanks.

Thanks,
Willy Tarreau Dec. 25, 2024, 4 p.m. UTC | #4
On Wed, Dec 25, 2024 at 11:42:29PM +0800, WangYuli wrote:
> On 2024/12/25 21:30, Andy Shevchenko wrote:
> 
> > Don't you think the Cc list is a bit overloaded?
> 
> Hi,
> 
> I apologize for any inconvenience this may cause.
> 
> I understand that under normal circumstances, one would simply pass the
> modified code path as an argument to the kernel's scripts/get_maintainer.pl
> script to determine the appropriate recipients.
> 
> However, given the vast and complex nature of the Linux kernel community,
> with tens of thousands of developers worldwide, and considering the varying
> "customs" of different subsystems, as well as time zone differences and
> individual work habits, it's not uncommon for patches to be sent to mailing
> lists and subsequently ignored or left pending.
(...)

Sorry, but by CCing 191 random addresses like this, that's the best way
to be added to .procmailrc and be completely ignored by everyone. At some
point one should wonder whether that's a common practice or if such
behaviors will be considered offensive by the majority. get_maintainer.pl
only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy
whom I left since they replied to you).

> This patch, for example, has been submitted multiple times without receiving
> any response, unfortunately.

It can happen, but sending to the right people and possibly resending if
it gets lost is usually sufficient to attract attention. Sending to that
many people make you look like someone feeling so important they need to
shout in a loudspeaker to send orders to everyone. Please refrain from
doing this in the future.

Thanks,
Willy
Mateusz Guzik Dec. 25, 2024, 4:04 p.m. UTC | #5
On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > Don't you think the Cc list is a bit overloaded?
> 
> Indeed, my mail server doesn't let me reply-all.
> 
> > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > +config PIPE_SKIP_SLEEPER
> > > +	bool "Skip sleeping processes during pipe read/write"
> > > +	default n
> > 
> > 'n' is the default 'default', no need to have this line.
> 
> Actually, I'd say to skip the kconfig option for this. Kconfig options
> that affect the behaviour of core code increase our testing burden, and
> are another variable to account for when chasing down bugs, and the
> potential overhead looks negligable.
> 

I agree the behavior should not be guarded by an option. However,
because of how wq_has_sleeper is implemented (see below) I would argue
this needs to show how often locking can be avoided in real workloads.

The commit message does state this comes with a slowdown for cases which
can't avoid wakeups, but as is I thought the submitter just meant an
extra branch.

> Also, did you look at adding this optimization to wake_up()? No-op
> wakeups are very common, I think this has wider applicability.

I was going to suggest it myself, but then:

static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
{
        /*
         * We need to be sure we are in sync with the
         * add_wait_queue modifications to the wait queue.
         *
         * This memory barrier should be paired with one on the
         * waiting side.
         */
        smp_mb();
        return waitqueue_active(wq_head);
}

Which means this is in fact quite expensive.

Since wakeup is a lock + an interrupt trip, it would still be
cheaper single-threaded to "merely" suffer a full fence and for cases
where the queue is empty often enough this is definitely the right thing
to do.

On the other hand this executing when the queue is mostly *not* empty
would combat the point.

So unfortunately embedding this in wake_up is a no-go.
Kent Overstreet Dec. 25, 2024, 4:32 p.m. UTC | #6
On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > Don't you think the Cc list is a bit overloaded?
> > 
> > Indeed, my mail server doesn't let me reply-all.
> > 
> > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > +config PIPE_SKIP_SLEEPER
> > > > +	bool "Skip sleeping processes during pipe read/write"
> > > > +	default n
> > > 
> > > 'n' is the default 'default', no need to have this line.
> > 
> > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > that affect the behaviour of core code increase our testing burden, and
> > are another variable to account for when chasing down bugs, and the
> > potential overhead looks negligable.
> > 
> 
> I agree the behavior should not be guarded by an option. However,
> because of how wq_has_sleeper is implemented (see below) I would argue
> this needs to show how often locking can be avoided in real workloads.
> 
> The commit message does state this comes with a slowdown for cases which
> can't avoid wakeups, but as is I thought the submitter just meant an
> extra branch.
> 
> > Also, did you look at adding this optimization to wake_up()? No-op
> > wakeups are very common, I think this has wider applicability.
> 
> I was going to suggest it myself, but then:
> 
> static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> {
>         /*
>          * We need to be sure we are in sync with the
>          * add_wait_queue modifications to the wait queue.
>          *
>          * This memory barrier should be paired with one on the
>          * waiting side.
>          */
>         smp_mb();
>         return waitqueue_active(wq_head);
> }
> 
> Which means this is in fact quite expensive.
>
> Since wakeup is a lock + an interrupt trip, it would still be
> cheaper single-threaded to "merely" suffer a full fence and for cases
> where the queue is empty often enough this is definitely the right thing
> to do.

We're comparing against no-op wakeup. A real wakeup does an IPI, which
completely dwarfs the cost of a barrier.

And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
assume it's gotten better, but back when I was looking at waitqueues
nested pushf/popf was horrifically expensive.

But perhaps can we do this with just a release barrier? Similar to how
list_empty_careful() works.

> On the other hand this executing when the queue is mostly *not* empty
> would combat the point.
> 
> So unfortunately embedding this in wake_up is a no-go.

You definitely can't say that without knowing how often no-op
wake_up()s occur. It wouldn't be hard to gather that (write a patch to
add a pair of percpu counters, throw it on a few machines running random
workloads) and I think the results might surprise you.
WangYuli Dec. 25, 2024, 4:32 p.m. UTC | #7
Hi,

I've reviewed the Contributor Covenant and the Linux Kernel Contributor 
Covenant Code of Conduct Interpretation, and I couldn't find anything 
suggesting that CCing a large number of people is "unfriendly".

And while I don't believe my actions were malicious, I understand your 
concern.

Going forward, I'll be more considerate of the recipients when sending 
emails and will avoid CCing more than a hundred people at once in 
similar situations.


On 2024/12/26 00:00, Willy Tarreau wrote:

> (...)
>
> Sorry, but by CCing 191 random addresses like this,

I think there may be a misunderstanding.

These all recipients can be found in the git history of fs/pipe.c.

> that's the best way
> to be added to .procmailrc and be completely ignored by everyone. At some
> point one should wonder whether that's a common practice or if such
> behaviors will be considered offensive by the majority. get_maintainer.pl
> only lists the 2 lists and 3 addresses I left in CC (after Kent and Andy
> whom I left since they replied to you).
>
>> This patch, for example, has been submitted multiple times without receiving
>> any response, unfortunately.
> It can happen, but sending to the right people and possibly resending if
> it gets lost is usually sufficient to attract attention. Sending to that
> many people make you look like someone feeling so important they need to
> shout in a loudspeaker to send orders to everyone.
Sincerely hope that's the case.
>   Please refrain from
> doing this in the future.
As above.

Thanks,
Willy Tarreau Dec. 25, 2024, 4:56 p.m. UTC | #8
On Thu, Dec 26, 2024 at 12:32:35AM +0800, WangYuli wrote:
> Hi,
> 
> I've reviewed the Contributor Covenant and the Linux Kernel Contributor
> Covenant Code of Conduct Interpretation, and I couldn't find anything
> suggesting that CCing a large number of people is "unfriendly".

This is unrelated, it's a matter of basic social interactions and respect
of others' time.

> And while I don't believe my actions were malicious, I understand your
> concern.
> 
> Going forward, I'll be more considerate of the recipients when sending
> emails and will avoid CCing more than a hundred people at once in similar
> situations.

"More than a hundred" ? Are you serious ? For what purpose ? I'll explain
you something related to how people consume e-mails: the first thing they
do if they hesitate to process it is to check if someone else in the To
or Cc is more knowledgeable or legitimate than them. If so they often
prefer to let others deal with it. With such a large list, it's impossible
to check all other addresses and virtually *everyone* will consider that
surely one of the hundreds of others is more legitimate. So the more people
you add, the less likely anyone will handle your request.

The common rules that apply here are not made out of nowhere but based on
what works best. Just send to the most relevant outputs of get_maintainer,
wait for one week and if you get no response it's just that these people
were busy and forgot about you, so just kindly ping again to make sure
your message was received. With such a large community it's normal to lose
some messages, and pinging again is normally not considered as an offence
so that's fine. If you get no response after multiple attempts, it might
mean there's something annoying in your message (like sending to tens or
hundreds of people at once).

And really, when a script tells you "send your message to these 3 people"
and you send it to 191, how can you imagine the recipients will think
anything different from "this person feels super important to spam
everyone like this". Why would they even respond if they feel like you
consider you have the right to spam everyone in hope to get your orders
processed immediately ? I'm pretty sure that the majority will just let
it rot by principle.

Good luck,
Willy
Mateusz Guzik Dec. 25, 2024, 5:22 p.m. UTC | #9
On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > Don't you think the Cc list is a bit overloaded?
> > >
> > > Indeed, my mail server doesn't let me reply-all.
> > >
> > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > +config PIPE_SKIP_SLEEPER
> > > > > +       bool "Skip sleeping processes during pipe read/write"
> > > > > +       default n
> > > >
> > > > 'n' is the default 'default', no need to have this line.
> > >
> > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > that affect the behaviour of core code increase our testing burden, and
> > > are another variable to account for when chasing down bugs, and the
> > > potential overhead looks negligable.
> > >
> >
> > I agree the behavior should not be guarded by an option. However,
> > because of how wq_has_sleeper is implemented (see below) I would argue
> > this needs to show how often locking can be avoided in real workloads.
> >
> > The commit message does state this comes with a slowdown for cases which
> > can't avoid wakeups, but as is I thought the submitter just meant an
> > extra branch.
> >
> > > Also, did you look at adding this optimization to wake_up()? No-op
> > > wakeups are very common, I think this has wider applicability.
> >
> > I was going to suggest it myself, but then:
> >
> > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > {
> >         /*
> >          * We need to be sure we are in sync with the
> >          * add_wait_queue modifications to the wait queue.
> >          *
> >          * This memory barrier should be paired with one on the
> >          * waiting side.
> >          */
> >         smp_mb();
> >         return waitqueue_active(wq_head);
> > }
> >
> > Which means this is in fact quite expensive.
> >
> > Since wakeup is a lock + an interrupt trip, it would still be
> > cheaper single-threaded to "merely" suffer a full fence and for cases
> > where the queue is empty often enough this is definitely the right thing
> > to do.
>
> We're comparing against no-op wakeup. A real wakeup does an IPI, which
> completely dwarfs the cost of a barrier.
>
> And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> assume it's gotten better, but back when I was looking at waitqueues
> nested pushf/popf was horrifically expensive.
>
> But perhaps can we do this with just a release barrier? Similar to how
> list_empty_careful() works.
>
> > On the other hand this executing when the queue is mostly *not* empty
> > would combat the point.
> >
> > So unfortunately embedding this in wake_up is a no-go.
>
> You definitely can't say that without knowing how often no-op
> wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> add a pair of percpu counters, throw it on a few machines running random
> workloads) and I think the results might surprise you.

There is some talking past each other here.

I explicitly noted one needs to check what happens in real workloads.

I very much expect there will be consumers where there are no waiters
almost every time and consumers which almost always do have them.

My claim is that this should be handled on a case-by-case basis.

So i whipped out a bpftrace one liner do take a look at the kernel
build, details at the end.

In terms of the total (0 == no waiters, 1 == waiters):
[0, 1)            600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...)          457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |

There is some funzies in the vfs layer which I'm going to sort out myself.

The kernel is tags/next-20241220

As far as pipes go:

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up_sync_key+59
    pipe_read+385
]:
[0, 1)             10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

So this guy literally never had any waiters when wakeup was issued.
faddr2line claims line 405, which I presume is off by one:

   401         if (was_full)
   402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
   403         if (wake_next_reader)
   404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
   405         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);

I'm guessing the real empty queue is rd_wait. Definitely a candidate
depending on other workloads, personally I would just patch it as is.

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up+54
    pipe_release+92
]:
[0, 1)             12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...)            5330 |@@@@@@@@@@@@@@@@@@@@@@                              |

a wash, would not touch that no matter what

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up+54
    pipe_release+110
]:
[0, 1)             17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

again no waiters, line claimed is 737, again off by one:
   733         /* Was that the last reader or writer, but not the other side? */
   734         if (!pipe->readers != !pipe->writers) {
   735         │       wake_up_interruptible_all(&pipe->rd_wait);
   736         │       wake_up_interruptible_all(&pipe->wr_wait);
   737         │       kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
   738         │       kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);

so I presume wr_wait? same comment as the entry at claimed line 405

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up_sync_key+59
    pipe_write+773
]:
[0, 1)             22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

again no waiters, claimed line 606
   604         if (wake_next_writer)
   605         │       wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
   606         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {

again would be inclined to patch as is

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up_sync_key+59
    pipe_read+943
]:
[0, 1)              9488 |@@@@@@@@@@@@@                                       |
[1, ...)           35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

majority of the time there were waiters, would not touch regardless of
other workloads, line 403

   401         if (was_full)
   402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
   403         if (wake_next_reader)
   404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);

the wr_wait thing

@[
    wakeprobe+5
    __wake_up_common+63
    __wake_up_sync_key+59
    pipe_write+729
]:
[0, 1)            199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[1, ...)          376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

ditto concerning not touching, resolved to line 603

   601         if (was_empty || pipe->poll_usage)
   602         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
   603         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
   604         if (wake_next_writer)
   605                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);

That is to say as far as this workload goes the submitted patch does
avoid some of the lock + irq trips by covering cases where there no
waiters seen in this workload, but also adds the smp_mb thing when it
does not help -- I would remove those spots from the submission.

While I agree a full service treatment would require a bunch of
different workloads, personally I would be inclined justify the change
merely based on the kernel build + leaving bpftrace running over some
a real-world box running random crap.

As for obtaining such info, I failed to convince bpftrace to check if
the waiter list is empty. Instead I resorted to adding a dedicated
func which grabs the bit and probing on that. The func is in a
different file because gcc decided to omit the call otherwise

one-liner:
bpftrace -e 'kprobe:wakeprobe { @[kstack(4)] = lhist(arg0, 0, 1, 1); }'

hack:
diff --git a/fs/file.c b/fs/file.c
index d868cdb95d1e..00d6a34eb174 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1442,3 +1442,11 @@ int iterate_fd(struct files_struct *files, unsigned n,
        return res;
 }
 EXPORT_SYMBOL(iterate_fd);
+
+
+void wakeprobe(int count);
+
+void wakeprobe(int count)
+{
+}
+EXPORT_SYMBOL(wakeprobe);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..8db7b0daf04b 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -57,6 +57,8 @@ void remove_wait_queue(struct wait_queue_head
*wq_head, struct wait_queue_entry
 }
 EXPORT_SYMBOL(remove_wait_queue);

+void wakeprobe(int count);
+
 /*
  * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
  * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -77,6 +79,8 @@ static int __wake_up_common(struct wait_queue_head
*wq_head, unsigned int mode,

        lockdep_assert_held(&wq_head->lock);

+       wakeprobe(wq_has_sleeper(wq_head));
+
        curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);

        if (&curr->entry == &wq_head->head)
Kent Overstreet Dec. 25, 2024, 5:41 p.m. UTC | #10
On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote:
> On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > > Don't you think the Cc list is a bit overloaded?
> > > >
> > > > Indeed, my mail server doesn't let me reply-all.
> > > >
> > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > > +config PIPE_SKIP_SLEEPER
> > > > > > +       bool "Skip sleeping processes during pipe read/write"
> > > > > > +       default n
> > > > >
> > > > > 'n' is the default 'default', no need to have this line.
> > > >
> > > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > > that affect the behaviour of core code increase our testing burden, and
> > > > are another variable to account for when chasing down bugs, and the
> > > > potential overhead looks negligable.
> > > >
> > >
> > > I agree the behavior should not be guarded by an option. However,
> > > because of how wq_has_sleeper is implemented (see below) I would argue
> > > this needs to show how often locking can be avoided in real workloads.
> > >
> > > The commit message does state this comes with a slowdown for cases which
> > > can't avoid wakeups, but as is I thought the submitter just meant an
> > > extra branch.
> > >
> > > > Also, did you look at adding this optimization to wake_up()? No-op
> > > > wakeups are very common, I think this has wider applicability.
> > >
> > > I was going to suggest it myself, but then:
> > >
> > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > > {
> > >         /*
> > >          * We need to be sure we are in sync with the
> > >          * add_wait_queue modifications to the wait queue.
> > >          *
> > >          * This memory barrier should be paired with one on the
> > >          * waiting side.
> > >          */
> > >         smp_mb();
> > >         return waitqueue_active(wq_head);
> > > }
> > >
> > > Which means this is in fact quite expensive.
> > >
> > > Since wakeup is a lock + an interrupt trip, it would still be
> > > cheaper single-threaded to "merely" suffer a full fence and for cases
> > > where the queue is empty often enough this is definitely the right thing
> > > to do.
> >
> > We're comparing against no-op wakeup. A real wakeup does an IPI, which
> > completely dwarfs the cost of a barrier.
> >
> > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> > assume it's gotten better, but back when I was looking at waitqueues
> > nested pushf/popf was horrifically expensive.
> >
> > But perhaps can we do this with just a release barrier? Similar to how
> > list_empty_careful() works.
> >
> > > On the other hand this executing when the queue is mostly *not* empty
> > > would combat the point.
> > >
> > > So unfortunately embedding this in wake_up is a no-go.
> >
> > You definitely can't say that without knowing how often no-op
> > wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> > add a pair of percpu counters, throw it on a few machines running random
> > workloads) and I think the results might surprise you.
> 
> There is some talking past each other here.
> 
> I explicitly noted one needs to check what happens in real workloads.
> 
> I very much expect there will be consumers where there are no waiters
> almost every time and consumers which almost always do have them.
> 
> My claim is that this should be handled on a case-by-case basis.
> 
> So i whipped out a bpftrace one liner do take a look at the kernel
> build, details at the end.
> 
> In terms of the total (0 == no waiters, 1 == waiters):
> [0, 1)            600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...)          457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
> 
> There is some funzies in the vfs layer which I'm going to sort out myself.
> 
> The kernel is tags/next-20241220
> 
> As far as pipes go:
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_read+385
> ]:
> [0, 1)             10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> So this guy literally never had any waiters when wakeup was issued.
> faddr2line claims line 405, which I presume is off by one:
> 
>    401         if (was_full)
>    402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    403         if (wake_next_reader)
>    404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>    405         kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> 
> I'm guessing the real empty queue is rd_wait. Definitely a candidate
> depending on other workloads, personally I would just patch it as is.
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up+54
>     pipe_release+92
> ]:
> [0, 1)             12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [1, ...)            5330 |@@@@@@@@@@@@@@@@@@@@@@                              |
> 
> a wash, would not touch that no matter what
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up+54
>     pipe_release+110
> ]:
> [0, 1)             17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> again no waiters, line claimed is 737, again off by one:
>    733         /* Was that the last reader or writer, but not the other side? */
>    734         if (!pipe->readers != !pipe->writers) {
>    735         │       wake_up_interruptible_all(&pipe->rd_wait);
>    736         │       wake_up_interruptible_all(&pipe->wr_wait);
>    737         │       kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>    738         │       kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> 
> so I presume wr_wait? same comment as the entry at claimed line 405
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_write+773
> ]:
> [0, 1)             22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> again no waiters, claimed line 606
>    604         if (wake_next_writer)
>    605         │       wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    606         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
> 
> again would be inclined to patch as is
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_read+943
> ]:
> [0, 1)              9488 |@@@@@@@@@@@@@                                       |
> [1, ...)           35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> majority of the time there were waiters, would not touch regardless of
> other workloads, line 403
> 
>    401         if (was_full)
>    402                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
>    403         if (wake_next_reader)
>    404         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
> 
> the wr_wait thing
> 
> @[
>     wakeprobe+5
>     __wake_up_common+63
>     __wake_up_sync_key+59
>     pipe_write+729
> ]:
> [0, 1)            199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
> [1, ...)          376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> 
> ditto concerning not touching, resolved to line 603
> 
>    601         if (was_empty || pipe->poll_usage)
>    602         │       wake_up_interruptible_sync_poll(&pipe->rd_wait,
> EPOLLIN | EPOLLRDNORM);
>    603         kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>    604         if (wake_next_writer)
>    605                 wake_up_interruptible_sync_poll(&pipe->wr_wait,
> EPOLLOUT | EPOLLWRNORM);
> 
> That is to say as far as this workload goes the submitted patch does
> avoid some of the lock + irq trips by covering cases where there no
> waiters seen in this workload, but also adds the smp_mb thing when it
> does not help -- I would remove those spots from the submission.

Neat use of bpf, although if you had to patch the kernel anyways I
would've just gone the percpu counter route... :)

Based on those numbers, even in the cases where wake-up dominates it
doesn't dominate by enough that I'd expect the patch to cause a
regression, at least if we can do it with a proper release barrier.

Why is smp_load_release() not a thing? Unusual I suppose, but it is what
we want here...
diff mbox series

Patch

diff --git a/fs/Kconfig b/fs/Kconfig
index 64d420e3c475..0dacc46a73fe 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -429,4 +429,17 @@  source "fs/unicode/Kconfig"
 config IO_WQ
 	bool
 
+config PIPE_SKIP_SLEEPER
+	bool "Skip sleeping processes during pipe read/write"
+	default n
+	help
+	  This option introduces a check whether the sleep queue will
+	  be awakened during pipe read/write.
+
+	  It often leads to a performance improvement. However, in
+	  low-load or single-task scenarios, it may introduce minor
+	  performance overhead.
+
+	  If unsure, say N.
+
 endmenu
diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..c085333ae72c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -247,6 +247,15 @@  static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
 	return tail;
 }
 
+static inline bool
+pipe_check_wq_has_sleeper(struct wait_queue_head *wq_head)
+{
+	if (IS_ENABLED(CONFIG_PIPE_SKIP_SLEEPER))
+		return wq_has_sleeper(wq_head);
+	else
+		return true;
+}
+
 static ssize_t
 pipe_read(struct kiocb *iocb, struct iov_iter *to)
 {
@@ -377,7 +386,7 @@  pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		 * _very_ unlikely case that the pipe was full, but we got
 		 * no data.
 		 */
-		if (unlikely(was_full))
+		if (unlikely(was_full) && pipe_check_wq_has_sleeper(&pipe->wr_wait))
 			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 
@@ -398,9 +407,9 @@  pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		wake_next_reader = false;
 	mutex_unlock(&pipe->mutex);
 
-	if (was_full)
+	if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
 		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
-	if (wake_next_reader)
+	if (wake_next_reader && pipe_check_wq_has_sleeper(&pipe->rd_wait))
 		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 	kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	if (ret > 0)
@@ -573,7 +582,7 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * become empty while we dropped the lock.
 		 */
 		mutex_unlock(&pipe->mutex);
-		if (was_empty)
+		if (was_empty && pipe_check_wq_has_sleeper(&pipe->rd_wait))
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
@@ -598,10 +607,10 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * Epoll nonsensically wants a wakeup whether the pipe
 	 * was already empty or not.
 	 */
-	if (was_empty || pipe->poll_usage)
+	if ((was_empty || pipe->poll_usage) && pipe_check_wq_has_sleeper(&pipe->rd_wait))
 		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
-	if (wake_next_writer)
+	if (wake_next_writer && pipe_check_wq_has_sleeper(&pipe->wr_wait))
 		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 	if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
 		int err = file_update_time(filp);