[0/2] execve scalability issues, part 1

Message ID	20230821202829.2163744-1-mjguzik@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Mateusz Guzik <mjguzik@gmail.com> To: linux-kernel@vger.kernel.org Cc: dennis@kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, shakeelb@google.com, linux-mm@kvack.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH 0/2] execve scalability issues, part 1 Date: Mon, 21 Aug 2023 22:28:27 +0200 Message-Id: <20230821202829.2163744-1-mjguzik@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	execve scalability issues, part 1 \| expand [0/2] execve scalability issues, part 1 [1/2] pcpcntr: add group allocation/free [2/2] fork: group allocation of per-cpu counters for mm struct

Mateusz Guzik Aug. 21, 2023, 8:28 p.m. UTC

To start I figured I'm going to bench about as friendly case as it gets
-- statically linked *separate* binaries all doing execve in a loop.

I borrowed the bench from found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

It prints a result every second (warning: first line is garbage).

My test box is temporarily only 26 cores and even at this scale I run
into massive lock contention stemming from back-to-back calls to
percpu_counter_init (and _destroy later).

While not a panacea, one simple thing to do here is to batch these ops.
Since the term "batching" is already used in the file, I decided to
refer to it as "grouping" instead.

Even if this code could be patched to dodge these counters,  I would
argue a high-traffic alloc/free consumer is only a matter of time so it
makes sense to facilitate it.

With the fix I get an ok win, to quote from the commit:
> Even at a very modest scale of 26 cores (ops/s):
> before: 133543.63
> after:  186061.81 (+39%)

> While with the patch these allocations remain a significant problem,
> the primary bottleneck shifts to:
> 
>     __pv_queued_spin_lock_slowpath+1
>     _raw_spin_lock_irqsave+57
>     folio_lruvec_lock_irqsave+91
>     release_pages+590
>     tlb_batch_pages_flush+61
>     tlb_finish_mmu+101
>     exit_mmap+327
>     __mmput+61
>     begin_new_exec+1245
>     load_elf_binary+712
>     bprm_execve+644
>     do_execveat_common.isra.0+429
>     __x64_sys_execve+50
>     do_syscall_64+46
>     entry_SYSCALL_64_after_hwframe+110

I intend to do more work on the area to mostly sort it out, but I would
not mind if someone else took the hammer to folio. :)

With this out of the way I'll be looking at some form of caching to
eliminate these allocs as a problem.

Thoughts?

Mateusz Guzik (2):
  pcpcntr: add group allocation/free
  fork: group allocation of per-cpu counters for mm struct

 include/linux/percpu_counter.h | 19 ++++++++---
 kernel/fork.c                  | 13 ++------
 lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 33 deletions(-)

Matthew Wilcox Aug. 21, 2023, 8:42 p.m. UTC | #1

On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> > While with the patch these allocations remain a significant problem,
> > the primary bottleneck shifts to:
> > 
> >     __pv_queued_spin_lock_slowpath+1
> >     _raw_spin_lock_irqsave+57
> >     folio_lruvec_lock_irqsave+91
> >     release_pages+590
> >     tlb_batch_pages_flush+61
> >     tlb_finish_mmu+101
> >     exit_mmap+327
> >     __mmput+61
> >     begin_new_exec+1245
> >     load_elf_binary+712
> >     bprm_execve+644
> >     do_execveat_common.isra.0+429
> >     __x64_sys_execve+50
> >     do_syscall_64+46
> >     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)

Funny you should ask ... these patches are from ~3 weeks ago.  They may
or may not apply to a current tree, but I'm working on this area.

Dennis Zhou Aug. 21, 2023, 9:07 p.m. UTC | #2

Hello,

On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
> 
> I borrowed the bench from found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
> 
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
> 
> It prints a result every second (warning: first line is garbage).
> 
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
> 
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
> 

Unfortunately it's taken me longer to get back to and I'm actually not
super happy with the results, I wrote a batch percpu allocation api.
It's better than the starting place, but I'm falling short on the free
path. I am/was also wrestling with the lifetime ideas (should these
lifetimes be percpus problem or call site bound like you've done).

What I like about this approach is you group the percpu_counter lock
which batching percpu allocations wouldn't be able to solve no matter
how well I do.

I'll review this more closely today.

> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
> 
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after:  186061.81 (+39%)
> 
> > While with the patch these allocations remain a significant problem,
> > the primary bottleneck shifts to:
> > 
> >     __pv_queued_spin_lock_slowpath+1
> >     _raw_spin_lock_irqsave+57
> >     folio_lruvec_lock_irqsave+91
> >     release_pages+590
> >     tlb_batch_pages_flush+61
> >     tlb_finish_mmu+101
> >     exit_mmap+327
> >     __mmput+61
> >     begin_new_exec+1245
> >     load_elf_binary+712
> >     bprm_execve+644
> >     do_execveat_common.isra.0+429
> >     __x64_sys_execve+50
> >     do_syscall_64+46
> >     entry_SYSCALL_64_after_hwframe+110
> 
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
> 
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
> 

I'm not against caching, this is just my first thought. Caching will
have an impact on the backing pages of percpu. All it takes is 1
allocation on a page for the current allocator to pin n pages of memory.
A few years ago percpu depopulation was implemented so that limits the
amount of resident backing pages.

Maybe the right thing to do is preallocate pools of common sized
allocations so that way they can be recycled such that we don't have to
think too hard about fragmentation that can occur if we populate these
pools over time?

Also as you've pointed out, it wasn't just the percpu allocation being
the bottleneck, but percpu_counter's global lock too for hotplug
support. I'm hazarding a guess most use cases of percpu might have
additional locking requirements too such as percpu_counter.

Thanks,
Dennis

> Thoughts?
> 
> Mateusz Guzik (2):
>   pcpcntr: add group allocation/free
>   fork: group allocation of per-cpu counters for mm struct
> 
>  include/linux/percpu_counter.h | 19 ++++++++---
>  kernel/fork.c                  | 13 ++------
>  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>  3 files changed, 60 insertions(+), 33 deletions(-)
> 
> -- 
> 2.39.2
>

Mateusz Guzik Aug. 21, 2023, 9:39 p.m. UTC | #3

On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> > With this out of the way I'll be looking at some form of caching to
> > eliminate these allocs as a problem.
> > 
> 
> I'm not against caching, this is just my first thought. Caching will
> have an impact on the backing pages of percpu. All it takes is 1
> allocation on a page for the current allocator to pin n pages of memory.
> A few years ago percpu depopulation was implemented so that limits the
> amount of resident backing pages.
> 

I'm painfully aware.

> Maybe the right thing to do is preallocate pools of common sized
> allocations so that way they can be recycled such that we don't have to
> think too hard about fragmentation that can occur if we populate these
> pools over time?
> 

This is what I was going to suggest :)

FreeBSD has a per-cpu allocator which pretends to be the same as the
slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
8, 16, 32 and 64 and you can act as if you are mallocing in that size.

Scales perfectly fine of course since it caches objs per-CPU, but there
is some waste and I have 0 idea how it compares to what Linux is doing
on that front.

I stress though that even if you were to carve out certain sizes, a
global lock to handle ops will still kill scalability.

Perhaps granularity better than global, but less than per-CPU would be a
sweet spot for scalabability vs memory waste.

That said...

> Also as you've pointed out, it wasn't just the percpu allocation being
> the bottleneck, but percpu_counter's global lock too for hotplug
> support. I'm hazarding a guess most use cases of percpu might have
> additional locking requirements too such as percpu_counter.
> 

True Fix(tm) is a longer story.

Maybe let's sort out this patchset first, whichever way. :)

> Thanks,
> Dennis
> 
> > Thoughts?
> > 
> > Mateusz Guzik (2):
> >   pcpcntr: add group allocation/free
> >   fork: group allocation of per-cpu counters for mm struct
> > 
> >  include/linux/percpu_counter.h | 19 ++++++++---
> >  kernel/fork.c                  | 13 ++------
> >  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
> >  3 files changed, 60 insertions(+), 33 deletions(-)
> > 
> > -- 
> > 2.39.2
> >

Mateusz Guzik Aug. 21, 2023, 10:29 p.m. UTC | #4

On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote:
>> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
>> > With this out of the way I'll be looking at some form of caching to
>> > eliminate these allocs as a problem.
>> >
>>
>> I'm not against caching, this is just my first thought. Caching will
>> have an impact on the backing pages of percpu. All it takes is 1
>> allocation on a page for the current allocator to pin n pages of memory.
>> A few years ago percpu depopulation was implemented so that limits the
>> amount of resident backing pages.
>>
>
> I'm painfully aware.
>
>> Maybe the right thing to do is preallocate pools of common sized
>> allocations so that way they can be recycled such that we don't have to
>> think too hard about fragmentation that can occur if we populate these
>> pools over time?
>>
>
> This is what I was going to suggest :)
>
> FreeBSD has a per-cpu allocator which pretends to be the same as the
> slab allocator, except handing out per-cpu bufs. So far it has sizes 4,
> 8, 16, 32 and 64 and you can act as if you are mallocing in that size.
>
> Scales perfectly fine of course since it caches objs per-CPU, but there
> is some waste and I have 0 idea how it compares to what Linux is doing
> on that front.
>
> I stress though that even if you were to carve out certain sizes, a
> global lock to handle ops will still kill scalability.
>
> Perhaps granularity better than global, but less than per-CPU would be a
> sweet spot for scalabability vs memory waste.
>
> That said...
>
>> Also as you've pointed out, it wasn't just the percpu allocation being
>> the bottleneck, but percpu_counter's global lock too for hotplug
>> support. I'm hazarding a guess most use cases of percpu might have
>> additional locking requirements too such as percpu_counter.
>>
>
> True Fix(tm) is a longer story.
>
> Maybe let's sort out this patchset first, whichever way. :)
>

So I found the discussion around the original patch with a perf
regression report.

https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/

The reporter suggests dodging the problem by only allocating per-cpu
counters when the process is going multithreaded. Given that there is
still plenty of forever single-threaded procs out there I think that's
does sound like a great plan regardless of what happens with this
patchset.

Almost all access is already done using dedicated routines, so this
should be an afternoon churn to sort out, unless I missed a
showstopper. (maybe there is no good place to stuff a flag/whatever
other indicator about the state of counters?)

That said I'll look into it some time this or next week.

>> Thanks,
>> Dennis
>>
>> > Thoughts?
>> >
>> > Mateusz Guzik (2):
>> >   pcpcntr: add group allocation/free
>> >   fork: group allocation of per-cpu counters for mm struct
>> >
>> >  include/linux/percpu_counter.h | 19 ++++++++---
>> >  kernel/fork.c                  | 13 ++------
>> >  lib/percpu_counter.c           | 61 ++++++++++++++++++++++++----------
>> >  3 files changed, 60 insertions(+), 33 deletions(-)
>> >
>> > --
>> > 2.39.2
>> >
>

Jan Kara Aug. 22, 2023, 9:51 a.m. UTC | #5

On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > True Fix(tm) is a longer story.
> >
> > Maybe let's sort out this patchset first, whichever way. :)
> >
> 
> So I found the discussion around the original patch with a perf
> regression report.
> 
> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> 
> The reporter suggests dodging the problem by only allocating per-cpu
> counters when the process is going multithreaded. Given that there is
> still plenty of forever single-threaded procs out there I think that's
> does sound like a great plan regardless of what happens with this
> patchset.
> 
> Almost all access is already done using dedicated routines, so this
> should be an afternoon churn to sort out, unless I missed a
> showstopper. (maybe there is no good place to stuff a flag/whatever
> other indicator about the state of counters?)
> 
> That said I'll look into it some time this or next week.

Good, just let me know how it went, I also wanted to start looking into
this to come up with some concrete patches :). What I had in mind was that
we could use 'counters == NULL' as an indication that the counter is still
in 'single counter mode'.

								Honza

Mateusz Guzik Aug. 22, 2023, 2:24 p.m. UTC | #6

On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> > True Fix(tm) is a longer story.
>> >
>> > Maybe let's sort out this patchset first, whichever way. :)
>> >
>>
>> So I found the discussion around the original patch with a perf
>> regression report.
>>
>> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>>
>> The reporter suggests dodging the problem by only allocating per-cpu
>> counters when the process is going multithreaded. Given that there is
>> still plenty of forever single-threaded procs out there I think that's
>> does sound like a great plan regardless of what happens with this
>> patchset.
>>
>> Almost all access is already done using dedicated routines, so this
>> should be an afternoon churn to sort out, unless I missed a
>> showstopper. (maybe there is no good place to stuff a flag/whatever
>> other indicator about the state of counters?)
>>
>> That said I'll look into it some time this or next week.
>
> Good, just let me know how it went, I also wanted to start looking into
> this to come up with some concrete patches :). What I had in mind was that
> we could use 'counters == NULL' as an indication that the counter is still
> in 'single counter mode'.
>

In the current state there are only pointers to counters in mm_struct
and there is no storage for them in task_struct. So I don't think
merely null-checking the per-cpu stuff is going to cut it -- where
should the single-threaded counters land?

Bonus problem, non-current can modify these counters and this needs to
be safe against current playing with them at the same time. (and it
would be a shame to require current to use atomic on them)

That said, my initial proposal adds a union:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e74ce4a28cd..ea70f0c08286 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -737,7 +737,11 @@ struct mm_struct {

                unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
/proc/PID/auxv */

-               struct percpu_counter rss_stat[NR_MM_COUNTERS];
+               union {
+                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
+                       u64 *rss_stat_single;
+               };
+               bool    magic_flag_stuffed_elsewhere;

                struct linux_binfmt *binfmt;

Then for single-threaded case an area is allocated for NR_MM_COUNTERS
countes * 2 -- first set updated without any synchro by current
thread. Second set only to be modified by others and protected with
mm->arg_lock. The lock protects remote access to the union to begin
with.

Transition to per-CPU operation sets the magic flag (there is plenty
of spare space in mm_struct, I'll find a good home for it without
growing the struct). It would be a one-way street -- a process which
gets a bunch of threads and goes back to one stays with per-CPU.

Then you get the true value of something by adding both counters.

arg_lock is sparingly used, so remote ops are not expected to contend
with anything. In fact their cost is going to go down since percpu
summation takes a spinlock which also disables interrupts.

Local ops should be about the same in cost as they are right now.

I might have missed some detail in the above description, but I think
the approach is decent.

Jan Kara Aug. 23, 2023, 9:49 a.m. UTC | #7

On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >> > True Fix(tm) is a longer story.
> >> >
> >> > Maybe let's sort out this patchset first, whichever way. :)
> >> >
> >>
> >> So I found the discussion around the original patch with a perf
> >> regression report.
> >>
> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> >>
> >> The reporter suggests dodging the problem by only allocating per-cpu
> >> counters when the process is going multithreaded. Given that there is
> >> still plenty of forever single-threaded procs out there I think that's
> >> does sound like a great plan regardless of what happens with this
> >> patchset.
> >>
> >> Almost all access is already done using dedicated routines, so this
> >> should be an afternoon churn to sort out, unless I missed a
> >> showstopper. (maybe there is no good place to stuff a flag/whatever
> >> other indicator about the state of counters?)
> >>
> >> That said I'll look into it some time this or next week.
> >
> > Good, just let me know how it went, I also wanted to start looking into
> > this to come up with some concrete patches :). What I had in mind was that
> > we could use 'counters == NULL' as an indication that the counter is still
> > in 'single counter mode'.
> >
> 
> In the current state there are only pointers to counters in mm_struct
> and there is no storage for them in task_struct. So I don't think
> merely null-checking the per-cpu stuff is going to cut it -- where
> should the single-threaded counters land?

I think you misunderstood. What I wanted to do it to provide a new flavor
of percpu_counter (sharing most of code and definitions) which would have
an option to start as simple counter (indicated by pcc->counters == NULL
and using pcc->count for counting) and then be upgraded by a call to real
percpu thing. Because I think such counters would be useful also on other
occasions than as rss counters.

> Bonus problem, non-current can modify these counters and this needs to
> be safe against current playing with them at the same time. (and it
> would be a shame to require current to use atomic on them)

Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
modifying the counters for other processes. Thanks for pointing this out.

> That said, my initial proposal adds a union:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5e74ce4a28cd..ea70f0c08286 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -737,7 +737,11 @@ struct mm_struct {
> 
>                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> /proc/PID/auxv */
> 
> -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> +               union {
> +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> +                       u64 *rss_stat_single;
> +               };
> +               bool    magic_flag_stuffed_elsewhere;
> 
>                 struct linux_binfmt *binfmt;
> 
> 
> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> countes * 2 -- first set updated without any synchro by current
> thread. Second set only to be modified by others and protected with
> mm->arg_lock. The lock protects remote access to the union to begin
> with.

arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
with two counters is clever but I'm not 100% convinced the complexity is
really worth it. I'm not sure the overhead of always using an atomic
counter would really be measurable as atomic counter ops in local CPU cache
tend to be cheap. Did you try to measure the difference?

If the second counter proves to be worth it, we could make just that one
atomic to avoid the need for abusing some spinlock.

> Transition to per-CPU operation sets the magic flag (there is plenty
> of spare space in mm_struct, I'll find a good home for it without
> growing the struct). It would be a one-way street -- a process which
> gets a bunch of threads and goes back to one stays with per-CPU.

Agreed with switching to be a one-way street.

> Then you get the true value of something by adding both counters.
> 
> arg_lock is sparingly used, so remote ops are not expected to contend
> with anything. In fact their cost is going to go down since percpu
> summation takes a spinlock which also disables interrupts.
> 
> Local ops should be about the same in cost as they are right now.
> 
> I might have missed some detail in the above description, but I think
> the approach is decent.

								Honza

David Laight Aug. 23, 2023, 10:49 a.m. UTC | #8

From: Jan Kara
> Sent: Wednesday, August 23, 2023 10:49 AM
....
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> >
> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> >
> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +               union {
> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +                       u64 *rss_stat_single;
> > +               };
> > +               bool    magic_flag_stuffed_elsewhere;

I wouldn't use a union to save a pointer - it is asking for trouble.

> >
> >                 struct linux_binfmt *binfmt;
> >
> >
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
> 
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?

A separate lock is worse than atomics.
(Although some 32bit arch may have issues with 64bit atomics.)

I think you'll be surprised just how slow atomic ops are.
Even when present in the local cache.
(Probably because any other copies have to be invalidated.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Mateusz Guzik Aug. 23, 2023, 12:01 p.m. UTC | #9

On 8/23/23, David Laight <David.Laight@aculab.com> wrote:
> From: Jan Kara
>> Sent: Wednesday, August 23, 2023 10:49 AM
> ....
>> > --- a/include/linux/mm_types.h
>> > +++ b/include/linux/mm_types.h
>> > @@ -737,7 +737,11 @@ struct mm_struct {
>> >
>> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
>> > /proc/PID/auxv */
>> >
>> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +               union {
>> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
>> > +                       u64 *rss_stat_single;
>> > +               };
>> > +               bool    magic_flag_stuffed_elsewhere;
>
> I wouldn't use a union to save a pointer - it is asking for trouble.
>

I may need to abandon this bit anyway -- counter init adds counters to
a global list and I can't call easily call it like that.

>> >
>> >                 struct linux_binfmt *binfmt;
>> >
>> >
>> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> > countes * 2 -- first set updated without any synchro by current
>> > thread. Second set only to be modified by others and protected with
>> > mm->arg_lock. The lock protects remote access to the union to begin
>> > with.
>>
>> arg_lock seems a bit like a hack. How is it related to rss_stat? The
>> scheme
>> with two counters is clever but I'm not 100% convinced the complexity is
>> really worth it. I'm not sure the overhead of always using an atomic
>> counter would really be measurable as atomic counter ops in local CPU
>> cache
>> tend to be cheap. Did you try to measure the difference?
>
> A separate lock is worse than atomics.
> (Although some 32bit arch may have issues with 64bit atomics.)
>

But in my proposal the separate lock is used to facilitate *NOT* using
atomics by the most common consumer -- the only thread.

The lock is only used for the transition to multithreaded state for
updated by remote parties (both rare compared to updated by current).

> I think you'll be surprised just how slow atomic ops are.
> Even when present in the local cache.
> (Probably because any other copies have to be invalidated.)
>

Agreed. They have always been super expensive on x86-64 (and continue
to be). I keep running to claims they are not, I don't know where
that's coming from.

Mateusz Guzik Aug. 23, 2023, 12:13 p.m. UTC | #10

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
>> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
>> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
>> >> > True Fix(tm) is a longer story.
>> >> >
>> >> > Maybe let's sort out this patchset first, whichever way. :)
>> >> >
>> >>
>> >> So I found the discussion around the original patch with a perf
>> >> regression report.
>> >>
>> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
>> >>
>> >> The reporter suggests dodging the problem by only allocating per-cpu
>> >> counters when the process is going multithreaded. Given that there is
>> >> still plenty of forever single-threaded procs out there I think that's
>> >> does sound like a great plan regardless of what happens with this
>> >> patchset.
>> >>
>> >> Almost all access is already done using dedicated routines, so this
>> >> should be an afternoon churn to sort out, unless I missed a
>> >> showstopper. (maybe there is no good place to stuff a flag/whatever
>> >> other indicator about the state of counters?)
>> >>
>> >> That said I'll look into it some time this or next week.
>> >
>> > Good, just let me know how it went, I also wanted to start looking into
>> > this to come up with some concrete patches :). What I had in mind was
>> > that
>> > we could use 'counters == NULL' as an indication that the counter is
>> > still
>> > in 'single counter mode'.
>> >
>>
>> In the current state there are only pointers to counters in mm_struct
>> and there is no storage for them in task_struct. So I don't think
>> merely null-checking the per-cpu stuff is going to cut it -- where
>> should the single-threaded counters land?
>
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
>

Indeed I did -- I had tunnel vision on dodging atomics for current
given remote modifications, which wont be done in your proposal.

I concede your idea solves the problem at hand, I question whether it
is the right to do though. Not my call to make.

>> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
>> countes * 2 -- first set updated without any synchro by current
>> thread. Second set only to be modified by others and protected with
>> mm->arg_lock. The lock protects remote access to the union to begin
>> with.
>
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
>

arg_lock is not as is, it would have to be renamed to something more generic.

Atomics on x86-64 are very expensive to this very day. Here is a
sample measurement of 2 atomics showing up done by someone else:
https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u

tl;dr it is *really* bad.

> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
>

The spinlock would be there to synchronize against the transition to
per-cpu -- any trickery is avoided and we trivially know for a fact
the remote party either sees the per-cpu state if transitioned, or
local if not. Then one easily knows no updates have been lost and the
buf for 2 sets of counters can be safely freed.

While writing down the idea previously I did not realize the per-cpu
counter ops disable interrupts around the op. That's already very slow
and the trip should be comparable to paying for an atomic (as in the
patch which introduced percpu counters here slowed things down for
single-threaded processes).

With your proposal the atomic would be there, but interrupt trip could
be avoided. This would roughly maintain the current cost of doing the
op (as in it would not get /worse/). My patch would make it lower.

All that said, I'm going to refrain from writing a patch for the time
being. If powers to be decide on your approach, I'm not going to argue
-- I don't think either is a clear winner over the other.

Jan Kara Aug. 23, 2023, 3:47 p.m. UTC | #11

On Wed 23-08-23 14:13:20, Mateusz Guzik wrote:
> On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> >> On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> >> Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> >> countes * 2 -- first set updated without any synchro by current
> >> thread. Second set only to be modified by others and protected with
> >> mm->arg_lock. The lock protects remote access to the union to begin
> >> with.
> >
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> >
> 
> arg_lock is not as is, it would have to be renamed to something more generic.

Ah, OK.

> Atomics on x86-64 are very expensive to this very day. Here is a
> sample measurement of 2 atomics showing up done by someone else:
> https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u
> 
> tl;dr it is *really* bad.

I didn't express myself well. Sure atomics are expensive compared to plain
arithmetic operations. But I wanted to say - we had atomics for RSS
counters before commit f1a7941243 ("mm: convert mm's rss stats into
percpu_counter") and people seemed happy with it until there were many CPUs
contending on the updates. So maybe RSS counters aren't used heavily enough
for the difference to practically matter? Probably operation like faulting
in (or unmapping) tmpfs file has the highest chance of showing the cost of
rss accounting compared to the cost of the remainder of the operation...

> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> 
> The spinlock would be there to synchronize against the transition to
> per-cpu -- any trickery is avoided and we trivially know for a fact
> the remote party either sees the per-cpu state if transitioned, or
> local if not. Then one easily knows no updates have been lost and the
> buf for 2 sets of counters can be safely freed.

Yeah, the spinlock makes the transition simpler, I agree.

> While writing down the idea previously I did not realize the per-cpu
> counter ops disable interrupts around the op. That's already very slow
> and the trip should be comparable to paying for an atomic (as in the
> patch which introduced percpu counters here slowed things down for
> single-threaded processes).
> 
> With your proposal the atomic would be there, but interrupt trip could
> be avoided. This would roughly maintain the current cost of doing the
> op (as in it would not get /worse/). My patch would make it lower.
> 
> All that said, I'm going to refrain from writing a patch for the time
> being. If powers to be decide on your approach, I'm not going to argue
> -- I don't think either is a clear winner over the other.

I guess we'll need to code it and compare the results :)

								Honza

Mateusz Guzik Aug. 23, 2023, 4:10 p.m. UTC | #12

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> I didn't express myself well. Sure atomics are expensive compared to plain
> arithmetic operations. But I wanted to say - we had atomics for RSS
> counters before commit f1a7941243 ("mm: convert mm's rss stats into
> percpu_counter") and people seemed happy with it until there were many CPUs
> contending on the updates. So maybe RSS counters aren't used heavily enough
> for the difference to practically matter? Probably operation like faulting
> in (or unmapping) tmpfs file has the highest chance of showing the cost of
> rss accounting compared to the cost of the remainder of the operation...
>

These stats used to be decentralized by storing them in task_struct,
the commit complains about values deviating too much.

The value would get synced every 64 uses, from the diff:
-/* sync counter once per 64 page faults */
-#define TASK_RSS_EVENTS_THRESH (64)
-static void check_sync_rss_stat(struct task_struct *task)
-{
-       if (unlikely(task != current))
-               return;
-       if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
-               sync_mm_rss(task->mm);
-}

other than that it was a non-atomic update in struct thread.

-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
-{
-       struct task_struct *task = current;
-
-       if (likely(task->mm == mm))
-               task->rss_stat.count[member] += val;
-       else
-               add_mm_counter(mm, member, val);
-}

So the question is how much does this matter. My personal approach is
that avoidable slowdowns (like atomics here) only facilitate further
avoidable slowdowns as people can claim there is a minuscule change in
% to baseline. But if the baseline is already slow....

Anyhow, I just found that patch failed to completely remove
SPLIT_RSS_COUNTING. I'm going to submit something about that later.

Jan Kara Aug. 23, 2023, 4:41 p.m. UTC | #13

On Wed 23-08-23 18:10:29, Mateusz Guzik wrote:
> On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> > I didn't express myself well. Sure atomics are expensive compared to plain
> > arithmetic operations. But I wanted to say - we had atomics for RSS
> > counters before commit f1a7941243 ("mm: convert mm's rss stats into
> > percpu_counter") and people seemed happy with it until there were many CPUs
> > contending on the updates. So maybe RSS counters aren't used heavily enough
> > for the difference to practically matter? Probably operation like faulting
> > in (or unmapping) tmpfs file has the highest chance of showing the cost of
> > rss accounting compared to the cost of the remainder of the operation...
> >
> 
> These stats used to be decentralized by storing them in task_struct,
> the commit complains about values deviating too much.
> 
> The value would get synced every 64 uses, from the diff:
> -/* sync counter once per 64 page faults */
> -#define TASK_RSS_EVENTS_THRESH (64)
> -static void check_sync_rss_stat(struct task_struct *task)
> -{
> -       if (unlikely(task != current))
> -               return;
> -       if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
> -               sync_mm_rss(task->mm);
> -}
> 
> other than that it was a non-atomic update in struct thread.
> 
> -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> -{
> -       struct task_struct *task = current;
> -
> -       if (likely(task->mm == mm))
> -               task->rss_stat.count[member] += val;
> -       else
> -               add_mm_counter(mm, member, val);
> -}

Ah, I see. I already forgot these details since I was checking the
regression back in spring. Now I've just seen the atomic_long_t counters in
task_struct and forgot there used to be also these per-thread ones. Thanks
for refreshing my memory!

> So the question is how much does this matter. My personal approach is
> that avoidable slowdowns (like atomics here) only facilitate further
> avoidable slowdowns as people can claim there is a minuscule change in
> % to baseline. But if the baseline is already slow....

I get your point but over the years I've also learned that premature
optimization isn't good either as we will be dragging the maintenance
burden for a long time ;) It's a balance.

								Honza

Mateusz Guzik Aug. 23, 2023, 5:12 p.m. UTC | #14

On 8/23/23, Jan Kara <jack@suse.cz> wrote:
> On Wed 23-08-23 18:10:29, Mateusz Guzik wrote:
>> So the question is how much does this matter. My personal approach is
>> that avoidable slowdowns (like atomics here) only facilitate further
>> avoidable slowdowns as people can claim there is a minuscule change in
>> % to baseline. But if the baseline is already slow....
>
> I get your point but over the years I've also learned that premature
> optimization isn't good either as we will be dragging the maintenance
> burden for a long time ;) It's a balance.
>

Mate, your proposal is not maintenance burden-free either. ;)

I claim mine is simpler and faster for single threaded case, but is
not generic so should another consumer show up with the single vs
multithreaded need, there may or may not be more work to accommodate
it.

Dennis Zhou Aug. 23, 2023, 8:27 p.m. UTC | #15

On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >> > True Fix(tm) is a longer story.
> > >> >
> > >> > Maybe let's sort out this patchset first, whichever way. :)
> > >> >
> > >>
> > >> So I found the discussion around the original patch with a perf
> > >> regression report.
> > >>
> > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > >>
> > >> The reporter suggests dodging the problem by only allocating per-cpu
> > >> counters when the process is going multithreaded. Given that there is
> > >> still plenty of forever single-threaded procs out there I think that's
> > >> does sound like a great plan regardless of what happens with this
> > >> patchset.
> > >>
> > >> Almost all access is already done using dedicated routines, so this
> > >> should be an afternoon churn to sort out, unless I missed a
> > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > >> other indicator about the state of counters?)
> > >>
> > >> That said I'll look into it some time this or next week.
> > >
> > > Good, just let me know how it went, I also wanted to start looking into
> > > this to come up with some concrete patches :). What I had in mind was that
> > > we could use 'counters == NULL' as an indication that the counter is still
> > > in 'single counter mode'.
> > >
> > 
> > In the current state there are only pointers to counters in mm_struct
> > and there is no storage for them in task_struct. So I don't think
> > merely null-checking the per-cpu stuff is going to cut it -- where
> > should the single-threaded counters land?
> 
> I think you misunderstood. What I wanted to do it to provide a new flavor
> of percpu_counter (sharing most of code and definitions) which would have
> an option to start as simple counter (indicated by pcc->counters == NULL
> and using pcc->count for counting) and then be upgraded by a call to real
> percpu thing. Because I think such counters would be useful also on other
> occasions than as rss counters.
> 

Kent wrote something similar and sent it out last year [1]. However, the
case slightly differs from what we'd want here, 1 -> 2 threads becomes
percpu vs update rate which a single thread might be able to trigger?

[1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/

Thanks,
Dennis

> > Bonus problem, non-current can modify these counters and this needs to
> > be safe against current playing with them at the same time. (and it
> > would be a shame to require current to use atomic on them)
> 
> Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> modifying the counters for other processes. Thanks for pointing this out.
> 
> > That said, my initial proposal adds a union:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5e74ce4a28cd..ea70f0c08286 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -737,7 +737,11 @@ struct mm_struct {
> > 
> >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > /proc/PID/auxv */
> > 
> > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +               union {
> > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > +                       u64 *rss_stat_single;
> > +               };
> > +               bool    magic_flag_stuffed_elsewhere;
> > 
> >                 struct linux_binfmt *binfmt;
> > 
> > 
> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > countes * 2 -- first set updated without any synchro by current
> > thread. Second set only to be modified by others and protected with
> > mm->arg_lock. The lock protects remote access to the union to begin
> > with.
> 
> arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> with two counters is clever but I'm not 100% convinced the complexity is
> really worth it. I'm not sure the overhead of always using an atomic
> counter would really be measurable as atomic counter ops in local CPU cache
> tend to be cheap. Did you try to measure the difference?
> 
> If the second counter proves to be worth it, we could make just that one
> atomic to avoid the need for abusing some spinlock.
> 
> > Transition to per-CPU operation sets the magic flag (there is plenty
> > of spare space in mm_struct, I'll find a good home for it without
> > growing the struct). It would be a one-way street -- a process which
> > gets a bunch of threads and goes back to one stays with per-CPU.
> 
> Agreed with switching to be a one-way street.
> 
> > Then you get the true value of something by adding both counters.
> > 
> > arg_lock is sparingly used, so remote ops are not expected to contend
> > with anything. In fact their cost is going to go down since percpu
> > summation takes a spinlock which also disables interrupts.
> > 
> > Local ops should be about the same in cost as they are right now.
> > 
> > I might have missed some detail in the above description, but I think
> > the approach is decent.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Aug. 24, 2023, 9:19 a.m. UTC | #16

On Wed 23-08-23 13:27:56, Dennis Zhou wrote:
> On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > > On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >> > True Fix(tm) is a longer story.
> > > >> >
> > > >> > Maybe let's sort out this patchset first, whichever way. :)
> > > >> >
> > > >>
> > > >> So I found the discussion around the original patch with a perf
> > > >> regression report.
> > > >>
> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > > >>
> > > >> The reporter suggests dodging the problem by only allocating per-cpu
> > > >> counters when the process is going multithreaded. Given that there is
> > > >> still plenty of forever single-threaded procs out there I think that's
> > > >> does sound like a great plan regardless of what happens with this
> > > >> patchset.
> > > >>
> > > >> Almost all access is already done using dedicated routines, so this
> > > >> should be an afternoon churn to sort out, unless I missed a
> > > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > > >> other indicator about the state of counters?)
> > > >>
> > > >> That said I'll look into it some time this or next week.
> > > >
> > > > Good, just let me know how it went, I also wanted to start looking into
> > > > this to come up with some concrete patches :). What I had in mind was that
> > > > we could use 'counters == NULL' as an indication that the counter is still
> > > > in 'single counter mode'.
> > > >
> > > 
> > > In the current state there are only pointers to counters in mm_struct
> > > and there is no storage for them in task_struct. So I don't think
> > > merely null-checking the per-cpu stuff is going to cut it -- where
> > > should the single-threaded counters land?
> > 
> > I think you misunderstood. What I wanted to do it to provide a new flavor
> > of percpu_counter (sharing most of code and definitions) which would have
> > an option to start as simple counter (indicated by pcc->counters == NULL
> > and using pcc->count for counting) and then be upgraded by a call to real
> > percpu thing. Because I think such counters would be useful also on other
> > occasions than as rss counters.
> > 
> 
> Kent wrote something similar and sent it out last year [1]. However, the
> case slightly differs from what we'd want here, 1 -> 2 threads becomes
> percpu vs update rate which a single thread might be able to trigger?

Thanks for the pointer but that version of counters is not really suitable
here as is (but we could factor out some common bits if that work is
happening). 1 thread can easily do 10000 RSS updates per second.

> [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/

								Honza

> > > Bonus problem, non-current can modify these counters and this needs to
> > > be safe against current playing with them at the same time. (and it
> > > would be a shame to require current to use atomic on them)
> > 
> > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> > modifying the counters for other processes. Thanks for pointing this out.
> > 
> > > That said, my initial proposal adds a union:
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 5e74ce4a28cd..ea70f0c08286 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -737,7 +737,11 @@ struct mm_struct {
> > > 
> > >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > > /proc/PID/auxv */
> > > 
> > > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +               union {
> > > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +                       u64 *rss_stat_single;
> > > +               };
> > > +               bool    magic_flag_stuffed_elsewhere;
> > > 
> > >                 struct linux_binfmt *binfmt;
> > > 
> > > 
> > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > > countes * 2 -- first set updated without any synchro by current
> > > thread. Second set only to be modified by others and protected with
> > > mm->arg_lock. The lock protects remote access to the union to begin
> > > with.
> > 
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> > 
> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> > 
> > > Transition to per-CPU operation sets the magic flag (there is plenty
> > > of spare space in mm_struct, I'll find a good home for it without
> > > growing the struct). It would be a one-way street -- a process which
> > > gets a bunch of threads and goes back to one stays with per-CPU.
> > 
> > Agreed with switching to be a one-way street.
> > 
> > > Then you get the true value of something by adding both counters.
> > > 
> > > arg_lock is sparingly used, so remote ops are not expected to contend
> > > with anything. In fact their cost is going to go down since percpu
> > > summation takes a spinlock which also disables interrupts.
> > > 
> > > Local ops should be about the same in cost as they are right now.
> > > 
> > > I might have missed some detail in the above description, but I think
> > > the approach is decent.
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

Mateusz Guzik Aug. 26, 2023, 6:33 p.m. UTC | #17

On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
>
> I borrowed the bench from found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
>
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
>
> It prints a result every second (warning: first line is garbage).
>
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
>
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
>
> Even if this code could be patched to dodge these counters,  I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
>
> With the fix I get an ok win, to quote from the commit:
>> Even at a very modest scale of 26 cores (ops/s):
>> before: 133543.63
>> after:  186061.81 (+39%)
>

So to sum up, a v3 of the patchset is queued up here:
https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-next

For interested I temporarily got my hands on something exceeding the
hand watch scale benched above -- a 192-way AMD EPYC 7R13 box (2
sockets x 48 cores x 2 threads).

A 6.5 kernel + the patchset only gets south of 140k execs/s when
running ./static-doexec 192

According to perf top:
  51.04%  [kernel]           [k] osq_lock
   6.82%  [kernel]           [k] __raw_callee_save___kvm_vcpu_is_preempted
   2.98%  [kernel]           [k] _atomic_dec_and_lock_irqsave
   1.62%  [kernel]           [k] rcu_cblist_dequeue
   1.54%  [kernel]           [k] refcount_dec_not_one
   1.51%  [kernel]           [k] __mod_lruvec_page_state
   1.46%  [kernel]           [k] put_cred_rcu
   1.34%  [kernel]           [k] native_queued_spin_lock_slowpath
   0.94%  [kernel]           [k] srso_alias_safe_ret
   0.81%  [kernel]           [k] memset_orig
   0.77%  [kernel]           [k] unmap_page_range
   0.73%  [kernel]           [k] _compound_head
   0.72%  [kernel]           [k] kmem_cache_free

Then bpftrace -e 'kprobe:osq_lock { @[kstack()] = count(); }' shows:

@[
    osq_lock+1
    __mutex_lock_killable_slowpath+19
    mutex_lock_killable+62
    pcpu_alloc+1219
    __alloc_percpu_gfp+18
    __percpu_counter_init_many+43
    mm_init+727
    mm_alloc+78
    alloc_bprm+138
    do_execveat_common.isra.0+103
    __x64_sys_execve+55
    do_syscall_64+54
    entry_SYSCALL_64_after_hwframe+110
]: 637370

@[
    osq_lock+1
    __mutex_lock_killable_slowpath+19
    mutex_lock_killable+62
    pcpu_alloc+1219
    __alloc_percpu+21
    mm_init+577
    mm_alloc+78
    alloc_bprm+138
    do_execveat_common.isra.0+103
    __x64_sys_execve+55
    do_syscall_64+54
    entry_SYSCALL_64_after_hwframe+110
]: 638036

That is per-cpu allocation is still on top at this scale.

But more importantly there are *TWO* unrelated back-to-back per-cpu
allocs -- one by rss counters and one by mm_alloc_cid.

That is to say per-cpu alloc scalability definitely needs to get
fixed, I'll ponder about it.

[0/2] execve scalability issues, part 1

Message

Comments