Message ID | 20230821202829.2163744-1-mjguzik@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | execve scalability issues, part 1 | expand |
On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: > > While with the patch these allocations remain a significant problem, > > the primary bottleneck shifts to: > > > > __pv_queued_spin_lock_slowpath+1 > > _raw_spin_lock_irqsave+57 > > folio_lruvec_lock_irqsave+91 > > release_pages+590 > > tlb_batch_pages_flush+61 > > tlb_finish_mmu+101 > > exit_mmap+327 > > __mmput+61 > > begin_new_exec+1245 > > load_elf_binary+712 > > bprm_execve+644 > > do_execveat_common.isra.0+429 > > __x64_sys_execve+50 > > do_syscall_64+46 > > entry_SYSCALL_64_after_hwframe+110 > > I intend to do more work on the area to mostly sort it out, but I would > not mind if someone else took the hammer to folio. :) Funny you should ask ... these patches are from ~3 weeks ago. They may or may not apply to a current tree, but I'm working on this area.
Hello, On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: > To start I figured I'm going to bench about as friendly case as it gets > -- statically linked *separate* binaries all doing execve in a loop. > > I borrowed the bench from found here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > It prints a result every second (warning: first line is garbage). > > My test box is temporarily only 26 cores and even at this scale I run > into massive lock contention stemming from back-to-back calls to > percpu_counter_init (and _destroy later). > > While not a panacea, one simple thing to do here is to batch these ops. > Since the term "batching" is already used in the file, I decided to > refer to it as "grouping" instead. > Unfortunately it's taken me longer to get back to and I'm actually not super happy with the results, I wrote a batch percpu allocation api. It's better than the starting place, but I'm falling short on the free path. I am/was also wrestling with the lifetime ideas (should these lifetimes be percpus problem or call site bound like you've done). What I like about this approach is you group the percpu_counter lock which batching percpu allocations wouldn't be able to solve no matter how well I do. I'll review this more closely today. > Even if this code could be patched to dodge these counters, I would > argue a high-traffic alloc/free consumer is only a matter of time so it > makes sense to facilitate it. > > With the fix I get an ok win, to quote from the commit: > > Even at a very modest scale of 26 cores (ops/s): > > before: 133543.63 > > after: 186061.81 (+39%) > > > While with the patch these allocations remain a significant problem, > > the primary bottleneck shifts to: > > > > __pv_queued_spin_lock_slowpath+1 > > _raw_spin_lock_irqsave+57 > > folio_lruvec_lock_irqsave+91 > > release_pages+590 > > tlb_batch_pages_flush+61 > > tlb_finish_mmu+101 > > exit_mmap+327 > > __mmput+61 > > begin_new_exec+1245 > > load_elf_binary+712 > > bprm_execve+644 > > do_execveat_common.isra.0+429 > > __x64_sys_execve+50 > > do_syscall_64+46 > > entry_SYSCALL_64_after_hwframe+110 > > I intend to do more work on the area to mostly sort it out, but I would > not mind if someone else took the hammer to folio. :) > > With this out of the way I'll be looking at some form of caching to > eliminate these allocs as a problem. > I'm not against caching, this is just my first thought. Caching will have an impact on the backing pages of percpu. All it takes is 1 allocation on a page for the current allocator to pin n pages of memory. A few years ago percpu depopulation was implemented so that limits the amount of resident backing pages. Maybe the right thing to do is preallocate pools of common sized allocations so that way they can be recycled such that we don't have to think too hard about fragmentation that can occur if we populate these pools over time? Also as you've pointed out, it wasn't just the percpu allocation being the bottleneck, but percpu_counter's global lock too for hotplug support. I'm hazarding a guess most use cases of percpu might have additional locking requirements too such as percpu_counter. Thanks, Dennis > Thoughts? > > Mateusz Guzik (2): > pcpcntr: add group allocation/free > fork: group allocation of per-cpu counters for mm struct > > include/linux/percpu_counter.h | 19 ++++++++--- > kernel/fork.c | 13 ++------ > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- > 3 files changed, 60 insertions(+), 33 deletions(-) > > -- > 2.39.2 >
On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote: > On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: > > With this out of the way I'll be looking at some form of caching to > > eliminate these allocs as a problem. > > > > I'm not against caching, this is just my first thought. Caching will > have an impact on the backing pages of percpu. All it takes is 1 > allocation on a page for the current allocator to pin n pages of memory. > A few years ago percpu depopulation was implemented so that limits the > amount of resident backing pages. > I'm painfully aware. > Maybe the right thing to do is preallocate pools of common sized > allocations so that way they can be recycled such that we don't have to > think too hard about fragmentation that can occur if we populate these > pools over time? > This is what I was going to suggest :) FreeBSD has a per-cpu allocator which pretends to be the same as the slab allocator, except handing out per-cpu bufs. So far it has sizes 4, 8, 16, 32 and 64 and you can act as if you are mallocing in that size. Scales perfectly fine of course since it caches objs per-CPU, but there is some waste and I have 0 idea how it compares to what Linux is doing on that front. I stress though that even if you were to carve out certain sizes, a global lock to handle ops will still kill scalability. Perhaps granularity better than global, but less than per-CPU would be a sweet spot for scalabability vs memory waste. That said... > Also as you've pointed out, it wasn't just the percpu allocation being > the bottleneck, but percpu_counter's global lock too for hotplug > support. I'm hazarding a guess most use cases of percpu might have > additional locking requirements too such as percpu_counter. > True Fix(tm) is a longer story. Maybe let's sort out this patchset first, whichever way. :) > Thanks, > Dennis > > > Thoughts? > > > > Mateusz Guzik (2): > > pcpcntr: add group allocation/free > > fork: group allocation of per-cpu counters for mm struct > > > > include/linux/percpu_counter.h | 19 ++++++++--- > > kernel/fork.c | 13 ++------ > > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- > > 3 files changed, 60 insertions(+), 33 deletions(-) > > > > -- > > 2.39.2 > >
On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On Mon, Aug 21, 2023 at 02:07:28PM -0700, Dennis Zhou wrote: >> On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: >> > With this out of the way I'll be looking at some form of caching to >> > eliminate these allocs as a problem. >> > >> >> I'm not against caching, this is just my first thought. Caching will >> have an impact on the backing pages of percpu. All it takes is 1 >> allocation on a page for the current allocator to pin n pages of memory. >> A few years ago percpu depopulation was implemented so that limits the >> amount of resident backing pages. >> > > I'm painfully aware. > >> Maybe the right thing to do is preallocate pools of common sized >> allocations so that way they can be recycled such that we don't have to >> think too hard about fragmentation that can occur if we populate these >> pools over time? >> > > This is what I was going to suggest :) > > FreeBSD has a per-cpu allocator which pretends to be the same as the > slab allocator, except handing out per-cpu bufs. So far it has sizes 4, > 8, 16, 32 and 64 and you can act as if you are mallocing in that size. > > Scales perfectly fine of course since it caches objs per-CPU, but there > is some waste and I have 0 idea how it compares to what Linux is doing > on that front. > > I stress though that even if you were to carve out certain sizes, a > global lock to handle ops will still kill scalability. > > Perhaps granularity better than global, but less than per-CPU would be a > sweet spot for scalabability vs memory waste. > > That said... > >> Also as you've pointed out, it wasn't just the percpu allocation being >> the bottleneck, but percpu_counter's global lock too for hotplug >> support. I'm hazarding a guess most use cases of percpu might have >> additional locking requirements too such as percpu_counter. >> > > True Fix(tm) is a longer story. > > Maybe let's sort out this patchset first, whichever way. :) > So I found the discussion around the original patch with a perf regression report. https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ The reporter suggests dodging the problem by only allocating per-cpu counters when the process is going multithreaded. Given that there is still plenty of forever single-threaded procs out there I think that's does sound like a great plan regardless of what happens with this patchset. Almost all access is already done using dedicated routines, so this should be an afternoon churn to sort out, unless I missed a showstopper. (maybe there is no good place to stuff a flag/whatever other indicator about the state of counters?) That said I'll look into it some time this or next week. >> Thanks, >> Dennis >> >> > Thoughts? >> > >> > Mateusz Guzik (2): >> > pcpcntr: add group allocation/free >> > fork: group allocation of per-cpu counters for mm struct >> > >> > include/linux/percpu_counter.h | 19 ++++++++--- >> > kernel/fork.c | 13 ++------ >> > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- >> > 3 files changed, 60 insertions(+), 33 deletions(-) >> > >> > -- >> > 2.39.2 >> > >
On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: > On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > > True Fix(tm) is a longer story. > > > > Maybe let's sort out this patchset first, whichever way. :) > > > > So I found the discussion around the original patch with a perf > regression report. > > https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ > > The reporter suggests dodging the problem by only allocating per-cpu > counters when the process is going multithreaded. Given that there is > still plenty of forever single-threaded procs out there I think that's > does sound like a great plan regardless of what happens with this > patchset. > > Almost all access is already done using dedicated routines, so this > should be an afternoon churn to sort out, unless I missed a > showstopper. (maybe there is no good place to stuff a flag/whatever > other indicator about the state of counters?) > > That said I'll look into it some time this or next week. Good, just let me know how it went, I also wanted to start looking into this to come up with some concrete patches :). What I had in mind was that we could use 'counters == NULL' as an indication that the counter is still in 'single counter mode'. Honza
On 8/22/23, Jan Kara <jack@suse.cz> wrote: > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> > True Fix(tm) is a longer story. >> > >> > Maybe let's sort out this patchset first, whichever way. :) >> > >> >> So I found the discussion around the original patch with a perf >> regression report. >> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ >> >> The reporter suggests dodging the problem by only allocating per-cpu >> counters when the process is going multithreaded. Given that there is >> still plenty of forever single-threaded procs out there I think that's >> does sound like a great plan regardless of what happens with this >> patchset. >> >> Almost all access is already done using dedicated routines, so this >> should be an afternoon churn to sort out, unless I missed a >> showstopper. (maybe there is no good place to stuff a flag/whatever >> other indicator about the state of counters?) >> >> That said I'll look into it some time this or next week. > > Good, just let me know how it went, I also wanted to start looking into > this to come up with some concrete patches :). What I had in mind was that > we could use 'counters == NULL' as an indication that the counter is still > in 'single counter mode'. > In the current state there are only pointers to counters in mm_struct and there is no storage for them in task_struct. So I don't think merely null-checking the per-cpu stuff is going to cut it -- where should the single-threaded counters land? Bonus problem, non-current can modify these counters and this needs to be safe against current playing with them at the same time. (and it would be a shame to require current to use atomic on them) That said, my initial proposal adds a union: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5e74ce4a28cd..ea70f0c08286 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -737,7 +737,11 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + union { + struct percpu_counter rss_stat[NR_MM_COUNTERS]; + u64 *rss_stat_single; + }; + bool magic_flag_stuffed_elsewhere; struct linux_binfmt *binfmt; Then for single-threaded case an area is allocated for NR_MM_COUNTERS countes * 2 -- first set updated without any synchro by current thread. Second set only to be modified by others and protected with mm->arg_lock. The lock protects remote access to the union to begin with. Transition to per-CPU operation sets the magic flag (there is plenty of spare space in mm_struct, I'll find a good home for it without growing the struct). It would be a one-way street -- a process which gets a bunch of threads and goes back to one stays with per-CPU. Then you get the true value of something by adding both counters. arg_lock is sparingly used, so remote ops are not expected to contend with anything. In fact their cost is going to go down since percpu summation takes a spinlock which also disables interrupts. Local ops should be about the same in cost as they are right now. I might have missed some detail in the above description, but I think the approach is decent.
On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: > On 8/22/23, Jan Kara <jack@suse.cz> wrote: > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > >> > True Fix(tm) is a longer story. > >> > > >> > Maybe let's sort out this patchset first, whichever way. :) > >> > > >> > >> So I found the discussion around the original patch with a perf > >> regression report. > >> > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ > >> > >> The reporter suggests dodging the problem by only allocating per-cpu > >> counters when the process is going multithreaded. Given that there is > >> still plenty of forever single-threaded procs out there I think that's > >> does sound like a great plan regardless of what happens with this > >> patchset. > >> > >> Almost all access is already done using dedicated routines, so this > >> should be an afternoon churn to sort out, unless I missed a > >> showstopper. (maybe there is no good place to stuff a flag/whatever > >> other indicator about the state of counters?) > >> > >> That said I'll look into it some time this or next week. > > > > Good, just let me know how it went, I also wanted to start looking into > > this to come up with some concrete patches :). What I had in mind was that > > we could use 'counters == NULL' as an indication that the counter is still > > in 'single counter mode'. > > > > In the current state there are only pointers to counters in mm_struct > and there is no storage for them in task_struct. So I don't think > merely null-checking the per-cpu stuff is going to cut it -- where > should the single-threaded counters land? I think you misunderstood. What I wanted to do it to provide a new flavor of percpu_counter (sharing most of code and definitions) which would have an option to start as simple counter (indicated by pcc->counters == NULL and using pcc->count for counting) and then be upgraded by a call to real percpu thing. Because I think such counters would be useful also on other occasions than as rss counters. > Bonus problem, non-current can modify these counters and this needs to > be safe against current playing with them at the same time. (and it > would be a shame to require current to use atomic on them) Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be modifying the counters for other processes. Thanks for pointing this out. > That said, my initial proposal adds a union: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5e74ce4a28cd..ea70f0c08286 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -737,7 +737,11 @@ struct mm_struct { > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for > /proc/PID/auxv */ > > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; > + union { > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > + u64 *rss_stat_single; > + }; > + bool magic_flag_stuffed_elsewhere; > > struct linux_binfmt *binfmt; > > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS > countes * 2 -- first set updated without any synchro by current > thread. Second set only to be modified by others and protected with > mm->arg_lock. The lock protects remote access to the union to begin > with. arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme with two counters is clever but I'm not 100% convinced the complexity is really worth it. I'm not sure the overhead of always using an atomic counter would really be measurable as atomic counter ops in local CPU cache tend to be cheap. Did you try to measure the difference? If the second counter proves to be worth it, we could make just that one atomic to avoid the need for abusing some spinlock. > Transition to per-CPU operation sets the magic flag (there is plenty > of spare space in mm_struct, I'll find a good home for it without > growing the struct). It would be a one-way street -- a process which > gets a bunch of threads and goes back to one stays with per-CPU. Agreed with switching to be a one-way street. > Then you get the true value of something by adding both counters. > > arg_lock is sparingly used, so remote ops are not expected to contend > with anything. In fact their cost is going to go down since percpu > summation takes a spinlock which also disables interrupts. > > Local ops should be about the same in cost as they are right now. > > I might have missed some detail in the above description, but I think > the approach is decent. Honza
From: Jan Kara > Sent: Wednesday, August 23, 2023 10:49 AM .... > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -737,7 +737,11 @@ struct mm_struct { > > > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for > > /proc/PID/auxv */ > > > > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + union { > > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + u64 *rss_stat_single; > > + }; > > + bool magic_flag_stuffed_elsewhere; I wouldn't use a union to save a pointer - it is asking for trouble. > > > > struct linux_binfmt *binfmt; > > > > > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS > > countes * 2 -- first set updated without any synchro by current > > thread. Second set only to be modified by others and protected with > > mm->arg_lock. The lock protects remote access to the union to begin > > with. > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > with two counters is clever but I'm not 100% convinced the complexity is > really worth it. I'm not sure the overhead of always using an atomic > counter would really be measurable as atomic counter ops in local CPU cache > tend to be cheap. Did you try to measure the difference? A separate lock is worse than atomics. (Although some 32bit arch may have issues with 64bit atomics.) I think you'll be surprised just how slow atomic ops are. Even when present in the local cache. (Probably because any other copies have to be invalidated.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 8/23/23, David Laight <David.Laight@aculab.com> wrote: > From: Jan Kara >> Sent: Wednesday, August 23, 2023 10:49 AM > .... >> > --- a/include/linux/mm_types.h >> > +++ b/include/linux/mm_types.h >> > @@ -737,7 +737,11 @@ struct mm_struct { >> > >> > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for >> > /proc/PID/auxv */ >> > >> > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; >> > + union { >> > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; >> > + u64 *rss_stat_single; >> > + }; >> > + bool magic_flag_stuffed_elsewhere; > > I wouldn't use a union to save a pointer - it is asking for trouble. > I may need to abandon this bit anyway -- counter init adds counters to a global list and I can't call easily call it like that. >> > >> > struct linux_binfmt *binfmt; >> > >> > >> > Then for single-threaded case an area is allocated for NR_MM_COUNTERS >> > countes * 2 -- first set updated without any synchro by current >> > thread. Second set only to be modified by others and protected with >> > mm->arg_lock. The lock protects remote access to the union to begin >> > with. >> >> arg_lock seems a bit like a hack. How is it related to rss_stat? The >> scheme >> with two counters is clever but I'm not 100% convinced the complexity is >> really worth it. I'm not sure the overhead of always using an atomic >> counter would really be measurable as atomic counter ops in local CPU >> cache >> tend to be cheap. Did you try to measure the difference? > > A separate lock is worse than atomics. > (Although some 32bit arch may have issues with 64bit atomics.) > But in my proposal the separate lock is used to facilitate *NOT* using atomics by the most common consumer -- the only thread. The lock is only used for the transition to multithreaded state for updated by remote parties (both rare compared to updated by current). > I think you'll be surprised just how slow atomic ops are. > Even when present in the local cache. > (Probably because any other copies have to be invalidated.) > Agreed. They have always been super expensive on x86-64 (and continue to be). I keep running to claims they are not, I don't know where that's coming from.
On 8/23/23, Jan Kara <jack@suse.cz> wrote: > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: >> On 8/22/23, Jan Kara <jack@suse.cz> wrote: >> > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: >> >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> > True Fix(tm) is a longer story. >> >> > >> >> > Maybe let's sort out this patchset first, whichever way. :) >> >> > >> >> >> >> So I found the discussion around the original patch with a perf >> >> regression report. >> >> >> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ >> >> >> >> The reporter suggests dodging the problem by only allocating per-cpu >> >> counters when the process is going multithreaded. Given that there is >> >> still plenty of forever single-threaded procs out there I think that's >> >> does sound like a great plan regardless of what happens with this >> >> patchset. >> >> >> >> Almost all access is already done using dedicated routines, so this >> >> should be an afternoon churn to sort out, unless I missed a >> >> showstopper. (maybe there is no good place to stuff a flag/whatever >> >> other indicator about the state of counters?) >> >> >> >> That said I'll look into it some time this or next week. >> > >> > Good, just let me know how it went, I also wanted to start looking into >> > this to come up with some concrete patches :). What I had in mind was >> > that >> > we could use 'counters == NULL' as an indication that the counter is >> > still >> > in 'single counter mode'. >> > >> >> In the current state there are only pointers to counters in mm_struct >> and there is no storage for them in task_struct. So I don't think >> merely null-checking the per-cpu stuff is going to cut it -- where >> should the single-threaded counters land? > > I think you misunderstood. What I wanted to do it to provide a new flavor > of percpu_counter (sharing most of code and definitions) which would have > an option to start as simple counter (indicated by pcc->counters == NULL > and using pcc->count for counting) and then be upgraded by a call to real > percpu thing. Because I think such counters would be useful also on other > occasions than as rss counters. > Indeed I did -- I had tunnel vision on dodging atomics for current given remote modifications, which wont be done in your proposal. I concede your idea solves the problem at hand, I question whether it is the right to do though. Not my call to make. >> Then for single-threaded case an area is allocated for NR_MM_COUNTERS >> countes * 2 -- first set updated without any synchro by current >> thread. Second set only to be modified by others and protected with >> mm->arg_lock. The lock protects remote access to the union to begin >> with. > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > with two counters is clever but I'm not 100% convinced the complexity is > really worth it. I'm not sure the overhead of always using an atomic > counter would really be measurable as atomic counter ops in local CPU cache > tend to be cheap. Did you try to measure the difference? > arg_lock is not as is, it would have to be renamed to something more generic. Atomics on x86-64 are very expensive to this very day. Here is a sample measurement of 2 atomics showing up done by someone else: https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u tl;dr it is *really* bad. > If the second counter proves to be worth it, we could make just that one > atomic to avoid the need for abusing some spinlock. > The spinlock would be there to synchronize against the transition to per-cpu -- any trickery is avoided and we trivially know for a fact the remote party either sees the per-cpu state if transitioned, or local if not. Then one easily knows no updates have been lost and the buf for 2 sets of counters can be safely freed. While writing down the idea previously I did not realize the per-cpu counter ops disable interrupts around the op. That's already very slow and the trip should be comparable to paying for an atomic (as in the patch which introduced percpu counters here slowed things down for single-threaded processes). With your proposal the atomic would be there, but interrupt trip could be avoided. This would roughly maintain the current cost of doing the op (as in it would not get /worse/). My patch would make it lower. All that said, I'm going to refrain from writing a patch for the time being. If powers to be decide on your approach, I'm not going to argue -- I don't think either is a clear winner over the other.
On Wed 23-08-23 14:13:20, Mateusz Guzik wrote: > On 8/23/23, Jan Kara <jack@suse.cz> wrote: > > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: > >> On 8/22/23, Jan Kara <jack@suse.cz> wrote: > >> Then for single-threaded case an area is allocated for NR_MM_COUNTERS > >> countes * 2 -- first set updated without any synchro by current > >> thread. Second set only to be modified by others and protected with > >> mm->arg_lock. The lock protects remote access to the union to begin > >> with. > > > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > > with two counters is clever but I'm not 100% convinced the complexity is > > really worth it. I'm not sure the overhead of always using an atomic > > counter would really be measurable as atomic counter ops in local CPU cache > > tend to be cheap. Did you try to measure the difference? > > > > arg_lock is not as is, it would have to be renamed to something more generic. Ah, OK. > Atomics on x86-64 are very expensive to this very day. Here is a > sample measurement of 2 atomics showing up done by someone else: > https://lore.kernel.org/oe-lkp/202308141149.d38fdf91-oliver.sang@intel.com/T/#u > > tl;dr it is *really* bad. I didn't express myself well. Sure atomics are expensive compared to plain arithmetic operations. But I wanted to say - we had atomics for RSS counters before commit f1a7941243 ("mm: convert mm's rss stats into percpu_counter") and people seemed happy with it until there were many CPUs contending on the updates. So maybe RSS counters aren't used heavily enough for the difference to practically matter? Probably operation like faulting in (or unmapping) tmpfs file has the highest chance of showing the cost of rss accounting compared to the cost of the remainder of the operation... > > If the second counter proves to be worth it, we could make just that one > > atomic to avoid the need for abusing some spinlock. > > The spinlock would be there to synchronize against the transition to > per-cpu -- any trickery is avoided and we trivially know for a fact > the remote party either sees the per-cpu state if transitioned, or > local if not. Then one easily knows no updates have been lost and the > buf for 2 sets of counters can be safely freed. Yeah, the spinlock makes the transition simpler, I agree. > While writing down the idea previously I did not realize the per-cpu > counter ops disable interrupts around the op. That's already very slow > and the trip should be comparable to paying for an atomic (as in the > patch which introduced percpu counters here slowed things down for > single-threaded processes). > > With your proposal the atomic would be there, but interrupt trip could > be avoided. This would roughly maintain the current cost of doing the > op (as in it would not get /worse/). My patch would make it lower. > > All that said, I'm going to refrain from writing a patch for the time > being. If powers to be decide on your approach, I'm not going to argue > -- I don't think either is a clear winner over the other. I guess we'll need to code it and compare the results :) Honza
On 8/23/23, Jan Kara <jack@suse.cz> wrote: > I didn't express myself well. Sure atomics are expensive compared to plain > arithmetic operations. But I wanted to say - we had atomics for RSS > counters before commit f1a7941243 ("mm: convert mm's rss stats into > percpu_counter") and people seemed happy with it until there were many CPUs > contending on the updates. So maybe RSS counters aren't used heavily enough > for the difference to practically matter? Probably operation like faulting > in (or unmapping) tmpfs file has the highest chance of showing the cost of > rss accounting compared to the cost of the remainder of the operation... > These stats used to be decentralized by storing them in task_struct, the commit complains about values deviating too much. The value would get synced every 64 uses, from the diff: -/* sync counter once per 64 page faults */ -#define TASK_RSS_EVENTS_THRESH (64) -static void check_sync_rss_stat(struct task_struct *task) -{ - if (unlikely(task != current)) - return; - if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) - sync_mm_rss(task->mm); -} other than that it was a non-atomic update in struct thread. -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val) -{ - struct task_struct *task = current; - - if (likely(task->mm == mm)) - task->rss_stat.count[member] += val; - else - add_mm_counter(mm, member, val); -} So the question is how much does this matter. My personal approach is that avoidable slowdowns (like atomics here) only facilitate further avoidable slowdowns as people can claim there is a minuscule change in % to baseline. But if the baseline is already slow.... Anyhow, I just found that patch failed to completely remove SPLIT_RSS_COUNTING. I'm going to submit something about that later.
On Wed 23-08-23 18:10:29, Mateusz Guzik wrote: > On 8/23/23, Jan Kara <jack@suse.cz> wrote: > > I didn't express myself well. Sure atomics are expensive compared to plain > > arithmetic operations. But I wanted to say - we had atomics for RSS > > counters before commit f1a7941243 ("mm: convert mm's rss stats into > > percpu_counter") and people seemed happy with it until there were many CPUs > > contending on the updates. So maybe RSS counters aren't used heavily enough > > for the difference to practically matter? Probably operation like faulting > > in (or unmapping) tmpfs file has the highest chance of showing the cost of > > rss accounting compared to the cost of the remainder of the operation... > > > > These stats used to be decentralized by storing them in task_struct, > the commit complains about values deviating too much. > > The value would get synced every 64 uses, from the diff: > -/* sync counter once per 64 page faults */ > -#define TASK_RSS_EVENTS_THRESH (64) > -static void check_sync_rss_stat(struct task_struct *task) > -{ > - if (unlikely(task != current)) > - return; > - if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) > - sync_mm_rss(task->mm); > -} > > other than that it was a non-atomic update in struct thread. > > -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val) > -{ > - struct task_struct *task = current; > - > - if (likely(task->mm == mm)) > - task->rss_stat.count[member] += val; > - else > - add_mm_counter(mm, member, val); > -} Ah, I see. I already forgot these details since I was checking the regression back in spring. Now I've just seen the atomic_long_t counters in task_struct and forgot there used to be also these per-thread ones. Thanks for refreshing my memory! > So the question is how much does this matter. My personal approach is > that avoidable slowdowns (like atomics here) only facilitate further > avoidable slowdowns as people can claim there is a minuscule change in > % to baseline. But if the baseline is already slow.... I get your point but over the years I've also learned that premature optimization isn't good either as we will be dragging the maintenance burden for a long time ;) It's a balance. Honza
On 8/23/23, Jan Kara <jack@suse.cz> wrote: > On Wed 23-08-23 18:10:29, Mateusz Guzik wrote: >> So the question is how much does this matter. My personal approach is >> that avoidable slowdowns (like atomics here) only facilitate further >> avoidable slowdowns as people can claim there is a minuscule change in >> % to baseline. But if the baseline is already slow.... > > I get your point but over the years I've also learned that premature > optimization isn't good either as we will be dragging the maintenance > burden for a long time ;) It's a balance. > Mate, your proposal is not maintenance burden-free either. ;) I claim mine is simpler and faster for single threaded case, but is not generic so should another consumer show up with the single vs multithreaded need, there may or may not be more work to accommodate it.
On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote: > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: > > On 8/22/23, Jan Kara <jack@suse.cz> wrote: > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: > > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > > >> > True Fix(tm) is a longer story. > > >> > > > >> > Maybe let's sort out this patchset first, whichever way. :) > > >> > > > >> > > >> So I found the discussion around the original patch with a perf > > >> regression report. > > >> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ > > >> > > >> The reporter suggests dodging the problem by only allocating per-cpu > > >> counters when the process is going multithreaded. Given that there is > > >> still plenty of forever single-threaded procs out there I think that's > > >> does sound like a great plan regardless of what happens with this > > >> patchset. > > >> > > >> Almost all access is already done using dedicated routines, so this > > >> should be an afternoon churn to sort out, unless I missed a > > >> showstopper. (maybe there is no good place to stuff a flag/whatever > > >> other indicator about the state of counters?) > > >> > > >> That said I'll look into it some time this or next week. > > > > > > Good, just let me know how it went, I also wanted to start looking into > > > this to come up with some concrete patches :). What I had in mind was that > > > we could use 'counters == NULL' as an indication that the counter is still > > > in 'single counter mode'. > > > > > > > In the current state there are only pointers to counters in mm_struct > > and there is no storage for them in task_struct. So I don't think > > merely null-checking the per-cpu stuff is going to cut it -- where > > should the single-threaded counters land? > > I think you misunderstood. What I wanted to do it to provide a new flavor > of percpu_counter (sharing most of code and definitions) which would have > an option to start as simple counter (indicated by pcc->counters == NULL > and using pcc->count for counting) and then be upgraded by a call to real > percpu thing. Because I think such counters would be useful also on other > occasions than as rss counters. > Kent wrote something similar and sent it out last year [1]. However, the case slightly differs from what we'd want here, 1 -> 2 threads becomes percpu vs update rate which a single thread might be able to trigger? [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/ Thanks, Dennis > > Bonus problem, non-current can modify these counters and this needs to > > be safe against current playing with them at the same time. (and it > > would be a shame to require current to use atomic on them) > > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be > modifying the counters for other processes. Thanks for pointing this out. > > > That said, my initial proposal adds a union: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 5e74ce4a28cd..ea70f0c08286 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -737,7 +737,11 @@ struct mm_struct { > > > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for > > /proc/PID/auxv */ > > > > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + union { > > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + u64 *rss_stat_single; > > + }; > > + bool magic_flag_stuffed_elsewhere; > > > > struct linux_binfmt *binfmt; > > > > > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS > > countes * 2 -- first set updated without any synchro by current > > thread. Second set only to be modified by others and protected with > > mm->arg_lock. The lock protects remote access to the union to begin > > with. > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > with two counters is clever but I'm not 100% convinced the complexity is > really worth it. I'm not sure the overhead of always using an atomic > counter would really be measurable as atomic counter ops in local CPU cache > tend to be cheap. Did you try to measure the difference? > > If the second counter proves to be worth it, we could make just that one > atomic to avoid the need for abusing some spinlock. > > > Transition to per-CPU operation sets the magic flag (there is plenty > > of spare space in mm_struct, I'll find a good home for it without > > growing the struct). It would be a one-way street -- a process which > > gets a bunch of threads and goes back to one stays with per-CPU. > > Agreed with switching to be a one-way street. > > > Then you get the true value of something by adding both counters. > > > > arg_lock is sparingly used, so remote ops are not expected to contend > > with anything. In fact their cost is going to go down since percpu > > summation takes a spinlock which also disables interrupts. > > > > Local ops should be about the same in cost as they are right now. > > > > I might have missed some detail in the above description, but I think > > the approach is decent. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed 23-08-23 13:27:56, Dennis Zhou wrote: > On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote: > > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: > > > On 8/22/23, Jan Kara <jack@suse.cz> wrote: > > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: > > > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > >> > True Fix(tm) is a longer story. > > > >> > > > > >> > Maybe let's sort out this patchset first, whichever way. :) > > > >> > > > > >> > > > >> So I found the discussion around the original patch with a perf > > > >> regression report. > > > >> > > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ > > > >> > > > >> The reporter suggests dodging the problem by only allocating per-cpu > > > >> counters when the process is going multithreaded. Given that there is > > > >> still plenty of forever single-threaded procs out there I think that's > > > >> does sound like a great plan regardless of what happens with this > > > >> patchset. > > > >> > > > >> Almost all access is already done using dedicated routines, so this > > > >> should be an afternoon churn to sort out, unless I missed a > > > >> showstopper. (maybe there is no good place to stuff a flag/whatever > > > >> other indicator about the state of counters?) > > > >> > > > >> That said I'll look into it some time this or next week. > > > > > > > > Good, just let me know how it went, I also wanted to start looking into > > > > this to come up with some concrete patches :). What I had in mind was that > > > > we could use 'counters == NULL' as an indication that the counter is still > > > > in 'single counter mode'. > > > > > > > > > > In the current state there are only pointers to counters in mm_struct > > > and there is no storage for them in task_struct. So I don't think > > > merely null-checking the per-cpu stuff is going to cut it -- where > > > should the single-threaded counters land? > > > > I think you misunderstood. What I wanted to do it to provide a new flavor > > of percpu_counter (sharing most of code and definitions) which would have > > an option to start as simple counter (indicated by pcc->counters == NULL > > and using pcc->count for counting) and then be upgraded by a call to real > > percpu thing. Because I think such counters would be useful also on other > > occasions than as rss counters. > > > > Kent wrote something similar and sent it out last year [1]. However, the > case slightly differs from what we'd want here, 1 -> 2 threads becomes > percpu vs update rate which a single thread might be able to trigger? Thanks for the pointer but that version of counters is not really suitable here as is (but we could factor out some common bits if that work is happening). 1 thread can easily do 10000 RSS updates per second. > [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/ Honza > > > Bonus problem, non-current can modify these counters and this needs to > > > be safe against current playing with them at the same time. (and it > > > would be a shame to require current to use atomic on them) > > > > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be > > modifying the counters for other processes. Thanks for pointing this out. > > > > > That said, my initial proposal adds a union: > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > > index 5e74ce4a28cd..ea70f0c08286 100644 > > > --- a/include/linux/mm_types.h > > > +++ b/include/linux/mm_types.h > > > @@ -737,7 +737,11 @@ struct mm_struct { > > > > > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for > > > /proc/PID/auxv */ > > > > > > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > > + union { > > > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > > + u64 *rss_stat_single; > > > + }; > > > + bool magic_flag_stuffed_elsewhere; > > > > > > struct linux_binfmt *binfmt; > > > > > > > > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS > > > countes * 2 -- first set updated without any synchro by current > > > thread. Second set only to be modified by others and protected with > > > mm->arg_lock. The lock protects remote access to the union to begin > > > with. > > > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > > with two counters is clever but I'm not 100% convinced the complexity is > > really worth it. I'm not sure the overhead of always using an atomic > > counter would really be measurable as atomic counter ops in local CPU cache > > tend to be cheap. Did you try to measure the difference? > > > > If the second counter proves to be worth it, we could make just that one > > atomic to avoid the need for abusing some spinlock. > > > > > Transition to per-CPU operation sets the magic flag (there is plenty > > > of spare space in mm_struct, I'll find a good home for it without > > > growing the struct). It would be a one-way street -- a process which > > > gets a bunch of threads and goes back to one stays with per-CPU. > > > > Agreed with switching to be a one-way street. > > > > > Then you get the true value of something by adding both counters. > > > > > > arg_lock is sparingly used, so remote ops are not expected to contend > > > with anything. In fact their cost is going to go down since percpu > > > summation takes a spinlock which also disables interrupts. > > > > > > Local ops should be about the same in cost as they are right now. > > > > > > I might have missed some detail in the above description, but I think > > > the approach is decent. > > > > Honza > > -- > > Jan Kara <jack@suse.com> > > SUSE Labs, CR
On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > To start I figured I'm going to bench about as friendly case as it gets > -- statically linked *separate* binaries all doing execve in a loop. > > I borrowed the bench from found here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > It prints a result every second (warning: first line is garbage). > > My test box is temporarily only 26 cores and even at this scale I run > into massive lock contention stemming from back-to-back calls to > percpu_counter_init (and _destroy later). > > While not a panacea, one simple thing to do here is to batch these ops. > Since the term "batching" is already used in the file, I decided to > refer to it as "grouping" instead. > > Even if this code could be patched to dodge these counters, I would > argue a high-traffic alloc/free consumer is only a matter of time so it > makes sense to facilitate it. > > With the fix I get an ok win, to quote from the commit: >> Even at a very modest scale of 26 cores (ops/s): >> before: 133543.63 >> after: 186061.81 (+39%) > So to sum up, a v3 of the patchset is queued up here: https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-next For interested I temporarily got my hands on something exceeding the hand watch scale benched above -- a 192-way AMD EPYC 7R13 box (2 sockets x 48 cores x 2 threads). A 6.5 kernel + the patchset only gets south of 140k execs/s when running ./static-doexec 192 According to perf top: 51.04% [kernel] [k] osq_lock 6.82% [kernel] [k] __raw_callee_save___kvm_vcpu_is_preempted 2.98% [kernel] [k] _atomic_dec_and_lock_irqsave 1.62% [kernel] [k] rcu_cblist_dequeue 1.54% [kernel] [k] refcount_dec_not_one 1.51% [kernel] [k] __mod_lruvec_page_state 1.46% [kernel] [k] put_cred_rcu 1.34% [kernel] [k] native_queued_spin_lock_slowpath 0.94% [kernel] [k] srso_alias_safe_ret 0.81% [kernel] [k] memset_orig 0.77% [kernel] [k] unmap_page_range 0.73% [kernel] [k] _compound_head 0.72% [kernel] [k] kmem_cache_free Then bpftrace -e 'kprobe:osq_lock { @[kstack()] = count(); }' shows: @[ osq_lock+1 __mutex_lock_killable_slowpath+19 mutex_lock_killable+62 pcpu_alloc+1219 __alloc_percpu_gfp+18 __percpu_counter_init_many+43 mm_init+727 mm_alloc+78 alloc_bprm+138 do_execveat_common.isra.0+103 __x64_sys_execve+55 do_syscall_64+54 entry_SYSCALL_64_after_hwframe+110 ]: 637370 @[ osq_lock+1 __mutex_lock_killable_slowpath+19 mutex_lock_killable+62 pcpu_alloc+1219 __alloc_percpu+21 mm_init+577 mm_alloc+78 alloc_bprm+138 do_execveat_common.isra.0+103 __x64_sys_execve+55 do_syscall_64+54 entry_SYSCALL_64_after_hwframe+110 ]: 638036 That is per-cpu allocation is still on top at this scale. But more importantly there are *TWO* unrelated back-to-back per-cpu allocs -- one by rss counters and one by mm_alloc_cid. That is to say per-cpu alloc scalability definitely needs to get fixed, I'll ponder about it.