mbox series

[RFC,0/3] mm: convert mm's rss stats into lazy_percpu_counter

Message ID 20240412092441.3112481-1-zhangpeng362@huawei.com (mailing list archive)
Headers show
Series mm: convert mm's rss stats into lazy_percpu_counter | expand

Message

Peng Zhang April 12, 2024, 9:24 a.m. UTC
From: ZhangPeng <zhangpeng362@huawei.com>

Since commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit
14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
for mm struct"), the performance of fork/exec/shell is still poor
compared to previous kernel versions.

To mitigate performance regression, we use lazy_percpu_counter[1] to
delay the allocation of percpu memory for rss_stats. After lmbench test,
we will get 3% ~ 6% performance improvement for lmbench
fork_proc/exec_proc/shell_proc after conversion.

The test results are as follows:

             base           base+revert        base+lazy_percpu_counter

fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)

This solution has not been fully evaluated and tested. The main idea of
this RFC patch series is to get the community's opinion on this approach.

[1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/

Kent Overstreet (1):
  Lazy percpu counters

ZhangPeng (2):
  lazy_percpu_counter: include struct percpu_counter in struct
    lazy_percpu_counter
  mm: convert mm's rss stats into lazy_percpu_counter

 include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
 include/linux/mm.h                  |   8 +-
 include/linux/mm_types.h            |   4 +-
 include/trace/events/kmem.h         |   4 +-
 kernel/fork.c                       |  12 +--
 lib/Makefile                        |   2 +-
 lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
 7 files changed, 232 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/lazy-percpu-counter.h
 create mode 100644 lib/lazy-percpu-counter.c

Comments

Jan Kara April 12, 2024, 1:53 p.m. UTC | #1
On Fri 12-04-24 17:24:38, Peng Zhang wrote:
> From: ZhangPeng <zhangpeng362@huawei.com>
> 
> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter"), the rss_stats have converted into percpu_counter,
> which convert the error margin from (nr_threads * 64) to approximately
> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> performance regression on fork/exec/shell. Even after commit
> 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
> for mm struct"), the performance of fork/exec/shell is still poor
> compared to previous kernel versions.
> 
> To mitigate performance regression, we use lazy_percpu_counter[1] to
> delay the allocation of percpu memory for rss_stats. After lmbench test,
> we will get 3% ~ 6% performance improvement for lmbench
> fork_proc/exec_proc/shell_proc after conversion.
> 
> The test results are as follows:
> 
>              base           base+revert        base+lazy_percpu_counter
> 
> fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
> exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
> shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)
> 
> This solution has not been fully evaluated and tested. The main idea of
> this RFC patch series is to get the community's opinion on this approach.

Thanks! I like the idea and in fact I wanted to do something similar (just
never got to it). Thread [2] has couple of good observations regarding this
problem. Couple of thoughts regarding your approach:

1) I think switching to pcpu counter when update rate exceeds 256 updates/s
is not a great fit for RSS because the updates are going to be frequent in
some cases but usually they will all happen from one thread. So I think it
would make more sense to move the decision of switching to pcpu mode from
the counter itself into the callers and just switch on clone() when the
second thread gets created.

2) I thought that for RSS lazy percpu counters, we could directly use
struct percpu_counter and just make it that if 'counters' is NULL, the
counter is in atomic mode (count is used as atomic_long_t), if counters !=
NULL, we are in pcpu mode.

3) In [2] Mateusz had a good observation that the old RSS counters actually
used atomic operations only in rare cases so even lazy pcpu counters are
going to have worse performance for singlethreaded processes than the old
code. We could *almost* get away with non-atomic updates to counter->count
if it was not for occasional RSS updates from unrelated tasks. So it might
be worth it to further optimize the counters as:

struct rss_counter_single {
	void *state;			/* To detect switching to pcpu mode */
	atomic_long_t counter_atomic;	/* Used for foreign updates */
	long counter;			/* Used by local updates */
}

struct rss_counter {
	union {
		struct rss_counter_single single;
		/* struct percpu_counter needs to be modified to have
		 * 'counters' first to avoid issues for different
		 * architectures or with CONFIG_HOTPLUG_CPU enabled */
		struct percpu_counter pcpu;
	}
}

But I'm not sure this complexity is worth it so I'd do it as a separate
patch with separate benchmarking if at all.

								Honza

[2] https://lore.kernel.org/all/ZOPSEJTzrow8YFix@snowbird/

> 
> [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/
> 
> Kent Overstreet (1):
>   Lazy percpu counters
> 
> ZhangPeng (2):
>   lazy_percpu_counter: include struct percpu_counter in struct
>     lazy_percpu_counter
>   mm: convert mm's rss stats into lazy_percpu_counter
> 
>  include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
>  include/linux/mm.h                  |   8 +-
>  include/linux/mm_types.h            |   4 +-
>  include/trace/events/kmem.h         |   4 +-
>  kernel/fork.c                       |  12 +--
>  lib/Makefile                        |   2 +-
>  lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
>  7 files changed, 232 insertions(+), 17 deletions(-)
>  create mode 100644 include/linux/lazy-percpu-counter.h
>  create mode 100644 lib/lazy-percpu-counter.c
> 
> -- 
> 2.25.1
>
Peng Zhang April 15, 2024, 12:33 p.m. UTC | #2
On 2024/4/12 21:53, Jan Kara wrote:

> On Fri 12-04-24 17:24:38, Peng Zhang wrote:
>> From: ZhangPeng <zhangpeng362@huawei.com>
>>
>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
>> percpu_counter"), the rss_stats have converted into percpu_counter,
>> which convert the error margin from (nr_threads * 64) to approximately
>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
>> performance regression on fork/exec/shell. Even after commit
>> 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters
>> for mm struct"), the performance of fork/exec/shell is still poor
>> compared to previous kernel versions.
>>
>> To mitigate performance regression, we use lazy_percpu_counter[1] to
>> delay the allocation of percpu memory for rss_stats. After lmbench test,
>> we will get 3% ~ 6% performance improvement for lmbench
>> fork_proc/exec_proc/shell_proc after conversion.
>>
>> The test results are as follows:
>>
>>               base           base+revert        base+lazy_percpu_counter
>>
>> fork_proc    427.4ms        394.1ms  (7.8%)    413.9ms  (3.2%)
>> exec_proc    2205.1ms       2042.2ms (7.4%)    2072.0ms (6.0%)
>> shell_proc   3180.9ms       2963.7ms (6.8%)    3010.7ms (5.4%)
>>
>> This solution has not been fully evaluated and tested. The main idea of
>> this RFC patch series is to get the community's opinion on this approach.
> Thanks! I like the idea and in fact I wanted to do something similar (just
> never got to it). Thread [2] has couple of good observations regarding this
> problem. Couple of thoughts regarding your approach:
>
> 1) I think switching to pcpu counter when update rate exceeds 256 updates/s
> is not a great fit for RSS because the updates are going to be frequent in
> some cases but usually they will all happen from one thread. So I think it
> would make more sense to move the decision of switching to pcpu mode from
> the counter itself into the callers and just switch on clone() when the
> second thread gets created.
>
> 2) I thought that for RSS lazy percpu counters, we could directly use
> struct percpu_counter and just make it that if 'counters' is NULL, the
> counter is in atomic mode (count is used as atomic_long_t), if counters !=
> NULL, we are in pcpu mode.

Thanks for your reply!
Agree with your thoughts, I'll implement it in the next version.

> 3) In [2] Mateusz had a good observation that the old RSS counters actually
> used atomic operations only in rare cases so even lazy pcpu counters are
> going to have worse performance for singlethreaded processes than the old
> code. We could *almost* get away with non-atomic updates to counter->count
> if it was not for occasional RSS updates from unrelated tasks. So it might
> be worth it to further optimize the counters as:
>
> struct rss_counter_single {
> 	void *state;			/* To detect switching to pcpu mode */
> 	atomic_long_t counter_atomic;	/* Used for foreign updates */
> 	long counter;			/* Used by local updates */
> }
>
> struct rss_counter {
> 	union {
> 		struct rss_counter_single single;
> 		/* struct percpu_counter needs to be modified to have
> 		 * 'counters' first to avoid issues for different
> 		 * architectures or with CONFIG_HOTPLUG_CPU enabled */
> 		struct percpu_counter pcpu;
> 	}
> }
>
> But I'm not sure this complexity is worth it so I'd do it as a separate
> patch with separate benchmarking if at all.
>
> 								Honza

Agreed. Single-threaded processes don't need atomic operations, and
this scenario needs to be thoroughly tested.
I'll try to implement it in another patch series after I finish the
basic approach.

> [2] https://lore.kernel.org/all/ZOPSEJTzrow8YFix@snowbird/
>
>> [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/
>>
>> Kent Overstreet (1):
>>    Lazy percpu counters
>>
>> ZhangPeng (2):
>>    lazy_percpu_counter: include struct percpu_counter in struct
>>      lazy_percpu_counter
>>    mm: convert mm's rss stats into lazy_percpu_counter
>>
>>   include/linux/lazy-percpu-counter.h |  88 +++++++++++++++++++
>>   include/linux/mm.h                  |   8 +-
>>   include/linux/mm_types.h            |   4 +-
>>   include/trace/events/kmem.h         |   4 +-
>>   kernel/fork.c                       |  12 +--
>>   lib/Makefile                        |   2 +-
>>   lib/lazy-percpu-counter.c           | 131 ++++++++++++++++++++++++++++
>>   7 files changed, 232 insertions(+), 17 deletions(-)
>>   create mode 100644 include/linux/lazy-percpu-counter.h
>>   create mode 100644 lib/lazy-percpu-counter.c
>>
>> -- 
>> 2.25.1
>>