[02/10] cacheinfo: calculate per-CPU data cache size

Message ID	20230920061856.257597-3-ying.huang@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Sudeep Holla <sudeep.holla@arm.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org>, Christoph Lameter <cl@linux.com> Subject: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size Date: Wed, 20 Sep 2023 14:18:48 +0800 Message-Id: <20230920061856.257597-3-ying.huang@intel.com> In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: PCP high auto-tuning \| expand [00/10] mm: PCP high auto-tuning [01/10] mm, pcp: avoid to drain PCP when process exit [02/10] cacheinfo: calculate per-CPU data cache size [03/10] mm, pcp: reduce lock contention for draining high-order pages [04/10] mm: restrict the pcp batch scale factor to avoid too long latency [05/10] mm, page_alloc: scale the number of pages that are batch allocated [06/10] mm: add framework for PCP high auto-tuning [07/10] mm: tune PCP high automatically [08/10] mm, pcp: decrease PCP high if free pages < high watermark [09/10] mm, pcp: avoid to reduce PCP high unnecessarily [10/10] mm, pcp: reduce detecting time of consecutive high order page freeing

Huang, Ying Sept. 20, 2023, 6:18 a.m. UTC

Per-CPU data cache size is useful information.  For example, it can be
used to determine per-CPU cache size.  So, in this patch, the data
cache size for each CPU is calculated via data_cache_size /
shared_cpu_weight.

A brute-force algorithm to iterate all online CPUs is used to avoid
to allocate an extra cpumask, especially in offline callback.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
 include/linux/cacheinfo.h |  1 +
 2 files changed, 42 insertions(+), 1 deletion(-)

Sudeep Holla Sept. 20, 2023, 9:24 a.m. UTC | #1

On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> Per-CPU data cache size is useful information.  For example, it can be
> used to determine per-CPU cache size.  So, in this patch, the data
> cache size for each CPU is calculated via data_cache_size /
> shared_cpu_weight.
>
> A brute-force algorithm to iterate all online CPUs is used to avoid
> to allocate an extra cpumask, especially in offline callback.
>

You have not mentioned who will use this information ? Looking at the
change, it is not exposed to the user-space. Also I see this is actually
part of the series [1]. Is this info used in any of those patches ? Can you
point me to the same ?

Not all architecture use cacheinfo yet. How will the mm changes affect those
platforms ?

--
Regards,
Sudeep

[1] https://lore.kernel.org/all/20230920061856.257597-1-ying.huang@intel.com/

Huang, Ying Sept. 22, 2023, 7:56 a.m. UTC | #2

Sudeep Holla <sudeep.holla@arm.com> writes:

> On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> Per-CPU data cache size is useful information.  For example, it can be
>> used to determine per-CPU cache size.  So, in this patch, the data
>> cache size for each CPU is calculated via data_cache_size /
>> shared_cpu_weight.
>>
>> A brute-force algorithm to iterate all online CPUs is used to avoid
>> to allocate an extra cpumask, especially in offline callback.
>>
>
> You have not mentioned who will use this information ? Looking at the
> change, it is not exposed to the user-space. Also I see this is actually
> part of the series [1]. Is this info used in any of those patches ? Can you
> point me to the same ?

Yes.  It is used by [PATCH 03/10] of the series.  If the per-CPU data
cache size is large enough, we will cache more pages in the per-CPU
pageset to reduce the zone lock contention.

> Not all architecture use cacheinfo yet. How will the mm changes affect those
> platforms ?

If cacheinfo isn't available, we will fallback to the original
behavior.  That is, we will drain per-CPU pageset more often (that is,
cache less to improve cache-hot pages sharing between CPUs).

> --
> Regards,
> Sudeep
>
> [1] https://lore.kernel.org/all/20230920061856.257597-1-ying.huang@intel.com/

--
Best Regards,
Huang, Ying

Mel Gorman Oct. 11, 2023, 12:20 p.m. UTC | #3

On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> Per-CPU data cache size is useful information.  For example, it can be
> used to determine per-CPU cache size.  So, in this patch, the data
> cache size for each CPU is calculated via data_cache_size /
> shared_cpu_weight.
> 
> A brute-force algorithm to iterate all online CPUs is used to avoid
> to allocate an extra cpumask, especially in offline callback.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

It's not necessarily relevant to the patch, but at least the scheduler
also stores some per-cpu topology information such as sd_llc_size -- the
number of CPUs sharing the same last-level-cache as this CPU. It may be
worth unifying this at some point if it's common that per-cpu
information is too fine and per-zone or per-node information is too
coarse. This would be particularly true when considering locking
granularity,

> Cc: Sudeep Holla <sudeep.holla@arm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Johannes Weiner <jweiner@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
>  include/linux/cacheinfo.h |  1 +
>  2 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index cbae8be1fe52..3e8951a3fbab 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>  	return rc;
>  }
>  
> +static void update_data_cache_size_cpu(unsigned int cpu)
> +{
> +	struct cpu_cacheinfo *ci;
> +	struct cacheinfo *leaf;
> +	unsigned int i, nr_shared;
> +	unsigned int size_data = 0;
> +
> +	if (!per_cpu_cacheinfo(cpu))
> +		return;
> +
> +	ci = ci_cacheinfo(cpu);
> +	for (i = 0; i < cache_leaves(cpu); i++) {
> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
> +		if (leaf->type != CACHE_TYPE_DATA &&
> +		    leaf->type != CACHE_TYPE_UNIFIED)
> +			continue;
> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> +		if (!nr_shared)
> +			continue;
> +		size_data += leaf->size / nr_shared;
> +	}
> +	ci->size_data = size_data;
> +}

This needs comments.

It would be nice to add a comment on top describing the limitation of
CACHE_TYPE_UNIFIED here in the context of update_data_cache_size_cpu().
The L2 cache could be unified but much smaller than a L3 or other
last-level-cache. It's not clear from the code what level of cache is being
used due to a lack of familiarity of the cpu_cacheinfo code but size_data
is not the size of a cache, it appears to be the share of a cache a CPU
would have under ideal circumstances.  However, as it appears to also be
iterating hierarchy then this may not be accurate. Caches may or may not
allow data to be duplicated between levels so the value may be inaccurate.

A respin of the patch is not necessary but a follow-on patch adding
clarifing comments would be very welcome covering

o What levels of cache are being used
o Describe what size_data actually is and preferably rename the field
  to be more explicit as "size" could be the total cache capacity, the
  cache slice under ideal circumstances or even the number of CPUs sharing
  that cache.

The cache details *may* need a follow-on patch if the size_data value is
misleading. If it is a hierarchy and the value does not always represent
the slice of cache a CPU could have under ideal circumstances then the
value should be based on the LLC only so that it is predictable across
architectures.

Huang, Ying Oct. 12, 2023, 12:08 p.m. UTC | #4

Mel Gorman <mgorman@techsingularity.net> writes:

> On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> Per-CPU data cache size is useful information.  For example, it can be
>> used to determine per-CPU cache size.  So, in this patch, the data
>> cache size for each CPU is calculated via data_cache_size /
>> shared_cpu_weight.
>> 
>> A brute-force algorithm to iterate all online CPUs is used to avoid
>> to allocate an extra cpumask, especially in offline callback.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>
> It's not necessarily relevant to the patch, but at least the scheduler
> also stores some per-cpu topology information such as sd_llc_size -- the
> number of CPUs sharing the same last-level-cache as this CPU. It may be
> worth unifying this at some point if it's common that per-cpu
> information is too fine and per-zone or per-node information is too
> coarse. This would be particularly true when considering locking
> granularity,
>
>> Cc: Sudeep Holla <sudeep.holla@arm.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Johannes Weiner <jweiner@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Christoph Lameter <cl@linux.com>
>> ---
>>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
>>  include/linux/cacheinfo.h |  1 +
>>  2 files changed, 42 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> index cbae8be1fe52..3e8951a3fbab 100644
>> --- a/drivers/base/cacheinfo.c
>> +++ b/drivers/base/cacheinfo.c
>> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>>  	return rc;
>>  }
>>  
>> +static void update_data_cache_size_cpu(unsigned int cpu)
>> +{
>> +	struct cpu_cacheinfo *ci;
>> +	struct cacheinfo *leaf;
>> +	unsigned int i, nr_shared;
>> +	unsigned int size_data = 0;
>> +
>> +	if (!per_cpu_cacheinfo(cpu))
>> +		return;
>> +
>> +	ci = ci_cacheinfo(cpu);
>> +	for (i = 0; i < cache_leaves(cpu); i++) {
>> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
>> +		if (leaf->type != CACHE_TYPE_DATA &&
>> +		    leaf->type != CACHE_TYPE_UNIFIED)
>> +			continue;
>> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> +		if (!nr_shared)
>> +			continue;
>> +		size_data += leaf->size / nr_shared;
>> +	}
>> +	ci->size_data = size_data;
>> +}
>
> This needs comments.
>
> It would be nice to add a comment on top describing the limitation of
> CACHE_TYPE_UNIFIED here in the context of
> update_data_cache_size_cpu().

Sure.  Will do that.

> The L2 cache could be unified but much smaller than a L3 or other
> last-level-cache. It's not clear from the code what level of cache is being
> used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> is not the size of a cache, it appears to be the share of a cache a CPU
> would have under ideal circumstances.

Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
shares of all levels of cache.  But the calculation is inaccurate.  More
details are in the below reply.

> However, as it appears to also be
> iterating hierarchy then this may not be accurate. Caches may or may not
> allow data to be duplicated between levels so the value may be inaccurate.

Thank you very much for pointing this out!  The cache can be inclusive
or not.  So, we cannot calculate the per-CPU slice of all-level caches
via adding them together blindly.  I will change this in a follow-on
patch.

> A respin of the patch is not necessary but a follow-on patch adding
> clarifing comments would be very welcome covering
>
> o What levels of cache are being used
> o Describe what size_data actually is and preferably rename the field
>   to be more explicit as "size" could be the total cache capacity, the
>   cache slice under ideal circumstances or even the number of CPUs sharing
>   that cache.

Sure.

> The cache details *may* need a follow-on patch if the size_data value is
> misleading. If it is a hierarchy and the value does not always represent
> the slice of cache a CPU could have under ideal circumstances then the
> value should be based on the LLC only so that it is predictable across
> architectures.

Sure.

--
Best Regards,
Huang, Ying

Mel Gorman Oct. 12, 2023, 12:52 p.m. UTC | #5

On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> Per-CPU data cache size is useful information.  For example, it can be
> >> used to determine per-CPU cache size.  So, in this patch, the data
> >> cache size for each CPU is calculated via data_cache_size /
> >> shared_cpu_weight.
> >> 
> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> to allocate an extra cpumask, especially in offline callback.
> >> 
> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >
> > It's not necessarily relevant to the patch, but at least the scheduler
> > also stores some per-cpu topology information such as sd_llc_size -- the
> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> > worth unifying this at some point if it's common that per-cpu
> > information is too fine and per-zone or per-node information is too
> > coarse. This would be particularly true when considering locking
> > granularity,
> >
> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: Johannes Weiner <jweiner@redhat.com>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Christoph Lameter <cl@linux.com>
> >> ---
> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
> >>  include/linux/cacheinfo.h |  1 +
> >>  2 files changed, 42 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> --- a/drivers/base/cacheinfo.c
> >> +++ b/drivers/base/cacheinfo.c
> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >>  	return rc;
> >>  }
> >>  
> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> +{
> >> +	struct cpu_cacheinfo *ci;
> >> +	struct cacheinfo *leaf;
> >> +	unsigned int i, nr_shared;
> >> +	unsigned int size_data = 0;
> >> +
> >> +	if (!per_cpu_cacheinfo(cpu))
> >> +		return;
> >> +
> >> +	ci = ci_cacheinfo(cpu);
> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> +		if (leaf->type != CACHE_TYPE_DATA &&
> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
> >> +			continue;
> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> +		if (!nr_shared)
> >> +			continue;
> >> +		size_data += leaf->size / nr_shared;
> >> +	}
> >> +	ci->size_data = size_data;
> >> +}
> >
> > This needs comments.
> >
> > It would be nice to add a comment on top describing the limitation of
> > CACHE_TYPE_UNIFIED here in the context of
> > update_data_cache_size_cpu().
> 
> Sure.  Will do that.
> 

Thanks.

> > The L2 cache could be unified but much smaller than a L3 or other
> > last-level-cache. It's not clear from the code what level of cache is being
> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> > is not the size of a cache, it appears to be the share of a cache a CPU
> > would have under ideal circumstances.
> 
> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
> shares of all levels of cache.  But the calculation is inaccurate.  More
> details are in the below reply.
> 
> > However, as it appears to also be
> > iterating hierarchy then this may not be accurate. Caches may or may not
> > allow data to be duplicated between levels so the value may be inaccurate.
> 
> Thank you very much for pointing this out!  The cache can be inclusive
> or not.  So, we cannot calculate the per-CPU slice of all-level caches
> via adding them together blindly.  I will change this in a follow-on
> patch.
> 

Please do, I would strongly suggest basing this on LLC only because it's
the only value you can be sure of. This change is the only change that may
warrant a respin of the series as the history will be somewhat confusing
otherwise.

Huang, Ying Oct. 12, 2023, 1:12 p.m. UTC | #6

Mel Gorman <mgorman@techsingularity.net> writes:

> On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> Per-CPU data cache size is useful information.  For example, it can be
>> >> used to determine per-CPU cache size.  So, in this patch, the data
>> >> cache size for each CPU is calculated via data_cache_size /
>> >> shared_cpu_weight.
>> >> 
>> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> to allocate an extra cpumask, especially in offline callback.
>> >> 
>> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >
>> > It's not necessarily relevant to the patch, but at least the scheduler
>> > also stores some per-cpu topology information such as sd_llc_size -- the
>> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> > worth unifying this at some point if it's common that per-cpu
>> > information is too fine and per-zone or per-node information is too
>> > coarse. This would be particularly true when considering locking
>> > granularity,
>> >
>> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
>> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> Cc: Vlastimil Babka <vbabka@suse.cz>
>> >> Cc: David Hildenbrand <david@redhat.com>
>> >> Cc: Johannes Weiner <jweiner@redhat.com>
>> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> >> Cc: Matthew Wilcox <willy@infradead.org>
>> >> Cc: Christoph Lameter <cl@linux.com>
>> >> ---
>> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
>> >>  include/linux/cacheinfo.h |  1 +
>> >>  2 files changed, 42 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> --- a/drivers/base/cacheinfo.c
>> >> +++ b/drivers/base/cacheinfo.c
>> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >>  	return rc;
>> >>  }
>> >>  
>> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> +{
>> >> +	struct cpu_cacheinfo *ci;
>> >> +	struct cacheinfo *leaf;
>> >> +	unsigned int i, nr_shared;
>> >> +	unsigned int size_data = 0;
>> >> +
>> >> +	if (!per_cpu_cacheinfo(cpu))
>> >> +		return;
>> >> +
>> >> +	ci = ci_cacheinfo(cpu);
>> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
>> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> +		if (leaf->type != CACHE_TYPE_DATA &&
>> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
>> >> +			continue;
>> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> +		if (!nr_shared)
>> >> +			continue;
>> >> +		size_data += leaf->size / nr_shared;
>> >> +	}
>> >> +	ci->size_data = size_data;
>> >> +}
>> >
>> > This needs comments.
>> >
>> > It would be nice to add a comment on top describing the limitation of
>> > CACHE_TYPE_UNIFIED here in the context of
>> > update_data_cache_size_cpu().
>> 
>> Sure.  Will do that.
>> 
>
> Thanks.
>
>> > The L2 cache could be unified but much smaller than a L3 or other
>> > last-level-cache. It's not clear from the code what level of cache is being
>> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> > is not the size of a cache, it appears to be the share of a cache a CPU
>> > would have under ideal circumstances.
>> 
>> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
>> shares of all levels of cache.  But the calculation is inaccurate.  More
>> details are in the below reply.
>> 
>> > However, as it appears to also be
>> > iterating hierarchy then this may not be accurate. Caches may or may not
>> > allow data to be duplicated between levels so the value may be inaccurate.
>> 
>> Thank you very much for pointing this out!  The cache can be inclusive
>> or not.  So, we cannot calculate the per-CPU slice of all-level caches
>> via adding them together blindly.  I will change this in a follow-on
>> patch.
>> 
>
> Please do, I would strongly suggest basing this on LLC only because it's
> the only value you can be sure of. This change is the only change that may
> warrant a respin of the series as the history will be somewhat confusing
> otherwise.

I am still checking whether it's possible to get cache inclusive
information via cpuid.

If there's no reliable way to do that.  We can use the max value of
per-CPU share of each level of cache.  For inclusive cache, that will be
the value of LLC.  For non-inclusive cache, the value will be more
accurate.  For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
per core, while LLC is 1.875 MB per core according to [1].

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html

I will respin the series.

Thanks a lot for review!

--
Best Regards,
Huang, Ying

Mel Gorman Oct. 12, 2023, 3:22 p.m. UTC | #7

On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> >> Mel Gorman <mgorman@techsingularity.net> writes:
> >> 
> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> >> Per-CPU data cache size is useful information.  For example, it can be
> >> >> used to determine per-CPU cache size.  So, in this patch, the data
> >> >> cache size for each CPU is calculated via data_cache_size /
> >> >> shared_cpu_weight.
> >> >> 
> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> >> to allocate an extra cpumask, especially in offline callback.
> >> >> 
> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> >
> >> > It's not necessarily relevant to the patch, but at least the scheduler
> >> > also stores some per-cpu topology information such as sd_llc_size -- the
> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> >> > worth unifying this at some point if it's common that per-cpu
> >> > information is too fine and per-zone or per-node information is too
> >> > coarse. This would be particularly true when considering locking
> >> > granularity,
> >> >
> >> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
> >> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> >> Cc: David Hildenbrand <david@redhat.com>
> >> >> Cc: Johannes Weiner <jweiner@redhat.com>
> >> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> >> Cc: Michal Hocko <mhocko@suse.com>
> >> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> >> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> >> Cc: Christoph Lameter <cl@linux.com>
> >> >> ---
> >> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
> >> >>  include/linux/cacheinfo.h |  1 +
> >> >>  2 files changed, 42 insertions(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> >> --- a/drivers/base/cacheinfo.c
> >> >> +++ b/drivers/base/cacheinfo.c
> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >> >>  	return rc;
> >> >>  }
> >> >>  
> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> >> +{
> >> >> +	struct cpu_cacheinfo *ci;
> >> >> +	struct cacheinfo *leaf;
> >> >> +	unsigned int i, nr_shared;
> >> >> +	unsigned int size_data = 0;
> >> >> +
> >> >> +	if (!per_cpu_cacheinfo(cpu))
> >> >> +		return;
> >> >> +
> >> >> +	ci = ci_cacheinfo(cpu);
> >> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
> >> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> >> +		if (leaf->type != CACHE_TYPE_DATA &&
> >> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
> >> >> +			continue;
> >> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> >> +		if (!nr_shared)
> >> >> +			continue;
> >> >> +		size_data += leaf->size / nr_shared;
> >> >> +	}
> >> >> +	ci->size_data = size_data;
> >> >> +}
> >> >
> >> > This needs comments.
> >> >
> >> > It would be nice to add a comment on top describing the limitation of
> >> > CACHE_TYPE_UNIFIED here in the context of
> >> > update_data_cache_size_cpu().
> >> 
> >> Sure.  Will do that.
> >> 
> >
> > Thanks.
> >
> >> > The L2 cache could be unified but much smaller than a L3 or other
> >> > last-level-cache. It's not clear from the code what level of cache is being
> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> >> > is not the size of a cache, it appears to be the share of a cache a CPU
> >> > would have under ideal circumstances.
> >> 
> >> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
> >> shares of all levels of cache.  But the calculation is inaccurate.  More
> >> details are in the below reply.
> >> 
> >> > However, as it appears to also be
> >> > iterating hierarchy then this may not be accurate. Caches may or may not
> >> > allow data to be duplicated between levels so the value may be inaccurate.
> >> 
> >> Thank you very much for pointing this out!  The cache can be inclusive
> >> or not.  So, we cannot calculate the per-CPU slice of all-level caches
> >> via adding them together blindly.  I will change this in a follow-on
> >> patch.
> >> 
> >
> > Please do, I would strongly suggest basing this on LLC only because it's
> > the only value you can be sure of. This change is the only change that may
> > warrant a respin of the series as the history will be somewhat confusing
> > otherwise.
> 
> I am still checking whether it's possible to get cache inclusive
> information via cpuid.
> 

cpuid may be x86-specific so that potentially leads to different behaviours
on different architectures.

> If there's no reliable way to do that.  We can use the max value of
> per-CPU share of each level of cache.  For inclusive cache, that will be
> the value of LLC.  For non-inclusive cache, the value will be more
> accurate.  For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
> per core, while LLC is 1.875 MB per core according to [1].
> 

Be that as it may, it still opens the possibility of significantly different
behaviour depending on the CPU family. I would strongly recommend that you
start with LLC only because LLC is also the topology level of interest used
by the scheduler and it's information that is generally available. Trying
to get accurate information on every level and the complexity of dealing
with inclusive vs exclusive cache or write-back vs write-through should
be a separate patch, with separate justification and notes on how it can
lead to behaviour specific to the CPU family or architecture.

Huang, Ying Oct. 13, 2023, 3:06 a.m. UTC | #8

Mel Gorman <mgorman@techsingularity.net> writes:

> On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
>> Mel Gorman <mgorman@techsingularity.net> writes:
>> 
>> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <mgorman@techsingularity.net> writes:
>> >> 
>> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> >> Per-CPU data cache size is useful information.  For example, it can be
>> >> >> used to determine per-CPU cache size.  So, in this patch, the data
>> >> >> cache size for each CPU is calculated via data_cache_size /
>> >> >> shared_cpu_weight.
>> >> >> 
>> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> >> to allocate an extra cpumask, especially in offline callback.
>> >> >> 
>> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> >
>> >> > It's not necessarily relevant to the patch, but at least the scheduler
>> >> > also stores some per-cpu topology information such as sd_llc_size -- the
>> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> >> > worth unifying this at some point if it's common that per-cpu
>> >> > information is too fine and per-zone or per-node information is too
>> >> > coarse. This would be particularly true when considering locking
>> >> > granularity,
>> >> >
>> >> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
>> >> >> Cc: Andrew Morton <akpm@linux-foundation.org>
>> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
>> >> >> Cc: Vlastimil Babka <vbabka@suse.cz>
>> >> >> Cc: David Hildenbrand <david@redhat.com>
>> >> >> Cc: Johannes Weiner <jweiner@redhat.com>
>> >> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> >> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> >> >> Cc: Matthew Wilcox <willy@infradead.org>
>> >> >> Cc: Christoph Lameter <cl@linux.com>
>> >> >> ---
>> >> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
>> >> >>  include/linux/cacheinfo.h |  1 +
>> >> >>  2 files changed, 42 insertions(+), 1 deletion(-)
>> >> >> 
>> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> >> --- a/drivers/base/cacheinfo.c
>> >> >> +++ b/drivers/base/cacheinfo.c
>> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >> >>  	return rc;
>> >> >>  }
>> >> >>  
>> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> >> +{
>> >> >> +	struct cpu_cacheinfo *ci;
>> >> >> +	struct cacheinfo *leaf;
>> >> >> +	unsigned int i, nr_shared;
>> >> >> +	unsigned int size_data = 0;
>> >> >> +
>> >> >> +	if (!per_cpu_cacheinfo(cpu))
>> >> >> +		return;
>> >> >> +
>> >> >> +	ci = ci_cacheinfo(cpu);
>> >> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
>> >> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> >> +		if (leaf->type != CACHE_TYPE_DATA &&
>> >> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
>> >> >> +			continue;
>> >> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> >> +		if (!nr_shared)
>> >> >> +			continue;
>> >> >> +		size_data += leaf->size / nr_shared;
>> >> >> +	}
>> >> >> +	ci->size_data = size_data;
>> >> >> +}
>> >> >
>> >> > This needs comments.
>> >> >
>> >> > It would be nice to add a comment on top describing the limitation of
>> >> > CACHE_TYPE_UNIFIED here in the context of
>> >> > update_data_cache_size_cpu().
>> >> 
>> >> Sure.  Will do that.
>> >> 
>> >
>> > Thanks.
>> >
>> >> > The L2 cache could be unified but much smaller than a L3 or other
>> >> > last-level-cache. It's not clear from the code what level of cache is being
>> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> >> > is not the size of a cache, it appears to be the share of a cache a CPU
>> >> > would have under ideal circumstances.
>> >> 
>> >> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
>> >> shares of all levels of cache.  But the calculation is inaccurate.  More
>> >> details are in the below reply.
>> >> 
>> >> > However, as it appears to also be
>> >> > iterating hierarchy then this may not be accurate. Caches may or may not
>> >> > allow data to be duplicated between levels so the value may be inaccurate.
>> >> 
>> >> Thank you very much for pointing this out!  The cache can be inclusive
>> >> or not.  So, we cannot calculate the per-CPU slice of all-level caches
>> >> via adding them together blindly.  I will change this in a follow-on
>> >> patch.
>> >> 
>> >
>> > Please do, I would strongly suggest basing this on LLC only because it's
>> > the only value you can be sure of. This change is the only change that may
>> > warrant a respin of the series as the history will be somewhat confusing
>> > otherwise.
>> 
>> I am still checking whether it's possible to get cache inclusive
>> information via cpuid.
>> 
>
> cpuid may be x86-specific so that potentially leads to different behaviours
> on different architectures.
>
>> If there's no reliable way to do that.  We can use the max value of
>> per-CPU share of each level of cache.  For inclusive cache, that will be
>> the value of LLC.  For non-inclusive cache, the value will be more
>> accurate.  For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
>> per core, while LLC is 1.875 MB per core according to [1].
>> 
>
> Be that as it may, it still opens the possibility of significantly different
> behaviour depending on the CPU family. I would strongly recommend that you
> start with LLC only because LLC is also the topology level of interest used
> by the scheduler and it's information that is generally available. Trying
> to get accurate information on every level and the complexity of dealing
> with inclusive vs exclusive cache or write-back vs write-through should
> be a separate patch, with separate justification and notes on how it can
> lead to behaviour specific to the CPU family or architecture.

IMHO, we should try to optimize for as many CPUs as possible.  The size
of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
CPUs is as follows,

Icelake: 0.75 MB
Sapphire Rapids: 0.9375 MB

While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.

In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
pcp->batch before draining the PCP.  This makes the optimization
unavailable for a significant portion of the server CPUs.

In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
cache-hot pages between CPUs.  So, if we change the condition to
"per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.

As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
So, it's OK for them.  ARM CPUs has much smaller per-CPU LLC, so some
further optimization is needed.

[2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2

So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
and use LLC in this patch [02/10].  Then, we can optimize the per-CPU
slice of cache calculation in the follow-up patches.

--
Best Regards,
Huang, Ying

Mel Gorman Oct. 16, 2023, 3:43 p.m. UTC | #9

On Fri, Oct 13, 2023 at 11:06:51AM +0800, Huang, Ying wrote:
> Mel Gorman <mgorman@techsingularity.net> writes:
> 
> > On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
> >> Mel Gorman <mgorman@techsingularity.net> writes:
> >> 
> >> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> >> >> Mel Gorman <mgorman@techsingularity.net> writes:
> >> >> 
> >> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> >> >> Per-CPU data cache size is useful information.  For example, it can be
> >> >> >> used to determine per-CPU cache size.  So, in this patch, the data
> >> >> >> cache size for each CPU is calculated via data_cache_size /
> >> >> >> shared_cpu_weight.
> >> >> >> 
> >> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> >> >> to allocate an extra cpumask, especially in offline callback.
> >> >> >> 
> >> >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> >> >
> >> >> > It's not necessarily relevant to the patch, but at least the scheduler
> >> >> > also stores some per-cpu topology information such as sd_llc_size -- the
> >> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> >> >> > worth unifying this at some point if it's common that per-cpu
> >> >> > information is too fine and per-zone or per-node information is too
> >> >> > coarse. This would be particularly true when considering locking
> >> >> > granularity,
> >> >> >
> >> >> >> Cc: Sudeep Holla <sudeep.holla@arm.com>
> >> >> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> >> >> Cc: Mel Gorman <mgorman@techsingularity.net>
> >> >> >> Cc: Vlastimil Babka <vbabka@suse.cz>
> >> >> >> Cc: David Hildenbrand <david@redhat.com>
> >> >> >> Cc: Johannes Weiner <jweiner@redhat.com>
> >> >> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> >> >> Cc: Michal Hocko <mhocko@suse.com>
> >> >> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> >> >> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> >> >> Cc: Christoph Lameter <cl@linux.com>
> >> >> >> ---
> >> >> >>  drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
> >> >> >>  include/linux/cacheinfo.h |  1 +
> >> >> >>  2 files changed, 42 insertions(+), 1 deletion(-)
> >> >> >> 
> >> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> >> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> >> >> --- a/drivers/base/cacheinfo.c
> >> >> >> +++ b/drivers/base/cacheinfo.c
> >> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >> >> >>  	return rc;
> >> >> >>  }
> >> >> >>  
> >> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> >> >> +{
> >> >> >> +	struct cpu_cacheinfo *ci;
> >> >> >> +	struct cacheinfo *leaf;
> >> >> >> +	unsigned int i, nr_shared;
> >> >> >> +	unsigned int size_data = 0;
> >> >> >> +
> >> >> >> +	if (!per_cpu_cacheinfo(cpu))
> >> >> >> +		return;
> >> >> >> +
> >> >> >> +	ci = ci_cacheinfo(cpu);
> >> >> >> +	for (i = 0; i < cache_leaves(cpu); i++) {
> >> >> >> +		leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> >> >> +		if (leaf->type != CACHE_TYPE_DATA &&
> >> >> >> +		    leaf->type != CACHE_TYPE_UNIFIED)
> >> >> >> +			continue;
> >> >> >> +		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> >> >> +		if (!nr_shared)
> >> >> >> +			continue;
> >> >> >> +		size_data += leaf->size / nr_shared;
> >> >> >> +	}
> >> >> >> +	ci->size_data = size_data;
> >> >> >> +}
> >> >> >
> >> >> > This needs comments.
> >> >> >
> >> >> > It would be nice to add a comment on top describing the limitation of
> >> >> > CACHE_TYPE_UNIFIED here in the context of
> >> >> > update_data_cache_size_cpu().
> >> >> 
> >> >> Sure.  Will do that.
> >> >> 
> >> >
> >> > Thanks.
> >> >
> >> >> > The L2 cache could be unified but much smaller than a L3 or other
> >> >> > last-level-cache. It's not clear from the code what level of cache is being
> >> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> >> >> > is not the size of a cache, it appears to be the share of a cache a CPU
> >> >> > would have under ideal circumstances.
> >> >> 
> >> >> Yes.  And it isn't for one specific level of cache.  It's sum of per-CPU
> >> >> shares of all levels of cache.  But the calculation is inaccurate.  More
> >> >> details are in the below reply.
> >> >> 
> >> >> > However, as it appears to also be
> >> >> > iterating hierarchy then this may not be accurate. Caches may or may not
> >> >> > allow data to be duplicated between levels so the value may be inaccurate.
> >> >> 
> >> >> Thank you very much for pointing this out!  The cache can be inclusive
> >> >> or not.  So, we cannot calculate the per-CPU slice of all-level caches
> >> >> via adding them together blindly.  I will change this in a follow-on
> >> >> patch.
> >> >> 
> >> >
> >> > Please do, I would strongly suggest basing this on LLC only because it's
> >> > the only value you can be sure of. This change is the only change that may
> >> > warrant a respin of the series as the history will be somewhat confusing
> >> > otherwise.
> >> 
> >> I am still checking whether it's possible to get cache inclusive
> >> information via cpuid.
> >> 
> >
> > cpuid may be x86-specific so that potentially leads to different behaviours
> > on different architectures.
> >
> >> If there's no reliable way to do that.  We can use the max value of
> >> per-CPU share of each level of cache.  For inclusive cache, that will be
> >> the value of LLC.  For non-inclusive cache, the value will be more
> >> accurate.  For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
> >> per core, while LLC is 1.875 MB per core according to [1].
> >> 
> >
> > Be that as it may, it still opens the possibility of significantly different
> > behaviour depending on the CPU family. I would strongly recommend that you
> > start with LLC only because LLC is also the topology level of interest used
> > by the scheduler and it's information that is generally available. Trying
> > to get accurate information on every level and the complexity of dealing
> > with inclusive vs exclusive cache or write-back vs write-through should
> > be a separate patch, with separate justification and notes on how it can
> > lead to behaviour specific to the CPU family or architecture.
> 
> IMHO, we should try to optimize for as many CPUs as possible.  The size
> of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
> CPUs is as follows,
> 
> Icelake: 0.75 MB
> Sapphire Rapids: 0.9375 MB
> 
> While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.
> 
> In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
> pcp->batch before draining the PCP.  This makes the optimization
> unavailable for a significant portion of the server CPUs.
> 
> In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
> cache-hot pages between CPUs.  So, if we change the condition to
> "per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.
> 
> As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
> So, it's OK for them.  ARM CPUs has much smaller per-CPU LLC, so some
> further optimization is needed.
> 
> [2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2
> 
> So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
> and use LLC in this patch [02/10].  Then, we can optimize the per-CPU
> slice of cache calculation in the follow-up patches.
> 

I'm ok with adjusting the thresholds to adapt to using LLC only because at
least it'll be consistent across CPU architectures and families.  Dealing
with the potentially different cache characteristics at each level or even
being able to discover them is just unnecessarily complicated. It gets
even worse if the mapping changes. For example, if L1 was direct mapped,
L2 index mapped and L3 fully associative then it's not even meaningful to
say that a CPU has a meaningful slice size as cache coloring side-effects
mess everything up.

[02/10] cacheinfo: calculate per-CPU data cache size

Commit Message

Comments

Patch