[0/6,v2] Calculate pcp->high based on zone sizes and active CPUs

Message ID	20210525080119.5455-1-mgorman@techsingularity.net (mailing list archive)
Headers	show Return-Path: <SRS0=aGFf=KU=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9D150600D3 From: Mel Gorman <mgorman@techsingularity.net> To: Andrew Morton <akpm@linux-foundation.org> Cc: Hillf Danton <hdanton@sina.com>, Dave Hansen <dave.hansen@linux.intel.com>, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@kernel.org>, LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 0/6 v2] Calculate pcp->high based on zone sizes and active CPUs Date: Tue, 25 May 2021 09:01:13 +0100 Message-Id: <20210525080119.5455-1-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Calculate pcp->high based on zone sizes and active CPUs \| expand [0/6,v2] Calculate pcp->high based on zone sizes and active CPUs [1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction [2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch [3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events [4/6] mm/page_alloc: Scale the number of pages that are batch freed [5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active [6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction

Mel Gorman May 25, 2021, 8:01 a.m. UTC

Changelog since v1
o Clarification comments
o Sanity check pcp->high during reclaim				(dhansen)
o Handle vm.percpu_pagelist_high_fraction in zone_highsize	(hdanton)
o Sanity check pcp->batch versus pcp->high

This series has pre-requisites in mmotm so for convenience it is also
available at

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pcpburst-v2r3

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the zone
size into account or the number of CPUs local to a zone. With larger zones
and more CPUs per node, the contention is getting worse. Furthermore,
the fact that vm.percpu_pagelist_fraction adjusts both batch and high
values means that the sysctl can reduce zone lock contention but also
increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.
It also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but
is also driven by the fact that large batches of page freeing via
release_pages() often shows zone contention as a major part of the
problem. Another is a bug report based on an older kernel where a
multi-terabyte process can takes several minutes to exit. A workaround
was to use vm.percpu_pagelist_fraction to increase the pcp->high value
but testing indicated that a production workload could not use the same
values because of an increase in allocation latencies. Unfortunately,
I cannot reproduce this test case myself as the multi-terabyte machines
are in active use but it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite. pcp
only works with order-0 which is useless for SLUB (when using high orders)
and THP (unconditionally). To store high-order pages on PCP, the pcp->high
values need to be increased first.

 Documentation/admin-guide/sysctl/vm.rst |  29 ++--
 include/linux/cpuhotplug.h              |   2 +-
 include/linux/mmzone.h                  |   8 +-
 kernel/sysctl.c                         |   8 +-
 mm/internal.h                           |   2 +-
 mm/memory_hotplug.c                     |   4 +-
 mm/page_alloc.c                         | 196 ++++++++++++++++++------
 mm/vmscan.c                             |  35 +++++
 8 files changed, 212 insertions(+), 72 deletions(-)

Dave Hansen May 27, 2021, 7:36 p.m. UTC | #1

Hi Mel,

Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.

The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

The PMEM nodes, of course, don't have any CPUs in them.

With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.

I did see quite a few pcp->counts above 60,000, so it's definitely
possible in practice to see the pcps filled up.  This was not observed
to cause any actual problems in practice.  But, it's still a bit worrisome.

Maybe instead of pretending that cpu-less nodes have one cpu:

	nr_local_cpus = max(1U,
cpumask_weight(cpumask_of_node(zone_to_nid(zone))));

we should just consider them to have *all* the CPUs in the system.  Perhaps:

	nr_local_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone)));
	if (!nr_local_cpus)
		nr_local_cpus = num_online_cpus();

Even if a system has a silly number of CPUs, the 'high' sizes will still
hit a floor at 4*batch:

	high = max(high, batch << 2);

Which doesn't seem too bad, especially considering that CPU-less nodes
naturally have less contention.

Mel Gorman May 28, 2021, 8:55 a.m. UTC | #2

On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> Hi Mel,
> 
> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> use" mode and being managed via the buddy just like the normal RAM.
> 
> The PMEM zones are big ones:
> 
>         present  65011712 = 248 G
>         high       134595 = 525 M
> 
> The PMEM nodes, of course, don't have any CPUs in them.
> 
> With your series, the pcp->high value per-cpu is 69584 pages or about
> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> worst-case memory in the pcps per zone, or roughly 10% of the size of
> the zone.
> 
> I did see quite a few pcp->counts above 60,000, so it's definitely
> possible in practice to see the pcps filled up.  This was not observed
> to cause any actual problems in practice.  But, it's still a bit worrisome.
> 

Ok, it does have the potential to trigger early reclaim as pages are
stored on remote PCP lists. The problem would be transient because
vmstat would drain those pages over time but still, how about this patch
on top of the series?

--8<--
mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes

Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.

  Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
  ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
  use" mode and being managed via the buddy just like the normal RAM.

  The PMEM zones are big ones:

        present  65011712 = 248 G
        high       134595 = 525 M

  The PMEM nodes, of course, don't have any CPUs in them.

  With your series, the pcp->high value per-cpu is 69584 pages or about
  270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
  worst-case memory in the pcps per zone, or roughly 10% of the size of
  the zone.

This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node. It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d708aa14f4ef..af566e97a0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6687,7 +6687,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 {
 #ifdef CONFIG_MMU
 	int high;
-	int nr_local_cpus;
+	int nr_split_cpus;
 	unsigned long total_pages;

 	if (!percpu_pagelist_high_fraction) {
@@ -6710,10 +6710,14 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined.
-	 */
-	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
-	high = total_pages / nr_local_cpus;
+	 * onlined. For memory nodes that have no CPUs, split pcp->high across
+	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * prematurely due to pages stored on pcp lists.
+	 */
+	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
+	if (!nr_split_cpus)
+		nr_split_cpus = num_online_cpus();
+	high = total_pages / nr_split_cpus;

 	/*
 	 * Ensure high is at least batch*4. The multiple is based on the

David Hildenbrand May 28, 2021, 9:03 a.m. UTC | #3

On 28.05.21 10:55, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>>
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>>
>> The PMEM zones are big ones:
>>
>>          present  65011712 = 248 G
>>          high       134595 = 525 M
>>
>> The PMEM nodes, of course, don't have any CPUs in them.
>>
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.

When I read about having such big amounts of free memory theoretically 
stuck in PCP lists, I guess we really want to start draining the PCP in 
alloc_contig_range(), just as we do with memory hotunplug when offlining.

David Hildenbrand May 28, 2021, 9:08 a.m. UTC | #4

On 28.05.21 11:03, David Hildenbrand wrote:
> On 28.05.21 10:55, Mel Gorman wrote:
>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>> Hi Mel,
>>>
>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>>> use" mode and being managed via the buddy just like the normal RAM.
>>>
>>> The PMEM zones are big ones:
>>>
>>>           present  65011712 = 248 G
>>>           high       134595 = 525 M
>>>
>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>
>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>> the zone.
> 
> When I read about having such big amounts of free memory theoretically
> stuck in PCP lists, I guess we really want to start draining the PCP in
> alloc_contig_range(), just as we do with memory hotunplug when offlining.
> 

Correction: we already drain the pcp, we just don't temporarily disable 
it, so a race as described in offline_pages() could apply:

"Disable pcplists so that page isolation cannot race with freeing
  in a way that pages from isolated pageblock are left on pcplists."

Guess we'd then want to move the draining before 
start_isolate_page_range() in alloc_contig_range().

Mel Gorman May 28, 2021, 9:49 a.m. UTC | #5

On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
> On 28.05.21 11:03, David Hildenbrand wrote:
> > On 28.05.21 10:55, Mel Gorman wrote:
> > > On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> > > > Hi Mel,
> > > > 
> > > > Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> > > > ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> > > > use" mode and being managed via the buddy just like the normal RAM.
> > > > 
> > > > The PMEM zones are big ones:
> > > > 
> > > >           present  65011712 = 248 G
> > > >           high       134595 = 525 M
> > > > 
> > > > The PMEM nodes, of course, don't have any CPUs in them.
> > > > 
> > > > With your series, the pcp->high value per-cpu is 69584 pages or about
> > > > 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> > > > worst-case memory in the pcps per zone, or roughly 10% of the size of
> > > > the zone.
> > 
> > When I read about having such big amounts of free memory theoretically
> > stuck in PCP lists, I guess we really want to start draining the PCP in
> > alloc_contig_range(), just as we do with memory hotunplug when offlining.
> > 
> 
> Correction: we already drain the pcp, we just don't temporarily disable it,
> so a race as described in offline_pages() could apply:
> 
> "Disable pcplists so that page isolation cannot race with freeing
>  in a way that pages from isolated pageblock are left on pcplists."
> 
> Guess we'd then want to move the draining before start_isolate_page_range()
> in alloc_contig_range().
> 

Or instead of draining, validate the PFN range in alloc_contig_range
is within the same zone and if so, call zone_pcp_disable() before
start_isolate_page_range and enable after __alloc_contig_migrate_range.

David Hildenbrand May 28, 2021, 9:52 a.m. UTC | #6

On 28.05.21 11:49, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
>> On 28.05.21 11:03, David Hildenbrand wrote:
>>> On 28.05.21 10:55, Mel Gorman wrote:
>>>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>>>> Hi Mel,
>>>>>
>>>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>>>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>>>>> use" mode and being managed via the buddy just like the normal RAM.
>>>>>
>>>>> The PMEM zones are big ones:
>>>>>
>>>>>            present  65011712 = 248 G
>>>>>            high       134595 = 525 M
>>>>>
>>>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>>>
>>>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>>>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>>>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>>>> the zone.
>>>
>>> When I read about having such big amounts of free memory theoretically
>>> stuck in PCP lists, I guess we really want to start draining the PCP in
>>> alloc_contig_range(), just as we do with memory hotunplug when offlining.
>>>
>>
>> Correction: we already drain the pcp, we just don't temporarily disable it,
>> so a race as described in offline_pages() could apply:
>>
>> "Disable pcplists so that page isolation cannot race with freeing
>>   in a way that pages from isolated pageblock are left on pcplists."
>>
>> Guess we'd then want to move the draining before start_isolate_page_range()
>> in alloc_contig_range().
>>
> 
> Or instead of draining, validate the PFN range in alloc_contig_range
> is within the same zone and if so, call zone_pcp_disable() before
> start_isolate_page_range and enable after __alloc_contig_migrate_range.
> 

We require the caller to only pass a range within a single zone, so that 
should be fine.

The only ugly thing about zone_pcp_disable() is 
mutex_lock(&pcp_batch_high_lock) which would serialize all 
alloc_contig_range() and even with offline_pages().

Mel Gorman May 28, 2021, 10:09 a.m. UTC | #7

On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
> > > "Disable pcplists so that page isolation cannot race with freeing
> > >   in a way that pages from isolated pageblock are left on pcplists."
> > > 
> > > Guess we'd then want to move the draining before start_isolate_page_range()
> > > in alloc_contig_range().
> > > 
> > 
> > Or instead of draining, validate the PFN range in alloc_contig_range
> > is within the same zone and if so, call zone_pcp_disable() before
> > start_isolate_page_range and enable after __alloc_contig_migrate_range.
> > 
> 
> We require the caller to only pass a range within a single zone, so that
> should be fine.
> 
> The only ugly thing about zone_pcp_disable() is
> mutex_lock(&pcp_batch_high_lock) which would serialize all
> alloc_contig_range() and even with offline_pages().
> 

True so it would have to be accessed if that is bad or not. If racing
against offline_pages, memory is potentially being offlined in the
target zone which may cause allocation failure. If racing with other
alloc_contig_range calls, the two callers are potentially racing to
isolate and allocate the same range. The arguement could be made that
alloc_contig_ranges should be serialised within one zone to improve the
allocation success rate at the potential cost of allocation latency.

David Hildenbrand May 28, 2021, 10:21 a.m. UTC | #8

On 28.05.21 12:09, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
>>>> "Disable pcplists so that page isolation cannot race with freeing
>>>>    in a way that pages from isolated pageblock are left on pcplists."
>>>>
>>>> Guess we'd then want to move the draining before start_isolate_page_range()
>>>> in alloc_contig_range().
>>>>
>>>
>>> Or instead of draining, validate the PFN range in alloc_contig_range
>>> is within the same zone and if so, call zone_pcp_disable() before
>>> start_isolate_page_range and enable after __alloc_contig_migrate_range.
>>>
>>
>> We require the caller to only pass a range within a single zone, so that
>> should be fine.
>>
>> The only ugly thing about zone_pcp_disable() is
>> mutex_lock(&pcp_batch_high_lock) which would serialize all
>> alloc_contig_range() and even with offline_pages().
>>
> 
> True so it would have to be accessed if that is bad or not. If racing
> against offline_pages, memory is potentially being offlined in the
> target zone which may cause allocation failure. If racing with other
> alloc_contig_range calls, the two callers are potentially racing to
> isolate and allocate the same range. The arguement could be made that
> alloc_contig_ranges should be serialised within one zone to improve the
> allocation success rate at the potential cost of allocation latency.

We have 3 main users of alloc_contig_range():

1. CMA

CMA synchronizes allocation within a CMA area via the allocation bitmap. 
So parallel CMA is perfectly possible and avoids races by design.

2. alloc_contig_pages() / Gigantic pages

Gigantic page allocation could race with virtio-mem. CMA does not apply. 
Possible but unlikely to matter in practice. virtio-mem will retry later 
again.

3. virito-mem

A virtio-mem device only operates on its assigned memory region, so we 
cannot have alloc_contig_range() from different devices racing, even 
when within a single zone. It could only race with gigantic pages as CMA 
does not apply.

So serializing would mostly harm parallel CMA (possible and likely) and 
parallel virtio-mem operation (e.g., unplug memory of two virtio-mem 
devices -- unlikely but possible).

Memory offlining racing with CMA is not an issue (impossible). 
virtio-mem synchronizes with memory offlining via memory notifiers, 
there is only a tiny little race window that usually doesn't matter as 
virtio-mem is expected to usually triggers offlining itself, and not 
user space rancomly. Memory offlining can race with dynamic gigantic 
page allocation, wich is highly unreliable already.

I wonder if we could optimize locking in zone_pcp_disable() instead.

Vlastimil Babka May 28, 2021, 12:12 p.m. UTC | #9

On 5/28/21 10:55 AM, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>> 
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>> 
>> The PMEM zones are big ones:
>> 
>>         present  65011712 = 248 G
>>         high       134595 = 525 M
>> 
>> The PMEM nodes, of course, don't have any CPUs in them.
>> 
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.
>> 
>> I did see quite a few pcp->counts above 60,000, so it's definitely
>> possible in practice to see the pcps filled up.  This was not observed
>> to cause any actual problems in practice.  But, it's still a bit worrisome.
>> 
> 
> Ok, it does have the potential to trigger early reclaim as pages are
> stored on remote PCP lists. The problem would be transient because
> vmstat would drain those pages over time but still, how about this patch
> on top of the series?
> 
> --8<--
> mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
> 
> Dave Hansen reported the following about Feng Tang's tests on a machine
> with persistent memory onlined as a DRAM-like device.
> 
>   Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>   ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
>   use" mode and being managed via the buddy just like the normal RAM.
> 
>   The PMEM zones are big ones:
> 
>         present  65011712 = 248 G
>         high       134595 = 525 M
> 
>   The PMEM nodes, of course, don't have any CPUs in them.
> 
>   With your series, the pcp->high value per-cpu is 69584 pages or about
>   270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
>   worst-case memory in the pcps per zone, or roughly 10% of the size of
>   the zone.
> 
> This should not cause a problem as such although it could trigger reclaim
> due to pages being stored on per-cpu lists for CPUs remote to a node. It
> is not possible to treat cpuless nodes exactly the same as normal nodes
> but the worst-case scenario can be mitigated by splitting pcp->high across
> all online CPUs for cpuless memory nodes.
> 
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Maybe we should even consider distinguishing high limits for local-to-cpu zones
vs remote, for example for the local-to-cpu zones we would divide by the number
of local cpus, for remote-to-cpu zones we would divide by all cpus.

Because we can expect cpus to allocate mostly from local zones, so leaving more
pages on percpu for those zones can be beneficial.

But as the motivation here was to reduce lock contention on freeing, that's less
clear. We probably can't expect the cpu to be freeing mostly local pages (in
case of e.g. a large process exiting), because no mechanism works towards that,
or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.

So that would have to be evaluated if that works in practice. Out of scope here,
just an idea to discuss.

Mel Gorman May 28, 2021, 12:37 p.m. UTC | #10

On Fri, May 28, 2021 at 02:12:09PM +0200, Vlastimil Babka wrote:
> > mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
> > 
> > Dave Hansen reported the following about Feng Tang's tests on a machine
> > with persistent memory onlined as a DRAM-like device.
> > 
> >   Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> >   ~512G of persistent memory and 128G of DRAM.  The PMEM is in "volatile
> >   use" mode and being managed via the buddy just like the normal RAM.
> > 
> >   The PMEM zones are big ones:
> > 
> >         present  65011712 = 248 G
> >         high       134595 = 525 M
> > 
> >   The PMEM nodes, of course, don't have any CPUs in them.
> > 
> >   With your series, the pcp->high value per-cpu is 69584 pages or about
> >   270MB per CPU.  Scaled up by the 96 CPU threads, that's ~26GB of
> >   worst-case memory in the pcps per zone, or roughly 10% of the size of
> >   the zone.
> > 
> > This should not cause a problem as such although it could trigger reclaim
> > due to pages being stored on per-cpu lists for CPUs remote to a node. It
> > is not possible to treat cpuless nodes exactly the same as normal nodes
> > but the worst-case scenario can be mitigated by splitting pcp->high across
> > all online CPUs for cpuless memory nodes.
> > 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

> Maybe we should even consider distinguishing high limits for local-to-cpu zones
> vs remote, for example for the local-to-cpu zones we would divide by the number
> of local cpus, for remote-to-cpu zones we would divide by all cpus.
> 
> Because we can expect cpus to allocate mostly from local zones, so leaving more
> pages on percpu for those zones can be beneficial.
> 

I did think about whether the ratios should be different but failed to
conclude that it was necessary or useful so I kept it simple.

> But as the motivation here was to reduce lock contention on freeing, that's less
> clear. We probably can't expect the cpu to be freeing mostly local pages (in
> case of e.g. a large process exiting), because no mechanism works towards that,
> or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.
> 

This is the major issue. Even if an application was NUMA aware and heavily
threaded, the process exiting is potentially freeing remote memory and
there is nothing wrong about that. The remote memory will be partially
drained by pcp->high being reached and the remaining memory will be
cleaned up by vmstat. It's a similar problem if a process is truncating
a large file with page cache allocated on a remote node.

Hence I decided to do nothing fancy with the ratios until a practical
problem was identified that could be alleviated by adjusting pcp->high
based on whether the CPU is remote or local to memory.

Dave Hansen May 28, 2021, 2:39 p.m. UTC | #11

On 5/28/21 1:55 AM, Mel Gorman wrote:
> -	 * onlined.
> -	 */
> -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> -	high = total_pages / nr_local_cpus;
> +	 * onlined. For memory nodes that have no CPUs, split pcp->high across
> +	 * all online CPUs to mitigate the risk that reclaim is triggered
> +	 * prematurely due to pages stored on pcp lists.
> +	 */
> +	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> +	if (!nr_split_cpus)
> +		nr_split_cpus = num_online_cpus();
> +	high = total_pages / nr_split_cpus;

Updated version looks fine to me, thanks!

BTW, to do some of this testing, Feng was doing a plain old kernel
build.  On the one system where this got run, he noted a ~2% regression
in build times.  Nothing major, but you might want to be on the lookout
in case 0day or the other test harnesses find something similar once
this series gets to them.

Acked-by: Dave Hansen <dave.hansen@intel.com>

Mel Gorman May 28, 2021, 3:18 p.m. UTC | #12

On Fri, May 28, 2021 at 07:39:29AM -0700, Dave Hansen wrote:
> On 5/28/21 1:55 AM, Mel Gorman wrote:
> > -	 * onlined.
> > -	 */
> > -	nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> > -	high = total_pages / nr_local_cpus;
> > +	 * onlined. For memory nodes that have no CPUs, split pcp->high across
> > +	 * all online CPUs to mitigate the risk that reclaim is triggered
> > +	 * prematurely due to pages stored on pcp lists.
> > +	 */
> > +	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> > +	if (!nr_split_cpus)
> > +		nr_split_cpus = num_online_cpus();
> > +	high = total_pages / nr_split_cpus;
> 
> Updated version looks fine to me, thanks!
> 
> BTW, to do some of this testing, Feng was doing a plain old kernel
> build.  On the one system where this got run, he noted a ~2% regression
> in build times.  Nothing major, but you might want to be on the lookout
> in case 0day or the other test harnesses find something similar once
> this series gets to them.
> 

What type of system was it?

I noticed minor differences for some thread counts on kernel compilations
but for CascadeLake at least, it was mostly neutral. Below is an old test
result based on a previous revision.

kernbench
                               5.13.0-rc2             5.13.0-rc2
                                  vanilla       mm-pcpburst-v2r3
Amean     elsp-2        469.22 (   0.00%)      470.03 *  -0.17%*
Amean     elsp-4        251.03 (   0.00%)      250.83 (   0.08%)
Amean     elsp-8        131.39 (   0.00%)      130.89 (   0.38%)
Amean     elsp-16        74.37 (   0.00%)       75.11 (  -0.99%)
Amean     elsp-32        42.10 (   0.00%)       42.20 (  -0.24%)
Amean     elsp-64        32.21 (   0.00%)       32.14 (   0.23%)
Amean     elsp-128       31.59 (   0.00%)       31.68 (  -0.27%)
Amean     elsp-160       31.76 (   0.00%)       31.69 (   0.21%)

A Haswell machine showed the worst results for kernbench

Amean     elsp-2        459.99 (   0.00%)      465.27 *  -1.15%*
Amean     elsp-4        250.76 (   0.00%)      253.17 *  -0.96%*
Amean     elsp-8        141.28 (   0.00%)      141.78 (  -0.36%)
Amean     elsp-16        77.71 (   0.00%)       77.88 (  -0.22%)
Amean     elsp-32        44.09 (   0.00%)       44.40 (  -0.69%)
Amean     elsp-64        33.79 (   0.00%)       33.46 (   0.96%)
Amean     elsp-128       33.14 (   0.00%)       33.26 (  -0.37%)
Amean     elsp-160       33.26 (   0.00%)       33.36 *  -0.30%*

The series with review feedback and dealing with cpuless nodes is queued
and should complete over the weekend.

> Acked-by: Dave Hansen <dave.hansen@intel.com>

Thanks!

Dave Hansen May 28, 2021, 4:17 p.m. UTC | #13

On 5/28/21 8:18 AM, Mel Gorman wrote:
>> BTW, to do some of this testing, Feng was doing a plain old kernel
>> build.  On the one system where this got run, he noted a ~2% regression
>> in build times.  Nothing major, but you might want to be on the lookout
>> in case 0day or the other test harnesses find something similar once
>> this series gets to them.
>>
> What type of system was it?
> 
> I noticed minor differences for some thread counts on kernel compilations
> but for CascadeLake at least, it was mostly neutral. Below is an old test
> result based on a previous revision.

It's a Cascade Lake as well.  But, I never trust hardware at a hardware
company.  These could be preproduction CPUs or BIOS or both, or have
some bonkers configuration knob flipped.

It's also got a bunch of PMEM plugged and onlined, including the
_possibility_ of kernel data structures ended up on PMEM.  They *mostly*
don't end up there, but it does happen on occasion.

Anyway, I'll see if we can do some more runs with your latest version.
It looks like it's been picked up for -mm so 0day should be pounding on
it soon enough.

Feng Tang May 31, 2021, noon UTC | #14

On Fri, May 28, 2021 at 09:17:41AM -0700, Dave Hansen wrote:
> On 5/28/21 8:18 AM, Mel Gorman wrote:
> >> BTW, to do some of this testing, Feng was doing a plain old kernel
> >> build.  On the one system where this got run, he noted a ~2% regression
> >> in build times.  Nothing major, but you might want to be on the lookout
> >> in case 0day or the other test harnesses find something similar once
> >> this series gets to them.
> >>
> > What type of system was it?
> > 
> > I noticed minor differences for some thread counts on kernel compilations
> > but for CascadeLake at least, it was mostly neutral. Below is an old test
> > result based on a previous revision.
> 
> It's a Cascade Lake as well.  But, I never trust hardware at a hardware
> company.  These could be preproduction CPUs or BIOS or both, or have
> some bonkers configuration knob flipped.
> 
> It's also got a bunch of PMEM plugged and onlined, including the
> _possibility_ of kernel data structures ended up on PMEM.  They *mostly*
> don't end up there, but it does happen on occasion.
> 
> Anyway, I'll see if we can do some more runs with your latest version.
> It looks like it's been picked up for -mm so 0day should be pounding on
> it soon enough.

Yes, usually 0day has more benchmark test covering -mm tree.

As for the kbuild test run for v2, after more runs, the previous 2%
longer kbuild time turns to 1% shorter time, seems to be in normal
deviation range.

Also I checked Mel's v3 branch which has the fix for cpuless node,
the pcp 'high' looks normal on PMEM node:

  pagesets
    cpu: 0
              count: 67
              high:  724
              batch: 63
  vm stats threshold: 125

Thanks,
Feng

[0/6,v2] Calculate pcp->high based on zone sizes and active CPUs

Message

Comments