diff mbox series

Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"

Message ID 20250226032258.234099-1-krisman@suse.de (mailing list archive)
State New
Headers show
Series Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" | expand

Commit Message

Gabriel Krisman Bertazi Feb. 26, 2025, 3:22 a.m. UTC
Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
->lowmem_reserve[] for empty zone") removes the protection of lower
zones from allocations targeting memory-less high zones.  This had an
unintended impact on the pattern of reclaims because it makes the
high-zone-targeted allocation more likely to succeed in lower zones,
which adds pressure to said zones.  I.e, the following corresponding
checks in zone_watermark_ok/zone_watermark_fast are less likely to
trigger:

        if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
                return false;

As a result, we are observing an increase in reclaim and kswapd scans,
due to the increased pressure.  This was initially observed as increased
latency in filesystem operations when benchmarking with fio on a machine
with some memory-less zones, but it has since been associated with
increased contention in locks related to memory reclaim.  By reverting
this patch, the original performance was recovered on that machine.

The original commit was introduced as a clarification of the
/proc/zoneinfo output, so it doesn't seem there are usecases depending
on it, making the revert a simple solution.

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Michal Hocko Feb. 26, 2025, 6:54 a.m. UTC | #1
On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> ->lowmem_reserve[] for empty zone") removes the protection of lower
> zones from allocations targeting memory-less high zones.  This had an
> unintended impact on the pattern of reclaims because it makes the
> high-zone-targeted allocation more likely to succeed in lower zones,
> which adds pressure to said zones.  I.e, the following corresponding
> checks in zone_watermark_ok/zone_watermark_fast are less likely to
> trigger:
> 
>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>                 return false;
> 
> As a result, we are observing an increase in reclaim and kswapd scans,
> due to the increased pressure.  This was initially observed as increased
> latency in filesystem operations when benchmarking with fio on a machine
> with some memory-less zones, but it has since been associated with
> increased contention in locks related to memory reclaim.  By reverting
> this patch, the original performance was recovered on that machine.

I think it would be nice to show the memory layout on that machine (is
there any movable or device zone)?

Exact reclaim patterns are really hard to predict and it is little bit
surprising the said patch has caused an increased kswapd activity
because I would expect that there will be more reclaim with the lowmem
reserves in place. But it is quite possible that the higher zone memory
pressure is just tipping over and increase the lowmem pressure enough
that it shows up.

In any case 96a5c186efff seems incorrect because it assumes that the
protection has anything to do with how higher zone is populated while
the protection fundamentaly protects lower zone from higher zones
allocation. Those allocations are independent on the actual memory in
that zone.

> The original commit was introduced as a clarification of the
> /proc/zoneinfo output, so it doesn't seem there are usecases depending
> on it, making the revert a simple solution.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Baoquan He <bhe@redhat.com>
> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/page_alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 579789600a3c..fe986e6de7a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>  
>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>  				struct zone *upper_zone = &pgdat->node_zones[j];
> -				bool empty = !zone_managed_pages(upper_zone);
>  
>  				managed_pages += zone_managed_pages(upper_zone);
>  
> -				if (clear || empty)
> +				if (clear)
>  					zone->lowmem_reserve[j] = 0;
>  				else
>  					zone->lowmem_reserve[j] = managed_pages / ratio;
> -- 
> 2.47.0
Baoquan He Feb. 26, 2025, 10 a.m. UTC | #2
On 02/26/25 at 07:54am, Michal Hocko wrote:
> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
> > Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> > ->lowmem_reserve[] for empty zone") removes the protection of lower
> > zones from allocations targeting memory-less high zones.  This had an
> > unintended impact on the pattern of reclaims because it makes the
> > high-zone-targeted allocation more likely to succeed in lower zones,
> > which adds pressure to said zones.  I.e, the following corresponding
> > checks in zone_watermark_ok/zone_watermark_fast are less likely to
> > trigger:
> > 
> >         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
> >                 return false;
> > 
> > As a result, we are observing an increase in reclaim and kswapd scans,
> > due to the increased pressure.  This was initially observed as increased
> > latency in filesystem operations when benchmarking with fio on a machine
> > with some memory-less zones, but it has since been associated with
> > increased contention in locks related to memory reclaim.  By reverting
> > this patch, the original performance was recovered on that machine.
> 
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?

Yeah, printing /proc/zoneinfo and pasting it here would be helpful.

> 
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.
> 
> In any case 96a5c186efff seems incorrect because it assumes that the
> protection has anything to do with how higher zone is populated while
> the protection fundamentaly protects lower zone from higher zones
> allocation. Those allocations are independent on the actual memory in
> that zone.

The protection value was introduced in non-NUMA time, and later adapted
to NUMA system. While it still only reflects each zone with other zones
within one specific node. We may need take this opportunity to
reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
nodes into account.
Michal Hocko Feb. 26, 2025, 10:52 a.m. UTC | #3
On Wed 26-02-25 18:00:26, Baoquan He wrote:
> On 02/26/25 at 07:54am, Michal Hocko wrote:
[...]
> > In any case 96a5c186efff seems incorrect because it assumes that the
> > protection has anything to do with how higher zone is populated while
> > the protection fundamentaly protects lower zone from higher zones
> > allocation. Those allocations are independent on the actual memory in
> > that zone.
> 
> The protection value was introduced in non-NUMA time, and later adapted
> to NUMA system. While it still only reflects each zone with other zones
> within one specific node. We may need take this opportunity to
> reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> nodes into account.

Are you suggesting zone fallback list to interleave nodes? I.e.
numa_zonelist_order we used to have in the past and that has been
removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
Michal Hocko Feb. 26, 2025, 11 a.m. UTC | #4
On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > On 02/26/25 at 07:54am, Michal Hocko wrote:
> [...]
> > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > protection has anything to do with how higher zone is populated while
> > > the protection fundamentaly protects lower zone from higher zones
> > > allocation. Those allocations are independent on the actual memory in
> > > that zone.
> > 
> > The protection value was introduced in non-NUMA time, and later adapted
> > to NUMA system. While it still only reflects each zone with other zones
> > within one specific node. We may need take this opportunity to
> > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > nodes into account.
> 
> Are you suggesting zone fallback list to interleave nodes? I.e.
> numa_zonelist_order we used to have in the past and that has been
> removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").

Btw. has 96a5c186efff tried to fix any actual runtime problem? The
changelog doesn't say much about that.
Baoquan He Feb. 26, 2025, 11:51 a.m. UTC | #5
On 02/26/25 at 12:00pm, Michal Hocko wrote:
> On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > [...]
> > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > protection has anything to do with how higher zone is populated while
> > > > the protection fundamentaly protects lower zone from higher zones
> > > > allocation. Those allocations are independent on the actual memory in
> > > > that zone.
> > > 
> > > The protection value was introduced in non-NUMA time, and later adapted
> > > to NUMA system. While it still only reflects each zone with other zones
> > > within one specific node. We may need take this opportunity to
> > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > nodes into account.
> > 
> > Are you suggesting zone fallback list to interleave nodes? I.e.
> > numa_zonelist_order we used to have in the past and that has been
> > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").

Hmm, if Gabriel can provide detailed node/zone information of the
system, we can check if there's anything we can do to adjust
zone->lowmem_reserve[] to reflect its real usage and semantics. I
haven't thought of the whole zone fallback list to interleave nodes
which invovles a lot of change.

> 
> Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> changelog doesn't say much about that. 

No, no actual problem was observed on tht. I was just trying to make
clear the semantics because I was confused by its obscure value printing
of zone->lowmem_reserve[] in /proc/zoneinfo.

I think we can merge this reverting firstly, then to investigate how to
better clarify it.
Michal Hocko Feb. 26, 2025, 12:01 p.m. UTC | #6
On Wed 26-02-25 19:51:12, Baoquan He wrote:
> On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > [...]
> > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > protection has anything to do with how higher zone is populated while
> > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > allocation. Those allocations are independent on the actual memory in
> > > > > that zone.
> > > > 
> > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > to NUMA system. While it still only reflects each zone with other zones
> > > > within one specific node. We may need take this opportunity to
> > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > nodes into account.
> > > 
> > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > numa_zonelist_order we used to have in the past and that has been
> > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> 
> Hmm, if Gabriel can provide detailed node/zone information of the
> system, we can check if there's anything we can do to adjust
> zone->lowmem_reserve[] to reflect its real usage and semantics.

Why do you think anything needs to be adjusted?

> I haven't thought of the whole zone fallback list to interleave nodes
> which invovles a lot of change.
> 
> > 
> > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > changelog doesn't say much about that. 
> 
> No, no actual problem was observed on tht.

OK

> I was just trying to make
> clear the semantics because I was confused by its obscure value printing
> of zone->lowmem_reserve[] in /proc/zoneinfo.
> 
> I think we can merge this reverting firstly, then to investigate how to
> better clarify it.

What do you think needs to be clarify? How exactly is the original
output confusing?
Vlastimil Babka Feb. 26, 2025, 1 p.m. UTC | #7
On 2/26/25 4:22 AM, Gabriel Krisman Bertazi wrote:
> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> ->lowmem_reserve[] for empty zone") removes the protection of lower
> zones from allocations targeting memory-less high zones.  This had an
> unintended impact on the pattern of reclaims because it makes the
> high-zone-targeted allocation more likely to succeed in lower zones,
> which adds pressure to said zones.  I.e, the following corresponding
> checks in zone_watermark_ok/zone_watermark_fast are less likely to
> trigger:
> 
>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>                 return false;
> 
> As a result, we are observing an increase in reclaim and kswapd scans,
> due to the increased pressure.  This was initially observed as increased
> latency in filesystem operations when benchmarking with fio on a machine
> with some memory-less zones, but it has since been associated with
> increased contention in locks related to memory reclaim.  By reverting
> this patch, the original performance was recovered on that machine.
> 
> The original commit was introduced as a clarification of the
> /proc/zoneinfo output, so it doesn't seem there are usecases depending
> on it, making the revert a simple solution.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Baoquan He <bhe@redhat.com>
> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 579789600a3c..fe986e6de7a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>  
>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>  				struct zone *upper_zone = &pgdat->node_zones[j];
> -				bool empty = !zone_managed_pages(upper_zone);
>  
>  				managed_pages += zone_managed_pages(upper_zone);
>  
> -				if (clear || empty)
> +				if (clear)
>  					zone->lowmem_reserve[j] = 0;
>  				else
>  					zone->lowmem_reserve[j] = managed_pages / ratio;
Vlastimil Babka Feb. 26, 2025, 1:07 p.m. UTC | #8
On 2/26/25 7:54 AM, Michal Hocko wrote:
> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
>> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
>> ->lowmem_reserve[] for empty zone") removes the protection of lower
>> zones from allocations targeting memory-less high zones.  This had an
>> unintended impact on the pattern of reclaims because it makes the
>> high-zone-targeted allocation more likely to succeed in lower zones,
>> which adds pressure to said zones.  I.e, the following corresponding
>> checks in zone_watermark_ok/zone_watermark_fast are less likely to
>> trigger:
>>
>>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>>                 return false;
>>
>> As a result, we are observing an increase in reclaim and kswapd scans,
>> due to the increased pressure.  This was initially observed as increased
>> latency in filesystem operations when benchmarking with fio on a machine
>> with some memory-less zones, but it has since been associated with
>> increased contention in locks related to memory reclaim.  By reverting
>> this patch, the original performance was recovered on that machine.
> 
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?
> 
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.

My theory is that the commit caused a difference between kernel and
userspace allocations, with bad consequences. Kernel allocation will
have highest_zoneidx = NORMAL and thus observe the lowmem_reserve for
for ZONE_DMA32 unchanged. Userspace allocation will have highest_zoneidx
= MOVABLE and thus will see zero lowmem_reserve and will allocate from
ZONE_DMA32 (or even ZONE_DMA) when previously it wouldn't.

Then a kernel allocation might happen to wake up kswapd, which will see
the DMA/DMA32 below watermark (with NORMAL highest_zoneidx) and try to
reclaim them back above the watermarks. Since the LRU lists are per-node
and nor per-zone anymore, it will spend a lot of effort to find pages
from DMA/DMA32 to reclaim.

> In any case 96a5c186efff seems incorrect because it assumes that the
> protection has anything to do with how higher zone is populated while
> the protection fundamentaly protects lower zone from higher zones
> allocation. Those allocations are independent on the actual memory in
> that zone.
> 
>> The original commit was introduced as a clarification of the
>> /proc/zoneinfo output, so it doesn't seem there are usecases depending
>> on it, making the revert a simple solution.
>>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Baoquan He <bhe@redhat.com>
>> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
> 
>> ---
>>  mm/page_alloc.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 579789600a3c..fe986e6de7a0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>>  
>>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>>  				struct zone *upper_zone = &pgdat->node_zones[j];
>> -				bool empty = !zone_managed_pages(upper_zone);
>>  
>>  				managed_pages += zone_managed_pages(upper_zone);
>>  
>> -				if (clear || empty)
>> +				if (clear)
>>  					zone->lowmem_reserve[j] = 0;
>>  				else
>>  					zone->lowmem_reserve[j] = managed_pages / ratio;
>> -- 
>> 2.47.0
>
Baoquan He Feb. 26, 2025, 3:57 p.m. UTC | #9
On 02/26/25 at 01:01pm, Michal Hocko wrote:
> On Wed 26-02-25 19:51:12, Baoquan He wrote:
> > On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > > [...]
> > > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > > protection has anything to do with how higher zone is populated while
> > > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > > allocation. Those allocations are independent on the actual memory in
> > > > > > that zone.
> > > > > 
> > > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > > to NUMA system. While it still only reflects each zone with other zones
> > > > > within one specific node. We may need take this opportunity to
> > > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > > nodes into account.
> > > > 
> > > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > > numa_zonelist_order we used to have in the past and that has been
> > > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> > 
> > Hmm, if Gabriel can provide detailed node/zone information of the
> > system, we can check if there's anything we can do to adjust
> > zone->lowmem_reserve[] to reflect its real usage and semantics.
> 
> Why do you think anything needs to be adjusted?

No, I don't think like that. But I am wondering what makes you get
the conclusion.

> 
> > I haven't thought of the whole zone fallback list to interleave nodes
> > which invovles a lot of change.
> > 
> > > 
> > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > > changelog doesn't say much about that. 
> > 
> > No, no actual problem was observed on tht.
> 
> OK
> 
> > I was just trying to make
> > clear the semantics because I was confused by its obscure value printing
> > of zone->lowmem_reserve[] in /proc/zoneinfo.
> > 
> > I think we can merge this reverting firstly, then to investigate how to
> > better clarify it.
> 
> What do you think needs to be clarify? How exactly is the original
> output confusing?

When I did the change, I wrote the reason in commit log. I don't think
you care to read it from your talking. Let me explain again, in
setup_per_zone_lowmem_reserve(), each zone's protection value is
calculated based on its own node's zones. E.g below on node 0, its
Movable zone and Device zone are empty but still show influence on
Normal/DMA32/DMA zone, this is unreasonable from the protection value
calculating code and its showing.

If really as your colleague Gabriel said, the protection value of DMA zone
on node 0 will impact allocation when targeted zone is Movable zone, we
may need consider the protection value calcuation acorss nodes. Because
the impact happens among different nodes. I only said we can do
investigation, I didn't said we need change or have to change.

Node 0, zone      DMA
  ......
  pages free     2816
        ......
        protection: (0, 1582, 23716, 23716, 23716)
Node 0, zone    DMA32
  pages free     403269
        ......
        protection: (0, 0, 22134, 22134, 22134)
Node 0, zone   Normal
  pages free     5423879
        ......
        protection: (0, 0, 0, 0, 0)
Node 0, zone  Movable
  pages free     0
        ......
        protection: (0, 0, 0, 0, 0)
Node 0, zone   Device
  pages free     0
        ......
        protection: (0, 0, 0, 0, 0)

> 
> -- 
> Michal Hocko
> SUSE Labs
>
Gabriel Krisman Bertazi Feb. 26, 2025, 4:05 p.m. UTC | #10
Michal Hocko <mhocko@suse.com> writes:

> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
>> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
>> ->lowmem_reserve[] for empty zone") removes the protection of lower
>> zones from allocations targeting memory-less high zones.  This had an
>> unintended impact on the pattern of reclaims because it makes the
>> high-zone-targeted allocation more likely to succeed in lower zones,
>> which adds pressure to said zones.  I.e, the following corresponding
>> checks in zone_watermark_ok/zone_watermark_fast are less likely to
>> trigger:
>> 
>>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>>                 return false;
>> 
>> As a result, we are observing an increase in reclaim and kswapd scans,
>> due to the increased pressure.  This was initially observed as increased
>> latency in filesystem operations when benchmarking with fio on a machine
>> with some memory-less zones, but it has since been associated with
>> increased contention in locks related to memory reclaim.  By reverting
>> this patch, the original performance was recovered on that machine.
>
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?
>
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.

For reference, I collected vmstat with and without this patch on a
freshly booted system running intensive randread io from an nvme for 5
minutes. I got:

rpm-6.12.0-slfo.1.2 ->  pgscan_kswapd 5629543865
Patched             ->  pgscan_kswapd 33580844

33M scans is similar to what we had in kernels predating this patch.
These numbers is fairly representative of the workload on this machine, as
measured in several runs.  So we are talking about a 2-order of
magnitude increase.

Attached is the zoneinfo with my revert patch applied.
Michal Hocko Feb. 26, 2025, 5:46 p.m. UTC | #11
On Wed 26-02-25 23:57:48, Baoquan He wrote:
> On 02/26/25 at 01:01pm, Michal Hocko wrote:
> > On Wed 26-02-25 19:51:12, Baoquan He wrote:
> > > On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > > > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > > > [...]
> > > > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > > > protection has anything to do with how higher zone is populated while
> > > > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > > > allocation. Those allocations are independent on the actual memory in
> > > > > > > that zone.
> > > > > > 
> > > > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > > > to NUMA system. While it still only reflects each zone with other zones
> > > > > > within one specific node. We may need take this opportunity to
> > > > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > > > nodes into account.
> > > > > 
> > > > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > > > numa_zonelist_order we used to have in the past and that has been
> > > > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> > > 
> > > Hmm, if Gabriel can provide detailed node/zone information of the
> > > system, we can check if there's anything we can do to adjust
> > > zone->lowmem_reserve[] to reflect its real usage and semantics.
> > 
> > Why do you think anything needs to be adjusted?
> 
> No, I don't think like that. But I am wondering what makes you get
> the conclusion.
> 
> > 
> > > I haven't thought of the whole zone fallback list to interleave nodes
> > > which invovles a lot of change.
> > > 
> > > > 
> > > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > > > changelog doesn't say much about that. 
> > > 
> > > No, no actual problem was observed on tht.
> > 
> > OK
> > 
> > > I was just trying to make
> > > clear the semantics because I was confused by its obscure value printing
> > > of zone->lowmem_reserve[] in /proc/zoneinfo.
> > > 
> > > I think we can merge this reverting firstly, then to investigate how to
> > > better clarify it.
> > 
> > What do you think needs to be clarify? How exactly is the original
> > output confusing?
> 
> When I did the change, I wrote the reason in commit log. I don't think
> you care to read it from your talking. Let me explain again, in
> setup_per_zone_lowmem_reserve(), each zone's protection value is
> calculated based on its own node's zones. E.g below on node 0, its
> Movable zone and Device zone are empty but still show influence on
> Normal/DMA32/DMA zone, this is unreasonable from the protection value
> calculating code and its showing.

You said that in the commit message without explanation why. Also I
claim this is just wrong because the zone's protection is independent on
the size of the zone that it is protected from. I have explained why I
believe but let me reiterate. ZONE_DMA32 should be protected from
GFP_MOVABLE even if the zone movable is empty (same as if it had a
single or many pages). Why? Because, the lowmem reserve protects low
memory allocation requests.

See my point? Is that reasoning clear?

P.S.
I think we can have a more productive discussion without accusations.
Andrew Morton Feb. 26, 2025, 11 p.m. UTC | #12
On Wed, 26 Feb 2025 11:05:10 -0500 Gabriel Krisman Bertazi <krisman@suse.de> wrote:

> For reference, I collected vmstat with and without this patch on a
> freshly booted system running intensive randread io from an nvme for 5
> minutes. I got:
> 
> rpm-6.12.0-slfo.1.2 ->  pgscan_kswapd 5629543865
> Patched             ->  pgscan_kswapd 33580844
> 
> 33M scans is similar to what we had in kernels predating this patch.
> These numbers is fairly representative of the workload on this machine, as
> measured in several runs.  So we are talking about a 2-order of
> magnitude increase.

Thanks for adding this detail - I pasted it into the changelog.

I'll queue the patch with a cc:stable.  You may wish to send along a
new changelog to fill in additional details resulting from the review
discussion.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 579789600a3c..fe986e6de7a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5849,11 +5849,10 @@  static void setup_per_zone_lowmem_reserve(void)
 
 			for (j = i + 1; j < MAX_NR_ZONES; j++) {
 				struct zone *upper_zone = &pgdat->node_zones[j];
-				bool empty = !zone_managed_pages(upper_zone);
 
 				managed_pages += zone_managed_pages(upper_zone);
 
-				if (clear || empty)
+				if (clear)
 					zone->lowmem_reserve[j] = 0;
 				else
 					zone->lowmem_reserve[j] = managed_pages / ratio;