mm/page_alloc: add zone to zonelist if populated

Message ID	20220203020022.3044-1-richard.weiyang@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Wei Yang <richard.weiyang@gmail.com> To: akpm@linux-foundation.org, mhocko@suse.com, mgorman@techsingularity.net Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Wei Yang <richard.weiyang@gmail.com>, David Hildenbrand <david@redhat.com> Subject: [PATCH] mm/page_alloc: add zone to zonelist if populated Date: Thu, 3 Feb 2022 02:00:22 +0000 Message-Id: <20220203020022.3044-1-richard.weiyang@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/page_alloc: add zone to zonelist if populated \| expand mm/page_alloc: add zone to zonelist if populated

Wei Yang Feb. 3, 2022, 2 a.m. UTC

During memory hotplug, when online/offline a zone, we need to rebuild
the zonelist for all nodes. Current behavior would lose a valid zone in
zonelist since only pick up managed_zone.

There are two cases for a zone with memory but still !managed.

  * all pages were allocated via memblock
  * all pages were taken by ballooning / virtio-mem

This state maybe temporary, since both of them may release some memory.
Then it end up with a managed zone not in zonelist.

This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
and reclaim from zones with pages managed by the buddy allocator")'.
This patch restore the behavior.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
CC: Mel Gorman <mgorman@techsingularity.net>
CC: David Hildenbrand <david@redhat.com>
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

David Hildenbrand Feb. 3, 2022, 9:25 a.m. UTC | #1

On 03.02.22 03:00, Wei Yang wrote:
> During memory hotplug, when online/offline a zone, we need to rebuild
> the zonelist for all nodes. Current behavior would lose a valid zone in
> zonelist since only pick up managed_zone.
> 
> There are two cases for a zone with memory but still !managed.
> 
>   * all pages were allocated via memblock
>   * all pages were taken by ballooning / virtio-mem
> 
> This state maybe temporary, since both of them may release some memory.
> Then it end up with a managed zone not in zonelist.
> 
> This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
> and reclaim from zones with pages managed by the buddy allocator")'.
> This patch restore the behavior.
> 
> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> CC: Mel Gorman <mgorman@techsingularity.net>
> CC: David Hildenbrand <david@redhat.com>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")

That commit mentions that there used to be some ppc64 cases with fadump
where it might have been a real problem. Unfortunately, that commit
doesn't really tell what the performance implications are.

We'd have to know how many "permanent memblock" allocations we have,
that can never get freed.

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de15021a2887..b433a57ee76f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6092,7 +6092,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
> -		if (managed_zone(zone)) {
> +		if (populated_zone(zone)) {
>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>  			check_highest_zone(zone_type);
>  		}

The comment above the function also expresses that "Add all populated
zones of a node to the zonelist.", so one way or the other, that should
be made consistent.

Michal Hocko Feb. 3, 2022, 9:27 a.m. UTC | #2

On Thu 03-02-22 02:00:22, Wei Yang wrote:
> During memory hotplug, when online/offline a zone, we need to rebuild
> the zonelist for all nodes. Current behavior would lose a valid zone in
> zonelist since only pick up managed_zone.
> 
> There are two cases for a zone with memory but still !managed.
> 
>   * all pages were allocated via memblock
>   * all pages were taken by ballooning / virtio-mem
> 
> This state maybe temporary, since both of them may release some memory.
> Then it end up with a managed zone not in zonelist.
> 
> This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
> and reclaim from zones with pages managed by the buddy allocator")'.
> This patch restore the behavior.

It has been introduced to fix a problem described in the the changelog
(FADUMP configuration making kswapd hogging a cpu). You are not
explaining why the original issue is not possible after this change.

I also think that this is more of theoretical issue than anything that
is a real life concern. It is good to state that in the changelog as
well.

That being said I am not against the change but the changelog needs more
explanation before I can ack it.

> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> CC: Mel Gorman <mgorman@techsingularity.net>
> CC: David Hildenbrand <david@redhat.com>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")

Fixes tag should be really used only if the referenced commit breaks
something. I do not really see this to be the case here.

Thanks!

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de15021a2887..b433a57ee76f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6092,7 +6092,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>  	do {
>  		zone_type--;
>  		zone = pgdat->node_zones + zone_type;
> -		if (managed_zone(zone)) {
> +		if (populated_zone(zone)) {
>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>  			check_highest_zone(zone_type);
>  		}
> -- 
> 2.33.1

Wei Yang Feb. 6, 2022, 2:11 a.m. UTC | #3

On Thu, Feb 03, 2022 at 10:25:51AM +0100, David Hildenbrand wrote:
>On 03.02.22 03:00, Wei Yang wrote:
>> During memory hotplug, when online/offline a zone, we need to rebuild
>> the zonelist for all nodes. Current behavior would lose a valid zone in
>> zonelist since only pick up managed_zone.
>> 
>> There are two cases for a zone with memory but still !managed.
>> 
>>   * all pages were allocated via memblock
>>   * all pages were taken by ballooning / virtio-mem
>> 
>> This state maybe temporary, since both of them may release some memory.
>> Then it end up with a managed zone not in zonelist.
>> 
>> This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
>> and reclaim from zones with pages managed by the buddy allocator")'.
>> This patch restore the behavior.
>> 
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: David Hildenbrand <david@redhat.com>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>
>That commit mentions that there used to be some ppc64 cases with fadump
>where it might have been a real problem. Unfortunately, that commit
>doesn't really tell what the performance implications are.
>

It mentioned a 100% CPU usage by commit 1d82de618ddd. Currently I don't find
which part introduced this and how it is fixed.

>We'd have to know how many "permanent memblock" allocations we have,
>that can never get freed.
>

For the case in that commit, the memory are reserved for crash kernel. I am
afraid this never get freed.

But for all the cases, I am not sure.

Wei Yang Feb. 6, 2022, 2:17 a.m. UTC | #4

On Thu, Feb 03, 2022 at 10:27:11AM +0100, Michal Hocko wrote:
>On Thu 03-02-22 02:00:22, Wei Yang wrote:
>> During memory hotplug, when online/offline a zone, we need to rebuild
>> the zonelist for all nodes. Current behavior would lose a valid zone in
>> zonelist since only pick up managed_zone.
>> 
>> There are two cases for a zone with memory but still !managed.
>> 
>>   * all pages were allocated via memblock
>>   * all pages were taken by ballooning / virtio-mem
>> 
>> This state maybe temporary, since both of them may release some memory.
>> Then it end up with a managed zone not in zonelist.
>> 
>> This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
>> and reclaim from zones with pages managed by the buddy allocator")'.
>> This patch restore the behavior.
>
>It has been introduced to fix a problem described in the the changelog
>(FADUMP configuration making kswapd hogging a cpu). You are not
>explaining why the original issue is not possible after this change.
>

The first sight is kswapd deals with pgdat->node_zones, which is not affected
by pgdat->node_zonelists.

For the exact detail, I don't figure that out now. Will need some time to take
a look into. For that commit, I only found this link.
http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net If there
are some other discussions, it would be helpful.

>I also think that this is more of theoretical issue than anything that
>is a real life concern. It is good to state that in the changelog as
>well.
>
>That being said I am not against the change but the changelog needs more
>explanation before I can ack it.
>
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: David Hildenbrand <david@redhat.com>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>
>Fixes tag should be really used only if the referenced commit breaks
>something. I do not really see this to be the case here.
>

Got it.

>Thanks!
>
>> ---
>>  mm/page_alloc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index de15021a2887..b433a57ee76f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6092,7 +6092,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
>>  	do {
>>  		zone_type--;
>>  		zone = pgdat->node_zones + zone_type;
>> -		if (managed_zone(zone)) {
>> +		if (populated_zone(zone)) {
>>  			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
>>  			check_highest_zone(zone_type);
>>  		}
>> -- 
>> 2.33.1
>
>-- 
>Michal Hocko
>SUSE Labs

Wei Yang March 16, 2022, 12:40 a.m. UTC | #5

On Thu, Feb 03, 2022 at 10:27:11AM +0100, Michal Hocko wrote:
>On Thu 03-02-22 02:00:22, Wei Yang wrote:
>> During memory hotplug, when online/offline a zone, we need to rebuild
>> the zonelist for all nodes. Current behavior would lose a valid zone in
>> zonelist since only pick up managed_zone.
>> 
>> There are two cases for a zone with memory but still !managed.
>> 
>>   * all pages were allocated via memblock
>>   * all pages were taken by ballooning / virtio-mem
>> 
>> This state maybe temporary, since both of them may release some memory.
>> Then it end up with a managed zone not in zonelist.
>> 
>> This is introduced in 'commit 6aa303defb74 ("mm, vmscan: only allocate
>> and reclaim from zones with pages managed by the buddy allocator")'.
>> This patch restore the behavior.
>
>It has been introduced to fix a problem described in the the changelog
>(FADUMP configuration making kswapd hogging a cpu). You are not
>explaining why the original issue is not possible after this change.
>

After some reading, here is what I find.

To prevent this problem again, we need to make sure reclaim only applies to
managed_zones. After go through the code, there are only two places we don't
guarantee this when iterating zone.

  1. skip_throttle_noprogress()
  2. throttle_direct_reclaim()

After we make sure vmscan only reclaim on managed_zone, the problem won't be
possible after this change.

BTW, there are another two places use for_each_zone_zonelist_nodemask(). It's
ok to not check managed_zone, since actually they are doing a node base
iteration.

If this looks good to you, I would adjust the changelog and send two patches
to fix the above two places.

>I also think that this is more of theoretical issue than anything that
>is a real life concern. It is good to state that in the changelog as
>well.
>
>That being said I am not against the change but the changelog needs more
>explanation before I can ack it.
>
>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> CC: Mel Gorman <mgorman@techsingularity.net>
>> CC: David Hildenbrand <david@redhat.com>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>
>Fixes tag should be really used only if the referenced commit breaks
>something. I do not really see this to be the case here.
>
>Thanks!
>

mm/page_alloc: add zone to zonelist if populated

Commit Message

Comments

Patch