diff mbox series

[-next,v3,1/2] mm: Cap zone movable's min wmark to small value

Message ID 20220905032858.1462927-2-mawupeng1@huawei.com (mailing list archive)
State New
Headers show
Series watermark related improvement on zone movable | expand

Commit Message

mawupeng Sept. 5, 2022, 3:28 a.m. UTC
From: Ma Wupeng <mawupeng1@huawei.com>

Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
zone movable. However zone movable will get its min share in
__setup_per_zone_wmarks() which does not make any sense.

And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
don't need movable pages, so there is no need to assign min pages for zone
movable.

Let's cap pages_min for zone movable to a small value here just link
highmem pages.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 mm/page_alloc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Mel Gorman Sept. 5, 2022, 9:26 a.m. UTC | #1
On Mon, Sep 05, 2022 at 11:28:57AM +0800, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
> zone movable. However zone movable will get its min share in
> __setup_per_zone_wmarks() which does not make any sense.
> 
> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
> don't need movable pages, so there is no need to assign min pages for zone
> movable.
> 
> Let's cap pages_min for zone movable to a small value here just link
> highmem pages.
> 

I think there is a misunderstanding why the higher zones have a watermark
and why it might be large.

It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
that few of those allocations may be movable. It's because high memory
allocations indirectly pin pages in lower zones. User-mapped memory allocated
from ZONE_MOVABLE still needs page table pages allocated from a lower zone
so there is a ratio between the size of ZONE_MOVABLE and lower zones
that limits the total amount of memory that can be allocated. Similarly,
file backed pages that may be allocated from ZONE_MOVABLE still requires
pages from lower memory for the inode and other associated kernel
objects that are allocated from lower zones.

The intent behind the higher zones having a large min watermark is so
that kswapd reclaims pages from there first to *potentially* release
pages from lower memory. By capping pages_min for zone_movable, there is
the potential for lower memory pressure to be higher and to reach a point
where a ZONE_MOVABLE page cannot be allocated simply because there isn't
enough low memory available. Once the lower zones are all unreclaimable
(e.g. page table pages or the movable pages are not been reclaimed to free
the associated kernel structures), the system goes OOM.

It's possible that there are safe adjustments that could be made that
would detect when there is no choice except to reclaim zone reclaimable
but it would be tricky and it's not this patch. This patch changelog states

	However zone movable will get its min share in
	__setup_per_zone_wmarks() which does not make any sense.

It makes sense, higher zones allocations indirectly pin pages in lower
zones and there is a bias in reclaim to free the higher zone pages first
on the *possibility* that lower zone pages get indirectly released later.
mawupeng Sept. 6, 2022, 10:12 a.m. UTC | #2
On 2022/9/5 17:26, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 11:28:57AM +0800, Wupeng Ma wrote:
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
>> zone movable. However zone movable will get its min share in
>> __setup_per_zone_wmarks() which does not make any sense.
>>
>> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
>> don't need movable pages, so there is no need to assign min pages for zone
>> movable.
>>
>> Let's cap pages_min for zone movable to a small value here just link
>> highmem pages.
>>
> 
> I think there is a misunderstanding why the higher zones have a watermark
> and why it might be large.
> 
> It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
> that few of those allocations may be movable. It's because high memory
> allocations indirectly pin pages in lower zones. User-mapped memory allocated
> from ZONE_MOVABLE still needs page table pages allocated from a lower zone
> so there is a ratio between the size of ZONE_MOVABLE and lower zones
> that limits the total amount of memory that can be allocated. Similarly,
> file backed pages that may be allocated from ZONE_MOVABLE still requires
> pages from lower memory for the inode and other associated kernel
> objects that are allocated from lower zones.
> 
> The intent behind the higher zones having a large min watermark is so
> that kswapd reclaims pages from there first to *potentially* release
> pages from lower memory. By capping pages_min for zone_movable, there is
> the potential for lower memory pressure to be higher and to reach a point
> where a ZONE_MOVABLE page cannot be allocated simply because there isn't
> enough low memory available. Once the lower zones are all unreclaimable
> (e.g. page table pages or the movable pages are not been reclaimed to free
> the associated kernel structures), the system goes OOM.

This i do agree with you, lower zone is actually "more important" than the
higher one.

But higher min watermark for zone movable will not work since no memory
allocation can use this reserve memory below min. Memory allocation
with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
in slowpath, however the standard movable memory allocation
(gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.

Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
And the second patch introduce per zone watermark_scale_factor to boost
normal/movable zone's watermark which can trigger early kswapd for zone
movable.

> 
> It's possible that there are safe adjustments that could be made that
> would detect when there is no choice except to reclaim zone reclaimable
> but it would be tricky and it's not this patch. This patch changelog states
> 
> 	However zone movable will get its min share in
> 	__setup_per_zone_wmarks() which does not make any sense.
> 
> It makes sense, higher zones allocations indirectly pin pages in lower
> zones and there is a bias in reclaim to free the higher zone pages first
> on the *possibility* that lower zone pages get indirectly released later.
> 

In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
momory(Movable zone), the min share for normal zone is too few since the
size of min watermark is calc by zone dma/normal while this will be shared
by zones(include zone movable) based on managed pages.

Node 0, zone      DMA
        min      39
        low      743
        high     1447
Node 0, zone   Normal
        min      180
        low      3372
        high     6564
Node 1, zone  Movable
        min      3728
        low      69788
        high     135848
Mel Gorman Sept. 6, 2022, 12:22 p.m. UTC | #3
On Tue, Sep 06, 2022 at 06:12:23PM +0800, mawupeng wrote:
> > I think there is a misunderstanding why the higher zones have a watermark
> > and why it might be large.
> > 
> > It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
> > that few of those allocations may be movable. It's because high memory
> > allocations indirectly pin pages in lower zones. User-mapped memory allocated
> > from ZONE_MOVABLE still needs page table pages allocated from a lower zone
> > so there is a ratio between the size of ZONE_MOVABLE and lower zones
> > that limits the total amount of memory that can be allocated. Similarly,
> > file backed pages that may be allocated from ZONE_MOVABLE still requires
> > pages from lower memory for the inode and other associated kernel
> > objects that are allocated from lower zones.
> > 
> > The intent behind the higher zones having a large min watermark is so
> > that kswapd reclaims pages from there first to *potentially* release
> > pages from lower memory. By capping pages_min for zone_movable, there is
> > the potential for lower memory pressure to be higher and to reach a point
> > where a ZONE_MOVABLE page cannot be allocated simply because there isn't
> > enough low memory available. Once the lower zones are all unreclaimable
> > (e.g. page table pages or the movable pages are not been reclaimed to free
> > the associated kernel structures), the system goes OOM.
> 
> This i do agree with you, lower zone is actually "more important" than the
> higher one.
> 

Very often yes.

> But higher min watermark for zone movable will not work since no memory
> allocation can use this reserve memory below min. Memory allocation
> with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
> in slowpath, however the standard movable memory allocation
> (gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.
> 

Then a more appropriate solution may be to alter how the gap between min
and low is calculated. That gap determines when kswapd is active but
allocations are still allowed.

> Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
> And the second patch introduce per zone watermark_scale_factor to boost
> normal/movable zone's watermark which can trigger early kswapd for zone
> movable.
> 

The problem with the tunable is that this patch introduces a potentially
seriously problem that must then be corrected by a system administrator and
it'll be non-obvious what the root of the problem is or the solution. For
some users, they will only be able to determine is that OOM triggers
when there is plenty of free memory or kswapd is consuming a lot more
CPU than expected. They will not necessarily be able to determine that
watermark_scale_factor is the solution.

> > 
> > It's possible that there are safe adjustments that could be made that
> > would detect when there is no choice except to reclaim zone reclaimable
> > but it would be tricky and it's not this patch. This patch changelog states
> > 
> > 	However zone movable will get its min share in
> > 	__setup_per_zone_wmarks() which does not make any sense.
> > 
> > It makes sense, higher zones allocations indirectly pin pages in lower
> > zones and there is a bias in reclaim to free the higher zone pages first
> > on the *possibility* that lower zone pages get indirectly released later.
> > 
> 
> In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
> momory(Movable zone), the min share for normal zone is too few since the
> size of min watermark is calc by zone dma/normal while this will be shared
> by zones(include zone movable) based on managed pages.
> 
> Node 0, zone      DMA
>         min      39
>         low      743
>         high     1447
> Node 0, zone   Normal
>         min      180
>         low      3372
>         high     6564
> Node 1, zone  Movable
>         min      3728
>         low      69788
>         high     135848

The gap between min and low is massive so either adjust how that gap is
calculated or to avoid side-effects for other users, consider special
casing the gap for ZONE_MOVABLE with a comment explaining why it is
treated differently. To mitigate the risk further, it could be further
special cased to only apply when there is a massive ratio between
ALL_ZONES_EXCEPT_MOVABLE:ZONE_MOVABLE. Document in the changelog the
potential downside of more lowmem potentially getting pinned by MOVABLE
allocations leading to excessive kswapd activity or premature OOM.
mawupeng Sept. 7, 2022, 8:42 a.m. UTC | #4
On 2022/9/6 20:22, Mel Gorman wrote:
> On Tue, Sep 06, 2022 at 06:12:23PM +0800, mawupeng wrote:
>>> I think there is a misunderstanding why the higher zones have a watermark
>>> and why it might be large.
>>>
>>> It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
>>> that few of those allocations may be movable. It's because high memory
>>> allocations indirectly pin pages in lower zones. User-mapped memory allocated
>>> from ZONE_MOVABLE still needs page table pages allocated from a lower zone
>>> so there is a ratio between the size of ZONE_MOVABLE and lower zones
>>> that limits the total amount of memory that can be allocated. Similarly,
>>> file backed pages that may be allocated from ZONE_MOVABLE still requires
>>> pages from lower memory for the inode and other associated kernel
>>> objects that are allocated from lower zones.
>>>
>>> The intent behind the higher zones having a large min watermark is so
>>> that kswapd reclaims pages from there first to *potentially* release
>>> pages from lower memory. By capping pages_min for zone_movable, there is
>>> the potential for lower memory pressure to be higher and to reach a point
>>> where a ZONE_MOVABLE page cannot be allocated simply because there isn't
>>> enough low memory available. Once the lower zones are all unreclaimable
>>> (e.g. page table pages or the movable pages are not been reclaimed to free
>>> the associated kernel structures), the system goes OOM.
>>
>> This i do agree with you, lower zone is actually "more important" than the
>> higher one.
>>
> 
> Very often yes.
> 
>> But higher min watermark for zone movable will not work since no memory
>> allocation can use this reserve memory below min. Memory allocation
>> with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
>> in slowpath, however the standard movable memory allocation
>> (gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.
>>
> 
> Then a more appropriate solution may be to alter how the gap between min
> and low is calculated. That gap determines when kswapd is active but
> allocations are still allowed.
> 
>> Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
>> And the second patch introduce per zone watermark_scale_factor to boost
>> normal/movable zone's watermark which can trigger early kswapd for zone
>> movable.
>>
> 
> The problem with the tunable is that this patch introduces a potentially
> seriously problem that must then be corrected by a system administrator and
> it'll be non-obvious what the root of the problem is or the solution. For
> some users, they will only be able to determine is that OOM triggers
> when there is plenty of free memory or kswapd is consuming a lot more
> CPU than expected. They will not necessarily be able to determine that
> watermark_scale_factor is the solution.
> 
>>>
>>> It's possible that there are safe adjustments that could be made that
>>> would detect when there is no choice except to reclaim zone reclaimable
>>> but it would be tricky and it's not this patch. This patch changelog states
>>>
>>> 	However zone movable will get its min share in
>>> 	__setup_per_zone_wmarks() which does not make any sense.
>>>
>>> It makes sense, higher zones allocations indirectly pin pages in lower
>>> zones and there is a bias in reclaim to free the higher zone pages first
>>> on the *possibility* that lower zone pages get indirectly released later.
>>>
>>
>> In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
>> momory(Movable zone), the min share for normal zone is too few since the
>> size of min watermark is calc by zone dma/normal while this will be shared
>> by zones(include zone movable) based on managed pages.
>>
>> Node 0, zone      DMA
>>         min      39
>>         low      743
>>         high     1447
>> Node 0, zone   Normal
>>         min      180
>>         low      3372
>>         high     6564
>> Node 1, zone  Movable
>>         min      3728
>>         low      69788
>>         high     135848
> 
> The gap between min and low is massive so either adjust how that gap is
> calculated or to avoid side-effects for other users, consider special
> casing the gap for ZONE_MOVABLE with a comment explaining why it is
> treated differently. To mitigate the risk further, it could be further
> special cased to only apply when there is a massive ratio between
> ALL_ZONES_EXCEPT_MOVABLE:ZONE_MOVABLE. Document in the changelog the
> potential downside of more lowmem potentially getting pinned by MOVABLE
> allocations leading to excessive kswapd activity or premature OOM.
What I'm trying to say is that the min watermark is too low for zone normal
since it is shared by other zone based on manager pages.

        Vanilla          |         Modified         
Node 0, zone      DMA    | Node 0, zone      DMA    
        min      39      |         min      713     
        low      743     |         low      1417    
        high     1447    |         high     2121    
Node 0, zone   Normal    | Node 0, zone   Normal    
        min    **180**   |         min    **3234**    
        low      3372    |         low      6426    
        high     6564    |         high     9618    
Node 1, zone  Movable    | Node 1, zone  Movable    
        min    **3728**  |         min    **128**     
        low      69788   |         low      66188   
        high     135848  |         high     132248 

You can see, after this patch, the min watermark is set to small value(128) while zone
dma/normal's min watermark increase a lot which be useful if the system is low on memory.

The gap between min and low is about 1/1000 of the zone's memory which will not be
effected by this patch.
  
This patch, I am to do something for the min watermark for zone movable, In the next
patch I want to do something to reserve memory for zone normal or just make 
watermark_scale_factor more flexible for little normal zone and huge movable zone.

What is you idea on "Cap zone movable to small value"?


>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..f1e4474879f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8636,9 +8636,9 @@  static void __setup_per_zone_wmarks(void)
 	struct zone *zone;
 	unsigned long flags;
 
-	/* Calculate total number of !ZONE_HIGHMEM pages */
+	/* Calculate total number of none highmem/movable pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
 			lowmem_pages += zone_managed_pages(zone);
 	}
 
@@ -8648,15 +8648,15 @@  static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone_managed_pages(zone);
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
-			 * need highmem pages, so cap pages_min to a small
-			 * value here.
+			 * need highmem/movable pages, so cap pages_min to a
+			 * small value here.
 			 *
 			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas control async page reclaim, and so should
-			 * not be capped for highmem.
+			 * not be capped for highmem/movable zone.
 			 */
 			unsigned long min_pages;