diff mbox series

mm: include CMA pages in lowmem_reserve at boot

Message ID 1597290698-24266-1-git-send-email-opendmb@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: include CMA pages in lowmem_reserve at boot | expand

Commit Message

Doug Berger Aug. 13, 2020, 3:51 a.m. UTC
The lowmem_reserve arrays provide a means of applying pressure
against allocations from lower zones that were targeted at
higher zones. Its values are a function of the number of pages
managed by higher zones and are assigned by a call to the
setup_per_zone_lowmem_reserve() function.

The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses
of the /proc/sys/vm/lowmem_reserve_ratio sysctl file.

The function init_per_zone_wmark_min() was moved up from a
module_init to a core_initcall to resolve a sequencing issue
with khugepaged. Unfortunately this created a sequencing issue
with CMA page accounting.

The CMA pages are added to the managed page count of a zone
when cma_init_reserved_areas() is called at boot also as a
core_initcall. This makes it uncertain whether the CMA pages
will be added to the managed page counts of their zones before
or after the call to init_per_zone_wmark_min() as it becomes
dependent on link order. With the current link order the pages
are added to the managed count after the lowmem_reserve arrays
are initialized at boot.

This means the lowmem_reserve values at boot may be lower than
the values used later if /proc/sys/vm/lowmem_reserve_ratio is
accessed even if the ratio values are unchanged.

In many cases the difference is not significant, but in others
it may have an affect.

This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the
CMA pages have the chance to be properly accounted in their
zone(s) and allowing the lowmem_reserve arrays to receive
consistent values.

Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
Signed-off-by: Doug Berger <opendmb@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Michal Hocko Aug. 13, 2020, 11:17 a.m. UTC | #1
On Wed 12-08-20 20:51:38, Doug Berger wrote:
> The lowmem_reserve arrays provide a means of applying pressure
> against allocations from lower zones that were targeted at
> higher zones. Its values are a function of the number of pages
> managed by higher zones and are assigned by a call to the
> setup_per_zone_lowmem_reserve() function.
> 
> The function is initially called at boot time by the function
> init_per_zone_wmark_min() and may be called later by accesses
> of the /proc/sys/vm/lowmem_reserve_ratio sysctl file.
> 
> The function init_per_zone_wmark_min() was moved up from a
> module_init to a core_initcall to resolve a sequencing issue
> with khugepaged. Unfortunately this created a sequencing issue
> with CMA page accounting.
> 
> The CMA pages are added to the managed page count of a zone
> when cma_init_reserved_areas() is called at boot also as a
> core_initcall. This makes it uncertain whether the CMA pages
> will be added to the managed page counts of their zones before
> or after the call to init_per_zone_wmark_min() as it becomes
> dependent on link order. With the current link order the pages
> are added to the managed count after the lowmem_reserve arrays
> are initialized at boot.
> 
> This means the lowmem_reserve values at boot may be lower than
> the values used later if /proc/sys/vm/lowmem_reserve_ratio is
> accessed even if the ratio values are unchanged.
> 
> In many cases the difference is not significant, but in others
> it may have an affect.

Could you be more specific please?

> This commit breaks the link order dependency by invoking
> init_per_zone_wmark_min() as a postcore_initcall so that the
> CMA pages have the chance to be properly accounted in their
> zone(s) and allowing the lowmem_reserve arrays to receive
> consistent values.
> 
> Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
> Signed-off-by: Doug Berger <opendmb@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b7d0ecf30b1..f3e340ec2b6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7887,7 +7887,7 @@ int __meminit init_per_zone_wmark_min(void)
>  
>  	return 0;
>  }
> -core_initcall(init_per_zone_wmark_min)
> +postcore_initcall(init_per_zone_wmark_min)
>  
>  /*
>   * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
> -- 
> 2.7.4
>
Doug Berger Aug. 13, 2020, 5:55 p.m. UTC | #2
On 8/13/2020 4:17 AM, Michal Hocko wrote:
> On Wed 12-08-20 20:51:38, Doug Berger wrote:
>> The lowmem_reserve arrays provide a means of applying pressure
>> against allocations from lower zones that were targeted at
>> higher zones. Its values are a function of the number of pages
>> managed by higher zones and are assigned by a call to the
>> setup_per_zone_lowmem_reserve() function.
>>
>> The function is initially called at boot time by the function
>> init_per_zone_wmark_min() and may be called later by accesses
>> of the /proc/sys/vm/lowmem_reserve_ratio sysctl file.
>>
>> The function init_per_zone_wmark_min() was moved up from a
>> module_init to a core_initcall to resolve a sequencing issue
>> with khugepaged. Unfortunately this created a sequencing issue
>> with CMA page accounting.
>>
>> The CMA pages are added to the managed page count of a zone
>> when cma_init_reserved_areas() is called at boot also as a
>> core_initcall. This makes it uncertain whether the CMA pages
>> will be added to the managed page counts of their zones before
>> or after the call to init_per_zone_wmark_min() as it becomes
>> dependent on link order. With the current link order the pages
>> are added to the managed count after the lowmem_reserve arrays
>> are initialized at boot.
>>
>> This means the lowmem_reserve values at boot may be lower than
>> the values used later if /proc/sys/vm/lowmem_reserve_ratio is
>> accessed even if the ratio values are unchanged.
>>
>> In many cases the difference is not significant, but in others
>> it may have an affect.
> 
> Could you be more specific please?

One example might be a 1GB arm platform that defines a 256MB default CMA
region. The default zones might map as follows:
[    0.000000] cma: Reserved 256 MiB at 0x0000000030000000
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000000000-0x000000002fffffff]
[    0.000000]   Normal   empty
[    0.000000]   HighMem  [mem 0x0000000030000000-0x000000003fffffff]

At boot the memory info would be:
# echo m > /proc/sysrq-trigger
[   21.559673] sysrq: Show Memory
[   21.562758] Mem-Info:
[   21.565053] active_anon:9783 inactive_anon:770 isolated_anon:0
[   21.565053]  active_file:0 inactive_file:0 isolated_file:0
[   21.565053]  unevictable:0 dirty:0 writeback:0
[   21.565053]  slab_reclaimable:1827 slab_unreclaimable:1867
[   21.565053]  mapped:716 shmem:10363 pagetables:26 bounce:0
[   21.565053]  free:221995 free_pcp:444 free_cma:54917
[   21.596744] Node 0 active_anon:39132kB inactive_anon:3080kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB mapped:2864kB dirty:0kB writeback:0kB shmem:41452kB
writeback_tmp:0kB kernel_stack:472kB all_unreclaimable? no
[   21.619650] DMA free:668312kB min:3288kB low:4108kB high:4928kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:786432kB managed:690364kB mlocked:0kB pagetables:104kB
bounce:0kB free_pcp:1776kB local_pcp:324kB free_cma:0kB
[   21.646810] lowmem_reserve[]: 0 0 0 0
[   21.650498] HighMem free:219668kB min:128kB low:128kB high:128kB
reserved_highatomic:0KB active_anon:39132kB inactive_anon:3080kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:262144kB managed:262144kB mlocked:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:219668kB
[   21.678184] lowmem_reserve[]: 0 0 0 0
[   21.681866] DMA: 20*4kB (UME) 9*8kB (UME) 7*16kB (UME) 6*32kB (M)
7*64kB (ME) 4*128kB (M) 7*256kB (UM) 7*512kB (ME) 6*1024kB (M) 8*2048kB
(UM) 156*4096kB (M) = 668296kB
[   21.696970] HighMem: 1*4kB (C) 3*8kB (C) 3*16kB (C) 4*32kB (C) 1*64kB
(C) 0*128kB 1*256kB (C) 0*512kB 0*1024kB 1*2048kB (C) 53*4096kB (C) =
219660kB
[   21.710328] 10363 total pagecache pages
[   21.714188] 0 pages in swap cache
[   21.717518] Swap cache stats: add 0, delete 0, find 0/0
[   21.722761] Free swap  = 0kB
[   21.725655] Total swap = 0kB
[   21.728549] 262144 pages RAM
[   21.731443] 65536 pages HighMem/MovableOnly
[   21.735641] 24017 pages reserved
[   21.738882] 65536 pages cma reserved

Here you can see that the lowmem_reserve array for the DMA zone is all
0's. This is because the HighMem zone is consumed by the CMA region
whose pages haven't been activated to increase the zone managed count
when init_per_zone_wmark_min() is invoked at boot.

If we access the /proc/sys/vm/lowmem_reserve_ratio sysctl with:
# cat /proc/sys/vm/lowmem_reserve_ratio
256     32      0       0

That is sufficient to recalculate the lowmem_reserve arrays which now show:
# echo m > /proc/sysrq-trigger
[   38.848640] sysrq: Show Memory
[   38.851712] Mem-Info:
[   38.854004] active_anon:9783 inactive_anon:773 isolated_anon:0
[   38.854004]  active_file:0 inactive_file:0 isolated_file:0
[   38.854004]  unevictable:0 dirty:0 writeback:0
[   38.854004]  slab_reclaimable:1835 slab_unreclaimable:1867
[   38.854004]  mapped:716 shmem:10363 pagetables:26 bounce:0
[   38.854004]  free:221984 free_pcp:444 free_cma:54914
[   38.885698] Node 0 active_anon:39132kB inactive_anon:3092kB
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB mapped:2864kB dirty:0kB writeback:0kB shmem:41452kB
writeback_tmp:0kB kernel_stack:472kB all_unreclaimable? no
[   38.908605] DMA free:668280kB min:3288kB low:4108kB high:4928kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:786432kB managed:690364kB mlocked:0kB pagetables:104kB
bounce:0kB free_pcp:1776kB local_pcp:132kB free_cma:0kB
[   38.935765] lowmem_reserve[]: 0 0 256 0
[   38.939628] HighMem free:219656kB min:128kB low:128kB high:128kB
reserved_highatomic:0KB active_anon:39132kB inactive_anon:3092kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:262144kB managed:262144kB mlocked:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:219656kB
[   38.967310] lowmem_reserve[]: 0 0 0 0
[   38.970992] DMA: 20*4kB (UME) 9*8kB (UME) 6*16kB (UM) 6*32kB (M)
7*64kB (ME) 4*128kB (M) 7*256kB (UM) 7*512kB (ME) 6*1024kB (M) 8*2048kB
(UM) 156*4096kB (M) = 668280kB
[   38.986007] HighMem: 8*4kB (C) 5*8kB (C) 0*16kB 4*32kB (C) 1*64kB (C)
0*128kB 1*256kB (C) 0*512kB 0*1024kB 1*2048kB (C) 53*4096kB (C) = 219656kB
[   38.999016] 10363 total pagecache pages
[   39.002868] 0 pages in swap cache
[   39.006196] Swap cache stats: add 0, delete 0, find 0/0
[   39.011437] Free swap  = 0kB
[   39.014330] Total swap = 0kB
[   39.017223] 262144 pages RAM
[   39.020116] 65536 pages HighMem/MovableOnly
[   39.024314] 24017 pages reserved
[   39.027554] 65536 pages cma reserved

Here the lowmem_reserve back pressure for the DMA zone for allocations
that target the HighMem zone is now 256 pages. Now 1MB is still not a
lot of additional back pressure, but the watermarks on the HighMem zone
aren't very large either so User space allocations can easily start
consuming the DMA zone while kswapd starts trying to reclaim space in
HighMem. This excess pressure on DMA zone memory can potentially lead to
earlier triggers of OOM Killer and/or kernel fallback allocations into
CMA Movable pages which can interfere with the ability of CMA to obtain
larger size contiguous allocations.

All of that said, my main concern is that I don't like the inconsistency
between the boot time and run time results.

Thank you for taking the time to review and consider this patch,
    Doug
Michal Hocko Aug. 14, 2020, 6:59 a.m. UTC | #3
On Thu 13-08-20 10:55:17, Doug Berger wrote:
[...]
> One example might be a 1GB arm platform that defines a 256MB default CMA
> region. The default zones might map as follows:
> [    0.000000] cma: Reserved 256 MiB at 0x0000000030000000
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000000000000-0x000000002fffffff]
> [    0.000000]   Normal   empty
> [    0.000000]   HighMem  [mem 0x0000000030000000-0x000000003fffffff]
[...]
> 
> Here you can see that the lowmem_reserve array for the DMA zone is all
> 0's. This is because the HighMem zone is consumed by the CMA region
> whose pages haven't been activated to increase the zone managed count
> when init_per_zone_wmark_min() is invoked at boot.
> 
> If we access the /proc/sys/vm/lowmem_reserve_ratio sysctl with:
> # cat /proc/sys/vm/lowmem_reserve_ratio
> 256     32      0       0

Yes, this is really an unexpected behavior.
[...]
 
> Here the lowmem_reserve back pressure for the DMA zone for allocations
> that target the HighMem zone is now 256 pages. Now 1MB is still not a
> lot of additional back pressure, but the watermarks on the HighMem zone
> aren't very large either so User space allocations can easily start
> consuming the DMA zone while kswapd starts trying to reclaim space in
> HighMem. This excess pressure on DMA zone memory can potentially lead to
> earlier triggers of OOM Killer and/or kernel fallback allocations into
> CMA Movable pages which can interfere with the ability of CMA to obtain
> larger size contiguous allocations.
> 
> All of that said, my main concern is that I don't like the inconsistency
> between the boot time and run time results.

Thanks for the clarification. I would suggest extending your changlog by
the following.

"
In many cases the difference is not significant, but for example an ARM
platform with 1GB of memory and the following memory layout
[    0.000000] cma: Reserved 256 MiB at 0x0000000030000000
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000000000-0x000000002fffffff]
[    0.000000]   Normal   empty
[    0.000000]   HighMem  [mem 0x0000000030000000-0x000000003fffffff]

would result in 0 lowmem_reserve for the DMA zone. This would allow
userspace the deplete the DMA zone easily. Funnily enough 
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because it forces setup_per_zone_lowmem_reserve
as a side effect.
"

With that feel free to add
Acked-by: Michal Hocko <mhocko@suse.com.

Thanks!
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b7d0ecf30b1..f3e340ec2b6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7887,7 +7887,7 @@  int __meminit init_per_zone_wmark_min(void)
 
 	return 0;
 }
-core_initcall(init_per_zone_wmark_min)
+postcore_initcall(init_per_zone_wmark_min)
 
 /*
  * min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so