diff mbox series

[v4] page_alloc: consider highatomic reserve in watermark fast

Message ID 20200619235958.11283-1-jaewon31.kim@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v4] page_alloc: consider highatomic reserve in watermark fast | expand

Commit Message

Jaewon Kim June 19, 2020, 11:59 p.m. UTC
zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
page_alloc: shortcut watermark checks for order-0 pages"). The commit
simply checks if free pages is bigger than watermark without additional
calculation such like reducing watermark.

It considered free cma pages but it did not consider highatomic
reserved. This may incur exhaustion of free pages except high order
atomic free pages.

Assume that reserved_highatomic pageblock is bigger than watermark min,
and there are only few free pages except high order atomic free. Because
zone_watermark_fast passes the allocation without considering high order
atomic free, normal reclaimable allocation like GFP_HIGHUSER will
consume all the free pages. Then finally order-0 atomic allocation may
fail on allocation.

This means watermark min is not protected against non-atomic allocation.
The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
failed. Additionally the __GFP_MEMALLOC allocation with
ALLOC_NO_WATERMARKS also can be failed.

To avoid the problem, zone_watermark_fast should consider highatomic
reserve. If the actual size of high atomic free is counted accurately
like cma free, we may use it. On this patch just use
nr_reserved_highatomic. Additionally introduce
__zone_watermark_unusable_free to factor out common parts between
zone_watermark_fast and __zone_watermark_ok.

This is an example of ALLOC_HARDER allocation failure using v4.19 based
kernel.

 Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
 Call trace:
 [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
 [<ffffff8008223320>] warn_alloc+0xd8/0x12c
 [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
 [<ffffff800827f6e8>] new_slab+0x128/0x604
 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
 [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
 Mem-Info:
 active_anon:102061 inactive_anon:81551 isolated_anon:0
  active_file:59102 inactive_file:68924 isolated_file:64
  unevictable:611 dirty:63 writeback:0 unstable:0
  slab_reclaimable:13324 slab_unreclaimable:44354
  mapped:83015 shmem:4858 pagetables:26316 bounce:0
  free:2727 free_pcp:1035 free_cma:178
 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
 Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
 lowmem_reserve[]: 0 0
 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
 138826 total pagecache pages
 5460 pages in swap cache
 Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142

This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
based kernel.

 kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
 kswapd0 cpuset=/ mems_allowed=0
 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
 Call trace:
 [<0000000000000000>] dump_backtrace+0x0/0x248
 [<0000000000000000>] show_stack+0x18/0x20
 [<0000000000000000>] __dump_stack+0x20/0x28
 [<0000000000000000>] dump_stack+0x68/0x90
 [<0000000000000000>] warn_alloc+0x104/0x198
 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
 [<0000000000000000>] zs_malloc+0x148/0x3d0
 [<0000000000000000>] zram_bvec_rw+0x410/0x798
 [<0000000000000000>] zram_rw_page+0x88/0xdc
 [<0000000000000000>] bdev_write_page+0x70/0xbc
 [<0000000000000000>] __swap_writepage+0x58/0x37c
 [<0000000000000000>] swap_writepage+0x40/0x4c
 [<0000000000000000>] shrink_page_list+0xc30/0xf48
 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
 [<0000000000000000>] shrink_node_memcg+0x23c/0x618
 [<0000000000000000>] shrink_node+0x1c8/0x304
 [<0000000000000000>] kswapd+0x680/0x7c4
 [<0000000000000000>] kthread+0x110/0x120
 [<0000000000000000>] ret_from_fork+0x10/0x18
 Mem-Info:
 active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
 Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB

This is trace log which shows GFP_HIGHUSER consumes free pages right
before ALLOC_NO_WATERMARKS.

  <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE

Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
v4: change description only; typo and log
v3: change log in description to one having reserved_highatomic
    change comment in code
v2: factor out common part
v1: consider highatomic reserve
---
 mm/page_alloc.c | 66 +++++++++++++++++++++++++++----------------------
 1 file changed, 36 insertions(+), 30 deletions(-)

Comments

Baoquan He June 19, 2020, 12:42 p.m. UTC | #1
On 06/20/20 at 08:59am, Jaewon Kim wrote:
...

> kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Baoquan He <bhe@redhat.com>

> ---
> v4: change description only; typo and log
> v3: change log in description to one having reserved_highatomic
>     change comment in code
> v2: factor out common part
> v1: consider highatomic reserve
> ---
>  mm/page_alloc.c | 66 +++++++++++++++++++++++++++----------------------
>  1 file changed, 36 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..fe83f88ce188 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  }
>  ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
>  
> +static inline long __zone_watermark_unusable_free(struct zone *z,
> +				unsigned int order, unsigned int alloc_flags)
> +{
> +	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> +	long unusable_free = (1 << order) - 1;
> +
> +	/*
> +	 * If the caller does not have rights to ALLOC_HARDER then subtract
> +	 * the high-atomic reserves. This will over-estimate the size of the
> +	 * atomic reserve but it avoids a search.
> +	 */
> +	if (likely(!alloc_harder))
> +		unusable_free += z->nr_reserved_highatomic;
> +
> +#ifdef CONFIG_CMA
> +	/* If allocation can't use CMA areas don't use free CMA pages */
> +	if (!(alloc_flags & ALLOC_CMA))
> +		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
> +#endif
> +
> +	return unusable_free;
> +}
> +
>  /*
>   * Return true if free base pages are above 'mark'. For high-order checks it
>   * will return true of the order-0 watermark is reached and there is at least
> @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
> -	free_pages -= (1 << order) - 1;
> +	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	if (alloc_flags & ALLOC_HIGH)
>  		min -= min / 2;
>  
> -	/*
> -	 * If the caller does not have rights to ALLOC_HARDER then subtract
> -	 * the high-atomic reserves. This will over-estimate the size of the
> -	 * atomic reserve but it avoids a search.
> -	 */
> -	if (likely(!alloc_harder)) {
> -		free_pages -= z->nr_reserved_highatomic;
> -	} else {
> +	if (unlikely(alloc_harder)) {
>  		/*
>  		 * OOM victims can try even harder than normal ALLOC_HARDER
>  		 * users on the grounds that it's definitely going to be in
> @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			min -= min / 4;
>  	}
>  
> -
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> -
>  	/*
>  	 * Check watermarks for an order-0 allocation request. If these
>  	 * are not met, then a high-order request also cannot go ahead
> @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  				unsigned long mark, int highest_zoneidx,
>  				unsigned int alloc_flags)
>  {
> -	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> -	long cma_pages = 0;
> +	long free_pages;
> +	long unusable_free;
>  
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> +	free_pages = zone_page_state(z, NR_FREE_PAGES);
> +	unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	/*
>  	 * Fast check for order-0 only. If this fails then the reserves
> -	 * need to be calculated. There is a corner case where the check
> -	 * passes but only the high-order atomic reserve are free. If
> -	 * the caller is !atomic then it'll uselessly search the free
> -	 * list. That corner case is then slower but it is harmless.
> +	 * need to be calculated.
>  	 */
> -	if (!order && (free_pages - cma_pages) >
> -				mark + z->lowmem_reserve[highest_zoneidx])
> -		return true;
> +	if (!order) {
> +		long fast_free = free_pages - unusable_free;
> +
> +		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> +			return true;
> +	}
>  
>  	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>  					free_pages);
> -- 
> 2.17.1
> 
>
Mel Gorman June 22, 2020, 8:55 a.m. UTC | #2
On Sat, Jun 20, 2020 at 08:59:58AM +0900, Jaewon Kim wrote:
> zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
> page_alloc: shortcut watermark checks for order-0 pages"). The commit
> simply checks if free pages is bigger than watermark without additional
> calculation such like reducing watermark.
> 
> It considered free cma pages but it did not consider highatomic
> reserved. This may incur exhaustion of free pages except high order
> atomic free pages.
> 
> Assume that reserved_highatomic pageblock is bigger than watermark min,
> and there are only few free pages except high order atomic free. Because
> zone_watermark_fast passes the allocation without considering high order
> atomic free, normal reclaimable allocation like GFP_HIGHUSER will
> consume all the free pages. Then finally order-0 atomic allocation may
> fail on allocation.
> 
> This means watermark min is not protected against non-atomic allocation.
> The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
> failed. Additionally the __GFP_MEMALLOC allocation with
> ALLOC_NO_WATERMARKS also can be failed.
> 
> To avoid the problem, zone_watermark_fast should consider highatomic
> reserve. If the actual size of high atomic free is counted accurately
> like cma free, we may use it. On this patch just use
> nr_reserved_highatomic. Additionally introduce
> __zone_watermark_unusable_free to factor out common parts between
> zone_watermark_fast and __zone_watermark_ok.
> 
> This is an example of ALLOC_HARDER allocation failure using v4.19 based
> kernel.
> 
>  Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
>  Call trace:
>  [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
>  [<ffffff8008223320>] warn_alloc+0xd8/0x12c
>  [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
>  [<ffffff800827f6e8>] new_slab+0x128/0x604
>  [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
>  [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
>  [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
>  [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
>  [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
>  [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
>  [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
>  [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
>  Mem-Info:
>  active_anon:102061 inactive_anon:81551 isolated_anon:0
>   active_file:59102 inactive_file:68924 isolated_file:64
>   unevictable:611 dirty:63 writeback:0 unstable:0
>   slab_reclaimable:13324 slab_unreclaimable:44354
>   mapped:83015 shmem:4858 pagetables:26316 bounce:0
>   free:2727 free_pcp:1035 free_cma:178
>  Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>  Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
>  lowmem_reserve[]: 0 0
>  Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
>  138826 total pagecache pages
>  5460 pages in swap cache
>  Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
> 
> This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
> based kernel.
> 
>  kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
>  kswapd0 cpuset=/ mems_allowed=0
>  CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
>  Call trace:
>  [<0000000000000000>] dump_backtrace+0x0/0x248
>  [<0000000000000000>] show_stack+0x18/0x20
>  [<0000000000000000>] __dump_stack+0x20/0x28
>  [<0000000000000000>] dump_stack+0x68/0x90
>  [<0000000000000000>] warn_alloc+0x104/0x198
>  [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
>  [<0000000000000000>] zs_malloc+0x148/0x3d0
>  [<0000000000000000>] zram_bvec_rw+0x410/0x798
>  [<0000000000000000>] zram_rw_page+0x88/0xdc
>  [<0000000000000000>] bdev_write_page+0x70/0xbc
>  [<0000000000000000>] __swap_writepage+0x58/0x37c
>  [<0000000000000000>] swap_writepage+0x40/0x4c
>  [<0000000000000000>] shrink_page_list+0xc30/0xf48
>  [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
>  [<0000000000000000>] shrink_node_memcg+0x23c/0x618
>  [<0000000000000000>] shrink_node+0x1c8/0x304
>  [<0000000000000000>] kswapd+0x680/0x7c4
>  [<0000000000000000>] kthread+0x110/0x120
>  [<0000000000000000>] ret_from_fork+0x10/0x18
>  Mem-Info:
>  active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
>  Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>  Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
>  Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
> 
> This is trace log which shows GFP_HIGHUSER consumes free pages right
> before ALLOC_NO_WATERMARKS.
> 
>   <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Michal Hocko June 22, 2020, 9:11 a.m. UTC | #3
On Sat 20-06-20 08:59:58, Jaewon Kim wrote:
[...]
> @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
> -	free_pages -= (1 << order) - 1;
> +	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	if (alloc_flags & ALLOC_HIGH)
>  		min -= min / 2;
>  
> -	/*
> -	 * If the caller does not have rights to ALLOC_HARDER then subtract
> -	 * the high-atomic reserves. This will over-estimate the size of the
> -	 * atomic reserve but it avoids a search.
> -	 */
> -	if (likely(!alloc_harder)) {
> -		free_pages -= z->nr_reserved_highatomic;
> -	} else {
> +	if (unlikely(alloc_harder)) {
>  		/*
>  		 * OOM victims can try even harder than normal ALLOC_HARDER
>  		 * users on the grounds that it's definitely going to be in
[...]
> @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  				unsigned long mark, int highest_zoneidx,
>  				unsigned int alloc_flags)
>  {
> -	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> -	long cma_pages = 0;
> +	long free_pages;
> +	long unusable_free;
>  
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> +	free_pages = zone_page_state(z, NR_FREE_PAGES);
> +	unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	/*
>  	 * Fast check for order-0 only. If this fails then the reserves
> -	 * need to be calculated. There is a corner case where the check
> -	 * passes but only the high-order atomic reserve are free. If
> -	 * the caller is !atomic then it'll uselessly search the free
> -	 * list. That corner case is then slower but it is harmless.
> +	 * need to be calculated.
>  	 */
> -	if (!order && (free_pages - cma_pages) >
> -				mark + z->lowmem_reserve[highest_zoneidx])
> -		return true;
> +	if (!order) {
> +		long fast_free = free_pages - unusable_free;
> +
> +		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> +			return true;
> +	}

There is no user of unusable_free for order > 0. With you current code
__zone_watermark_unusable_free would be called twice for high order
allocations unless compiler tries to be clever..

But more importantly, I have hard time to follow why we need both
zone_watermark_fast and zone_watermark_ok now. They should be
essentially the same for anything but order == 0. For order 0 the
only difference between the two is that zone_watermark_ok checks for
ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
the former and why do we need it these days?

> 
>  	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>  					free_pages);
> -- 
> 2.17.1
>
Jaewon Kim June 22, 2020, 9:40 a.m. UTC | #4
>On Sat 20-06-20 08:59:58, Jaewon Kim wrote:
>[...]
>> @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>>          const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>>  
>>          /* free_pages may go negative - that's OK */
>> -        free_pages -= (1 << order) - 1;
>> +        free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>>  
>>          if (alloc_flags & ALLOC_HIGH)
>>                  min -= min / 2;
>>  
>> -        /*
>> -         * If the caller does not have rights to ALLOC_HARDER then subtract
>> -         * the high-atomic reserves. This will over-estimate the size of the
>> -         * atomic reserve but it avoids a search.
>> -         */
>> -        if (likely(!alloc_harder)) {
>> -                free_pages -= z->nr_reserved_highatomic;
>> -        } else {
>> +        if (unlikely(alloc_harder)) {
>>                  /*
>>                   * OOM victims can try even harder than normal ALLOC_HARDER
>>                   * users on the grounds that it's definitely going to be in
>[...]
>> @@ -3582,25 +3591,22 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>>                                  unsigned long mark, int highest_zoneidx,
>>                                  unsigned int alloc_flags)
>>  {
>> -        long free_pages = zone_page_state(z, NR_FREE_PAGES);
>> -        long cma_pages = 0;
>> +        long free_pages;
>> +        long unusable_free;
>>  
>> -#ifdef CONFIG_CMA
>> -        /* If allocation can't use CMA areas don't use free CMA pages */
>> -        if (!(alloc_flags & ALLOC_CMA))
>> -                cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
>> -#endif
>> +        free_pages = zone_page_state(z, NR_FREE_PAGES);
>> +        unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags);
>>  
>>          /*
>>           * Fast check for order-0 only. If this fails then the reserves
>> -         * need to be calculated. There is a corner case where the check
>> -         * passes but only the high-order atomic reserve are free. If
>> -         * the caller is !atomic then it'll uselessly search the free
>> -         * list. That corner case is then slower but it is harmless.
>> +         * need to be calculated.
>>           */
>> -        if (!order && (free_pages - cma_pages) >
>> -                                mark + z->lowmem_reserve[highest_zoneidx])
>> -                return true;
>> +        if (!order) {
>> +                long fast_free = free_pages - unusable_free;
>> +
>> +                if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
>> +                        return true;
>> +        }
> 
>There is no user of unusable_free for order > 0. With you current code
>__zone_watermark_unusable_free would be called twice for high order
>allocations unless compiler tries to be clever..

Yes you're right.
Following code could be moved only for order-0.
 unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags);
Let me fix it at v5.

> 
>But more importantly, I have hard time to follow why we need both
>zone_watermark_fast and zone_watermark_ok now. They should be
>essentially the same for anything but order == 0. For order 0 the
>only difference between the two is that zone_watermark_ok checks for
>ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
>the former and why do we need it these days?
> 

I think the author, Mel, may ansewr. But I think the wmark_fast may
fast by 1) not checking more condition about wmark and 2) using inline
rather than function. According to description on commit 48ee5f3696f6,
it seems to bring about 4% improvement.

>> 
>>          return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>>                                          free_pages);
>> -- 
>> 2.17.1
>> 
> 
>-- 
>Michal Hocko
>SUSE Labs
>
Mel Gorman June 22, 2020, 10:04 a.m. UTC | #5
On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote:
> >But more importantly, I have hard time to follow why we need both
> >zone_watermark_fast and zone_watermark_ok now. They should be
> >essentially the same for anything but order == 0. For order 0 the
> >only difference between the two is that zone_watermark_ok checks for
> >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
> >the former and why do we need it these days?
> > 
> 
> I think the author, Mel, may ansewr. But I think the wmark_fast may
> fast by 1) not checking more condition about wmark and 2) using inline
> rather than function. According to description on commit 48ee5f3696f6,
> it seems to bring about 4% improvement.
> 

The original intent was that watermark checks were expensive as some of the
calculations are only necessary when a zone is relatively low on memory
and the check does not always have to be 100% accurate. This is probably
still true given that __zone_watermark_ok() makes a number of calculations
depending on alloc flags even if a zone is almost completely free.
Michal Hocko June 22, 2020, 2:23 p.m. UTC | #6
On Mon 22-06-20 11:04:39, Mel Gorman wrote:
> On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote:
> > >But more importantly, I have hard time to follow why we need both
> > >zone_watermark_fast and zone_watermark_ok now. They should be
> > >essentially the same for anything but order == 0. For order 0 the
> > >only difference between the two is that zone_watermark_ok checks for
> > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
> > >the former and why do we need it these days?
> > > 
> > 
> > I think the author, Mel, may ansewr. But I think the wmark_fast may
> > fast by 1) not checking more condition about wmark and 2) using inline
> > rather than function. According to description on commit 48ee5f3696f6,
> > it seems to bring about 4% improvement.
> > 
> 
> The original intent was that watermark checks were expensive as some of the
> calculations are only necessary when a zone is relatively low on memory
> and the check does not always have to be 100% accurate. This is probably
> still true given that __zone_watermark_ok() makes a number of calculations
> depending on alloc flags even if a zone is almost completely free.

OK, so we are talking about 
	if (alloc_flags & ALLOC_HIGH)
		min -= min / 2;

	if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) {
		/*
		 * OOM victims can try even harder than normal ALLOC_HARDER
		 * users on the grounds that it's definitely going to be in
		 * the exit path shortly and free memory. Any allocation it
		 * makes during the free path will be small and short-lived.
		 */
		if (alloc_flags & ALLOC_OOM)
			min -= min / 2;
		else
			min -= min / 4;
	}

Is this something even measurable and something that would justify a
complex code? If we really want to keep it even after these changes
which are making the two closer in the cost then can we have it
documented at least?
Mel Gorman June 22, 2020, 4:25 p.m. UTC | #7
On Mon, Jun 22, 2020 at 04:23:04PM +0200, Michal Hocko wrote:
> On Mon 22-06-20 11:04:39, Mel Gorman wrote:
> > On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote:
> > > >But more importantly, I have hard time to follow why we need both
> > > >zone_watermark_fast and zone_watermark_ok now. They should be
> > > >essentially the same for anything but order == 0. For order 0 the
> > > >only difference between the two is that zone_watermark_ok checks for
> > > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
> > > >the former and why do we need it these days?
> > > > 
> > > 
> > > I think the author, Mel, may ansewr. But I think the wmark_fast may
> > > fast by 1) not checking more condition about wmark and 2) using inline
> > > rather than function. According to description on commit 48ee5f3696f6,
> > > it seems to bring about 4% improvement.
> > > 
> > 
> > The original intent was that watermark checks were expensive as some of the
> > calculations are only necessary when a zone is relatively low on memory
> > and the check does not always have to be 100% accurate. This is probably
> > still true given that __zone_watermark_ok() makes a number of calculations
> > depending on alloc flags even if a zone is almost completely free.
> 
> OK, so we are talking about 
> 	if (alloc_flags & ALLOC_HIGH)
> 		min -= min / 2;
> 
> 	if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) {
> 		/*
> 		 * OOM victims can try even harder than normal ALLOC_HARDER
> 		 * users on the grounds that it's definitely going to be in
> 		 * the exit path shortly and free memory. Any allocation it
> 		 * makes during the free path will be small and short-lived.
> 		 */
> 		if (alloc_flags & ALLOC_OOM)
> 			min -= min / 2;
> 		else
> 			min -= min / 4;
> 	}
> 
> Is this something even measurable and something that would justify a
> complex code? If we really want to keep it even after these changes
> which are making the two closer in the cost then can we have it
> documented at least?

It was originally documented as being roughly 4% for a page allocator
micro-benchmark but that was 4 years ago and I do not even remember what
type of machine that was on. Chances are the relative cost is different
now but I haven't measured it as the microbenchmark in question doesn't
even compile with recent kernels. For many allocations, the bulk of the
allocation cost is zeroing the page so I have no particular objection
to zone_watermark_fast being removed if it makes the code easier to
read. While I have not looked recently, the cost of allocation in general
and the increasing scope of the zone->lock with larger NUMA nodes for
high-order allocations like THP are more of a concern than two branches
and potentially two minor calculations.
Michal Hocko June 23, 2020, 7:11 a.m. UTC | #8
On Mon 22-06-20 17:25:01, Mel Gorman wrote:
> On Mon, Jun 22, 2020 at 04:23:04PM +0200, Michal Hocko wrote:
> > On Mon 22-06-20 11:04:39, Mel Gorman wrote:
> > > On Mon, Jun 22, 2020 at 06:40:20PM +0900, ????????? wrote:
> > > > >But more importantly, I have hard time to follow why we need both
> > > > >zone_watermark_fast and zone_watermark_ok now. They should be
> > > > >essentially the same for anything but order == 0. For order 0 the
> > > > >only difference between the two is that zone_watermark_ok checks for
> > > > >ALLOC_HIGH resp ALLOC_HARDER, ALLOC_OOM. So what is exactly fast about
> > > > >the former and why do we need it these days?
> > > > > 
> > > > 
> > > > I think the author, Mel, may ansewr. But I think the wmark_fast may
> > > > fast by 1) not checking more condition about wmark and 2) using inline
> > > > rather than function. According to description on commit 48ee5f3696f6,
> > > > it seems to bring about 4% improvement.
> > > > 
> > > 
> > > The original intent was that watermark checks were expensive as some of the
> > > calculations are only necessary when a zone is relatively low on memory
> > > and the check does not always have to be 100% accurate. This is probably
> > > still true given that __zone_watermark_ok() makes a number of calculations
> > > depending on alloc flags even if a zone is almost completely free.
> > 
> > OK, so we are talking about 
> > 	if (alloc_flags & ALLOC_HIGH)
> > 		min -= min / 2;
> > 
> > 	if (unlikely((alloc_flags & (ALLOC_HARDER|ALLOC_OOM))) {
> > 		/*
> > 		 * OOM victims can try even harder than normal ALLOC_HARDER
> > 		 * users on the grounds that it's definitely going to be in
> > 		 * the exit path shortly and free memory. Any allocation it
> > 		 * makes during the free path will be small and short-lived.
> > 		 */
> > 		if (alloc_flags & ALLOC_OOM)
> > 			min -= min / 2;
> > 		else
> > 			min -= min / 4;
> > 	}
> > 
> > Is this something even measurable and something that would justify a
> > complex code? If we really want to keep it even after these changes
> > which are making the two closer in the cost then can we have it
> > documented at least?
> 
> It was originally documented as being roughly 4% for a page allocator
> micro-benchmark but that was 4 years ago and I do not even remember what
> type of machine that was on. Chances are the relative cost is different
> now but I haven't measured it as the microbenchmark in question doesn't
> even compile with recent kernels.

Thanks for the clarification.

> For many allocations, the bulk of the
> allocation cost is zeroing the page so I have no particular objection
> to zone_watermark_fast being removed if it makes the code easier to
> read. While I have not looked recently, the cost of allocation in general
> and the increasing scope of the zone->lock with larger NUMA nodes for
> high-order allocations like THP are more of a concern than two branches
> and potentially two minor calculations.

OK, then I would rather go with the code simplification for the future
maintainability. If somebody can test this and provide good numbers then
we can reintroduce a fast check.

Thanks!
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..fe83f88ce188 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3487,6 +3487,29 @@  static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 }
 ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
 
+static inline long __zone_watermark_unusable_free(struct zone *z,
+				unsigned int order, unsigned int alloc_flags)
+{
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+	long unusable_free = (1 << order) - 1;
+
+	/*
+	 * If the caller does not have rights to ALLOC_HARDER then subtract
+	 * the high-atomic reserves. This will over-estimate the size of the
+	 * atomic reserve but it avoids a search.
+	 */
+	if (likely(!alloc_harder))
+		unusable_free += z->nr_reserved_highatomic;
+
+#ifdef CONFIG_CMA
+	/* If allocation can't use CMA areas don't use free CMA pages */
+	if (!(alloc_flags & ALLOC_CMA))
+		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
+#endif
+
+	return unusable_free;
+}
+
 /*
  * Return true if free base pages are above 'mark'. For high-order checks it
  * will return true of the order-0 watermark is reached and there is at least
@@ -3502,19 +3525,12 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
-	free_pages -= (1 << order) - 1;
+	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 
-	/*
-	 * If the caller does not have rights to ALLOC_HARDER then subtract
-	 * the high-atomic reserves. This will over-estimate the size of the
-	 * atomic reserve but it avoids a search.
-	 */
-	if (likely(!alloc_harder)) {
-		free_pages -= z->nr_reserved_highatomic;
-	} else {
+	if (unlikely(alloc_harder)) {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
 		 * users on the grounds that it's definitely going to be in
@@ -3527,13 +3543,6 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			min -= min / 4;
 	}
 
-
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
 	/*
 	 * Check watermarks for an order-0 allocation request. If these
 	 * are not met, then a high-order request also cannot go ahead
@@ -3582,25 +3591,22 @@  static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 				unsigned long mark, int highest_zoneidx,
 				unsigned int alloc_flags)
 {
-	long free_pages = zone_page_state(z, NR_FREE_PAGES);
-	long cma_pages = 0;
+	long free_pages;
+	long unusable_free;
 
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
+	free_pages = zone_page_state(z, NR_FREE_PAGES);
+	unusable_free = __zone_watermark_unusable_free(z, order, alloc_flags);
 
 	/*
 	 * Fast check for order-0 only. If this fails then the reserves
-	 * need to be calculated. There is a corner case where the check
-	 * passes but only the high-order atomic reserve are free. If
-	 * the caller is !atomic then it'll uselessly search the free
-	 * list. That corner case is then slower but it is harmless.
+	 * need to be calculated.
 	 */
-	if (!order && (free_pages - cma_pages) >
-				mark + z->lowmem_reserve[highest_zoneidx])
-		return true;
+	if (!order) {
+		long fast_free = free_pages - unusable_free;
+
+		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
+			return true;
+	}
 
 	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
 					free_pages);