[RFC,03/26] mm: make pageblock_order 2M per default

Message ID	20230418191313.268131-4-hannes@cmpxchg.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Johannes Weiner <hannes@cmpxchg.org> To: linux-mm@kvack.org Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Rientjes <rientjes@google.com>, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [RFC PATCH 03/26] mm: make pageblock_order 2M per default Date: Tue, 18 Apr 2023 15:12:50 -0400 Message-Id: <20230418191313.268131-4-hannes@cmpxchg.org> In-Reply-To: <20230418191313.268131-1-hannes@cmpxchg.org> References: <20230418191313.268131-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: reliable huge page allocator \| expand [RFC,00/26] mm: reliable huge page allocator [RFC,01/26] block: bdev: blockdev page cache is movable [RFC,02/26] mm: compaction: avoid GFP_NOFS deadlocks [RFC,03/26] mm: make pageblock_order 2M per default [RFC,04/26] mm: page_isolation: write proper kerneldoc [RFC,05/26] mm: page_alloc: per-migratetype pcplist for THPs [RFC,06/26] mm: page_alloc: consolidate free page accounting [RFC,07/26] mm: page_alloc: move capture_control to the page allocator [RFC,08/26] mm: page_alloc: claim blocks during compaction capturing [RFC,09/26] mm: page_alloc: move expand() above compaction_capture() [RFC,10/26] mm: page_alloc: allow compaction capturing from larger blocks [RFC,11/26] mm: page_alloc: introduce MIGRATE_FREE [RFC,12/26] mm: page_alloc: per-migratetype free counts [RFC,13/26] mm: compaction: remove compaction result helpers [RFC,14/26] mm: compaction: simplify should_compact_retry() [RFC,15/26] mm: compaction: simplify free block check in suitable_migration_target() [RFC,16/26] mm: compaction: improve compaction_suitable() accuracy [RFC,17/26] mm: compaction: refactor __compaction_suitable() [RFC,18/26] mm: compaction: remove unnecessary is_via_compact_memory() checks [RFC,19/26] mm: compaction: drop redundant watermark check in compaction_zonelist_suitable() [RFC,20/26] mm: vmscan: use compaction_suitable() check in kswapd [RFC,21/26] mm: compaction: align compaction goals with reclaim goals [RFC,22/26] mm: page_alloc: manage free memory in whole pageblocks [RFC,23/26] mm: page_alloc: kill highatomic [RFC,24/26] mm: page_alloc: kill watermark boosting [RFC,25/26] mm: page_alloc: disallow fallbacks when 2M defrag is enabled [RFC,26/26] mm: page_alloc: add sanity checks for migratetypes

Johannes Weiner April 18, 2023, 7:12 p.m. UTC

pageblock_order can be of various sizes, depending on configuration,
but the default is MAX_ORDER-1. Given 4k pages, that comes out to
4M. This is a large chunk for the allocator/reclaim/compaction to try
to keep grouped per migratetype. It's also unnecessary as the majority
of higher order allocations - THP and slab - are smaller than that.

Before subsequent patches increase the effort that goes into
maintaining migratetype isolation, it's important to first set the
defrag block size to what's likely to have common consumers.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/pageblock-flags.h | 4 ++--
 mm/page_alloc.c                 | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

Kirill A. Shutemov April 19, 2023, 12:01 a.m. UTC | #1

On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1.

Note that MAX_ORDER got redefined in -mm tree recently.

> Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

This seems way to x86-specific. Other arches have larger THP sizes. I
believe 16M is common.

Maybe define it as min(MAX_ORDER, PMD_ORDER)?

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/pageblock-flags.h | 4 ++--
>  mm/page_alloc.c                 | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>  
>  #else /* CONFIG_HUGETLB_PAGE */
>  
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order		(MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
>  
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
>  /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
>  void __init set_pageblock_order(void)
>  {
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>  
>  	/* Check that pageblock_nr_pages has not already been setup */
>  	if (pageblock_order)
> -- 
> 2.39.2
> 
>

Johannes Weiner April 19, 2023, 2:55 a.m. UTC | #2

On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > pageblock_order can be of various sizes, depending on configuration,
> > but the default is MAX_ORDER-1.
> 
> Note that MAX_ORDER got redefined in -mm tree recently.
> 
> > Given 4k pages, that comes out to
> > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > to keep grouped per migratetype. It's also unnecessary as the majority
> > of higher order allocations - THP and slab - are smaller than that.
> 
> This seems way to x86-specific.

Hey, that's the machines I have access to ;)

> Other arches have larger THP sizes. I believe 16M is common.
>
> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Hm, let me play around with larger pageblocks.

The thing that gives me pause is that this seems quite aggressive as a
default block size for the allocator and reclaim/compaction - if you
consider the implications for internal fragmentation and the amount of
ongoing defragmentation work it would require.

IOW, it's not just a function of physical page size supported by the
CPU. It's also a function of overall memory capacity. Independent of
architecture, 2MB seems like a more reasonable step up than 16M.

16M is great for TLB coverage, and in our DCs we're getting a lot of
use out of 1G hugetlb pages as well. The question is if those archs
are willing to pay the cost of serving such page sizes quickly and
reliably during runtime; or if that's something better left to setups
with explicit preallocations and stuff like hugetlb_cma reservations.

Johannes Weiner April 19, 2023, 3:44 a.m. UTC | #3

On Tue, Apr 18, 2023 at 10:55:53PM -0400, Johannes Weiner wrote:
> On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote:
> > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> > > pageblock_order can be of various sizes, depending on configuration,
> > > but the default is MAX_ORDER-1.
> > 
> > Note that MAX_ORDER got redefined in -mm tree recently.
> > 
> > > Given 4k pages, that comes out to
> > > 4M. This is a large chunk for the allocator/reclaim/compaction to try
> > > to keep grouped per migratetype. It's also unnecessary as the majority
> > > of higher order allocations - THP and slab - are smaller than that.
> > 
> > This seems way to x86-specific.
> > Other arches have larger THP sizes. I believe 16M is common.
> >
> > Maybe define it as min(MAX_ORDER, PMD_ORDER)?
> 
> Hm, let me play around with larger pageblocks.
> 
> The thing that gives me pause is that this seems quite aggressive as a
> default block size for the allocator and reclaim/compaction - if you
> consider the implications for internal fragmentation and the amount of
> ongoing defragmentation work it would require.
> 
> IOW, it's not just a function of physical page size supported by the
> CPU. It's also a function of overall memory capacity. Independent of
> architecture, 2MB seems like a more reasonable step up than 16M.

[ Quick addition: on those other archs, these patches would still help
  with other, non-THP sources of compound allocations, such as slub,
  variable-order cache folios, and really any orders up to 2M. So it's
  not like we *have* to raise it to PMD_ORDER for them to benefit. ]

Vlastimil Babka April 19, 2023, 10:36 a.m. UTC | #4

On 4/18/23 21:12, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.

Well in my experience the kernel usually has hugetlbfs config-enabled so it
uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
is used instead. But sure, we can set a better default that's not tied to
hugetlbfs.

> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/pageblock-flags.h | 4 ++--
>  mm/page_alloc.c                 | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 5f1ae07d724b..05b6811f8cee 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -47,8 +47,8 @@ extern unsigned int pageblock_order;
>  
>  #else /* CONFIG_HUGETLB_PAGE */
>  
> -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */
> -#define pageblock_order		(MAX_ORDER-1)
> +/* Manage fragmentation at the 2M level */
> +#define pageblock_order		ilog2(2U << (20 - PAGE_SHIFT))
>  
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac03571e0532..5e04a69f6a26 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {}
>  /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
>  void __init set_pageblock_order(void)
>  {
> -	unsigned int order = MAX_ORDER - 1;
> +	unsigned int order = ilog2(2U << (20 - PAGE_SHIFT));
>  
>  	/* Check that pageblock_nr_pages has not already been setup */
>  	if (pageblock_order)

David Hildenbrand April 19, 2023, 11:09 a.m. UTC | #5

On 19.04.23 12:36, Vlastimil Babka wrote:
> On 4/18/23 21:12, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
> 
> Well in my experience the kernel usually has hugetlbfs config-enabled so it
> uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP
> is used instead. But sure, we can set a better default that's not tied to
> hugetlbfs.

As virtio-mem really wants small pageblocks (hot(un)plug granularity), 
I've seen reports from users without HUGETLB configured complaining 
about this (on x86, we'd get 4M instead of 2M).

So having a better default (PMD_SIZE) sounds like a good idea to me (and 
I even recall suggesting to change the !hugetlb default).

David Hildenbrand April 19, 2023, 11:10 a.m. UTC | #6

On 19.04.23 02:01, Kirill A. Shutemov wrote:
> On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
>> pageblock_order can be of various sizes, depending on configuration,
>> but the default is MAX_ORDER-1.
> 
> Note that MAX_ORDER got redefined in -mm tree recently.
> 
>> Given 4k pages, that comes out to
>> 4M. This is a large chunk for the allocator/reclaim/compaction to try
>> to keep grouped per migratetype. It's also unnecessary as the majority
>> of higher order allocations - THP and slab - are smaller than that.
> 
> This seems way to x86-specific. Other arches have larger THP sizes. I
> believe 16M is common.
> 

arm64 with 64k pages has ... 512 MiB IIRC :/ It's the weird one.

> Maybe define it as min(MAX_ORDER, PMD_ORDER)?

Sounds good to me.

Mel Gorman April 21, 2023, 12:37 p.m. UTC | #7

On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote:
> pageblock_order can be of various sizes, depending on configuration,
> but the default is MAX_ORDER-1. Given 4k pages, that comes out to
> 4M. This is a large chunk for the allocator/reclaim/compaction to try
> to keep grouped per migratetype. It's also unnecessary as the majority
> of higher order allocations - THP and slab - are smaller than that.
> 
> Before subsequent patches increase the effort that goes into
> maintaining migratetype isolation, it's important to first set the
> defrag block size to what's likely to have common consumers.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

This patch may be a distraction in the context of this series. I don't feel
particularly strongly about it but it has strong bikeshed potential. For
configurations that support huge pages of any sort, it should be PMD_ORDER,
for anything else the choice is arbitrary. 2M is as good a guess as
anyway because even if it was tied to the PAGE_ALLOC_COSTLY_ORDER then
the pageblock bitmap overhead might be annoying.

[RFC,03/26] mm: make pageblock_order 2M per default

Commit Message

Comments

Patch