Message ID | 20230418191313.268131-4-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: reliable huge page allocator | expand |
On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote: > pageblock_order can be of various sizes, depending on configuration, > but the default is MAX_ORDER-1. Note that MAX_ORDER got redefined in -mm tree recently. > Given 4k pages, that comes out to > 4M. This is a large chunk for the allocator/reclaim/compaction to try > to keep grouped per migratetype. It's also unnecessary as the majority > of higher order allocations - THP and slab - are smaller than that. This seems way to x86-specific. Other arches have larger THP sizes. I believe 16M is common. Maybe define it as min(MAX_ORDER, PMD_ORDER)? > Before subsequent patches increase the effort that goes into > maintaining migratetype isolation, it's important to first set the > defrag block size to what's likely to have common consumers. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/pageblock-flags.h | 4 ++-- > mm/page_alloc.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h > index 5f1ae07d724b..05b6811f8cee 100644 > --- a/include/linux/pageblock-flags.h > +++ b/include/linux/pageblock-flags.h > @@ -47,8 +47,8 @@ extern unsigned int pageblock_order; > > #else /* CONFIG_HUGETLB_PAGE */ > > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ > -#define pageblock_order (MAX_ORDER-1) > +/* Manage fragmentation at the 2M level */ > +#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT)) > > #endif /* CONFIG_HUGETLB_PAGE */ > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ac03571e0532..5e04a69f6a26 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {} > /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ > void __init set_pageblock_order(void) > { > - unsigned int order = MAX_ORDER - 1; > + unsigned int order = ilog2(2U << (20 - PAGE_SHIFT)); > > /* Check that pageblock_nr_pages has not already been setup */ > if (pageblock_order) > -- > 2.39.2 > >
On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote: > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote: > > pageblock_order can be of various sizes, depending on configuration, > > but the default is MAX_ORDER-1. > > Note that MAX_ORDER got redefined in -mm tree recently. > > > Given 4k pages, that comes out to > > 4M. This is a large chunk for the allocator/reclaim/compaction to try > > to keep grouped per migratetype. It's also unnecessary as the majority > > of higher order allocations - THP and slab - are smaller than that. > > This seems way to x86-specific. Hey, that's the machines I have access to ;) > Other arches have larger THP sizes. I believe 16M is common. > > Maybe define it as min(MAX_ORDER, PMD_ORDER)? Hm, let me play around with larger pageblocks. The thing that gives me pause is that this seems quite aggressive as a default block size for the allocator and reclaim/compaction - if you consider the implications for internal fragmentation and the amount of ongoing defragmentation work it would require. IOW, it's not just a function of physical page size supported by the CPU. It's also a function of overall memory capacity. Independent of architecture, 2MB seems like a more reasonable step up than 16M. 16M is great for TLB coverage, and in our DCs we're getting a lot of use out of 1G hugetlb pages as well. The question is if those archs are willing to pay the cost of serving such page sizes quickly and reliably during runtime; or if that's something better left to setups with explicit preallocations and stuff like hugetlb_cma reservations.
On Tue, Apr 18, 2023 at 10:55:53PM -0400, Johannes Weiner wrote: > On Wed, Apr 19, 2023 at 03:01:05AM +0300, Kirill A. Shutemov wrote: > > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote: > > > pageblock_order can be of various sizes, depending on configuration, > > > but the default is MAX_ORDER-1. > > > > Note that MAX_ORDER got redefined in -mm tree recently. > > > > > Given 4k pages, that comes out to > > > 4M. This is a large chunk for the allocator/reclaim/compaction to try > > > to keep grouped per migratetype. It's also unnecessary as the majority > > > of higher order allocations - THP and slab - are smaller than that. > > > > This seems way to x86-specific. > > Other arches have larger THP sizes. I believe 16M is common. > > > > Maybe define it as min(MAX_ORDER, PMD_ORDER)? > > Hm, let me play around with larger pageblocks. > > The thing that gives me pause is that this seems quite aggressive as a > default block size for the allocator and reclaim/compaction - if you > consider the implications for internal fragmentation and the amount of > ongoing defragmentation work it would require. > > IOW, it's not just a function of physical page size supported by the > CPU. It's also a function of overall memory capacity. Independent of > architecture, 2MB seems like a more reasonable step up than 16M. [ Quick addition: on those other archs, these patches would still help with other, non-THP sources of compound allocations, such as slub, variable-order cache folios, and really any orders up to 2M. So it's not like we *have* to raise it to PMD_ORDER for them to benefit. ]
On 4/18/23 21:12, Johannes Weiner wrote: > pageblock_order can be of various sizes, depending on configuration, > but the default is MAX_ORDER-1. Given 4k pages, that comes out to > 4M. This is a large chunk for the allocator/reclaim/compaction to try > to keep grouped per migratetype. It's also unnecessary as the majority > of higher order allocations - THP and slab - are smaller than that. Well in my experience the kernel usually has hugetlbfs config-enabled so it uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP is used instead. But sure, we can set a better default that's not tied to hugetlbfs. > Before subsequent patches increase the effort that goes into > maintaining migratetype isolation, it's important to first set the > defrag block size to what's likely to have common consumers. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > include/linux/pageblock-flags.h | 4 ++-- > mm/page_alloc.c | 2 +- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h > index 5f1ae07d724b..05b6811f8cee 100644 > --- a/include/linux/pageblock-flags.h > +++ b/include/linux/pageblock-flags.h > @@ -47,8 +47,8 @@ extern unsigned int pageblock_order; > > #else /* CONFIG_HUGETLB_PAGE */ > > -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ > -#define pageblock_order (MAX_ORDER-1) > +/* Manage fragmentation at the 2M level */ > +#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT)) > > #endif /* CONFIG_HUGETLB_PAGE */ > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ac03571e0532..5e04a69f6a26 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {} > /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ > void __init set_pageblock_order(void) > { > - unsigned int order = MAX_ORDER - 1; > + unsigned int order = ilog2(2U << (20 - PAGE_SHIFT)); > > /* Check that pageblock_nr_pages has not already been setup */ > if (pageblock_order)
On 19.04.23 12:36, Vlastimil Babka wrote: > On 4/18/23 21:12, Johannes Weiner wrote: >> pageblock_order can be of various sizes, depending on configuration, >> but the default is MAX_ORDER-1. Given 4k pages, that comes out to >> 4M. This is a large chunk for the allocator/reclaim/compaction to try >> to keep grouped per migratetype. It's also unnecessary as the majority >> of higher order allocations - THP and slab - are smaller than that. > > Well in my experience the kernel usually has hugetlbfs config-enabled so it > uses 2MB pageblocks (on x86) even if hugetlbfs is unused at runtime and THP > is used instead. But sure, we can set a better default that's not tied to > hugetlbfs. As virtio-mem really wants small pageblocks (hot(un)plug granularity), I've seen reports from users without HUGETLB configured complaining about this (on x86, we'd get 4M instead of 2M). So having a better default (PMD_SIZE) sounds like a good idea to me (and I even recall suggesting to change the !hugetlb default).
On 19.04.23 02:01, Kirill A. Shutemov wrote: > On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote: >> pageblock_order can be of various sizes, depending on configuration, >> but the default is MAX_ORDER-1. > > Note that MAX_ORDER got redefined in -mm tree recently. > >> Given 4k pages, that comes out to >> 4M. This is a large chunk for the allocator/reclaim/compaction to try >> to keep grouped per migratetype. It's also unnecessary as the majority >> of higher order allocations - THP and slab - are smaller than that. > > This seems way to x86-specific. Other arches have larger THP sizes. I > believe 16M is common. > arm64 with 64k pages has ... 512 MiB IIRC :/ It's the weird one. > Maybe define it as min(MAX_ORDER, PMD_ORDER)? Sounds good to me.
On Tue, Apr 18, 2023 at 03:12:50PM -0400, Johannes Weiner wrote: > pageblock_order can be of various sizes, depending on configuration, > but the default is MAX_ORDER-1. Given 4k pages, that comes out to > 4M. This is a large chunk for the allocator/reclaim/compaction to try > to keep grouped per migratetype. It's also unnecessary as the majority > of higher order allocations - THP and slab - are smaller than that. > > Before subsequent patches increase the effort that goes into > maintaining migratetype isolation, it's important to first set the > defrag block size to what's likely to have common consumers. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> This patch may be a distraction in the context of this series. I don't feel particularly strongly about it but it has strong bikeshed potential. For configurations that support huge pages of any sort, it should be PMD_ORDER, for anything else the choice is arbitrary. 2M is as good a guess as anyway because even if it was tied to the PAGE_ALLOC_COSTLY_ORDER then the pageblock bitmap overhead might be annoying.
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 5f1ae07d724b..05b6811f8cee 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -47,8 +47,8 @@ extern unsigned int pageblock_order; #else /* CONFIG_HUGETLB_PAGE */ -/* If huge pages are not used, group by MAX_ORDER_NR_PAGES */ -#define pageblock_order (MAX_ORDER-1) +/* Manage fragmentation at the 2M level */ +#define pageblock_order ilog2(2U << (20 - PAGE_SHIFT)) #endif /* CONFIG_HUGETLB_PAGE */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ac03571e0532..5e04a69f6a26 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7634,7 +7634,7 @@ static inline void setup_usemap(struct zone *zone) {} /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ void __init set_pageblock_order(void) { - unsigned int order = MAX_ORDER - 1; + unsigned int order = ilog2(2U << (20 - PAGE_SHIFT)); /* Check that pageblock_nr_pages has not already been setup */ if (pageblock_order)
pageblock_order can be of various sizes, depending on configuration, but the default is MAX_ORDER-1. Given 4k pages, that comes out to 4M. This is a large chunk for the allocator/reclaim/compaction to try to keep grouped per migratetype. It's also unnecessary as the majority of higher order allocations - THP and slab - are smaller than that. Before subsequent patches increase the effort that goes into maintaining migratetype isolation, it's important to first set the defrag block size to what's likely to have common consumers. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/pageblock-flags.h | 4 ++-- mm/page_alloc.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-)