Message ID | 20240229183436.4110845-2-yuzhao@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [Chapter,One] THP zones: the use cases of policy zones | expand |
On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote: > Compared with the hugeTLB pool approach, THP zones tap into core MM > features including: > 1. THP allocations can fall back to the lower zones, which can have > higher latency but still succeed. > 2. THPs can be either shattered (see Chapter Two) if partially > unmapped or reclaimed if becoming cold. > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > contiguous PTEs on arm64 [1], which are more suitable for client > workloads. Can this mechanism be used to fully replace the hugetlb pool approach? That would be a major selling point. It kind of feels like it should, but I am insufficiently expert to be certain. I'll read over the patches sometime soon. There's a lot to go through. Something I didn't see in the cover letter or commit messages was any discussion of page->flags and how many bits we use for ZONE (particularly on 32-bit). Perhaps I'll discover the answer to that as I read.
On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote: > > There are three types of zones: > 1. The first four zones partition the physical address space of CPU > memory. > 2. The device zone provides interoperability between CPU and device > memory. > 3. The movable zone commonly represents a memory allocation policy. > > Though originally designed for memory hot removal, the movable zone is > instead widely used for other purposes, e.g., CMA and kdump kernel, on > platforms that do not support hot removal, e.g., Android and ChromeOS. > Nowadays, it is legitimately a zone independent of any physical > characteristics. In spite of being somewhat regarded as a hack, > largely due to the lack of a generic design concept for its true major > use cases (on billions of client devices), the movable zone naturally > resembles a policy (virtual) zone overlayed on the first four > (physical) zones. > > This proposal formally generalizes this concept as policy zones so > that additional policies can be implemented and enforced by subsequent > zones after the movable zone. An inherited requirement of policy zones > (and the first four zones) is that subsequent zones must be able to > fall back to previous zones and therefore must add new properties to > the previous zones rather than remove existing ones from them. Also, > all properties must be known at the allocation time, rather than the > runtime, e.g., memory object size and mobility are valid properties > but hotness and lifetime are not. > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > zones: > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > ZONE_MOVABLE) and restricted to a minimum order to be > anti-fragmentation. The latter means that they cannot be split down > below that order, while they are free or in use. > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > to an exact order. The latter means that not only is split > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > reason in Chapter Three), while they are free or in use. > > Since these two zones only can serve THP allocations (__GFP_MOVABLE | > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and > compaction is not needed for these two zones. > > Compared with the hugeTLB pool approach, THP zones tap into core MM > features including: > 1. THP allocations can fall back to the lower zones, which can have > higher latency but still succeed. > 2. THPs can be either shattered (see Chapter Two) if partially > unmapped or reclaimed if becoming cold. > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > contiguous PTEs on arm64 [1], which are more suitable for client > workloads. I think the allocation fallback policy needs to be elaborated. IIUC, when allocating large folios, if the order > min order of the policy zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE -> ZONE_MOVABLE -> ZONE_NORMAL, right? If all other zones are depleted, the allocation, whose order is < the min order, won't fallback to the policy zones and will fail, just like the non-movable allocation can't fallback to ZONE_MOVABLE even though there is enough memory for that zone, right? > > Policy zones can be dynamically resized by offlining pages in one of > them and onlining those pages in another of them. Note that this is > only done among policy zones, not between a policy zone and a physical > zone, since resizing is a (software) policy, not a physical > characteristic. > > Implementing the same idea in the pageblock granularity has also been > explored but rejected at Google. Pageblocks have a finer granularity > and therefore can be more flexible than zones. The tradeoff is that > this alternative implementation was more complex and failed to bring a > better ROI. However, the rejection was mainly due to its inability to > be smoothly extended to 1GB THPs [2], which is a planned use case of > TAO. > > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ > > Signed-off-by: Yu Zhao <yuzhao@google.com> > --- > .../admin-guide/kernel-parameters.txt | 10 + > drivers/virtio/virtio_mem.c | 2 +- > include/linux/gfp.h | 24 +- > include/linux/huge_mm.h | 6 - > include/linux/mempolicy.h | 2 +- > include/linux/mmzone.h | 52 +- > include/linux/nodemask.h | 2 +- > include/linux/vm_event_item.h | 2 +- > include/trace/events/mmflags.h | 4 +- > mm/compaction.c | 12 + > mm/huge_memory.c | 5 +- > mm/mempolicy.c | 14 +- > mm/migrate.c | 7 +- > mm/mm_init.c | 452 ++++++++++-------- > mm/page_alloc.c | 44 +- > mm/page_isolation.c | 2 +- > mm/swap_slots.c | 3 +- > mm/vmscan.c | 32 +- > mm/vmstat.c | 7 +- > 19 files changed, 431 insertions(+), 251 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 31b3a25680d0..a6c181f6efde 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3529,6 +3529,16 @@ > allocations which rules out almost all kernel > allocations. Use with caution! > > + nosplit=X,Y [MM] Set the minimum order of the nosplit zone. Pages in > + this zone can't be split down below order Y, while free > + or in use. > + Like movablecore, X should be either nn[KMGTPE] or n%. > + > + nomerge=X,Y [MM] Set the exact orders of the nomerge zone. Pages in > + this zone are always order Y, meaning they can't be > + split or merged while free or in use. > + Like movablecore, X should be either nn[KMGTPE] or n%. > + > MTD_Partition= [MTD] > Format: <name>,<region-number>,<size>,<offset> > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index 8e3223294442..37ecf5ee4afd 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm, > page = pfn_to_online_page(pfn); > if (!page) > continue; > - if (page_zonenum(page) != ZONE_MOVABLE) > + if (!is_zone_movable_page(page)) > return false; > } > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index de292a007138..c0f9d21b4d18 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) > * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. > */ > > -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 > -/* ZONE_DEVICE is not a valid GFP zone specifier */ > +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 > +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ > #define GFP_ZONES_SHIFT 2 > #else > #define GFP_ZONES_SHIFT ZONES_SHIFT > @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) > z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & > ((1 << GFP_ZONES_SHIFT) - 1); > VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); > + > + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) > + return LAST_VIRT_ZONE; > + > return z; > } > > +extern int zone_nomerge_order __read_mostly; > +extern int zone_nosplit_order __read_mostly; > + > +static inline enum zone_type gfp_order_zone(gfp_t flags, int order) > +{ > + enum zone_type zid = gfp_zone(flags); > + > + if (zid >= ZONE_NOMERGE && order != zone_nomerge_order) > + zid = ZONE_NOMERGE - 1; > + > + if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order) > + zid = ZONE_NOSPLIT - 1; > + > + return zid; > +} > + > /* > * There is only one page-allocator function, and two main namespaces to > * it. The alloc_page*() variants return 'struct page *' and as such > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..9960ad7c3b10 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags); > > void folio_prep_large_rmappable(struct folio *folio); > -bool can_split_folio(struct folio *folio, int *pextra_pins); > int split_huge_page_to_list(struct page *page, struct list_head *list); > static inline int split_huge_page(struct page *page) > { > @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {} > > #define thp_get_unmapped_area NULL > > -static inline bool > -can_split_folio(struct folio *folio, int *pextra_pins) > -{ > - return false; > -} > static inline int > split_huge_page_to_list(struct page *page, struct list_head *list) > { > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..a92bcf47cf8c 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -150,7 +150,7 @@ extern enum zone_type policy_zone; > > static inline void check_highest_zone(enum zone_type k) > { > - if (k > policy_zone && k != ZONE_MOVABLE) > + if (k > policy_zone && !zid_is_virt(k)) > policy_zone = k; > } > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index a497f189d988..532218167bba 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -805,11 +805,15 @@ enum zone_type { > * there can be false negatives). > */ > ZONE_MOVABLE, > + ZONE_NOSPLIT, > + ZONE_NOMERGE, > #ifdef CONFIG_ZONE_DEVICE > ZONE_DEVICE, > #endif > - __MAX_NR_ZONES > + __MAX_NR_ZONES, > > + LAST_PHYS_ZONE = ZONE_MOVABLE - 1, > + LAST_VIRT_ZONE = ZONE_NOMERGE, > }; > > #ifndef __GENERATING_BOUNDS_H > @@ -929,6 +933,8 @@ struct zone { > seqlock_t span_seqlock; > #endif > > + int order; > + > int initialized; > > /* Write-intensive fields used from the page allocator */ > @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio) > > static inline bool is_zone_movable_page(const struct page *page) > { > - return page_zonenum(page) == ZONE_MOVABLE; > + return page_zonenum(page) >= ZONE_MOVABLE; > } > > static inline bool folio_is_zone_movable(const struct folio *folio) > { > - return folio_zonenum(folio) == ZONE_MOVABLE; > + return folio_zonenum(folio) >= ZONE_MOVABLE; > +} > + > +static inline bool page_can_split(struct page *page) > +{ > + return page_zonenum(page) < ZONE_NOSPLIT; > +} > + > +static inline bool folio_can_split(struct folio *folio) > +{ > + return folio_zonenum(folio) < ZONE_NOSPLIT; > } > #endif > > @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; }; > */ > #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > > +static inline bool zid_is_virt(enum zone_type zid) > +{ > + return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE; > +} > + > +static inline bool zone_can_frag(struct zone *zone) > +{ > + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT); > + > + return zone_idx(zone) < ZONE_NOSPLIT; > +} > + > +static inline bool zone_is_suitable(struct zone *zone, int order) > +{ > + int zid = zone_idx(zone); > + > + if (zid < ZONE_NOSPLIT) > + return true; > + > + if (!zone->order) > + return false; > + > + return (zid == ZONE_NOSPLIT && order >= zone->order) || > + (zid == ZONE_NOMERGE && order == zone->order); > +} > + > #ifdef CONFIG_ZONE_DEVICE > static inline bool zone_is_zone_device(struct zone *zone) > { > @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone) > static inline void zone_set_nid(struct zone *zone, int nid) {} > #endif > > -extern int movable_zone; > +extern int virt_zone; > > static inline int is_highmem_idx(enum zone_type idx) > { > #ifdef CONFIG_HIGHMEM > return (idx == ZONE_HIGHMEM || > - (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM)); > + (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM)); > #else > return 0; > #endif > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index b61438313a73..34fbe910576d 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -404,7 +404,7 @@ enum node_states { > #else > N_HIGH_MEMORY = N_NORMAL_MEMORY, > #endif > - N_MEMORY, /* The node has memory(regular, high, movable) */ > + N_MEMORY, /* The node has memory in any of the zones */ > N_CPU, /* The node has one or more cpus */ > N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */ > NR_NODE_STATES > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 747943bc8cc2..9a54d15d5ec3 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -27,7 +27,7 @@ > #endif > > #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \ > - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx) > + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx) > > enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > FOR_ALL_ZONES(PGALLOC) > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h > index d801409b33cf..2b5fdafaadea 100644 > --- a/include/trace/events/mmflags.h > +++ b/include/trace/events/mmflags.h > @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ > IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ > EM (ZONE_NORMAL, "Normal") \ > IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ > - EMe(ZONE_MOVABLE,"Movable") > + EM (ZONE_MOVABLE,"Movable") \ > + EM (ZONE_NOSPLIT,"NoSplit") \ > + EMe(ZONE_NOMERGE,"NoMerge") > > #define LRU_NAMES \ > EM (LRU_INACTIVE_ANON, "inactive_anon") \ > diff --git a/mm/compaction.c b/mm/compaction.c > index 4add68d40e8d..8a64c805f411 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, > ac->highest_zoneidx, ac->nodemask) { > enum compact_result status; > > + if (!zone_can_frag(zone)) > + continue; > + > if (prio > MIN_COMPACT_PRIORITY > && compaction_deferred(zone, order)) { > rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); > @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone = zone; > > compact_zone(&cc, NULL); > @@ -2846,6 +2852,9 @@ static void compact_node(int nid) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone = zone; > > compact_zone(&cc, NULL); > @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > ret = compaction_suit_allocation_order(zone, > pgdat->kcompactd_max_order, > highest_zoneidx, ALLOC_WMARK_MIN); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 94c958f7ebb5..b57faa0a1e83 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > } > > /* Racy check whether the huge page can be split */ > -bool can_split_folio(struct folio *folio, int *pextra_pins) > +static bool can_split_folio(struct folio *folio, int *pextra_pins) > { > int extra_pins; > > + if (!folio_can_split(folio)) > + return false; > + > /* Additional pins from page cache */ > if (folio_test_anon(folio)) > extra_pins = folio_test_swapcache(folio) ? > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 10a590ee1c89..1f84dd759086 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma) > > bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > { > - enum zone_type dynamic_policy_zone = policy_zone; > - > - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); > + WARN_ON_ONCE(zid_is_virt(policy_zone)); > > /* > - * if policy->nodes has movable memory only, > - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. > + * If policy->nodes has memory in virtual zones only, we apply policy > + * only if gfp_zone(gfp) can allocate from those zones. > * > * policy->nodes is intersect with node_states[N_MEMORY]. > * so if the following test fails, it implies > - * policy->nodes has movable memory only. > + * policy->nodes has memory in virtual zones only. > */ > if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) > - dynamic_policy_zone = ZONE_MOVABLE; > + return zone > LAST_PHYS_ZONE; > > - return zone >= dynamic_policy_zone; > + return zone >= policy_zone; > } > > /* Do dynamic interleaving for a process */ > diff --git a/mm/migrate.c b/mm/migrate.c > index cc9f2bcd73b4..f615c0c22046 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > { > int rc; > > + if (!folio_can_split(folio)) > + return -EBUSY; > + > folio_lock(folio); > rc = split_folio_to_list(folio, split_folios); > folio_unlock(folio); > @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private) > order = folio_order(src); > } > zidx = zone_idx(folio_zone(src)); > - if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) > + if (zidx > ZONE_NORMAL) > gfp_mask |= __GFP_HIGHMEM; > > return __folio_alloc(gfp_mask, order, nid, mtc->nmask); > @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) > break; > } > wakeup_kswapd(pgdat->node_zones + z, 0, > - folio_order(folio), ZONE_MOVABLE); > + folio_order(folio), z); > return 0; > } > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 2c19f5515e36..7769c21e6d54 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init); > > static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata; > static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata; > -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata; > > -static unsigned long required_kernelcore __initdata; > -static unsigned long required_kernelcore_percent __initdata; > -static unsigned long required_movablecore __initdata; > -static unsigned long required_movablecore_percent __initdata; > +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata; > +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid]) > + > +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; > +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE]) > + > +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; > +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE]) > + > +int zone_nosplit_order __read_mostly; > +int zone_nomerge_order __read_mostly; > > static unsigned long nr_kernel_pages __initdata; > static unsigned long nr_all_pages __initdata; > @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p) > return 0; > } > > - return cmdline_parse_core(p, &required_kernelcore, > - &required_kernelcore_percent); > + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE), > + &percentage_of(LAST_PHYS_ZONE)); > } > early_param("kernelcore", cmdline_parse_kernelcore); > > @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore); > */ > static int __init cmdline_parse_movablecore(char *p) > { > - return cmdline_parse_core(p, &required_movablecore, > - &required_movablecore_percent); > + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE), > + &percentage_of(ZONE_MOVABLE)); > } > early_param("movablecore", cmdline_parse_movablecore); > > +static int __init parse_zone_order(char *p, unsigned long *nr_pages, > + unsigned long *percent, int *order) > +{ > + int err; > + unsigned long n; > + char *s = strchr(p, ','); > + > + if (!s) > + return -EINVAL; > + > + *s++ = '\0'; > + > + err = kstrtoul(s, 0, &n); > + if (err) > + return err; > + > + if (n < 2 || n > MAX_PAGE_ORDER) > + return -EINVAL; > + > + err = cmdline_parse_core(p, nr_pages, percent); > + if (err) > + return err; > + > + *order = n; > + > + return 0; > +} > + > +static int __init parse_zone_nosplit(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT), > + &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order); > +} > +early_param("nosplit", parse_zone_nosplit); > + > +static int __init parse_zone_nomerge(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE), > + &percentage_of(ZONE_NOMERGE), &zone_nomerge_order); > +} > +early_param("nomerge", parse_zone_nomerge); > + > /* > * early_calculate_totalpages() > - * Sum pages in active regions for movable zone. > + * Sum pages in active regions for virtual zones. > * Populate N_MEMORY for calculating usable_nodes. > */ > static unsigned long __init early_calculate_totalpages(void) > @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void) > } > > /* > - * This finds a zone that can be used for ZONE_MOVABLE pages. The > + * This finds a physical zone that can be used for virtual zones. The > * assumption is made that zones within a node are ordered in monotonic > * increasing memory addresses so that the "highest" populated zone is used > */ > -static void __init find_usable_zone_for_movable(void) > +static void __init find_usable_zone(void) > { > int zone_index; > - for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { > - if (zone_index == ZONE_MOVABLE) > - continue; > - > + for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) { > if (arch_zone_highest_possible_pfn[zone_index] > > arch_zone_lowest_possible_pfn[zone_index]) > break; > } > > VM_BUG_ON(zone_index == -1); > - movable_zone = zone_index; > + virt_zone = zone_index; > +} > + > +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn) > +{ > + int i, nid; > + unsigned long node_avg, remaining; > + int usable_nodes = nodes_weight(node_states[N_MEMORY]); > + /* usable_startpfn is the lowest possible pfn virtual zones can be at */ > + unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone]; > + > +restart: > + /* Carve out memory as evenly as possible throughout nodes */ > + node_avg = occupied / usable_nodes; > + for_each_node_state(nid, N_MEMORY) { > + unsigned long start_pfn, end_pfn; > + > + /* > + * Recalculate node_avg if the division per node now exceeds > + * what is necessary to satisfy the amount of memory to carve > + * out. > + */ > + if (occupied < node_avg) > + node_avg = occupied / usable_nodes; > + > + /* > + * As the map is walked, we track how much memory is usable > + * using remaining. When it is 0, the rest of the node is > + * usable. > + */ > + remaining = node_avg; > + > + /* Go through each range of PFNs within this node */ > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > + unsigned long size_pages; > + > + start_pfn = max(start_pfn, zone_pfn[nid]); > + if (start_pfn >= end_pfn) > + continue; > + > + /* Account for what is only usable when carving out */ > + if (start_pfn < usable_startpfn) { > + unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn; > + > + remaining -= min(nr_pages, remaining); > + occupied -= min(nr_pages, occupied); > + > + /* Continue if range is now fully accounted */ > + if (end_pfn <= usable_startpfn) { > + > + /* > + * Push zone_pfn to the end so that if > + * we have to carve out more across > + * nodes, we will not double account > + * here. > + */ > + zone_pfn[nid] = end_pfn; > + continue; > + } > + start_pfn = usable_startpfn; > + } > + > + /* > + * The usable PFN range is from start_pfn->end_pfn. > + * Calculate size_pages as the number of pages used. > + */ > + size_pages = end_pfn - start_pfn; > + if (size_pages > remaining) > + size_pages = remaining; > + zone_pfn[nid] = start_pfn + size_pages; > + > + /* > + * Some memory was carved out, update counts and break > + * if the request for this node has been satisfied. > + */ > + occupied -= min(occupied, size_pages); > + remaining -= size_pages; > + if (!remaining) > + break; > + } > + } > + > + /* > + * If there is still more to carve out, we do another pass with one less > + * node in the count. This will push zone_pfn[nid] further along on the > + * nodes that still have memory until the request is fully satisfied. > + */ > + usable_nodes--; > + if (usable_nodes && occupied > usable_nodes) > + goto restart; > } > > /* > @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void) > * memory. When they don't, some nodes will have more kernelcore than > * others > */ > -static void __init find_zone_movable_pfns_for_nodes(void) > +static void __init find_virt_zones(void) > { > - int i, nid; > + int i; > + int nid; > unsigned long usable_startpfn; > - unsigned long kernelcore_node, kernelcore_remaining; > /* save the state before borrow the nodemask */ > nodemask_t saved_node_state = node_states[N_MEMORY]; > unsigned long totalpages = early_calculate_totalpages(); > - int usable_nodes = nodes_weight(node_states[N_MEMORY]); > struct memblock_region *r; > + unsigned long occupied = 0; > > - /* Need to find movable_zone earlier when movable_node is specified. */ > - find_usable_zone_for_movable(); > + /* Need to find virt_zone earlier when movable_node is specified. */ > + find_usable_zone(); > > /* > * If movable_node is specified, ignore kernelcore and movablecore > @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) > nid = memblock_get_region_node(r); > > usable_startpfn = PFN_DOWN(r->base); > - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid]) : > + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : > usable_startpfn; > } > > @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) > continue; > } > > - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid]) : > + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : > usable_startpfn; > } > > @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void) > goto out2; > } > > + if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) { > + nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0; > + percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0; > + zone_nosplit_order = zone_nomerge_order = 0; > + pr_warn("zone %s order %d must be higher zone %s order %d\n", > + zone_names[ZONE_NOMERGE], zone_nomerge_order, > + zone_names[ZONE_NOSPLIT], zone_nosplit_order); > + } > + > /* > * If kernelcore=nn% or movablecore=nn% was specified, calculate the > * amount of necessary memory. > */ > - if (required_kernelcore_percent) > - required_kernelcore = (totalpages * 100 * required_kernelcore_percent) / > - 10000UL; > - if (required_movablecore_percent) > - required_movablecore = (totalpages * 100 * required_movablecore_percent) / > - 10000UL; > + for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) { > + if (percentage_of(i)) > + nr_pages_of(i) = totalpages * percentage_of(i) / 100; > + > + nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES); > + occupied += nr_pages_of(i); > + } > > /* > * If movablecore= was specified, calculate what size of > * kernelcore that corresponds so that memory usable for > * any allocation type is evenly spread. If both kernelcore > * and movablecore are specified, then the value of kernelcore > - * will be used for required_kernelcore if it's greater than > - * what movablecore would have allowed. > + * will be used if it's greater than what movablecore would have > + * allowed. > */ > - if (required_movablecore) { > - unsigned long corepages; > + if (occupied < totalpages) { > + enum zone_type zid; > > - /* > - * Round-up so that ZONE_MOVABLE is at least as large as what > - * was requested by the user > - */ > - required_movablecore = > - roundup(required_movablecore, MAX_ORDER_NR_PAGES); > - required_movablecore = min(totalpages, required_movablecore); > - corepages = totalpages - required_movablecore; > - > - required_kernelcore = max(required_kernelcore, corepages); > + zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ? > + LAST_PHYS_ZONE : ZONE_MOVABLE; > + nr_pages_of(zid) += totalpages - occupied; > } > > /* > * If kernelcore was not specified or kernelcore size is larger > - * than totalpages, there is no ZONE_MOVABLE. > + * than totalpages, there are not virtual zones. > */ > - if (!required_kernelcore || required_kernelcore >= totalpages) > + occupied = nr_pages_of(LAST_PHYS_ZONE); > + if (!occupied || occupied >= totalpages) > goto out; > > - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ > - usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; > + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { > + if (!nr_pages_of(i)) > + continue; > > -restart: > - /* Spread kernelcore memory as evenly as possible throughout nodes */ > - kernelcore_node = required_kernelcore / usable_nodes; > - for_each_node_state(nid, N_MEMORY) { > - unsigned long start_pfn, end_pfn; > - > - /* > - * Recalculate kernelcore_node if the division per node > - * now exceeds what is necessary to satisfy the requested > - * amount of memory for the kernel > - */ > - if (required_kernelcore < kernelcore_node) > - kernelcore_node = required_kernelcore / usable_nodes; > - > - /* > - * As the map is walked, we track how much memory is usable > - * by the kernel using kernelcore_remaining. When it is > - * 0, the rest of the node is usable by ZONE_MOVABLE > - */ > - kernelcore_remaining = kernelcore_node; > - > - /* Go through each range of PFNs within this node */ > - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > - unsigned long size_pages; > - > - start_pfn = max(start_pfn, zone_movable_pfn[nid]); > - if (start_pfn >= end_pfn) > - continue; > - > - /* Account for what is only usable for kernelcore */ > - if (start_pfn < usable_startpfn) { > - unsigned long kernel_pages; > - kernel_pages = min(end_pfn, usable_startpfn) > - - start_pfn; > - > - kernelcore_remaining -= min(kernel_pages, > - kernelcore_remaining); > - required_kernelcore -= min(kernel_pages, > - required_kernelcore); > - > - /* Continue if range is now fully accounted */ > - if (end_pfn <= usable_startpfn) { > - > - /* > - * Push zone_movable_pfn to the end so > - * that if we have to rebalance > - * kernelcore across nodes, we will > - * not double account here > - */ > - zone_movable_pfn[nid] = end_pfn; > - continue; > - } > - start_pfn = usable_startpfn; > - } > - > - /* > - * The usable PFN range for ZONE_MOVABLE is from > - * start_pfn->end_pfn. Calculate size_pages as the > - * number of pages used as kernelcore > - */ > - size_pages = end_pfn - start_pfn; > - if (size_pages > kernelcore_remaining) > - size_pages = kernelcore_remaining; > - zone_movable_pfn[nid] = start_pfn + size_pages; > - > - /* > - * Some kernelcore has been met, update counts and > - * break if the kernelcore for this node has been > - * satisfied > - */ > - required_kernelcore -= min(required_kernelcore, > - size_pages); > - kernelcore_remaining -= size_pages; > - if (!kernelcore_remaining) > - break; > - } > + find_virt_zone(occupied, &pfn_of(i, 0)); > + occupied += nr_pages_of(i); > } > - > - /* > - * If there is still required_kernelcore, we do another pass with one > - * less node in the count. This will push zone_movable_pfn[nid] further > - * along on the nodes that still have memory until kernelcore is > - * satisfied > - */ > - usable_nodes--; > - if (usable_nodes && required_kernelcore > usable_nodes) > - goto restart; > - > out2: > - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ > + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */ > for (nid = 0; nid < MAX_NUMNODES; nid++) { > unsigned long start_pfn, end_pfn; > - > - zone_movable_pfn[nid] = > - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); > + unsigned long prev_virt_zone_pfn = 0; > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > - if (zone_movable_pfn[nid] >= end_pfn) > - zone_movable_pfn[nid] = 0; > + > + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { > + pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES); > + > + if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn) > + pfn_of(i, nid) = 0; > + > + if (pfn_of(i, nid)) > + prev_virt_zone_pfn = pfn_of(i, nid); > + } > } > - > out: > /* restore the node_state */ > node_states[N_MEMORY] = saved_node_state; > @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone, > #endif > > /* > - * The zone ranges provided by the architecture do not include ZONE_MOVABLE > - * because it is sized independent of architecture. Unlike the other zones, > - * the starting point for ZONE_MOVABLE is not fixed. It may be different > - * in each node depending on the size of each node and how evenly kernelcore > - * is distributed. This helper function adjusts the zone ranges > + * The zone ranges provided by the architecture do not include virtual zones > + * because they are sized independent of architecture. Unlike physical zones, > + * the starting point for the first populated virtual zone is not fixed. It may > + * be different in each node depending on the size of each node and how evenly > + * kernelcore is distributed. This helper function adjusts the zone ranges > * provided by the architecture for a given node by using the end of the > - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that > - * zones within a node are in order of monotonic increases memory addresses > + * highest usable zone for the first populated virtual zone. This preserves the > + * assumption that zones within a node are in order of monotonic increases > + * memory addresses. > */ > -static void __init adjust_zone_range_for_zone_movable(int nid, > +static void __init adjust_zone_range(int nid, > unsigned long zone_type, > unsigned long node_end_pfn, > unsigned long *zone_start_pfn, > unsigned long *zone_end_pfn) > { > - /* Only adjust if ZONE_MOVABLE is on this node */ > - if (zone_movable_pfn[nid]) { > - /* Size ZONE_MOVABLE */ > - if (zone_type == ZONE_MOVABLE) { > - *zone_start_pfn = zone_movable_pfn[nid]; > - *zone_end_pfn = min(node_end_pfn, > - arch_zone_highest_possible_pfn[movable_zone]); > + int i = max_t(int, zone_type, LAST_PHYS_ZONE); > + unsigned long next_virt_zone_pfn = 0; > > - /* Adjust for ZONE_MOVABLE starting within this range */ > - } else if (!mirrored_kernelcore && > - *zone_start_pfn < zone_movable_pfn[nid] && > - *zone_end_pfn > zone_movable_pfn[nid]) { > - *zone_end_pfn = zone_movable_pfn[nid]; > + while (i++ < LAST_VIRT_ZONE) { > + if (pfn_of(i, nid)) { > + next_virt_zone_pfn = pfn_of(i, nid); > + break; > + } > + } > > - /* Check if this whole range is within ZONE_MOVABLE */ > - } else if (*zone_start_pfn >= zone_movable_pfn[nid]) > + if (zone_type <= LAST_PHYS_ZONE) { > + if (!next_virt_zone_pfn) > + return; > + > + if (!mirrored_kernelcore && > + *zone_start_pfn < next_virt_zone_pfn && > + *zone_end_pfn > next_virt_zone_pfn) > + *zone_end_pfn = next_virt_zone_pfn; > + else if (*zone_start_pfn >= next_virt_zone_pfn) > *zone_start_pfn = *zone_end_pfn; > + } else if (zone_type <= LAST_VIRT_ZONE) { > + if (!pfn_of(zone_type, nid)) > + return; > + > + if (next_virt_zone_pfn) > + *zone_end_pfn = min3(next_virt_zone_pfn, > + node_end_pfn, > + arch_zone_highest_possible_pfn[virt_zone]); > + else > + *zone_end_pfn = min(node_end_pfn, > + arch_zone_highest_possible_pfn[virt_zone]); > + *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid)); > } > } > > @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid, > * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages > * and vice versa. > */ > - if (mirrored_kernelcore && zone_movable_pfn[nid]) { > + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) { > unsigned long start_pfn, end_pfn; > struct memblock_region *r; > > @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid, > /* Get the start and end of the zone */ > *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); > *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); > - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, > - zone_start_pfn, zone_end_pfn); > + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn); > > /* Check that this node has pages within the zone's required range */ > if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) > @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat, > #if defined(CONFIG_MEMORY_HOTPLUG) > zone->present_early_pages = real_size; > #endif > + if (i == ZONE_NOSPLIT) > + zone->order = zone_nosplit_order; > + if (i == ZONE_NOMERGE) > + zone->order = zone_nomerge_order; > > totalpages += spanned; > realtotalpages += real_size; > @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat) > { > enum zone_type zone_type; > > - for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) { > + for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) { > struct zone *zone = &pgdat->node_zones[zone_type]; > if (populated_zone(zone)) { > if (IS_ENABLED(CONFIG_HIGHMEM)) > @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void) > void __init free_area_init(unsigned long *max_zone_pfn) > { > unsigned long start_pfn, end_pfn; > - int i, nid, zone; > + int i, j, nid, zone; > bool descending; > > /* Record where the zone boundaries are */ > @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) > start_pfn = PHYS_PFN(memblock_start_of_DRAM()); > descending = arch_has_descending_max_zone_pfns(); > > - for (i = 0; i < MAX_NR_ZONES; i++) { > + for (i = 0; i <= LAST_PHYS_ZONE; i++) { > if (descending) > - zone = MAX_NR_ZONES - i - 1; > + zone = LAST_PHYS_ZONE - i; > else > zone = i; > > - if (zone == ZONE_MOVABLE) > - continue; > - > end_pfn = max(max_zone_pfn[zone], start_pfn); > arch_zone_lowest_possible_pfn[zone] = start_pfn; > arch_zone_highest_possible_pfn[zone] = end_pfn; > @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) > start_pfn = end_pfn; > } > > - /* Find the PFNs that ZONE_MOVABLE begins at in each node */ > - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); > - find_zone_movable_pfns_for_nodes(); > + /* Find the PFNs that virtual zones begin at in each node */ > + find_virt_zones(); > > /* Print out the zone ranges */ > pr_info("Zone ranges:\n"); > - for (i = 0; i < MAX_NR_ZONES; i++) { > - if (i == ZONE_MOVABLE) > - continue; > + for (i = 0; i <= LAST_PHYS_ZONE; i++) { > pr_info(" %-8s ", zone_names[i]); > if (arch_zone_lowest_possible_pfn[i] == > arch_zone_highest_possible_pfn[i]) > @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn) > << PAGE_SHIFT) - 1); > } > > - /* Print out the PFNs ZONE_MOVABLE begins at in each node */ > - pr_info("Movable zone start for each node\n"); > - for (i = 0; i < MAX_NUMNODES; i++) { > - if (zone_movable_pfn[i]) > - pr_info(" Node %d: %#018Lx\n", i, > - (u64)zone_movable_pfn[i] << PAGE_SHIFT); > + /* Print out the PFNs virtual zones begin at in each node */ > + for (; i <= LAST_VIRT_ZONE; i++) { > + pr_info("%s zone start for each node\n", zone_names[i]); > + for (j = 0; j < MAX_NUMNODES; j++) { > + if (pfn_of(i, j)) > + pr_info(" Node %d: %#018Lx\n", > + j, (u64)pfn_of(i, j) << PAGE_SHIFT); > + } > } > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 150d4f23b010..6a4da8f8691c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = { > "HighMem", > #endif > "Movable", > + "NoSplit", > + "NoMerge", > #ifdef CONFIG_ZONE_DEVICE > "Device", > #endif > @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1; > static int watermark_boost_factor __read_mostly = 15000; > static int watermark_scale_factor = 10; > > -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > -int movable_zone; > -EXPORT_SYMBOL(movable_zone); > +/* virt_zone is the "real" zone pages in virtual zones are taken from */ > +int virt_zone; > +EXPORT_SYMBOL(virt_zone); > > #if MAX_NUMNODES > 1 > unsigned int nr_node_ids __read_mostly = MAX_NUMNODES; > @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > unsigned long higher_page_pfn; > struct page *higher_page; > > - if (order >= MAX_PAGE_ORDER - 1) > - return false; > - > higher_page_pfn = buddy_pfn & pfn; > higher_page = page + (higher_page_pfn - pfn); > > @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > NULL) != NULL; > } > > +static int zone_max_order(struct zone *zone) > +{ > + return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER; > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + int max_order = zone_max_order(zone); > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); > VM_BUG_ON_PAGE(bad_range(zone, page), page); > > - while (order < MAX_PAGE_ORDER) { > + while (order < max_order) { > if (compaction_capture(capc, page, order, migratetype)) { > __mod_zone_freepage_state(zone, -(1 << order), > migratetype); > @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page, > to_tail = true; > else if (is_shuffle_order(order)) > to_tail = shuffle_pick_tail(); > + else if (order + 1 >= max_order) > + to_tail = false; > else > to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); > > @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page, > int mt; > int ret = 0; > > + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page); > + > if (split_pfn_offset == 0) > return ret; > > @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > struct free_area *area; > struct page *page; > > + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order)); > + > /* Find a page of the appropriate size in the preferred list */ > for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) { > area = &(zone->free_area[current_order]); > @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > long min = mark; > int o; > > + if (!zone_is_suitable(z, order)) > + return false; > + > /* free_pages may go negative - that's OK */ > free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); > > @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, > { > long free_pages; > > + if (!zone_is_suitable(z, order)) > + return false; > + > free_pages = zone_page_state(z, NR_FREE_PAGES); > > /* > @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > struct page *page; > unsigned long mark; > > + if (!zone_is_suitable(zone, order)) > + continue; > + > if (cpusets_enabled() && > (alloc_flags & ALLOC_CPUSET) && > !__cpuset_zone_allowed(zone, gfp_mask)) > @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void) > struct zone *zone; > unsigned long flags; > > - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */ > + /* Calculate total number of pages below ZONE_HIGHMEM */ > for_each_zone(zone) { > - if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE) > + if (zone_idx(zone) <= ZONE_NORMAL) > lowmem_pages += zone_managed_pages(zone); > } > > @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void) > spin_lock_irqsave(&zone->lock, flags); > tmp = (u64)pages_min * zone_managed_pages(zone); > do_div(tmp, lowmem_pages); > - if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) { > + if (zone_idx(zone) > ZONE_NORMAL) { > /* > * __GFP_HIGH and PF_MEMALLOC allocations usually don't > - * need highmem and movable zones pages, so cap pages_min > - * to a small value here. > + * need pages from zones above ZONE_NORMAL, so cap > + * pages_min to a small value here. > * > * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) > * deltas control async page reclaim, and so should > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index cd0ea3668253..8a6473543427 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > * pages then it should be reasonably safe to assume the rest > * is movable. > */ > - if (zone_idx(zone) == ZONE_MOVABLE) > + if (zid_is_virt(zone_idx(zone))) > continue; > > /* > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 0bec1f705f8e..ad0db0373b05 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio) > entry.val = 0; > > if (folio_test_large(folio)) { > - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() && > + folio_test_pmd_mappable(folio)) > get_swap_pages(1, &entry, folio_nr_pages(folio)); > goto out; > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4f9c854ce6cc..ae061ec4866a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > goto keep_locked; > if (folio_maybe_dma_pinned(folio)) > goto keep_locked; > - if (folio_test_large(folio)) { > - /* cannot split folio, skip it */ > - if (!can_split_folio(folio, NULL)) > - goto activate_locked; > - /* > - * Split folios without a PMD map right > - * away. Chances are some or all of the > - * tail pages can be freed without IO. > - */ > - if (!folio_entire_mapcount(folio) && > - split_folio_to_list(folio, > - folio_list)) > - goto activate_locked; > - } > + /* > + * Split folios that are not fully map right > + * away. Chances are some of the tail pages can > + * be freed without IO. > + */ > + if (folio_test_large(folio) && > + atomic_read(&folio->_nr_pages_mapped) < nr_pages) > + split_folio_to_list(folio, folio_list); > if (!add_to_swap(folio)) { > if (!folio_test_large(folio)) > goto activate_locked_split; > @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > orig_mask = sc->gfp_mask; > if (buffer_heads_over_limit) { > sc->gfp_mask |= __GFP_HIGHMEM; > - sc->reclaim_idx = gfp_zone(sc->gfp_mask); > + sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order); > } > > for_each_zone_zonelist_nodemask(zone, z, zonelist, > @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > struct scan_control sc = { > .nr_to_reclaim = SWAP_CLUSTER_MAX, > .gfp_mask = current_gfp_context(gfp_mask), > - .reclaim_idx = gfp_zone(gfp_mask), > + .reclaim_idx = gfp_order_zone(gfp_mask, order), > .order = order, > .nodemask = nodemask, > .priority = DEF_PRIORITY, > @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > if (!cpuset_zone_allowed(zone, gfp_flags)) > return; > > + curr_idx = gfp_order_zone(gfp_flags, order); > + if (highest_zoneidx > curr_idx) > + highest_zoneidx = curr_idx; > + > pgdat = zone->zone_pgdat; > curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx); > > @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > - .reclaim_idx = gfp_zone(gfp_mask), > + .reclaim_idx = gfp_order_zone(gfp_mask, order), > }; > unsigned long pflags; > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index db79935e4a54..adbd032e6a0f 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) > > #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ > TEXT_FOR_HIGHMEM(xx) xx "_movable", \ > + xx "_nosplit", xx "_nomerge", \ > TEXT_FOR_DEVICE(xx) > > const char * const vmstat_text[] = { > @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > "\n spanned %lu" > "\n present %lu" > "\n managed %lu" > - "\n cma %lu", > + "\n cma %lu" > + "\n order %u", > zone_page_state(zone, NR_FREE_PAGES), > zone->watermark_boost, > min_wmark_pages(zone), > @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > zone->spanned_pages, > zone->present_pages, > zone_managed_pages(zone), > - zone_cma_pages(zone)); > + zone_cma_pages(zone), > + zone->order); > > seq_printf(m, > "\n protection: (%ld", > -- > 2.44.0.rc1.240.g4c46232300-goog > >
On Thu, Feb 29, 2024 at 6:31 PM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote: > > > > There are three types of zones: > > 1. The first four zones partition the physical address space of CPU > > memory. > > 2. The device zone provides interoperability between CPU and device > > memory. > > 3. The movable zone commonly represents a memory allocation policy. > > > > Though originally designed for memory hot removal, the movable zone is > > instead widely used for other purposes, e.g., CMA and kdump kernel, on > > platforms that do not support hot removal, e.g., Android and ChromeOS. > > Nowadays, it is legitimately a zone independent of any physical > > characteristics. In spite of being somewhat regarded as a hack, > > largely due to the lack of a generic design concept for its true major > > use cases (on billions of client devices), the movable zone naturally > > resembles a policy (virtual) zone overlayed on the first four > > (physical) zones. > > > > This proposal formally generalizes this concept as policy zones so > > that additional policies can be implemented and enforced by subsequent > > zones after the movable zone. An inherited requirement of policy zones > > (and the first four zones) is that subsequent zones must be able to > > fall back to previous zones and therefore must add new properties to > > the previous zones rather than remove existing ones from them. Also, > > all properties must be known at the allocation time, rather than the > > runtime, e.g., memory object size and mobility are valid properties > > but hotness and lifetime are not. > > > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > > zones: > > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > > ZONE_MOVABLE) and restricted to a minimum order to be > > anti-fragmentation. The latter means that they cannot be split down > > below that order, while they are free or in use. > > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > > to an exact order. The latter means that not only is split > > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > > reason in Chapter Three), while they are free or in use. > > > > Since these two zones only can serve THP allocations (__GFP_MOVABLE | > > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and > > compaction is not needed for these two zones. > > > > Compared with the hugeTLB pool approach, THP zones tap into core MM > > features including: > > 1. THP allocations can fall back to the lower zones, which can have > > higher latency but still succeed. > > 2. THPs can be either shattered (see Chapter Two) if partially > > unmapped or reclaimed if becoming cold. > > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > > contiguous PTEs on arm64 [1], which are more suitable for client > > workloads. > > I think the allocation fallback policy needs to be elaborated. IIUC, > when allocating large folios, if the order > min order of the policy > zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE -> > ZONE_MOVABLE -> ZONE_NORMAL, right? Correct. > If all other zones are depleted, the allocation, whose order is < the > min order, won't fallback to the policy zones and will fail, just like > the non-movable allocation can't fallback to ZONE_MOVABLE even though > there is enough memory for that zone, right? Correct. In this case, the userspace can consider dynamic resizing. (The resizing patches are not included since, as I said in the other thread, we need to focus on the first few steps at the current stage.) Naturally, the next question would be why creating this whole new process rather than trying to improve compaction. We did try the latter: on servers, we tuned compaction and had some good improvements but soon hit a new wall; on clients, no luck at all because 1) they are usually under a much higher pressure than servers 2) they are more sensitive to latency. So we needed a *more deterministic* approach when dealing with fragmentation. Unlike compaction which I'd call heuristics, resizing is more of a policy that the userspace can have full control over. Obviously leaving the task to the userspace can be a good or bad thing, depending on the point view. The bottomline is: 1. The resizing would also help the *existing* unbalanced ZONE_MOVABLE/other zones problem, for the non-hot-removal case. 2. Enlarging the THP zones would be more likely to succeed than compaction would, because it targets the blocks it "donated" to ZONE_MOVABLE with everything it got (both migration and reclaim) and it keeps at it until it succeeds, whereas the compaction lacks such laser focus and is more of a best-efforts approach. (Needless to say, shrinking the THP zones can always succeed.) > > Policy zones can be dynamically resized by offlining pages in one of > > them and onlining those pages in another of them. Note that this is > > only done among policy zones, not between a policy zone and a physical > > zone, since resizing is a (software) policy, not a physical > > characteristic. > > > > Implementing the same idea in the pageblock granularity has also been > > explored but rejected at Google. Pageblocks have a finer granularity > > and therefore can be more flexible than zones. The tradeoff is that > > this alternative implementation was more complex and failed to bring a > > better ROI. However, the rejection was mainly due to its inability to > > be smoothly extended to 1GB THPs [2], which is a planned use case of > > TAO. > > > > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ > > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/
On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote: > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > zones: > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > ZONE_MOVABLE) and restricted to a minimum order to be > anti-fragmentation. The latter means that they cannot be split down > below that order, while they are free or in use. > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > to an exact order. The latter means that not only is split > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > reason in Chapter Three), while they are free or in use. These two zones end up solving a problem for memdescs. So I'm in favour! I added Option 5 to https://kernelnewbies.org/MatthewWilcox/BuddyAllocator I think this patch needs to be split into more digestable chunks, but a quick skim of it didn't reveal anything egregiously wrong. I do still have that question about the number of bits used for Zone in page->flags. Probably this all needs to be dependent on CONFIG_64BIT?
> There are three types of zones: > 1. The first four zones partition the physical address space of CPU > memory. > 2. The device zone provides interoperability between CPU and device > memory. > 3. The movable zone commonly represents a memory allocation policy. > > Though originally designed for memory hot removal, the movable zone is > instead widely used for other purposes, e.g., CMA and kdump kernel, on > platforms that do not support hot removal, e.g., Android and ChromeOS. > Nowadays, it is legitimately a zone independent of any physical > characteristics. In spite of being somewhat regarded as a hack, > largely due to the lack of a generic design concept for its true major > use cases (on billions of client devices), the movable zone naturally > resembles a policy (virtual) zone overlayed on the first four > (physical) zones. > > This proposal formally generalizes this concept as policy zones so > that additional policies can be implemented and enforced by subsequent > zones after the movable zone. An inherited requirement of policy zones > (and the first four zones) is that subsequent zones must be able to > fall back to previous zones and therefore must add new properties to > the previous zones rather than remove existing ones from them. Also, > all properties must be known at the allocation time, rather than the > runtime, e.g., memory object size and mobility are valid properties > but hotness and lifetime are not. > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > zones: > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > ZONE_MOVABLE) and restricted to a minimum order to be > anti-fragmentation. The latter means that they cannot be split down > below that order, while they are free or in use. > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > to an exact order. The latter means that not only is split > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > reason in Chapter Three), while they are free or in use. > > Since these two zones only can serve THP allocations (__GFP_MOVABLE | > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and > compaction is not needed for these two zones. > > Compared with the hugeTLB pool approach, THP zones tap into core MM > features including: > 1. THP allocations can fall back to the lower zones, which can have > higher latency but still succeed. > 2. THPs can be either shattered (see Chapter Two) if partially > unmapped or reclaimed if becoming cold. > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > contiguous PTEs on arm64 [1], which are more suitable for client > workloads. > > Policy zones can be dynamically resized by offlining pages in one of > them and onlining those pages in another of them. Note that this is > only done among policy zones, not between a policy zone and a physical > zone, since resizing is a (software) policy, not a physical > characteristic. > > Implementing the same idea in the pageblock granularity has also been > explored but rejected at Google. Pageblocks have a finer granularity > and therefore can be more flexible than zones. The tradeoff is that > this alternative implementation was more complex and failed to bring a > better ROI. However, the rejection was mainly due to its inability to > be smoothly extended to 1GB THPs [2], which is a planned use case of > TAO. We did implement similar idea in the pageblock granularity on OPPO's phones by extending two special migratetypes[1]: * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain of 3-order allocation if and only if 3-order allocation has failed in both normal buddy and the below TRIP_TO_QUAD. * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the pain of 3-order allocation if and only if 3-order allocation has failed in normal buddy. neither of above will be merged into 5 order or above; neither of above will be splitted into 2 order or lower. in compaction, we will skip both of above. I am seeing one disadvantage of this approach is that I have to add a separate LRU list in each zone to place those mTHP folios. if mTHP and small folios are put in the same LRU list, the reclamation efficiency is extremely bad. A separate zone, on the other hand, can avoid a separate LRU list for mTHP as the new zone has its own LRU list. [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c > > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ > > Signed-off-by: Yu Zhao <yuzhao@google.com> Thanks Barry
On 3/5/24 09:41, Barry Song wrote: > We did implement similar idea in the pageblock granularity on OPPO's > phones by extending two special migratetypes[1]: > > * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use > ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain > of 3-order allocation if and only if 3-order allocation has failed in both > normal buddy and the below TRIP_TO_QUAD. > > * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use > ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the > pain of 3-order allocation if and only if 3-order allocation has failed in > normal buddy. > > neither of above will be merged into 5 order or above; neither of above > will be splitted into 2 order or lower. > > in compaction, we will skip both of above. I am seeing one disadvantage > of this approach is that I have to add a separate LRU list in each > zone to place those mTHP folios. if mTHP and small folios are put > in the same LRU list, the reclamation efficiency is extremely bad. > > A separate zone, on the other hand, can avoid a separate LRU list > for mTHP as the new zone has its own LRU list. But we switched from per-zone to per-node LRU lists years ago? Is that actually a complication for the policy zones? Or does this work silently assume multigen lru which (IIRC) works differently? > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c > >> >> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ >> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ >> >> Signed-off-by: Yu Zhao <yuzhao@google.com> > > Thanks > Barry > >
On Mon, Mar 04, 2024 at 03:19:42PM +0000, Matthew Wilcox wrote: > On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote: > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > > zones: > > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > > ZONE_MOVABLE) and restricted to a minimum order to be > > anti-fragmentation. The latter means that they cannot be split down > > below that order, while they are free or in use. > > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > > to an exact order. The latter means that not only is split > > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > > reason in Chapter Three), while they are free or in use. > > These two zones end up solving a problem for memdescs. So I'm in favour! > I added Option 5 to https://kernelnewbies.org/MatthewWilcox/BuddyAllocator I realised that we don't even need a doubly-linked-list for ZONE_NOMERGE (would ZONE_FIXEDSIZE be a better name?) We only need a oubly linked list to make removal from the middle of the list an O(1) operation, and we only remove from the middle of a list when merging. So we can simply keep a stack of free "pages" and we have 60 bits to point to the next memdesc, so we can easily cover all memory that can exist in a 64-bit machine in ZONE_NOMERGE. ZONE_NOSPLIT would be limited to the first 1PB of memory (assuming it has a minimum size of 2MB -- with 29 bits to refer to each of next & prev, 29 + 21 = 50 bits of address space).
On Tue, Mar 5, 2024 at 11:07 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 3/5/24 09:41, Barry Song wrote: > > We did implement similar idea in the pageblock granularity on OPPO's > > phones by extending two special migratetypes[1]: > > > > * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use > > ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain > > of 3-order allocation if and only if 3-order allocation has failed in both > > normal buddy and the below TRIP_TO_QUAD. > > > > * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use > > ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the > > pain of 3-order allocation if and only if 3-order allocation has failed in > > normal buddy. > > > > neither of above will be merged into 5 order or above; neither of above > > will be splitted into 2 order or lower. > > > > in compaction, we will skip both of above. I am seeing one disadvantage > > of this approach is that I have to add a separate LRU list in each > > zone to place those mTHP folios. if mTHP and small folios are put > > in the same LRU list, the reclamation efficiency is extremely bad. > > > > A separate zone, on the other hand, can avoid a separate LRU list > > for mTHP as the new zone has its own LRU list. > > But we switched from per-zone to per-node LRU lists years ago? > Is that actually a complication for the policy zones? Or does this work > silently assume multigen lru which (IIRC) works differently? the latter. based on the below code, i believe mglru is different with active/inactive, void lru_gen_init_lruvec(struct lruvec *lruvec) { int i; int gen, type, zone; struct lru_gen_folio *lrugen = &lruvec->lrugen; lrugen->max_seq = MIN_NR_GENS + 1; lrugen->enabled = lru_gen_enabled(); for (i = 0; i <= MIN_NR_GENS + 1; i++) lrugen->timestamps[i] = jiffies; for_each_gen_type_zone(gen, type, zone) INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]); lruvec->mm_state.seq = MIN_NR_GENS; } A fundamental difference is that mglru has a different aging and eviction mechanism, This can synchronize the LRUs of each zone to move forward at the same pace while the active/inactive might be unable to compare the ages of folios across zones. > > > > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplus/sm8650_u_14.0.0_oneplus12/mm/page_alloc.c > > > >> > >> [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ > >> [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ > >> > >> Signed-off-by: Yu Zhao <yuzhao@google.com> > > Thanks Barry
On Tue, Mar 5, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 11:07 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 3/5/24 09:41, Barry Song wrote: > > > We did implement similar idea in the pageblock granularity on OPPO's > > > phones by extending two special migratetypes[1]: > > > > > > * QUAD_TO_TRIP - this is mainly for 4-order mTHP allocation which can use > > > ARM64's CONT-PTE; but can rarely be splitted into 3 order to dull the pain > > > of 3-order allocation if and only if 3-order allocation has failed in both > > > normal buddy and the below TRIP_TO_QUAD. > > > > > > * TRIP_TO_QUAD - this is mainly for 4-order mTHP allocation which can use > > > ARM64's CONT-PTE; but can sometimes be splitted into 3 order to dull the > > > pain of 3-order allocation if and only if 3-order allocation has failed in > > > normal buddy. > > > > > > neither of above will be merged into 5 order or above; neither of above > > > will be splitted into 2 order or lower. > > > > > > in compaction, we will skip both of above. I am seeing one disadvantage > > > of this approach is that I have to add a separate LRU list in each > > > zone to place those mTHP folios. if mTHP and small folios are put > > > in the same LRU list, the reclamation efficiency is extremely bad. > > > > > > A separate zone, on the other hand, can avoid a separate LRU list > > > for mTHP as the new zone has its own LRU list. > > > > But we switched from per-zone to per-node LRU lists years ago? > > Is that actually a complication for the policy zones? Or does this work > > silently assume multigen lru which (IIRC) works differently? > > the latter. based on the below code, i believe mglru is different > with active/inactive, > > void lru_gen_init_lruvec(struct lruvec *lruvec) > { > int i; > int gen, type, zone; > struct lru_gen_folio *lrugen = &lruvec->lrugen; > > lrugen->max_seq = MIN_NR_GENS + 1; > lrugen->enabled = lru_gen_enabled(); > > for (i = 0; i <= MIN_NR_GENS + 1; i++) > lrugen->timestamps[i] = jiffies; > > for_each_gen_type_zone(gen, type, zone) > INIT_LIST_HEAD(&lrugen->folios[gen][type][zone]); > > lruvec->mm_state.seq = MIN_NR_GENS; > } > > A fundamental difference is that mglru has a different aging and > eviction mechanism, > This can synchronize the LRUs of each zone to move forward at the same > pace while > the active/inactive might be unable to compare the ages of folios across zones. That's correct. The active/inactive should also work with the extra zones, just like it does for ZONE_MOVABLE. But it's not as optimized as MGLRU, e.g., targeting eligible zones without search the entire LRU list containing folios from all zones.
On Thu, Feb 29, 2024 at 3:28 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote: > > Compared with the hugeTLB pool approach, THP zones tap into core MM > > features including: > > 1. THP allocations can fall back to the lower zones, which can have > > higher latency but still succeed. > > 2. THPs can be either shattered (see Chapter Two) if partially > > unmapped or reclaimed if becoming cold. > > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > > contiguous PTEs on arm64 [1], which are more suitable for client > > workloads. > > Can this mechanism be used to fully replace the hugetlb pool approach? > That would be a major selling point. It kind of feels like it should, > but I am insufficiently expert to be certain. This depends on the return value from htlb_alloc_mask(): if it's GFP_HIGHUSER_MOVABLE, then yes (i.e., 2MB hugeTLB folios on x86). Hypothetically, if users can have THPs as reliable as hugeTLB can offer, wouldn't most users want to go with the former since it's more flexible? E.g., core MM features like split (shattering) and reclaim in addition to HVO. > I'll read over the patches sometime soon. There's a lot to go through. > Something I didn't see in the cover letter or commit messages was any > discussion of page->flags and how many bits we use for ZONE (particularly > on 32-bit). Perhaps I'll discover the answer to that as I read. There may be corner cases because of how different architectures use page->flags, but in general, this shouldn't be a big problem because we can have 6 zones (at most) before this series, and after this series, we can have 8 (at most). IOW, we need 3 bits regardless, in order to all existing zones.
On Tue, Mar 05, 2024 at 10:51:20PM -0500, Yu Zhao wrote: > On Thu, Feb 29, 2024 at 3:28 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 29, 2024 at 11:34:33AM -0700, Yu Zhao wrote: > > > Compared with the hugeTLB pool approach, THP zones tap into core MM > > > features including: > > > 1. THP allocations can fall back to the lower zones, which can have > > > higher latency but still succeed. > > > 2. THPs can be either shattered (see Chapter Two) if partially > > > unmapped or reclaimed if becoming cold. > > > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > > > contiguous PTEs on arm64 [1], which are more suitable for client > > > workloads. > > > > Can this mechanism be used to fully replace the hugetlb pool approach? > > That would be a major selling point. It kind of feels like it should, > > but I am insufficiently expert to be certain. > > This depends on the return value from htlb_alloc_mask(): if it's > GFP_HIGHUSER_MOVABLE, then yes (i.e., 2MB hugeTLB folios on x86). > Hypothetically, if users can have THPs as reliable as hugeTLB can > offer, wouldn't most users want to go with the former since it's more > flexible? E.g., core MM features like split (shattering) and reclaim > in addition to HVO. Right; the real question is what can we do to unify hugetlbfs and THPs. The reservation ability is one feature that hugetlbfs has over THP and removing that advantage gets us one step closer. > > I'll read over the patches sometime soon. There's a lot to go through. > > Something I didn't see in the cover letter or commit messages was any > > discussion of page->flags and how many bits we use for ZONE (particularly > > on 32-bit). Perhaps I'll discover the answer to that as I read. > > There may be corner cases because of how different architectures use > page->flags, but in general, this shouldn't be a big problem because > we can have 6 zones (at most) before this series, and after this > series, we can have 8 (at most). IOW, we need 3 bits regardless, in > order to all existing zones. On a 32-bit system, we'll typically only have four; DMA, NORMAL, HIGHMEM and MOVABLE. DMA32 will be skipped since it would match NORMAL, and DEVICE is just not supported on 32-bit.
> There are three types of zones: > 1. The first four zones partition the physical address space of CPU > memory. > 2. The device zone provides interoperability between CPU and device > memory. > 3. The movable zone commonly represents a memory allocation policy. > > + > +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn) > +{ > + int i, nid; > + unsigned long node_avg, remaining; Hi Yu, I discovered that CMA can be part of virtual zones. For example: Node 0, zone NoMerge pages free 35945 nr_free_pages 35945 ... nr_free_cma 8128 pagesets CMA used to be available for order-0 anonymous allocations, and the Android kernel even prioritized it by commit[1] "ANDROID: cma: redirect page allocation to CMA" /* * Used during anonymous page fault handling. */ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, unsigned long vaddr) { gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO | __GFP_CMA; /* * If the page is mapped with PROT_MTE, initialise the tags at the * point of allocation and page zeroing as this is usually faster than * separate DC ZVA and STGM. */ if (vma->vm_flags & VM_MTE) flags |= __GFP_ZEROTAGS; return vma_alloc_folio(flags, 0, vma, vaddr, false); } I wonder if cma is still available to order-0 when it is located in the nomerge/nosplit zone. And while dma_alloc_coherent() or similar APIs want to get contiguous memory from cma, are they still as easy as before if cma is a part of virt zones? [1] https://android.googlesource.com/kernel/common/+/1c8aebe4c072bf18409cc78fc84407e24a437302 Thanks Barry
Hi Yu, On 3/1/2024 12:04 AM, Yu Zhao wrote: > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index de292a007138..c0f9d21b4d18 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) > * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. > */ > > -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 > -/* ZONE_DEVICE is not a valid GFP zone specifier */ > +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 > +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ > #define GFP_ZONES_SHIFT 2 > #else > #define GFP_ZONES_SHIFT ZONES_SHIFT > @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) > z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & > ((1 << GFP_ZONES_SHIFT) - 1); > VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); > + > + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) > + return LAST_VIRT_ZONE; > + Not sure If someone had already reported this: With this patch, we allow pages to allocate from movable zone(through fallback from LAST_VIRT_ZONE) even with out __GFP_HIGHMEM. The commit cc09cb134124a ("mm/page_alloc: Add folio allocation functions") sets the __GFP_COMP by default and user has just to pass the __GFP_MOVABLE. Please CMIW. Thanks, Charan
On Thu, Oct 31, 2024 at 8:35 PM Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Hi Yu, > > On 3/1/2024 12:04 AM, Yu Zhao wrote: > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > > index de292a007138..c0f9d21b4d18 100644 > > --- a/include/linux/gfp.h > > +++ b/include/linux/gfp.h > > @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) > > * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. > > */ > > > > -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 > > -/* ZONE_DEVICE is not a valid GFP zone specifier */ > > +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 > > +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ > > #define GFP_ZONES_SHIFT 2 > > #else > > #define GFP_ZONES_SHIFT ZONES_SHIFT > > @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) > > z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & > > ((1 << GFP_ZONES_SHIFT) - 1); > > VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); > > + > > + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) > > + return LAST_VIRT_ZONE; > > + > Not sure If someone had already reported this: With this patch, we allow > pages to allocate from movable zone(through fallback from > LAST_VIRT_ZONE) even with out __GFP_HIGHMEM. The commit cc09cb134124a > ("mm/page_alloc: Add folio allocation functions") sets the __GFP_COMP by > default and user has just to pass the __GFP_MOVABLE. Please CMIW. Hi Charan, I don't remember whether we have this fixed in the Android kernel off the top of head -- I'll ask Kalesh to take a closer look and follow up with you. Thanks!
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 31b3a25680d0..a6c181f6efde 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3529,6 +3529,16 @@ allocations which rules out almost all kernel allocations. Use with caution! + nosplit=X,Y [MM] Set the minimum order of the nosplit zone. Pages in + this zone can't be split down below order Y, while free + or in use. + Like movablecore, X should be either nn[KMGTPE] or n%. + + nomerge=X,Y [MM] Set the exact orders of the nomerge zone. Pages in + this zone are always order Y, meaning they can't be + split or merged while free or in use. + Like movablecore, X should be either nn[KMGTPE] or n%. + MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 8e3223294442..37ecf5ee4afd 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm, page = pfn_to_online_page(pfn); if (!page) continue; - if (page_zonenum(page) != ZONE_MOVABLE) + if (!is_zone_movable_page(page)) return false; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..c0f9d21b4d18 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. */ -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 -/* ZONE_DEVICE is not a valid GFP zone specifier */ +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ #define GFP_ZONES_SHIFT 2 #else #define GFP_ZONES_SHIFT ZONES_SHIFT @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & ((1 << GFP_ZONES_SHIFT) - 1); VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); + + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) + return LAST_VIRT_ZONE; + return z; } +extern int zone_nomerge_order __read_mostly; +extern int zone_nosplit_order __read_mostly; + +static inline enum zone_type gfp_order_zone(gfp_t flags, int order) +{ + enum zone_type zid = gfp_zone(flags); + + if (zid >= ZONE_NOMERGE && order != zone_nomerge_order) + zid = ZONE_NOMERGE - 1; + + if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order) + zid = ZONE_NOSPLIT - 1; + + return zid; +} + /* * There is only one page-allocator function, and two main namespaces to * it. The alloc_page*() variants return 'struct page *' and as such diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..9960ad7c3b10 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); void folio_prep_large_rmappable(struct folio *folio); -bool can_split_folio(struct folio *folio, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {} #define thp_get_unmapped_area NULL -static inline bool -can_split_folio(struct folio *folio, int *pextra_pins) -{ - return false; -} static inline int split_huge_page_to_list(struct page *page, struct list_head *list) { diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..a92bcf47cf8c 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -150,7 +150,7 @@ extern enum zone_type policy_zone; static inline void check_highest_zone(enum zone_type k) { - if (k > policy_zone && k != ZONE_MOVABLE) + if (k > policy_zone && !zid_is_virt(k)) policy_zone = k; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..532218167bba 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -805,11 +805,15 @@ enum zone_type { * there can be false negatives). */ ZONE_MOVABLE, + ZONE_NOSPLIT, + ZONE_NOMERGE, #ifdef CONFIG_ZONE_DEVICE ZONE_DEVICE, #endif - __MAX_NR_ZONES + __MAX_NR_ZONES, + LAST_PHYS_ZONE = ZONE_MOVABLE - 1, + LAST_VIRT_ZONE = ZONE_NOMERGE, }; #ifndef __GENERATING_BOUNDS_H @@ -929,6 +933,8 @@ struct zone { seqlock_t span_seqlock; #endif + int order; + int initialized; /* Write-intensive fields used from the page allocator */ @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio) static inline bool is_zone_movable_page(const struct page *page) { - return page_zonenum(page) == ZONE_MOVABLE; + return page_zonenum(page) >= ZONE_MOVABLE; } static inline bool folio_is_zone_movable(const struct folio *folio) { - return folio_zonenum(folio) == ZONE_MOVABLE; + return folio_zonenum(folio) >= ZONE_MOVABLE; +} + +static inline bool page_can_split(struct page *page) +{ + return page_zonenum(page) < ZONE_NOSPLIT; +} + +static inline bool folio_can_split(struct folio *folio) +{ + return folio_zonenum(folio) < ZONE_NOSPLIT; } #endif @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; }; */ #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline bool zid_is_virt(enum zone_type zid) +{ + return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE; +} + +static inline bool zone_can_frag(struct zone *zone) +{ + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT); + + return zone_idx(zone) < ZONE_NOSPLIT; +} + +static inline bool zone_is_suitable(struct zone *zone, int order) +{ + int zid = zone_idx(zone); + + if (zid < ZONE_NOSPLIT) + return true; + + if (!zone->order) + return false; + + return (zid == ZONE_NOSPLIT && order >= zone->order) || + (zid == ZONE_NOMERGE && order == zone->order); +} + #ifdef CONFIG_ZONE_DEVICE static inline bool zone_is_zone_device(struct zone *zone) { @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone) static inline void zone_set_nid(struct zone *zone, int nid) {} #endif -extern int movable_zone; +extern int virt_zone; static inline int is_highmem_idx(enum zone_type idx) { #ifdef CONFIG_HIGHMEM return (idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM)); + (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM)); #else return 0; #endif diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index b61438313a73..34fbe910576d 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -404,7 +404,7 @@ enum node_states { #else N_HIGH_MEMORY = N_NORMAL_MEMORY, #endif - N_MEMORY, /* The node has memory(regular, high, movable) */ + N_MEMORY, /* The node has memory in any of the zones */ N_CPU, /* The node has one or more cpus */ N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */ NR_NODE_STATES diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 747943bc8cc2..9a54d15d5ec3 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -27,7 +27,7 @@ #endif #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \ - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx) + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx) enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, FOR_ALL_ZONES(PGALLOC) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index d801409b33cf..2b5fdafaadea 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ EM (ZONE_NORMAL, "Normal") \ IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ - EMe(ZONE_MOVABLE,"Movable") + EM (ZONE_MOVABLE,"Movable") \ + EM (ZONE_NOSPLIT,"NoSplit") \ + EMe(ZONE_NOMERGE,"NoMerge") #define LRU_NAMES \ EM (LRU_INACTIVE_ANON, "inactive_anon") \ diff --git a/mm/compaction.c b/mm/compaction.c index 4add68d40e8d..8a64c805f411 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, ac->highest_zoneidx, ac->nodemask) { enum compact_result status; + if (!zone_can_frag(zone)) + continue; + if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + cc.zone = zone; compact_zone(&cc, NULL); @@ -2846,6 +2852,9 @@ static void compact_node(int nid) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + cc.zone = zone; compact_zone(&cc, NULL); @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + ret = compaction_suit_allocation_order(zone, pgdat->kcompactd_max_order, highest_zoneidx, ALLOC_WMARK_MIN); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 94c958f7ebb5..b57faa0a1e83 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, } /* Racy check whether the huge page can be split */ -bool can_split_folio(struct folio *folio, int *pextra_pins) +static bool can_split_folio(struct folio *folio, int *pextra_pins) { int extra_pins; + if (!folio_can_split(folio)) + return false; + /* Additional pins from page cache */ if (folio_test_anon(folio)) extra_pins = folio_test_swapcache(folio) ? diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..1f84dd759086 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma) bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) { - enum zone_type dynamic_policy_zone = policy_zone; - - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); + WARN_ON_ONCE(zid_is_virt(policy_zone)); /* - * if policy->nodes has movable memory only, - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. + * If policy->nodes has memory in virtual zones only, we apply policy + * only if gfp_zone(gfp) can allocate from those zones. * * policy->nodes is intersect with node_states[N_MEMORY]. * so if the following test fails, it implies - * policy->nodes has movable memory only. + * policy->nodes has memory in virtual zones only. */ if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) - dynamic_policy_zone = ZONE_MOVABLE; + return zone > LAST_PHYS_ZONE; - return zone >= dynamic_policy_zone; + return zone >= policy_zone; } /* Do dynamic interleaving for a process */ diff --git a/mm/migrate.c b/mm/migrate.c index cc9f2bcd73b4..f615c0c22046 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f { int rc; + if (!folio_can_split(folio)) + return -EBUSY; + folio_lock(folio); rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private) order = folio_order(src); } zidx = zone_idx(folio_zone(src)); - if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) + if (zidx > ZONE_NORMAL) gfp_mask |= __GFP_HIGHMEM; return __folio_alloc(gfp_mask, order, nid, mtc->nmask); @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) break; } wakeup_kswapd(pgdat->node_zones + z, 0, - folio_order(folio), ZONE_MOVABLE); + folio_order(folio), z); return 0; } diff --git a/mm/mm_init.c b/mm/mm_init.c index 2c19f5515e36..7769c21e6d54 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init); static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata; static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata; -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata; -static unsigned long required_kernelcore __initdata; -static unsigned long required_kernelcore_percent __initdata; -static unsigned long required_movablecore __initdata; -static unsigned long required_movablecore_percent __initdata; +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata; +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid]) + +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE]) + +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE]) + +int zone_nosplit_order __read_mostly; +int zone_nomerge_order __read_mostly; static unsigned long nr_kernel_pages __initdata; static unsigned long nr_all_pages __initdata; @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p) return 0; } - return cmdline_parse_core(p, &required_kernelcore, - &required_kernelcore_percent); + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE), + &percentage_of(LAST_PHYS_ZONE)); } early_param("kernelcore", cmdline_parse_kernelcore); @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore); */ static int __init cmdline_parse_movablecore(char *p) { - return cmdline_parse_core(p, &required_movablecore, - &required_movablecore_percent); + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE), + &percentage_of(ZONE_MOVABLE)); } early_param("movablecore", cmdline_parse_movablecore); +static int __init parse_zone_order(char *p, unsigned long *nr_pages, + unsigned long *percent, int *order) +{ + int err; + unsigned long n; + char *s = strchr(p, ','); + + if (!s) + return -EINVAL; + + *s++ = '\0'; + + err = kstrtoul(s, 0, &n); + if (err) + return err; + + if (n < 2 || n > MAX_PAGE_ORDER) + return -EINVAL; + + err = cmdline_parse_core(p, nr_pages, percent); + if (err) + return err; + + *order = n; + + return 0; +} + +static int __init parse_zone_nosplit(char *p) +{ + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT), + &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order); +} +early_param("nosplit", parse_zone_nosplit); + +static int __init parse_zone_nomerge(char *p) +{ + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE), + &percentage_of(ZONE_NOMERGE), &zone_nomerge_order); +} +early_param("nomerge", parse_zone_nomerge); + /* * early_calculate_totalpages() - * Sum pages in active regions for movable zone. + * Sum pages in active regions for virtual zones. * Populate N_MEMORY for calculating usable_nodes. */ static unsigned long __init early_calculate_totalpages(void) @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void) } /* - * This finds a zone that can be used for ZONE_MOVABLE pages. The + * This finds a physical zone that can be used for virtual zones. The * assumption is made that zones within a node are ordered in monotonic * increasing memory addresses so that the "highest" populated zone is used */ -static void __init find_usable_zone_for_movable(void) +static void __init find_usable_zone(void) { int zone_index; - for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { - if (zone_index == ZONE_MOVABLE) - continue; - + for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) { if (arch_zone_highest_possible_pfn[zone_index] > arch_zone_lowest_possible_pfn[zone_index]) break; } VM_BUG_ON(zone_index == -1); - movable_zone = zone_index; + virt_zone = zone_index; +} + +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn) +{ + int i, nid; + unsigned long node_avg, remaining; + int usable_nodes = nodes_weight(node_states[N_MEMORY]); + /* usable_startpfn is the lowest possible pfn virtual zones can be at */ + unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone]; + +restart: + /* Carve out memory as evenly as possible throughout nodes */ + node_avg = occupied / usable_nodes; + for_each_node_state(nid, N_MEMORY) { + unsigned long start_pfn, end_pfn; + + /* + * Recalculate node_avg if the division per node now exceeds + * what is necessary to satisfy the amount of memory to carve + * out. + */ + if (occupied < node_avg) + node_avg = occupied / usable_nodes; + + /* + * As the map is walked, we track how much memory is usable + * using remaining. When it is 0, the rest of the node is + * usable. + */ + remaining = node_avg; + + /* Go through each range of PFNs within this node */ + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { + unsigned long size_pages; + + start_pfn = max(start_pfn, zone_pfn[nid]); + if (start_pfn >= end_pfn) + continue; + + /* Account for what is only usable when carving out */ + if (start_pfn < usable_startpfn) { + unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn; + + remaining -= min(nr_pages, remaining); + occupied -= min(nr_pages, occupied); + + /* Continue if range is now fully accounted */ + if (end_pfn <= usable_startpfn) { + + /* + * Push zone_pfn to the end so that if + * we have to carve out more across + * nodes, we will not double account + * here. + */ + zone_pfn[nid] = end_pfn; + continue; + } + start_pfn = usable_startpfn; + } + + /* + * The usable PFN range is from start_pfn->end_pfn. + * Calculate size_pages as the number of pages used. + */ + size_pages = end_pfn - start_pfn; + if (size_pages > remaining) + size_pages = remaining; + zone_pfn[nid] = start_pfn + size_pages; + + /* + * Some memory was carved out, update counts and break + * if the request for this node has been satisfied. + */ + occupied -= min(occupied, size_pages); + remaining -= size_pages; + if (!remaining) + break; + } + } + + /* + * If there is still more to carve out, we do another pass with one less + * node in the count. This will push zone_pfn[nid] further along on the + * nodes that still have memory until the request is fully satisfied. + */ + usable_nodes--; + if (usable_nodes && occupied > usable_nodes) + goto restart; } /* @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void) * memory. When they don't, some nodes will have more kernelcore than * others */ -static void __init find_zone_movable_pfns_for_nodes(void) +static void __init find_virt_zones(void) { - int i, nid; + int i; + int nid; unsigned long usable_startpfn; - unsigned long kernelcore_node, kernelcore_remaining; /* save the state before borrow the nodemask */ nodemask_t saved_node_state = node_states[N_MEMORY]; unsigned long totalpages = early_calculate_totalpages(); - int usable_nodes = nodes_weight(node_states[N_MEMORY]); struct memblock_region *r; + unsigned long occupied = 0; - /* Need to find movable_zone earlier when movable_node is specified. */ - find_usable_zone_for_movable(); + /* Need to find virt_zone earlier when movable_node is specified. */ + find_usable_zone(); /* * If movable_node is specified, ignore kernelcore and movablecore @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) nid = memblock_get_region_node(r); usable_startpfn = PFN_DOWN(r->base); - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? - min(usable_startpfn, zone_movable_pfn[nid]) : + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : usable_startpfn; } @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) continue; } - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? - min(usable_startpfn, zone_movable_pfn[nid]) : + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : usable_startpfn; } @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void) goto out2; } + if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) { + nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0; + percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0; + zone_nosplit_order = zone_nomerge_order = 0; + pr_warn("zone %s order %d must be higher zone %s order %d\n", + zone_names[ZONE_NOMERGE], zone_nomerge_order, + zone_names[ZONE_NOSPLIT], zone_nosplit_order); + } + /* * If kernelcore=nn% or movablecore=nn% was specified, calculate the * amount of necessary memory. */ - if (required_kernelcore_percent) - required_kernelcore = (totalpages * 100 * required_kernelcore_percent) / - 10000UL; - if (required_movablecore_percent) - required_movablecore = (totalpages * 100 * required_movablecore_percent) / - 10000UL; + for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) { + if (percentage_of(i)) + nr_pages_of(i) = totalpages * percentage_of(i) / 100; + + nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES); + occupied += nr_pages_of(i); + } /* * If movablecore= was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore * and movablecore are specified, then the value of kernelcore - * will be used for required_kernelcore if it's greater than - * what movablecore would have allowed. + * will be used if it's greater than what movablecore would have + * allowed. */ - if (required_movablecore) { - unsigned long corepages; + if (occupied < totalpages) { + enum zone_type zid; - /* - * Round-up so that ZONE_MOVABLE is at least as large as what - * was requested by the user - */ - required_movablecore = - roundup(required_movablecore, MAX_ORDER_NR_PAGES); - required_movablecore = min(totalpages, required_movablecore); - corepages = totalpages - required_movablecore; - - required_kernelcore = max(required_kernelcore, corepages); + zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ? + LAST_PHYS_ZONE : ZONE_MOVABLE; + nr_pages_of(zid) += totalpages - occupied; } /* * If kernelcore was not specified or kernelcore size is larger - * than totalpages, there is no ZONE_MOVABLE. + * than totalpages, there are not virtual zones. */ - if (!required_kernelcore || required_kernelcore >= totalpages) + occupied = nr_pages_of(LAST_PHYS_ZONE); + if (!occupied || occupied >= totalpages) goto out; - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ - usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { + if (!nr_pages_of(i)) + continue; -restart: - /* Spread kernelcore memory as evenly as possible throughout nodes */ - kernelcore_node = required_kernelcore / usable_nodes; - for_each_node_state(nid, N_MEMORY) { - unsigned long start_pfn, end_pfn; - - /* - * Recalculate kernelcore_node if the division per node - * now exceeds what is necessary to satisfy the requested - * amount of memory for the kernel - */ - if (required_kernelcore < kernelcore_node) - kernelcore_node = required_kernelcore / usable_nodes; - - /* - * As the map is walked, we track how much memory is usable - * by the kernel using kernelcore_remaining. When it is - * 0, the rest of the node is usable by ZONE_MOVABLE - */ - kernelcore_remaining = kernelcore_node; - - /* Go through each range of PFNs within this node */ - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { - unsigned long size_pages; - - start_pfn = max(start_pfn, zone_movable_pfn[nid]); - if (start_pfn >= end_pfn) - continue; - - /* Account for what is only usable for kernelcore */ - if (start_pfn < usable_startpfn) { - unsigned long kernel_pages; - kernel_pages = min(end_pfn, usable_startpfn) - - start_pfn; - - kernelcore_remaining -= min(kernel_pages, - kernelcore_remaining); - required_kernelcore -= min(kernel_pages, - required_kernelcore); - - /* Continue if range is now fully accounted */ - if (end_pfn <= usable_startpfn) { - - /* - * Push zone_movable_pfn to the end so - * that if we have to rebalance - * kernelcore across nodes, we will - * not double account here - */ - zone_movable_pfn[nid] = end_pfn; - continue; - } - start_pfn = usable_startpfn; - } - - /* - * The usable PFN range for ZONE_MOVABLE is from - * start_pfn->end_pfn. Calculate size_pages as the - * number of pages used as kernelcore - */ - size_pages = end_pfn - start_pfn; - if (size_pages > kernelcore_remaining) - size_pages = kernelcore_remaining; - zone_movable_pfn[nid] = start_pfn + size_pages; - - /* - * Some kernelcore has been met, update counts and - * break if the kernelcore for this node has been - * satisfied - */ - required_kernelcore -= min(required_kernelcore, - size_pages); - kernelcore_remaining -= size_pages; - if (!kernelcore_remaining) - break; - } + find_virt_zone(occupied, &pfn_of(i, 0)); + occupied += nr_pages_of(i); } - - /* - * If there is still required_kernelcore, we do another pass with one - * less node in the count. This will push zone_movable_pfn[nid] further - * along on the nodes that still have memory until kernelcore is - * satisfied - */ - usable_nodes--; - if (usable_nodes && required_kernelcore > usable_nodes) - goto restart; - out2: - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */ for (nid = 0; nid < MAX_NUMNODES; nid++) { unsigned long start_pfn, end_pfn; - - zone_movable_pfn[nid] = - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); + unsigned long prev_virt_zone_pfn = 0; get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); - if (zone_movable_pfn[nid] >= end_pfn) - zone_movable_pfn[nid] = 0; + + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { + pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES); + + if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn) + pfn_of(i, nid) = 0; + + if (pfn_of(i, nid)) + prev_virt_zone_pfn = pfn_of(i, nid); + } } - out: /* restore the node_state */ node_states[N_MEMORY] = saved_node_state; @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone, #endif /* - * The zone ranges provided by the architecture do not include ZONE_MOVABLE - * because it is sized independent of architecture. Unlike the other zones, - * the starting point for ZONE_MOVABLE is not fixed. It may be different - * in each node depending on the size of each node and how evenly kernelcore - * is distributed. This helper function adjusts the zone ranges + * The zone ranges provided by the architecture do not include virtual zones + * because they are sized independent of architecture. Unlike physical zones, + * the starting point for the first populated virtual zone is not fixed. It may + * be different in each node depending on the size of each node and how evenly + * kernelcore is distributed. This helper function adjusts the zone ranges * provided by the architecture for a given node by using the end of the - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that - * zones within a node are in order of monotonic increases memory addresses + * highest usable zone for the first populated virtual zone. This preserves the + * assumption that zones within a node are in order of monotonic increases + * memory addresses. */ -static void __init adjust_zone_range_for_zone_movable(int nid, +static void __init adjust_zone_range(int nid, unsigned long zone_type, unsigned long node_end_pfn, unsigned long *zone_start_pfn, unsigned long *zone_end_pfn) { - /* Only adjust if ZONE_MOVABLE is on this node */ - if (zone_movable_pfn[nid]) { - /* Size ZONE_MOVABLE */ - if (zone_type == ZONE_MOVABLE) { - *zone_start_pfn = zone_movable_pfn[nid]; - *zone_end_pfn = min(node_end_pfn, - arch_zone_highest_possible_pfn[movable_zone]); + int i = max_t(int, zone_type, LAST_PHYS_ZONE); + unsigned long next_virt_zone_pfn = 0; - /* Adjust for ZONE_MOVABLE starting within this range */ - } else if (!mirrored_kernelcore && - *zone_start_pfn < zone_movable_pfn[nid] && - *zone_end_pfn > zone_movable_pfn[nid]) { - *zone_end_pfn = zone_movable_pfn[nid]; + while (i++ < LAST_VIRT_ZONE) { + if (pfn_of(i, nid)) { + next_virt_zone_pfn = pfn_of(i, nid); + break; + } + } - /* Check if this whole range is within ZONE_MOVABLE */ - } else if (*zone_start_pfn >= zone_movable_pfn[nid]) + if (zone_type <= LAST_PHYS_ZONE) { + if (!next_virt_zone_pfn) + return; + + if (!mirrored_kernelcore && + *zone_start_pfn < next_virt_zone_pfn && + *zone_end_pfn > next_virt_zone_pfn) + *zone_end_pfn = next_virt_zone_pfn; + else if (*zone_start_pfn >= next_virt_zone_pfn) *zone_start_pfn = *zone_end_pfn; + } else if (zone_type <= LAST_VIRT_ZONE) { + if (!pfn_of(zone_type, nid)) + return; + + if (next_virt_zone_pfn) + *zone_end_pfn = min3(next_virt_zone_pfn, + node_end_pfn, + arch_zone_highest_possible_pfn[virt_zone]); + else + *zone_end_pfn = min(node_end_pfn, + arch_zone_highest_possible_pfn[virt_zone]); + *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid)); } } @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid, * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages * and vice versa. */ - if (mirrored_kernelcore && zone_movable_pfn[nid]) { + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) { unsigned long start_pfn, end_pfn; struct memblock_region *r; @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid, /* Get the start and end of the zone */ *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, - zone_start_pfn, zone_end_pfn); + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat, #if defined(CONFIG_MEMORY_HOTPLUG) zone->present_early_pages = real_size; #endif + if (i == ZONE_NOSPLIT) + zone->order = zone_nosplit_order; + if (i == ZONE_NOMERGE) + zone->order = zone_nomerge_order; totalpages += spanned; realtotalpages += real_size; @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat) { enum zone_type zone_type; - for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) { + for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) { struct zone *zone = &pgdat->node_zones[zone_type]; if (populated_zone(zone)) { if (IS_ENABLED(CONFIG_HIGHMEM)) @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void) void __init free_area_init(unsigned long *max_zone_pfn) { unsigned long start_pfn, end_pfn; - int i, nid, zone; + int i, j, nid, zone; bool descending; /* Record where the zone boundaries are */ @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) start_pfn = PHYS_PFN(memblock_start_of_DRAM()); descending = arch_has_descending_max_zone_pfns(); - for (i = 0; i < MAX_NR_ZONES; i++) { + for (i = 0; i <= LAST_PHYS_ZONE; i++) { if (descending) - zone = MAX_NR_ZONES - i - 1; + zone = LAST_PHYS_ZONE - i; else zone = i; - if (zone == ZONE_MOVABLE) - continue; - end_pfn = max(max_zone_pfn[zone], start_pfn); arch_zone_lowest_possible_pfn[zone] = start_pfn; arch_zone_highest_possible_pfn[zone] = end_pfn; @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) start_pfn = end_pfn; } - /* Find the PFNs that ZONE_MOVABLE begins at in each node */ - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); - find_zone_movable_pfns_for_nodes(); + /* Find the PFNs that virtual zones begin at in each node */ + find_virt_zones(); /* Print out the zone ranges */ pr_info("Zone ranges:\n"); - for (i = 0; i < MAX_NR_ZONES; i++) { - if (i == ZONE_MOVABLE) - continue; + for (i = 0; i <= LAST_PHYS_ZONE; i++) { pr_info(" %-8s ", zone_names[i]); if (arch_zone_lowest_possible_pfn[i] == arch_zone_highest_possible_pfn[i]) @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn) << PAGE_SHIFT) - 1); } - /* Print out the PFNs ZONE_MOVABLE begins at in each node */ - pr_info("Movable zone start for each node\n"); - for (i = 0; i < MAX_NUMNODES; i++) { - if (zone_movable_pfn[i]) - pr_info(" Node %d: %#018Lx\n", i, - (u64)zone_movable_pfn[i] << PAGE_SHIFT); + /* Print out the PFNs virtual zones begin at in each node */ + for (; i <= LAST_VIRT_ZONE; i++) { + pr_info("%s zone start for each node\n", zone_names[i]); + for (j = 0; j < MAX_NUMNODES; j++) { + if (pfn_of(i, j)) + pr_info(" Node %d: %#018Lx\n", + j, (u64)pfn_of(i, j) << PAGE_SHIFT); + } } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..6a4da8f8691c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = { "HighMem", #endif "Movable", + "NoSplit", + "NoMerge", #ifdef CONFIG_ZONE_DEVICE "Device", #endif @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ -int movable_zone; -EXPORT_SYMBOL(movable_zone); +/* virt_zone is the "real" zone pages in virtual zones are taken from */ +int virt_zone; +EXPORT_SYMBOL(virt_zone); #if MAX_NUMNODES > 1 unsigned int nr_node_ids __read_mostly = MAX_NUMNODES; @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, unsigned long higher_page_pfn; struct page *higher_page; - if (order >= MAX_PAGE_ORDER - 1) - return false; - higher_page_pfn = buddy_pfn & pfn; higher_page = page + (higher_page_pfn - pfn); @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, NULL) != NULL; } +static int zone_max_order(struct zone *zone) +{ + return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER; +} + /* * Freeing function for a buddy system allocator. * @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page, unsigned long combined_pfn; struct page *buddy; bool to_tail; + int max_order = zone_max_order(zone); VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); - while (order < MAX_PAGE_ORDER) { + while (order < max_order) { if (compaction_capture(capc, page, order, migratetype)) { __mod_zone_freepage_state(zone, -(1 << order), migratetype); @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page, to_tail = true; else if (is_shuffle_order(order)) to_tail = shuffle_pick_tail(); + else if (order + 1 >= max_order) + to_tail = false; else to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page, int mt; int ret = 0; + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page); + if (split_pfn_offset == 0) return ret; @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct free_area *area; struct page *page; + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order)); + /* Find a page of the appropriate size in the preferred list */ for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) { area = &(zone->free_area[current_order]); @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, long min = mark; int o; + if (!zone_is_suitable(z, order)) + return false; + /* free_pages may go negative - that's OK */ free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, { long free_pages; + if (!zone_is_suitable(z, order)) + return false; + free_pages = zone_page_state(z, NR_FREE_PAGES); /* @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, struct page *page; unsigned long mark; + if (!zone_is_suitable(zone, order)) + continue; + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && !__cpuset_zone_allowed(zone, gfp_mask)) @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void) struct zone *zone; unsigned long flags; - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */ + /* Calculate total number of pages below ZONE_HIGHMEM */ for_each_zone(zone) { - if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE) + if (zone_idx(zone) <= ZONE_NORMAL) lowmem_pages += zone_managed_pages(zone); } @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone_managed_pages(zone); do_div(tmp, lowmem_pages); - if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) { + if (zone_idx(zone) > ZONE_NORMAL) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't - * need highmem and movable zones pages, so cap pages_min - * to a small value here. + * need pages from zones above ZONE_NORMAL, so cap + * pages_min to a small value here. * * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) * deltas control async page reclaim, and so should diff --git a/mm/page_isolation.c b/mm/page_isolation.c index cd0ea3668253..8a6473543427 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * pages then it should be reasonably safe to assume the rest * is movable. */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zid_is_virt(zone_idx(zone))) continue; /* diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0bec1f705f8e..ad0db0373b05 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio) entry.val = 0; if (folio_test_large(folio)) { - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() && + folio_test_pmd_mappable(folio)) get_swap_pages(1, &entry, folio_nr_pages(folio)); goto out; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..ae061ec4866a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, goto keep_locked; if (folio_maybe_dma_pinned(folio)) goto keep_locked; - if (folio_test_large(folio)) { - /* cannot split folio, skip it */ - if (!can_split_folio(folio, NULL)) - goto activate_locked; - /* - * Split folios without a PMD map right - * away. Chances are some or all of the - * tail pages can be freed without IO. - */ - if (!folio_entire_mapcount(folio) && - split_folio_to_list(folio, - folio_list)) - goto activate_locked; - } + /* + * Split folios that are not fully map right + * away. Chances are some of the tail pages can + * be freed without IO. + */ + if (folio_test_large(folio) && + atomic_read(&folio->_nr_pages_mapped) < nr_pages) + split_folio_to_list(folio, folio_list); if (!add_to_swap(folio)) { if (!folio_test_large(folio)) goto activate_locked_split; @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) orig_mask = sc->gfp_mask; if (buffer_heads_over_limit) { sc->gfp_mask |= __GFP_HIGHMEM; - sc->reclaim_idx = gfp_zone(sc->gfp_mask); + sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order); } for_each_zone_zonelist_nodemask(zone, z, zonelist, @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, .gfp_mask = current_gfp_context(gfp_mask), - .reclaim_idx = gfp_zone(gfp_mask), + .reclaim_idx = gfp_order_zone(gfp_mask, order), .order = order, .nodemask = nodemask, .priority = DEF_PRIORITY, @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, if (!cpuset_zone_allowed(zone, gfp_flags)) return; + curr_idx = gfp_order_zone(gfp_flags, order); + if (highest_zoneidx > curr_idx) + highest_zoneidx = curr_idx; + pgdat = zone->zone_pgdat; curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx); @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, - .reclaim_idx = gfp_zone(gfp_mask), + .reclaim_idx = gfp_order_zone(gfp_mask, order), }; unsigned long pflags; diff --git a/mm/vmstat.c b/mm/vmstat.c index db79935e4a54..adbd032e6a0f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ TEXT_FOR_HIGHMEM(xx) xx "_movable", \ + xx "_nosplit", xx "_nomerge", \ TEXT_FOR_DEVICE(xx) const char * const vmstat_text[] = { @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n spanned %lu" "\n present %lu" "\n managed %lu" - "\n cma %lu", + "\n cma %lu" + "\n order %u", zone_page_state(zone, NR_FREE_PAGES), zone->watermark_boost, min_wmark_pages(zone), @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, zone->spanned_pages, zone->present_pages, zone_managed_pages(zone), - zone_cma_pages(zone)); + zone_cma_pages(zone), + zone->order); seq_printf(m, "\n protection: (%ld",
There are three types of zones: 1. The first four zones partition the physical address space of CPU memory. 2. The device zone provides interoperability between CPU and device memory. 3. The movable zone commonly represents a memory allocation policy. Though originally designed for memory hot removal, the movable zone is instead widely used for other purposes, e.g., CMA and kdump kernel, on platforms that do not support hot removal, e.g., Android and ChromeOS. Nowadays, it is legitimately a zone independent of any physical characteristics. In spite of being somewhat regarded as a hack, largely due to the lack of a generic design concept for its true major use cases (on billions of client devices), the movable zone naturally resembles a policy (virtual) zone overlayed on the first four (physical) zones. This proposal formally generalizes this concept as policy zones so that additional policies can be implemented and enforced by subsequent zones after the movable zone. An inherited requirement of policy zones (and the first four zones) is that subsequent zones must be able to fall back to previous zones and therefore must add new properties to the previous zones rather than remove existing ones from them. Also, all properties must be known at the allocation time, rather than the runtime, e.g., memory object size and mobility are valid properties but hotness and lifetime are not. ZONE_MOVABLE becomes the first policy zone, followed by two new policy zones: 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from ZONE_MOVABLE) and restricted to a minimum order to be anti-fragmentation. The latter means that they cannot be split down below that order, while they are free or in use. 2. ZONE_NOMERGE, which contains pages that are movable and restricted to an exact order. The latter means that not only is split prohibited (inherited from ZONE_NOSPLIT) but also merge (see the reason in Chapter Three), while they are free or in use. Since these two zones only can serve THP allocations (__GFP_MOVABLE | __GFP_COMP), they are called THP zones. Reclaim works seamlessly and compaction is not needed for these two zones. Compared with the hugeTLB pool approach, THP zones tap into core MM features including: 1. THP allocations can fall back to the lower zones, which can have higher latency but still succeed. 2. THPs can be either shattered (see Chapter Two) if partially unmapped or reclaimed if becoming cold. 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB contiguous PTEs on arm64 [1], which are more suitable for client workloads. Policy zones can be dynamically resized by offlining pages in one of them and onlining those pages in another of them. Note that this is only done among policy zones, not between a policy zone and a physical zone, since resizing is a (software) policy, not a physical characteristic. Implementing the same idea in the pageblock granularity has also been explored but rejected at Google. Pageblocks have a finer granularity and therefore can be more flexible than zones. The tradeoff is that this alternative implementation was more complex and failed to bring a better ROI. However, the rejection was mainly due to its inability to be smoothly extended to 1GB THPs [2], which is a planned use case of TAO. [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> --- .../admin-guide/kernel-parameters.txt | 10 + drivers/virtio/virtio_mem.c | 2 +- include/linux/gfp.h | 24 +- include/linux/huge_mm.h | 6 - include/linux/mempolicy.h | 2 +- include/linux/mmzone.h | 52 +- include/linux/nodemask.h | 2 +- include/linux/vm_event_item.h | 2 +- include/trace/events/mmflags.h | 4 +- mm/compaction.c | 12 + mm/huge_memory.c | 5 +- mm/mempolicy.c | 14 +- mm/migrate.c | 7 +- mm/mm_init.c | 452 ++++++++++-------- mm/page_alloc.c | 44 +- mm/page_isolation.c | 2 +- mm/swap_slots.c | 3 +- mm/vmscan.c | 32 +- mm/vmstat.c | 7 +- 19 files changed, 431 insertions(+), 251 deletions(-)