[Chapter,One] THP zones: the use cases of policy zones

Message ID	20240229183436.4110845-2-yuzhao@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 29 Feb 2024 11:34:33 -0700 In-Reply-To: <20240229183436.4110845-1-yuzhao@google.com> Mime-Version: 1.0 References: <20240229183436.4110845-1-yuzhao@google.com> Message-ID: <20240229183436.4110845-2-yuzhao@google.com> Subject: [Chapter One] THP zones: the use cases of policy zones From: Yu Zhao <yuzhao@google.com> To: lsf-pc@lists.linux-foundation.org Cc: linux-mm@kvack.org, Jonathan Corbet <corbet@lwn.net>, Yu Zhao <yuzhao@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[Chapter,One] THP zones: the use cases of policy zones \| expand [Chapter,One] THP zones: the use cases of policy zones

On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@google.com> wrote: > > There are three types of zones: > 1. The first four zones partition the physical address space of CPU > memory. > 2. The device zone provides interoperability between CPU and device > memory. > 3. The movable zone commonly represents a memory allocation policy. > > Though originally designed for memory hot removal, the movable zone is > instead widely used for other purposes, e.g., CMA and kdump kernel, on > platforms that do not support hot removal, e.g., Android and ChromeOS. > Nowadays, it is legitimately a zone independent of any physical > characteristics. In spite of being somewhat regarded as a hack, > largely due to the lack of a generic design concept for its true major > use cases (on billions of client devices), the movable zone naturally > resembles a policy (virtual) zone overlayed on the first four > (physical) zones. > > This proposal formally generalizes this concept as policy zones so > that additional policies can be implemented and enforced by subsequent > zones after the movable zone. An inherited requirement of policy zones > (and the first four zones) is that subsequent zones must be able to > fall back to previous zones and therefore must add new properties to > the previous zones rather than remove existing ones from them. Also, > all properties must be known at the allocation time, rather than the > runtime, e.g., memory object size and mobility are valid properties > but hotness and lifetime are not. > > ZONE_MOVABLE becomes the first policy zone, followed by two new policy > zones: > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from > ZONE_MOVABLE) and restricted to a minimum order to be > anti-fragmentation. The latter means that they cannot be split down > below that order, while they are free or in use. > 2. ZONE_NOMERGE, which contains pages that are movable and restricted > to an exact order. The latter means that not only is split > prohibited (inherited from ZONE_NOSPLIT) but also merge (see the > reason in Chapter Three), while they are free or in use. > > Since these two zones only can serve THP allocations (__GFP_MOVABLE | > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and > compaction is not needed for these two zones. > > Compared with the hugeTLB pool approach, THP zones tap into core MM > features including: > 1. THP allocations can fall back to the lower zones, which can have > higher latency but still succeed. > 2. THPs can be either shattered (see Chapter Two) if partially > unmapped or reclaimed if becoming cold. > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB > contiguous PTEs on arm64 [1], which are more suitable for client > workloads. I think the allocation fallback policy needs to be elaborated. IIUC, when allocating large folios, if the order > min order of the policy zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE -> ZONE_MOVABLE -> ZONE_NORMAL, right? If all other zones are depleted, the allocation, whose order is < the min order, won't fallback to the policy zones and will fail, just like the non-movable allocation can't fallback to ZONE_MOVABLE even though there is enough memory for that zone, right? > > Policy zones can be dynamically resized by offlining pages in one of > them and onlining those pages in another of them. Note that this is > only done among policy zones, not between a policy zone and a physical > zone, since resizing is a (software) policy, not a physical > characteristic. > > Implementing the same idea in the pageblock granularity has also been > explored but rejected at Google. Pageblocks have a finer granularity > and therefore can be more flexible than zones. The tradeoff is that > this alternative implementation was more complex and failed to bring a > better ROI. However, the rejection was mainly due to its inability to > be smoothly extended to 1GB THPs [2], which is a planned use case of > TAO. > > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@sent.com/ > > Signed-off-by: Yu Zhao <yuzhao@google.com> > --- > .../admin-guide/kernel-parameters.txt | 10 + > drivers/virtio/virtio_mem.c | 2 +- > include/linux/gfp.h | 24 +- > include/linux/huge_mm.h | 6 - > include/linux/mempolicy.h | 2 +- > include/linux/mmzone.h | 52 +- > include/linux/nodemask.h | 2 +- > include/linux/vm_event_item.h | 2 +- > include/trace/events/mmflags.h | 4 +- > mm/compaction.c | 12 + > mm/huge_memory.c | 5 +- > mm/mempolicy.c | 14 +- > mm/migrate.c | 7 +- > mm/mm_init.c | 452 ++++++++++-------- > mm/page_alloc.c | 44 +- > mm/page_isolation.c | 2 +- > mm/swap_slots.c | 3 +- > mm/vmscan.c | 32 +- > mm/vmstat.c | 7 +- > 19 files changed, 431 insertions(+), 251 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 31b3a25680d0..a6c181f6efde 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3529,6 +3529,16 @@ > allocations which rules out almost all kernel > allocations. Use with caution! > > + nosplit=X,Y [MM] Set the minimum order of the nosplit zone. Pages in > + this zone can't be split down below order Y, while free > + or in use. > + Like movablecore, X should be either nn[KMGTPE] or n%. > + > + nomerge=X,Y [MM] Set the exact orders of the nomerge zone. Pages in > + this zone are always order Y, meaning they can't be > + split or merged while free or in use. > + Like movablecore, X should be either nn[KMGTPE] or n%. > + > MTD_Partition= [MTD] > Format: <name>,<region-number>,<size>,<offset> > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index 8e3223294442..37ecf5ee4afd 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm, > page = pfn_to_online_page(pfn); > if (!page) > continue; > - if (page_zonenum(page) != ZONE_MOVABLE) > + if (!is_zone_movable_page(page)) > return false; > } > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index de292a007138..c0f9d21b4d18 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) > * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. > */ > > -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 > -/* ZONE_DEVICE is not a valid GFP zone specifier */ > +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 > +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ > #define GFP_ZONES_SHIFT 2 > #else > #define GFP_ZONES_SHIFT ZONES_SHIFT > @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) > z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & > ((1 << GFP_ZONES_SHIFT) - 1); > VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); > + > + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) > + return LAST_VIRT_ZONE; > + > return z; > } > > +extern int zone_nomerge_order __read_mostly; > +extern int zone_nosplit_order __read_mostly; > + > +static inline enum zone_type gfp_order_zone(gfp_t flags, int order) > +{ > + enum zone_type zid = gfp_zone(flags); > + > + if (zid >= ZONE_NOMERGE && order != zone_nomerge_order) > + zid = ZONE_NOMERGE - 1; > + > + if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order) > + zid = ZONE_NOSPLIT - 1; > + > + return zid; > +} > + > /* > * There is only one page-allocator function, and two main namespaces to > * it. The alloc_page*() variants return 'struct page *' and as such > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 5adb86af35fc..9960ad7c3b10 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > unsigned long len, unsigned long pgoff, unsigned long flags); > > void folio_prep_large_rmappable(struct folio *folio); > -bool can_split_folio(struct folio *folio, int *pextra_pins); > int split_huge_page_to_list(struct page *page, struct list_head *list); > static inline int split_huge_page(struct page *page) > { > @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {} > > #define thp_get_unmapped_area NULL > > -static inline bool > -can_split_folio(struct folio *folio, int *pextra_pins) > -{ > - return false; > -} > static inline int > split_huge_page_to_list(struct page *page, struct list_head *list) > { > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..a92bcf47cf8c 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -150,7 +150,7 @@ extern enum zone_type policy_zone; > > static inline void check_highest_zone(enum zone_type k) > { > - if (k > policy_zone && k != ZONE_MOVABLE) > + if (k > policy_zone && !zid_is_virt(k)) > policy_zone = k; > } > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index a497f189d988..532218167bba 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -805,11 +805,15 @@ enum zone_type { > * there can be false negatives). > */ > ZONE_MOVABLE, > + ZONE_NOSPLIT, > + ZONE_NOMERGE, > #ifdef CONFIG_ZONE_DEVICE > ZONE_DEVICE, > #endif > - __MAX_NR_ZONES > + __MAX_NR_ZONES, > > + LAST_PHYS_ZONE = ZONE_MOVABLE - 1, > + LAST_VIRT_ZONE = ZONE_NOMERGE, > }; > > #ifndef __GENERATING_BOUNDS_H > @@ -929,6 +933,8 @@ struct zone { > seqlock_t span_seqlock; > #endif > > + int order; > + > int initialized; > > /* Write-intensive fields used from the page allocator */ > @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio) > > static inline bool is_zone_movable_page(const struct page *page) > { > - return page_zonenum(page) == ZONE_MOVABLE; > + return page_zonenum(page) >= ZONE_MOVABLE; > } > > static inline bool folio_is_zone_movable(const struct folio *folio) > { > - return folio_zonenum(folio) == ZONE_MOVABLE; > + return folio_zonenum(folio) >= ZONE_MOVABLE; > +} > + > +static inline bool page_can_split(struct page *page) > +{ > + return page_zonenum(page) < ZONE_NOSPLIT; > +} > + > +static inline bool folio_can_split(struct folio *folio) > +{ > + return folio_zonenum(folio) < ZONE_NOSPLIT; > } > #endif > > @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; }; > */ > #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) > > +static inline bool zid_is_virt(enum zone_type zid) > +{ > + return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE; > +} > + > +static inline bool zone_can_frag(struct zone *zone) > +{ > + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT); > + > + return zone_idx(zone) < ZONE_NOSPLIT; > +} > + > +static inline bool zone_is_suitable(struct zone *zone, int order) > +{ > + int zid = zone_idx(zone); > + > + if (zid < ZONE_NOSPLIT) > + return true; > + > + if (!zone->order) > + return false; > + > + return (zid == ZONE_NOSPLIT && order >= zone->order) || > + (zid == ZONE_NOMERGE && order == zone->order); > +} > + > #ifdef CONFIG_ZONE_DEVICE > static inline bool zone_is_zone_device(struct zone *zone) > { > @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone) > static inline void zone_set_nid(struct zone *zone, int nid) {} > #endif > > -extern int movable_zone; > +extern int virt_zone; > > static inline int is_highmem_idx(enum zone_type idx) > { > #ifdef CONFIG_HIGHMEM > return (idx == ZONE_HIGHMEM || > - (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM)); > + (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM)); > #else > return 0; > #endif > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index b61438313a73..34fbe910576d 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -404,7 +404,7 @@ enum node_states { > #else > N_HIGH_MEMORY = N_NORMAL_MEMORY, > #endif > - N_MEMORY, /* The node has memory(regular, high, movable) */ > + N_MEMORY, /* The node has memory in any of the zones */ > N_CPU, /* The node has one or more cpus */ > N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */ > NR_NODE_STATES > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 747943bc8cc2..9a54d15d5ec3 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -27,7 +27,7 @@ > #endif > > #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \ > - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx) > + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx) > > enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > FOR_ALL_ZONES(PGALLOC) > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h > index d801409b33cf..2b5fdafaadea 100644 > --- a/include/trace/events/mmflags.h > +++ b/include/trace/events/mmflags.h > @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ > IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ > EM (ZONE_NORMAL, "Normal") \ > IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ > - EMe(ZONE_MOVABLE,"Movable") > + EM (ZONE_MOVABLE,"Movable") \ > + EM (ZONE_NOSPLIT,"NoSplit") \ > + EMe(ZONE_NOMERGE,"NoMerge") > > #define LRU_NAMES \ > EM (LRU_INACTIVE_ANON, "inactive_anon") \ > diff --git a/mm/compaction.c b/mm/compaction.c > index 4add68d40e8d..8a64c805f411 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, > ac->highest_zoneidx, ac->nodemask) { > enum compact_result status; > > + if (!zone_can_frag(zone)) > + continue; > + > if (prio > MIN_COMPACT_PRIORITY > && compaction_deferred(zone, order)) { > rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); > @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone = zone; > > compact_zone(&cc, NULL); > @@ -2846,6 +2852,9 @@ static void compact_node(int nid) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > cc.zone = zone; > > compact_zone(&cc, NULL); > @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) > if (!populated_zone(zone)) > continue; > > + if (!zone_can_frag(zone)) > + continue; > + > ret = compaction_suit_allocation_order(zone, > pgdat->kcompactd_max_order, > highest_zoneidx, ALLOC_WMARK_MIN); > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 94c958f7ebb5..b57faa0a1e83 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > } > > /* Racy check whether the huge page can be split */ > -bool can_split_folio(struct folio *folio, int *pextra_pins) > +static bool can_split_folio(struct folio *folio, int *pextra_pins) > { > int extra_pins; > > + if (!folio_can_split(folio)) > + return false; > + > /* Additional pins from page cache */ > if (folio_test_anon(folio)) > extra_pins = folio_test_swapcache(folio) ? > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 10a590ee1c89..1f84dd759086 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma) > > bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > { > - enum zone_type dynamic_policy_zone = policy_zone; > - > - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); > + WARN_ON_ONCE(zid_is_virt(policy_zone)); > > /* > - * if policy->nodes has movable memory only, > - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. > + * If policy->nodes has memory in virtual zones only, we apply policy > + * only if gfp_zone(gfp) can allocate from those zones. > * > * policy->nodes is intersect with node_states[N_MEMORY]. > * so if the following test fails, it implies > - * policy->nodes has movable memory only. > + * policy->nodes has memory in virtual zones only. > */ > if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) > - dynamic_policy_zone = ZONE_MOVABLE; > + return zone > LAST_PHYS_ZONE; > > - return zone >= dynamic_policy_zone; > + return zone >= policy_zone; > } > > /* Do dynamic interleaving for a process */ > diff --git a/mm/migrate.c b/mm/migrate.c > index cc9f2bcd73b4..f615c0c22046 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f > { > int rc; > > + if (!folio_can_split(folio)) > + return -EBUSY; > + > folio_lock(folio); > rc = split_folio_to_list(folio, split_folios); > folio_unlock(folio); > @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private) > order = folio_order(src); > } > zidx = zone_idx(folio_zone(src)); > - if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) > + if (zidx > ZONE_NORMAL) > gfp_mask |= __GFP_HIGHMEM; > > return __folio_alloc(gfp_mask, order, nid, mtc->nmask); > @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) > break; > } > wakeup_kswapd(pgdat->node_zones + z, 0, > - folio_order(folio), ZONE_MOVABLE); > + folio_order(folio), z); > return 0; > } > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 2c19f5515e36..7769c21e6d54 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init); > > static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata; > static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata; > -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata; > > -static unsigned long required_kernelcore __initdata; > -static unsigned long required_kernelcore_percent __initdata; > -static unsigned long required_movablecore __initdata; > -static unsigned long required_movablecore_percent __initdata; > +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata; > +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid]) > + > +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; > +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE]) > + > +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; > +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE]) > + > +int zone_nosplit_order __read_mostly; > +int zone_nomerge_order __read_mostly; > > static unsigned long nr_kernel_pages __initdata; > static unsigned long nr_all_pages __initdata; > @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p) > return 0; > } > > - return cmdline_parse_core(p, &required_kernelcore, > - &required_kernelcore_percent); > + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE), > + &percentage_of(LAST_PHYS_ZONE)); > } > early_param("kernelcore", cmdline_parse_kernelcore); > > @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore); > */ > static int __init cmdline_parse_movablecore(char *p) > { > - return cmdline_parse_core(p, &required_movablecore, > - &required_movablecore_percent); > + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE), > + &percentage_of(ZONE_MOVABLE)); > } > early_param("movablecore", cmdline_parse_movablecore); > > +static int __init parse_zone_order(char *p, unsigned long *nr_pages, > + unsigned long *percent, int *order) > +{ > + int err; > + unsigned long n; > + char *s = strchr(p, ','); > + > + if (!s) > + return -EINVAL; > + > + *s++ = '\0'; > + > + err = kstrtoul(s, 0, &n); > + if (err) > + return err; > + > + if (n < 2 || n > MAX_PAGE_ORDER) > + return -EINVAL; > + > + err = cmdline_parse_core(p, nr_pages, percent); > + if (err) > + return err; > + > + *order = n; > + > + return 0; > +} > + > +static int __init parse_zone_nosplit(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT), > + &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order); > +} > +early_param("nosplit", parse_zone_nosplit); > + > +static int __init parse_zone_nomerge(char *p) > +{ > + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE), > + &percentage_of(ZONE_NOMERGE), &zone_nomerge_order); > +} > +early_param("nomerge", parse_zone_nomerge); > + > /* > * early_calculate_totalpages() > - * Sum pages in active regions for movable zone. > + * Sum pages in active regions for virtual zones. > * Populate N_MEMORY for calculating usable_nodes. > */ > static unsigned long __init early_calculate_totalpages(void) > @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void) > } > > /* > - * This finds a zone that can be used for ZONE_MOVABLE pages. The > + * This finds a physical zone that can be used for virtual zones. The > * assumption is made that zones within a node are ordered in monotonic > * increasing memory addresses so that the "highest" populated zone is used > */ > -static void __init find_usable_zone_for_movable(void) > +static void __init find_usable_zone(void) > { > int zone_index; > - for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { > - if (zone_index == ZONE_MOVABLE) > - continue; > - > + for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) { > if (arch_zone_highest_possible_pfn[zone_index] > > arch_zone_lowest_possible_pfn[zone_index]) > break; > } > > VM_BUG_ON(zone_index == -1); > - movable_zone = zone_index; > + virt_zone = zone_index; > +} > + > +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn) > +{ > + int i, nid; > + unsigned long node_avg, remaining; > + int usable_nodes = nodes_weight(node_states[N_MEMORY]); > + /* usable_startpfn is the lowest possible pfn virtual zones can be at */ > + unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone]; > + > +restart: > + /* Carve out memory as evenly as possible throughout nodes */ > + node_avg = occupied / usable_nodes; > + for_each_node_state(nid, N_MEMORY) { > + unsigned long start_pfn, end_pfn; > + > + /* > + * Recalculate node_avg if the division per node now exceeds > + * what is necessary to satisfy the amount of memory to carve > + * out. > + */ > + if (occupied < node_avg) > + node_avg = occupied / usable_nodes; > + > + /* > + * As the map is walked, we track how much memory is usable > + * using remaining. When it is 0, the rest of the node is > + * usable. > + */ > + remaining = node_avg; > + > + /* Go through each range of PFNs within this node */ > + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > + unsigned long size_pages; > + > + start_pfn = max(start_pfn, zone_pfn[nid]); > + if (start_pfn >= end_pfn) > + continue; > + > + /* Account for what is only usable when carving out */ > + if (start_pfn < usable_startpfn) { > + unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn; > + > + remaining -= min(nr_pages, remaining); > + occupied -= min(nr_pages, occupied); > + > + /* Continue if range is now fully accounted */ > + if (end_pfn <= usable_startpfn) { > + > + /* > + * Push zone_pfn to the end so that if > + * we have to carve out more across > + * nodes, we will not double account > + * here. > + */ > + zone_pfn[nid] = end_pfn; > + continue; > + } > + start_pfn = usable_startpfn; > + } > + > + /* > + * The usable PFN range is from start_pfn->end_pfn. > + * Calculate size_pages as the number of pages used. > + */ > + size_pages = end_pfn - start_pfn; > + if (size_pages > remaining) > + size_pages = remaining; > + zone_pfn[nid] = start_pfn + size_pages; > + > + /* > + * Some memory was carved out, update counts and break > + * if the request for this node has been satisfied. > + */ > + occupied -= min(occupied, size_pages); > + remaining -= size_pages; > + if (!remaining) > + break; > + } > + } > + > + /* > + * If there is still more to carve out, we do another pass with one less > + * node in the count. This will push zone_pfn[nid] further along on the > + * nodes that still have memory until the request is fully satisfied. > + */ > + usable_nodes--; > + if (usable_nodes && occupied > usable_nodes) > + goto restart; > } > > /* > @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void) > * memory. When they don't, some nodes will have more kernelcore than > * others > */ > -static void __init find_zone_movable_pfns_for_nodes(void) > +static void __init find_virt_zones(void) > { > - int i, nid; > + int i; > + int nid; > unsigned long usable_startpfn; > - unsigned long kernelcore_node, kernelcore_remaining; > /* save the state before borrow the nodemask */ > nodemask_t saved_node_state = node_states[N_MEMORY]; > unsigned long totalpages = early_calculate_totalpages(); > - int usable_nodes = nodes_weight(node_states[N_MEMORY]); > struct memblock_region *r; > + unsigned long occupied = 0; > > - /* Need to find movable_zone earlier when movable_node is specified. */ > - find_usable_zone_for_movable(); > + /* Need to find virt_zone earlier when movable_node is specified. */ > + find_usable_zone(); > > /* > * If movable_node is specified, ignore kernelcore and movablecore > @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) > nid = memblock_get_region_node(r); > > usable_startpfn = PFN_DOWN(r->base); > - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid]) : > + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : > usable_startpfn; > } > > @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) > continue; > } > > - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? > - min(usable_startpfn, zone_movable_pfn[nid]) : > + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? > + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : > usable_startpfn; > } > > @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void) > goto out2; > } > > + if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) { > + nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0; > + percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0; > + zone_nosplit_order = zone_nomerge_order = 0; > + pr_warn("zone %s order %d must be higher zone %s order %d\n", > + zone_names[ZONE_NOMERGE], zone_nomerge_order, > + zone_names[ZONE_NOSPLIT], zone_nosplit_order); > + } > + > /* > * If kernelcore=nn% or movablecore=nn% was specified, calculate the > * amount of necessary memory. > */ > - if (required_kernelcore_percent) > - required_kernelcore = (totalpages * 100 * required_kernelcore_percent) / > - 10000UL; > - if (required_movablecore_percent) > - required_movablecore = (totalpages * 100 * required_movablecore_percent) / > - 10000UL; > + for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) { > + if (percentage_of(i)) > + nr_pages_of(i) = totalpages * percentage_of(i) / 100; > + > + nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES); > + occupied += nr_pages_of(i); > + } > > /* > * If movablecore= was specified, calculate what size of > * kernelcore that corresponds so that memory usable for > * any allocation type is evenly spread. If both kernelcore > * and movablecore are specified, then the value of kernelcore > - * will be used for required_kernelcore if it's greater than > - * what movablecore would have allowed. > + * will be used if it's greater than what movablecore would have > + * allowed. > */ > - if (required_movablecore) { > - unsigned long corepages; > + if (occupied < totalpages) { > + enum zone_type zid; > > - /* > - * Round-up so that ZONE_MOVABLE is at least as large as what > - * was requested by the user > - */ > - required_movablecore = > - roundup(required_movablecore, MAX_ORDER_NR_PAGES); > - required_movablecore = min(totalpages, required_movablecore); > - corepages = totalpages - required_movablecore; > - > - required_kernelcore = max(required_kernelcore, corepages); > + zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ? > + LAST_PHYS_ZONE : ZONE_MOVABLE; > + nr_pages_of(zid) += totalpages - occupied; > } > > /* > * If kernelcore was not specified or kernelcore size is larger > - * than totalpages, there is no ZONE_MOVABLE. > + * than totalpages, there are not virtual zones. > */ > - if (!required_kernelcore || required_kernelcore >= totalpages) > + occupied = nr_pages_of(LAST_PHYS_ZONE); > + if (!occupied || occupied >= totalpages) > goto out; > > - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ > - usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; > + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { > + if (!nr_pages_of(i)) > + continue; > > -restart: > - /* Spread kernelcore memory as evenly as possible throughout nodes */ > - kernelcore_node = required_kernelcore / usable_nodes; > - for_each_node_state(nid, N_MEMORY) { > - unsigned long start_pfn, end_pfn; > - > - /* > - * Recalculate kernelcore_node if the division per node > - * now exceeds what is necessary to satisfy the requested > - * amount of memory for the kernel > - */ > - if (required_kernelcore < kernelcore_node) > - kernelcore_node = required_kernelcore / usable_nodes; > - > - /* > - * As the map is walked, we track how much memory is usable > - * by the kernel using kernelcore_remaining. When it is > - * 0, the rest of the node is usable by ZONE_MOVABLE > - */ > - kernelcore_remaining = kernelcore_node; > - > - /* Go through each range of PFNs within this node */ > - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > - unsigned long size_pages; > - > - start_pfn = max(start_pfn, zone_movable_pfn[nid]); > - if (start_pfn >= end_pfn) > - continue; > - > - /* Account for what is only usable for kernelcore */ > - if (start_pfn < usable_startpfn) { > - unsigned long kernel_pages; > - kernel_pages = min(end_pfn, usable_startpfn) > - - start_pfn; > - > - kernelcore_remaining -= min(kernel_pages, > - kernelcore_remaining); > - required_kernelcore -= min(kernel_pages, > - required_kernelcore); > - > - /* Continue if range is now fully accounted */ > - if (end_pfn <= usable_startpfn) { > - > - /* > - * Push zone_movable_pfn to the end so > - * that if we have to rebalance > - * kernelcore across nodes, we will > - * not double account here > - */ > - zone_movable_pfn[nid] = end_pfn; > - continue; > - } > - start_pfn = usable_startpfn; > - } > - > - /* > - * The usable PFN range for ZONE_MOVABLE is from > - * start_pfn->end_pfn. Calculate size_pages as the > - * number of pages used as kernelcore > - */ > - size_pages = end_pfn - start_pfn; > - if (size_pages > kernelcore_remaining) > - size_pages = kernelcore_remaining; > - zone_movable_pfn[nid] = start_pfn + size_pages; > - > - /* > - * Some kernelcore has been met, update counts and > - * break if the kernelcore for this node has been > - * satisfied > - */ > - required_kernelcore -= min(required_kernelcore, > - size_pages); > - kernelcore_remaining -= size_pages; > - if (!kernelcore_remaining) > - break; > - } > + find_virt_zone(occupied, &pfn_of(i, 0)); > + occupied += nr_pages_of(i); > } > - > - /* > - * If there is still required_kernelcore, we do another pass with one > - * less node in the count. This will push zone_movable_pfn[nid] further > - * along on the nodes that still have memory until kernelcore is > - * satisfied > - */ > - usable_nodes--; > - if (usable_nodes && required_kernelcore > usable_nodes) > - goto restart; > - > out2: > - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ > + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */ > for (nid = 0; nid < MAX_NUMNODES; nid++) { > unsigned long start_pfn, end_pfn; > - > - zone_movable_pfn[nid] = > - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); > + unsigned long prev_virt_zone_pfn = 0; > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); > - if (zone_movable_pfn[nid] >= end_pfn) > - zone_movable_pfn[nid] = 0; > + > + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { > + pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES); > + > + if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn) > + pfn_of(i, nid) = 0; > + > + if (pfn_of(i, nid)) > + prev_virt_zone_pfn = pfn_of(i, nid); > + } > } > - > out: > /* restore the node_state */ > node_states[N_MEMORY] = saved_node_state; > @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone, > #endif > > /* > - * The zone ranges provided by the architecture do not include ZONE_MOVABLE > - * because it is sized independent of architecture. Unlike the other zones, > - * the starting point for ZONE_MOVABLE is not fixed. It may be different > - * in each node depending on the size of each node and how evenly kernelcore > - * is distributed. This helper function adjusts the zone ranges > + * The zone ranges provided by the architecture do not include virtual zones > + * because they are sized independent of architecture. Unlike physical zones, > + * the starting point for the first populated virtual zone is not fixed. It may > + * be different in each node depending on the size of each node and how evenly > + * kernelcore is distributed. This helper function adjusts the zone ranges > * provided by the architecture for a given node by using the end of the > - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that > - * zones within a node are in order of monotonic increases memory addresses > + * highest usable zone for the first populated virtual zone. This preserves the > + * assumption that zones within a node are in order of monotonic increases > + * memory addresses. > */ > -static void __init adjust_zone_range_for_zone_movable(int nid, > +static void __init adjust_zone_range(int nid, > unsigned long zone_type, > unsigned long node_end_pfn, > unsigned long *zone_start_pfn, > unsigned long *zone_end_pfn) > { > - /* Only adjust if ZONE_MOVABLE is on this node */ > - if (zone_movable_pfn[nid]) { > - /* Size ZONE_MOVABLE */ > - if (zone_type == ZONE_MOVABLE) { > - *zone_start_pfn = zone_movable_pfn[nid]; > - *zone_end_pfn = min(node_end_pfn, > - arch_zone_highest_possible_pfn[movable_zone]); > + int i = max_t(int, zone_type, LAST_PHYS_ZONE); > + unsigned long next_virt_zone_pfn = 0; > > - /* Adjust for ZONE_MOVABLE starting within this range */ > - } else if (!mirrored_kernelcore && > - *zone_start_pfn < zone_movable_pfn[nid] && > - *zone_end_pfn > zone_movable_pfn[nid]) { > - *zone_end_pfn = zone_movable_pfn[nid]; > + while (i++ < LAST_VIRT_ZONE) { > + if (pfn_of(i, nid)) { > + next_virt_zone_pfn = pfn_of(i, nid); > + break; > + } > + } > > - /* Check if this whole range is within ZONE_MOVABLE */ > - } else if (*zone_start_pfn >= zone_movable_pfn[nid]) > + if (zone_type <= LAST_PHYS_ZONE) { > + if (!next_virt_zone_pfn) > + return; > + > + if (!mirrored_kernelcore && > + *zone_start_pfn < next_virt_zone_pfn && > + *zone_end_pfn > next_virt_zone_pfn) > + *zone_end_pfn = next_virt_zone_pfn; > + else if (*zone_start_pfn >= next_virt_zone_pfn) > *zone_start_pfn = *zone_end_pfn; > + } else if (zone_type <= LAST_VIRT_ZONE) { > + if (!pfn_of(zone_type, nid)) > + return; > + > + if (next_virt_zone_pfn) > + *zone_end_pfn = min3(next_virt_zone_pfn, > + node_end_pfn, > + arch_zone_highest_possible_pfn[virt_zone]); > + else > + *zone_end_pfn = min(node_end_pfn, > + arch_zone_highest_possible_pfn[virt_zone]); > + *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid)); > } > } > > @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid, > * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages > * and vice versa. > */ > - if (mirrored_kernelcore && zone_movable_pfn[nid]) { > + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) { > unsigned long start_pfn, end_pfn; > struct memblock_region *r; > > @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid, > /* Get the start and end of the zone */ > *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); > *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); > - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, > - zone_start_pfn, zone_end_pfn); > + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn); > > /* Check that this node has pages within the zone's required range */ > if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) > @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat, > #if defined(CONFIG_MEMORY_HOTPLUG) > zone->present_early_pages = real_size; > #endif > + if (i == ZONE_NOSPLIT) > + zone->order = zone_nosplit_order; > + if (i == ZONE_NOMERGE) > + zone->order = zone_nomerge_order; > > totalpages += spanned; > realtotalpages += real_size; > @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat) > { > enum zone_type zone_type; > > - for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) { > + for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) { > struct zone *zone = &pgdat->node_zones[zone_type]; > if (populated_zone(zone)) { > if (IS_ENABLED(CONFIG_HIGHMEM)) > @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void) > void __init free_area_init(unsigned long *max_zone_pfn) > { > unsigned long start_pfn, end_pfn; > - int i, nid, zone; > + int i, j, nid, zone; > bool descending; > > /* Record where the zone boundaries are */ > @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) > start_pfn = PHYS_PFN(memblock_start_of_DRAM()); > descending = arch_has_descending_max_zone_pfns(); > > - for (i = 0; i < MAX_NR_ZONES; i++) { > + for (i = 0; i <= LAST_PHYS_ZONE; i++) { > if (descending) > - zone = MAX_NR_ZONES - i - 1; > + zone = LAST_PHYS_ZONE - i; > else > zone = i; > > - if (zone == ZONE_MOVABLE) > - continue; > - > end_pfn = max(max_zone_pfn[zone], start_pfn); > arch_zone_lowest_possible_pfn[zone] = start_pfn; > arch_zone_highest_possible_pfn[zone] = end_pfn; > @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) > start_pfn = end_pfn; > } > > - /* Find the PFNs that ZONE_MOVABLE begins at in each node */ > - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); > - find_zone_movable_pfns_for_nodes(); > + /* Find the PFNs that virtual zones begin at in each node */ > + find_virt_zones(); > > /* Print out the zone ranges */ > pr_info("Zone ranges:\n"); > - for (i = 0; i < MAX_NR_ZONES; i++) { > - if (i == ZONE_MOVABLE) > - continue; > + for (i = 0; i <= LAST_PHYS_ZONE; i++) { > pr_info(" %-8s ", zone_names[i]); > if (arch_zone_lowest_possible_pfn[i] == > arch_zone_highest_possible_pfn[i]) > @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn) > << PAGE_SHIFT) - 1); > } > > - /* Print out the PFNs ZONE_MOVABLE begins at in each node */ > - pr_info("Movable zone start for each node\n"); > - for (i = 0; i < MAX_NUMNODES; i++) { > - if (zone_movable_pfn[i]) > - pr_info(" Node %d: %#018Lx\n", i, > - (u64)zone_movable_pfn[i] << PAGE_SHIFT); > + /* Print out the PFNs virtual zones begin at in each node */ > + for (; i <= LAST_VIRT_ZONE; i++) { > + pr_info("%s zone start for each node\n", zone_names[i]); > + for (j = 0; j < MAX_NUMNODES; j++) { > + if (pfn_of(i, j)) > + pr_info(" Node %d: %#018Lx\n", > + j, (u64)pfn_of(i, j) << PAGE_SHIFT); > + } > } > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 150d4f23b010..6a4da8f8691c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = { > "HighMem", > #endif > "Movable", > + "NoSplit", > + "NoMerge", > #ifdef CONFIG_ZONE_DEVICE > "Device", > #endif > @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1; > static int watermark_boost_factor __read_mostly = 15000; > static int watermark_scale_factor = 10; > > -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > -int movable_zone; > -EXPORT_SYMBOL(movable_zone); > +/* virt_zone is the "real" zone pages in virtual zones are taken from */ > +int virt_zone; > +EXPORT_SYMBOL(virt_zone); > > #if MAX_NUMNODES > 1 > unsigned int nr_node_ids __read_mostly = MAX_NUMNODES; > @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > unsigned long higher_page_pfn; > struct page *higher_page; > > - if (order >= MAX_PAGE_ORDER - 1) > - return false; > - > higher_page_pfn = buddy_pfn & pfn; > higher_page = page + (higher_page_pfn - pfn); > > @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > NULL) != NULL; > } > > +static int zone_max_order(struct zone *zone) > +{ > + return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER; > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + int max_order = zone_max_order(zone); > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); > VM_BUG_ON_PAGE(bad_range(zone, page), page); > > - while (order < MAX_PAGE_ORDER) { > + while (order < max_order) { > if (compaction_capture(capc, page, order, migratetype)) { > __mod_zone_freepage_state(zone, -(1 << order), > migratetype); > @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page, > to_tail = true; > else if (is_shuffle_order(order)) > to_tail = shuffle_pick_tail(); > + else if (order + 1 >= max_order) > + to_tail = false; > else > to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); > > @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page, > int mt; > int ret = 0; > > + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page); > + > if (split_pfn_offset == 0) > return ret; > > @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > struct free_area *area; > struct page *page; > > + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order)); > + > /* Find a page of the appropriate size in the preferred list */ > for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) { > area = &(zone->free_area[current_order]); > @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > long min = mark; > int o; > > + if (!zone_is_suitable(z, order)) > + return false; > + > /* free_pages may go negative - that's OK */ > free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); > > @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, > { > long free_pages; > > + if (!zone_is_suitable(z, order)) > + return false; > + > free_pages = zone_page_state(z, NR_FREE_PAGES); > > /* > @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > struct page *page; > unsigned long mark; > > + if (!zone_is_suitable(zone, order)) > + continue; > + > if (cpusets_enabled() && > (alloc_flags & ALLOC_CPUSET) && > !__cpuset_zone_allowed(zone, gfp_mask)) > @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void) > struct zone *zone; > unsigned long flags; > > - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */ > + /* Calculate total number of pages below ZONE_HIGHMEM */ > for_each_zone(zone) { > - if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE) > + if (zone_idx(zone) <= ZONE_NORMAL) > lowmem_pages += zone_managed_pages(zone); > } > > @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void) > spin_lock_irqsave(&zone->lock, flags); > tmp = (u64)pages_min * zone_managed_pages(zone); > do_div(tmp, lowmem_pages); > - if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) { > + if (zone_idx(zone) > ZONE_NORMAL) { > /* > * __GFP_HIGH and PF_MEMALLOC allocations usually don't > - * need highmem and movable zones pages, so cap pages_min > - * to a small value here. > + * need pages from zones above ZONE_NORMAL, so cap > + * pages_min to a small value here. > * > * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) > * deltas control async page reclaim, and so should > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > index cd0ea3668253..8a6473543427 100644 > --- a/mm/page_isolation.c > +++ b/mm/page_isolation.c > @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e > * pages then it should be reasonably safe to assume the rest > * is movable. > */ > - if (zone_idx(zone) == ZONE_MOVABLE) > + if (zid_is_virt(zone_idx(zone))) > continue; > > /* > diff --git a/mm/swap_slots.c b/mm/swap_slots.c > index 0bec1f705f8e..ad0db0373b05 100644 > --- a/mm/swap_slots.c > +++ b/mm/swap_slots.c > @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio) > entry.val = 0; > > if (folio_test_large(folio)) { > - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) > + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() && > + folio_test_pmd_mappable(folio)) > get_swap_pages(1, &entry, folio_nr_pages(folio)); > goto out; > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4f9c854ce6cc..ae061ec4866a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > goto keep_locked; > if (folio_maybe_dma_pinned(folio)) > goto keep_locked; > - if (folio_test_large(folio)) { > - /* cannot split folio, skip it */ > - if (!can_split_folio(folio, NULL)) > - goto activate_locked; > - /* > - * Split folios without a PMD map right > - * away. Chances are some or all of the > - * tail pages can be freed without IO. > - */ > - if (!folio_entire_mapcount(folio) && > - split_folio_to_list(folio, > - folio_list)) > - goto activate_locked; > - } > + /* > + * Split folios that are not fully map right > + * away. Chances are some of the tail pages can > + * be freed without IO. > + */ > + if (folio_test_large(folio) && > + atomic_read(&folio->_nr_pages_mapped) < nr_pages) > + split_folio_to_list(folio, folio_list); > if (!add_to_swap(folio)) { > if (!folio_test_large(folio)) > goto activate_locked_split; > @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > orig_mask = sc->gfp_mask; > if (buffer_heads_over_limit) { > sc->gfp_mask |= __GFP_HIGHMEM; > - sc->reclaim_idx = gfp_zone(sc->gfp_mask); > + sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order); > } > > for_each_zone_zonelist_nodemask(zone, z, zonelist, > @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > struct scan_control sc = { > .nr_to_reclaim = SWAP_CLUSTER_MAX, > .gfp_mask = current_gfp_context(gfp_mask), > - .reclaim_idx = gfp_zone(gfp_mask), > + .reclaim_idx = gfp_order_zone(gfp_mask, order), > .order = order, > .nodemask = nodemask, > .priority = DEF_PRIORITY, > @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > if (!cpuset_zone_allowed(zone, gfp_flags)) > return; > > + curr_idx = gfp_order_zone(gfp_flags, order); > + if (highest_zoneidx > curr_idx) > + highest_zoneidx = curr_idx; > + > pgdat = zone->zone_pgdat; > curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx); > > @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > - .reclaim_idx = gfp_zone(gfp_mask), > + .reclaim_idx = gfp_order_zone(gfp_mask, order), > }; > unsigned long pflags; > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index db79935e4a54..adbd032e6a0f 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) > > #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ > TEXT_FOR_HIGHMEM(xx) xx "_movable", \ > + xx "_nosplit", xx "_nomerge", \ > TEXT_FOR_DEVICE(xx) > > const char * const vmstat_text[] = { > @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > "\n spanned %lu" > "\n present %lu" > "\n managed %lu" > - "\n cma %lu", > + "\n cma %lu" > + "\n order %u", > zone_page_state(zone, NR_FREE_PAGES), > zone->watermark_boost, > min_wmark_pages(zone), > @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, > zone->spanned_pages, > zone->present_pages, > zone_managed_pages(zone), > - zone_cma_pages(zone)); > + zone_cma_pages(zone), > + zone->order); > > seq_printf(m, > "\n protection: (%ld", > -- > 2.44.0.rc1.240.g4c46232300-goog > >

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 31b3a25680d0..a6c181f6efde 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3529,6 +3529,16 @@ allocations which rules out almost all kernel allocations. Use with caution! + nosplit=X,Y [MM] Set the minimum order of the nosplit zone. Pages in + this zone can't be split down below order Y, while free + or in use. + Like movablecore, X should be either nn[KMGTPE] or n%. + + nomerge=X,Y [MM] Set the exact orders of the nomerge zone. Pages in + this zone are always order Y, meaning they can't be + split or merged while free or in use. + Like movablecore, X should be either nn[KMGTPE] or n%. + MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 8e3223294442..37ecf5ee4afd 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -2228,7 +2228,7 @@ static bool virtio_mem_bbm_bb_is_movable(struct virtio_mem *vm, page = pfn_to_online_page(pfn); if (!page) continue; - if (page_zonenum(page) != ZONE_MOVABLE) + if (!is_zone_movable_page(page)) return false; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index de292a007138..c0f9d21b4d18 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -88,8 +88,8 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags) * GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms. */ -#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4 -/* ZONE_DEVICE is not a valid GFP zone specifier */ +#if MAX_NR_ZONES - 2 - IS_ENABLED(CONFIG_ZONE_DEVICE) <= 4 +/* zones beyond ZONE_MOVABLE are not valid GFP zone specifiers */ #define GFP_ZONES_SHIFT 2 #else #define GFP_ZONES_SHIFT ZONES_SHIFT @@ -135,9 +135,29 @@ static inline enum zone_type gfp_zone(gfp_t flags) z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) & ((1 << GFP_ZONES_SHIFT) - 1); VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1); + + if ((flags & (__GFP_MOVABLE | __GFP_COMP)) == (__GFP_MOVABLE | __GFP_COMP)) + return LAST_VIRT_ZONE; + return z; } +extern int zone_nomerge_order __read_mostly; +extern int zone_nosplit_order __read_mostly; + +static inline enum zone_type gfp_order_zone(gfp_t flags, int order) +{ + enum zone_type zid = gfp_zone(flags); + + if (zid >= ZONE_NOMERGE && order != zone_nomerge_order) + zid = ZONE_NOMERGE - 1; + + if (zid >= ZONE_NOSPLIT && order < zone_nosplit_order) + zid = ZONE_NOSPLIT - 1; + + return zid; +} + /* * There is only one page-allocator function, and two main namespaces to * it. The alloc_page*() variants return 'struct page *' and as such diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 5adb86af35fc..9960ad7c3b10 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -264,7 +264,6 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); void folio_prep_large_rmappable(struct folio *folio); -bool can_split_folio(struct folio *folio, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { @@ -416,11 +415,6 @@ static inline void folio_prep_large_rmappable(struct folio *folio) {} #define thp_get_unmapped_area NULL -static inline bool -can_split_folio(struct folio *folio, int *pextra_pins) -{ - return false; -} static inline int split_huge_page_to_list(struct page *page, struct list_head *list) { diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..a92bcf47cf8c 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -150,7 +150,7 @@ extern enum zone_type policy_zone; static inline void check_highest_zone(enum zone_type k) { - if (k > policy_zone && k != ZONE_MOVABLE) + if (k > policy_zone && !zid_is_virt(k)) policy_zone = k; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..532218167bba 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -805,11 +805,15 @@ enum zone_type { * there can be false negatives). */ ZONE_MOVABLE, + ZONE_NOSPLIT, + ZONE_NOMERGE, #ifdef CONFIG_ZONE_DEVICE ZONE_DEVICE, #endif - __MAX_NR_ZONES + __MAX_NR_ZONES, + LAST_PHYS_ZONE = ZONE_MOVABLE - 1, + LAST_VIRT_ZONE = ZONE_NOMERGE, }; #ifndef __GENERATING_BOUNDS_H @@ -929,6 +933,8 @@ struct zone { seqlock_t span_seqlock; #endif + int order; + int initialized; /* Write-intensive fields used from the page allocator */ @@ -1147,12 +1153,22 @@ static inline bool folio_is_zone_device(const struct folio *folio) static inline bool is_zone_movable_page(const struct page *page) { - return page_zonenum(page) == ZONE_MOVABLE; + return page_zonenum(page) >= ZONE_MOVABLE; } static inline bool folio_is_zone_movable(const struct folio *folio) { - return folio_zonenum(folio) == ZONE_MOVABLE; + return folio_zonenum(folio) >= ZONE_MOVABLE; +} + +static inline bool page_can_split(struct page *page) +{ + return page_zonenum(page) < ZONE_NOSPLIT; +} + +static inline bool folio_can_split(struct folio *folio) +{ + return folio_zonenum(folio) < ZONE_NOSPLIT; } #endif @@ -1469,6 +1485,32 @@ static inline int local_memory_node(int node_id) { return node_id; }; */ #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline bool zid_is_virt(enum zone_type zid) +{ + return zid > LAST_PHYS_ZONE && zid <= LAST_VIRT_ZONE; +} + +static inline bool zone_can_frag(struct zone *zone) +{ + VM_WARN_ON_ONCE(zone->order && zone_idx(zone) < ZONE_NOSPLIT); + + return zone_idx(zone) < ZONE_NOSPLIT; +} + +static inline bool zone_is_suitable(struct zone *zone, int order) +{ + int zid = zone_idx(zone); + + if (zid < ZONE_NOSPLIT) + return true; + + if (!zone->order) + return false; + + return (zid == ZONE_NOSPLIT && order >= zone->order) || + (zid == ZONE_NOMERGE && order == zone->order); +} + #ifdef CONFIG_ZONE_DEVICE static inline bool zone_is_zone_device(struct zone *zone) { @@ -1517,13 +1559,13 @@ static inline int zone_to_nid(struct zone *zone) static inline void zone_set_nid(struct zone *zone, int nid) {} #endif -extern int movable_zone; +extern int virt_zone; static inline int is_highmem_idx(enum zone_type idx) { #ifdef CONFIG_HIGHMEM return (idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && movable_zone == ZONE_HIGHMEM)); + (zid_is_virt(idx) && virt_zone == ZONE_HIGHMEM)); #else return 0; #endif diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index b61438313a73..34fbe910576d 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -404,7 +404,7 @@ enum node_states { #else N_HIGH_MEMORY = N_NORMAL_MEMORY, #endif - N_MEMORY, /* The node has memory(regular, high, movable) */ + N_MEMORY, /* The node has memory in any of the zones */ N_CPU, /* The node has one or more cpus */ N_GENERIC_INITIATOR, /* The node has one or more Generic Initiators */ NR_NODE_STATES diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 747943bc8cc2..9a54d15d5ec3 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -27,7 +27,7 @@ #endif #define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL, \ - HIGHMEM_ZONE(xx) xx##_MOVABLE, DEVICE_ZONE(xx) + HIGHMEM_ZONE(xx) xx##_MOVABLE, xx##_NOSPLIT, xx##_NOMERGE, DEVICE_ZONE(xx) enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, FOR_ALL_ZONES(PGALLOC) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index d801409b33cf..2b5fdafaadea 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -265,7 +265,9 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ IFDEF_ZONE_DMA32( EM (ZONE_DMA32, "DMA32")) \ EM (ZONE_NORMAL, "Normal") \ IFDEF_ZONE_HIGHMEM( EM (ZONE_HIGHMEM,"HighMem")) \ - EMe(ZONE_MOVABLE,"Movable") + EM (ZONE_MOVABLE,"Movable") \ + EM (ZONE_NOSPLIT,"NoSplit") \ + EMe(ZONE_NOMERGE,"NoMerge") #define LRU_NAMES \ EM (LRU_INACTIVE_ANON, "inactive_anon") \ diff --git a/mm/compaction.c b/mm/compaction.c index 4add68d40e8d..8a64c805f411 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2742,6 +2742,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, ac->highest_zoneidx, ac->nodemask) { enum compact_result status; + if (!zone_can_frag(zone)) + continue; + if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); @@ -2814,6 +2817,9 @@ static void proactive_compact_node(pg_data_t *pgdat) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + cc.zone = zone; compact_zone(&cc, NULL); @@ -2846,6 +2852,9 @@ static void compact_node(int nid) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + cc.zone = zone; compact_zone(&cc, NULL); @@ -2960,6 +2969,9 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) if (!populated_zone(zone)) continue; + if (!zone_can_frag(zone)) + continue; + ret = compaction_suit_allocation_order(zone, pgdat->kcompactd_max_order, highest_zoneidx, ALLOC_WMARK_MIN); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 94c958f7ebb5..b57faa0a1e83 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2941,10 +2941,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, } /* Racy check whether the huge page can be split */ -bool can_split_folio(struct folio *folio, int *pextra_pins) +static bool can_split_folio(struct folio *folio, int *pextra_pins) { int extra_pins; + if (!folio_can_split(folio)) + return false; + /* Additional pins from page cache */ if (folio_test_anon(folio)) extra_pins = folio_test_swapcache(folio) ? diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..1f84dd759086 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1807,22 +1807,20 @@ bool vma_policy_mof(struct vm_area_struct *vma) bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) { - enum zone_type dynamic_policy_zone = policy_zone; - - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); + WARN_ON_ONCE(zid_is_virt(policy_zone)); /* - * if policy->nodes has movable memory only, - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. + * If policy->nodes has memory in virtual zones only, we apply policy + * only if gfp_zone(gfp) can allocate from those zones. * * policy->nodes is intersect with node_states[N_MEMORY]. * so if the following test fails, it implies - * policy->nodes has movable memory only. + * policy->nodes has memory in virtual zones only. */ if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) - dynamic_policy_zone = ZONE_MOVABLE; + return zone > LAST_PHYS_ZONE; - return zone >= dynamic_policy_zone; + return zone >= policy_zone; } /* Do dynamic interleaving for a process */ diff --git a/mm/migrate.c b/mm/migrate.c index cc9f2bcd73b4..f615c0c22046 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1480,6 +1480,9 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f { int rc; + if (!folio_can_split(folio)) + return -EBUSY; + folio_lock(folio); rc = split_folio_to_list(folio, split_folios); folio_unlock(folio); @@ -2032,7 +2035,7 @@ struct folio *alloc_migration_target(struct folio *src, unsigned long private) order = folio_order(src); } zidx = zone_idx(folio_zone(src)); - if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) + if (zidx > ZONE_NORMAL) gfp_mask |= __GFP_HIGHMEM; return __folio_alloc(gfp_mask, order, nid, mtc->nmask); @@ -2520,7 +2523,7 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio) break; } wakeup_kswapd(pgdat->node_zones + z, 0, - folio_order(folio), ZONE_MOVABLE); + folio_order(folio), z); return 0; } diff --git a/mm/mm_init.c b/mm/mm_init.c index 2c19f5515e36..7769c21e6d54 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -217,12 +217,18 @@ postcore_initcall(mm_sysfs_init); static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __initdata; static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __initdata; -static unsigned long zone_movable_pfn[MAX_NUMNODES] __initdata; -static unsigned long required_kernelcore __initdata; -static unsigned long required_kernelcore_percent __initdata; -static unsigned long required_movablecore __initdata; -static unsigned long required_movablecore_percent __initdata; +static unsigned long virt_zones[LAST_VIRT_ZONE - LAST_PHYS_ZONE][MAX_NUMNODES] __initdata; +#define pfn_of(zid, nid) (virt_zones[(zid) - LAST_PHYS_ZONE - 1][nid]) + +static unsigned long zone_nr_pages[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; +#define nr_pages_of(zid) (zone_nr_pages[(zid) - LAST_PHYS_ZONE]) + +static unsigned long zone_percentage[LAST_VIRT_ZONE - LAST_PHYS_ZONE + 1] __initdata; +#define percentage_of(zid) (zone_percentage[(zid) - LAST_PHYS_ZONE]) + +int zone_nosplit_order __read_mostly; +int zone_nomerge_order __read_mostly; static unsigned long nr_kernel_pages __initdata; static unsigned long nr_all_pages __initdata; @@ -273,8 +279,8 @@ static int __init cmdline_parse_kernelcore(char *p) return 0; } - return cmdline_parse_core(p, &required_kernelcore, - &required_kernelcore_percent); + return cmdline_parse_core(p, &nr_pages_of(LAST_PHYS_ZONE), + &percentage_of(LAST_PHYS_ZONE)); } early_param("kernelcore", cmdline_parse_kernelcore); @@ -284,14 +290,56 @@ early_param("kernelcore", cmdline_parse_kernelcore); */ static int __init cmdline_parse_movablecore(char *p) { - return cmdline_parse_core(p, &required_movablecore, - &required_movablecore_percent); + return cmdline_parse_core(p, &nr_pages_of(ZONE_MOVABLE), + &percentage_of(ZONE_MOVABLE)); } early_param("movablecore", cmdline_parse_movablecore); +static int __init parse_zone_order(char *p, unsigned long *nr_pages, + unsigned long *percent, int *order) +{ + int err; + unsigned long n; + char *s = strchr(p, ','); + + if (!s) + return -EINVAL; + + *s++ = '\0'; + + err = kstrtoul(s, 0, &n); + if (err) + return err; + + if (n < 2 || n > MAX_PAGE_ORDER) + return -EINVAL; + + err = cmdline_parse_core(p, nr_pages, percent); + if (err) + return err; + + *order = n; + + return 0; +} + +static int __init parse_zone_nosplit(char *p) +{ + return parse_zone_order(p, &nr_pages_of(ZONE_NOSPLIT), + &percentage_of(ZONE_NOSPLIT), &zone_nosplit_order); +} +early_param("nosplit", parse_zone_nosplit); + +static int __init parse_zone_nomerge(char *p) +{ + return parse_zone_order(p, &nr_pages_of(ZONE_NOMERGE), + &percentage_of(ZONE_NOMERGE), &zone_nomerge_order); +} +early_param("nomerge", parse_zone_nomerge); + /* * early_calculate_totalpages() - * Sum pages in active regions for movable zone. + * Sum pages in active regions for virtual zones. * Populate N_MEMORY for calculating usable_nodes. */ static unsigned long __init early_calculate_totalpages(void) @@ -311,24 +359,110 @@ static unsigned long __init early_calculate_totalpages(void) } /* - * This finds a zone that can be used for ZONE_MOVABLE pages. The + * This finds a physical zone that can be used for virtual zones. The * assumption is made that zones within a node are ordered in monotonic * increasing memory addresses so that the "highest" populated zone is used */ -static void __init find_usable_zone_for_movable(void) +static void __init find_usable_zone(void) { int zone_index; - for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) { - if (zone_index == ZONE_MOVABLE) - continue; - + for (zone_index = LAST_PHYS_ZONE; zone_index >= 0; zone_index--) { if (arch_zone_highest_possible_pfn[zone_index] > arch_zone_lowest_possible_pfn[zone_index]) break; } VM_BUG_ON(zone_index == -1); - movable_zone = zone_index; + virt_zone = zone_index; +} + +static void __init find_virt_zone(unsigned long occupied, unsigned long *zone_pfn) +{ + int i, nid; + unsigned long node_avg, remaining; + int usable_nodes = nodes_weight(node_states[N_MEMORY]); + /* usable_startpfn is the lowest possible pfn virtual zones can be at */ + unsigned long usable_startpfn = arch_zone_lowest_possible_pfn[virt_zone]; + +restart: + /* Carve out memory as evenly as possible throughout nodes */ + node_avg = occupied / usable_nodes; + for_each_node_state(nid, N_MEMORY) { + unsigned long start_pfn, end_pfn; + + /* + * Recalculate node_avg if the division per node now exceeds + * what is necessary to satisfy the amount of memory to carve + * out. + */ + if (occupied < node_avg) + node_avg = occupied / usable_nodes; + + /* + * As the map is walked, we track how much memory is usable + * using remaining. When it is 0, the rest of the node is + * usable. + */ + remaining = node_avg; + + /* Go through each range of PFNs within this node */ + for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { + unsigned long size_pages; + + start_pfn = max(start_pfn, zone_pfn[nid]); + if (start_pfn >= end_pfn) + continue; + + /* Account for what is only usable when carving out */ + if (start_pfn < usable_startpfn) { + unsigned long nr_pages = min(end_pfn, usable_startpfn) - start_pfn; + + remaining -= min(nr_pages, remaining); + occupied -= min(nr_pages, occupied); + + /* Continue if range is now fully accounted */ + if (end_pfn <= usable_startpfn) { + + /* + * Push zone_pfn to the end so that if + * we have to carve out more across + * nodes, we will not double account + * here. + */ + zone_pfn[nid] = end_pfn; + continue; + } + start_pfn = usable_startpfn; + } + + /* + * The usable PFN range is from start_pfn->end_pfn. + * Calculate size_pages as the number of pages used. + */ + size_pages = end_pfn - start_pfn; + if (size_pages > remaining) + size_pages = remaining; + zone_pfn[nid] = start_pfn + size_pages; + + /* + * Some memory was carved out, update counts and break + * if the request for this node has been satisfied. + */ + occupied -= min(occupied, size_pages); + remaining -= size_pages; + if (!remaining) + break; + } + } + + /* + * If there is still more to carve out, we do another pass with one less + * node in the count. This will push zone_pfn[nid] further along on the + * nodes that still have memory until the request is fully satisfied. + */ + usable_nodes--; + if (usable_nodes && occupied > usable_nodes) + goto restart; } /* @@ -337,19 +471,19 @@ static void __init find_usable_zone_for_movable(void) * memory. When they don't, some nodes will have more kernelcore than * others */ -static void __init find_zone_movable_pfns_for_nodes(void) +static void __init find_virt_zones(void) { - int i, nid; + int i; + int nid; unsigned long usable_startpfn; - unsigned long kernelcore_node, kernelcore_remaining; /* save the state before borrow the nodemask */ nodemask_t saved_node_state = node_states[N_MEMORY]; unsigned long totalpages = early_calculate_totalpages(); - int usable_nodes = nodes_weight(node_states[N_MEMORY]); struct memblock_region *r; + unsigned long occupied = 0; - /* Need to find movable_zone earlier when movable_node is specified. */ - find_usable_zone_for_movable(); + /* Need to find virt_zone earlier when movable_node is specified. */ + find_usable_zone(); /* * If movable_node is specified, ignore kernelcore and movablecore @@ -363,8 +497,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) nid = memblock_get_region_node(r); usable_startpfn = PFN_DOWN(r->base); - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? - min(usable_startpfn, zone_movable_pfn[nid]) : + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : usable_startpfn; } @@ -400,8 +534,8 @@ static void __init find_zone_movable_pfns_for_nodes(void) continue; } - zone_movable_pfn[nid] = zone_movable_pfn[nid] ? - min(usable_startpfn, zone_movable_pfn[nid]) : + pfn_of(ZONE_MOVABLE, nid) = pfn_of(ZONE_MOVABLE, nid) ? + min(usable_startpfn, pfn_of(ZONE_MOVABLE, nid)) : usable_startpfn; } @@ -411,151 +545,76 @@ static void __init find_zone_movable_pfns_for_nodes(void) goto out2; } + if (zone_nomerge_order && zone_nomerge_order <= zone_nosplit_order) { + nr_pages_of(ZONE_NOSPLIT) = nr_pages_of(ZONE_NOMERGE) = 0; + percentage_of(ZONE_NOSPLIT) = percentage_of(ZONE_NOMERGE) = 0; + zone_nosplit_order = zone_nomerge_order = 0; + pr_warn("zone %s order %d must be higher zone %s order %d\n", + zone_names[ZONE_NOMERGE], zone_nomerge_order, + zone_names[ZONE_NOSPLIT], zone_nosplit_order); + } + /* * If kernelcore=nn% or movablecore=nn% was specified, calculate the * amount of necessary memory. */ - if (required_kernelcore_percent) - required_kernelcore = (totalpages * 100 * required_kernelcore_percent) / - 10000UL; - if (required_movablecore_percent) - required_movablecore = (totalpages * 100 * required_movablecore_percent) / - 10000UL; + for (i = LAST_PHYS_ZONE; i <= LAST_VIRT_ZONE; i++) { + if (percentage_of(i)) + nr_pages_of(i) = totalpages * percentage_of(i) / 100; + + nr_pages_of(i) = roundup(nr_pages_of(i), MAX_ORDER_NR_PAGES); + occupied += nr_pages_of(i); + } /* * If movablecore= was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore * and movablecore are specified, then the value of kernelcore - * will be used for required_kernelcore if it's greater than - * what movablecore would have allowed. + * will be used if it's greater than what movablecore would have + * allowed. */ - if (required_movablecore) { - unsigned long corepages; + if (occupied < totalpages) { + enum zone_type zid; - /* - * Round-up so that ZONE_MOVABLE is at least as large as what - * was requested by the user - */ - required_movablecore = - roundup(required_movablecore, MAX_ORDER_NR_PAGES); - required_movablecore = min(totalpages, required_movablecore); - corepages = totalpages - required_movablecore; - - required_kernelcore = max(required_kernelcore, corepages); + zid = !nr_pages_of(LAST_PHYS_ZONE) || nr_pages_of(ZONE_MOVABLE) ? + LAST_PHYS_ZONE : ZONE_MOVABLE; + nr_pages_of(zid) += totalpages - occupied; } /* * If kernelcore was not specified or kernelcore size is larger - * than totalpages, there is no ZONE_MOVABLE. + * than totalpages, there are not virtual zones. */ - if (!required_kernelcore || required_kernelcore >= totalpages) + occupied = nr_pages_of(LAST_PHYS_ZONE); + if (!occupied || occupied >= totalpages) goto out; - /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */ - usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone]; + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { + if (!nr_pages_of(i)) + continue; -restart: - /* Spread kernelcore memory as evenly as possible throughout nodes */ - kernelcore_node = required_kernelcore / usable_nodes; - for_each_node_state(nid, N_MEMORY) { - unsigned long start_pfn, end_pfn; - - /* - * Recalculate kernelcore_node if the division per node - * now exceeds what is necessary to satisfy the requested - * amount of memory for the kernel - */ - if (required_kernelcore < kernelcore_node) - kernelcore_node = required_kernelcore / usable_nodes; - - /* - * As the map is walked, we track how much memory is usable - * by the kernel using kernelcore_remaining. When it is - * 0, the rest of the node is usable by ZONE_MOVABLE - */ - kernelcore_remaining = kernelcore_node; - - /* Go through each range of PFNs within this node */ - for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { - unsigned long size_pages; - - start_pfn = max(start_pfn, zone_movable_pfn[nid]); - if (start_pfn >= end_pfn) - continue; - - /* Account for what is only usable for kernelcore */ - if (start_pfn < usable_startpfn) { - unsigned long kernel_pages; - kernel_pages = min(end_pfn, usable_startpfn) - - start_pfn; - - kernelcore_remaining -= min(kernel_pages, - kernelcore_remaining); - required_kernelcore -= min(kernel_pages, - required_kernelcore); - - /* Continue if range is now fully accounted */ - if (end_pfn <= usable_startpfn) { - - /* - * Push zone_movable_pfn to the end so - * that if we have to rebalance - * kernelcore across nodes, we will - * not double account here - */ - zone_movable_pfn[nid] = end_pfn; - continue; - } - start_pfn = usable_startpfn; - } - - /* - * The usable PFN range for ZONE_MOVABLE is from - * start_pfn->end_pfn. Calculate size_pages as the - * number of pages used as kernelcore - */ - size_pages = end_pfn - start_pfn; - if (size_pages > kernelcore_remaining) - size_pages = kernelcore_remaining; - zone_movable_pfn[nid] = start_pfn + size_pages; - - /* - * Some kernelcore has been met, update counts and - * break if the kernelcore for this node has been - * satisfied - */ - required_kernelcore -= min(required_kernelcore, - size_pages); - kernelcore_remaining -= size_pages; - if (!kernelcore_remaining) - break; - } + find_virt_zone(occupied, &pfn_of(i, 0)); + occupied += nr_pages_of(i); } - - /* - * If there is still required_kernelcore, we do another pass with one - * less node in the count. This will push zone_movable_pfn[nid] further - * along on the nodes that still have memory until kernelcore is - * satisfied - */ - usable_nodes--; - if (usable_nodes && required_kernelcore > usable_nodes) - goto restart; - out2: - /* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */ + /* Align starts of virtual zones on all nids to MAX_ORDER_NR_PAGES */ for (nid = 0; nid < MAX_NUMNODES; nid++) { unsigned long start_pfn, end_pfn; - - zone_movable_pfn[nid] = - roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES); + unsigned long prev_virt_zone_pfn = 0; get_pfn_range_for_nid(nid, &start_pfn, &end_pfn); - if (zone_movable_pfn[nid] >= end_pfn) - zone_movable_pfn[nid] = 0; + + for (i = LAST_PHYS_ZONE + 1; i <= LAST_VIRT_ZONE; i++) { + pfn_of(i, nid) = roundup(pfn_of(i, nid), MAX_ORDER_NR_PAGES); + + if (pfn_of(i, nid) <= prev_virt_zone_pfn || pfn_of(i, nid) >= end_pfn) + pfn_of(i, nid) = 0; + + if (pfn_of(i, nid)) + prev_virt_zone_pfn = pfn_of(i, nid); + } } - out: /* restore the node_state */ node_states[N_MEMORY] = saved_node_state; @@ -1105,38 +1164,54 @@ void __ref memmap_init_zone_device(struct zone *zone, #endif /* - * The zone ranges provided by the architecture do not include ZONE_MOVABLE - * because it is sized independent of architecture. Unlike the other zones, - * the starting point for ZONE_MOVABLE is not fixed. It may be different - * in each node depending on the size of each node and how evenly kernelcore - * is distributed. This helper function adjusts the zone ranges + * The zone ranges provided by the architecture do not include virtual zones + * because they are sized independent of architecture. Unlike physical zones, + * the starting point for the first populated virtual zone is not fixed. It may + * be different in each node depending on the size of each node and how evenly + * kernelcore is distributed. This helper function adjusts the zone ranges * provided by the architecture for a given node by using the end of the - * highest usable zone for ZONE_MOVABLE. This preserves the assumption that - * zones within a node are in order of monotonic increases memory addresses + * highest usable zone for the first populated virtual zone. This preserves the + * assumption that zones within a node are in order of monotonic increases + * memory addresses. */ -static void __init adjust_zone_range_for_zone_movable(int nid, +static void __init adjust_zone_range(int nid, unsigned long zone_type, unsigned long node_end_pfn, unsigned long *zone_start_pfn, unsigned long *zone_end_pfn) { - /* Only adjust if ZONE_MOVABLE is on this node */ - if (zone_movable_pfn[nid]) { - /* Size ZONE_MOVABLE */ - if (zone_type == ZONE_MOVABLE) { - *zone_start_pfn = zone_movable_pfn[nid]; - *zone_end_pfn = min(node_end_pfn, - arch_zone_highest_possible_pfn[movable_zone]); + int i = max_t(int, zone_type, LAST_PHYS_ZONE); + unsigned long next_virt_zone_pfn = 0; - /* Adjust for ZONE_MOVABLE starting within this range */ - } else if (!mirrored_kernelcore && - *zone_start_pfn < zone_movable_pfn[nid] && - *zone_end_pfn > zone_movable_pfn[nid]) { - *zone_end_pfn = zone_movable_pfn[nid]; + while (i++ < LAST_VIRT_ZONE) { + if (pfn_of(i, nid)) { + next_virt_zone_pfn = pfn_of(i, nid); + break; + } + } - /* Check if this whole range is within ZONE_MOVABLE */ - } else if (*zone_start_pfn >= zone_movable_pfn[nid]) + if (zone_type <= LAST_PHYS_ZONE) { + if (!next_virt_zone_pfn) + return; + + if (!mirrored_kernelcore && + *zone_start_pfn < next_virt_zone_pfn && + *zone_end_pfn > next_virt_zone_pfn) + *zone_end_pfn = next_virt_zone_pfn; + else if (*zone_start_pfn >= next_virt_zone_pfn) *zone_start_pfn = *zone_end_pfn; + } else if (zone_type <= LAST_VIRT_ZONE) { + if (!pfn_of(zone_type, nid)) + return; + + if (next_virt_zone_pfn) + *zone_end_pfn = min3(next_virt_zone_pfn, + node_end_pfn, + arch_zone_highest_possible_pfn[virt_zone]); + else + *zone_end_pfn = min(node_end_pfn, + arch_zone_highest_possible_pfn[virt_zone]); + *zone_start_pfn = min(*zone_end_pfn, pfn_of(zone_type, nid)); } } @@ -1192,7 +1267,7 @@ static unsigned long __init zone_absent_pages_in_node(int nid, * Treat pages to be ZONE_MOVABLE in ZONE_NORMAL as absent pages * and vice versa. */ - if (mirrored_kernelcore && zone_movable_pfn[nid]) { + if (mirrored_kernelcore && pfn_of(ZONE_MOVABLE, nid)) { unsigned long start_pfn, end_pfn; struct memblock_region *r; @@ -1232,8 +1307,7 @@ static unsigned long __init zone_spanned_pages_in_node(int nid, /* Get the start and end of the zone */ *zone_start_pfn = clamp(node_start_pfn, zone_low, zone_high); *zone_end_pfn = clamp(node_end_pfn, zone_low, zone_high); - adjust_zone_range_for_zone_movable(nid, zone_type, node_end_pfn, - zone_start_pfn, zone_end_pfn); + adjust_zone_range(nid, zone_type, node_end_pfn, zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) @@ -1298,6 +1372,10 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat, #if defined(CONFIG_MEMORY_HOTPLUG) zone->present_early_pages = real_size; #endif + if (i == ZONE_NOSPLIT) + zone->order = zone_nosplit_order; + if (i == ZONE_NOMERGE) + zone->order = zone_nomerge_order; totalpages += spanned; realtotalpages += real_size; @@ -1739,7 +1817,7 @@ static void __init check_for_memory(pg_data_t *pgdat) { enum zone_type zone_type; - for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) { + for (zone_type = 0; zone_type <= LAST_PHYS_ZONE; zone_type++) { struct zone *zone = &pgdat->node_zones[zone_type]; if (populated_zone(zone)) { if (IS_ENABLED(CONFIG_HIGHMEM)) @@ -1789,7 +1867,7 @@ static bool arch_has_descending_max_zone_pfns(void) void __init free_area_init(unsigned long *max_zone_pfn) { unsigned long start_pfn, end_pfn; - int i, nid, zone; + int i, j, nid, zone; bool descending; /* Record where the zone boundaries are */ @@ -1801,15 +1879,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) start_pfn = PHYS_PFN(memblock_start_of_DRAM()); descending = arch_has_descending_max_zone_pfns(); - for (i = 0; i < MAX_NR_ZONES; i++) { + for (i = 0; i <= LAST_PHYS_ZONE; i++) { if (descending) - zone = MAX_NR_ZONES - i - 1; + zone = LAST_PHYS_ZONE - i; else zone = i; - if (zone == ZONE_MOVABLE) - continue; - end_pfn = max(max_zone_pfn[zone], start_pfn); arch_zone_lowest_possible_pfn[zone] = start_pfn; arch_zone_highest_possible_pfn[zone] = end_pfn; @@ -1817,15 +1892,12 @@ void __init free_area_init(unsigned long *max_zone_pfn) start_pfn = end_pfn; } - /* Find the PFNs that ZONE_MOVABLE begins at in each node */ - memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn)); - find_zone_movable_pfns_for_nodes(); + /* Find the PFNs that virtual zones begin at in each node */ + find_virt_zones(); /* Print out the zone ranges */ pr_info("Zone ranges:\n"); - for (i = 0; i < MAX_NR_ZONES; i++) { - if (i == ZONE_MOVABLE) - continue; + for (i = 0; i <= LAST_PHYS_ZONE; i++) { pr_info(" %-8s ", zone_names[i]); if (arch_zone_lowest_possible_pfn[i] == arch_zone_highest_possible_pfn[i]) @@ -1838,12 +1910,14 @@ void __init free_area_init(unsigned long *max_zone_pfn) << PAGE_SHIFT) - 1); } - /* Print out the PFNs ZONE_MOVABLE begins at in each node */ - pr_info("Movable zone start for each node\n"); - for (i = 0; i < MAX_NUMNODES; i++) { - if (zone_movable_pfn[i]) - pr_info(" Node %d: %#018Lx\n", i, - (u64)zone_movable_pfn[i] << PAGE_SHIFT); + /* Print out the PFNs virtual zones begin at in each node */ + for (; i <= LAST_VIRT_ZONE; i++) { + pr_info("%s zone start for each node\n", zone_names[i]); + for (j = 0; j < MAX_NUMNODES; j++) { + if (pfn_of(i, j)) + pr_info(" Node %d: %#018Lx\n", + j, (u64)pfn_of(i, j) << PAGE_SHIFT); + } } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 150d4f23b010..6a4da8f8691c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -267,6 +267,8 @@ char * const zone_names[MAX_NR_ZONES] = { "HighMem", #endif "Movable", + "NoSplit", + "NoMerge", #ifdef CONFIG_ZONE_DEVICE "Device", #endif @@ -290,9 +292,9 @@ int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; -/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ -int movable_zone; -EXPORT_SYMBOL(movable_zone); +/* virt_zone is the "real" zone pages in virtual zones are taken from */ +int virt_zone; +EXPORT_SYMBOL(virt_zone); #if MAX_NUMNODES > 1 unsigned int nr_node_ids __read_mostly = MAX_NUMNODES; @@ -727,9 +729,6 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, unsigned long higher_page_pfn; struct page *higher_page; - if (order >= MAX_PAGE_ORDER - 1) - return false; - higher_page_pfn = buddy_pfn & pfn; higher_page = page + (higher_page_pfn - pfn); @@ -737,6 +736,11 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, NULL) != NULL; } +static int zone_max_order(struct zone *zone) +{ + return zone->order && zone_idx(zone) == ZONE_NOMERGE ? zone->order : MAX_PAGE_ORDER; +} + /* * Freeing function for a buddy system allocator. * @@ -771,6 +775,7 @@ static inline void __free_one_page(struct page *page, unsigned long combined_pfn; struct page *buddy; bool to_tail; + int max_order = zone_max_order(zone); VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -782,7 +787,7 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); - while (order < MAX_PAGE_ORDER) { + while (order < max_order) { if (compaction_capture(capc, page, order, migratetype)) { __mod_zone_freepage_state(zone, -(1 << order), migratetype); @@ -829,6 +834,8 @@ static inline void __free_one_page(struct page *page, to_tail = true; else if (is_shuffle_order(order)) to_tail = shuffle_pick_tail(); + else if (order + 1 >= max_order) + to_tail = false; else to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); @@ -866,6 +873,8 @@ int split_free_page(struct page *free_page, int mt; int ret = 0; + VM_WARN_ON_ONCE_PAGE(!page_can_split(free_page), free_page); + if (split_pfn_offset == 0) return ret; @@ -1566,6 +1575,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct free_area *area; struct page *page; + VM_WARN_ON_ONCE(!zone_is_suitable(zone, order)); + /* Find a page of the appropriate size in the preferred list */ for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) { area = &(zone->free_area[current_order]); @@ -2961,6 +2972,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, long min = mark; int o; + if (!zone_is_suitable(z, order)) + return false; + /* free_pages may go negative - that's OK */ free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); @@ -3045,6 +3059,9 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order, { long free_pages; + if (!zone_is_suitable(z, order)) + return false; + free_pages = zone_page_state(z, NR_FREE_PAGES); /* @@ -3188,6 +3205,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, struct page *page; unsigned long mark; + if (!zone_is_suitable(zone, order)) + continue; + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) && !__cpuset_zone_allowed(zone, gfp_mask)) @@ -5834,9 +5854,9 @@ static void __setup_per_zone_wmarks(void) struct zone *zone; unsigned long flags; - /* Calculate total number of !ZONE_HIGHMEM and !ZONE_MOVABLE pages */ + /* Calculate total number of pages below ZONE_HIGHMEM */ for_each_zone(zone) { - if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE) + if (zone_idx(zone) <= ZONE_NORMAL) lowmem_pages += zone_managed_pages(zone); } @@ -5846,11 +5866,11 @@ static void __setup_per_zone_wmarks(void) spin_lock_irqsave(&zone->lock, flags); tmp = (u64)pages_min * zone_managed_pages(zone); do_div(tmp, lowmem_pages); - if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) { + if (zone_idx(zone) > ZONE_NORMAL) { /* * __GFP_HIGH and PF_MEMALLOC allocations usually don't - * need highmem and movable zones pages, so cap pages_min - * to a small value here. + * need pages from zones above ZONE_NORMAL, so cap + * pages_min to a small value here. * * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN) * deltas control async page reclaim, and so should diff --git a/mm/page_isolation.c b/mm/page_isolation.c index cd0ea3668253..8a6473543427 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -69,7 +69,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e * pages then it should be reasonably safe to assume the rest * is movable. */ - if (zone_idx(zone) == ZONE_MOVABLE) + if (zid_is_virt(zone_idx(zone))) continue; /* diff --git a/mm/swap_slots.c b/mm/swap_slots.c index 0bec1f705f8e..ad0db0373b05 100644 --- a/mm/swap_slots.c +++ b/mm/swap_slots.c @@ -307,7 +307,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio) entry.val = 0; if (folio_test_large(folio)) { - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported()) + if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported() && + folio_test_pmd_mappable(folio)) get_swap_pages(1, &entry, folio_nr_pages(folio)); goto out; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..ae061ec4866a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1193,20 +1193,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, goto keep_locked; if (folio_maybe_dma_pinned(folio)) goto keep_locked; - if (folio_test_large(folio)) { - /* cannot split folio, skip it */ - if (!can_split_folio(folio, NULL)) - goto activate_locked; - /* - * Split folios without a PMD map right - * away. Chances are some or all of the - * tail pages can be freed without IO. - */ - if (!folio_entire_mapcount(folio) && - split_folio_to_list(folio, - folio_list)) - goto activate_locked; - } + /* + * Split folios that are not fully map right + * away. Chances are some of the tail pages can + * be freed without IO. + */ + if (folio_test_large(folio) && + atomic_read(&folio->_nr_pages_mapped) < nr_pages) + split_folio_to_list(folio, folio_list); if (!add_to_swap(folio)) { if (!folio_test_large(folio)) goto activate_locked_split; @@ -6077,7 +6071,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) orig_mask = sc->gfp_mask; if (buffer_heads_over_limit) { sc->gfp_mask |= __GFP_HIGHMEM; - sc->reclaim_idx = gfp_zone(sc->gfp_mask); + sc->reclaim_idx = gfp_order_zone(sc->gfp_mask, sc->order); } for_each_zone_zonelist_nodemask(zone, z, zonelist, @@ -6407,7 +6401,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, struct scan_control sc = { .nr_to_reclaim = SWAP_CLUSTER_MAX, .gfp_mask = current_gfp_context(gfp_mask), - .reclaim_idx = gfp_zone(gfp_mask), + .reclaim_idx = gfp_order_zone(gfp_mask, order), .order = order, .nodemask = nodemask, .priority = DEF_PRIORITY, @@ -7170,6 +7164,10 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, if (!cpuset_zone_allowed(zone, gfp_flags)) return; + curr_idx = gfp_order_zone(gfp_flags, order); + if (highest_zoneidx > curr_idx) + highest_zoneidx = curr_idx; + pgdat = zone->zone_pgdat; curr_idx = READ_ONCE(pgdat->kswapd_highest_zoneidx); @@ -7380,7 +7378,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, - .reclaim_idx = gfp_zone(gfp_mask), + .reclaim_idx = gfp_order_zone(gfp_mask, order), }; unsigned long pflags; diff --git a/mm/vmstat.c b/mm/vmstat.c index db79935e4a54..adbd032e6a0f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1167,6 +1167,7 @@ int fragmentation_index(struct zone *zone, unsigned int order) #define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \ TEXT_FOR_HIGHMEM(xx) xx "_movable", \ + xx "_nosplit", xx "_nomerge", \ TEXT_FOR_DEVICE(xx) const char * const vmstat_text[] = { @@ -1699,7 +1700,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, "\n spanned %lu" "\n present %lu" "\n managed %lu" - "\n cma %lu", + "\n cma %lu" + "\n order %u", zone_page_state(zone, NR_FREE_PAGES), zone->watermark_boost, min_wmark_pages(zone), @@ -1708,7 +1710,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, zone->spanned_pages, zone->present_pages, zone_managed_pages(zone), - zone_cma_pages(zone)); + zone_cma_pages(zone), + zone->order); seq_printf(m, "\n protection: (%ld",

[Chapter,One] THP zones: the use cases of policy zones

Commit Message

Comments

Patch