From patchwork Fri May 21 10:28:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272877 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16F2CC433ED for ; Fri, 21 May 2021 10:28:50 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 93B08613AE for ; Fri, 21 May 2021 10:28:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 93B08613AE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 370C88E002A; Fri, 21 May 2021 06:28:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 320A98E0022; Fri, 21 May 2021 06:28:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 173CF8E002A; Fri, 21 May 2021 06:28:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D9B008E0022 for ; Fri, 21 May 2021 06:28:48 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 7DC84DDE5 for ; Fri, 21 May 2021 10:28:48 +0000 (UTC) X-FDA: 78164864736.10.591A875 Received: from outbound-smtp47.blacknight.com (outbound-smtp47.blacknight.com [46.22.136.64]) by imf05.hostedemail.com (Postfix) with ESMTP id C6CD0E000800 for ; Fri, 21 May 2021 10:28:44 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp47.blacknight.com (Postfix) with ESMTPS id E56E2FADBA for ; Fri, 21 May 2021 11:28:46 +0100 (IST) Received: (qmail 22048 invoked from network); 21 May 2021 10:28:46 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:28:46 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction Date: Fri, 21 May 2021 11:28:21 +0100 Message-Id: <20210521102826.28552-2-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf05.hostedemail.com: domain of mgorman@techsingularity.net designates 46.22.136.64 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: C6CD0E000800 X-Stat-Signature: 9enc1eybd7q5a59cxoju897exf79jxte X-HE-Tag: 1621592924-480055 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Signed-off-by: Mel Gorman Acked-by: Dave Hansen --- Documentation/admin-guide/sysctl/vm.rst | 19 --------- include/linux/mmzone.h | 3 -- kernel/sysctl.c | 8 ---- mm/page_alloc.c | 55 ++----------------------- 4 files changed, 4 insertions(+), 81 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 586cd4b86428..2fcafccb53a8 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -64,7 +64,6 @@ files can be found in mm/swap.c. - overcommit_ratio - page-cluster - panic_on_oom -- percpu_pagelist_fraction - stat_interval - stat_refresh - numa_stat @@ -790,24 +789,6 @@ panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. -percpu_pagelist_fraction -======================== - -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. - -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) - -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. - - stat_interval ============= diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d7740c97b87e..b449151745d7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1038,15 +1038,12 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *, extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); -int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, - void *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int numa_zonelist_order_handler(struct ctl_table *, int, void *, size_t *, loff_t *); -extern int percpu_pagelist_fraction; extern char numa_zonelist_order[]; #define NUMA_ZONELIST_ORDER_LEN 16 diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 14edf84cc571..4e5ac50a1af0 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2889,14 +2889,6 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ONE, .extra2 = &one_thousand, }, - { - .procname = "percpu_pagelist_fraction", - .data = &percpu_pagelist_fraction, - .maxlen = sizeof(percpu_pagelist_fraction), - .mode = 0644, - .proc_handler = percpu_pagelist_fraction_sysctl_handler, - .extra1 = SYSCTL_ZERO, - }, { .procname = "page_lock_unfairness", .data = &sysctl_page_lock_unfairness, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ff8f706839ea..a48f305f0381 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -120,7 +120,6 @@ typedef int __bitwise fpi_t; /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); -#define MIN_PERCPU_PAGELIST_FRACTION (8) struct pagesets { local_lock_t lock; @@ -182,7 +181,6 @@ EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; -int percpu_pagelist_fraction; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc); EXPORT_SYMBOL(init_on_alloc); @@ -6696,22 +6694,15 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h /* * Calculate and set new high and batch values for all per-cpu pagesets of a - * zone, based on the zone's size and the percpu_pagelist_fraction sysctl. + * zone based on the zone's size. */ static void zone_set_pageset_high_and_batch(struct zone *zone) { unsigned long new_high, new_batch; - if (percpu_pagelist_fraction) { - new_high = zone_managed_pages(zone) / percpu_pagelist_fraction; - new_batch = max(1UL, new_high / 4); - if ((new_high / 4) > (PAGE_SHIFT * 8)) - new_batch = PAGE_SHIFT * 8; - } else { - new_batch = zone_batchsize(zone); - new_high = 6 * new_batch; - new_batch = max(1UL, 1 * new_batch); - } + new_batch = zone_batchsize(zone); + new_high = 6 * new_batch; + new_batch = max(1UL, 1 * new_batch); if (zone->pageset_high == new_high && zone->pageset_batch == new_batch) @@ -8377,44 +8368,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write, return 0; } -/* - * percpu_pagelist_fraction - changes the pcp->high for each zone on each - * cpu. It is the fraction of total pages in each zone that a hot per cpu - * pagelist can have before it gets flushed back to buddy allocator. - */ -int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write, - void *buffer, size_t *length, loff_t *ppos) -{ - struct zone *zone; - int old_percpu_pagelist_fraction; - int ret; - - mutex_lock(&pcp_batch_high_lock); - old_percpu_pagelist_fraction = percpu_pagelist_fraction; - - ret = proc_dointvec_minmax(table, write, buffer, length, ppos); - if (!write || ret < 0) - goto out; - - /* Sanity checking to avoid pcp imbalance */ - if (percpu_pagelist_fraction && - percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) { - percpu_pagelist_fraction = old_percpu_pagelist_fraction; - ret = -EINVAL; - goto out; - } - - /* No change? */ - if (percpu_pagelist_fraction == old_percpu_pagelist_fraction) - goto out; - - for_each_populated_zone(zone) - zone_set_pageset_high_and_batch(zone); -out: - mutex_unlock(&pcp_batch_high_lock); - return ret; -} - #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES /* * Returns the number of pages that arch has reserved but From patchwork Fri May 21 10:28:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272879 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 279DBC433B4 for ; Fri, 21 May 2021 10:29:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id ACB23613CE for ; Fri, 21 May 2021 10:29:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ACB23613CE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4CF5B8E002B; Fri, 21 May 2021 06:29:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45E688E0022; Fri, 21 May 2021 06:29:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D3508E002B; Fri, 21 May 2021 06:29:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0151.hostedemail.com [216.40.44.151]) by kanga.kvack.org (Postfix) with ESMTP id EA2358E0022 for ; Fri, 21 May 2021 06:29:00 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 06118180AD815 for ; Fri, 21 May 2021 10:28:59 +0000 (UTC) X-FDA: 78164865198.01.D122680 Received: from outbound-smtp26.blacknight.com (outbound-smtp26.blacknight.com [81.17.249.194]) by imf16.hostedemail.com (Postfix) with ESMTP id 8BF3C801910F for ; Fri, 21 May 2021 10:28:56 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp26.blacknight.com (Postfix) with ESMTPS id 0D865CAC4D for ; Fri, 21 May 2021 11:28:57 +0100 (IST) Received: (qmail 22629 invoked from network); 21 May 2021 10:28:56 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:28:56 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch Date: Fri, 21 May 2021 11:28:22 +0100 Message-Id: <20210521102826.28552-3-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf16.hostedemail.com: domain of mgorman@techsingularity.net designates 81.17.249.194 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 8BF3C801910F X-Stat-Signature: hbukkyyeebh5ndt55rcfjt1yqiigaq5f X-HE-Tag: 1621592936-660579 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The pcp high watermark is based on the batch size but there is no relationship between them other than it is convenient to use early in boot. This patch takes the first step and bases pcp->high on the zone low watermark split across the number of CPUs local to a zone while the batch size remains the same to avoid increasing allocation latencies. The intent behind the default pcp->high is "set the number of PCP pages such that if they are all full that background reclaim is not started prematurely". Note that in this patch the pcp->high values are adjusted after memory hotplug events, min_free_kbytes adjustments and watermark scale factor adjustments but not CPU hotplug events. On a test KVM instance; Before grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 378 batch: 63 After grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 649 batch: 63 Signed-off-by: Mel Gorman --- mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 35 insertions(+), 18 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a48f305f0381..bf5cdc466e6c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void) /* Block until all are initialised */ wait_for_completion(&pgdat_init_all_done_comp); - /* - * The number of managed pages has changed due to the initialisation - * so the pcpu batch and high limits needs to be updated or the limits - * will be artificially small. - */ - for_each_populated_zone(zone) - zone_pcp_update(zone); - /* * We initialized the rest of the deferred pages. Permanently disable * on-demand struct page initialization. @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone) int batch; /* - * The per-cpu-pages pools are set to around 1000th of the - * size of the zone. + * The number of pages to batch allocate is either 0.1% + * of the zone or 1MB, whichever is smaller. The batch + * size is striking a balance between allocation latency + * and zone lock contention. */ - batch = zone_managed_pages(zone) / 1024; - /* But no more than a meg. */ - if (batch * PAGE_SIZE > 1024 * 1024) - batch = (1024 * 1024) / PAGE_SIZE; + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE); batch /= 4; /* We effectively *= 4 below */ if (batch < 1) batch = 1; @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone) #endif } +static int zone_highsize(struct zone *zone) +{ +#ifdef CONFIG_MMU + int high; + int nr_local_cpus; + + /* + * The high value of the pcp is based on the zone low watermark + * when reclaim is potentially active spread across the online + * CPUs local to a zone. Note that early in boot that CPUs may + * not be online yet. + */ + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))); + high = low_wmark_pages(zone) / nr_local_cpus; + + return high; +#else + return 0; +#endif +} + /* * pcp->high and pcp->batch values are related and generally batch is lower * than high. They are also related to pcp->count such that count is lower @@ -6698,11 +6710,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h */ static void zone_set_pageset_high_and_batch(struct zone *zone) { - unsigned long new_high, new_batch; + int new_high, new_batch; - new_batch = zone_batchsize(zone); - new_high = 6 * new_batch; - new_batch = max(1UL, 1 * new_batch); + new_batch = max(1, zone_batchsize(zone)); + new_high = zone_highsize(zone); if (zone->pageset_high == new_high && zone->pageset_batch == new_batch) @@ -8170,6 +8181,12 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; + /* + * The watermark size have changed so update the pcpu batch + * and high limits or the limits may be inappropriate. + */ + zone_set_pageset_high_and_batch(zone); + spin_unlock_irqrestore(&zone->lock, flags); } From patchwork Fri May 21 10:28:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272881 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49A0BC433B4 for ; Fri, 21 May 2021 10:29:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E9FFE613DF for ; Fri, 21 May 2021 10:29:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E9FFE613DF Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 879C08E002C; Fri, 21 May 2021 06:29:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 829348E0022; Fri, 21 May 2021 06:29:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 67C948E002C; Fri, 21 May 2021 06:29:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0022.hostedemail.com [216.40.44.22]) by kanga.kvack.org (Postfix) with ESMTP id 2E7688E0022 for ; Fri, 21 May 2021 06:29:09 -0400 (EDT) Received: from smtpin39.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id BE7B28248068 for ; Fri, 21 May 2021 10:29:08 +0000 (UTC) X-FDA: 78164865576.39.459B6E4 Received: from outbound-smtp19.blacknight.com (outbound-smtp19.blacknight.com [46.22.139.246]) by imf13.hostedemail.com (Postfix) with ESMTP id 743DAE000801 for ; Fri, 21 May 2021 10:29:05 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp19.blacknight.com (Postfix) with ESMTPS id 2EB081C3BFE for ; Fri, 21 May 2021 11:29:07 +0100 (IST) Received: (qmail 23211 invoked from network); 21 May 2021 10:29:07 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:29:07 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events Date: Fri, 21 May 2021 11:28:23 +0100 Message-Id: <20210521102826.28552-4-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf13.hostedemail.com: domain of mgorman@techsingularity.net designates 46.22.139.246 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 743DAE000801 X-Stat-Signature: e68f7nqgdyjphok6gj4fgf3hz8bf73zm X-HE-Tag: 1621592945-594203 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The PCP high watermark is based on the number of online CPUs so the watermarks must be adjusted during CPU hotplug. At the time of hot-remove, the number of online CPUs is already adjusted but during hot-add, a delta needs to be applied to update PCP to the correct value. After this patch is applied, the high watermarks are adjusted correctly. # grep high: /proc/zoneinfo | tail -1 high: 649 # echo 0 > /sys/devices/system/cpu/cpu4/online # grep high: /proc/zoneinfo | tail -1 high: 664 # echo 1 > /sys/devices/system/cpu/cpu4/online # grep high: /proc/zoneinfo | tail -1 high: 649 Signed-off-by: Mel Gorman --- include/linux/cpuhotplug.h | 2 +- mm/internal.h | 2 +- mm/memory_hotplug.c | 4 ++-- mm/page_alloc.c | 35 +++++++++++++++++++++++++---------- 4 files changed, 29 insertions(+), 14 deletions(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 4a62b3980642..47e13582d9fc 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -54,7 +54,7 @@ enum cpuhp_state { CPUHP_MM_MEMCQ_DEAD, CPUHP_PERCPU_CNT_DEAD, CPUHP_RADIX_DEAD, - CPUHP_PAGE_ALLOC_DEAD, + CPUHP_PAGE_ALLOC, CPUHP_NET_DEV_DEAD, CPUHP_PCI_XGENE_DEAD, CPUHP_IOMMU_IOVA_DEAD, diff --git a/mm/internal.h b/mm/internal.h index 54bd0dc2c23c..651250e59ef5 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -221,7 +221,7 @@ extern int user_min_free_kbytes; extern void free_unref_page(struct page *page); extern void free_unref_page_list(struct list_head *list); -extern void zone_pcp_update(struct zone *zone); +extern void zone_pcp_update(struct zone *zone, int cpu_online); extern void zone_pcp_reset(struct zone *zone); extern void zone_pcp_disable(struct zone *zone); extern void zone_pcp_enable(struct zone *zone); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 70620d0dd923..bebb3cead810 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -961,7 +961,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z node_states_set_node(nid, &arg); if (need_zonelists_rebuild) build_all_zonelists(NULL); - zone_pcp_update(zone); + zone_pcp_update(zone, 0); /* Basic onlining is complete, allow allocation of onlined pages. */ undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE); @@ -1835,7 +1835,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) zone_pcp_reset(zone); build_all_zonelists(NULL); } else - zone_pcp_update(zone); + zone_pcp_update(zone, 0); node_states_clear_node(node, &arg); if (arg.status_change_nid >= 0) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bf5cdc466e6c..2761b03b3a44 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone) #endif } -static int zone_highsize(struct zone *zone) +static int zone_highsize(struct zone *zone, int cpu_online) { #ifdef CONFIG_MMU int high; @@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone) * CPUs local to a zone. Note that early in boot that CPUs may * not be online yet. */ - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))); + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online; high = low_wmark_pages(zone) / nr_local_cpus; return high; @@ -6708,12 +6708,12 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h * Calculate and set new high and batch values for all per-cpu pagesets of a * zone based on the zone's size. */ -static void zone_set_pageset_high_and_batch(struct zone *zone) +static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) { int new_high, new_batch; new_batch = max(1, zone_batchsize(zone)); - new_high = zone_highsize(zone); + new_high = zone_highsize(zone, cpu_online); if (zone->pageset_high == new_high && zone->pageset_batch == new_batch) @@ -6743,7 +6743,7 @@ void __meminit setup_zone_pageset(struct zone *zone) per_cpu_pages_init(pcp, pzstats); } - zone_set_pageset_high_and_batch(zone); + zone_set_pageset_high_and_batch(zone, 0); } /* @@ -8001,6 +8001,7 @@ void __init set_dma_reserve(unsigned long new_dma_reserve) static int page_alloc_cpu_dead(unsigned int cpu) { + struct zone *zone; lru_add_drain_cpu(cpu); drain_pages(cpu); @@ -8021,6 +8022,19 @@ static int page_alloc_cpu_dead(unsigned int cpu) * race with what we are doing. */ cpu_vm_stats_fold(cpu); + + for_each_populated_zone(zone) + zone_pcp_update(zone, 0); + + return 0; +} + +static int page_alloc_cpu_online(unsigned int cpu) +{ + struct zone *zone; + + for_each_populated_zone(zone) + zone_pcp_update(zone, 1); return 0; } @@ -8046,8 +8060,9 @@ void __init page_alloc_init(void) hashdist = 0; #endif - ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD, - "mm/page_alloc:dead", NULL, + ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC, + "mm/page_alloc:pcp", + page_alloc_cpu_online, page_alloc_cpu_dead); WARN_ON(ret < 0); } @@ -8185,7 +8200,7 @@ static void __setup_per_zone_wmarks(void) * The watermark size have changed so update the pcpu batch * and high limits or the limits may be inappropriate. */ - zone_set_pageset_high_and_batch(zone); + zone_set_pageset_high_and_batch(zone, 0); spin_unlock_irqrestore(&zone->lock, flags); } @@ -9007,10 +9022,10 @@ EXPORT_SYMBOL(free_contig_range); * The zone indicated has a new number of managed_pages; batch sizes and percpu * page high values need to be recalculated. */ -void __meminit zone_pcp_update(struct zone *zone) +void zone_pcp_update(struct zone *zone, int cpu_online) { mutex_lock(&pcp_batch_high_lock); - zone_set_pageset_high_and_batch(zone); + zone_set_pageset_high_and_batch(zone, cpu_online); mutex_unlock(&pcp_batch_high_lock); } From patchwork Fri May 21 10:28:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272883 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DBACC433B4 for ; Fri, 21 May 2021 10:29:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E019B613D0 for ; Fri, 21 May 2021 10:29:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E019B613D0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 803BA8E002D; Fri, 21 May 2021 06:29:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B4138E0022; Fri, 21 May 2021 06:29:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 62DD88E002D; Fri, 21 May 2021 06:29:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0249.hostedemail.com [216.40.44.249]) by kanga.kvack.org (Postfix) with ESMTP id 2C1BC8E0022 for ; Fri, 21 May 2021 06:29:19 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id C9A15DB55 for ; Fri, 21 May 2021 10:29:18 +0000 (UTC) X-FDA: 78164865996.05.2ADCA84 Received: from outbound-smtp17.blacknight.com (outbound-smtp17.blacknight.com [46.22.139.234]) by imf17.hostedemail.com (Postfix) with ESMTP id 2AF6B40B8CF7 for ; Fri, 21 May 2021 10:29:17 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp17.blacknight.com (Postfix) with ESMTPS id 58FAE1C3C00 for ; Fri, 21 May 2021 11:29:17 +0100 (IST) Received: (qmail 23575 invoked from network); 21 May 2021 10:29:17 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:29:17 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed Date: Fri, 21 May 2021 11:28:24 +0100 Message-Id: <20210521102826.28552-5-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 X-Rspamd-Queue-Id: 2AF6B40B8CF7 Authentication-Results: imf17.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf17.hostedemail.com: domain of mgorman@techsingularity.net designates 46.22.139.234 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Rspamd-Server: rspam04 X-Stat-Signature: 4jn5hztcsdwib48crghf9bp4o1yyxsen X-HE-Tag: 1621592957-931769 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When a task is freeing a large number of order-0 pages, it may acquire the zone->lock multiple times freeing pages in batches. This may unnecessarily contend on the zone lock when freeing very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent frees. As the machines I used were not large enough to test this are not large enough to illustrate a problem, a debugging patch shows patterns like the following (slightly editted for clarity) Baseline vanilla kernel time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378 With patches time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814 Signed-off-by: Mel Gorman Acked-by: Dave Hansen --- include/linux/mmzone.h | 3 ++- mm/page_alloc.c | 30 ++++++++++++++++++++++++++++-- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b449151745d7..92182e0299b2 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -343,8 +343,9 @@ struct per_cpu_pages { int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ + short free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA - int expire; /* When 0, remote pagesets are drained */ + short expire; /* When 0, remote pagesets are drained */ #endif /* Lists of pages, one per migrate type stored on the pcp-lists */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2761b03b3a44..c3da6401f138 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3267,18 +3267,42 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn) return true; } +static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch) +{ + int min_nr_free, max_nr_free; + + /* Check for PCP disabled or boot pageset */ + if (unlikely(high < batch)) + return 1; + + min_nr_free = batch; + max_nr_free = high - batch; + + batch <<= pcp->free_factor; + if (batch < max_nr_free) + pcp->free_factor++; + batch = clamp(batch, min_nr_free, max_nr_free); + + return batch; +} + static void free_unref_page_commit(struct page *page, unsigned long pfn, int migratetype) { struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; + int high; __count_vm_event(PGFREE); pcp = this_cpu_ptr(zone->per_cpu_pageset); list_add(&page->lru, &pcp->lists[migratetype]); pcp->count++; - if (pcp->count >= READ_ONCE(pcp->high)) - free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp); + high = READ_ONCE(pcp->high); + if (pcp->count >= high) { + int batch = READ_ONCE(pcp->batch); + + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp); + } } /* @@ -3531,6 +3555,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, local_lock_irqsave(&pagesets.lock, flags); pcp = this_cpu_ptr(zone->per_cpu_pageset); + pcp->free_factor >>= 1; list = &pcp->lists[migratetype]; page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list); local_unlock_irqrestore(&pagesets.lock, flags); @@ -6690,6 +6715,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta */ pcp->high = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; + pcp->free_factor = 0; } static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, From patchwork Fri May 21 10:28:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272885 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BA73C433B4 for ; Fri, 21 May 2021 10:29:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EE84E613D0 for ; Fri, 21 May 2021 10:29:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE84E613D0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 915908E002E; Fri, 21 May 2021 06:29:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C5158E0022; Fri, 21 May 2021 06:29:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7664D8E002E; Fri, 21 May 2021 06:29:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0202.hostedemail.com [216.40.44.202]) by kanga.kvack.org (Postfix) with ESMTP id 416F38E0022 for ; Fri, 21 May 2021 06:29:29 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id D4921D23C for ; Fri, 21 May 2021 10:29:28 +0000 (UTC) X-FDA: 78164866416.30.38D1DD6 Received: from outbound-smtp46.blacknight.com (outbound-smtp46.blacknight.com [46.22.136.58]) by imf27.hostedemail.com (Postfix) with ESMTP id E03BE801911F for ; Fri, 21 May 2021 10:29:25 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp46.blacknight.com (Postfix) with ESMTPS id 76A32FADC2 for ; Fri, 21 May 2021 11:29:27 +0100 (IST) Received: (qmail 23960 invoked from network); 21 May 2021 10:29:27 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:29:27 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active Date: Fri, 21 May 2021 11:28:25 +0100 Message-Id: <20210521102826.28552-6-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 X-Rspamd-Queue-Id: E03BE801911F Authentication-Results: imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of mgorman@techsingularity.net designates 46.22.136.58 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net; dmarc=none X-Rspamd-Server: rspam03 X-Stat-Signature: snnuoj9qo7oyz884xwez41ju3rikunoo X-HE-Tag: 1621592965-347972 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When kswapd is active then direct reclaim is potentially active. In either case, it is possible that a zone would be balanced if pages were not trapped on PCP lists. Instead of draining remote pages, simply limit the size of the PCP lists while kswapd is active. Signed-off-by: Mel Gorman --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 19 ++++++++++++++++++- mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 92182e0299b2..a0606239a167 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -647,6 +647,7 @@ enum zone_flags { ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks. * Cleared when kswapd is woken. */ + ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */ }; static inline unsigned long zone_managed_pages(struct zone *zone) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c3da6401f138..d8f8044781c4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3286,6 +3286,23 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch) return batch; } +static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone) +{ + int high = READ_ONCE(pcp->high); + + if (unlikely(!high)) + return 0; + + if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + return high; + + /* + * If reclaim is active, limit the number of pages that can be + * stored on pcp lists + */ + return READ_ONCE(pcp->batch) << 2; +} + static void free_unref_page_commit(struct page *page, unsigned long pfn, int migratetype) { @@ -3297,7 +3314,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn, pcp = this_cpu_ptr(zone->per_cpu_pageset); list_add(&page->lru, &pcp->lists[migratetype]); pcp->count++; - high = READ_ONCE(pcp->high); + high = nr_pcp_high(pcp, zone); if (pcp->count >= high) { int batch = READ_ONCE(pcp->batch); diff --git a/mm/vmscan.c b/mm/vmscan.c index 5199b9696bab..c3c2100a80b8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3722,6 +3722,38 @@ static bool kswapd_shrink_node(pg_data_t *pgdat, return sc->nr_scanned >= sc->nr_to_reclaim; } +/* Page allocator PCP high watermark is lowered if reclaim is active. */ +static inline void +update_reclaim_active(pg_data_t *pgdat, int highest_zoneidx, bool active) +{ + int i; + struct zone *zone; + + for (i = 0; i <= highest_zoneidx; i++) { + zone = pgdat->node_zones + i; + + if (!managed_zone(zone)) + continue; + + if (active) + set_bit(ZONE_RECLAIM_ACTIVE, &zone->flags); + else + clear_bit(ZONE_RECLAIM_ACTIVE, &zone->flags); + } +} + +static inline void +set_reclaim_active(pg_data_t *pgdat, int highest_zoneidx) +{ + update_reclaim_active(pgdat, highest_zoneidx, true); +} + +static inline void +clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx) +{ + update_reclaim_active(pgdat, highest_zoneidx, false); +} + /* * For kswapd, balance_pgdat() will reclaim pages across a node from zones * that are eligible for use by the caller until at least one zone is @@ -3774,6 +3806,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) boosted = nr_boost_reclaim; restart: + set_reclaim_active(pgdat, highest_zoneidx); sc.priority = DEF_PRIORITY; do { unsigned long nr_reclaimed = sc.nr_reclaimed; @@ -3907,6 +3940,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) pgdat->kswapd_failures++; out: + clear_reclaim_active(pgdat, highest_zoneidx); + /* If reclaim was boosted, account for the reclaim done in this pass */ if (boosted) { unsigned long flags; From patchwork Fri May 21 10:28:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 12272887 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDACAC433B4 for ; Fri, 21 May 2021 10:29:40 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 75A8D613AE for ; Fri, 21 May 2021 10:29:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 75A8D613AE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=techsingularity.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 174298E002F; Fri, 21 May 2021 06:29:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 124AF8E0022; Fri, 21 May 2021 06:29:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB8C08E002F; Fri, 21 May 2021 06:29:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com [216.40.44.241]) by kanga.kvack.org (Postfix) with ESMTP id B472B8E0022 for ; Fri, 21 May 2021 06:29:39 -0400 (EDT) Received: from smtpin40.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 35F70181AF5C2 for ; Fri, 21 May 2021 10:29:39 +0000 (UTC) X-FDA: 78164866878.40.C65717F Received: from outbound-smtp11.blacknight.com (outbound-smtp11.blacknight.com [46.22.139.106]) by imf03.hostedemail.com (Postfix) with ESMTP id 52050C0042E3 for ; Fri, 21 May 2021 10:29:36 +0000 (UTC) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp11.blacknight.com (Postfix) with ESMTPS id 94D981C3C13 for ; Fri, 21 May 2021 11:29:37 +0100 (IST) Received: (qmail 24401 invoked from network); 21 May 2021 10:29:37 -0000 Received: from unknown (HELO stampy.112glenside.lan) (mgorman@techsingularity.net@[84.203.23.168]) by 81.17.254.9 with ESMTPA; 21 May 2021 10:29:37 -0000 From: Mel Gorman To: Linux-MM Cc: Dave Hansen , Matthew Wilcox , Vlastimil Babka , Michal Hocko , Nicholas Piggin , LKML , Mel Gorman Subject: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction Date: Fri, 21 May 2021 11:28:26 +0100 Message-Id: <20210521102826.28552-7-mgorman@techsingularity.net> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210521102826.28552-1-mgorman@techsingularity.net> References: <20210521102826.28552-1-mgorman@techsingularity.net> MIME-Version: 1.0 Authentication-Results: imf03.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf03.hostedemail.com: domain of mgorman@techsingularity.net designates 46.22.139.106 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 52050C0042E3 X-Stat-Signature: accposeorajefmas363z8ayjustrwr5x X-HE-Tag: 1621592976-595244 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is similar to the old vm.percpu_pagelist_fraction except it only adjusts pcp->high to potentially reduce zone->lock contention while preserving allocation latency when PCP lists have to be refilled. # grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 649 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=8 # grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 35071 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=64 high: 4383 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=0 high: 649 batch: 63 Signed-off-by: Mel Gorman Acked-by: Dave Hansen --- Documentation/admin-guide/sysctl/vm.rst | 20 +++++++++ include/linux/mmzone.h | 3 ++ kernel/sysctl.c | 8 ++++ mm/page_alloc.c | 56 +++++++++++++++++++++++-- 4 files changed, 83 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 2fcafccb53a8..415f2aebf59b 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -64,6 +64,7 @@ files can be found in mm/swap.c. - overcommit_ratio - page-cluster - panic_on_oom +- percpu_pagelist_high_fraction - stat_interval - stat_refresh - numa_stat @@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. +percpu_pagelist_high_fraction +============================= + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we do not allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist remains the same regardless of the +value of the high fraction so allocation latencies are unaffected. + +The initial value is zero. Kernel uses this value to set the high pcp->high +mark based on the low watermark for the zone and the number of local +online CPUs. If the user writes '0' to this sysctl, it will revert to +this default behavior. + + stat_interval ============= diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a0606239a167..e20d98c62beb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1040,12 +1040,15 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *, extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); +int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *, int, + void *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int numa_zonelist_order_handler(struct ctl_table *, int, void *, size_t *, loff_t *); +extern int percpu_pagelist_high_fraction; extern char numa_zonelist_order[]; #define NUMA_ZONELIST_ORDER_LEN 16 diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4e5ac50a1af0..9eb9d1f987d9 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2889,6 +2889,14 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ONE, .extra2 = &one_thousand, }, + { + .procname = "percpu_pagelist_high_fraction", + .data = &percpu_pagelist_high_fraction, + .maxlen = sizeof(percpu_pagelist_high_fraction), + .mode = 0644, + .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, + .extra1 = SYSCTL_ZERO, + }, { .procname = "page_lock_unfairness", .data = &sysctl_page_lock_unfairness, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d8f8044781c4..08f9e5027ed4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -120,6 +120,7 @@ typedef int __bitwise fpi_t; /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); +#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) struct pagesets { local_lock_t lock; @@ -181,6 +182,7 @@ EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; unsigned long totalcma_pages __read_mostly; +int percpu_pagelist_high_fraction; gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc); EXPORT_SYMBOL(init_on_alloc); @@ -6670,7 +6672,8 @@ static int zone_batchsize(struct zone *zone) #endif } -static int zone_highsize(struct zone *zone, int cpu_online) +static int +zone_highsize(struct zone *zone, unsigned long total_pages, int cpu_online) { #ifdef CONFIG_MMU int high; @@ -6683,7 +6686,7 @@ static int zone_highsize(struct zone *zone, int cpu_online) * not be online yet. */ nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online; - high = low_wmark_pages(zone) / nr_local_cpus; + high = total_pages / nr_local_cpus; return high; #else @@ -6749,14 +6752,21 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h /* * Calculate and set new high and batch values for all per-cpu pagesets of a - * zone based on the zone's size. + * zone based on the zone's size and the percpu_pagelist_high_fraction sysctl. */ static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) { int new_high, new_batch; + if (!percpu_pagelist_high_fraction) { + new_high = zone_highsize(zone, low_wmark_pages(zone), cpu_online); + } else { + new_high = zone_highsize(zone, + zone_managed_pages(zone) / percpu_pagelist_high_fraction, + cpu_online); + } + new_batch = max(1, zone_batchsize(zone)); - new_high = zone_highsize(zone, cpu_online); if (zone->pageset_high == new_high && zone->pageset_batch == new_batch) @@ -8443,6 +8453,44 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write, return 0; } +/* + * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each + * cpu. It is the fraction of total pages in each zone that a hot per cpu + * pagelist can have before it gets flushed back to buddy allocator. + */ +int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table, + int write, void *buffer, size_t *length, loff_t *ppos) +{ + struct zone *zone; + int old_percpu_pagelist_high_fraction; + int ret; + + mutex_lock(&pcp_batch_high_lock); + old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction; + + ret = proc_dointvec_minmax(table, write, buffer, length, ppos); + if (!write || ret < 0) + goto out; + + /* Sanity checking to avoid pcp imbalance */ + if (percpu_pagelist_high_fraction && + percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) { + percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction; + ret = -EINVAL; + goto out; + } + + /* No change? */ + if (percpu_pagelist_high_fraction == old_percpu_pagelist_high_fraction) + goto out; + + for_each_populated_zone(zone) + zone_set_pageset_high_and_batch(zone, 0); +out: + mutex_unlock(&pcp_batch_high_lock); + return ret; +} + #ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES /* * Returns the number of pages that arch has reserved but