From patchwork Wed Jul 13 08:39:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 12916280 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40D00C43334 for ; Wed, 13 Jul 2022 08:40:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D3CE6940114; Wed, 13 Jul 2022 04:40:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CEE8E9400E5; Wed, 13 Jul 2022 04:40:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B9000940114; Wed, 13 Jul 2022 04:40:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A91459400E5 for ; Wed, 13 Jul 2022 04:40:37 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 779C734C62 for ; Wed, 13 Jul 2022 08:40:37 +0000 (UTC) X-FDA: 79681430514.10.4AD7584 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf30.hostedemail.com (Postfix) with ESMTP id DD89380079 for ; Wed, 13 Jul 2022 08:40:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657701636; x=1689237636; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=vUd05dvyBiM84Ezjr6E/K54uJBFgUb38qWnD0JwjrJs=; b=gEC9Eysj8UoFV0fz3SrW7yZEyDH/d4NRUr6jWJh9yqKwliHi2lO7W3qD 4AoOgzLxWefppfc46QKdlqp04+u4S+Yh98pAfIHbxJBvRPVVnc0/PYsHQ CLp7Cwwxu8tRik+bKVcD9umNzOH5hLdfOOymJn3xVxM1/94/AqqA8zANM ht3KQEkXNXZB/9/ckJuB2yjKihKtSW+CNmKuK81WmnvsQuxTOKFkPhDyy 5pF4ungQX1g6KX/O3lmexN3anTDEtqYlnYkNUK2vx4euairosVJQIB7rk T2W5qsYkWlY0vwOzb4rh6MRx3817VtHi59lzHjhr9XrW7W2BJHJZQI5wz w==; X-IronPort-AV: E=McAfee;i="6400,9594,10406"; a="282704736" X-IronPort-AV: E=Sophos;i="5.92,267,1650956400"; d="scan'208";a="282704736" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jul 2022 01:40:36 -0700 X-IronPort-AV: E=Sophos;i="5.92,267,1650956400"; d="scan'208";a="653284485" Received: from yijunwa1-mobl.ccr.corp.intel.com (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.215.54]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Jul 2022 01:40:32 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Baolin Wang , Johannes Weiner , Michal Hocko , Rik van Riel , Mel Gorman , Peter Zijlstra , Dave Hansen , Yang Shi , Zi Yan , Wei Xu , osalvador , Shakeel Butt , Zhong Jiang Subject: [PATCH -V4 RESEND 2/3] memory tiering: rate limit NUMA migration throughput Date: Wed, 13 Jul 2022 16:39:52 +0800 Message-Id: <20220713083954.34196-3-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220713083954.34196-1-ying.huang@intel.com> References: <20220713083954.34196-1-ying.huang@intel.com> MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657701637; a=rsa-sha256; cv=none; b=ex03TMBQW6Z2f9AFWnW+qG7+7vr+MPODzycAbAdamOo49Z0INFZrgdzbMrB7fUoM6SdA7P zRWkxaqcGS/8jh5NJuiyY9QpZvcMZeeqsNf+FfhLcMQiws6r8S91sCoipK1u6UUF7NkU1s maczbTOCtXvdU454s/iVzz71MvZTyWM= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=gEC9Eysj; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf30.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.93) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657701637; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4UpM1ymW5skAjQfGRKyP6GGfoBSJ74v+tEeUxWA/a4c=; b=wW/Xdu6J/1kC09z0GpQnsa95ghjgUzTYdNX2zECzeVR0FMR6MoUmB+8OWxhTa7j+IBTKmZ dirqEtozVWZ9j6ojv5RuTwIYyQpAXmN+V5CD6AyKjB6AzHGgrrKWxOOFnYs4Tn3tGnx3zp PzwG1yQJ94khJjdG9A4V5SREQ0ttF9k= X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: DD89380079 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=gEC9Eysj; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf30.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.93) smtp.mailfrom=ying.huang@intel.com X-Stat-Signature: monfjg1ixaih1qh7kwimog4yrenk6bxm X-Rspam-User: X-HE-Tag: 1657701636-267593 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Signed-off-by: "Huang, Ying" Reviewed-by: Baolin Wang Tested-by: Baolin Wang Cc: Andrew Morton Cc: Johannes Weiner Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Dave Hansen Cc: Yang Shi Cc: Zi Yan Cc: Wei Xu Cc: osalvador Cc: Shakeel Butt Cc: Zhong Jiang Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- Documentation/admin-guide/sysctl/kernel.rst | 11 +++++++ include/linux/mmzone.h | 7 +++++ include/linux/sched/sysctl.h | 1 + kernel/sched/fair.c | 33 +++++++++++++++++++-- kernel/sysctl.c | 8 +++++ mm/vmstat.c | 1 + 6 files changed, 59 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ddccd1077462..c99bceafd162 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -623,6 +623,17 @@ different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too. +numa_balancing_promote_rate_limit_MBps +====================================== + +Too high promotion/demotion throughput between different memory types +may hurt application latency. This can be used to rate limit the +promotion throughput. The per-node max promotion throughput in MB/s +will be limited to be no more than the set value. + +A rule of thumb is to set this to less than 1/10 of the PMEM node +write bandwidth. + oops_all_cpu_backtrace ====================== diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..994a0cd39595 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -221,6 +221,7 @@ enum node_stat_item { #endif #ifdef CONFIG_NUMA_BALANCING PGPROMOTE_SUCCESS, /* promote successfully */ + PGPROMOTE_CANDIDATE, /* candidate pages to promote */ #endif NR_VM_NODE_STAT_ITEMS }; @@ -912,6 +913,12 @@ typedef struct pglist_data { struct deferred_split deferred_split_queue; #endif +#ifdef CONFIG_NUMA_BALANCING + /* start time in ms of current promote rate limit period */ + unsigned int nbp_rl_start; + /* number of promote candidate pages at start time of current rate limit period */ + unsigned long nbp_rl_nr_cand; +#endif /* Fields commonly accessed by the page reclaim scanner */ /* diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index e650946816d0..303ee7dd0c7e 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -27,6 +27,7 @@ enum sched_tunable_scaling { #ifdef CONFIG_NUMA_BALANCING extern int sysctl_numa_balancing_mode; +extern unsigned int sysctl_numa_balancing_promote_rate_limit; #else #define sysctl_numa_balancing_mode 0 #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index edc3d741ef84..d779a91a8ca0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1073,6 +1073,9 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000; /* The page with hint page fault latency < threshold in ms is considered hot */ unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC; +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */ +unsigned int sysctl_numa_balancing_promote_rate_limit = 65536; + struct numa_group { refcount_t refcount; @@ -1477,6 +1480,29 @@ static int numa_hint_fault_latency(struct page *page) return (time - last_time) & PAGE_ACCESS_TIME_MASK; } +/* + * For memory tiering mode, too high promotion/demotion throughput may + * hurt application latency. So we provide a mechanism to rate limit + * the number of pages that are tried to be promoted. + */ +static bool numa_promotion_rate_limit(struct pglist_data *pgdat, + unsigned long rate_limit, int nr) +{ + unsigned long nr_cand; + unsigned int now, start; + + now = jiffies_to_msecs(jiffies); + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); + nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE); + start = pgdat->nbp_rl_start; + if (now - start > MSEC_PER_SEC && + cmpxchg(&pgdat->nbp_rl_start, start, now) == start) + pgdat->nbp_rl_nr_cand = nr_cand; + if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit) + return true; + return false; +} + bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int src_nid, int dst_cpu) { @@ -1491,7 +1517,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && !node_is_toptier(src_nid)) { struct pglist_data *pgdat; - unsigned long latency, th; + unsigned long rate_limit, latency, th; pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) @@ -1502,7 +1528,10 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, if (latency >= th) return false; - return true; + rate_limit = sysctl_numa_balancing_promote_rate_limit << \ + (20 - PAGE_SHIFT); + return !numa_promotion_rate_limit(pgdat, rate_limit, + thp_nr_pages(page)); } this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e52b6e372c60..3188698e2c8e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1597,6 +1597,14 @@ static struct ctl_table kern_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_FOUR, }, + { + .procname = "numa_balancing_promote_rate_limit_MBps", + .data = &sysctl_numa_balancing_promote_rate_limit, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + }, #endif /* CONFIG_NUMA_BALANCING */ { .procname = "panic", diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..068ca7d150ab 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = { #endif #ifdef CONFIG_NUMA_BALANCING "pgpromote_success", + "pgpromote_candidate", #endif /* enum writeback_stat_item counters */