From patchwork Wed Oct 7 16:17:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820981 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2712314D5 for ; Wed, 7 Oct 2020 16:18:15 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DBF5820789 for ; Wed, 7 Oct 2020 16:18:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DBF5820789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C4C756B0074; Wed, 7 Oct 2020 12:18:04 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BFA3A6B0075; Wed, 7 Oct 2020 12:18:04 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A540D6B0078; Wed, 7 Oct 2020 12:18:04 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44]) by kanga.kvack.org (Postfix) with ESMTP id 792806B0074 for ; Wed, 7 Oct 2020 12:18:04 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id EBB80362A for ; Wed, 7 Oct 2020 16:18:03 +0000 (UTC) X-FDA: 77345636046.11.smoke62_0509912271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin11.hostedemail.com (Postfix) with ESMTP id C8EE3180F8B80 for ; Wed, 7 Oct 2020 16:18:03 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30051:30054:30064:30080:30090,0,RBL:192.55.52.136:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04y8pnibn16khoimnwnd9g9ya5hizopxyxr649361wnk815fh3ma1y73sjyebyg.565qw6t4zry7mg7zsofs7qdgmtqgybws6ehcdbyq13qx6pajicagaitukomdpsz.w-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: smoke62_0509912271d0 X-Filterd-Recvd-Size: 3914 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:18:02 +0000 (UTC) IronPort-SDR: RFUsRllLIPQP3NmpWMz5EUDN5MNWPehan/K0GB/XkicSl1P/Bg9iaUsQBfT5/orSw/OCw0UlbL OacGqunUw5Ow== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="144390442" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="144390442" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:45 -0700 IronPort-SDR: WH1ZsKs1hkQPXa4ELitlC1Jl6CZikFwpbHUpmO3Mj3N3wEGEh6HquTZ3M7F7N6RGYfwKhqHOsc 9Sj2qp+hZTdw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="418753250" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga001.fm.intel.com with ESMTP; 07 Oct 2020 09:17:45 -0700 Subject: [RFC][PATCH 1/9] mm/numa: node demotion data structure and lookup To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:38 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161738.403BFCD7@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen Prepare for the kernel to auto-migrate pages to other memory nodes with a user defined node migration table. This allows creating single migration target for each NUMA node to enable the kernel to do NUMA page migrations instead of simply reclaiming colder pages. A node with no target is a "terminal node", so reclaim acts normally there. The migration target does not fundamentally _need_ to be a single node, but this implementation starts there to limit complexity. If you consider the migration path as a graph, cycles (loops) in the graph are disallowed. This avoids wasting resources by constantly migrating (A->B, B->A, A->B ...). The expectation is that cycles will never be allowed. Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- changes in July 2020: - Remove loop from next_demotion_node() and get_online_mems(). This means that the node returned by next_demotion_node() might now be offline, but the worst case is that the allocation fails. That's fine since it is transient. --- b/mm/migrate.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c --- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path 2020-10-07 09:15:25.978642454 -0700 +++ b/mm/migrate.c 2020-10-07 09:15:25.989642454 -0700 @@ -1161,6 +1161,22 @@ out: return rc; } +static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; + +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * @returns: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + return node_demotion[node]; +} + /* * Obtain the lock on page, remove all ptes and migrate the page * to the newly allocated page in newpage. From patchwork Wed Oct 7 16:17:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820979 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A076914D5 for ; Wed, 7 Oct 2020 16:18:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4C5FE20789 for ; Wed, 7 Oct 2020 16:18:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4C5FE20789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8EF7F6B0073; Wed, 7 Oct 2020 12:18:03 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 87F696B0074; Wed, 7 Oct 2020 12:18:03 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 608C96B0075; Wed, 7 Oct 2020 12:18:03 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0216.hostedemail.com [216.40.44.216]) by kanga.kvack.org (Postfix) with ESMTP id 296306B0073 for ; Wed, 7 Oct 2020 12:18:03 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 91F77362A for ; Wed, 7 Oct 2020 16:18:02 +0000 (UTC) X-FDA: 77345636004.13.box45_14032bb271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id C4E071813F54D for ; Wed, 7 Oct 2020 16:17:51 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30012:30036:30045:30054:30064:30070:30080,0,RBL:192.55.52.115:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95;04yferkutoyzf5fkhpuxpnrzm9idgocytm85dmn8r1qb45a3iu7hmjspm4kzpxm.acqmy3edzp1dyd15yaxuu1ebt3gpxjrq7oeqcbjqodadgxipw18e4ycgi9bs9g1.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: box45_14032bb271d0 X-Filterd-Recvd-Size: 9476 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:50 +0000 (UTC) IronPort-SDR: pIiWMC1+5uJSXoXH5tdjIK/tkViHcgYdwxeQ/C1SyV40ceq4PCFMoUn1oyjFZJahLQJsxE3Hvi XQKQRs/WDbeQ== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="164237599" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="164237599" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:47 -0700 IronPort-SDR: T6PKEUzpb/UIdLvbli7gmhDBPrZQEzkqq6o0pk2mTGts4c3QyiPitgqv+c4IXyzkD9HmBm4f5c DvaCiNH6X/FQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="518900054" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga005.fm.intel.com with ESMTP; 07 Oct 2020 09:17:47 -0700 Subject: [RFC][PATCH 2/9] mm/numa: automatically generate node migration order To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:40 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161740.244FF532@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen When memory fills up on a node, memory contents can be automatically migrated to another node. The biggest problems are knowing when to migrate and to where the migration should be targeted. The most straightforward way to generate the "to where" list would be to follow the page allocator fallback lists. Those lists already tell us if memory is full where to look next. It would also be logical to move memory in that order. But, the allocator fallback lists have a fatal flaw: most nodes appear in all the lists. This would potentially lead to migration cycles (A->B, B->A, A->B, ...). Instead of using the allocator fallback lists directly, keep a separate node migration ordering. But, reuse the same data used to generate page allocator fallback in the first place: find_next_best_node(). This means that the firmware data used to populate node distances essentially dictates the ordering for now. It should also be architecture-neutral since all NUMA architectures have a working find_next_best_node(). The protocol for node_demotion[] access and writing is not standard. It has no specific locking and is intended to be read locklessly. Readers must take care to avoid observing changes that appear incoherent. This was done so that node_demotion[] locking has no chance of becoming a bottleneck on large systems with lots of CPUs in direct reclaim. This code is unused for now. It will be called later in the series. Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/mm/internal.h | 1 b/mm/migrate.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++- b/mm/page_alloc.c | 2 3 files changed, 138 insertions(+), 2 deletions(-) diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h --- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.027642452 -0700 +++ b/mm/internal.h 2020-10-07 09:15:27.039642452 -0700 @@ -203,6 +203,7 @@ extern int user_min_free_kbytes; extern void zone_pcp_update(struct zone *zone); extern void zone_pcp_reset(struct zone *zone); +extern int find_next_best_node(int node, nodemask_t *used_node_mask); #if defined CONFIG_COMPACTION || defined CONFIG_CMA diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c --- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.031642452 -0700 +++ b/mm/migrate.c 2020-10-07 09:15:27.041642452 -0700 @@ -1161,6 +1161,10 @@ out: return rc; } +/* + * Writes to this array occur without locking. READ_ONCE() + * is recommended for readers to ensure consistent reads. + */ static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; /** @@ -1174,7 +1178,13 @@ static int node_demotion[MAX_NUMNODES] = */ int next_demotion_node(int node) { - return node_demotion[node]; + /* + * node_demotion[] is updated without excluding + * this function from running. READ_ONCE() avoids + * reading multiple, inconsistent 'node' values + * during an update. + */ + return READ_ONCE(node_demotion[node]); } /* @@ -3112,3 +3122,128 @@ void migrate_vma_finalize(struct migrate } EXPORT_SYMBOL(migrate_vma_finalize); #endif /* CONFIG_DEVICE_PRIVATE */ + +/* Disable reclaim-based migration. */ +static void disable_all_migrate_targets(void) +{ + int node; + + for_each_online_node(node) + node_demotion[node] = NUMA_NO_NODE; +} + +/* + * Find an automatic demotion target for 'node'. + * Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static int establish_migrate_target(int node, nodemask_t *used) +{ + int migration_target; + + /* + * Can not set a migration target on a + * node with it already set. + * + * No need for READ_ONCE() here since this + * in the write path for node_demotion[]. + * This should be the only thread writing. + */ + if (node_demotion[node] != NUMA_NO_NODE) + return NUMA_NO_NODE; + + migration_target = find_next_best_node(node, used); + if (migration_target == NUMA_NO_NODE) + return NUMA_NO_NODE; + + node_demotion[node] = migration_target; + + return migration_target; +} + +/* + * When memory fills up on a node, memory contents can be + * automatically migrated to another node instead of + * discarded at reclaim. + * + * Establish a "migration path" which will start at nodes + * with CPUs and will follow the priorities used to build the + * page allocator zonelists. + * + * The difference here is that cycles must be avoided. If + * node0 migrates to node1, then neither node1, nor anything + * node1 migrates to can migrate to node0. + * + * This function can run simultaneously with readers of + * node_demotion[]. However, it can not run simultaneously + * with itself. Exclusion is provided by memory hotplug events + * being single-threaded. + */ +void __set_migration_target_nodes(void) +{ + nodemask_t next_pass = NODE_MASK_NONE; + nodemask_t this_pass = NODE_MASK_NONE; + nodemask_t used_targets = NODE_MASK_NONE; + int node; + + /* + * Avoid any oddities like cycles that could occur + * from changes in the topology. This will leave + * a momentary gap when migration is disabled. + */ + disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + * + * The before+after state together might have cycles and + * could cause readers to do things like loop until this + * function finishes. This ensures they can only see a + * single "bad" read and would, for instance, only loop + * once. + */ + smp_wmb(); + + /* + * Allocations go close to CPUs, first. Assume that + * the migration path starts at the nodes with CPUs. + */ + next_pass = node_states[N_CPU]; +again: + this_pass = next_pass; + next_pass = NODE_MASK_NONE; + /* + * To avoid cycles in the migration "graph", ensure + * that migration sources are not future targets by + * setting them in 'used_targets'. Do this only + * once per pass so that multiple source nodes can + * share a target node. + * + * 'used_targets' will become unavailable in future + * passes. This limits some opportunities for + * multiple source nodes to share a desintation. + */ + nodes_or(used_targets, used_targets, this_pass); + for_each_node_mask(node, this_pass) { + int target_node = establish_migrate_target(node, &used_targets); + + if (target_node == NUMA_NO_NODE) + continue; + + /* Visit targets from this pass in the next pass: */ + node_set(target_node, next_pass); + } + /* Is another pass necessary? */ + if (!nodes_empty(next_pass)) + goto again; +} + +void set_migration_target_nodes(void) +{ + get_online_mems(); + __set_migration_target_nodes(); + put_online_mems(); +} diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c --- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware 2020-10-07 09:15:27.035642452 -0700 +++ b/mm/page_alloc.c 2020-10-07 09:15:27.043642452 -0700 @@ -5632,7 +5632,7 @@ static int node_load[MAX_NUMNODES]; * * Return: node id of the found node or %NUMA_NO_NODE if no node is found. */ -static int find_next_best_node(int node, nodemask_t *used_node_mask) +int find_next_best_node(int node, nodemask_t *used_node_mask) { int n, val; int min_val = INT_MAX; From patchwork Wed Oct 7 16:17:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820965 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1E8A41580 for ; Wed, 7 Oct 2020 16:17:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BCF6420789 for ; Wed, 7 Oct 2020 16:17:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BCF6420789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B5F9E6B0062; Wed, 7 Oct 2020 12:17:53 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B0FD56B0068; Wed, 7 Oct 2020 12:17:53 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9FE7B6B006C; Wed, 7 Oct 2020 12:17:53 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0253.hostedemail.com [216.40.44.253]) by kanga.kvack.org (Postfix) with ESMTP id 742936B0062 for ; Wed, 7 Oct 2020 12:17:53 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 1308B181AE86F for ; Wed, 7 Oct 2020 16:17:53 +0000 (UTC) X-FDA: 77345635626.30.tramp82_27125e3271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 2182A180B3CBA for ; Wed, 7 Oct 2020 16:17:52 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30045:30051:30054:30064:30070:30090,0,RBL:192.55.52.88:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04y8a935an41oahxth8wbqbjggbdeypxfjsd8661qaa3bmyt57coqf865ty9fnk.qhgi33ubjt5c8dk3cfzxhnoi6c44ckrgnwkurp6mh8h6dph3zf97b4ax36fztnc.e-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: tramp82_27125e3271d0 X-Filterd-Recvd-Size: 5903 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:51 +0000 (UTC) IronPort-SDR: 5fBVugHkHKMMEdGBPelrPq/V+YAK278KyFn/zj2dUod930QZ/WUCapu29A4vJ9GO4c2OF1Aa9O htLv8K7qVDwA== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="182481091" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="182481091" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:49 -0700 IronPort-SDR: 8HPIDvb9j3GVeLiCNh2f2i/uckzsBEixx1tHzFSrBXnFo8jFzdOq69fA+lQSJnuhFZ/b/HevRQ cZ3xhuN0TLgA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="316279973" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga006.jf.intel.com with ESMTP; 07 Oct 2020 09:17:48 -0700 Subject: [RFC][PATCH 3/9] mm/migrate: update migration order during on hotplug events To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:41 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161741.DDC85648@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen Reclaim-based migration is attempting to optimize data placement in memory based on the system topology. If the system changes, so must the migration ordering. The implementation here is pretty simple and entirely unoptimized. On any memory or CPU hotplug events, assume that a node was added or removed and recalculate all migration targets. This ensures that the node_demotion[] array is always ready to be used in case the new reclaim mode is enabled. This recalculation is far from optimal, most glaringly that it does not even attempt to figure out if nodes are actually coming or going. But, given the expected paucity of hotplug events, this should be fine. Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/mm/migrate.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c --- a/mm/migrate.c~enable-numa-demotion 2020-10-07 09:15:28.260642449 -0700 +++ b/mm/migrate.c 2020-10-07 09:15:28.266642449 -0700 @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -3241,9 +3242,101 @@ again: goto again; } +/* + * For callers that do not hold get_online_mems() already. + */ void set_migration_target_nodes(void) { get_online_mems(); __set_migration_target_nodes(); put_online_mems(); } + +/* + * React to hotplug events that might affect the migration targes + * like events that online or offline NUMA nodes. + * + * The ordering is also currently dependent on which nodes have + * CPUs. That means we need CPU on/offline notification too. + */ +static int migration_online_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + +static int migration_offline_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + +/* + * This leaves migrate-on-reclaim transiently disabled + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events. + * This runs reclaim-based micgration is enabled or not. + * This ensures that the user can turn reclaim-based + * migration at any time without needing to recalcuate + * migration targets. + * + * These callbacks already hold get_online_mems(). That + * is why __set_migration_target_nodes() can be used as + * opposed to set_migration_target_nodes(). + */ +#if defined(CONFIG_MEMORY_HOTPLUG) +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + switch (action) { + case MEM_GOING_OFFLINE: + /* + * Make sure there are not transient states where + * an offline node is a migration target. This + * will leave migration disabled until the offline + * completes and the MEM_OFFLINE case below runs. + */ + disable_all_migrate_targets(); + break; + case MEM_OFFLINE: + case MEM_ONLINE: + /* + * Recalculate the target nodes once the node + * reaches its final state (online or offline). + */ + __set_migration_target_nodes(); + break; + case MEM_CANCEL_OFFLINE: + /* + * MEM_GOING_OFFLINE disabled all the migration + * targets. Reenable them. + */ + __set_migration_target_nodes(); + break; + case MEM_GOING_ONLINE: + case MEM_CANCEL_ONLINE: + break; + } + + return notifier_from_errno(0); +} + +static int __init migrate_on_reclaim_init(void) +{ + int ret; + + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim", + migration_online_cpu, + migration_offline_cpu); + /* + * In the unlikely case that this fails, the automatic + * migration targets may become suboptimal for nodes + * where N_CPU changes. With such a small impact in a + * rare case, do not bother trying to do anything special. + */ + WARN_ON(ret < 0); + + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); + return 0; +} +late_initcall(migrate_on_reclaim_init); +#endif /* CONFIG_MEMORY_HOTPLUG */ From patchwork Wed Oct 7 16:17:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820969 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6C19E14D5 for ; Wed, 7 Oct 2020 16:17:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1409620789 for ; Wed, 7 Oct 2020 16:17:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1409620789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1EAEF6B006C; Wed, 7 Oct 2020 12:17:56 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 173C96B006E; Wed, 7 Oct 2020 12:17:56 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDF8A6B0070; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id C00A06B006C for ; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 4D3F8180AD806 for ; Wed, 7 Oct 2020 16:17:55 +0000 (UTC) X-FDA: 77345635710.01.spoon00_2e0d074271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 24FBE10049A7B for ; Wed, 7 Oct 2020 16:17:53 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30001:30012:30034:30051:30054:30064:30070,0,RBL:192.55.52.88:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95;04yruqqyt3nk8wjfbwo4i8wrexedxoccuwppyz6ripuim7pmqrtbaioo6rpurc3.f584h16b5xryegzzs1gyrzzuknxmbyundinjonmjiuctxejccy9h6noijo3nu8g.1-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: spoon00_2e0d074271d0 X-Filterd-Recvd-Size: 14331 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:52 +0000 (UTC) IronPort-SDR: FeruM/9Y95B0cN8WJIk6F41fGDx/ujGkPk+UrIWgsREgvoEVgv1AhmlVj8ig4+rIwkbCxn/HpG sPFgWHSTrF8g== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="182481101" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="182481101" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:50 -0700 IronPort-SDR: mHxL5K9yyjlga7S/LfpO0SdBeANyDnvBSaaMlrW59l2BbhaLnwA7vkX6bcnWe7SQKrfBDeUm0C 7JWiUwWAh5KQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="311807127" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga003.jf.intel.com with ESMTP; 07 Oct 2020 09:17:50 -0700 Subject: [RFC][PATCH 4/9] mm/migrate: make migrate_pages() return nr_succeeded To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:43 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161743.CF8F79F6@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Yang Shi The migrate_pages() returns the number of pages that were not migrated, or an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. In the following patch, migrate_pages() is used to demote pages to PMEM node, we need account how many pages are reclaimed (demoted) since page reclaim behavior depends on this. Add *nr_succeeded parameter to make migrate_pages() return how many pages are demoted successfully for all cases. Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/include/linux/migrate.h | 5 +++-- b/mm/compaction.c | 3 ++- b/mm/gup.c | 4 +++- b/mm/memory-failure.c | 7 +++++-- b/mm/memory_hotplug.c | 4 +++- b/mm/mempolicy.c | 7 +++++-- b/mm/migrate.c | 16 +++++++++------- b/mm/page_alloc.c | 9 ++++++--- 8 files changed, 36 insertions(+), 19 deletions(-) diff -puN include/linux/migrate.h~migrate_pages-add-success-return include/linux/migrate.h --- a/include/linux/migrate.h~migrate_pages-add-success-return 2020-10-07 09:15:29.333642446 -0700 +++ b/include/linux/migrate.h 2020-10-07 09:15:29.375642446 -0700 @@ -40,7 +40,8 @@ extern int migrate_page(struct address_s struct page *newpage, struct page *page, enum migrate_mode mode); extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, - unsigned long private, enum migrate_mode mode, int reason); + unsigned long private, enum migrate_mode mode, int reason, + unsigned int *nr_succeeded); extern struct page *alloc_migration_target(struct page *page, unsigned long private); extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(str static inline void putback_movable_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, - int reason) + int reason, unsigned int *nr_succeeded) { return -ENOSYS; } static inline struct page *alloc_migration_target(struct page *page, unsigned long private) diff -puN mm/compaction.c~migrate_pages-add-success-return mm/compaction.c --- a/mm/compaction.c~migrate_pages-add-success-return 2020-10-07 09:15:29.337642446 -0700 +++ b/mm/compaction.c 2020-10-07 09:15:29.380642446 -0700 @@ -2196,6 +2196,7 @@ compact_zone(struct compact_control *cc, unsigned long last_migrated_pfn; const bool sync = cc->mode != MIGRATE_ASYNC; bool update_cached; + unsigned int nr_succeeded = 0; /* * These counters track activities during zone compaction. Initialize @@ -2314,7 +2315,7 @@ compact_zone(struct compact_control *cc, err = migrate_pages(&cc->migratepages, compaction_alloc, compaction_free, (unsigned long)cc, cc->mode, - MR_COMPACTION); + MR_COMPACTION, &nr_succeeded); trace_mm_compaction_migratepages(cc->nr_migratepages, err, &cc->migratepages); diff -puN mm/gup.c~migrate_pages-add-success-return mm/gup.c --- a/mm/gup.c~migrate_pages-add-success-return 2020-10-07 09:15:29.345642446 -0700 +++ b/mm/gup.c 2020-10-07 09:15:29.384642446 -0700 @@ -1586,6 +1586,7 @@ static long check_and_migrate_cma_pages( unsigned long step; bool drain_allow = true; bool migrate_allow = true; + unsigned int nr_succeeded = 0; LIST_HEAD(cma_page_list); long ret = nr_pages; struct migration_target_control mtc = { @@ -1638,7 +1639,8 @@ check_again: put_page(pages[i]); if (migrate_pages(&cma_page_list, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) { + (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE, + &nr_succeeded)) { /* * some of the pages failed migration. Do get_user_pages * without migration. diff -puN mm/memory-failure.c~migrate_pages-add-success-return mm/memory-failure.c --- a/mm/memory-failure.c~migrate_pages-add-success-return 2020-10-07 09:15:29.347642446 -0700 +++ b/mm/memory-failure.c 2020-10-07 09:15:29.388642446 -0700 @@ -1724,6 +1724,7 @@ static int soft_offline_huge_page(struct int ret; unsigned long pfn = page_to_pfn(page); struct page *hpage = compound_head(page); + unsigned int nr_succeeded = 0; LIST_HEAD(pagelist); /* @@ -1751,7 +1752,7 @@ static int soft_offline_huge_page(struct } ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, - MIGRATE_SYNC, MR_MEMORY_FAILURE); + MIGRATE_SYNC, MR_MEMORY_FAILURE, &nr_succeeded); if (ret) { pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n", pfn, ret, page->flags, &page->flags); @@ -1782,6 +1783,7 @@ static int __soft_offline_page(struct pa { int ret; unsigned long pfn = page_to_pfn(page); + unsigned int nr_succeeded = 0; /* * Check PageHWPoison again inside page lock because PageHWPoison @@ -1841,7 +1843,8 @@ static int __soft_offline_page(struct pa page_is_file_lru(page)); list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, - MIGRATE_SYNC, MR_MEMORY_FAILURE); + MIGRATE_SYNC, MR_MEMORY_FAILURE, + &nr_succeeded); if (ret) { if (!list_empty(&pagelist)) putback_movable_pages(&pagelist); diff -puN mm/memory_hotplug.c~migrate_pages-add-success-return mm/memory_hotplug.c --- a/mm/memory_hotplug.c~migrate_pages-add-success-return 2020-10-07 09:15:29.353642446 -0700 +++ b/mm/memory_hotplug.c 2020-10-07 09:15:29.391642446 -0700 @@ -1301,6 +1301,7 @@ do_migrate_range(unsigned long start_pfn unsigned long pfn; struct page *page, *head; int ret = 0; + unsigned int nr_succeeded = 0; LIST_HEAD(source); for (pfn = start_pfn; pfn < end_pfn; pfn++) { @@ -1356,7 +1357,8 @@ do_migrate_range(unsigned long start_pfn if (!list_empty(&source)) { /* Allocate a new page from the nearest neighbor node */ ret = migrate_pages(&source, new_node_page, NULL, 0, - MIGRATE_SYNC, MR_MEMORY_HOTPLUG); + MIGRATE_SYNC, MR_MEMORY_HOTPLUG, + &nr_succeeded); if (ret) { list_for_each_entry(page, &source, lru) { pr_warn("migrating pfn %lx failed ret:%d ", diff -puN mm/mempolicy.c~migrate_pages-add-success-return mm/mempolicy.c --- a/mm/mempolicy.c~migrate_pages-add-success-return 2020-10-07 09:15:29.356642446 -0700 +++ b/mm/mempolicy.c 2020-10-07 09:15:29.396642446 -0700 @@ -1072,6 +1072,7 @@ static int migrate_page_add(struct page static int migrate_to_node(struct mm_struct *mm, int source, int dest, int flags) { + unsigned int nr_succeeded = 0; nodemask_t nmask; LIST_HEAD(pagelist); int err = 0; @@ -1094,7 +1095,7 @@ static int migrate_to_node(struct mm_str if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded); if (err) putback_movable_pages(&pagelist); } @@ -1271,6 +1272,7 @@ static long do_mbind(unsigned long start nodemask_t *nmask, unsigned long flags) { struct mm_struct *mm = current->mm; + unsigned int nr_succeeded = 0; struct mempolicy *new; unsigned long end; int err; @@ -1352,7 +1354,8 @@ static long do_mbind(unsigned long start if (!list_empty(&pagelist)) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(&pagelist, new_page, NULL, - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); + start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, + &nr_succeeded); if (nr_failed) putback_movable_pages(&pagelist); } diff -puN mm/migrate.c~migrate_pages-add-success-return mm/migrate.c --- a/mm/migrate.c~migrate_pages-add-success-return 2020-10-07 09:15:29.362642446 -0700 +++ b/mm/migrate.c 2020-10-07 09:15:29.401642446 -0700 @@ -1433,6 +1433,7 @@ out: * @mode: The migration mode that specifies the constraints for * page migration, if any. * @reason: The reason for page migration. + * @nr_succeeded: The number of pages migrated successfully. * * The function returns after 10 attempts or if no pages are movable any more * because the list has become empty or no retryable pages exist any more. @@ -1443,12 +1444,11 @@ out: */ int migrate_pages(struct list_head *from, new_page_t get_new_page, free_page_t put_new_page, unsigned long private, - enum migrate_mode mode, int reason) + enum migrate_mode mode, int reason, unsigned int *nr_succeeded) { int retry = 1; int thp_retry = 1; int nr_failed = 0; - int nr_succeeded = 0; int nr_thp_succeeded = 0; int nr_thp_failed = 0; int nr_thp_split = 0; @@ -1529,7 +1529,7 @@ retry: nr_succeeded += nr_subpages; break; } - nr_succeeded++; + (*nr_succeeded)++; break; default: /* @@ -1552,12 +1552,12 @@ retry: nr_thp_failed += thp_retry; rc = nr_failed; out: - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); + count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded); count_vm_events(PGMIGRATE_FAIL, nr_failed); count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded); count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed); count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split); - trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded, + trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded, nr_thp_failed, nr_thp_split, mode, reason); if (!swapwrite) @@ -1625,6 +1625,7 @@ static int store_status(int __user *stat static int do_move_pages_to_node(struct mm_struct *mm, struct list_head *pagelist, int node) { + unsigned int nr_succeeded = 0; int err; struct migration_target_control mtc = { .nid = node, @@ -1632,7 +1633,7 @@ static int do_move_pages_to_node(struct }; err = migrate_pages(pagelist, alloc_migration_target, NULL, - (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL, &nr_succeeded); if (err) putback_movable_pages(pagelist); return err; @@ -2090,6 +2091,7 @@ int migrate_misplaced_page(struct page * pg_data_t *pgdat = NODE_DATA(node); int isolated; int nr_remaining; + unsigned int nr_succeeded = 0; LIST_HEAD(migratepages); /* @@ -2114,7 +2116,7 @@ int migrate_misplaced_page(struct page * list_add(&page->lru, &migratepages); nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, NULL, node, MIGRATE_ASYNC, - MR_NUMA_MISPLACED); + MR_NUMA_MISPLACED, &nr_succeeded); if (nr_remaining) { if (!list_empty(&migratepages)) { list_del(&page->lru); diff -puN mm/page_alloc.c~migrate_pages-add-success-return mm/page_alloc.c --- a/mm/page_alloc.c~migrate_pages-add-success-return 2020-10-07 09:15:29.371642446 -0700 +++ b/mm/page_alloc.c 2020-10-07 09:15:29.409642446 -0700 @@ -8346,7 +8346,8 @@ static unsigned long pfn_max_align_up(un /* [start, end) must belong to a single zone. */ static int __alloc_contig_migrate_range(struct compact_control *cc, - unsigned long start, unsigned long end) + unsigned long start, unsigned long end, + unsigned int *nr_succeeded) { /* This function is based on compact_zone() from compaction.c. */ unsigned int nr_reclaimed; @@ -8384,7 +8385,8 @@ static int __alloc_contig_migrate_range( cc->nr_migratepages -= nr_reclaimed; ret = migrate_pages(&cc->migratepages, alloc_migration_target, - NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE); + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE, + nr_succeeded); } if (ret < 0) { putback_movable_pages(&cc->migratepages); @@ -8420,6 +8422,7 @@ int alloc_contig_range(unsigned long sta unsigned long outer_start, outer_end; unsigned int order; int ret = 0; + unsigned int nr_succeeded = 0; struct compact_control cc = { .nr_migratepages = 0, @@ -8472,7 +8475,7 @@ int alloc_contig_range(unsigned long sta * allocated. So, if we fall through be sure to clear ret so that * -EBUSY is not accidentally used or returned to caller. */ - ret = __alloc_contig_migrate_range(&cc, start, end); + ret = __alloc_contig_migrate_range(&cc, start, end, &nr_succeeded); if (ret && ret != -EBUSY) goto done; ret =0; From patchwork Wed Oct 7 16:17:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820967 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1DF2A14D5 for ; Wed, 7 Oct 2020 16:17:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CD9B620789 for ; Wed, 7 Oct 2020 16:17:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CD9B620789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 915276B0068; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8C8546B006C; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78EBA6B006E; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0079.hostedemail.com [216.40.44.79]) by kanga.kvack.org (Postfix) with ESMTP id 4E2196B0068 for ; Wed, 7 Oct 2020 12:17:55 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id CB002180AD804 for ; Wed, 7 Oct 2020 16:17:54 +0000 (UTC) X-FDA: 77345635668.12.noise70_5a11498271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin12.hostedemail.com (Postfix) with ESMTP id A699B18016BF7 for ; Wed, 7 Oct 2020 16:17:54 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30012:30034:30036:30054:30064:30070:30090,0,RBL:192.55.52.93:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04yr8tg1rd35jskrnhhndc87it61eypofynhjn1n64s938h64r79f5husussjc4.wry3qc4tx4x5sqzf8z3chmcu5fd4qpzf6i4bwo64mbc1jd5aay1t81cywocwbt6.a-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: noise70_5a11498271d0 X-Filterd-Recvd-Size: 8113 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf49.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:53 +0000 (UTC) IronPort-SDR: MG/cApLn3WthGwSslzz6YRrpPLODiXS31neMWepn1tWE0LDi9XVmybxIocAjvg1NKdmbEHfKoS 4fdfmLvOhlPg== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="161609624" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="161609624" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:52 -0700 IronPort-SDR: WAxQbRwmxaISY1f4Vhk148Twpcj9BCIoZ3gJLAYz62zznowkSaEzwtMmME3f/6PlE7LFIqZrf7 YyGTz0DxUKBQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="517869735" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga006.fm.intel.com with ESMTP; 07 Oct 2020 09:17:52 -0700 Subject: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:45 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161745.26B1D789@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of discarding the data. This always avoids the cost of I/O needed to read the page back in and sometimes avoids the writeout cost when the pagee is dirty. A second pass through shrink_page_list() will be made if any demotions fail. This essentally falls back to normal reclaim behavior in the case that demotions fail. Previous versions of this patch may have simply failed to reclaim pages which were eligible for demotion but were unable to be demoted in practice. Note: This just adds the start of infratructure for migration. It is actually disabled next to the FIXME in migrate_demote_page_ok(). Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams --- changes from 20200730: * Add another pass through shrink_page_list() when demotion fails. --- b/include/linux/migrate.h | 2 b/mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+) diff -puN include/linux/migrate.h~demote-with-migrate_pages include/linux/migrate.h --- a/include/linux/migrate.h~demote-with-migrate_pages 2020-10-07 09:15:31.028642442 -0700 +++ b/include/linux/migrate.h 2020-10-07 09:15:31.034642442 -0700 @@ -27,6 +27,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_DEMOTION, MR_TYPES }; @@ -196,6 +197,7 @@ struct migrate_vma { int migrate_vma_setup(struct migrate_vma *args); void migrate_vma_pages(struct migrate_vma *migrate); void migrate_vma_finalize(struct migrate_vma *migrate); +int next_demotion_node(int node); #endif /* CONFIG_MIGRATION */ diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c --- a/mm/vmscan.c~demote-with-migrate_pages 2020-10-07 09:15:31.030642442 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:31.037642442 -0700 @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -1034,6 +1035,24 @@ static enum page_references page_check_r return PAGEREF_RECLAIM; } +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc) +{ + int next_nid = next_demotion_node(page_to_nid(page)); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageHuge(page), page); + VM_BUG_ON_PAGE(PageLRU(page), page); + + if (next_nid == NUMA_NO_NODE) + return false; + if (PageTransHuge(page) && !thp_migration_supported()) + return false; + + // FIXME: actually enable this later in the series + return false; +} + + /* Check if a page is dirty or under writeback */ static void page_check_dirty_writeback(struct page *page, bool *dirty, bool *writeback) @@ -1064,6 +1083,60 @@ static void page_check_dirty_writeback(s mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } +static struct page *alloc_demote_page(struct page *page, unsigned long node) +{ + /* + * Try to fail quickly if memory on the target node is not + * available. Leaving out __GFP_IO and __GFP_FS helps with + * this. If the desintation node is full, we want kswapd to + * run there so that its pages will get reclaimed and future + * migration attempts may succeed. + */ + gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | + __GFP_KSWAPD_RECLAIM); + /* HugeTLB pages should not be on the LRU */ + WARN_ON_ONCE(PageHuge(page)); + + if (PageTransHuge(page)) { + struct page *thp; + + flags |= __GFP_COMP; + + thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER); + if (!thp) + return NULL; + prep_transhuge_page(thp); + return thp; + } + + return __alloc_pages_node(node, flags, 0); +} + +/* + * Take pages on @demote_list and attempt to demote them to + * another node. Pages which are not demoted are left on + * @demote_pages. + */ +static unsigned int demote_page_list(struct list_head *demote_pages, + struct pglist_data *pgdat, + struct scan_control *sc) +{ + int target_nid = next_demotion_node(pgdat->node_id); + unsigned int nr_succeeded = 0; + int err; + + if (list_empty(demote_pages)) + return 0; + + /* Demotion ignores all cpuset and mempolicy settings */ + err = migrate_pages(demote_pages, alloc_demote_page, NULL, + target_nid, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); + + return nr_succeeded; +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -1076,12 +1149,15 @@ static unsigned int shrink_page_list(str { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(demote_pages); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0; + bool do_demote_pass = true; memset(stat, 0, sizeof(*stat)); cond_resched(); +retry: while (!list_empty(page_list)) { struct address_space *mapping; struct page *page; @@ -1231,6 +1307,16 @@ static unsigned int shrink_page_list(str } /* + * Before reclaiming the page, try to relocate + * its contents to another node. + */ + if (do_demote_pass && migrate_demote_page_ok(page, sc)) { + list_add(&page->lru, &demote_pages); + unlock_page(page); + continue; + } + + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly @@ -1477,6 +1563,17 @@ keep: list_add(&page->lru, &ret_pages); VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + /* 'page_list' is always empty here */ + + /* Migrate pages selected for demotion */ + nr_reclaimed += demote_page_list(&demote_pages, pgdat, sc); + /* Pages that could not be demoted are still in @demote_pages */ + if (!list_empty(&demote_pages)) { + /* Pages which failed to demoted go back on on @page_list for retry: */ + list_splice_init(&demote_pages, page_list); + do_demote_pass = false; + goto retry; + } pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; From patchwork Wed Oct 7 16:17:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820971 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EEE8A14D5 for ; Wed, 7 Oct 2020 16:18:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A7784216C4 for ; Wed, 7 Oct 2020 16:18:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A7784216C4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E6EA46B006E; Wed, 7 Oct 2020 12:17:57 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DF9116B0070; Wed, 7 Oct 2020 12:17:57 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C493E6B0071; Wed, 7 Oct 2020 12:17:57 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161]) by kanga.kvack.org (Postfix) with ESMTP id 96F2D6B006E for ; Wed, 7 Oct 2020 12:17:57 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 32D2C181AE86F for ; Wed, 7 Oct 2020 16:17:57 +0000 (UTC) X-FDA: 77345635794.23.swim68_24158d8271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 01CF637606 for ; Wed, 7 Oct 2020 16:17:56 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30001:30034:30054:30064,0,RBL:192.55.52.43:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04yrmcs4yuftsjdat4uejfjqrwm4fycf6fr11ysppk3su3u4c4e581tctp95tn7.go5p85bixmi3jcuqyiq7qe83cs4eqy88syhd47u5n9hud9y86q14t7j633cb1na.w-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:23,LUA_SUMMARY:none X-HE-Tag: swim68_24158d8271d0 X-Filterd-Recvd-Size: 4064 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:56 +0000 (UTC) IronPort-SDR: /vd2J/vbj7ZASAqd6qPBsOVDQ6ZUyQPYdBA0B9XSVDfBlGofnUktMzx9QT3HrPY0P+UiXbjo3y Rju1d564wEJA== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="249718742" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="249718742" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:54 -0700 IronPort-SDR: X2IE/XN29BTNWoE8aGv5MQ7wZl+bzAfSYYTFKpFc+VaSzSkRV9QPY+ErJSLeZviMfeE30w8Zbj CYSPvqYA/JEg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="354054949" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by FMSMGA003.fm.intel.com with ESMTP; 07 Oct 2020 09:17:54 -0700 Subject: [RFC][PATCH 6/9] mm/vmscan: add page demotion counter To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:47 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161747.FE7288F0@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Yang Shi Account the number of demoted pages into reclaim_state->nr_demoted. Add pgdemote_kswapd and pgdemote_direct VM counters showed in /proc/vmstat. [ daveh: - __count_vm_events() a bit, and made them look at the THP size directly rather than getting data from migrate_pages() ] Signed-off-by: Yang Shi Signed-off-by: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/include/linux/vm_event_item.h | 2 ++ b/mm/vmscan.c | 6 ++++++ b/mm/vmstat.c | 2 ++ 3 files changed, 10 insertions(+) diff -puN include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter include/linux/vm_event_item.h --- a/include/linux/vm_event_item.h~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.171642439 -0700 +++ b/include/linux/vm_event_item.h 2020-10-07 09:15:32.179642439 -0700 @@ -33,6 +33,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS PGREUSE, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, + PGDEMOTE_KSWAPD, + PGDEMOTE_DIRECT, PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_DIRECT_THROTTLE, diff -puN mm/vmscan.c~mm-vmscan-add-page-demotion-counter mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.173642439 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:32.180642439 -0700 @@ -147,6 +147,7 @@ struct scan_control { unsigned int immediate; unsigned int file_taken; unsigned int taken; + unsigned int demoted; } nr; /* for recording the reclaimed slab by now */ @@ -1134,6 +1135,11 @@ static unsigned int demote_page_list(str target_nid, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); + if (current_is_kswapd()) + __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); + else + __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); + return nr_succeeded; } diff -puN mm/vmstat.c~mm-vmscan-add-page-demotion-counter mm/vmstat.c --- a/mm/vmstat.c~mm-vmscan-add-page-demotion-counter 2020-10-07 09:15:32.175642439 -0700 +++ b/mm/vmstat.c 2020-10-07 09:15:32.181642439 -0700 @@ -1244,6 +1244,8 @@ const char * const vmstat_text[] = { "pgreuse", "pgsteal_kswapd", "pgsteal_direct", + "pgdemote_kswapd", + "pgdemote_direct", "pgscan_kswapd", "pgscan_direct", "pgscan_direct_throttle", From patchwork Wed Oct 7 16:17:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820973 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B374A14D5 for ; Wed, 7 Oct 2020 16:18:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 70360216C4 for ; Wed, 7 Oct 2020 16:18:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 70360216C4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EBAC56B0070; Wed, 7 Oct 2020 12:17:58 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E4A8B6B0071; Wed, 7 Oct 2020 12:17:58 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6F7B6B0072; Wed, 7 Oct 2020 12:17:58 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0108.hostedemail.com [216.40.44.108]) by kanga.kvack.org (Postfix) with ESMTP id 9B2B36B0070 for ; Wed, 7 Oct 2020 12:17:58 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 34B881EE6 for ; Wed, 7 Oct 2020 16:17:58 +0000 (UTC) X-FDA: 77345635836.21.pump50_6116a7b271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin21.hostedemail.com (Postfix) with ESMTP id 118EF180442C0 for ; Wed, 7 Oct 2020 16:17:58 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30004:30026:30054:30055:30064,0,RBL:134.134.136.100:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04yfnk1b55gtkb3mgj6wcncui9645ypggxr7d17brxqgmzc38u7edj1jt81z36b.313d9wr7pad43xcj3co8dt4gsfk43r8h8yyiujg4kgcscba3txtg4xdfdbizyb7.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: pump50_6116a7b271d0 X-Filterd-Recvd-Size: 6492 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:57 +0000 (UTC) IronPort-SDR: jyO9R2xBGTbTc0tVy6gDiNk/2SjNdGBlfPOkT3J7A0f6Aa22/PoH5I7rMdTUuA+1dtbDCAU4yh Df4I7kXbmE/g== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="229142587" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="229142587" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:56 -0700 IronPort-SDR: CUMpXdIA3UTGS4+DgUrmaWY07FyOyD+7ZXW4OfnJvFYE08kXdRNjzIrwy7nMY3hmNn5xw+eL1D WS8XCL3cHEug== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="354961403" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga007.jf.intel.com with ESMTP; 07 Oct 2020 09:17:56 -0700 Subject: [RFC][PATCH 7/9] mm/vmscan: Consider anonymous pages without swap To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,kbusch@kernel.org,vishal.l.verma@intel.com,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:49 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161749.4C56D1F1@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Keith Busch Age and reclaim anonymous pages if a migration path is available. The node has other recourses for inactive anonymous pages beyond swap, #Signed-off-by: Keith Busch Cc: Keith Busch [vishal: fixup the migration->demotion rename] Signed-off-by: Vishal Verma Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- Changes from Dave 06/2020: * rename reclaim_anon_pages()->can_reclaim_anon_pages() Note: Keith's Intel SoB is commented out because he is no longer at Intel and his @intel.com mail will bouncee --- b/include/linux/node.h | 9 +++++++++ b/mm/vmscan.c | 33 ++++++++++++++++++++++++++++----- 2 files changed, 37 insertions(+), 5 deletions(-) diff -puN include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap include/linux/node.h --- a/include/linux/node.h~0009-mm-vmscan-Consider-anonymous-pages-without-swap 2020-10-07 09:15:33.390642436 -0700 +++ b/include/linux/node.h 2020-10-07 09:15:33.399642436 -0700 @@ -180,4 +180,13 @@ static inline void register_hugetlbfs_wi #define to_node(device) container_of(device, struct node, dev) +#ifdef CONFIG_MIGRATION +extern int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif + #endif /* _LINUX_NODE_H_ */ diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap mm/vmscan.c --- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 2020-10-07 09:15:33.392642436 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:33.400642436 -0700 @@ -290,6 +290,26 @@ static bool writeback_throttling_sane(st } #endif +static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, + int node_id) +{ + /* Always age anon pages when we have swap */ + if (memcg == NULL) { + if (get_nr_swap_pages() > 0) + return true; + } else { + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) + return true; + } + + /* Also age anon pages if we can auto-migrate them */ + if (next_demotion_node(node_id) >= 0) + return true; + + /* No way to reclaim anon pages */ + return false; +} + /* * This misses isolated pages which are not accounted for to save counters. * As the data only determines if reclaim or compaction continues, it is @@ -301,7 +321,7 @@ unsigned long zone_reclaimable_pages(str nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); - if (get_nr_swap_pages() > 0) + if (can_reclaim_anon_pages(NULL, zone_to_nid(zone))) nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); @@ -2337,6 +2357,7 @@ enum scan_balance { static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long *nr) { + struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct mem_cgroup *memcg = lruvec_memcg(lruvec); unsigned long anon_cost, file_cost, total_cost; int swappiness = mem_cgroup_swappiness(memcg); @@ -2347,7 +2368,7 @@ static void get_scan_count(struct lruvec enum lru_list lru; /* If we have no swap space, do not bother scanning anon pages. */ - if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { + if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) { scan_balance = SCAN_FILE; goto out; } @@ -2631,7 +2652,9 @@ static void shrink_lruvec(struct lruvec * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. */ - if (total_swap_pages && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) + if (can_reclaim_anon_pages(lruvec_memcg(lruvec), + lruvec_pgdat(lruvec)->node_id) && + inactive_is_low(lruvec, LRU_INACTIVE_ANON)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); } @@ -2701,7 +2724,7 @@ static inline bool should_continue_recla */ pages_for_compaction = compact_gap(sc->order); inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); - if (get_nr_swap_pages() > 0) + if (can_reclaim_anon_pages(NULL, pgdat->node_id)) inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); return inactive_lru_pages > pages_for_compaction; @@ -3460,7 +3483,7 @@ static void age_active_anon(struct pglis struct mem_cgroup *memcg; struct lruvec *lruvec; - if (!total_swap_pages) + if (!can_reclaim_anon_pages(NULL, pgdat->node_id)) return; lruvec = mem_cgroup_lruvec(NULL, pgdat); From patchwork Wed Oct 7 16:17:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820975 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 80FFF1580 for ; Wed, 7 Oct 2020 16:18:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3A9BD217BA for ; Wed, 7 Oct 2020 16:18:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3A9BD217BA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BFC2F6B0071; Wed, 7 Oct 2020 12:18:00 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BD3976B0072; Wed, 7 Oct 2020 12:18:00 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A9F856B0073; Wed, 7 Oct 2020 12:18:00 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0075.hostedemail.com [216.40.44.75]) by kanga.kvack.org (Postfix) with ESMTP id 750E46B0071 for ; Wed, 7 Oct 2020 12:18:00 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0AD86181AE86F for ; Wed, 7 Oct 2020 16:18:00 +0000 (UTC) X-FDA: 77345635920.04.store31_110f00e271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin04.hostedemail.com (Postfix) with ESMTP id D54B8800004D for ; Wed, 7 Oct 2020 16:17:59 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30004:30005:30054:30064,0,RBL:134.134.136.31:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95;04yres3himbzmgpds8iob9zhc8gcooptrnq5wnhgagzbec3wjzjttia3zo6cq81.aajaz17bikepj7one7yk8acdk7okktbx5u14h5tg3cyidiw6fhjinn8g68ut9t5.1-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: store31_110f00e271d0 X-Filterd-Recvd-Size: 5909 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:17:58 +0000 (UTC) IronPort-SDR: aLcX8m6PyWpR9am6YUztU4nf+V4nDtZ4gI9czJDjqOkufMrg+jJbIfQBNfYBa2qt1NuAtngdeb QE25RhAJspvA== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="226592680" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="226592680" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:57 -0700 IronPort-SDR: LZw/E6HmWncdT9wAseE8RuL19wYmYRoj0c/GHSXxCYNkh/Dv2rE5g2cD989oLIT8YookNH91wC 3eijvHzMCfxA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="328125603" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga002.jf.intel.com with ESMTP; 07 Oct 2020 09:17:57 -0700 Subject: [RFC][PATCH 8/9] mm/vmscan: never demote for memcg reclaim To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:50 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161750.74CE9FA2@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen Global reclaim aims to reduce the amount of memory used on a given node or set of nodes. Migrating pages to another node serves this purpose. memcg reclaim is different. Its goal is to reduce the total memory consumption of the entire memcg, across all nodes. Migration does not assist memcg reclaim because it just moves page contents between nodes rather than actually reducing memory consumption. Signed-off-by: Dave Hansen Suggested-by: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/mm/vmscan.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff -puN mm/vmscan.c~never-demote-for-memcg-reclaim mm/vmscan.c --- a/mm/vmscan.c~never-demote-for-memcg-reclaim 2020-10-07 09:15:34.546642433 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:34.554642433 -0700 @@ -291,8 +291,11 @@ static bool writeback_throttling_sane(st #endif static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, - int node_id) + int node_id, + struct scan_control *sc) { + bool in_cgroup_reclaim = false; + /* Always age anon pages when we have swap */ if (memcg == NULL) { if (get_nr_swap_pages() > 0) @@ -302,8 +305,18 @@ static inline bool can_reclaim_anon_page return true; } - /* Also age anon pages if we can auto-migrate them */ - if (next_demotion_node(node_id) >= 0) + /* Can only be in memcg reclaim in paths with valid 'sc': */ + if (sc && cgroup_reclaim(sc)) + in_cgroup_reclaim = true; + + /* + * Also age anon pages if we can auto-migrate them. + * + * Migrating a page does not reduce comsumption of a + * memcg so should not be performed when in memcg + * reclaim. + */ + if (!in_cgroup_reclaim && (next_demotion_node(node_id) >= 0)) return true; /* No way to reclaim anon pages */ @@ -321,7 +334,7 @@ unsigned long zone_reclaimable_pages(str nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); - if (can_reclaim_anon_pages(NULL, zone_to_nid(zone))) + if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL)) nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); @@ -1064,6 +1077,10 @@ bool migrate_demote_page_ok(struct page VM_BUG_ON_PAGE(PageHuge(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + /* It is pointless to do demotion in memcg reclaim */ + if (cgroup_reclaim(sc)) + return false; + if (next_nid == NUMA_NO_NODE) return false; if (PageTransHuge(page) && !thp_migration_supported()) @@ -2368,7 +2385,7 @@ static void get_scan_count(struct lruvec enum lru_list lru; /* If we have no swap space, do not bother scanning anon pages. */ - if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) { + if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id, sc)) { scan_balance = SCAN_FILE; goto out; } @@ -2653,7 +2670,7 @@ static void shrink_lruvec(struct lruvec * rebalance the anon lru active/inactive ratio. */ if (can_reclaim_anon_pages(lruvec_memcg(lruvec), - lruvec_pgdat(lruvec)->node_id) && + lruvec_pgdat(lruvec)->node_id, sc) && inactive_is_low(lruvec, LRU_INACTIVE_ANON)) shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); @@ -2724,7 +2741,7 @@ static inline bool should_continue_recla */ pages_for_compaction = compact_gap(sc->order); inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); - if (can_reclaim_anon_pages(NULL, pgdat->node_id)) + if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); return inactive_lru_pages > pages_for_compaction; @@ -3483,7 +3500,7 @@ static void age_active_anon(struct pglis struct mem_cgroup *memcg; struct lruvec *lruvec; - if (!can_reclaim_anon_pages(NULL, pgdat->node_id)) + if (!can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) return; lruvec = mem_cgroup_lruvec(NULL, pgdat); From patchwork Wed Oct 7 16:17:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11820977 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1D37A1580 for ; Wed, 7 Oct 2020 16:18:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B89FD20789 for ; Wed, 7 Oct 2020 16:18:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B89FD20789 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 768F86B0072; Wed, 7 Oct 2020 12:18:02 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 6F5446B0073; Wed, 7 Oct 2020 12:18:02 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5433F6B0074; Wed, 7 Oct 2020 12:18:02 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0177.hostedemail.com [216.40.44.177]) by kanga.kvack.org (Postfix) with ESMTP id 2518C6B0072 for ; Wed, 7 Oct 2020 12:18:02 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id ACC96824999B for ; Wed, 7 Oct 2020 16:18:01 +0000 (UTC) X-FDA: 77345635962.29.deer39_1308c9e271d0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id 65FDD180868C9 for ; Wed, 7 Oct 2020 16:18:01 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30054:30064,0,RBL:134.134.136.20:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95;04yrae5h5u76nhyzufsmxzfn4sjeiocmrja4918d4j8nktqxxxssx5r4ct685q6.axbr1usxfqxgxtfx8in37scmjwqtw99uidagmhcs655tbzpjjb3skm7sz43fkeq.g-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:25,LUA_SUMMARY:none X-HE-Tag: deer39_1308c9e271d0 X-Filterd-Recvd-Size: 6387 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Wed, 7 Oct 2020 16:18:00 +0000 (UTC) IronPort-SDR: Y4UyG+xrwViovNCFVM0hH+ZsKfxqjVeLWCNpGtirTrS7OX9nx8ISPpujkqgD0X/Slc5l7/mp1Q Jr1bBpfXjMUQ== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="151940366" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="151940366" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:59 -0700 IronPort-SDR: r3PwdlDNxIUz73BRtbmu5/pVS2rWzzHDnfEMrs/BS9mM4kYe9ii7mpS5MDGKE/jSibjqouBUta /aQiwnpNLCug== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="388424802" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by orsmga001.jf.intel.com with ESMTP; 07 Oct 2020 09:17:59 -0700 Subject: [RFC][PATCH 9/9] mm/migrate: new zone_reclaim_mode to enable reclaim migration To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:52 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161752.11E81B0E@viggo.jf.intel.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This propses extending the existing "zone_reclaim_mode" (now now really node_reclaim_mode) as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it it workload harm is detected (just like traditional autonuma). Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams Cc: David Hildenbrand --- b/Documentation/admin-guide/sysctl/vm.rst | 9 +++++++++ b/include/linux/swap.h | 3 ++- b/include/uapi/linux/mempolicy.h | 1 + b/mm/vmscan.c | 6 ++++-- 4 files changed, 16 insertions(+), 3 deletions(-) diff -puN Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE Documentation/admin-guide/sysctl/vm.rst --- a/Documentation/admin-guide/sysctl/vm.rst~RECLAIM_MIGRATE 2020-10-07 09:15:35.622642430 -0700 +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-10-07 09:15:35.640642430 -0700 @@ -969,6 +969,7 @@ This is value OR'ed together of 1 Zone reclaim on 2 Zone reclaim writes dirty pages out 4 Zone reclaim swaps pages +8 Zone reclaim migrates pages = =================================== zone_reclaim_mode is disabled by default. For file servers or workloads @@ -993,3 +994,11 @@ of other processes running on other node Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. + +Page migration during reclaim is intended for systems with tiered memory +configurations. These systems have multiple types of memory with varied +performance characteristics instead of plain NUMA systems where the same +kind of memory is found at varied distances. Allowing page migration +during reclaim enables these systems to migrate pages from fast tiers to +slow tiers when the fast tier is under pressure. This migration is +performed before swap. diff -puN include/linux/swap.h~RECLAIM_MIGRATE include/linux/swap.h --- a/include/linux/swap.h~RECLAIM_MIGRATE 2020-10-07 09:15:35.624642430 -0700 +++ b/include/linux/swap.h 2020-10-07 09:15:35.640642430 -0700 @@ -385,7 +385,8 @@ extern int sysctl_min_slab_ratio; static inline bool node_reclaim_enabled(void) { /* Is any node_reclaim_mode bit set? */ - return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP); + return node_reclaim_mode & (RECLAIM_ZONE |RECLAIM_WRITE| + RECLAIM_UNMAP|RECLAIM_MIGRATE); } extern void check_move_unevictable_pages(struct pagevec *pvec); diff -puN include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE include/uapi/linux/mempolicy.h --- a/include/uapi/linux/mempolicy.h~RECLAIM_MIGRATE 2020-10-07 09:15:35.628642430 -0700 +++ b/include/uapi/linux/mempolicy.h 2020-10-07 09:15:35.640642430 -0700 @@ -69,5 +69,6 @@ enum { #define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ #define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ #define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_MIGRATE (1<<3) /* Migrate to other nodes during reclaim */ #endif /* _UAPI_LINUX_MEMPOLICY_H */ diff -puN mm/vmscan.c~RECLAIM_MIGRATE mm/vmscan.c --- a/mm/vmscan.c~RECLAIM_MIGRATE 2020-10-07 09:15:35.630642430 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:35.641642430 -0700 @@ -1077,6 +1077,9 @@ bool migrate_demote_page_ok(struct page VM_BUG_ON_PAGE(PageHuge(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + if (!(node_reclaim_mode & RECLAIM_MIGRATE)) + return false; + /* It is pointless to do demotion in memcg reclaim */ if (cgroup_reclaim(sc)) return false; @@ -1086,8 +1089,7 @@ bool migrate_demote_page_ok(struct page if (PageTransHuge(page) && !thp_migration_supported()) return false; - // FIXME: actually enable this later in the series - return false; + return true; }