From patchwork Tue Feb 18 08:26:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11387983 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7367B13A4 for ; Tue, 18 Feb 2020 08:27:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3C74821D56 for ; Tue, 18 Feb 2020 08:27:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3C74821D56 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0FAD46B0008; Tue, 18 Feb 2020 03:27:26 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0ABBA6B000A; Tue, 18 Feb 2020 03:27:26 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E19916B000C; Tue, 18 Feb 2020 03:27:25 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0081.hostedemail.com [216.40.44.81]) by kanga.kvack.org (Postfix) with ESMTP id CB7556B0008 for ; Tue, 18 Feb 2020 03:27:25 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6B4122C98 for ; Tue, 18 Feb 2020 08:27:25 +0000 (UTC) X-FDA: 76502568450.25.hand41_40bcc3c882811 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30054:30064:30070:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:2,LUA_SUMMARY:none X-HE-Tag: hand41_40bcc3c882811 X-Filterd-Recvd-Size: 7274 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 08:27:24 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466667" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:21 -0800 From: "Huang, Ying" To: Peter Zijlstra , Ingo Molnar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Andrew Morton , Michal Hocko , Rik van Riel , Mel Gorman , Dave Hansen , Dan Williams Subject: [RFC -V2 3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM Date: Tue, 18 Feb 2020 16:26:29 +0800 Message-Id: <20200218082634.1596727-4-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Huang Ying In a memory tiering system, if the memory size of the workloads is smaller than that of the faster memory (e.g. DRAM) nodes, all pages of the workloads should be put in the faster memory nodes. But this makes it unnecessary to use slower memory (e.g. PMEM) at all. So in common cases, the memory size of the workload should be larger than that of the faster memory nodes. And to optimize the performance, the hot pages should be promoted to the faster memory nodes while the cold pages should be demoted to the slower memory nodes. To achieve that, we have two choices, a. Promote the hot pages from the slower memory node to the faster memory node. This will create some memory pressure in the faster memory node, thus trigger the memory reclaiming, where the cold pages will be demoted to the slower memory node. b. Demote the cold pages from faster memory node to the slower memory node. This will create some free memory space in the faster memory node, and the hot pages in the slower memory node could be promoted to the faster memory node. The choice "a" will create the memory pressure in the faster memory node. If the memory pressure of the workload is high too, the memory pressure may become so high that the memory allocation latency of the workload is influenced, e.g. the direct reclaiming may be triggered. The choice "b" works much better at this aspect. If the memory pressure of the workload is high, it will consume the free memory and the hot pages promotion will stop earlier if its allocation watermark is higher than that of the normal memory allocation. In this patch, choice "b" is implemented. If memory tiering NUMA balancing mode is enabled, the node isn't the slowest node, and the free memory size of the node is below the high watermark, the kswapd of the node will be waken up to free some memory until the free memory size is above the high watermark + autonuma promotion rate limit. If the free memory size is below the high watermark, autonuma promotion will stop working. This avoids to create too much memory pressure to the system. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Michal Hocko Cc: Rik van Riel Cc: Mel Gorman Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Dave Hansen Cc: Dan Williams Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org --- mm/migrate.c | 26 +++++++++++++++++--------- mm/vmscan.c | 7 +++++++ 2 files changed, 24 insertions(+), 9 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 0b046759f99a..bbf16764d105 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -48,6 +48,7 @@ #include #include #include +#include #include @@ -1946,8 +1947,7 @@ COMPAT_SYSCALL_DEFINE6(move_pages, pid_t, pid, compat_ulong_t, nr_pages, * Returns true if this is a safe migration target node for misplaced NUMA * pages. Currently it only checks the watermarks which crude */ -static bool migrate_balanced_pgdat(struct pglist_data *pgdat, - unsigned long nr_migrate_pages) +static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order) { int z; @@ -1958,12 +1958,9 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, continue; /* Avoid waking kswapd by allocating pages_to_migrate pages. */ - if (!zone_watermark_ok(zone, 0, - high_wmark_pages(zone) + - nr_migrate_pages, - ZONE_MOVABLE, 0)) - continue; - return true; + if (zone_watermark_ok(zone, order, high_wmark_pages(zone), + ZONE_MOVABLE, 0)) + return true; } return false; } @@ -1990,8 +1987,19 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); /* Avoid migrating to a node that is nearly full */ - if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) + if (!migrate_balanced_pgdat(pgdat, compound_order(page))) { + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) { + int z; + + for (z = pgdat->nr_zones - 1; z >= 0; z--) { + if (populated_zone(pgdat->node_zones + z)) + break; + } + wakeup_kswapd(pgdat->node_zones + z, + 0, compound_order(page), ZONE_MOVABLE); + } return 0; + } if (isolate_lru_page(page)) return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index fe90236045d5..b265868d62ef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include +#include #include "internal.h" @@ -3462,8 +3463,11 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { int i; unsigned long mark = -1; + unsigned long promote_ratelimit; struct zone *zone; + promote_ratelimit = sysctl_numa_balancing_rate_limit << + (20 - PAGE_SHIFT); /* * Check watermarks bottom-up as lower zones are more likely to * meet watermarks. @@ -3475,6 +3479,9 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) continue; mark = high_wmark_pages(zone); + if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && + next_migration_node(pgdat->node_id) != -1) + mark += promote_ratelimit; if (zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return true; }