From patchwork Wed Dec 26 13:14:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743147 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E75E491E for ; Wed, 26 Dec 2018 13:39:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D5E0E28450 for ; Wed, 26 Dec 2018 13:39:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C9E9C2847D; Wed, 26 Dec 2018 13:39:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78D4D28450 for ; Wed, 26 Dec 2018 13:39:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726912AbeLZNhF (ORCPT ); Wed, 26 Dec 2018 08:37:05 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726724AbeLZNhF (ORCPT ); Wed, 26 Dec 2018 08:37:05 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185455" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Nv-69; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.106676005@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:47 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0001-e820-Force-PMEM-entry-as-RAM-type-to-enumerate-NUMA-.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du This is a hack to enumerate PMEM as NUMA nodes. It's necessary for current BIOS that don't yet fill ACPI HMAT table. WARNING: take care to backup. It is mutual exclusive with libnvdimm subsystem and can destroy ndctl managed namespaces. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- arch/x86/kernel/e820.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- linux.orig/arch/x86/kernel/e820.c 2018-12-23 19:20:34.587078783 +0800 +++ linux/arch/x86/kernel/e820.c 2018-12-23 19:20:34.587078783 +0800 @@ -403,7 +403,8 @@ static int __init __append_e820_table(st /* Ignore the entry on 64-bit overflow: */ if (start > end && likely(size)) return -1; - + if (type == E820_TYPE_PMEM) + type = E820_TYPE_RAM; e820__range_add(start, size, type); entry++; From patchwork Wed Dec 26 13:14:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743065 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 28AB7924 for ; Wed, 26 Dec 2018 13:37:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 144DF28495 for ; Wed, 26 Dec 2018 13:37:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0811B28938; Wed, 26 Dec 2018 13:37:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8B34728495 for ; Wed, 26 Dec 2018 13:37:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726960AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726910AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185460" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Nz-7S; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.164047705@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:48 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0002-acpi-Memorize-numa-node-type-from-SRAT-table.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du Mark NUMA node as DRAM or PMEM. This could happen in boot up state (see the e820 pmem type override patch), or on fly when bind devdax device with kmem driver. It depends on BIOS supplying PMEM NUMA proximity in SRAT table, that's current production BIOS does. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- arch/x86/include/asm/numa.h | 2 ++ arch/x86/mm/numa.c | 2 ++ drivers/acpi/numa.c | 5 +++++ 3 files changed, 9 insertions(+) --- linux.orig/arch/x86/include/asm/numa.h 2018-12-23 19:20:39.890947888 +0800 +++ linux/arch/x86/include/asm/numa.h 2018-12-23 19:20:39.890947888 +0800 @@ -30,6 +30,8 @@ extern int numa_off; */ extern s16 __apicid_to_node[MAX_LOCAL_APIC]; extern nodemask_t numa_nodes_parsed __initdata; +extern nodemask_t numa_nodes_pmem; +extern nodemask_t numa_nodes_dram; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); --- linux.orig/arch/x86/mm/numa.c 2018-12-23 19:20:39.890947888 +0800 +++ linux/arch/x86/mm/numa.c 2018-12-23 19:20:39.890947888 +0800 @@ -20,6 +20,8 @@ int numa_off; nodemask_t numa_nodes_parsed __initdata; +nodemask_t numa_nodes_pmem; +nodemask_t numa_nodes_dram; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); --- linux.orig/drivers/acpi/numa.c 2018-12-23 19:20:39.890947888 +0800 +++ linux/drivers/acpi/numa.c 2018-12-23 19:20:39.890947888 +0800 @@ -297,6 +297,11 @@ acpi_numa_memory_affinity_init(struct ac node_set(node, numa_nodes_parsed); + if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE) + node_set(node, numa_nodes_pmem); + else + node_set(node, numa_nodes_dram); + pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n", node, pxm, (unsigned long long) start, (unsigned long long) end - 1, From patchwork Wed Dec 26 13:14:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743149 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3956391E for ; Wed, 26 Dec 2018 13:39:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 285282846D for ; Wed, 26 Dec 2018 13:39:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1C58728481; Wed, 26 Dec 2018 13:39:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C71472846D for ; Wed, 26 Dec 2018 13:39:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726865AbeLZNj1 (ORCPT ); Wed, 26 Dec 2018 08:39:27 -0500 Received: from mga12.intel.com ([192.55.52.136]:31975 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726908AbeLZNhF (ORCPT ); Wed, 26 Dec 2018 08:37:05 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358926" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005O3-8C; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.229014333@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:49 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams cc: Fengguang Wu Subject: [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=fix-fake-numa.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du The index of numa_meminfo is expected to the same as of numa_meminfo.blk[]. and numa_remove_memblk_from break the expectation. 2S system does not break, because before numa_remove_memblk_from index nid 0 0 1 1 after numa_remove_memblk_from index nid 0 1 1 1 If you try to configure uniform fake node in 4S system. index nid 0 0 1 1 2 2 3 3 node 3 will be removed by numa_remove_memblk_from when iterate index 2. so we only create fake node for 3 physcial node, and a portion of memroy wasted as much as it hit lost pages checking in numa_meminfo_cover_memory. Signed-off-by: Fan Du --- arch/x86/mm/numa_emulation.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) --- linux.orig/arch/x86/mm/numa_emulation.c 2018-12-23 19:20:51.570664269 +0800 +++ linux/arch/x86/mm/numa_emulation.c 2018-12-23 19:20:51.566664364 +0800 @@ -381,7 +381,21 @@ void __init numa_emulation(struct numa_m goto no_emu; memset(&ei, 0, sizeof(ei)); - pi = *numa_meminfo; + + { + /* Make sure the index is identical with nid */ + struct numa_meminfo *mi = numa_meminfo; + int nid; + + for (i = 0; i < mi->nr_blks; i++) { + nid = mi->blk[i].nid; + pi.blk[nid].nid = nid; + pi.blk[nid].start = mi->blk[i].start; + pi.blk[nid].end = mi->blk[i].end; + } + pi.nr_blks = mi->nr_blks; + + } for (i = 0; i < MAX_NUMNODES; i++) emu_nid_to_phys[i] = NUMA_NO_NODE; From patchwork Wed Dec 26 13:14:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743135 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 65A7D91E for ; Wed, 26 Dec 2018 13:38:54 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 543B3204BF for ; Wed, 26 Dec 2018 13:38:54 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 44BA427BA5; Wed, 26 Dec 2018 13:38:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 01EC4204BF for ; Wed, 26 Dec 2018 13:38:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727010AbeLZNit (ORCPT ); Wed, 26 Dec 2018 08:38:49 -0500 Received: from mga12.intel.com ([192.55.52.136]:31979 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726959AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358931" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005O7-8z; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.287359389@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:50 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams cc: Fengguang Wu Subject: [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0021-x86-numa-Fix-fake-numa-in-uniform-case.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du Signed-off-by: Fan Du --- arch/x86/mm/numa_emulation.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) --- linux.orig/arch/x86/mm/numa_emulation.c 2018-12-23 19:21:11.002206144 +0800 +++ linux/arch/x86/mm/numa_emulation.c 2018-12-23 19:21:10.998206236 +0800 @@ -12,6 +12,8 @@ static int emu_nid_to_phys[MAX_NUMNODES]; static char *emu_cmdline __initdata; +static nodemask_t emu_numa_nodes_pmem; +static nodemask_t emu_numa_nodes_dram; void __init numa_emu_cmdline(char *str) { @@ -311,6 +313,12 @@ static int __init split_nodes_size_inter min(end, limit) - start); if (ret < 0) return ret; + + /* Update numa node type for fake numa node */ + if (node_isset(i, emu_numa_nodes_pmem)) + node_set(nid - 1, numa_nodes_pmem); + else + node_set(nid - 1, numa_nodes_dram); } } return nid; @@ -410,6 +418,12 @@ void __init numa_emulation(struct numa_m unsigned long n; int nid = 0; + emu_numa_nodes_pmem = numa_nodes_pmem; + emu_numa_nodes_dram = numa_nodes_dram; + + nodes_clear(numa_nodes_pmem); + nodes_clear(numa_nodes_dram); + n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); ret = -1; for_each_node_mask(i, physnode_mask) { From patchwork Wed Dec 26 13:14:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743143 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7E00C924 for ; Wed, 26 Dec 2018 13:39:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D7895280FC for ; Wed, 26 Dec 2018 13:39:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CC14F28174; Wed, 26 Dec 2018 13:39:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7BA7E280FC for ; Wed, 26 Dec 2018 13:39:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727138AbeLZNjM (ORCPT ); Wed, 26 Dec 2018 08:39:12 -0500 Received: from mga12.intel.com ([192.55.52.136]:31975 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726918AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358927" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005OB-9m; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.348801665@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:51 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0003-mmzone-Introduce-new-flag-to-tag-pgdat-type.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du One system with DRAM and PMEM, we need new flag to tag pgdat is made of DRAM or peristent memory. This patch serves as preparetion one for follow up patch. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- include/linux/mmzone.h | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) --- linux.orig/include/linux/mmzone.h 2018-12-23 19:29:42.430602202 +0800 +++ linux/include/linux/mmzone.h 2018-12-23 19:29:42.430602202 +0800 @@ -522,6 +522,8 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_DRAM, /* Volatile DRAM memory node */ + PGDAT_PMEM, /* Persistent memory node */ }; static inline unsigned long zone_end_pfn(const struct zone *zone) @@ -919,6 +921,30 @@ extern struct pglist_data contig_page_da #endif /* !CONFIG_NEED_MULTIPLE_NODES */ +static inline int is_node_pmem(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + return test_bit(PGDAT_PMEM, &pgdat->flags); +} + +static inline int is_node_dram(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + return test_bit(PGDAT_DRAM, &pgdat->flags); +} + +static inline void set_node_type(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + if (node_isset(nid, numa_nodes_pmem)) + set_bit(PGDAT_PMEM, &pgdat->flags); + else + set_bit(PGDAT_DRAM, &pgdat->flags); +} + extern struct pglist_data *first_online_pgdat(void); extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat); extern struct zone *next_zone(struct zone *zone); From patchwork Wed Dec 26 13:14:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743125 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4824B924 for ; Wed, 26 Dec 2018 13:38:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 376A328938 for ; Wed, 26 Dec 2018 13:38:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2B0B32893D; Wed, 26 Dec 2018 13:38:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CBFF328938 for ; Wed, 26 Dec 2018 13:38:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727186AbeLZNiP (ORCPT ); Wed, 26 Dec 2018 08:38:15 -0500 Received: from mga12.intel.com ([192.55.52.136]:31979 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726971AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358935" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005OH-Ae; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.410639437@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:52 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 06/21] x86,numa: update numa node type References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0004-x86-numa-Update-numa-node-type.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- arch/x86/mm/numa.c | 1 + 1 file changed, 1 insertion(+) --- linux.orig/arch/x86/mm/numa.c 2018-12-23 19:38:17.363582512 +0800 +++ linux/arch/x86/mm/numa.c 2018-12-23 19:38:17.363582512 +0800 @@ -594,6 +594,7 @@ static int __init numa_register_memblks( continue; alloc_node_data(nid); + set_node_type(nid); } /* Dump memblock with node info and return. */ From patchwork Wed Dec 26 13:14:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743151 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 751C9924 for ; Wed, 26 Dec 2018 13:39:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 639B628495 for ; Wed, 26 Dec 2018 13:39:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 564ED284D4; Wed, 26 Dec 2018 13:39:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0A31028495 for ; Wed, 26 Dec 2018 13:39:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727318AbeLZNjf (ORCPT ); Wed, 26 Dec 2018 08:39:35 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726899AbeLZNhF (ORCPT ); Wed, 26 Dec 2018 08:37:05 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185457" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005ON-BS; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.463947436@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:53 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0005-Export-node-type-pmem-ram-in-sys-bus-node.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du User space migration daemon could check /sys/bus/node/devices/nodeX/type for node type. Software can interrogate node type for node memory type and distance to get desirable target node in migration. grep -r . /sys/devices/system/node/*/type /sys/devices/system/node/node0/type:dram /sys/devices/system/node/node1/type:dram /sys/devices/system/node/node2/type:pmem /sys/devices/system/node/node3/type:pmem Along with next patch which export `peer_node`, migration daemon could easily find the memory type of current node, and the target node in case of migration. grep -r . /sys/devices/system/node/*/peer_node /sys/devices/system/node/node0/peer_node:2 /sys/devices/system/node/node1/peer_node:3 /sys/devices/system/node/node2/peer_node:0 /sys/devices/system/node/node3/peer_node:1 Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- drivers/base/node.c | 10 ++++++++++ 1 file changed, 10 insertions(+) --- linux.orig/drivers/base/node.c 2018-12-23 19:39:04.763414931 +0800 +++ linux/drivers/base/node.c 2018-12-23 19:39:04.763414931 +0800 @@ -233,6 +233,15 @@ static ssize_t node_read_distance(struct } static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL); +static ssize_t type_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int nid = dev->id; + + return sprintf(buf, is_node_pmem(nid) ? "pmem\n" : "dram\n"); +} +static DEVICE_ATTR(type, S_IRUGO, type_show, NULL); + static struct attribute *node_dev_attrs[] = { &dev_attr_cpumap.attr, &dev_attr_cpulist.attr, @@ -240,6 +249,7 @@ static struct attribute *node_dev_attrs[ &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, + &dev_attr_type.attr, NULL }; ATTRIBUTE_GROUPS(node_dev); From patchwork Wed Dec 26 13:14:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743137 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A3082924 for ; Wed, 26 Dec 2018 13:39:02 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8FC1926242 for ; Wed, 26 Dec 2018 13:39:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 80FF827FB7; Wed, 26 Dec 2018 13:39:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2102826242 for ; Wed, 26 Dec 2018 13:39:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727271AbeLZNi4 (ORCPT ); Wed, 26 Dec 2018 08:38:56 -0500 Received: from mga04.intel.com ([192.55.52.120]:33944 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726935AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185462" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:01 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005OT-CD; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.521151384@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:54 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0019-mm-Introduce-and-export-peer_node-for-pgdat.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du Each CPU socket can have 1 DRAM and 1 PMEM node, we call them "peer nodes". Migration between DRAM and PMEM will by default happen between peer nodes. It's a temp solution. In multiple memory layers, a node can have both promotion and demotion targets instead of a single peer node. User space may also be able to infer promotion/demotion targets based on future HMAT info. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- drivers/base/node.c | 11 +++++++++++ include/linux/mmzone.h | 12 ++++++++++++ mm/page_alloc.c | 29 +++++++++++++++++++++++++++++ 3 files changed, 52 insertions(+) --- linux.orig/drivers/base/node.c 2018-12-23 19:39:51.647261099 +0800 +++ linux/drivers/base/node.c 2018-12-23 19:39:51.643261112 +0800 @@ -242,6 +242,16 @@ static ssize_t type_show(struct device * } static DEVICE_ATTR(type, S_IRUGO, type_show, NULL); +static ssize_t peer_node_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + int nid = dev->id; + struct pglist_data *pgdat = NODE_DATA(nid); + + return sprintf(buf, "%d\n", pgdat->peer_node); +} +static DEVICE_ATTR(peer_node, S_IRUGO, peer_node_show, NULL); + static struct attribute *node_dev_attrs[] = { &dev_attr_cpumap.attr, &dev_attr_cpulist.attr, @@ -250,6 +260,7 @@ static struct attribute *node_dev_attrs[ &dev_attr_distance.attr, &dev_attr_vmstat.attr, &dev_attr_type.attr, + &dev_attr_peer_node.attr, NULL }; ATTRIBUTE_GROUPS(node_dev); --- linux.orig/include/linux/mmzone.h 2018-12-23 19:39:51.647261099 +0800 +++ linux/include/linux/mmzone.h 2018-12-23 19:39:51.643261112 +0800 @@ -713,6 +713,18 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; + + /* + * Points to the nearest node in terms of latency + * E.g. peer of node 0 is node 2 per SLIT + * node distances: + * node 0 1 2 3 + * 0: 10 21 17 28 + * 1: 21 10 28 17 + * 2: 17 28 10 28 + * 3: 28 17 28 10 + */ + int peer_node; } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) --- linux.orig/mm/page_alloc.c 2018-12-23 19:39:51.647261099 +0800 +++ linux/mm/page_alloc.c 2018-12-23 19:39:51.643261112 +0800 @@ -6926,6 +6926,34 @@ static void check_for_memory(pg_data_t * } } +/* + * Return the nearest peer node in terms of *locality* + * E.g. peer of node 0 is node 2 per SLIT + * node distances: + * node 0 1 2 3 + * 0: 10 21 17 28 + * 1: 21 10 28 17 + * 2: 17 28 10 28 + * 3: 28 17 28 10 + */ +static int find_best_peer_node(int nid) +{ + int n, val; + int min_val = INT_MAX; + int peer = NUMA_NO_NODE; + + for_each_online_node(n) { + if (n == nid) + continue; + val = node_distance(nid, n); + if (val < min_val) { + min_val = val; + peer = n; + } + } + return peer; +} + /** * free_area_init_nodes - Initialise all pg_data_t and zone data * @max_zone_pfn: an array of max PFNs for each zone @@ -7012,6 +7040,7 @@ void __init free_area_init_nodes(unsigne if (pgdat->node_present_pages) node_set_state(nid, N_MEMORY); check_for_memory(pgdat, nid); + pgdat->peer_node = find_best_peer_node(nid); } } From patchwork Wed Dec 26 13:14:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743129 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D76BB91E for ; Wed, 26 Dec 2018 13:38:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6C27A27E5A for ; Wed, 26 Dec 2018 13:38:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 59C6F27F89; Wed, 26 Dec 2018 13:38:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 088BA28495 for ; Wed, 26 Dec 2018 13:38:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727218AbeLZNi1 (ORCPT ); Wed, 26 Dec 2018 08:38:27 -0500 Received: from mga12.intel.com ([192.55.52.136]:31975 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726966AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358933" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005OY-D7; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.579378360@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:55 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0020-page_alloc-avoid-duplicate-peer-target-node.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP To ensure 1:1 peer node mapping on broken BIOS node distances: node 0 1 2 3 0: 10 21 20 20 1: 21 10 20 20 2: 20 20 10 20 3: 20 20 20 10 or with numa=fake=4U node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 10 10 10 21 21 21 21 17 17 17 17 28 28 28 28 1: 10 10 10 10 21 21 21 21 17 17 17 17 28 28 28 28 2: 10 10 10 10 21 21 21 21 17 17 17 17 28 28 28 28 3: 10 10 10 10 21 21 21 21 17 17 17 17 28 28 28 28 4: 21 21 21 21 10 10 10 10 28 28 28 28 17 17 17 17 5: 21 21 21 21 10 10 10 10 28 28 28 28 17 17 17 17 6: 21 21 21 21 10 10 10 10 28 28 28 28 17 17 17 17 7: 21 21 21 21 10 10 10 10 28 28 28 28 17 17 17 17 8: 17 17 17 17 28 28 28 28 10 10 10 10 28 28 28 28 9: 17 17 17 17 28 28 28 28 10 10 10 10 28 28 28 28 10: 17 17 17 17 28 28 28 28 10 10 10 10 28 28 28 28 11: 17 17 17 17 28 28 28 28 10 10 10 10 28 28 28 28 12: 28 28 28 28 17 17 17 17 28 28 28 28 10 10 10 10 13: 28 28 28 28 17 17 17 17 28 28 28 28 10 10 10 10 14: 28 28 28 28 17 17 17 17 28 28 28 28 10 10 10 10 15: 28 28 28 28 17 17 17 17 28 28 28 28 10 10 10 10 Signed-off-by: Fengguang Wu --- mm/page_alloc.c | 6 ++++++ 1 file changed, 6 insertions(+) --- linux.orig/mm/page_alloc.c 2018-12-23 19:48:27.366110325 +0800 +++ linux/mm/page_alloc.c 2018-12-23 19:48:27.362110332 +0800 @@ -6941,16 +6941,22 @@ static int find_best_peer_node(int nid) int n, val; int min_val = INT_MAX; int peer = NUMA_NO_NODE; + static nodemask_t target_nodes = NODE_MASK_NONE; for_each_online_node(n) { if (n == nid) continue; val = node_distance(nid, n); + if (val == LOCAL_DISTANCE) + continue; + if (node_isset(n, target_nodes)) + continue; if (val < min_val) { min_val = val; peer = n; } } + node_set(peer, target_nodes); return peer; } From patchwork Wed Dec 26 13:14:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743093 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 936F517FB for ; Wed, 26 Dec 2018 13:37:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7F3E828495 for ; Wed, 26 Dec 2018 13:37:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7380128938; Wed, 26 Dec 2018 13:37:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 09E0028495 for ; Wed, 26 Dec 2018 13:37:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727048AbeLZNhK (ORCPT ); Wed, 26 Dec 2018 08:37:10 -0500 Received: from mga06.intel.com ([134.134.136.31]:21293 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727007AbeLZNhI (ORCPT ); Wed, 26 Dec 2018 08:37:08 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358937" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Oe-Dr; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.644607371@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:56 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0016-page-alloc-Build-separate-zonelist-for-PMEM-and-RAM-.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Fan Du When allocate page, DRAM and PMEM node should better not fall back to each other. This allows migration code to explicitly control which type of node to allocate pages from. With this patch, PMEM NUMA node can only be used in 2 ways: - migrate in and out - numactl That guarantees PMEM NUMA node will only hold anon pages. We don't detect hotness for other types of pages for now. So need to prevent some PMEM page goes hot while not able to detect/move it to DRAM. Another implication is, new page allocations will by default goto DRAM nodes. Which is normally a good choice -- since DRAM writes are cheaper than PMEM, it's often benefitial to watch new pages in DRAM for some time and only move the likely cold pages to PMEM. However there can be exceptions. For example, if PMEM:DRAM ratio is very high, some page allocations may better go to PMEM nodes directly. In long term, we may create more kind of fallback zonelists and make them configurable by NUMA policy. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- mm/mempolicy.c | 14 ++++++++++++++ mm/page_alloc.c | 42 +++++++++++++++++++++++++++++------------- 2 files changed, 43 insertions(+), 13 deletions(-) --- linux.orig/mm/mempolicy.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/mempolicy.c 2018-12-26 20:29:24.597884301 +0800 @@ -1745,6 +1745,20 @@ static int policy_node(gfp_t gfp, struct WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); } + if (policy->mode == MPOL_BIND) { + nodemask_t nodes = policy->v.nodes; + + /* + * The rule is if we run on DRAM node and mbind to PMEM node, + * perferred node id is the peer node, vice versa. + * if we run on DRAM node and mbind to DRAM node, #PF node is + * the preferred node, vice versa, so just fall back. + */ + if ((is_node_dram(nd) && nodes_subset(nodes, numa_nodes_pmem)) || + (is_node_pmem(nd) && nodes_subset(nodes, numa_nodes_dram))) + nd = NODE_DATA(nd)->peer_node; + } + return nd; } --- linux.orig/mm/page_alloc.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/page_alloc.c 2018-12-26 20:03:49.817417321 +0800 @@ -5153,6 +5153,10 @@ static int find_next_best_node(int node, if (node_isset(n, *used_node_mask)) continue; + /* DRAM node doesn't fallback to pmem node */ + if (is_node_pmem(n)) + continue; + /* Use the distance array to find the distance */ val = node_distance(node, n); @@ -5242,19 +5246,31 @@ static void build_zonelists(pg_data_t *p nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { - /* - * We don't want to pressure a particular node. - * So adding penalty to the first node in same - * distance group to make it round-robin. - */ - if (node_distance(local_node, node) != - node_distance(local_node, prev_node)) - node_load[node] = load; - - node_order[nr_nodes++] = node; - prev_node = node; - load--; + /* Pmem node doesn't fallback to DRAM node */ + if (is_node_pmem(local_node)) { + int n; + + /* Pmem nodes should fallback to each other */ + node_order[nr_nodes++] = local_node; + for_each_node_state(n, N_MEMORY) { + if ((n != local_node) && is_node_pmem(n)) + node_order[nr_nodes++] = n; + } + } else { + while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { + /* + * We don't want to pressure a particular node. + * So adding penalty to the first node in same + * distance group to make it round-robin. + */ + if (node_distance(local_node, node) != + node_distance(local_node, prev_node)) + node_load[node] = load; + + node_order[nr_nodes++] = node; + prev_node = node; + load--; + } } build_zonelists_in_node_order(pgdat, node_order, nr_nodes); From patchwork Wed Dec 26 13:14:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743141 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5279691E for ; Wed, 26 Dec 2018 13:39:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A4CBA28068 for ; Wed, 26 Dec 2018 13:39:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 98E752811E; Wed, 26 Dec 2018 13:39:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4D9ED28068 for ; Wed, 26 Dec 2018 13:39:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727095AbeLZNjC (ORCPT ); Wed, 26 Dec 2018 08:39:02 -0500 Received: from mga12.intel.com ([192.55.52.136]:31975 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726949AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358929" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Oj-Em; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.703380444@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:57 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Yao Yuan , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0001-kvm-allocate-page-table-pages-from-DRAM.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Yao Yuan Signed-off-by: Yao Yuan Signed-off-by: Fengguang Wu --- arch/x86/kvm/mmu.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) --- linux.orig/arch/x86/kvm/mmu.c 2018-12-26 20:54:48.846720344 +0800 +++ linux/arch/x86/kvm/mmu.c 2018-12-26 20:54:48.842719614 +0800 @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct kmem_cache_free(cache, mc->objects[--mc->nobjs]); } +static unsigned long __get_dram_free_pages(gfp_t gfp_mask) +{ + struct page *page; + + page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id()); + if (!page) + return 0; + return (unsigned long) page_address(page); +} + static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache, int min) { @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s if (cache->nobjs >= min) return 0; while (cache->nobjs < ARRAY_SIZE(cache->objects)) { - page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT); + page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT); if (!page) return cache->nobjs >= min ? 0 : -ENOMEM; cache->objects[cache->nobjs++] = page; From patchwork Wed Dec 26 13:14:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743145 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7647291E for ; Wed, 26 Dec 2018 13:39:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6320328450 for ; Wed, 26 Dec 2018 13:39:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 56B472846D; Wed, 26 Dec 2018 13:39:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EDABA28450 for ; Wed, 26 Dec 2018 13:39:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727024AbeLZNjM (ORCPT ); Wed, 26 Dec 2018 08:39:12 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726930AbeLZNhG (ORCPT ); Wed, 26 Dec 2018 08:37:06 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185464" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Oo-FY; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.770245668@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:58 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 12/21] x86/pgtable: allocate page table pages from DRAM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0018-pgtable-force-pgtable-allocation-from-DRAM-node-0.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On rand read/writes on large data, we find near half memory accesses caused by TLB misses, hence hit the page table pages. So better keep page table pages in faster DRAM nodes. Signed-off-by: Fengguang Wu --- arch/x86/include/asm/pgalloc.h | 10 +++++++--- arch/x86/mm/pgtable.c | 22 ++++++++++++++++++---- 2 files changed, 25 insertions(+), 7 deletions(-) --- linux.orig/arch/x86/mm/pgtable.c 2018-12-26 19:41:57.494900885 +0800 +++ linux/arch/x86/mm/pgtable.c 2018-12-26 19:42:35.531621035 +0800 @@ -22,17 +22,30 @@ EXPORT_SYMBOL(physical_mask); #endif gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP; +nodemask_t all_node_mask = NODE_MASK_ALL; + +unsigned long __get_free_pgtable_pages(gfp_t gfp_mask, + unsigned int order) +{ + struct page *page; + + page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), &all_node_mask); + if (!page) + return 0; + return (unsigned long) page_address(page); +} +EXPORT_SYMBOL(__get_free_pgtable_pages); pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT); + return (pte_t *)__get_free_pgtable_pages(PGALLOC_GFP & ~__GFP_ACCOUNT, 0); } pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long address) { struct page *pte; - pte = alloc_pages(__userpte_alloc_gfp, 0); + pte = __alloc_pages_nodemask(__userpte_alloc_gfp, 0, numa_node_id(), &all_node_mask); if (!pte) return NULL; if (!pgtable_page_ctor(pte)) { @@ -241,7 +254,7 @@ static int preallocate_pmds(struct mm_st gfp &= ~__GFP_ACCOUNT; for (i = 0; i < count; i++) { - pmd_t *pmd = (pmd_t *)__get_free_page(gfp); + pmd_t *pmd = (pmd_t *)__get_free_pgtable_pages(gfp, 0); if (!pmd) failed = true; if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) { @@ -422,7 +435,8 @@ static inline void _pgd_free(pgd_t *pgd) static inline pgd_t *_pgd_alloc(void) { - return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER); + return (pgd_t *)__get_free_pgtable_pages(PGALLOC_GFP, + PGD_ALLOCATION_ORDER); } static inline void _pgd_free(pgd_t *pgd) --- linux.orig/arch/x86/include/asm/pgalloc.h 2018-12-26 19:40:12.992251270 +0800 +++ linux/arch/x86/include/asm/pgalloc.h 2018-12-26 19:42:35.531621035 +0800 @@ -96,10 +96,11 @@ static inline pmd_t *pmd_alloc_one(struc { struct page *page; gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO; + nodemask_t all_node_mask = NODE_MASK_ALL; if (mm == &init_mm) gfp &= ~__GFP_ACCOUNT; - page = alloc_pages(gfp, 0); + page = __alloc_pages_nodemask(gfp, 0, numa_node_id(), &all_node_mask); if (!page) return NULL; if (!pgtable_pmd_page_ctor(page)) { @@ -141,13 +142,16 @@ static inline void p4d_populate(struct m set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud))); } +extern unsigned long __get_free_pgtable_pages(gfp_t gfp_mask, + unsigned int order); + static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) { gfp_t gfp = GFP_KERNEL_ACCOUNT; if (mm == &init_mm) gfp &= ~__GFP_ACCOUNT; - return (pud_t *)get_zeroed_page(gfp); + return (pud_t *)__get_free_pgtable_pages(gfp | __GFP_ZERO, 0); } static inline void pud_free(struct mm_struct *mm, pud_t *pud) @@ -179,7 +183,7 @@ static inline p4d_t *p4d_alloc_one(struc if (mm == &init_mm) gfp &= ~__GFP_ACCOUNT; - return (p4d_t *)get_zeroed_page(gfp); + return (p4d_t *)__get_free_pgtable_pages(gfp | __GFP_ZERO, 0); } static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d) From patchwork Wed Dec 26 13:14:59 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743121 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B6F891E for ; Wed, 26 Dec 2018 13:38:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E8ECF28495 for ; Wed, 26 Dec 2018 13:38:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DCD5028938; Wed, 26 Dec 2018 13:38:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 92CC228495 for ; Wed, 26 Dec 2018 13:38:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727128AbeLZNh6 (ORCPT ); Wed, 26 Dec 2018 08:37:58 -0500 Received: from mga06.intel.com ([134.134.136.31]:21293 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726994AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358941" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Ot-GK; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.828074959@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:59 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Jingqi Liu , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0006-pgtable-don-t-check-the-page-accessed-bit.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Jingqi Liu ept-idle will clear PMD accessed bit to speedup PTE scan -- if the bit remains unset in the next scan, all the 512 PTEs can be skipped. So don't complain on !_PAGE_ACCESSED in pmd_bad(). Note that clearing PMD accessed bit has its own cost, the optimization may only be worthwhile for - large idle area - sparsely populated area Signed-off-by: Jingqi Liu Signed-off-by: Fengguang Wu --- arch/x86/include/asm/pgtable.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- linux.orig/arch/x86/include/asm/pgtable.h 2018-12-23 19:50:50.917902600 +0800 +++ linux/arch/x86/include/asm/pgtable.h 2018-12-23 19:50:50.913902605 +0800 @@ -821,7 +821,8 @@ static inline pte_t *pte_offset_kernel(p static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return (pmd_flags(pmd) & ~(_PAGE_USER | _PAGE_ACCESSED)) != + (_KERNPG_TABLE & ~_PAGE_ACCESSED); } static inline unsigned long pages_to_mb(unsigned long npg) From patchwork Wed Dec 26 13:15:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743105 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DD5AA17FB for ; Wed, 26 Dec 2018 13:37:48 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CC58528495 for ; Wed, 26 Dec 2018 13:37:48 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C007628938; Wed, 26 Dec 2018 13:37:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 727A328495 for ; Wed, 26 Dec 2018 13:37:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727011AbeLZNhI (ORCPT ); Wed, 26 Dec 2018 08:37:08 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726986AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185473" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Oy-IC; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.894160986@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:00 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Nikita Leshenko , Christian Borntraeger , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 14/21] kvm: register in mm_struct References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0009-kvm-register-in-mm_struct.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP VM is associated with an address space and not a specific thread. >From Documentation/virtual/kvm/api.txt: Only run VM ioctls from the same process (address space) that was used to create the VM. CC: Nikita Leshenko CC: Christian Borntraeger Signed-off-by: Fengguang Wu --- include/linux/mm_types.h | 11 +++++++++++ virt/kvm/kvm_main.c | 3 +++ 2 files changed, 14 insertions(+) --- linux.orig/include/linux/mm_types.h 2018-12-23 19:58:06.993417137 +0800 +++ linux/include/linux/mm_types.h 2018-12-23 19:58:06.993417137 +0800 @@ -27,6 +27,7 @@ typedef int vm_fault_t; struct address_space; struct mem_cgroup; struct hmm; +struct kvm; /* * Each physical page in the system has a struct page associated with @@ -496,6 +497,10 @@ struct mm_struct { /* HMM needs to track a few things per mm */ struct hmm *hmm; #endif + +#if IS_ENABLED(CONFIG_KVM) + struct kvm *kvm; +#endif } __randomize_layout; /* @@ -507,6 +512,12 @@ struct mm_struct { extern struct mm_struct init_mm; +#if IS_ENABLED(CONFIG_KVM) +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return mm->kvm; } +#else +static inline struct kvm *mm_kvm(struct mm_struct *mm) { return NULL; } +#endif + /* Pointer magic because the dynamic array size confuses some compilers. */ static inline void mm_init_cpumask(struct mm_struct *mm) { --- linux.orig/virt/kvm/kvm_main.c 2018-12-23 19:58:06.993417137 +0800 +++ linux/virt/kvm/kvm_main.c 2018-12-23 19:58:06.993417137 +0800 @@ -727,6 +727,7 @@ static void kvm_destroy_vm(struct kvm *k struct mm_struct *mm = kvm->mm; kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm); + mm->kvm = NULL; kvm_destroy_vm_debugfs(kvm); kvm_arch_sync_events(kvm); spin_lock(&kvm_lock); @@ -3224,6 +3225,8 @@ static int kvm_dev_ioctl_create_vm(unsig fput(file); return -ENOMEM; } + + kvm->mm->kvm = kvm; kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm); fd_install(r, file); From patchwork Wed Dec 26 13:15:01 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743099 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8A10517FB for ; Wed, 26 Dec 2018 13:37:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6F17C28495 for ; Wed, 26 Dec 2018 13:37:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6293128938; Wed, 26 Dec 2018 13:37:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2055E28495 for ; Wed, 26 Dec 2018 13:37:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727030AbeLZNhJ (ORCPT ); Wed, 26 Dec 2018 08:37:09 -0500 Received: from mga04.intel.com ([192.55.52.120]:33945 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726988AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185476" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005P3-JE; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.956098465@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:01 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Dave Hansen , Peng Dong , Liu Jingqi , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Huang Ying cc: Dong Eddie cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0014-kvm-ept-idle-EPT-page-table-walk-for-A-bits.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For virtual machines, "accessed" bits will be set in guest page tables and EPT/NPT. So for qemu-kvm process, convert HVA to GFN to GPA, then do EPT/NPT walks. This borrows host page table walk macros/functions to do EPT/NPT walk. So it depends on them using the same level. As proposed by Dave Hansen, invalidate TLB when finished one round of scan, in order to ensure HW will set accessed bit for super-hot pages. V2: convert idle_bitmap to idle_pages to be more efficient on - huge pages - sparse page table - ranges of similar pages The new idle_pages file contains a series of records of different size reporting ranges of different page size to user space. That interface has a major downside: it breaks read() assumption about range_to_read == read_buffer_size. Now we workaround this problem by deducing range_to_read from read_buffer_size, and let read() return when either read_buffer_size is filled, or range_to_read is fully scanned. To make a more precise interface, we may need further switch to ioctl(). CC: Dave Hansen Signed-off-by: Peng Dong Signed-off-by: Liu Jingqi Signed-off-by: Fengguang Wu --- arch/x86/kvm/ept_idle.c | 637 ++++++++++++++++++++++++++++++++++++++ arch/x86/kvm/ept_idle.h | 116 ++++++ 2 files changed, 753 insertions(+) create mode 100644 arch/x86/kvm/ept_idle.c create mode 100644 arch/x86/kvm/ept_idle.h --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/arch/x86/kvm/ept_idle.c 2018-12-26 20:38:07.298994533 +0800 @@ -0,0 +1,637 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ept_idle.h" + +/* #define DEBUG 1 */ + +#ifdef DEBUG + +#define debug_printk trace_printk + +#define set_restart_gpa(val, note) ({ \ + unsigned long old_val = eic->restart_gpa; \ + eic->restart_gpa = (val); \ + trace_printk("restart_gpa=%lx %luK %s %s %d\n", \ + (val), (eic->restart_gpa - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#define set_next_hva(val, note) ({ \ + unsigned long old_val = eic->next_hva; \ + eic->next_hva = (val); \ + trace_printk(" next_hva=%lx %luK %s %s %d\n", \ + (val), (eic->next_hva - old_val) >> 10, \ + note, __func__, __LINE__); \ +}) + +#else + +#define debug_printk(...) + +#define set_restart_gpa(val, note) ({ \ + eic->restart_gpa = (val); \ +}) + +#define set_next_hva(val, note) ({ \ + eic->next_hva = (val); \ +}) + +#endif + +static unsigned long pagetype_size[16] = { + [PTE_ACCESSED] = PAGE_SIZE, /* 4k page */ + [PMD_ACCESSED] = PMD_SIZE, /* 2M page */ + [PUD_PRESENT] = PUD_SIZE, /* 1G page */ + + [PTE_DIRTY] = PAGE_SIZE, + [PMD_DIRTY] = PMD_SIZE, + + [PTE_IDLE] = PAGE_SIZE, + [PMD_IDLE] = PMD_SIZE, + [PMD_IDLE_PTES] = PMD_SIZE, + + [PTE_HOLE] = PAGE_SIZE, + [PMD_HOLE] = PMD_SIZE, +}; + +static void u64_to_u8(uint64_t n, uint8_t *p) +{ + p += sizeof(uint64_t) - 1; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p-- = n; n >>= 8; + *p = n; +} + +static void dump_eic(struct ept_idle_ctrl *eic) +{ + debug_printk("ept_idle_ctrl: pie_read=%d pie_read_max=%d buf_size=%d " + "bytes_copied=%d next_hva=%lx restart_gpa=%lx " + "gpa_to_hva=%lx\n", + eic->pie_read, + eic->pie_read_max, + eic->buf_size, + eic->bytes_copied, + eic->next_hva, + eic->restart_gpa, + eic->gpa_to_hva); +} + +static void eic_report_addr(struct ept_idle_ctrl *eic, unsigned long addr) +{ + unsigned long hva; + eic->kpie[eic->pie_read++] = PIP_CMD_SET_HVA; + hva = addr; + u64_to_u8(hva, &eic->kpie[eic->pie_read]); + eic->pie_read += sizeof(uint64_t); + debug_printk("eic_report_addr %lx\n", addr); + dump_eic(eic); +} + +static int eic_add_page(struct ept_idle_ctrl *eic, + unsigned long addr, + unsigned long next, + enum ProcIdlePageType page_type) +{ + int page_size = pagetype_size[page_type]; + + debug_printk("eic_add_page addr=%lx next=%lx " + "page_type=%d pagesize=%dK\n", + addr, next, (int)page_type, (int)page_size >> 10); + dump_eic(eic); + + /* align kernel/user vision of cursor position */ + next = round_up(next, page_size); + + if (!eic->pie_read || + addr + eic->gpa_to_hva != eic->next_hva) { + /* merge hole */ + if (page_type == PTE_HOLE || + page_type == PMD_HOLE) { + set_restart_gpa(next, "PTE_HOLE|PMD_HOLE"); + return 0; + } + + if (addr + eic->gpa_to_hva < eic->next_hva) { + debug_printk("ept_idle: addr moves backwards\n"); + WARN_ONCE(1, "ept_idle: addr moves backwards"); + } + + if (eic->pie_read + sizeof(uint64_t) + 2 >= eic->pie_read_max) { + set_restart_gpa(addr, "EPT_IDLE_KBUF_FULL"); + return EPT_IDLE_KBUF_FULL; + } + + eic_report_addr(eic, round_down(addr, page_size) + + eic->gpa_to_hva); + } else { + if (PIP_TYPE(eic->kpie[eic->pie_read - 1]) == page_type && + PIP_SIZE(eic->kpie[eic->pie_read - 1]) < 0xF) { + set_next_hva(next + eic->gpa_to_hva, "IN-PLACE INC"); + set_restart_gpa(next, "IN-PLACE INC"); + eic->kpie[eic->pie_read - 1]++; + WARN_ONCE(page_size < next-addr, "next-addr too large"); + return 0; + } + if (eic->pie_read >= eic->pie_read_max) { + set_restart_gpa(addr, "EPT_IDLE_KBUF_FULL"); + return EPT_IDLE_KBUF_FULL; + } + } + + set_next_hva(next + eic->gpa_to_hva, "NEW-ITEM"); + set_restart_gpa(next, "NEW-ITEM"); + eic->kpie[eic->pie_read] = PIP_COMPOSE(page_type, 1); + eic->pie_read++; + + return 0; +} + +static int ept_pte_range(struct ept_idle_ctrl *eic, + pmd_t *pmd, unsigned long addr, unsigned long end) +{ + pte_t *pte; + enum ProcIdlePageType page_type; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!ept_pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + } + + err = eic_add_page(eic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != end); + + return err; +} + +static int ept_pmd_range(struct ept_idle_ctrl *eic, + pud_t *pud, unsigned long addr, unsigned long end) +{ + pmd_t *pmd; + unsigned long next; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err = 0; + + if (eic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + pmd = pmd_offset(pud, addr); + do { + next = pmd_addr_end(addr, end); + + if (!ept_pmd_present(*pmd)) + page_type = PMD_HOLE; /* likely won't hit here */ + else if (!test_and_clear_bit(_PAGE_BIT_EPT_ACCESSED, + (unsigned long *)pmd)) { + if (pmd_large(*pmd)) + page_type = PMD_IDLE; + else if (eic->flags & SCAN_SKIM_IDLE) + page_type = PMD_IDLE_PTES; + else + page_type = pte_page_type; + } else if (pmd_large(*pmd)) { + page_type = PMD_ACCESSED; + } else + page_type = pte_page_type; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = eic_add_page(eic, addr, next, page_type); + else + err = ept_pte_range(eic, pmd, addr, next); + if (err) + break; + } while (pmd++, addr = next, addr != end); + + return err; +} + +static int ept_pud_range(struct ept_idle_ctrl *eic, + p4d_t *p4d, unsigned long addr, unsigned long end) +{ + pud_t *pud; + unsigned long next; + int err = 0; + + pud = pud_offset(p4d, addr); + do { + next = pud_addr_end(addr, end); + + if (!ept_pud_present(*pud)) { + set_restart_gpa(next, "PUD_HOLE"); + continue; + } + + if (pud_large(*pud)) + err = eic_add_page(eic, addr, next, PUD_PRESENT); + else + err = ept_pmd_range(eic, pud, addr, next); + + if (err) + break; + } while (pud++, addr = next, addr != end); + + return err; +} + +static int ept_p4d_range(struct ept_idle_ctrl *eic, + pgd_t *pgd, unsigned long addr, unsigned long end) +{ + p4d_t *p4d; + unsigned long next; + int err = 0; + + p4d = p4d_offset(pgd, addr); + do { + next = p4d_addr_end(addr, end); + if (!ept_p4d_present(*p4d)) { + set_restart_gpa(next, "P4D_HOLE"); + continue; + } + + err = ept_pud_range(eic, p4d, addr, next); + if (err) + break; + } while (p4d++, addr = next, addr != end); + + return err; +} + +static int ept_page_range(struct ept_idle_ctrl *eic, + unsigned long addr, + unsigned long end) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + pgd_t *ept_root; + pgd_t *pgd; + unsigned long next; + int err = 0; + + BUG_ON(addr >= end); + + spin_lock(&eic->kvm->mmu_lock); + + vcpu = kvm_get_vcpu(eic->kvm, 0); + if (!vcpu) { + err = -EINVAL; + goto out_unlock; + } + + mmu = vcpu->arch.mmu; + if (!VALID_PAGE(mmu->root_hpa)) { + err = -EINVAL; + goto out_unlock; + } + + ept_root = __va(mmu->root_hpa); + + local_irq_disable(); + pgd = pgd_offset_pgd(ept_root, addr); + do { + next = pgd_addr_end(addr, end); + if (!ept_pgd_present(*pgd)) { + set_restart_gpa(next, "PGD_HOLE"); + continue; + } + + err = ept_p4d_range(eic, pgd, addr, next); + if (err) + break; + } while (pgd++, addr = next, addr != end); + local_irq_enable(); +out_unlock: + spin_unlock(&eic->kvm->mmu_lock); + return err; +} + +static void init_ept_idle_ctrl_buffer(struct ept_idle_ctrl *eic) +{ + eic->pie_read = 0; + eic->pie_read_max = min(EPT_IDLE_KBUF_SIZE, + eic->buf_size - eic->bytes_copied); + /* reserve space for PIP_CMD_SET_HVA in the end */ + eic->pie_read_max -= sizeof(uint64_t) + 1; + memset(eic->kpie, 0, sizeof(eic->kpie)); +} + +static int ept_idle_copy_user(struct ept_idle_ctrl *eic, + unsigned long start, unsigned long end) +{ + int bytes_read; + int lc = 0; /* last copy? */ + int ret; + + debug_printk("ept_idle_copy_user %lx %lx\n", start, end); + dump_eic(eic); + + /* Break out of loop on no more progress. */ + if (!eic->pie_read) { + lc = 1; + if (start < end) + start = end; + } + + if (start >= end && start > eic->next_hva) { + set_next_hva(start, "TAIL-HOLE"); + eic_report_addr(eic, start); + } + + bytes_read = eic->pie_read; + if (!bytes_read) + return 1; + + ret = copy_to_user(eic->buf, eic->kpie, bytes_read); + if (ret) + return -EFAULT; + + eic->buf += bytes_read; + eic->bytes_copied += bytes_read; + if (eic->bytes_copied >= eic->buf_size) + return EPT_IDLE_BUF_FULL; + if (lc) + return lc; + + init_ept_idle_ctrl_buffer(eic); + cond_resched(); + return 0; +} + +/* + * Depending on whether hva falls in a memslot: + * + * 1) found => return gpa and remaining memslot size in *addr_range + * + * |<----- addr_range --------->| + * [ mem slot ] + * ^hva + * + * 2) not found => return hole size in *addr_range + * + * |<----- addr_range --------->| + * [ first mem slot above hva ] + * ^hva + * + * If hva is above all mem slots, *addr_range will be ~0UL. We can finish read(2). + */ +static unsigned long ept_idle_find_gpa(struct ept_idle_ctrl *eic, + unsigned long hva, + unsigned long *addr_range) +{ + struct kvm *kvm = eic->kvm; + struct kvm_memslots *slots; + struct kvm_memory_slot *memslot; + unsigned long hva_end; + gfn_t gfn; + + *addr_range = ~0UL; + mutex_lock(&kvm->slots_lock); + slots = kvm_memslots(eic->kvm); + kvm_for_each_memslot(memslot, slots) { + hva_end = memslot->userspace_addr + + (memslot->npages << PAGE_SHIFT); + + if (hva >= memslot->userspace_addr && hva < hva_end) { + gpa_t gpa; + gfn = hva_to_gfn_memslot(hva, memslot); + *addr_range = hva_end - hva; + gpa = gfn_to_gpa(gfn); + debug_printk("ept_idle_find_gpa slot %lx=>%llx %lx=>%llx " + "delta %llx size %lx\n", + memslot->userspace_addr, + gfn_to_gpa(memslot->base_gfn), + hva, gpa, + hva - gpa, + memslot->npages << PAGE_SHIFT); + mutex_unlock(&kvm->slots_lock); + return gpa; + } + + if (memslot->userspace_addr > hva) + *addr_range = min(*addr_range, + memslot->userspace_addr - hva); + } + mutex_unlock(&kvm->slots_lock); + return INVALID_PAGE; +} + +static int ept_idle_supports_cpu(struct kvm *kvm) +{ + struct kvm_vcpu *vcpu; + struct kvm_mmu *mmu; + int ret; + + vcpu = kvm_get_vcpu(kvm, 0); + if (!vcpu) + return -EINVAL; + + spin_lock(&kvm->mmu_lock); + mmu = vcpu->arch.mmu; + if (mmu->mmu_role.base.ad_disabled) { + printk(KERN_NOTICE + "CPU does not support EPT A/D bits tracking\n"); + ret = -EINVAL; + } else if (mmu->shadow_root_level != 4 + (! !pgtable_l5_enabled())) { + printk(KERN_NOTICE "Unsupported EPT level %d\n", + mmu->shadow_root_level); + ret = -EINVAL; + } else + ret = 0; + spin_unlock(&kvm->mmu_lock); + + return ret; +} + +static int ept_idle_walk_hva_range(struct ept_idle_ctrl *eic, + unsigned long start, unsigned long end) +{ + unsigned long gpa_addr; + unsigned long addr_range; + int ret; + + ret = ept_idle_supports_cpu(eic->kvm); + if (ret) + return ret; + + init_ept_idle_ctrl_buffer(eic); + + for (; start < end;) { + gpa_addr = ept_idle_find_gpa(eic, start, &addr_range); + + if (gpa_addr == INVALID_PAGE) { + eic->gpa_to_hva = 0; + if (addr_range == ~0UL) /* beyond max virtual address */ + set_restart_gpa(TASK_SIZE, "EOF"); + else { + start += addr_range; + set_restart_gpa(start, "OUT-OF-SLOT"); + } + } else { + eic->gpa_to_hva = start - gpa_addr; + ept_page_range(eic, gpa_addr, gpa_addr + addr_range); + } + + start = eic->restart_gpa + eic->gpa_to_hva; + ret = ept_idle_copy_user(eic, start, end); + if (ret) + break; + } + + if (eic->bytes_copied) + ret = 0; + return ret; +} + +static ssize_t ept_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct ept_idle_ctrl *eic; + unsigned long hva_start = *ppos; + unsigned long hva_end = hva_start + (count << (3 + PAGE_SHIFT)); + int ret; + + if (hva_start >= TASK_SIZE) { + debug_printk("ept_idle_read past TASK_SIZE: %lx %lx\n", + hva_start, TASK_SIZE); + return 0; + } + + if (!mm_kvm(mm)) + return mm_idle_read(file, buf, count, ppos); + + if (hva_end <= hva_start) { + debug_printk("ept_idle_read past EOF: %lx %lx\n", + hva_start, hva_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("ept_idle_read unaligned ppos: %lx\n", + hva_start); + return -EINVAL; + } + if (count < EPT_IDLE_BUF_MIN) { + debug_printk("ept_idle_read small count: %lx\n", + (unsigned long)count); + return -EINVAL; + } + + eic = kzalloc(sizeof(*eic), GFP_KERNEL); + if (!eic) + return -ENOMEM; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + goto out_free_eic; + } + + eic->buf = buf; + eic->buf_size = count; + eic->mm = mm; + eic->kvm = mm_kvm(mm); + if (!eic->kvm) { + ret = -EINVAL; + goto out_mm; + } + + kvm_get_kvm(eic->kvm); + + ret = ept_idle_walk_hva_range(eic, hva_start, hva_end); + if (ret) + goto out_kvm; + + ret = eic->bytes_copied; + *ppos = eic->next_hva; + debug_printk("ppos=%lx bytes_copied=%d\n", + eic->next_hva, ret); +out_kvm: + kvm_put_kvm(eic->kvm); +out_mm: + mmput(mm); +out_free_eic: + kfree(eic); + return ret; +} + +static int ept_idle_open(struct inode *inode, struct file *file) +{ + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + + return 0; +} + +static int ept_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + struct kvm *kvm; + int ret = 0; + + if (!mm) { + ret = -EBADF; + goto out; + } + + kvm = mm_kvm(mm); + if (!kvm) { + ret = -EINVAL; + goto out; + } + + spin_lock(&kvm->mmu_lock); + kvm_flush_remote_tlbs(kvm); + spin_unlock(&kvm->mmu_lock); + +out: + module_put(THIS_MODULE); + return ret; +} + +extern struct file_operations proc_ept_idle_operations; + +static int ept_idle_entry(void) +{ + proc_ept_idle_operations.owner = THIS_MODULE; + proc_ept_idle_operations.read = ept_idle_read; + proc_ept_idle_operations.open = ept_idle_open; + proc_ept_idle_operations.release = ept_idle_release; + + return 0; +} + +static void ept_idle_exit(void) +{ + memset(&proc_ept_idle_operations, 0, sizeof(proc_ept_idle_operations)); +} + +MODULE_LICENSE("GPL"); +module_init(ept_idle_entry); +module_exit(ept_idle_exit); --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux/arch/x86/kvm/ept_idle.h 2018-12-26 20:32:09.775444685 +0800 @@ -0,0 +1,116 @@ +#ifndef _EPT_IDLE_H +#define _EPT_IDLE_H + +#define SCAN_HUGE_PAGE O_NONBLOCK /* only huge page */ +#define SCAN_SKIM_IDLE O_NOFOLLOW /* stop on PMD_IDLE_PTES */ + +enum ProcIdlePageType { + PTE_ACCESSED, /* 4k page */ + PMD_ACCESSED, /* 2M page */ + PUD_PRESENT, /* 1G page */ + + PTE_DIRTY, + PMD_DIRTY, + + PTE_IDLE, + PMD_IDLE, + PMD_IDLE_PTES, /* all PTE idle */ + + PTE_HOLE, + PMD_HOLE, + + PIP_CMD, + + IDLE_PAGE_TYPE_MAX +}; + +#define PIP_TYPE(a) (0xf & (a >> 4)) +#define PIP_SIZE(a) (0xf & a) +#define PIP_COMPOSE(type, nr) ((type << 4) | nr) + +#define PIP_CMD_SET_HVA PIP_COMPOSE(PIP_CMD, 0) + +#define _PAGE_BIT_EPT_ACCESSED 8 +#define _PAGE_EPT_ACCESSED (_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED) + +#define _PAGE_EPT_PRESENT (_AT(pteval_t, 7)) + +static inline int ept_pte_present(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pmd_present(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pud_present(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_p4d_present(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pgd_present(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_PRESENT; +} + +static inline int ept_pte_accessed(pte_t a) +{ + return pte_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pmd_accessed(pmd_t a) +{ + return pmd_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pud_accessed(pud_t a) +{ + return pud_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_p4d_accessed(p4d_t a) +{ + return p4d_flags(a) & _PAGE_EPT_ACCESSED; +} + +static inline int ept_pgd_accessed(pgd_t a) +{ + return pgd_flags(a) & _PAGE_EPT_ACCESSED; +} + +extern struct file_operations proc_ept_idle_operations; + +#define EPT_IDLE_KBUF_FULL 1 +#define EPT_IDLE_BUF_FULL 2 +#define EPT_IDLE_BUF_MIN (sizeof(uint64_t) * 2 + 3) + +#define EPT_IDLE_KBUF_SIZE 8000 + +struct ept_idle_ctrl { + struct mm_struct *mm; + struct kvm *kvm; + + uint8_t kpie[EPT_IDLE_KBUF_SIZE]; + int pie_read; + int pie_read_max; + + void __user *buf; + int buf_size; + int bytes_copied; + + unsigned long next_hva; /* GPA for EPT; VA for PT */ + unsigned long gpa_to_hva; + unsigned long restart_gpa; + unsigned long last_va; + + unsigned int flags; +}; + +#endif From patchwork Wed Dec 26 13:15:02 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743115 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8578591E for ; Wed, 26 Dec 2018 13:37:58 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7375D28495 for ; Wed, 26 Dec 2018 13:37:58 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6754D28938; Wed, 26 Dec 2018 13:37:58 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC70728900 for ; Wed, 26 Dec 2018 13:37:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727001AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 Received: from mga04.intel.com ([192.55.52.120]:33944 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726969AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185469" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005P8-Jt; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.012352050@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:02 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Zhang Yi , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Dan Williams Subject: [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0015-page-idle-Added-mmu-idle-page-walk.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Zhang Yi File pages are skipped for now. They are in general not guaranteed to be mapped. It means when become hot, there is no guarantee to find and move them to DRAM nodes. Signed-off-by: Zhang Yi Signed-off-by: Fengguang Wu --- arch/x86/kvm/ept_idle.c | 204 ++++++++++++++++++++++++++++++++++++++ mm/pagewalk.c | 1 2 files changed, 205 insertions(+) --- linux.orig/arch/x86/kvm/ept_idle.c 2018-12-26 19:58:30.576894801 +0800 +++ linux/arch/x86/kvm/ept_idle.c 2018-12-26 19:58:39.840936072 +0800 @@ -510,6 +510,9 @@ static int ept_idle_walk_hva_range(struc return ret; } +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos); + static ssize_t ept_idle_read(struct file *file, char *buf, size_t count, loff_t *ppos) { @@ -615,6 +618,207 @@ out: return ret; } +static int mm_idle_pte_range(struct ept_idle_ctrl *eic, pmd_t *pmd, + unsigned long addr, unsigned long next) +{ + enum ProcIdlePageType page_type; + pte_t *pte; + int err = 0; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_present(*pte)) + page_type = PTE_HOLE; + else if (!test_and_clear_bit(_PAGE_BIT_ACCESSED, + (unsigned long *) &pte->pte)) + page_type = PTE_IDLE; + else { + page_type = PTE_ACCESSED; + } + + err = eic_add_page(eic, addr, addr + PAGE_SIZE, page_type); + if (err) + break; + } while (pte++, addr += PAGE_SIZE, addr != next); + + return err; +} + +static int mm_idle_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ept_idle_ctrl *eic = walk->private; + enum ProcIdlePageType page_type; + enum ProcIdlePageType pte_page_type; + int err; + + /* + * Skip duplicate PMD_IDLE_PTES: when the PMD crosses VMA boundary, + * walk_page_range() can call on the same PMD twice. + */ + if ((addr & PMD_MASK) == (eic->last_va & PMD_MASK)) { + debug_printk("ignore duplicate addr %lx %lx\n", + addr, eic->last_va); + return 0; + } + eic->last_va = addr; + + if (eic->flags & SCAN_HUGE_PAGE) + pte_page_type = PMD_IDLE_PTES; + else + pte_page_type = IDLE_PAGE_TYPE_MAX; + + if (!pmd_present(*pmd)) + page_type = PMD_HOLE; + else if (!test_and_clear_bit(_PAGE_BIT_ACCESSED, (unsigned long *)pmd)) { + if (pmd_large(*pmd)) + page_type = PMD_IDLE; + else if (eic->flags & SCAN_SKIM_IDLE) + page_type = PMD_IDLE_PTES; + else + page_type = pte_page_type; + } else if (pmd_large(*pmd)) { + page_type = PMD_ACCESSED; + } else + page_type = pte_page_type; + + if (page_type != IDLE_PAGE_TYPE_MAX) + err = eic_add_page(eic, addr, next, page_type); + else + err = mm_idle_pte_range(eic, pmd, addr, next); + + return err; +} + +static int mm_idle_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + struct ept_idle_ctrl *eic = walk->private; + + if ((addr & PUD_MASK) != (eic->last_va & PUD_MASK)) { + eic_add_page(eic, addr, next, PUD_PRESENT); + eic->last_va = addr; + } + return 1; +} + +static int mm_idle_test_walk(unsigned long start, unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + + if (vma->vm_file) { + if ((vma->vm_flags & (VM_WRITE|VM_MAYSHARE)) == VM_WRITE) + return 0; + return 1; + } + + return 0; +} + +static int mm_idle_walk_range(struct ept_idle_ctrl *eic, + unsigned long start, + unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma; + int ret; + + init_ept_idle_ctrl_buffer(eic); + + for (; start < end;) + { + down_read(&walk->mm->mmap_sem); + vma = find_vma(walk->mm, start); + if (vma) { + if (end > vma->vm_start) { + local_irq_disable(); + ret = walk_page_range(start, end, walk); + local_irq_enable(); + } else + set_restart_gpa(vma->vm_start, "VMA-HOLE"); + } else + set_restart_gpa(TASK_SIZE, "EOF"); + up_read(&walk->mm->mmap_sem); + + WARN_ONCE(eic->gpa_to_hva, "non-zero gpa_to_hva"); + start = eic->restart_gpa; + ret = ept_idle_copy_user(eic, start, end); + if (ret) + break; + } + + if (eic->bytes_copied) { + if (ret != EPT_IDLE_BUF_FULL && eic->next_hva < end) + debug_printk("partial scan: next_hva=%lx end=%lx\n", + eic->next_hva, end); + ret = 0; + } else + WARN_ONCE(1, "nothing read"); + return ret; +} + +static ssize_t mm_idle_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + struct mm_struct *mm = file->private_data; + struct mm_walk mm_walk = {}; + struct ept_idle_ctrl *eic; + unsigned long va_start = *ppos; + unsigned long va_end = va_start + (count << (3 + PAGE_SHIFT)); + int ret; + + if (va_end <= va_start) { + debug_printk("mm_idle_read past EOF: %lx %lx\n", + va_start, va_end); + return 0; + } + if (*ppos & (PAGE_SIZE - 1)) { + debug_printk("mm_idle_read unaligned ppos: %lx\n", + va_start); + return -EINVAL; + } + if (count < EPT_IDLE_BUF_MIN) { + debug_printk("mm_idle_read small count: %lx\n", + (unsigned long)count); + return -EINVAL; + } + + eic = kzalloc(sizeof(*eic), GFP_KERNEL); + if (!eic) + return -ENOMEM; + + if (!mm || !mmget_not_zero(mm)) { + ret = -ESRCH; + goto out_free; + } + + eic->buf = buf; + eic->buf_size = count; + eic->mm = mm; + eic->flags = file->f_flags; + + mm_walk.mm = mm; + mm_walk.pmd_entry = mm_idle_pmd_entry; + mm_walk.pud_entry = mm_idle_pud_entry; + mm_walk.test_walk = mm_idle_test_walk; + mm_walk.private = eic; + + ret = mm_idle_walk_range(eic, va_start, va_end, &mm_walk); + if (ret) + goto out_mm; + + ret = eic->bytes_copied; + *ppos = eic->next_hva; + debug_printk("ppos=%lx bytes_copied=%d\n", + eic->next_hva, ret); +out_mm: + mmput(mm); +out_free: + kfree(eic); + return ret; +} + extern struct file_operations proc_ept_idle_operations; static int ept_idle_entry(void) --- linux.orig/mm/pagewalk.c 2018-12-26 19:58:30.576894801 +0800 +++ linux/mm/pagewalk.c 2018-12-26 19:58:30.576894801 +0800 @@ -338,6 +338,7 @@ int walk_page_range(unsigned long start, } while (start = next, start < end); return err; } +EXPORT_SYMBOL(walk_page_range); int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk) { From patchwork Wed Dec 26 13:15:03 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743069 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8B82491E for ; Wed, 26 Dec 2018 13:37:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 779FA28495 for ; Wed, 26 Dec 2018 13:37:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6BB6828938; Wed, 26 Dec 2018 13:37:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E4BB628495 for ; Wed, 26 Dec 2018 13:37:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727023AbeLZNhJ (ORCPT ); Wed, 26 Dec 2018 08:37:09 -0500 Received: from mga06.intel.com ([134.134.136.31]:21292 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726724AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358947" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005PD-Kb; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.076749877@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:03 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Huang Ying , Brendan Gregg , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0008-proc-introduce-proc-PID-idle_pages.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This will be similar to /sys/kernel/mm/page_idle/bitmap documented in Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed by process virtual address. When using the global PFN indexed idle bitmap, we find 2 kind of overheads: - to track a task's working set, Brendan Gregg end up writing wss-v1 for small tasks and wss-v2 for large tasks: https://github.com/brendangregg/wss That's because VAs may point to random PAs throughout the physical address space. So we either query /proc/pid/pagemap first and access the lots of random PFNs (with lots of syscalls) in the bitmap, or write+read the whole system idle bitmap beforehand. - page table walking by PFN has much more overheads than to walk a page table in its natural order: - rmap queries - more locking - random memory reads/writes This interface provides a cheap path for the majority non-shared mapping pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system time to scan the per-task/global idle bitmaps. Which means ~7x speedup. The gap will be enlarged if consider - the extra /proc/pid/pagemap walk - natural page table walks can skip the whole 512 PTEs if PMD is idle OTOH, the per-task idle bitmap is not suitable in some situations: - not accurate for shared pages - don't work with non-mapped file pages - don't perform well for sparse page tables (pointed out by Huang Ying) So it's more about complementing the existing global idle bitmap. CC: Huang Ying CC: Brendan Gregg Signed-off-by: Fengguang Wu --- fs/proc/base.c | 2 + fs/proc/internal.h | 1 fs/proc/task_mmu.c | 54 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+) --- linux.orig/fs/proc/base.c 2018-12-23 20:08:14.228919325 +0800 +++ linux/fs/proc/base.c 2018-12-23 20:08:14.224919327 +0800 @@ -2969,6 +2969,7 @@ static const struct pid_entry tgid_base_ REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3357,6 +3358,7 @@ static const struct pid_entry tid_base_s REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_pages", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), --- linux.orig/fs/proc/internal.h 2018-12-23 20:08:14.228919325 +0800 +++ linux/fs/proc/internal.h 2018-12-23 20:08:14.224919327 +0800 @@ -298,6 +298,7 @@ extern const struct file_operations proc extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, --- linux.orig/fs/proc/task_mmu.c 2018-12-23 20:08:14.228919325 +0800 +++ linux/fs/proc/task_mmu.c 2018-12-23 20:08:14.224919327 +0800 @@ -1559,6 +1559,60 @@ const struct file_operations proc_pagema .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_ept_idle_operations = { +}; +EXPORT_SYMBOL_GPL(proc_ept_idle_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + if (proc_ept_idle_operations.read) + return proc_ept_idle_operations.read(file, buf, count, ppos); + + return 0; +} + + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = proc_mem_open(inode, PTRACE_MODE_READ); + + if (IS_ERR(mm)) + return PTR_ERR(mm); + + file->private_data = mm; + + if (proc_ept_idle_operations.open) + return proc_ept_idle_operations.open(inode, file); + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + + if (mm) { + if (!mm_kvm(mm)) + flush_tlb_mm(mm); + mmdrop(mm); + } + + if (proc_ept_idle_operations.release) + return proc_ept_idle_operations.release(inode, file); + + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA From patchwork Wed Dec 26 13:15:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743127 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A30391E for ; Wed, 26 Dec 2018 13:38:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 389EC28495 for ; Wed, 26 Dec 2018 13:38:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2C31628938; Wed, 26 Dec 2018 13:38:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D2AD328495 for ; Wed, 26 Dec 2018 13:38:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726867AbeLZNiO (ORCPT ); Wed, 26 Dec 2018 08:38:14 -0500 Received: from mga04.intel.com ([192.55.52.120]:33941 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726957AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185467" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005PI-LE; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.133164898@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:04 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 18/21] kvm-ept-idle: enable module References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0007-kvm-ept-idle-enable-module.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Signed-off-by: Fengguang Wu --- arch/x86/kvm/Kconfig | 11 +++++++++++ arch/x86/kvm/Makefile | 4 ++++ 2 files changed, 15 insertions(+) --- linux.orig/arch/x86/kvm/Kconfig 2018-12-23 20:09:04.628882396 +0800 +++ linux/arch/x86/kvm/Kconfig 2018-12-23 20:09:04.628882396 +0800 @@ -96,6 +96,17 @@ config KVM_MMU_AUDIT This option adds a R/W kVM module parameter 'mmu_audit', which allows auditing of KVM MMU events at runtime. +config KVM_EPT_IDLE + tristate "KVM EPT idle page tracking" + depends on KVM_INTEL + depends on PROC_PAGE_MONITOR + ---help--- + Provides support for walking EPT to get the A bits on Intel + processors equipped with the VT extensions. + + To compile this as a module, choose M here: the module + will be called kvm-ept-idle. + # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. source drivers/vhost/Kconfig --- linux.orig/arch/x86/kvm/Makefile 2018-12-23 20:09:04.628882396 +0800 +++ linux/arch/x86/kvm/Makefile 2018-12-23 20:09:04.628882396 +0800 @@ -19,6 +19,10 @@ kvm-y += x86.o mmu.o emulate.o i8259.o kvm-intel-y += vmx.o pmu_intel.o kvm-amd-y += svm.o pmu_amd.o +kvm-ept-idle-y += ept_idle.o + obj-$(CONFIG_KVM) += kvm.o obj-$(CONFIG_KVM_INTEL) += kvm-intel.o obj-$(CONFIG_KVM_AMD) += kvm-amd.o + +obj-$(CONFIG_KVM_EPT_IDLE) += kvm-ept-idle.o From patchwork Wed Dec 26 13:15:05 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743109 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 464F391E for ; Wed, 26 Dec 2018 13:37:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2D07A28495 for ; Wed, 26 Dec 2018 13:37:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2055028938; Wed, 26 Dec 2018 13:37:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B4FD928900 for ; Wed, 26 Dec 2018 13:37:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727115AbeLZNhu (ORCPT ); Wed, 26 Dec 2018 08:37:50 -0500 Received: from mga06.intel.com ([134.134.136.31]:21292 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727003AbeLZNhI (ORCPT ); Wed, 26 Dec 2018 08:37:08 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358949" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005PN-M1; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.189896494@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:05 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Liu Jingqi , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0010-migrate-check-if-the-page-is-software-young-when-mov.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Liu Jingqi Introduce MPOL_MF_SW_YOUNG flag to move_pages(). When on, the already-in-DRAM pages will be set PG_referenced. Background: The use space migration daemon will frequently scan page table and read-clear accessed bits to detect hot/cold pages. Then migrate hot pages from PMEM to DRAM node. When doing so, it btw tells kernel that these are the hot page set. This maintains a persistent view of hot/cold pages between kernel and user space daemon. The more concrete steps are 1) do multiple scan of page table, count accessed bits 2) highest accessed count => hot pages 3) call move_pages(hot pages, DRAM nodes, MPOL_MF_SW_YOUNG) (1) regularly clears PTE young, which makes kernel lose access to PTE young information (2) for anonymous pages, user space daemon defines which is hot and which is cold (3) conveys user space view of hot/cold pages to kernel through PG_referenced In the long run, most hot pages could already be in DRAM. move_pages(MPOL_MF_SW_YOUNG) sets PG_referenced for those already in DRAM hot pages. But not for newly migrated hot pages. Since they are expected to put to the end of LRU, thus has long enough time in LRU to gather accessed/PG_referenced bit and prove to kernel they are really hot. The daemon may only select DRAM/2 pages as hot for 2 purposes: - avoid thrashing, eg. some warm pages got promoted then demoted soon - make sure enough DRAM LRU pages look "cold" to kernel, so that vmscan won't run into trouble busy scanning LRU lists Signed-off-by: Liu Jingqi Signed-off-by: Fengguang Wu --- mm/migrate.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) --- linux.orig/mm/migrate.c 2018-12-23 20:37:12.604621319 +0800 +++ linux/mm/migrate.c 2018-12-23 20:37:12.604621319 +0800 @@ -55,6 +55,8 @@ #include "internal.h" +#define MPOL_MF_SW_YOUNG (1<<7) + /* * migrate_prep() needs to be called before we start compiling a list of pages * to be migrated using isolate_lru_page(). If scheduling work on other CPUs is @@ -1484,12 +1486,13 @@ static int do_move_pages_to_node(struct * the target node */ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, - int node, struct list_head *pagelist, bool migrate_all) + int node, struct list_head *pagelist, int flags) { struct vm_area_struct *vma; struct page *page; unsigned int follflags; int err; + bool migrate_all = flags & MPOL_MF_MOVE_ALL; down_read(&mm->mmap_sem); err = -EFAULT; @@ -1519,6 +1522,8 @@ static int add_page_for_migration(struct if (PageHuge(page)) { if (PageHead(page)) { + if (flags & MPOL_MF_SW_YOUNG) + SetPageReferenced(page); isolate_huge_page(page, pagelist); err = 0; } @@ -1531,6 +1536,8 @@ static int add_page_for_migration(struct goto out_putpage; err = 0; + if (flags & MPOL_MF_SW_YOUNG) + SetPageReferenced(head); list_add_tail(&head->lru, pagelist); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + page_is_file_cache(head), @@ -1606,7 +1613,7 @@ static int do_pages_move(struct mm_struc * report them via status */ err = add_page_for_migration(mm, addr, current_node, - &pagelist, flags & MPOL_MF_MOVE_ALL); + &pagelist, flags); if (!err) continue; @@ -1725,7 +1732,7 @@ static int kernel_move_pages(pid_t pid, nodemask_t task_nodes; /* Check flags */ - if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL)) + if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL|MPOL_MF_SW_YOUNG)) return -EINVAL; if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE)) From patchwork Wed Dec 26 13:15:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743123 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0A7E491E for ; Wed, 26 Dec 2018 13:38:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EB21628495 for ; Wed, 26 Dec 2018 13:38:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DF65D28938; Wed, 26 Dec 2018 13:38:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 73C2628495 for ; Wed, 26 Dec 2018 13:38:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727173AbeLZNiG (ORCPT ); Wed, 26 Dec 2018 08:38:06 -0500 Received: from mga04.intel.com ([192.55.52.120]:33944 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726985AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="121185471" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by FMSMGA003.fm.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005PS-Ms; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.246320288@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:06 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Jingqi Liu , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0012-vmscan-migrate-anonymous-pages-to-pmem-node-before-s.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Jingqi Liu With PMEM nodes, the demotion path could be 1) DRAM pages: migrate to PMEM node 2) PMEM pages: swap out This patch does (1) for anonymous pages only. Since we cannot detect hotness of (unmapped) page cache pages for now. The user space daemon can do migration in both directions: - PMEM=>DRAM hot page migration - DRAM=>PMEM cold page migration However it's more natural for user space to do hot page migration and kernel to do cold page migration. Especially, only kernel can guarantee on-demand migration when there is memory pressure. So the big picture will look like this: user space daemon does regular hot page migration to DRAM, creating memory pressure on DRAM nodes, which triggers kernel cold page migration to PMEM nodes. Du Fan: - Support multiple NUMA nodes. - Don't migrate clean MADV_FREE pages to PMEM node. With advise(MADV_FREE) syscall, both vma structure and its corresponding page entries still lives, but we got MADV_FREE page, anonymous but WITHOUT SwapBacked. In case of page reclaim, clean MADV_FREE pages will be freed and return to buddy system, the dirty ones then turn into canonical anonymous page with PageSwapBacked(page) set, and put into LRU_INACTIVE_FILE list falling into standard aging routine. Point is clean MADV_FREE pages should not be migrated, it has steal (useless) user data once madvise(MADV_FREE) called and guard against thus scenarios. P.S. MADV_FREE is heavily used by jemalloc engine, and workload like redis, refer to [1] for detailed backgroud, usecase, and benchmark result. [1] https://lore.kernel.org/patchwork/patch/622179/ Fengguang: - detect migrate thp and hugetlb - avoid moving pages to a non-existent node Signed-off-by: Fan Du Signed-off-by: Jingqi Liu Signed-off-by: Fengguang Wu --- mm/vmscan.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) --- linux.orig/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800 +++ linux/mm/vmscan.c 2018-12-23 20:37:58.305551976 +0800 @@ -1112,6 +1112,7 @@ static unsigned long shrink_page_list(st { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(move_pages); int pgactivate = 0; unsigned nr_unqueued_dirty = 0; unsigned nr_dirty = 0; @@ -1121,6 +1122,7 @@ static unsigned long shrink_page_list(st unsigned nr_immediate = 0; unsigned nr_ref_keep = 0; unsigned nr_unmap_fail = 0; + int page_on_dram = is_node_dram(pgdat->node_id); cond_resched(); @@ -1275,6 +1277,21 @@ static unsigned long shrink_page_list(st } /* + * Check if the page is in DRAM numa node. + * Skip MADV_FREE pages as it might be freed + * immediately to buddy system if it's clean. + */ + if (node_online(pgdat->peer_node) && + PageAnon(page) && (PageSwapBacked(page) || PageTransHuge(page))) { + if (page_on_dram) { + /* Add to the page list which will be moved to pmem numa node. */ + list_add(&page->lru, &move_pages); + unlock_page(page); + continue; + } + } + + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly @@ -1496,6 +1513,22 @@ keep: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + /* Move the anonymous pages to PMEM numa node. */ + if (!list_empty(&move_pages)) { + int err; + + /* Could not block. */ + err = migrate_pages(&move_pages, alloc_new_node_page, NULL, + pgdat->peer_node, + MIGRATE_ASYNC, MR_NUMA_MISPLACED); + if (err) { + putback_movable_pages(&move_pages); + + /* Join the pages which were not migrated. */ + list_splice(&ret_pages, &move_pages); + } + } + mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_unref_page_list(&free_pages); From patchwork Wed Dec 26 13:15:07 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10743095 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C95891E for ; Wed, 26 Dec 2018 13:37:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6A3B328495 for ; Wed, 26 Dec 2018 13:37:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5E14328939; Wed, 26 Dec 2018 13:37:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F212C28495 for ; Wed, 26 Dec 2018 13:37:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727041AbeLZNhJ (ORCPT ); Wed, 26 Dec 2018 08:37:09 -0500 Received: from mga06.intel.com ([134.134.136.31]:21292 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726990AbeLZNhH (ORCPT ); Wed, 26 Dec 2018 08:37:07 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358944" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005PX-NY; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133352.303666865@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:15:07 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Fan Du cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0013-vmscan-disable-0-swap-space-optimization.patch Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Fix OOM by making in-kernel DRAM=>PMEM migration reachable. Here we assume these 2 possible demotion paths: - DRAM migrate to PMEM - PMEM to swap device Signed-off-by: Fengguang Wu --- mm/vmscan.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- linux.orig/mm/vmscan.c 2018-12-23 20:38:44.310446223 +0800 +++ linux/mm/vmscan.c 2018-12-23 20:38:44.306446146 +0800 @@ -2259,7 +2259,7 @@ static bool inactive_list_is_low(struct * If we don't have swap space, anonymous page deactivation * is pointless. */ - if (!file && !total_swap_pages) + if (!file && (is_node_pmem(pgdat->node_id) && !total_swap_pages)) return false; inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx); @@ -2340,7 +2340,8 @@ static void get_scan_count(struct lruvec enum lru_list lru; /* If we have no swap space, do not bother scanning anon pages. */ - if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) { + if (is_node_pmem(pgdat->node_id) && + (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0)) { scan_balance = SCAN_FILE; goto out; }