From patchwork Mon Jan 9 07:22:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13093093 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD6C5C677F1 for ; Mon, 9 Jan 2023 07:19:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F8118E0005; Mon, 9 Jan 2023 02:19:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A77E8E0001; Mon, 9 Jan 2023 02:19:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 286F98E0005; Mon, 9 Jan 2023 02:19:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 178E88E0001 for ; Mon, 9 Jan 2023 02:19:41 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id DB926402AB for ; Mon, 9 Jan 2023 07:19:40 +0000 (UTC) X-FDA: 80334410520.12.B24B3A6 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf09.hostedemail.com (Postfix) with ESMTP id E9F2914000B for ; Mon, 9 Jan 2023 07:19:38 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="F5F/clnF"; spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LJobTs2V5XcSusL76UNrK4PZQv7eZ//5y09KZ8bp8nU=; b=jwe5t3itYFhmgoVUGdnxQfe7saT8Vbx3fNWtrFbkurBmNNjbdVqKJ7uv8QLdRTSBhHymrr hjy52Cypf6Xq2FA8CiRBkERbG+EoolckrtuJSoGwO93N3FZwlDe/2Yopk8SQk82hy/j1LD 7Tj3MiywT22a7qJYzbYZlIuPM7h0sPw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="F5F/clnF"; spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248779; a=rsa-sha256; cv=none; b=uXiMJ5CiWEg7noaLVLagl9wujRbh8s2PnA755xHJJ0Jw7W60OPOfICijUO7aobLBJIryQz sd6yimVCBPWMas5tTviwhtdyL3fDtW6C32YEgjC2xEkBsqyiVZz75houXtkHJs0QmnVu9U eF5NeQJWb4pzGCxA80Ulo0yb94yrfOQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248779; x=1704784779; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aPU5uaXNkq3fAf8nIKkOuklY542v16NKu2eeGh2divw=; b=F5F/clnFrSB4JRvjJ11aE3cWRx+0RBzrHyrmwpO4CLNFeClboPoMt3X7 BcsgG6jHl/tD8SRl2r32m+LrMTiwPfD76p0Q6QP4mysdby6sac/zEnjMd jnsIETTC+CTXPqiPUQS/Bc/wXjv0gBqpqQZFoP60SZkWtH9aIg8gA5ooO 6SdLt9qm6ot1tkF5JBdiJmviBo5zgNUjmY/tVBTMEyOrbBaV2n0VG04fQ rZYhK2JwB2n0HDA4TxI2ni5YfeIBuoIRkyG1Yw86DRBOeeu+WiCQX5VqR 6ntsn+2qdAs/Im8DNN7u0bwoVKbmvt7BJwuqnyOuI/QP40sNLwKjnU0Ng g==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260883" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260883" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:18 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112033" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634112033" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:14 -0800 From: Yin Fengwei To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple consecutive page Date: Mon, 9 Jan 2023 15:22:29 +0800 Message-Id: <20230109072232.2398464-2-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com> References: <20230109072232.2398464-1-fengwei.yin@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: E9F2914000B X-Stat-Signature: ii9r69tt7mcjjtss9ersf1ic9abf47op X-Rspam-User: X-HE-Tag: 1673248778-886699 X-HE-Meta: U2FsdGVkX18Z5AFEC9LH38sHdYR4hwIqA6fAIocPu1Hb/ZbUOGHfJj8NsMU6fpT9k9ZYtkPIjh44GRAg3u+p9eMZvtRwihn3/yJBFDS5SyJV9pUv/NOztjt8O7AHYC4UtA4S0bFFysROiDKzQeFxryV3EBxmbhY2PGhfPxkh+JdEfBBFGazpvos8K85tUnfjtFgsuGpIRarssSHB/PZ3qtVq7uIQd8iMqlQG3omtMNSOiiHnTDZMJWcX+L6KjXT5xk4z2859x4l4Hi9qKCj6H9YI03imW8z6ys87c4KzoppAgll7nUZ1JlWSQDRdcOmjmCWBOjs/g6XH8fniJJZ+sr2MqxQ1ybv3yR+YZ2YACvJKLFcM2A5vpxXJUbW5neKoFhej1SRxZqYVIEiIQBugF3C/IZKSmpzLxqp4cMnAekyaGL9/3nEdvDiwDcMmaaARv+b/XqPGitEnE2Jm87fCz1DHpqOqFFv3BvtCKSuFanElWyC+CQuTgPhRXVM6Jzjq6fsxikExh2NsoeYA+cJ5R/XdnPtjtGvgc7kyRSuPUTPnfbP5MEw/yrBIxVqTUBBajwjrHczq8ZC/FBQ4UdwXF1HqBzPAjBaJe0pI/ilyFGWQV3H4Og0VtKwaNfR8vGCsb0BOAU17vB2UXBw8CPqaO0P18UoreiagAZRzmym5Vf4yDvaXZOQLcdtI80BgppDJ8V2VwvKJOjB+Fx5oE3GSokYkKKsJ5cIsxUylwkpDTsj7oZHaUnOqokGeWxlaP216GuxUfhMcU6gsolbzG22tLiSFORNbJmFedAvMBUb2R4w47cc7OQ3FfQdotSN0pELVi03+lkIHOk7V/i3wSCnnrIfl8QtNWwBdSDaqahy4f2Ht0Xa/quCbfBo3a9ZtCcG2Ld2GX/2udfsWnDidhjQb8g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Huge page in current kernel could bring obvious performance improvement for some workloads with less TLB missing and less page fault. But the limited options of huge page size (2M/1G for x86_64) also brings extra cost like larger memory consumption, and more CPU cycle for page zeroing. The idea of the multiple consecutive page (abbr as "mcpage") is using collection of physical contiguous 4K page other than huge page for anonymous mapping. Target is to have more choices to trade off the pros and cons of huge page. Comparing to huge page, it will not get so much benefit of TLB missing and page fault. And it will not pay too much extra cost for large memory consumption and larger latency introduced by page compaction, page zeroing etc. The size of mcpage can be configured. The default value of 16K size is just picked up arbitrarily. User should choose the value according to the result of tuning their workload with different mcpage size. To have physical contiguous pages, high order pages is allocated (order is calculated according to mcpage size). Then the high order page will be split. By doing this, each sub page of mcpage is just normal 4K page. The current kernel page management infrastructure is applied to "mc" pages without any change. To reduce the page fault number, multiple page table entries are populated in one page fault with sub pages pfn of mcpage. This also brings a little bit cost of memory consumption. Update Kconfig to allow user define the mcpage order. Define MACROs like mcpage mask/shift/nr/size. In this RFC patch, only Kconfig is used for mcpage order to show the idea. Runtime parameter will be chosen if make this official patch in the future. Signed-off-by: Yin Fengwei --- include/linux/mm_types.h | 11 +++++++++++ mm/Kconfig | 19 +++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 3b8475007734..fa561c7b6290 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -71,6 +71,17 @@ struct mem_cgroup; #define _struct_page_alignment __aligned(sizeof(unsigned long)) #endif +#ifdef CONFIG_MCPAGE_ORDER +#define MCPAGE_ORDER CONFIG_MCPAGE_ORDER +#else +#define MCPAGE_ORDER 0 +#endif + +#define MCPAGE_SIZE (1 << (MCPAGE_ORDER + PAGE_SHIFT)) +#define MCPAGE_MASK (~(MCPAGE_SIZE - 1)) +#define MCPAGE_SHIFT (MCPAGE_ORDER + PAGE_SHIFT) +#define MCPAGE_NR (1 << (MCPAGE_ORDER)) + struct page { unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ diff --git a/mm/Kconfig b/mm/Kconfig index ff7b209dec05..c202dc99ab6d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -650,6 +650,25 @@ config HUGETLB_PAGE_SIZE_VARIABLE Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be clamped down to MAX_ORDER - 1. +config MCPAGE + bool "multiple consecutive page " + default n + help + Enable multiple consecutive page: mcpage is page collections (sub-page) + which are physical contiguous. When mapping to user space, all the + sub-pages will be mapped to user space in one page fault handler. + Expect to trade off the pros and cons of huge page. Like less + unnecessary extra memory zeroing and less memory consumption. + But with no TLB benefit. + +config MCPAGE_ORDER + int "multiple consecutive page order" + default 2 + depends on X86_64 && MCPAGE + help + The order of mcpage. Should be chosen carefully by tuning your + workload. + config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA From patchwork Mon Jan 9 07:22:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13093095 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2333CC61DB3 for ; Mon, 9 Jan 2023 07:19:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B8008E0007; Mon, 9 Jan 2023 02:19:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 39C8E8E0008; Mon, 9 Jan 2023 02:19:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 061E48E0007; Mon, 9 Jan 2023 02:19:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D99638E0001 for ; Mon, 9 Jan 2023 02:19:42 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A171116015A for ; Mon, 9 Jan 2023 07:19:42 +0000 (UTC) X-FDA: 80334410604.08.7F04EB5 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf12.hostedemail.com (Postfix) with ESMTP id 8D33C40002 for ; Mon, 9 Jan 2023 07:19:39 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r"; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248779; a=rsa-sha256; cv=none; b=bhkN3Fis7GgTWyeG7T7hCErk+CBErfHwUPjO0Y7B+VhP5JRLKyVPJlgmE1zhDKJCvgqtc4 Q/moMTB4lyg9cXLw7m2k9BQrD2g+1jg3imMNJsPzDcFyLIzJaXVxVjkTsWpm92OSfO9e+o GgBAYVQRIByBmEqlRNYrg6gea7gyUdU= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r"; spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rWv8v9UrXcndVYkgrCYiYsENQ2Yf2ms5o49zD2wS+70=; b=f5grHRps+u1XFmh10SPdkWX9ZiI/6YXbOykerRflkMec1+2h6wcOWih3w7OoLkqMXT0Z5V iVQ2WTNamz88vZz6ENIKo8TbTbN2cK/9qaVhFV8PrSRTQH8VTPNaUQaJfTKNGm7syyFSgg b+vmagCFKvZXp0/mzibFnxD+CA90Xg0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248779; x=1704784779; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1GZ20/ujhcJiI6hk+ypJrjOL78IcEdeQQ7lmSo6WjBk=; b=nSmaWp/rvWBJwWKrtrZwALT09FsS+R03FIg+6vJsf6fvH2bpSZxKyMxV sc8C8U+xCgJeHnUfTEss8MPWZKEd7Hhd/rna5yBKr+OgEj/tOcAWTXkHi 0UIJLdxSDI/mLo5DH5RFvjSL1WedDN7Bvg9/4rPujeYgGcr8vY5Fek74W Fh8OkZ4t67ukHek0THl2ub+9QkoLt43wYFSaB2hRdPuVz1Ites2tYaS7L oXKW1JKtG/Exo25w8JRFAFJMLbrqPwuK+2+TpqSDvwzRGLl9OZfWb10TT QQL14/N3Z4rJ6teZxmvS3PvayoI4yp+YfP7nzLPGhAkBKs9lxDmyy+jhr A==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260912" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260912" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:25 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112081" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634112081" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:18 -0800 From: Yin Fengwei To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping Date: Mon, 9 Jan 2023 15:22:30 +0800 Message-Id: <20230109072232.2398464-3-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com> References: <20230109072232.2398464-1-fengwei.yin@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 8D33C40002 X-Stat-Signature: 71ro3bymxu3fzj66ugh3t1abca8e9ebm X-HE-Tag: 1673248779-214982 X-HE-Meta: U2FsdGVkX19t2mzNODmcHH6ZMhHGb9euSbjUAikTu9F8I0pHnNjZM+GFPkzemDVSGLGW2hpo2aFlxwlz5UzSGhn2dwWutMZbr/aF5DtKt5eN/rkrK0aNopEyGn++h4Ycic0QNTjLwJtmYr2q09Ldfje5eLxJMGynwlQbSoaOmnrusbF5FXLOtZqFHAjbAStvjlHiWDnAnJM7j5JJKcsO/v4uBX5XzV4ogncEAJLPBPqwXT14lihK766OOqklfU2qQ5iI/k+kakOM76/2cDZcTXnLuCkFqabo67CSNssmDoLe+RqO0xoWUfNsx6a3b8cD1/w2JcIN82MxDbKHJ9/T2UOsF2bZj8JX+tXgzguGTxVhXvg8nWRHPkqysmBn5d7rexTeROLJof190vN4h3xVkBEoKtaZPh41N0LgUcnGhKj0PK3dB+WrL4bVQ6/TO6FSYupas95vIMS05AO+LBKk5UGtHp9GTy7VCuNyeGEAlvhCTIntahRxCiIPDkYFlYUMLWkjfjhcDxWvGJ6lX5TUwYS1r9+I3xCsBC2fpl8TO520jl1i7eYpfPEAyxvo7NlbDomy4nIgMG1l8AMbjAhEuszuNiyICIFH7pmXsWtZEAx92En7RRjEijtxYF3T6+uU/m7u60YrcCFIqbM4Ij9+R4hTVUrsQ7SQ6K8UcfJT+2nN/XJ+ZxP8qroOWd5R2Z67y62mQDsXw7gIivV1Okifbq0OljMcaZ/vXgSnOGhZ4JJepSQBB/HHnGWJPT5lfX3lRUeIC8SHLlI9lFaDTR1ApmykkJOLBVz3974cUaKAagw4tGQ6ktZHrtWntdgTBwDvPJlI/MKtdAp7aX7lceVnpoX8N94mpTLtIe/7BKqdhVIRlCiMKRzJ/N0HFa7GEFdEzr6Hqz/idxFPiX+0ff4A5w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If mcpage is in the range of VMA, try to allocated mcpage and setup for anonymous mapping. Try best to populate all the around page table entries. The benefit is that the page fault number will be reduced. Split the mcpage to allow each sub-page to be managed as normal 4K page. Doing split before setup page table entries to avoid the complicated page lock, mapcount and refcount handling. It's expected that the change will impact the memory consumption, page fault number, zone lock and lru lock directly. The memory consumption and system performance impact are evaluated as following. Some system performance data were collected with 16K mcpage size: =============================================================================== v6.1-rc4-no-thp v6.1-rc4-thp mcpage will-it-scale/malloc1 (higher is better) 100% 2% 17% will-it-scale/page_fault1 (higher is better) 100% 238% 115% redis.set_avg_throughput (higher is better) 100% 99% 102% redis.get_avg_throughput (higher is better) 100% 99% 100% kernel build (lower is better) 100% 98% 97% * v6.1-rc4-no-thp: 6.1-rc4 with THP disabled in Kconfig * v6.1-rc4-thp: 6.1-rc4 with THP enabled as always in Kconfig * mcpage: 6.1-rc4 + 16KB mcpage The test results are normalized to config "v6.1-rc4-no-thp" The perf data between v6.1-rc4-no-thp and mcpage are collected: For kernel build, perf showed 56% minor_page_fault drop and 1.3% clear_page increasing: v6.1-rc4-no-thp mcpage 5.939e+08 -56.0% 2.61e+08 kbuild.time.minor_page_faults 0.00 +2.2 2.20 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages 0.72 -0.7 0.00 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page For redis, perf showed 74.6% minor_page_fault drop and 0.11% zone lock drop. v6.1-rc4-no-thp mcpage 401414 -74.6% 102134 redis.time.minor_page_faults 0.00 +0.1 0.11 perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages 0.22 -0.2 0.00 perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.vma_alloc_folio For will-it-scale/page_fault1, perf showed 12.8% minor_page_fault drop and 15.97% zone lock drop and 27% lru lock increasing. v6.1-rc4-no-thp mcpage 7239 -12.8% 6312 will-it-scale.time.minor_page_faults 52.15 -34.4 17.75 perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages 3.29 +27.0 30.29 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush 4.14 -4.1 0.00 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page 0.00 +13.2 13.20 perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages 0.00 +18.4 18.43 perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages For will-it-scale/malloc1, the test result is surprise. The regression is much bigger than expected. perf showed 12.3% minor_page_fault drop and 43.6% zone lock increasing: v6.1-rc4-no-thp mcpage 2978027 -82.2% 530847 will-it-scale.128.processes 7249 -12.3% 6360 will-it-scale.time.minor_page_faults 0.00 +43.6 43.62 perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.pte_alloc_one.__pte_alloc 0.00 +45.4 45.39 perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush It turned out the mcpage allocation/free pattern hit a corn case (high zone lock contention triggered and impact pte_alloc) which current pcp list bulk free can't handle very well. Will address the pcp list bulk free issue separately. After fix the pcp list bulk corn case, the result of will-it-scale/malloc1 is restored to 56% of v6.1-rc4-no-thp. =============================================================================== For tail latency of page allocation, use following testing setup: - alloc_page() with order 0, 2 and 9 are called 2097152, 2097152 and 32768 times in kernel - none fragment and fragment entier memory - w/o __GFP_ZERO flag to identify pure compaction latency and user visible latency And the result is as following: no page zeroing: 4K page: none fragment: fragment: Number of test: 2097152 Number of test: 2097152 max latency: 26us max latency: 27us 90% tail latency: 1us (1887436th) 90% tail latency: 1us (1887436th) 95% tail latency: 1us (1992294th) 95% tail latency: 1us (1992294th) 99% tail latency: 2us (2076180th) 99% tail latency: 3us (2076180th) 16K mcpage none fragment: fragment: Number of test: 2097152 Number of test: 2097152 max latency: 26us max latency: 9862us 90% tail latency: 1us (1887436th) 90% tail latency: 1us (1887436th) 95% tail latency: 1us (1992294th) 95% tail latency: 1us (1992294th) 99% tail latency: 1us (2076180th) 99% tail latency: 3us (2076180th) 2M THP: none fragment: fragment: Number of test: 32768 Number of test: 32768 max latency: 40us max latency: 12149us 90% tail latency: 8us (29491th) 90% tail latency: 864us (29491th) 95% tail latency: 10us (31129th) 95% tail latency: 943us (31129th) 99% tail latency: 13us (32440th) 99% tail latency: 1067us (32440th) page zeroing: 4K page: none fragment: fragment: Number of test: 2097152 Number of test: 2097152 max latency: 18us max latency: 46us 90% tail latency: 1us (1887436th) 90% tail latency: 1us (1887436th) 95% tail latency: 1us (1992294th) 95% tail latency: 1us (1992294th) 99% tail latency: 2us (2076180th) 99% tail latency: 4us (2076180th) 16K mcpage none fragment: fragment: Number of test: 2097152 Number of test: 2097152 max latency: 31us max latency: 5740us 90% tail latency: 3us (1887436th) 90% tail latency: 3us (1887436th) 95% tail latency: 3us (1992294th) 95% tail latency: 4us (1992294th) 99% tail latency: 4us (2076180th) 99% tail latency: 5us (2076180th) 2M THP: none fragment: fragment: Number of test: 32768 Number of test: 32768 max latency: 530us max latency: 10494us 90% tail latency: 366us (29491th) 90% tail latency: 1114us (29491th) 95% tail latency: 373us (31129th) 95% tail latency: 1263us (31129th) 99% tail latency: 391us (32440th) 99% tail latency: 1808us (32440th) With 16K mcpage, the tail latency for page allocation is good while 2M THP has much worse result in memory fragment case. =============================================================================== For the performance of NUMA interleaving on base page, mcpage and THP, memory latency from https://github.com/torvalds/test-tlb is used. On a Cascade Lake box with 96 core + 258G memory with two NUMA nodes: node distances: node 0 1 0: 10 20 1: 20 10 With memory policy set to MPOL_INTERLEAVE and 1G memory mapping with 128 bytes (2X cache line) stride, the memory access latency (less is better): random access with 4K apge: 142.32 ns random access with 16K mcpage: 141.21 ns (+0.8%) random access with 2M THP: 116.56 ns (+18.2%) sequential access with 4K page: 21.28 ns sequential access with 16K mcpage: 20.52 ns (+0.36%) sequential access with 2M THP: 20.36 ns (+0.43%) mcpage brings minor memory access latency improvement comparing to 4K page. But less than the improvement comparing to 2M THP. =============================================================================== The memory consumption is checked by using firefox to access "www.lwn.net" website and collect the RSS of firefox with 16K mcpage size: 6.1-rc7: RSS of firefox is 285300 KB 6.1-rc7 + 16K mcpage: RSS of firefox is 295536 KB 3.59% more memory consumption with 16K mcpage. =============================================================================== In this RFC patch, the none-batch update to page table entries is used to show the idea. Batch mode will be chosen if make this official patch in the future. Signed-off-by: Yin Fengwei --- include/linux/gfp.h | 5 ++ include/linux/mcpage_mm.h | 35 ++++++++++ mm/Makefile | 1 + mm/mcpage_memory.c | 134 ++++++++++++++++++++++++++++++++++++++ mm/memory.c | 11 ++++ mm/mempolicy.c | 51 +++++++++++++++ 6 files changed, 237 insertions(+) create mode 100644 include/linux/mcpage_mm.h create mode 100644 mm/mcpage_memory.c diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 65a78773dcca..035c5fadd9d4 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -265,6 +265,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order); struct folio *folio_alloc(gfp_t gfp, unsigned order); struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, unsigned long addr, bool hugepage); +struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma, + unsigned long addr); #else static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) { @@ -276,7 +278,10 @@ static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order) } #define vma_alloc_folio(gfp, order, vma, addr, hugepage) \ folio_alloc(gfp, order) +#define alloc_mcpages(gfp, order, vma, addr) \ + alloc_pages(gfp, order) #endif + #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0) static inline struct page *alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) diff --git a/include/linux/mcpage_mm.h b/include/linux/mcpage_mm.h new file mode 100644 index 000000000000..4b2fb7319233 --- /dev/null +++ b/include/linux/mcpage_mm.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MCPAGE_MM_H +#define _LINUX_MCPAGE_MM_H + +#include + +#ifdef CONFIG_MCPAGE_ORDER + +static inline bool allow_mcpage(struct vm_area_struct *vma, + unsigned long addr, unsigned int order) +{ + unsigned int mcpage_size = 1 << (order + PAGE_SHIFT); + unsigned long haddr = ALIGN_DOWN(addr, mcpage_size); + + return range_in_vma(vma, haddr, haddr + mcpage_size); +} + +extern vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, + unsigned int order); + +#else +static inline bool allow_mcpage(struct vm_area_struct *vma, + unsigned long addr, unsigned int order) +{ + return false; +} + +static inline vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, + unsigned int order) +{ + return VM_FAULT_FALLBACK; +} +#endif /* CONFIG_MCPAGE */ + +#endif /* _LINUX_MCPAGE_MM_H */ diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..efeaa8358953 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -96,6 +96,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o +obj-$(CONFIG_MCPAGE) += mcpage_memory.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o ifdef CONFIG_SWAP diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c new file mode 100644 index 000000000000..ea4be2e25bce --- /dev/null +++ b/mm/mcpage_memory.c @@ -0,0 +1,134 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright(c) 2022 Intel Corporation. All rights reserved. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE +static inline struct page * +alloc_zeroed_mcpages(int order, struct vm_area_struct *vma, + unsigned long addr) +{ + struct page *page = alloc_mcpages(GFP_HIGHUSER_MOVABLE, order, + vma, addr); + + if (page) { + int i; + struct page *it = page; + + for (i = 0; i < (1 << order); i++, it++) { + clear_user_highpage(it, addr); + cond_resched(); + } + } + + return page; +} +#else +static inline struct page * +alloc_zeroed_mcpages(int order, struct vm_area_struct *vma, + unsigned long addr) +{ + return alloc_mcpages(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, + order, vma, addr); +} +#endif + +static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf, + struct page *page, unsigned long addr) +{ + struct vm_area_struct *vma = vmf->vma; + vm_fault_t ret = 0; + pte_t entry; + + if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) { + ret = VM_FAULT_OOM; + goto oom; + } + + cgroup_throttle_swaprate(page, GFP_KERNEL); + __SetPageUptodate(page); + + entry = mk_pte(page, vma->vm_page_prot); + entry = pte_sw_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite(pte_mkdirty(entry)); + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); + + if (!pte_none(*vmf->pte)) { + ret = VM_FAULT_FALLBACK; + update_mmu_cache(vma, addr, vmf->pte); + goto release; + } + + ret = check_stable_address_space(vma->vm_mm); + if (ret) { + ret = VM_FAULT_FALLBACK; + goto release; + } + + if (userfaultfd_missing(vma)) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return handle_userfault(vmf, VM_UFFD_MISSING); + } + + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + page_add_new_anon_rmap(page, vma, addr); + lru_cache_add_inactive_or_unevictable(page, vma); + set_pte_at(vma->vm_mm, addr, vmf->pte, entry); + update_mmu_cache(vma, addr, vmf->pte); +release: + pte_unmap_unlock(vmf->pte, vmf->ptl); +oom: + return ret; +} + +vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order) +{ + int i, nr = 1 << order; + unsigned int mcpage_size = nr * PAGE_SIZE; + vm_fault_t ret = 0, real_ret = 0; + bool handled = false; + struct page *page; + unsigned long haddr = ALIGN_DOWN(vmf->address, mcpage_size); + + page = alloc_zeroed_mcpages(order, vmf->vma, haddr); + if (!page) + return VM_FAULT_FALLBACK; + + split_page(page, order); + for (i = 0; i < nr; i++, haddr += PAGE_SIZE) { + ret = do_anonymous_mcpage(vmf, &page[i], haddr); + if (haddr == PAGE_ALIGN_DOWN(vmf->address)) { + real_ret = ret; + handled = true; + } + if (ret) + break; + } + + while (i < nr) + put_page(&page[i++]); + + /* + * If the fault address is not handled, fallback to handle + * fault address with normal page. + */ + if (!handled) + return VM_FAULT_FALLBACK; + else + return real_ret; +} diff --git a/mm/memory.c b/mm/memory.c index aad226daf41b..fb7f370f6c67 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -77,6 +77,7 @@ #include #include #include +#include #include @@ -4071,6 +4072,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; + + if (allow_mcpage(vma, vmf->address, MCPAGE_ORDER)) { + ret = do_anonymous_mcpages(vmf, MCPAGE_ORDER); + + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + + ret = 0; + } + page = alloc_zeroed_user_highpage_movable(vma, vmf->address); if (!page) goto oom; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 02c8a712282f..87ecbdb74fbe 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2251,6 +2251,57 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, } EXPORT_SYMBOL(vma_alloc_folio); +/** + * alloc_mcpages - Allocate a mcpage for a VMA. + * @gfp: GFP flags. + * @order: Order of the mcpage. + * @vma: Pointer to VMA or NULL if not available. + * @addr: Virtual address of the allocation. Must be inside @vma. + * + * Allocate a mcpage for a specific address in @vma, using the + * appropriate NUMA policy. When @vma is not NULL the caller must hold the + * mmap_lock of the mm_struct of the VMA to prevent it from going away. + * Should be used for all allocations for pages that will be mapped into + * user space. + * + * Return: The page on success or NULL if allocation fails. + */ +struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma, + unsigned long addr) +{ + struct mempolicy *pol; + int node = numa_node_id(); + struct page *page; + int preferred_nid; + nodemask_t *nmask; + + pol = get_vma_policy(vma, addr); + + if (pol->mode == MPOL_INTERLEAVE) { + unsigned int nid; + + nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order); + mpol_cond_put(pol); + page = alloc_page_interleave(gfp, order, nid); + goto out; + } + + if (pol->mode == MPOL_PREFERRED_MANY) { + node = policy_node(gfp, pol, node); + page = alloc_pages_preferred_many(gfp, order, node, pol); + mpol_cond_put(pol); + goto out; + } + + nmask = policy_nodemask(gfp, pol); + preferred_nid = policy_node(gfp, pol, node); + page = __alloc_pages(gfp, order, preferred_nid, nmask); + mpol_cond_put(pol); +out: + return page; +} +EXPORT_SYMBOL(alloc_mcpages); + /** * alloc_pages - Allocate pages. * @gfp: GFP flags. From patchwork Mon Jan 9 07:22:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13093094 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B54B7C54EBD for ; Mon, 9 Jan 2023 07:19:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F7588E0006; Mon, 9 Jan 2023 02:19:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4804C8E0001; Mon, 9 Jan 2023 02:19:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 286658E0006; Mon, 9 Jan 2023 02:19:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 124058E0001 for ; Mon, 9 Jan 2023 02:19:42 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E475B402AB for ; Mon, 9 Jan 2023 07:19:41 +0000 (UTC) X-FDA: 80334410562.08.CD1C1CE Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf21.hostedemail.com (Postfix) with ESMTP id 152D31C0015 for ; Mon, 9 Jan 2023 07:19:39 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=KH8JLmQb; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf21.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248780; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=izhV+qgtzIRXGhBId2iycUfWj/BcACJStsOaVUDWwxY=; b=k29ctR/bUROOCY4caxY/FEva1U+ngiyerX3iIB8ePU3akbkW0ZWZbkYpqMtO4Z0GwVQIQk 1/gpc2LkcXsYbj5rZodioASp5FRPRKxrBFga4gL0HsCHk2NLezzNxex2uWicTqGmV/4MDt 7o6w6oDZudIsEcVxBEVCKENUkrdBk6s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=KH8JLmQb; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf21.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248780; a=rsa-sha256; cv=none; b=h2ausRH2yJh46zLuaMwiFUdBSAmx4clxl+hiGl2N3qP1jOUlxJQat6w3T1si2GFLT5Tp+2 8VHuSgpixlAMD5ObHjGRXPMKtnWHICUT0gOl+rEsrrNR1Kb12o3f1zq8f2lMslX8cIzR1F WEsEOO/znqBNIgNh0UP9swCLfekyf4M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248780; x=1704784780; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NPYBrBKpCMUBtZBYO+jWTHgO8ZQCDz+oPndyU4c9IhI=; b=KH8JLmQb0UD4QHbZ5C7q+VmiQ77GSgNjBL08zstkbDN9kqYL9GSH9e91 dlpCMyFt26E971pEGeoqKaSlgRLzM6ekpcRLxLNkWSs2kEs0MAA1OQCSR kSihwznN0+LBEU1ZGovr7m0pxdXYMKgqkVtu1fALnouJ4aietz/gs8jSL /eiVG2g2IUsfOErrGswU+VRpRVZZv2JS18APWF2c9dDnbxhIBjLqakCWA ycSmPTwZnIc2VTwL3LapzBPuv0KHyLbvU8ABtz5LZ2kkIS9VOp1YQN0ZC xBZTlOeDfksXKjwkzYMBi4E/xMNMriJZmPEGWSKI1aYsFeW+KdSuqClfE g==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260930" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260930" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:29 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112111" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634112111" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:24 -0800 From: Yin Fengwei To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages Date: Mon, 9 Jan 2023 15:22:31 +0800 Message-Id: <20230109072232.2398464-4-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com> References: <20230109072232.2398464-1-fengwei.yin@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 152D31C0015 X-Stat-Signature: eyoxdo1cxrefkp46du7heqx4u73r7coe X-HE-Tag: 1673248779-336675 X-HE-Meta: U2FsdGVkX1+UEBB+XyDfTUAqhiB3Y+zX2bUdfF9IlQdNQq0Y6dZPASn5EenhkaFkQOyxneaWjldu8u1GzPvvZZilUuJGXxBEbgWen4JAu52sOR5dx/3loIMHeIHMVMr9q769fisPGQk8UbZZ+F+A0/BP9ggspsv0mt50l97A9OZU/ZIkmA8YKrt2WQmiGeyObfIre5OBXaRGl+sEhwpxRkItsY+F4b5elkEDdXuF+bo2HQ6zWTUvRzeAN6ohpitfnK8WFJ9QzC4Jwd8itoIbjZ+13+xS+GeUf9tqYJkkJMVbEYCJRITxlwpmWNIfOr6erxLO6iQz+YdSgFEc6BOV75ukFpVC0Iqtcbt8pzFjtwPf7pJu0RJzlDbP+q9Mwsump2e5KdvAyWjax9oq0HDR/zVhZcj6YjQUx9gSpSk/ObKAqwIe5O021xtbW+4cSuKtwDHTP69A31zDTd03BiAOiGKdnaQ12QPUL+yPce6EkgPniz8VSAlQ7Aub8g45GfpTFNlLZqyYGxnVT+PDFLhclqteuME9sNPtVhufLwYFv3xe8Hb7//5tyot7njY5no9OLHqa1Kttft8tqFQJd1UkeJY6BSmLoyQyHAqLLB9DDPlp2wTUp9W+ei9KJiUcOZJjuDPfdOxDnCI6m80bkhpp2KZJWcqElv2q5YyvkfX3/Mpknt2sVUnm5NZlujzqPiROSuF1m41MlCVnXW0L4Pv0MUG7uOdamj/g0zM/s+pKRsY86grU9cbIB9Fsa210jUJx+MZhnz7CsqHSCppEMppJBJ2lFMmntlCtlK+FdRMjRWoB+/5EsTnVELvMVQnZpXcbz9I7NUBxQQ485a4IT3JUBcMdUi0O0RibotFezUQVkmRDyTbLJhhsPDggAqPXYnmU01/Rq0kAU4xpyALZJNsGdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: MCPAGE_ANON_FAULT_ALLOC: how many times mcpage is used for anonymous mapping. MCPAGE_ANON_FAULT_FALLBACK: how many times fallback to normal page for anonymous mapping. MCPAGE_ANON_FAULT_CHARGE_FAILED: how many times fallback because of memcg charge failure. MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED: how many times fallback because page table already populated. MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE: how many times fallback because of unstable address space. Signed-off-by: Yin Fengwei --- include/linux/vm_event_item.h | 10 ++++++++++ mm/mcpage_memory.c | 6 ++++++ mm/memory.c | 1 + mm/vmstat.c | 7 +++++++ 4 files changed, 24 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 7f5d1caf5890..9c36bfc4c904 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -119,6 +119,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SWPOUT, THP_SWPOUT_FALLBACK, #endif +#ifdef CONFIG_MCPAGE + MCPAGE_ANON_FAULT_ALLOC, + MCPAGE_ANON_FAULT_FALLBACK, + MCPAGE_ANON_FAULT_CHARGE_FAILED, + MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED, + MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE, +#endif #ifdef CONFIG_MEMORY_BALLOON BALLOON_INFLATE, BALLOON_DEFLATE, @@ -159,5 +166,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #define THP_FILE_FALLBACK_CHARGE ({ BUILD_BUG(); 0; }) #define THP_FILE_MAPPED ({ BUILD_BUG(); 0; }) #endif +#ifndef CONFIG_MCPAGE +#define MCPAGE_ANON_FAULT_FALLBACK ({ BUILD_BUG(); 0; }) +#endif #endif /* VM_EVENT_ITEM_H_INCLUDED */ diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c index ea4be2e25bce..e208cf818ebf 100644 --- a/mm/mcpage_memory.c +++ b/mm/mcpage_memory.c @@ -55,6 +55,7 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf, if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) { ret = VM_FAULT_OOM; + count_vm_event(MCPAGE_ANON_FAULT_CHARGE_FAILED); goto oom; } @@ -71,12 +72,14 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf, if (!pte_none(*vmf->pte)) { ret = VM_FAULT_FALLBACK; update_mmu_cache(vma, addr, vmf->pte); + count_vm_event(MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED); goto release; } ret = check_stable_address_space(vma->vm_mm); if (ret) { ret = VM_FAULT_FALLBACK; + count_vm_event(MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE); goto release; } @@ -120,6 +123,9 @@ vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order) break; } + if (i == nr) + count_vm_event(MCPAGE_ANON_FAULT_ALLOC); + while (i < nr) put_page(&page[i++]); diff --git a/mm/memory.c b/mm/memory.c index fb7f370f6c67..b3655be849ae 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4079,6 +4079,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (!(ret & VM_FAULT_FALLBACK)) return ret; + count_vm_event(MCPAGE_ANON_FAULT_FALLBACK); ret = 0; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ea6a5ce1c41..c40e33dee1b1 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1367,6 +1367,13 @@ const char * const vmstat_text[] = { "thp_swpout", "thp_swpout_fallback", #endif +#ifdef CONFIG_MCPAGE + "mcpage_anon_fault_alloc", + "mcpage_anon_fault_fallback", + "mcpage_anon_fault_charge_failed", + "mcpage_anon_fault_page_table_populated", + "mcpage_anon_fault_instable_address_space", +#endif #ifdef CONFIG_MEMORY_BALLOON "balloon_inflate", "balloon_deflate", From patchwork Mon Jan 9 07:22:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yin Fengwei X-Patchwork-Id: 13093096 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FA77C54EBD for ; Mon, 9 Jan 2023 07:19:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A0D4900002; Mon, 9 Jan 2023 02:19:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82AAD900003; Mon, 9 Jan 2023 02:19:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54ACA900002; Mon, 9 Jan 2023 02:19:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2F06F8E0001 for ; Mon, 9 Jan 2023 02:19:43 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DEDDF1A046A for ; Mon, 9 Jan 2023 07:19:42 +0000 (UTC) X-FDA: 80334410604.30.A103EE8 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf09.hostedemail.com (Postfix) with ESMTP id 0411B140003 for ; Mon, 9 Jan 2023 07:19:40 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YniMldMD; spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1673248781; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HCxb6co2puvxDMcg8PlLaV1XdBVBssCEk/cbsuMmJb4=; b=SIpBlT4KQBf98ER0MVRx7l33voKQspocdw7WD7OuFWoIMVgng8Vo+WAzSUqvimn4OWQjeH gk8rpz7d6AxRA7RFet6ySIrLUWDmB+lHus+cUwicJkJictSGe6vcDtHn/weppWmX2958x5 uBdQNQn5Y6QEHOhifTOsUEmpOcvpbuo= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YniMldMD; spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248781; a=rsa-sha256; cv=none; b=GnqQg7A8APVxRfu4vbuRNkxGSmvN1bT/2HLxngJl1t+tANJXi1feb1dzaSxyoGmDtPbvFZ Y4vwZH033SGGRD2KgCmz96b7WtD2Gxaqyj/fnememWIUVlEPivo9YZ/K7iOkq+km28q0Ok 3STPWrhMBpcar779JI7D9g8eIW3UYBs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673248781; x=1704784781; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CxFvIjMgroEmPoheOidP0ThTbJZhYjtVPzD7cNlzE8E=; b=YniMldMDKN3egg9/TRlMcNKUn8WJNUiTui4/KTv1hb9TsfZQvQghrB0w QLcOHuwCB0xA7olvZT6Bh2UYHr4+ICBuBtbefmOu1YW5Zmin68yxo6tB7 6XroFPekgDesMiw3NFgA2c1Em8vzlExj7bfH2CGAp+M3iAHxQuDzuR+bU YCB9z0+OoiC2APw4uE0zu1HIq5oYOFW4d+qGbjjfElyGvnv/EQNovea1m mdUVxbIF0JaJ4zO9NnNYeo1WWykPNxut1MybAAKi60HkFGoJCFN51jwIp zWQ04hK0PwwMr7bWxWYh7MTNuPgA7tpz3iTAayW3MyehamXibarFB9MVz w==; X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260946" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="387260946" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Jan 2023 23:19:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112126" X-IronPort-AV: E=Sophos;i="5.96,311,1665471600"; d="scan'208";a="634112126" Received: from fyin-dev.sh.intel.com ([10.239.159.32]) by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:29 -0800 From: Yin Fengwei To: linux-mm@kvack.org, akpm@linux-foundation.org, jack@suse.cz, hughd@google.com, kirill.shutemov@linux.intel.com, mhocko@suse.com, ak@linux.intel.com, aarcange@redhat.com, npiggin@gmail.com, mgorman@techsingularity.net, willy@infradead.org, rppt@kernel.org, dave.hansen@intel.com, ying.huang@intel.com, tim.c.chen@intel.com Cc: fengwei.yin@intel.com Subject: [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned addr Date: Mon, 9 Jan 2023 15:22:32 +0800 Message-Id: <20230109072232.2398464-5-fengwei.yin@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com> References: <20230109072232.2398464-1-fengwei.yin@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 0411B140003 X-Stat-Signature: zs5cxhbk3ez1xx5w4zqstz3y5pe4fk4t X-Rspam-User: X-HE-Tag: 1673248780-872796 X-HE-Meta: U2FsdGVkX19rB2N/fJpZb8XZMd/F6z5vcLOTzLyFfIhRdkarREz82SxoqTdMLQW7euULSya0SydWZxErR5GEHfySiU/NVTYDtSk5/OxuY9WoQBCk+TiYbQtJfDqbFHVm4NZdy98u5ugsqgHoCtjQsCLO6Rl5H6KQ/jz6GDPJ1jIfE9UH9wVdO6+IDl0B54av6K3nd2orZeuNwDs6bczsBP7HYrta5gKjiemQFZnLEiQ34sm6TUodT5Hu6ru7lMe4ku6sqGXJrkbPv9jlg1FPgMImUn9QIS83ak15lpCUrM9SiTx0vI31oxVKWx1NQWNOQaVYvpRnwjUdxlpoRCh5aSTH9atb/m5Mscf8iNC2qIRsLHChMLv7X7bkiLOBr1wNB4A1U31PNDf3y76wv5LiBApnSRWy+b2zJ670Sb6Bg1rzJxzkNWCrbCQks28N1Igod2fowxuMzZAyzNiKMtHvBX29+tpato9O7jCyRw0pIUCEWdTCKJjYAdq0DV8RgmHaOef6bA+1EJLcy+AspdF7ohhkxvdLjNjrmLYS52tPusTbpUra0smDRapsndYf+Goa+XjC4GDkPnpWPeH688212h0vO6MdjeIwyVRPmW1OR6Ho/26Us4ppYiMZrECSHZy4nKP3jIN52lHpPSwJZbceTGuL+4GVLC2LtPp0tzOQPVSgg1y+sCwy7tcO0/YQMDKGpoZ9XoKVm9EHYQouU1V7w1w7tCHE4c9PCCQCkV1QyyG2YBfG4r8NasGvqRbvq8fdILIT0OsOHSNY/lCk5DoyVo6kXZ8qeP+KG2ldF66IHWIVK72blyCw6bifNC6reF4wamFchrQn4L4u77d5RN6KdBdvIA1Z2FqbbthfkEWb7ScRGC0CYCCwXQdngsyo2nvIJEUMlHQb9fwgJO4o012MB61qUVUkTNBbVKaA0CsqLSQf20x6ebOWNsAXGFRQp1Xg215DUMSIrRo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: For x86_64, let mmap start from mcpage size aligned address. Using firefox with one tab to access the entry page of "www.lwn.net" as workload. With mcpage set to 2, collected the count about the mcpage can't be used because mcpage is out of VMA range: run1 run2 run3 avg stddev With this patch: 1453, 1434, 1428 1438 13.0% Without this patch: 1536, 1467, 1493 1498 34.8% It shows that the chance of using mcpage for anonymous mapping is increased 4.2% with the patch. For general possible impact because the virtual address space is more sparse, run will-it-scale:malloc1, will-it-scale:page_fault1 and kernel build w/o the change based on v6.1-rc7. The result shows no performance change introduced by this change: malloc1: v6.1-rc7 v6.1-rc7 + this patch ---------------- --------------------------- 23338 -0.5% 23210 will-it-scale.per_process_ops page_fault1: v6.1-rc7 v6.1-rc7 + this patch ---------------- --------------------------- 96322 -0.1% 96222 will-it-scale.per_process_ops kernel build: v6.1-rc7 v6.1-rc7 + this patch ---------------- --------------------------- 28.45 +0.2% 28.52 kbuild.buildtime_per_iteration One drawback of the change is that the effective ASLR bits is reduced by mcpage_order bits. Signed-off-by: Yin Fengwei --- arch/x86/kernel/sys_x86_64.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 8cc653ffdccd..9b5617973e81 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -154,6 +154,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } + + if (info.align_mask < ~MCPAGE_MASK) + info.align_mask = ~MCPAGE_MASK; + return vm_unmapped_area(&info); } @@ -212,6 +216,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } + + if (info.align_mask < ~MCPAGE_MASK) + info.align_mask = ~MCPAGE_MASK; + addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) return addr;