From patchwork Mon Jan  9 07:22:29 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yin Fengwei <fengwei.yin@intel.com>
X-Patchwork-Id: 13093093
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BD6C5C677F1
	for <linux-mm@archiver.kernel.org>; Mon,  9 Jan 2023 07:19:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4F8118E0005; Mon,  9 Jan 2023 02:19:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4A77E8E0001; Mon,  9 Jan 2023 02:19:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 286F98E0005; Mon,  9 Jan 2023 02:19:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 178E88E0001
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 02:19:41 -0500 (EST)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id DB926402AB
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:40 +0000 (UTC)
X-FDA: 80334410520.12.B24B3A6
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf09.hostedemail.com (Postfix) with ESMTP id E9F2914000B
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:38 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="F5F/clnF";
	spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1673248779;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=LJobTs2V5XcSusL76UNrK4PZQv7eZ//5y09KZ8bp8nU=;
	b=jwe5t3itYFhmgoVUGdnxQfe7saT8Vbx3fNWtrFbkurBmNNjbdVqKJ7uv8QLdRTSBhHymrr
	hjy52Cypf6Xq2FA8CiRBkERbG+EoolckrtuJSoGwO93N3FZwlDe/2Yopk8SQk82hy/j1LD
	7Tj3MiywT22a7qJYzbYZlIuPM7h0sPw=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="F5F/clnF";
	spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248779; a=rsa-sha256;
	cv=none;
	b=uXiMJ5CiWEg7noaLVLagl9wujRbh8s2PnA755xHJJ0Jw7W60OPOfICijUO7aobLBJIryQz
	sd6yimVCBPWMas5tTviwhtdyL3fDtW6C32YEgjC2xEkBsqyiVZz75houXtkHJs0QmnVu9U
	eF5NeQJWb4pzGCxA80Ulo0yb94yrfOQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1673248779; x=1704784779;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=aPU5uaXNkq3fAf8nIKkOuklY542v16NKu2eeGh2divw=;
  b=F5F/clnFrSB4JRvjJ11aE3cWRx+0RBzrHyrmwpO4CLNFeClboPoMt3X7
   BcsgG6jHl/tD8SRl2r32m+LrMTiwPfD76p0Q6QP4mysdby6sac/zEnjMd
   jnsIETTC+CTXPqiPUQS/Bc/wXjv0gBqpqQZFoP60SZkWtH9aIg8gA5ooO
   6SdLt9qm6ot1tkF5JBdiJmviBo5zgNUjmY/tVBTMEyOrbBaV2n0VG04fQ
   rZYhK2JwB2n0HDA4TxI2ni5YfeIBuoIRkyG1Yw86DRBOeeu+WiCQX5VqR
   6ntsn+2qdAs/Im8DNN7u0bwoVKbmvt7BJwuqnyOuI/QP40sNLwKjnU0Ng
   g==;
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260883"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="387260883"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jan 2023 23:19:18 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112033"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="634112033"
Received: from fyin-dev.sh.intel.com ([10.239.159.32])
  by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:14 -0800
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jack@suse.cz,
	hughd@google.com,
	kirill.shutemov@linux.intel.com,
	mhocko@suse.com,
	ak@linux.intel.com,
	aarcange@redhat.com,
	npiggin@gmail.com,
	mgorman@techsingularity.net,
	willy@infradead.org,
	rppt@kernel.org,
	dave.hansen@intel.com,
	ying.huang@intel.com,
	tim.c.chen@intel.com
Cc: fengwei.yin@intel.com
Subject: [RFC PATCH 1/4] mcpage: add size/mask/shift definition for multiple
 consecutive page
Date: Mon,  9 Jan 2023 15:22:29 +0800
Message-Id: <20230109072232.2398464-2-fengwei.yin@intel.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com>
References: <20230109072232.2398464-1-fengwei.yin@intel.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: E9F2914000B
X-Stat-Signature: ii9r69tt7mcjjtss9ersf1ic9abf47op
X-Rspam-User: 
X-HE-Tag: 1673248778-886699
X-HE-Meta: 
 U2FsdGVkX18Z5AFEC9LH38sHdYR4hwIqA6fAIocPu1Hb/ZbUOGHfJj8NsMU6fpT9k9ZYtkPIjh44GRAg3u+p9eMZvtRwihn3/yJBFDS5SyJV9pUv/NOztjt8O7AHYC4UtA4S0bFFysROiDKzQeFxryV3EBxmbhY2PGhfPxkh+JdEfBBFGazpvos8K85tUnfjtFgsuGpIRarssSHB/PZ3qtVq7uIQd8iMqlQG3omtMNSOiiHnTDZMJWcX+L6KjXT5xk4z2859x4l4Hi9qKCj6H9YI03imW8z6ys87c4KzoppAgll7nUZ1JlWSQDRdcOmjmCWBOjs/g6XH8fniJJZ+sr2MqxQ1ybv3yR+YZ2YACvJKLFcM2A5vpxXJUbW5neKoFhej1SRxZqYVIEiIQBugF3C/IZKSmpzLxqp4cMnAekyaGL9/3nEdvDiwDcMmaaARv+b/XqPGitEnE2Jm87fCz1DHpqOqFFv3BvtCKSuFanElWyC+CQuTgPhRXVM6Jzjq6fsxikExh2NsoeYA+cJ5R/XdnPtjtGvgc7kyRSuPUTPnfbP5MEw/yrBIxVqTUBBajwjrHczq8ZC/FBQ4UdwXF1HqBzPAjBaJe0pI/ilyFGWQV3H4Og0VtKwaNfR8vGCsb0BOAU17vB2UXBw8CPqaO0P18UoreiagAZRzmym5Vf4yDvaXZOQLcdtI80BgppDJ8V2VwvKJOjB+Fx5oE3GSokYkKKsJ5cIsxUylwkpDTsj7oZHaUnOqokGeWxlaP216GuxUfhMcU6gsolbzG22tLiSFORNbJmFedAvMBUb2R4w47cc7OQ3FfQdotSN0pELVi03+lkIHOk7V/i3wSCnnrIfl8QtNWwBdSDaqahy4f2Ht0Xa/quCbfBo3a9ZtCcG2Ld2GX/2udfsWnDidhjQb8g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Huge page in current kernel could bring obvious performance improvement
for some workloads with less TLB missing and less page fault. But the
limited options of huge page size (2M/1G for x86_64) also brings extra
cost like larger memory consumption, and more CPU cycle for page zeroing.

The idea of the multiple consecutive page (abbr as "mcpage") is using
collection of physical contiguous 4K page other than huge page for
anonymous mapping. Target is to have more choices to trade off the pros
and cons of huge page. Comparing to huge page, it will not get so much
benefit of TLB missing and page fault. And it will not pay too much extra
cost for large memory consumption and larger latency introduced by page
compaction, page zeroing etc.

The size of mcpage can be configured. The default value of 16K size is
just picked up arbitrarily. User should choose the value according to the
result of tuning their workload with different mcpage size.

To have physical contiguous pages, high order pages is allocated (order
is calculated according to mcpage size). Then the high order page will
be split. By doing this, each sub page of mcpage is just normal 4K page.
The current kernel page management infrastructure is applied to "mc"
pages without any change.

To reduce the page fault number, multiple page table entries are populated
in one page fault with sub pages pfn of mcpage. This also brings a little
bit cost of memory consumption.

Update Kconfig to allow user define the mcpage order. Define MACROs like
mcpage mask/shift/nr/size.

In this RFC patch, only Kconfig is used for mcpage order to show the idea.
Runtime parameter will be chosen if make this official patch in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm_types.h | 11 +++++++++++
 mm/Kconfig               | 19 +++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..fa561c7b6290 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -71,6 +71,17 @@ struct mem_cgroup;
 #define _struct_page_alignment	__aligned(sizeof(unsigned long))
 #endif
 
+#ifdef CONFIG_MCPAGE_ORDER
+#define MCPAGE_ORDER		CONFIG_MCPAGE_ORDER
+#else
+#define MCPAGE_ORDER		0
+#endif
+
+#define MCPAGE_SIZE		(1 << (MCPAGE_ORDER + PAGE_SHIFT))
+#define MCPAGE_MASK		(~(MCPAGE_SIZE - 1))
+#define MCPAGE_SHIFT		(MCPAGE_ORDER + PAGE_SHIFT)
+#define MCPAGE_NR		(1 << (MCPAGE_ORDER))
+
 struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..c202dc99ab6d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -650,6 +650,25 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 	  Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
 	  clamped down to MAX_ORDER - 1.
 
+config MCPAGE
+	bool "multiple consecutive page <mcpage>"
+	default n
+	help
+	  Enable multiple consecutive page: mcpage is page collections (sub-page)
+	  which are physical contiguous. When mapping to user space, all the
+	  sub-pages will be mapped to user space in one page fault handler.
+	  Expect to trade off the pros and cons of huge page. Like less
+	  unnecessary extra memory zeroing and less memory consumption.
+	  But with no TLB benefit.
+
+config MCPAGE_ORDER
+	int "multiple consecutive page order"
+	default 2
+	depends on X86_64 && MCPAGE
+	help
+	  The order of mcpage. Should be chosen carefully by tuning your
+	  workload.
+
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 

From patchwork Mon Jan  9 07:22:30 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yin Fengwei <fengwei.yin@intel.com>
X-Patchwork-Id: 13093095
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2333CC61DB3
	for <linux-mm@archiver.kernel.org>; Mon,  9 Jan 2023 07:19:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3B8008E0007; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 39C8E8E0008; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 061E48E0007; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D99638E0001
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id A171116015A
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:42 +0000 (UTC)
X-FDA: 80334410604.08.7F04EB5
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf12.hostedemail.com (Postfix) with ESMTP id 8D33C40002
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:39 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r";
	spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248779; a=rsa-sha256;
	cv=none;
	b=bhkN3Fis7GgTWyeG7T7hCErk+CBErfHwUPjO0Y7B+VhP5JRLKyVPJlgmE1zhDKJCvgqtc4
	Q/moMTB4lyg9cXLw7m2k9BQrD2g+1jg3imMNJsPzDcFyLIzJaXVxVjkTsWpm92OSfO9e+o
	GgBAYVQRIByBmEqlRNYrg6gea7gyUdU=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b="nSmaWp/r";
	spf=pass (imf12.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1673248779;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=rWv8v9UrXcndVYkgrCYiYsENQ2Yf2ms5o49zD2wS+70=;
	b=f5grHRps+u1XFmh10SPdkWX9ZiI/6YXbOykerRflkMec1+2h6wcOWih3w7OoLkqMXT0Z5V
	iVQ2WTNamz88vZz6ENIKo8TbTbN2cK/9qaVhFV8PrSRTQH8VTPNaUQaJfTKNGm7syyFSgg
	b+vmagCFKvZXp0/mzibFnxD+CA90Xg0=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1673248779; x=1704784779;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=1GZ20/ujhcJiI6hk+ypJrjOL78IcEdeQQ7lmSo6WjBk=;
  b=nSmaWp/rvWBJwWKrtrZwALT09FsS+R03FIg+6vJsf6fvH2bpSZxKyMxV
   sc8C8U+xCgJeHnUfTEss8MPWZKEd7Hhd/rna5yBKr+OgEj/tOcAWTXkHi
   0UIJLdxSDI/mLo5DH5RFvjSL1WedDN7Bvg9/4rPujeYgGcr8vY5Fek74W
   Fh8OkZ4t67ukHek0THl2ub+9QkoLt43wYFSaB2hRdPuVz1Ites2tYaS7L
   oXKW1JKtG/Exo25w8JRFAFJMLbrqPwuK+2+TpqSDvwzRGLl9OZfWb10TT
   QQL14/N3Z4rJ6teZxmvS3PvayoI4yp+YfP7nzLPGhAkBKs9lxDmyy+jhr
   A==;
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260912"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="387260912"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jan 2023 23:19:25 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112081"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="634112081"
Received: from fyin-dev.sh.intel.com ([10.239.159.32])
  by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:18 -0800
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jack@suse.cz,
	hughd@google.com,
	kirill.shutemov@linux.intel.com,
	mhocko@suse.com,
	ak@linux.intel.com,
	aarcange@redhat.com,
	npiggin@gmail.com,
	mgorman@techsingularity.net,
	willy@infradead.org,
	rppt@kernel.org,
	dave.hansen@intel.com,
	ying.huang@intel.com,
	tim.c.chen@intel.com
Cc: fengwei.yin@intel.com
Subject: [RFC PATCH 2/4] mcpage: anon page: Use mcpage for anonymous mapping
Date: Mon,  9 Jan 2023 15:22:30 +0800
Message-Id: <20230109072232.2398464-3-fengwei.yin@intel.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com>
References: <20230109072232.2398464-1-fengwei.yin@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 8D33C40002
X-Stat-Signature: 71ro3bymxu3fzj66ugh3t1abca8e9ebm
X-HE-Tag: 1673248779-214982
X-HE-Meta: 
 U2FsdGVkX19t2mzNODmcHH6ZMhHGb9euSbjUAikTu9F8I0pHnNjZM+GFPkzemDVSGLGW2hpo2aFlxwlz5UzSGhn2dwWutMZbr/aF5DtKt5eN/rkrK0aNopEyGn++h4Ycic0QNTjLwJtmYr2q09Ldfje5eLxJMGynwlQbSoaOmnrusbF5FXLOtZqFHAjbAStvjlHiWDnAnJM7j5JJKcsO/v4uBX5XzV4ogncEAJLPBPqwXT14lihK766OOqklfU2qQ5iI/k+kakOM76/2cDZcTXnLuCkFqabo67CSNssmDoLe+RqO0xoWUfNsx6a3b8cD1/w2JcIN82MxDbKHJ9/T2UOsF2bZj8JX+tXgzguGTxVhXvg8nWRHPkqysmBn5d7rexTeROLJof190vN4h3xVkBEoKtaZPh41N0LgUcnGhKj0PK3dB+WrL4bVQ6/TO6FSYupas95vIMS05AO+LBKk5UGtHp9GTy7VCuNyeGEAlvhCTIntahRxCiIPDkYFlYUMLWkjfjhcDxWvGJ6lX5TUwYS1r9+I3xCsBC2fpl8TO520jl1i7eYpfPEAyxvo7NlbDomy4nIgMG1l8AMbjAhEuszuNiyICIFH7pmXsWtZEAx92En7RRjEijtxYF3T6+uU/m7u60YrcCFIqbM4Ij9+R4hTVUrsQ7SQ6K8UcfJT+2nN/XJ+ZxP8qroOWd5R2Z67y62mQDsXw7gIivV1Okifbq0OljMcaZ/vXgSnOGhZ4JJepSQBB/HHnGWJPT5lfX3lRUeIC8SHLlI9lFaDTR1ApmykkJOLBVz3974cUaKAagw4tGQ6ktZHrtWntdgTBwDvPJlI/MKtdAp7aX7lceVnpoX8N94mpTLtIe/7BKqdhVIRlCiMKRzJ/N0HFa7GEFdEzr6Hqz/idxFPiX+0ff4A5w==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

If mcpage is in the range of VMA, try to allocated mcpage and setup
for anonymous mapping.

Try best to populate all the around page table entries. The benefit
is that the page fault number will be reduced.

Split the mcpage to allow each sub-page to be managed as normal 4K page.
Doing split before setup page table entries to avoid the complicated page
lock, mapcount and refcount handling.

It's expected that the change will impact the memory consumption, page
fault number, zone lock and lru lock directly. The memory consumption and
system performance impact are evaluated as following.

Some system performance data were collected with 16K mcpage size:
===============================================================================
                                             v6.1-rc4-no-thp v6.1-rc4-thp mcpage
will-it-scale/malloc1 (higher is better)      100%            2%           17%
will-it-scale/page_fault1 (higher is better)  100%            238%         115%
redis.set_avg_throughput (higher is better)   100%            99%          102%
redis.get_avg_throughput (higher is better)   100%            99%          100%
kernel build (lower is better)                100%            98%          97%

  * v6.1-rc4-no-thp:   6.1-rc4 with THP disabled in Kconfig
  * v6.1-rc4-thp:      6.1-rc4 with THP enabled as always in Kconfig
  * mcpage:            6.1-rc4 + 16KB mcpage
  The test results are normalized to config "v6.1-rc4-no-thp"

The perf data between v6.1-rc4-no-thp and mcpage are collected:

  For kernel build, perf showed 56% minor_page_fault drop and 1.3% clear_page
  increasing:
     v6.1-rc4-no-thp          mcpage
     5.939e+08       -56.0%   2.61e+08    kbuild.time.minor_page_faults
     0.00            +2.2        2.20     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
     0.72            -0.7        0.00     perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page

  For redis, perf showed 74.6% minor_page_fault drop and 0.11% zone lock drop.
     v6.1-rc4-no-thp            mcpage
    401414           -74.6%     102134    redis.time.minor_page_faults
      0.00           +0.1        0.11     perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.22           -0.2        0.00     perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.vma_alloc_folio

  For will-it-scale/page_fault1, perf showed 12.8% minor_page_fault drop and
  15.97% zone lock drop and 27% lru lock increasing.
      v6.1-rc4-no-thp            mcpage
      7239           -12.8%      6312     will-it-scale.time.minor_page_faults
      52.15          -34.4       17.75    perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages
      3.29           +27.0       30.29    perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.tlb_batch_pages_flush
      4.14           -4.1         0.00    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.vma_alloc_folio.do_anonymous_page
      0.00           +13.2       13.20    perf-profile.calltrace.cycles-pp.clear_page_erms.get_page_from_freelist.__alloc_pages.alloc_mcpages.do_anonymous_mcpages
      0.00           +18.4       18.43    perf-profile.calltrace.cycles-pp.rmqueue_bulk.rmqueue.get_page_from_freelist.__alloc_pages.alloc_mcpages

  For will-it-scale/malloc1, the test result is surprise. The regression is
  much bigger than expected. perf showed 12.3% minor_page_fault drop and 43.6%
  zone lock increasing:
   v6.1-rc4-no-thp               mcpage
   2978027           -82.2%      530847   will-it-scale.128.processes
     7249            -12.3%        6360   will-it-scale.time.minor_page_faults
     0.00            +43.6        43.62   perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.pte_alloc_one.__pte_alloc
     0.00            +45.4        45.39   perf-profile.calltrace.cycles-pp._raw_spin_lock.free_pcppages_bulk.free_unref_page_list.release_pages.tlb_batch_pages_flush

  It turned out the mcpage allocation/free pattern hit a corn case (high zone
  lock contention triggered and impact pte_alloc) which current pcp list bulk
  free can't handle very well.

  Will address the pcp list bulk free issue separately. After fix the pcp list
  bulk corn case, the result of will-it-scale/malloc1 is restored to 56% of
  v6.1-rc4-no-thp.

===============================================================================
For tail latency of page allocation, use following testing setup:
  - alloc_page() with order 0, 2 and 9 are called 2097152, 2097152 and 32768
    times in kernel
  - none fragment and fragment entier memory
  - w/o __GFP_ZERO flag to identify pure compaction latency and user visible
    latency

And the result is as following:

no page zeroing:
4K page:
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 27us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 2us (2076180th)     99% tail latency: 3us (2076180th)

16K mcpage
    none fragment:                        fragment:
        Number of test: 2097152               Number of test: 2097152
        max latency: 26us                     max latency: 9862us
        90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
        95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
        99% tail latency: 1us (2076180th)     99% tail latency: 3us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768               Number of test: 32768
      max latency: 40us                   max latency: 12149us
      90% tail latency: 8us  (29491th)    90% tail latency: 864us  (29491th)
      95% tail latency: 10us (31129th)    95% tail latency: 943us  (31129th)
      99% tail latency: 13us (32440th)    99% tail latency: 1067us (32440th)

page zeroing:
4K page:
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 18us                     max latency: 46us
      90% tail latency: 1us (1887436th)     90% tail latency: 1us (1887436th)
      95% tail latency: 1us (1992294th)     95% tail latency: 1us (1992294th)
      99% tail latency: 2us (2076180th)     99% tail latency: 4us (2076180th)

16K mcpage
    none fragment:                        fragment:
      Number of test: 2097152               Number of test: 2097152
      max latency: 31us                     max latency: 5740us
      90% tail latency: 3us (1887436th)     90% tail latency: 3us (1887436th)
      95% tail latency: 3us (1992294th)     95% tail latency: 4us (1992294th)
      99% tail latency: 4us (2076180th)     99% tail latency: 5us (2076180th)

2M THP:
    none fragment:                        fragment:
      Number of test: 32768                 Number of test: 32768
      max latency: 530us                    max latency: 10494us
      90% tail latency: 366us (29491th)     90% tail latency: 1114us (29491th)
      95% tail latency: 373us (31129th)     95% tail latency: 1263us (31129th)
      99% tail latency: 391us (32440th)     99% tail latency: 1808us (32440th)

With 16K mcpage, the tail latency for page allocation is good while 2M THP
has much worse result in memory fragment case.

===============================================================================
For the performance of NUMA interleaving on base page, mcpage and THP,
memory latency from https://github.com/torvalds/test-tlb is used.

On a Cascade Lake box with 96 core + 258G memory with two NUMA nodes:
    node distances:
    node   0   1
      0:  10  20
      1:  20  10

With memory policy set to MPOL_INTERLEAVE and 1G memory mapping with
128 bytes (2X cache line) stride, the memory access latency (less
is better):
  random access with 4K apge:   	142.32 ns
  random access with 16K mcpage:	141.21 ns (+0.8%)
  random access with 2M THP:		116.56 ns (+18.2%)

  sequential access with 4K page:	21.28 ns
  sequential access with 16K mcpage:	20.52 ns (+0.36%)
  sequential access with 2M THP:	20.36 ns (+0.43%)

mcpage brings minor memory access latency improvement comparing to 4K page.
But less than the improvement comparing to 2M THP.

===============================================================================
The memory consumption is checked by using firefox to access "www.lwn.net"
website and collect the RSS of firefox with 16K mcpage size:
  6.1-rc7: 		RSS of firefox is 285300 KB
  6.1-rc7 + 16K mcpage: RSS of firefox is 295536 KB

3.59% more memory consumption with 16K mcpage.

===============================================================================

In this RFC patch, the none-batch update to page table entries is used
to show the idea. Batch mode will be chosen if make this official patch
in the future.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/gfp.h       |   5 ++
 include/linux/mcpage_mm.h |  35 ++++++++++
 mm/Makefile               |   1 +
 mm/mcpage_memory.c        | 134 ++++++++++++++++++++++++++++++++++++++
 mm/memory.c               |  11 ++++
 mm/mempolicy.c            |  51 +++++++++++++++
 6 files changed, 237 insertions(+)
 create mode 100644 include/linux/mcpage_mm.h
 create mode 100644 mm/mcpage_memory.c

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..035c5fadd9d4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -265,6 +265,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned int order);
 struct folio *folio_alloc(gfp_t gfp, unsigned order);
 struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, bool hugepage);
+struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr);
 #else
 static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
 {
@@ -276,7 +278,10 @@ static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
 }
 #define vma_alloc_folio(gfp, order, vma, addr, hugepage)		\
 	folio_alloc(gfp, order)
+#define alloc_mcpages(gfp, order, vma, addr)				\
+	alloc_pages(gfp, order)
 #endif
+
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 static inline struct page *alloc_page_vma(gfp_t gfp,
 		struct vm_area_struct *vma, unsigned long addr)
diff --git a/include/linux/mcpage_mm.h b/include/linux/mcpage_mm.h
new file mode 100644
index 000000000000..4b2fb7319233
--- /dev/null
+++ b/include/linux/mcpage_mm.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MCPAGE_MM_H
+#define _LINUX_MCPAGE_MM_H
+
+#include <linux/mm_types.h>
+
+#ifdef CONFIG_MCPAGE_ORDER
+
+static inline bool allow_mcpage(struct vm_area_struct *vma,
+	unsigned long addr, unsigned int order)
+{
+	unsigned int mcpage_size = 1 << (order + PAGE_SHIFT);
+	unsigned long haddr = ALIGN_DOWN(addr, mcpage_size);
+
+	return range_in_vma(vma, haddr, haddr + mcpage_size);
+}
+
+extern vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf,
+	unsigned int order);
+
+#else
+static inline bool allow_mcpage(struct vm_area_struct *vma,
+	unsigned long addr, unsigned int order)
+{
+	return false;
+}
+
+static inline vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf,
+	unsigned int order)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* CONFIG_MCPAGE */
+
+#endif /* _LINUX_MCPAGE_MM_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..efeaa8358953 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_MCPAGE) += mcpage_memory.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c
new file mode 100644
index 000000000000..ea4be2e25bce
--- /dev/null
+++ b/mm/mcpage_memory.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2022 Intel Corporation. All rights reserved.
+ */
+
+#include <linux/gfp.h>
+#include <linux/page_owner.h>
+#include <linux/pgtable.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/rmap.h>
+#include <linux/oom.h>
+#include <linux/vm_event_item.h>
+#include <linux/userfaultfd_k.h>
+
+#include "internal.h"
+
+#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
+static inline struct page *
+alloc_zeroed_mcpages(int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct page *page = alloc_mcpages(GFP_HIGHUSER_MOVABLE, order,
+		vma, addr);
+
+	if (page) {
+		int i;
+		struct page *it = page;
+
+		for (i = 0; i < (1 << order); i++, it++) {
+			clear_user_highpage(it, addr);
+			cond_resched();
+		}
+	}
+
+	return page;
+}
+#else
+static inline struct page *
+alloc_zeroed_mcpages(int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	return alloc_mcpages(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
+		order, vma, addr);
+}
+#endif
+
+static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
+		struct page *page, unsigned long addr)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	vm_fault_t ret = 0;
+	pte_t entry;
+
+	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) {
+		ret = VM_FAULT_OOM;
+		goto oom;
+	}
+
+	cgroup_throttle_swaprate(page, GFP_KERNEL);
+	__SetPageUptodate(page);
+
+	entry = mk_pte(page, vma->vm_page_prot);
+	entry = pte_sw_mkyoung(entry);
+	if (vma->vm_flags & VM_WRITE)
+		entry = pte_mkwrite(pte_mkdirty(entry));
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+
+	if (!pte_none(*vmf->pte)) {
+		ret = VM_FAULT_FALLBACK;
+		update_mmu_cache(vma, addr, vmf->pte);
+		goto release;
+	}
+
+	ret = check_stable_address_space(vma->vm_mm);
+	if (ret) {
+		ret = VM_FAULT_FALLBACK;
+		goto release;
+	}
+
+	if (userfaultfd_missing(vma)) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return handle_userfault(vmf, VM_UFFD_MISSING);
+	}
+
+	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, addr);
+	lru_cache_add_inactive_or_unevictable(page, vma);
+	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+	update_mmu_cache(vma, addr, vmf->pte);
+release:
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+oom:
+	return ret;
+}
+
+vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order)
+{
+	int i, nr = 1 << order;
+	unsigned int mcpage_size = nr * PAGE_SIZE;
+	vm_fault_t ret = 0, real_ret = 0;
+	bool handled = false;
+	struct page *page;
+	unsigned long haddr = ALIGN_DOWN(vmf->address, mcpage_size);
+
+	page = alloc_zeroed_mcpages(order, vmf->vma, haddr);
+	if (!page)
+		return VM_FAULT_FALLBACK;
+
+	split_page(page, order);
+	for (i = 0; i < nr; i++, haddr += PAGE_SIZE) {
+		ret = do_anonymous_mcpage(vmf, &page[i], haddr);
+		if (haddr == PAGE_ALIGN_DOWN(vmf->address)) {
+			real_ret = ret;
+			handled = true;
+		}
+		if (ret)
+			break;
+	}
+
+	while (i < nr)
+		put_page(&page[i++]);
+
+	/*
+	 * If the fault address is not handled, fallback to handle
+	 * fault address with normal page.
+	 */
+	if (!handled)
+		return VM_FAULT_FALLBACK;
+	else
+		return real_ret;
+}
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..fb7f370f6c67 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -77,6 +77,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/mcpage_mm.h>
 
 #include <trace/events/kmem.h>
 
@@ -4071,6 +4072,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
+
+	if (allow_mcpage(vma, vmf->address, MCPAGE_ORDER)) {
+		ret = do_anonymous_mcpages(vmf, MCPAGE_ORDER);
+
+		if (!(ret & VM_FAULT_FALLBACK))
+			return ret;
+
+		ret = 0;
+	}
+
 	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
 	if (!page)
 		goto oom;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..87ecbdb74fbe 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2251,6 +2251,57 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(vma_alloc_folio);
 
+/**
+ * alloc_mcpages - Allocate a mcpage for a VMA.
+ * @gfp: GFP flags.
+ * @order: Order of the mcpage.
+ * @vma: Pointer to VMA or NULL if not available.
+ * @addr: Virtual address of the allocation.  Must be inside @vma.
+ *
+ * Allocate a mcpage for a specific address in @vma, using the
+ * appropriate NUMA policy.  When @vma is not NULL the caller must hold the
+ * mmap_lock of the mm_struct of the VMA to prevent it from going away.
+ * Should be used for all allocations for pages that will be mapped into
+ * user space.
+ *
+ * Return: The page on success or NULL if allocation fails.
+ */
+struct page *alloc_mcpages(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr)
+{
+	struct mempolicy *pol;
+	int node = numa_node_id();
+	struct page *page;
+	int preferred_nid;
+	nodemask_t *nmask;
+
+	pol = get_vma_policy(vma, addr);
+
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned int nid;
+
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+		mpol_cond_put(pol);
+		page = alloc_page_interleave(gfp, order, nid);
+		goto out;
+	}
+
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		node = policy_node(gfp, pol, node);
+		page = alloc_pages_preferred_many(gfp, order, node, pol);
+		mpol_cond_put(pol);
+		goto out;
+	}
+
+	nmask = policy_nodemask(gfp, pol);
+	preferred_nid = policy_node(gfp, pol, node);
+	page = __alloc_pages(gfp, order, preferred_nid, nmask);
+	mpol_cond_put(pol);
+out:
+	return page;
+}
+EXPORT_SYMBOL(alloc_mcpages);
+
 /**
  * alloc_pages - Allocate pages.
  * @gfp: GFP flags.

From patchwork Mon Jan  9 07:22:31 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yin Fengwei <fengwei.yin@intel.com>
X-Patchwork-Id: 13093094
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B54B7C54EBD
	for <linux-mm@archiver.kernel.org>; Mon,  9 Jan 2023 07:19:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4F7588E0006; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4804C8E0001; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 286658E0006; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 124058E0001
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 02:19:42 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id E475B402AB
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:41 +0000 (UTC)
X-FDA: 80334410562.08.CD1C1CE
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf21.hostedemail.com (Postfix) with ESMTP id 152D31C0015
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:39 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KH8JLmQb;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf21.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1673248780;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=izhV+qgtzIRXGhBId2iycUfWj/BcACJStsOaVUDWwxY=;
	b=k29ctR/bUROOCY4caxY/FEva1U+ngiyerX3iIB8ePU3akbkW0ZWZbkYpqMtO4Z0GwVQIQk
	1/gpc2LkcXsYbj5rZodioASp5FRPRKxrBFga4gL0HsCHk2NLezzNxex2uWicTqGmV/4MDt
	7o6w6oDZudIsEcVxBEVCKENUkrdBk6s=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=KH8JLmQb;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf21.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248780; a=rsa-sha256;
	cv=none;
	b=h2ausRH2yJh46zLuaMwiFUdBSAmx4clxl+hiGl2N3qP1jOUlxJQat6w3T1si2GFLT5Tp+2
	8VHuSgpixlAMD5ObHjGRXPMKtnWHICUT0gOl+rEsrrNR1Kb12o3f1zq8f2lMslX8cIzR1F
	WEsEOO/znqBNIgNh0UP9swCLfekyf4M=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1673248780; x=1704784780;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NPYBrBKpCMUBtZBYO+jWTHgO8ZQCDz+oPndyU4c9IhI=;
  b=KH8JLmQb0UD4QHbZ5C7q+VmiQ77GSgNjBL08zstkbDN9kqYL9GSH9e91
   dlpCMyFt26E971pEGeoqKaSlgRLzM6ekpcRLxLNkWSs2kEs0MAA1OQCSR
   kSihwznN0+LBEU1ZGovr7m0pxdXYMKgqkVtu1fALnouJ4aietz/gs8jSL
   /eiVG2g2IUsfOErrGswU+VRpRVZZv2JS18APWF2c9dDnbxhIBjLqakCWA
   ycSmPTwZnIc2VTwL3LapzBPuv0KHyLbvU8ABtz5LZ2kkIS9VOp1YQN0ZC
   xBZTlOeDfksXKjwkzYMBi4E/xMNMriJZmPEGWSKI1aYsFeW+KdSuqClfE
   g==;
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260930"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="387260930"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jan 2023 23:19:29 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112111"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="634112111"
Received: from fyin-dev.sh.intel.com ([10.239.159.32])
  by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:24 -0800
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jack@suse.cz,
	hughd@google.com,
	kirill.shutemov@linux.intel.com,
	mhocko@suse.com,
	ak@linux.intel.com,
	aarcange@redhat.com,
	npiggin@gmail.com,
	mgorman@techsingularity.net,
	willy@infradead.org,
	rppt@kernel.org,
	dave.hansen@intel.com,
	ying.huang@intel.com,
	tim.c.chen@intel.com
Cc: fengwei.yin@intel.com
Subject: [RFC PATCH 3/4] mcpage: add vmstat counters for mcpages
Date: Mon,  9 Jan 2023 15:22:31 +0800
Message-Id: <20230109072232.2398464-4-fengwei.yin@intel.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com>
References: <20230109072232.2398464-1-fengwei.yin@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 152D31C0015
X-Stat-Signature: eyoxdo1cxrefkp46du7heqx4u73r7coe
X-HE-Tag: 1673248779-336675
X-HE-Meta: 
 U2FsdGVkX1+UEBB+XyDfTUAqhiB3Y+zX2bUdfF9IlQdNQq0Y6dZPASn5EenhkaFkQOyxneaWjldu8u1GzPvvZZilUuJGXxBEbgWen4JAu52sOR5dx/3loIMHeIHMVMr9q769fisPGQk8UbZZ+F+A0/BP9ggspsv0mt50l97A9OZU/ZIkmA8YKrt2WQmiGeyObfIre5OBXaRGl+sEhwpxRkItsY+F4b5elkEDdXuF+bo2HQ6zWTUvRzeAN6ohpitfnK8WFJ9QzC4Jwd8itoIbjZ+13+xS+GeUf9tqYJkkJMVbEYCJRITxlwpmWNIfOr6erxLO6iQz+YdSgFEc6BOV75ukFpVC0Iqtcbt8pzFjtwPf7pJu0RJzlDbP+q9Mwsump2e5KdvAyWjax9oq0HDR/zVhZcj6YjQUx9gSpSk/ObKAqwIe5O021xtbW+4cSuKtwDHTP69A31zDTd03BiAOiGKdnaQ12QPUL+yPce6EkgPniz8VSAlQ7Aub8g45GfpTFNlLZqyYGxnVT+PDFLhclqteuME9sNPtVhufLwYFv3xe8Hb7//5tyot7njY5no9OLHqa1Kttft8tqFQJd1UkeJY6BSmLoyQyHAqLLB9DDPlp2wTUp9W+ei9KJiUcOZJjuDPfdOxDnCI6m80bkhpp2KZJWcqElv2q5YyvkfX3/Mpknt2sVUnm5NZlujzqPiROSuF1m41MlCVnXW0L4Pv0MUG7uOdamj/g0zM/s+pKRsY86grU9cbIB9Fsa210jUJx+MZhnz7CsqHSCppEMppJBJ2lFMmntlCtlK+FdRMjRWoB+/5EsTnVELvMVQnZpXcbz9I7NUBxQQ485a4IT3JUBcMdUi0O0RibotFezUQVkmRDyTbLJhhsPDggAqPXYnmU01/Rq0kAU4xpyALZJNsGdQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

MCPAGE_ANON_FAULT_ALLOC: how many times mcpage is used for anonymous
mapping.

MCPAGE_ANON_FAULT_FALLBACK: how many times fallback to normal page
for anonymous mapping.

MCPAGE_ANON_FAULT_CHARGE_FAILED: how many times fallback because of
memcg charge failure.

MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED: how many times fallback
because page table already populated.

MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE: how many times fallback
because of unstable address space.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/vm_event_item.h | 10 ++++++++++
 mm/mcpage_memory.c            |  6 ++++++
 mm/memory.c                   |  1 +
 mm/vmstat.c                   |  7 +++++++
 4 files changed, 24 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..9c36bfc4c904 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -119,6 +119,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
 #endif
+#ifdef CONFIG_MCPAGE
+		MCPAGE_ANON_FAULT_ALLOC,
+		MCPAGE_ANON_FAULT_FALLBACK,
+		MCPAGE_ANON_FAULT_CHARGE_FAILED,
+		MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED,
+		MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE,
+#endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
 		BALLOON_DEFLATE,
@@ -159,5 +166,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #define THP_FILE_FALLBACK_CHARGE ({ BUILD_BUG(); 0; })
 #define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
 #endif
+#ifndef CONFIG_MCPAGE
+#define MCPAGE_ANON_FAULT_FALLBACK ({ BUILD_BUG(); 0; })
+#endif
 
 #endif		/* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/mcpage_memory.c b/mm/mcpage_memory.c
index ea4be2e25bce..e208cf818ebf 100644
--- a/mm/mcpage_memory.c
+++ b/mm/mcpage_memory.c
@@ -55,6 +55,7 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
 
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) {
 		ret = VM_FAULT_OOM;
+		count_vm_event(MCPAGE_ANON_FAULT_CHARGE_FAILED);
 		goto oom;
 	}
 
@@ -71,12 +72,14 @@ static vm_fault_t do_anonymous_mcpage(struct vm_fault *vmf,
 	if (!pte_none(*vmf->pte)) {
 		ret = VM_FAULT_FALLBACK;
 		update_mmu_cache(vma, addr, vmf->pte);
+		count_vm_event(MCPAGE_ANON_FAULT_PAGE_TABLE_POPULATED);
 		goto release;
 	}
 
 	ret = check_stable_address_space(vma->vm_mm);
 	if (ret) {
 		ret = VM_FAULT_FALLBACK;
+		count_vm_event(MCPAGE_ANON_FAULT_INSTABLE_ADDRESS_SPACE);
 		goto release;
 	}
 
@@ -120,6 +123,9 @@ vm_fault_t do_anonymous_mcpages(struct vm_fault *vmf, unsigned int order)
 			break;
 	}
 
+	if (i == nr)
+		count_vm_event(MCPAGE_ANON_FAULT_ALLOC);
+
 	while (i < nr)
 		put_page(&page[i++]);
 
diff --git a/mm/memory.c b/mm/memory.c
index fb7f370f6c67..b3655be849ae 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4079,6 +4079,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 
+		count_vm_event(MCPAGE_ANON_FAULT_FALLBACK);
 		ret = 0;
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..c40e33dee1b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1367,6 +1367,13 @@ const char * const vmstat_text[] = {
 	"thp_swpout",
 	"thp_swpout_fallback",
 #endif
+#ifdef CONFIG_MCPAGE
+	"mcpage_anon_fault_alloc",
+	"mcpage_anon_fault_fallback",
+	"mcpage_anon_fault_charge_failed",
+	"mcpage_anon_fault_page_table_populated",
+	"mcpage_anon_fault_instable_address_space",
+#endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",
 	"balloon_deflate",

From patchwork Mon Jan  9 07:22:32 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yin Fengwei <fengwei.yin@intel.com>
X-Patchwork-Id: 13093096
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9FA77C54EBD
	for <linux-mm@archiver.kernel.org>; Mon,  9 Jan 2023 07:19:45 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8A0D4900002; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 82AAD900003; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 54ACA900002; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com
 [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 2F06F8E0001
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 02:19:43 -0500 (EST)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id DEDDF1A046A
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:42 +0000 (UTC)
X-FDA: 80334410604.30.A103EE8
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf09.hostedemail.com (Postfix) with ESMTP id 0411B140003
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 07:19:40 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=YniMldMD;
	spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1673248781;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HCxb6co2puvxDMcg8PlLaV1XdBVBssCEk/cbsuMmJb4=;
	b=SIpBlT4KQBf98ER0MVRx7l33voKQspocdw7WD7OuFWoIMVgng8Vo+WAzSUqvimn4OWQjeH
	gk8rpz7d6AxRA7RFet6ySIrLUWDmB+lHus+cUwicJkJictSGe6vcDtHn/weppWmX2958x5
	uBdQNQn5Y6QEHOhifTOsUEmpOcvpbuo=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=YniMldMD;
	spf=pass (imf09.hostedemail.com: domain of fengwei.yin@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=fengwei.yin@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673248781; a=rsa-sha256;
	cv=none;
	b=GnqQg7A8APVxRfu4vbuRNkxGSmvN1bT/2HLxngJl1t+tANJXi1feb1dzaSxyoGmDtPbvFZ
	Y4vwZH033SGGRD2KgCmz96b7WtD2Gxaqyj/fnememWIUVlEPivo9YZ/K7iOkq+km28q0Ok
	3STPWrhMBpcar779JI7D9g8eIW3UYBs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1673248781; x=1704784781;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=CxFvIjMgroEmPoheOidP0ThTbJZhYjtVPzD7cNlzE8E=;
  b=YniMldMDKN3egg9/TRlMcNKUn8WJNUiTui4/KTv1hb9TsfZQvQghrB0w
   QLcOHuwCB0xA7olvZT6Bh2UYHr4+ICBuBtbefmOu1YW5Zmin68yxo6tB7
   6XroFPekgDesMiw3NFgA2c1Em8vzlExj7bfH2CGAp+M3iAHxQuDzuR+bU
   YCB9z0+OoiC2APw4uE0zu1HIq5oYOFW4d+qGbjjfElyGvnv/EQNovea1m
   mdUVxbIF0JaJ4zO9NnNYeo1WWykPNxut1MybAAKi60HkFGoJCFN51jwIp
   zWQ04hK0PwwMr7bWxWYh7MTNuPgA7tpz3iTAayW3MyehamXibarFB9MVz
   w==;
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="387260946"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="387260946"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Jan 2023 23:19:33 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6500,9779,10584"; a="634112126"
X-IronPort-AV: E=Sophos;i="5.96,311,1665471600";
   d="scan'208";a="634112126"
Received: from fyin-dev.sh.intel.com ([10.239.159.32])
  by orsmga006.jf.intel.com with ESMTP; 08 Jan 2023 23:19:29 -0800
From: Yin Fengwei <fengwei.yin@intel.com>
To: linux-mm@kvack.org,
	akpm@linux-foundation.org,
	jack@suse.cz,
	hughd@google.com,
	kirill.shutemov@linux.intel.com,
	mhocko@suse.com,
	ak@linux.intel.com,
	aarcange@redhat.com,
	npiggin@gmail.com,
	mgorman@techsingularity.net,
	willy@infradead.org,
	rppt@kernel.org,
	dave.hansen@intel.com,
	ying.huang@intel.com,
	tim.c.chen@intel.com
Cc: fengwei.yin@intel.com
Subject: [RFC PATCH 4/4] mcpage: get_unmapped_area return mcpage size aligned
 addr
Date: Mon,  9 Jan 2023 15:22:32 +0800
Message-Id: <20230109072232.2398464-5-fengwei.yin@intel.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <20230109072232.2398464-1-fengwei.yin@intel.com>
References: <20230109072232.2398464-1-fengwei.yin@intel.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 0411B140003
X-Stat-Signature: zs5cxhbk3ez1xx5w4zqstz3y5pe4fk4t
X-Rspam-User: 
X-HE-Tag: 1673248780-872796
X-HE-Meta: 
 U2FsdGVkX19rB2N/fJpZb8XZMd/F6z5vcLOTzLyFfIhRdkarREz82SxoqTdMLQW7euULSya0SydWZxErR5GEHfySiU/NVTYDtSk5/OxuY9WoQBCk+TiYbQtJfDqbFHVm4NZdy98u5ugsqgHoCtjQsCLO6Rl5H6KQ/jz6GDPJ1jIfE9UH9wVdO6+IDl0B54av6K3nd2orZeuNwDs6bczsBP7HYrta5gKjiemQFZnLEiQ34sm6TUodT5Hu6ru7lMe4ku6sqGXJrkbPv9jlg1FPgMImUn9QIS83ak15lpCUrM9SiTx0vI31oxVKWx1NQWNOQaVYvpRnwjUdxlpoRCh5aSTH9atb/m5Mscf8iNC2qIRsLHChMLv7X7bkiLOBr1wNB4A1U31PNDf3y76wv5LiBApnSRWy+b2zJ670Sb6Bg1rzJxzkNWCrbCQks28N1Igod2fowxuMzZAyzNiKMtHvBX29+tpato9O7jCyRw0pIUCEWdTCKJjYAdq0DV8RgmHaOef6bA+1EJLcy+AspdF7ohhkxvdLjNjrmLYS52tPusTbpUra0smDRapsndYf+Goa+XjC4GDkPnpWPeH688212h0vO6MdjeIwyVRPmW1OR6Ho/26Us4ppYiMZrECSHZy4nKP3jIN52lHpPSwJZbceTGuL+4GVLC2LtPp0tzOQPVSgg1y+sCwy7tcO0/YQMDKGpoZ9XoKVm9EHYQouU1V7w1w7tCHE4c9PCCQCkV1QyyG2YBfG4r8NasGvqRbvq8fdILIT0OsOHSNY/lCk5DoyVo6kXZ8qeP+KG2ldF66IHWIVK72blyCw6bifNC6reF4wamFchrQn4L4u77d5RN6KdBdvIA1Z2FqbbthfkEWb7ScRGC0CYCCwXQdngsyo2nvIJEUMlHQb9fwgJO4o012MB61qUVUkTNBbVKaA0CsqLSQf20x6ebOWNsAXGFRQp1Xg215DUMSIrRo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

For x86_64, let mmap start from mcpage size aligned address.

Using firefox with one tab to access the entry page of
"www.lwn.net" as workload. With mcpage set to 2, collected the
count about the mcpage can't be used because mcpage is out of
VMA range:
                         run1  run2  run3  avg  stddev
    With this patch:     1453, 1434, 1428  1438 13.0%
    Without this patch:  1536, 1467, 1493  1498 34.8%

It shows that the chance of using mcpage for anonymous mapping
is increased 4.2% with the patch.

For general possible impact because the virtual address space is
more sparse, run will-it-scale:malloc1, will-it-scale:page_fault1
and kernel build w/o the change based on v6.1-rc7. The result shows
no performance change introduced by this change:

malloc1:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     23338            -0.5%      23210        will-it-scale.per_process_ops

page_fault1:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     96322            -0.1%      96222        will-it-scale.per_process_ops

kernel build:
        v6.1-rc7 v6.1-rc7 + this patch
---------------- ---------------------------
     28.45            +0.2%      28.52        kbuild.buildtime_per_iteration

One drawback of the change is that the effective ASLR bits is reduced
by mcpage_order bits.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 arch/x86/kernel/sys_x86_64.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 8cc653ffdccd..9b5617973e81 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -154,6 +154,10 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+
+	if (info.align_mask < ~MCPAGE_MASK)
+		info.align_mask = ~MCPAGE_MASK;
+
 	return vm_unmapped_area(&info);
 }
 
@@ -212,6 +216,10 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
 		info.align_mask = get_align_mask();
 		info.align_offset += get_align_bits();
 	}
+
+	if (info.align_mask < ~MCPAGE_MASK)
+		info.align_mask = ~MCPAGE_MASK;
+
 	addr = vm_unmapped_area(&info);
 	if (!(addr & ~PAGE_MASK))
 		return addr;