From patchwork Mon May  6 08:46:30 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 13655148
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E8BE6C4345F
	for <linux-mm@archiver.kernel.org>; Mon,  6 May 2024 08:47:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CBD576B008C; Mon,  6 May 2024 04:46:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C1CE06B0092; Mon,  6 May 2024 04:46:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A7A306B0093; Mon,  6 May 2024 04:46:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com
 [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 858DF6B008C
	for <linux-mm@kvack.org>; Mon,  6 May 2024 04:46:58 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 33EF380903
	for <linux-mm@kvack.org>; Mon,  6 May 2024 08:46:58 +0000 (UTC)
X-FDA: 82087340916.30.8ADCDE7
Received: from out30-112.freemail.mail.aliyun.com
 (out30-112.freemail.mail.aliyun.com [115.124.30.112])
	by imf06.hostedemail.com (Postfix) with ESMTP id C69A318000F
	for <linux-mm@kvack.org>; Mon,  6 May 2024 08:46:55 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=nIi1ioel;
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf06.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.112 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1714985216;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=JkwuLgB40/tFCSMT5YJyWsVrrngk9w7x0Ah/Vo6lKcs=;
	b=nnTgxjR8wBpUZzPmc+Kh4hu+W+nF/j2oqL5ErdlGPFJJ6NIFyQZnTFTqu93+mm1iFrgWBg
	No2G+C43nMDLoFZKEu81DonlF6njypxvN8B7B8doM1tc8AAWCrEh5U6fq7D6IUmT7nweee
	HI1jAyJnoLZmZp7jIyFl6k6VbizPfNI=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=nIi1ioel;
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf06.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.112 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714985216; a=rsa-sha256;
	cv=none;
	b=hRyjjwnyXY1JvhBDw3HeXYjVkEFcdlSNSvVv0b7XoQkcZVra9PAnUZUh0GxcVvcOuRtgEK
	PS5G05S72S2Z3XYqGrjnrly8ySnIHs1tM057mGJyv1Wq22Ho+Wx2V0VeRWuIV64a8o7Q+t
	Bv7egc8ZAkkkHKXRrrrtLRKIgRzV2+4=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1714985213; h=From:To:Subject:Date:Message-Id:MIME-Version;
	bh=JkwuLgB40/tFCSMT5YJyWsVrrngk9w7x0Ah/Vo6lKcs=;
	b=nIi1ioelu5JoHLDNYOk6CvONM4ZsnGK/ByI0NKRsUgDurk29F4qptpQRIfUmufbEI6txS71B0wNpweUMCnqVigJ1Brjk0/+JwrkozdsCezf0RXUiUgTFeI0UUD1/SXpfQx5BanLme2qmpK4Khd93N1YSRuoQq7YOqT1/DF0cx8c=
X-Alimail-AntiSpam: 
 AC=PASS;BC=-1|-1;BR=01201311R141e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033068173054;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=14;SR=0;TI=SMTPD_---0W5wQICN_1714985210;
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0W5wQICN_1714985210)
          by smtp.aliyun-inc.com;
          Mon, 06 May 2024 16:46:51 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	hughd@google.com
Cc: willy@infradead.org,
	david@redhat.com,
	ioworker0@gmail.com,
	wangkefeng.wang@huawei.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	ryan.roberts@arm.com,
	shy828301@gmail.com,
	ziy@nvidia.com,
	baolin.wang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 6/8] mm: shmem: add mTHP support for anonymous shmem
Date: Mon,  6 May 2024 16:46:30 +0800
Message-Id: 
 <adc64bf0f150bdc614c6c06fc313adeef7dbbbff.1714978902.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 2.39.3
In-Reply-To: <cover.1714978902.git.baolin.wang@linux.alibaba.com>
References: <cover.1714978902.git.baolin.wang@linux.alibaba.com>
MIME-Version: 1.0
X-Stat-Signature: 8q4fdcjzunmnhwyeb1rpko4z63rsokpu
X-Rspamd-Queue-Id: C69A318000F
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-HE-Tag: 1714985215-408752
X-HE-Meta: 
 U2FsdGVkX19SlV3DhpHFZbUGQXGfn1rhB4RLpjidoXA6jFfoZ9Jqj3BMfcTv42RkixmkExb9KelQ0vhc8XXdVaG3fgU3MYweE4b+IC4LRM7WoMwjyh80KZl2DGzZdz1Mq1oZXbP+GzD66TWmPYfiVZF11i1kAark8uBTUe2BWZ+0rPTN0R3KsOr82e9vxRdv4Z3wF40FG+cMetLX5D592B7U1qr6X17ZDl3lOiNocNIK27D9TS+8iY34x1yEbvl0MQdVrYb9rVyPw26uzzbf9m8WFjRhMItlvD5Ha+vxAslxg/4GYPpN64lvCbs2PSbkKGiBjgwPN5g16sMO+4+YiP/GxgwyUSSpxfAXIn9c74dfwERQOPjeLHXOYQi55m1y+xzy0LE0I0FcJR8IyjHyeVfBtopSSfKnT80Mxn2ynr8kW1hyokS3o3l8l1x0RRRVrcgvOezkTEc7kVGKQTSw8NUdT2KRiD6QUaNwH6Hdd6dX3E7Mlqket7X+CaXeaZ7sc6bMpyHH+SDyik/od4w5sqJyv8dcNi7ZO0TPxGRGtxB0K1ZTdi47Qr/M4d+MOJvSuz868ApEoXFQ18mXxA3LPrE+qy8e+MrZRclTP35oOOlk9fnB8IB0a9yXO6xZU3/l8Zy4Yhi+M8RyTKbe5ILOYWef7qaOG0wKCIFOTlXm3ILnYvZptsCgvFkbw8MYQLQ3JoBBtXtI4uzE/0dqDLDq/jMd6g6RtqKW5FdEJqQACCTQ2exIh3HxVOCJQYSvhULo/djfDa+zE2JDM9JLBpGBVbM6K0/NKBa+PTFKRbhQEQ5eL3AwA5euWi0AijZoTUDkLRuBGqoWTnj8lB/qVMmkoDCPh8Tqg3dB5V2F7deiRQvy+38AeClYnNZ10QYTX/2xocxBc0ggK4C445UX+UaEMwE8GtUWJphSpWeleF4wcqdC87eMmo3XCL/Dl+0WbXPx9g5Cq0UtnI7OI9AMUAg
 bQrDadLe
 HvYNCvQuvfMKioQJgsvybCOxXPdTC0YVw2OlMIkANDZY90O6Z9+Hi5HClMT1toJBGM9T3oDUa7Uz69mw7FmsSxc0S8S9IktNXyn5Zrkj9ZuUPXkI2wGwL1vAZo9D7MezzPPjTGN6qus2eGa40+Mwh5qyGQN75tHVFCTadvHECcae0V+BU1ePh3mLDAAFtUlk6pPKQQVij9kO0OmJO7/48+FmVDdg4xGFBdv5++mqcWY8OVrE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous share pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous share pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 177 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 150 insertions(+), 27 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 59cc26d44344..08ccea5170a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,106 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 	return result;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
+	unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
+	unsigned long vm_flags = vma->vm_flags;
+	/*
+	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+	 * are enabled for this vma.
+	 */
+	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+	loff_t i_size;
+	int order;
+
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return 0;
+
+	/* If the hardware/firmware marked hugepage support disabled. */
+	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+		return 0;
+
+	/*
+	 * Following the 'deny' semantics of the top level, force the huge
+	 * option off from all mounts.
+	 */
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return 0;
+	/*
+	 * Only allow inherit orders if the top-level value is 'force', which
+	 * means non-PMD sized THP can not override 'huge' mount option now.
+	 */
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	/* Allow mTHP that will be fully within i_size. */
+	order = highest_order(within_size_orders);
+	while (within_size_orders) {
+		index = round_up(index + 1, order);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >> PAGE_SHIFT >= index) {
+			mask |= within_size_orders;
+			break;
+		}
+
+		order = next_order(&within_size_orders, order);
+	}
+
+	if (vm_flags & VM_HUGEPAGE)
+		mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
+
+	if (global_huge)
+		mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	return orders & mask;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pages;
+	int order;
+
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
+
+	/* Find the highest order that can add into the page cache */
+	order = highest_order(orders);
+	while (orders) {
+		pages = 1UL << order;
+		index = round_down(index, pages);
+		if (!xa_find(&mapping->i_pages, &index,
+			     index + pages - 1, XA_PRESENT))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	return 0;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static struct folio *shmem_alloc_hugefolio(gfp_t gfp,
 		struct shmem_inode_info *info, pgoff_t index, int order)
 {
@@ -1639,38 +1739,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
 	return (struct folio *)page;
 }
 
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
-		struct inode *inode, pgoff_t index,
-		struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+		gfp_t gfp, struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, bool huge, unsigned long orders)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	unsigned long suitable_orders;
 	struct folio *folio;
 	long pages;
-	int error;
+	int error, order;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		huge = false;
 
-	if (huge) {
-		pages = HPAGE_PMD_NR;
-		index = round_down(index, HPAGE_PMD_NR);
+	if (huge || orders > 0) {
+		if (vma && vma_is_anon_shmem(vma) && orders) {
+			suitable_orders = anon_shmem_suitable_orders(inode, vmf,
+							mapping, index, orders);
+		} else {
+			pages = HPAGE_PMD_NR;
+			suitable_orders = BIT(HPAGE_PMD_ORDER);
+			index = round_down(index, HPAGE_PMD_NR);
 
-		/*
-		 * Check for conflict before waiting on a huge allocation.
-		 * Conflict might be that a huge page has just been allocated
-		 * and added to page cache by a racing thread, or that there
-		 * is already at least one small page in the huge extent.
-		 * Be careful to retry when appropriate, but not forever!
-		 * Elsewhere -EEXIST would be the right code, but not here.
-		 */
-		if (xa_find(&mapping->i_pages, &index,
+			/*
+			 * Check for conflict before waiting on a huge allocation.
+			 * Conflict might be that a huge page has just been allocated
+			 * and added to page cache by a racing thread, or that there
+			 * is already at least one small page in the huge extent.
+			 * Be careful to retry when appropriate, but not forever!
+			 * Elsewhere -EEXIST would be the right code, but not here.
+			 */
+			if (xa_find(&mapping->i_pages, &index,
 				index + HPAGE_PMD_NR - 1, XA_PRESENT))
-			return ERR_PTR(-E2BIG);
+				return ERR_PTR(-E2BIG);
+		}
 
-		folio = shmem_alloc_hugefolio(gfp, info, index, HPAGE_PMD_ORDER);
-		if (!folio && pages == HPAGE_PMD_NR)
-			count_vm_event(THP_FILE_FALLBACK);
+		order = highest_order(suitable_orders);
+		while (suitable_orders) {
+			pages = 1 << order;
+			index = round_down(index, pages);
+			folio = shmem_alloc_hugefolio(gfp, info, index, order);
+			if (folio)
+				goto allocated;
+
+			if (pages == HPAGE_PMD_NR)
+				count_vm_event(THP_FILE_FALLBACK);
+			order = next_order(&suitable_orders, order);
+		}
 	} else {
 		pages = 1;
 		folio = shmem_alloc_folio(gfp, info, index);
@@ -1678,6 +1795,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
+allocated:
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
@@ -1972,7 +2090,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm;
 	struct folio *folio;
 	int error;
-	bool alloced;
+	bool alloced, huge;
+	unsigned long orders = 0;
 
 	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
 		return -EINVAL;
@@ -2044,14 +2163,18 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (shmem_is_huge(inode, index, false, fault_mm,
-			  vma ? vma->vm_flags : 0)) {
+	huge = shmem_is_huge(inode, index, false, fault_mm,
+			     vma ? vma->vm_flags : 0);
+	/* Find hugepage orders that are allowed for anonymous shmem. */
+	if (vma && vma_is_anon_shmem(vma))
+		orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
+	if (huge || orders > 0) {
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-		folio = shmem_alloc_and_add_folio(huge_gfp,
-				inode, index, fault_mm, true);
+		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+				inode, index, fault_mm, true, orders);
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
@@ -2061,7 +2184,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 			goto repeat;
 	}
 
-	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, false, 0);
 	if (IS_ERR(folio)) {
 		error = PTR_ERR(folio);
 		if (error == -EEXIST)
@@ -2072,7 +2195,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 alloced:
 	alloced = true;
-	if (folio_test_pmd_mappable(folio) &&
+	if (folio_test_large(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);