From patchwork Wed Dec 11 06:26:54 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Baolin Wang <baolin.wang@linux.alibaba.com>
X-Patchwork-Id: 13902962
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6B0B7E7717D
	for <linux-mm@archiver.kernel.org>; Wed, 11 Dec 2024 06:27:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7972D6B0286; Wed, 11 Dec 2024 01:27:14 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 71EA96B0287; Wed, 11 Dec 2024 01:27:14 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 599016B0288; Wed, 11 Dec 2024 01:27:14 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 378FE6B0286
	for <linux-mm@kvack.org>; Wed, 11 Dec 2024 01:27:14 -0500 (EST)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id B5A2841979
	for <linux-mm@kvack.org>; Wed, 11 Dec 2024 06:27:13 +0000 (UTC)
X-FDA: 82881694308.17.74D7DE6
Received: from out30-97.freemail.mail.aliyun.com
 (out30-97.freemail.mail.aliyun.com [115.124.30.97])
	by imf04.hostedemail.com (Postfix) with ESMTP id 86FE340005
	for <linux-mm@kvack.org>; Wed, 11 Dec 2024 06:26:46 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=PppvmsIR;
	spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.97 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1733898408;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=rVVoZx7AwEDV4Z9iTlK+a+Fddq+8GT10d2eCCYoPtQQ=;
	b=IYLfR8yzmy2kS3/cxwu+g2Y79Zr4/fam8gPB0Utp9AZTJdN3DBFnKTkyAw7Zzu3h7PMxbX
	dzqNfbglpG+S/jsSysiEEtguOUKlgYgfMA6HCfg6AOfRMM10COF+Enq+xQy3DOk24yQP62
	0rE7Qxr1HTFeUJCasqz0Wlgaw3fowhI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733898408; a=rsa-sha256;
	cv=none;
	b=Wb06/bv/mCg3P/ehBWDjhYQcn47Wu5T0M8pWxbwel0WAHTdH7vyp/tZCOkFwaBSE28rnz6
	zFKpz7bW6mQ05VLzJw6QKXnTfcKL7RR8HzAg06gdGCqkAJoNzH1lseFocboogcdZiQNm9Q
	eWO5vvrIAPPf2o8OOdp0c9Ob8IBF4aE=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=PppvmsIR;
	spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com
 designates 115.124.30.97 as permitted sender)
 smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1733898427; h=From:To:Subject:Date:Message-Id:MIME-Version;
	bh=rVVoZx7AwEDV4Z9iTlK+a+Fddq+8GT10d2eCCYoPtQQ=;
	b=PppvmsIREfv6v3yt+T5zvkY3J4HPvyLgHsYeJzNHDo+kzTeIjQjfe+MgSrTfsd5Wcu4uXEdi+aswf38MiXYg5HfHrUIfEtun65Mmz/aXfZIz/0YSdNM85Wi57h2uEDISu8Bpwk1lyaxEHBxbs6YEpizWRuNaZScFaVuh0z2Q/7E=
Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com
 fp:SMTPD_---0WLHBLZA_1733898425 cluster:ay36)
          by smtp.aliyun-inc.com;
          Wed, 11 Dec 2024 14:27:06 +0800
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: akpm@linux-foundation.org,
	hughd@google.com
Cc: willy@infradead.org,
	david@redhat.com,
	wangkefeng.wang@huawei.com,
	kasong@tencent.com,
	ying.huang@linux.alibaba.com,
	21cnbao@gmail.com,
	ryan.roberts@arm.com,
	baolin.wang@linux.alibaba.com,
	linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH] mm: shmem: skip swapcache for swapin of synchronous swap
 device
Date: Wed, 11 Dec 2024 14:26:54 +0800
Message-Id: 
 <8c40d045dc1ea46cc0983c0188b566615d9eb490.1733897892.git.baolin.wang@linux.alibaba.com>
X-Mailer: git-send-email 2.39.3
MIME-Version: 1.0
X-Rspamd-Queue-Id: 86FE340005
X-Rspam-User: 
X-Rspamd-Server: rspam07
X-Stat-Signature: 3dy5mmnupd3uryt5yjjwjk4qr181p5xu
X-HE-Tag: 1733898406-266089
X-HE-Meta: 
 U2FsdGVkX19jUYJHvX/IJ6OTDrDxoXQZaRD1eArmN1wqqXzSA81W5+YLXUole1YPCxA8htSLxhwIqdR5W9LTCFbK5vzu5+jb0yU+bYfzpjcf9pXGLQMV/6hiM2yNdlwseuen8rXtbzywdgLwZ1cA95C5Qy1+VwgkH8gNzqk2BOhW93AVKPoZQM4p6yr33A7TYEbPEKd5BxKl+ZopjMpOhjDd3lGCXyKkC28GZutGhlozhAzl9eyTYzssZX2+sPYukXoJb1k1isYSfV4cib7vCz6KA3i3u2A0uliYbQLBIuEJdblJsFWhA1rOJLsMr5DDWiAQgeGiZk+rzAoNutdZ99xgaLtjF14HTn6mfGV+23KFm8IMvYVNXD38gAthnwyFP6GXXzIq19dYQM2w3RlXPgRNDTZpthossWH6Z3kDwsl2qTSWhzH10CFLODBpYsAtbZWKLjYXlAB2opoOupf/+oIFceESjKYlvoyREP8OfXQ3alZQIlO+QtXpXaeOzLOwS3FvcU/nztl8hGAdXH9QQhICdOSshIgwP7QBvu9monmy5cnqIGSErnDSSk+3cCGI7oCjNoSfVJ3ZehxcSKKthSAL6wVClDMDjBYMRmXACVR1CjCnnP6EG/M9+ZOefKPp6AhwxHOVHJEbmAF4DvqCo7SEbsNAuqeR0uz+aXNjFhPYz60MM4xAZBUr0AwQALJda2NWFm5+fuDfcR/wzqGrEQgi0FT7JdytyJNxX+PY4Z1QWTifJSeKOIgCW4hLmJ1peo42fOmRTsCWZnwC0hA9fVBd6unDvt53oab8dn30lYVkB/NuToPhFycQNnA7rD5srn4qOurazs5JFge1htIW5dOTLvMIFZ/usi9Je4TLTANUff9Z8VpzSsNVjt6dhdq9WbvkSePzDnS1YkfZPsfjj9GyksEBBrO83brwvjjyJJPWb+fEdEaoIPJuLSE8WzMeoj93VzvYzCKMt1mdZKH
 gA1JsXxG
 N6WxliNTkwY34B61HjBa0AtbqKFp2VjdP1dFhEc+osqfWk6dSMv7P3CAg9ENxvD/a6Wtj6KL3GP+kEQ1OShmJBDn7yctIj26OuG1N0cfK+yyjBs7iT6vjLdF+CDLIQp7bjrcZpa8TCkGxdPLC7Sssy9wuwkRYG6twSfCkP5RrLZmKU8NJPUvcrTPVyysvYElDgv9cDrKxm6SAvpJpOuWc37OZF6fezOvR7CxS
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

With fast swap devices (such as zram), swapin latency is crucial to applications.
For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache
operation to improve swapin latency. Testing 1G shmem sequential swapin without
THP enabled, I observed approximately a 6% performance improvement:
(Note: I repeated 5 times and took the mean data for each test)

w/o patch	w/ patch	changes
534.8ms		501ms		+6.3%

In addition, currently, we always split the large swap entry stored in the
shmem mapping during shmem large folio swapin, which is not perfect, especially
with a fast swap device. We can swap in the whole large folio instead of
splitting the precious large folios to take advantage of the large folios
and improve the swapin latency if the swap device is synchronous device,
which is also similar to anonymous memory mTHP swapin. Testing 1G shmem
sequential swapin with 64K mTHP and 2M mTHP, I observed obvious performance
improvement:

mTHP=64K
w/o patch	w/ patch	changes
550.4ms		169.6ms		+69%

mTHP=2M
w/o patch	w/ patch	changes
542.8ms		126.8ms		+77%

Note that skipping swapcache requires attention to concurrent swapin scenarios.
Fortunately the swapcache_prepare() and shmem_add_to_page_cache() can help
identify concurrent swapin and large swap entry split scenarios, and return
-EEXIST for retry.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/shmem.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 41d7a181ed89..a110f973dec0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1966,6 +1966,66 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 	return ERR_PTR(error);
 }
 
+static struct folio *shmem_swap_alloc_folio(struct inode *inode, struct vm_area_struct *vma,
+					    pgoff_t index, swp_entry_t entry, int order, gfp_t gfp)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct folio *new;
+	void *shadow;
+	int nr_pages;
+
+	/*
+	 * We have arrived here because our zones are constrained, so don't
+	 * limit chance of success by further cpuset and node constraints.
+	 */
+	gfp &= ~GFP_CONSTRAINT_MASK;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (order > 0) {
+		gfp_t huge_gfp = vma_thp_gfp_mask(vma);
+
+		gfp = limit_gfp_mask(huge_gfp, gfp);
+	}
+#endif
+
+	new = shmem_alloc_folio(gfp, order, info, index);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	nr_pages = folio_nr_pages(new);
+	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
+					   gfp, entry)) {
+		folio_put(new);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Prevent parallel swapin from proceeding with the swap cache flag.
+	 *
+	 * Of course there is another possible concurrent scenario as well,
+	 * that is to say, the swap cache flag of a large folio has already
+	 * been set by swapcache_prepare(), while another thread may have
+	 * already split the large swap entry stored in the shmem mapping.
+	 * In this case, shmem_add_to_page_cache() will help identify the
+	 * concurrent swapin and return -EEXIST.
+	 */
+	if (swapcache_prepare(entry, nr_pages)) {
+		folio_put(new);
+		return ERR_PTR(-EEXIST);
+	}
+
+	__folio_set_locked(new);
+	__folio_set_swapbacked(new);
+	new->swap = entry;
+
+	mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+	shadow = get_shadow_from_swap_cache(entry);
+	if (shadow)
+		workingset_refault(new, shadow);
+	folio_add_lru(new);
+	swap_read_folio(new, NULL);
+	return new;
+}
+
 /*
  * When a page is moved from swapcache to shmem filecache (either by the
  * usual swapin of shmem_get_folio_gfp(), or by the less common swapoff of
@@ -2189,6 +2249,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
+	bool skip_swapcache = false;
 	swp_entry_t swap;
 	int error, nr_pages;
 
@@ -2210,6 +2271,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap, NULL, 0);
 	if (!folio) {
+		int order = xa_get_order(&mapping->i_pages, index);
+		bool fallback_order0 = false;
 		int split_order;
 
 		/* Or update major stats only when swapin succeeds?? */
@@ -2219,6 +2282,33 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
 
+		/*
+		 * If uffd is active for the vma, we need per-page fault
+		 * fidelity to maintain the uffd semantics, then fallback
+		 * to swapin order-0 folio, as well as for zswap case.
+		 */
+		if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
+				  !zswap_never_enabled()))
+			fallback_order0 = true;
+
+		/* Skip swapcache for synchronous device. */
+		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
+			if (!IS_ERR(folio)) {
+				skip_swapcache = true;
+				goto alloced;
+			}
+
+			/*
+			 * Fallback to swapin order-0 folio unless the swap entry
+			 * already exists.
+			 */
+			error = PTR_ERR(folio);
+			folio = NULL;
+			if (error == -EEXIST)
+				goto failed;
+		}
+
 		/*
 		 * Now swap device can only swap in order 0 folio, then we
 		 * should split the large swap entry stored in the pagecache
@@ -2249,9 +2339,10 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		}
 	}
 
+alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if (!folio_test_swapcache(folio) ||
+	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
 	    folio->swap.val != swap.val ||
 	    !shmem_confirm_swap(mapping, index, swap)) {
 		error = -EEXIST;
@@ -2287,7 +2378,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	delete_from_swap_cache(folio);
+	if (skip_swapcache) {
+		folio->swap.val = 0;
+		swapcache_clear(si, swap, nr_pages);
+	} else {
+		delete_from_swap_cache(folio);
+	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
@@ -2300,6 +2396,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap);
 unlock:
+	if (skip_swapcache)
+		swapcache_clear(si, swap, folio_nr_pages(folio));
 	if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);