From patchwork Fri Oct 18 06:48:04 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
X-Patchwork-Id: 13841265
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CA1C9D3C551
	for <linux-mm@archiver.kernel.org>; Fri, 18 Oct 2024 06:48:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0EB456B00B3; Fri, 18 Oct 2024 02:48:17 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EF3876B00B5; Fri, 18 Oct 2024 02:48:16 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D70DD6B00B7; Fri, 18 Oct 2024 02:48:16 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com
 [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 9925E6B00B3
	for <linux-mm@kvack.org>; Fri, 18 Oct 2024 02:48:16 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id B8C6AA05F0
	for <linux-mm@kvack.org>; Fri, 18 Oct 2024 06:47:54 +0000 (UTC)
X-FDA: 82685793456.26.8E40131
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.14])
	by imf09.hostedemail.com (Postfix) with ESMTP id 0D29E14000B
	for <linux-mm@kvack.org>; Fri, 18 Oct 2024 06:48:05 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=gydkjxnj;
	spf=pass (imf09.hostedemail.com: domain of kanchana.p.sridhar@intel.com
 designates 192.198.163.14 as permitted sender)
 smtp.mailfrom=kanchana.p.sridhar@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1729233900;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=O4FtJuDSldNkCMPp6Ms6Z4bQNZjd3qldBPFLZh97Bx8=;
	b=JySvcrwJl3Yw5otsI6PGdDyEK/PlvBOTNfpCM9okUZCnhHp0NU7NpVQpqWnXNS51gLAXEM
	mXgMo3G142Qhm3ie7BagaFGhO9G27BCa3Jt2C4PXI24fjT0GjJhIIVCXUTc0XYVCasOQqo
	aJFe+KEV49DmhV/KtpLHCvgWN50rBpk=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=gydkjxnj;
	spf=pass (imf09.hostedemail.com: domain of kanchana.p.sridhar@intel.com
 designates 192.198.163.14 as permitted sender)
 smtp.mailfrom=kanchana.p.sridhar@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729233900; a=rsa-sha256;
	cv=none;
	b=HqyRSpV/lbO1Wn5XG1Ik5/zvLZm0C7bFAc3TRP6qxGPkfO7GVyGpeNuIEXli6nFKW4t92E
	zmje4ZFIXl5J68l5vVC/uc69a0A/5VYbZZxr3hVc0+Dw3yQwsNlVqk/fBYsjCRAZTblObC
	aN0wzEiNtTV/YeJ0VJ8LyX/y3njd5vw=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1729234094; x=1760770094;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=NVsoZ0G+h+yCIJer8I6ApzubS8LAuLTzsZjgRgzHJ0A=;
  b=gydkjxnjZoM6cztnQuYxrpMQbREo5Uwc5pcx0eoAt2e1MZDCrNkworSC
   j8eCsHbBxLdOIsy+VqyE3lFl49V/5ysGwLlofe8An5Ia6yd9Teo7rqgbs
   D1w8mVB5LTH+oiQ6BNsliARt0Hb0nWwc9AplsVKbe07FV3Fm5mktzag6L
   dRPXAiLR+yBKbl3N2G7xldTkmZtbxJMnnwYYwriubS+tU8M2Evxpjp7BI
   ycgPmqx050hYjB4eJm1Ou5urduGNwliBZtayBL18ZLzeqh29D+aAoq7+7
   dupNQmhji7N2/YYSpSBGx6drXQj/7AxNLfNCFEaSmA3AJyyE7Wd/X6i2z
   A==;
X-CSE-ConnectionGUID: rEKBjC69R7el1i75BDAxXw==
X-CSE-MsgGUID: DgDjXXvoQRuysIz07ip9Gw==
X-IronPort-AV: E=McAfee;i="6700,10204,11228"; a="28963376"
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="28963376"
Received: from fmviesa003.fm.intel.com ([10.60.135.143])
  by fmvoesa108.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Oct 2024 23:48:10 -0700
X-CSE-ConnectionGUID: 9smf1lgiQ5GrHRanDYY+1g==
X-CSE-MsgGUID: vZzx9IQOSxa3k9xr46CDBg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,212,1725346800";
   d="scan'208";a="82744530"
Received: from jf5300-b11a338t.jf.intel.com ([10.242.51.6])
  by fmviesa003.fm.intel.com with ESMTP; 17 Oct 2024 23:48:09 -0700
From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	ying.huang@intel.com,
	21cnbao@gmail.com,
	akpm@linux-foundation.org,
	hughd@google.com,
	willy@infradead.org,
	bfoster@redhat.com,
	dchinner@redhat.com,
	chrisl@kernel.org,
	david@redhat.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap
 load batching interface.
Date: Thu, 17 Oct 2024 23:48:04 -0700
Message-Id: <20241018064805.336490-7-kanchana.p.sridhar@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
References: <20241018064805.336490-1-kanchana.p.sridhar@intel.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 0D29E14000B
X-Stat-Signature: joojcwdgmrjsr5i9rsdnokrwgd6wuapq
X-Rspam-User: 
X-HE-Tag: 1729234085-633320
X-HE-Meta: 
 U2FsdGVkX18H89MxM4Q0ht2xyIAVNUM5z2uUcCRKNHQnpU/g9eGbcdNnw4274y93GZHzbWqzCGFCATwVGtWW9xe58Yl0/hC3UWSTcI05ffGgmQAzz1fCEs7/OODuxu52tvAIYX1hVfrD2Te2nQLz+xRE8TIA+cBFA22C9cRFTewfS6BpI2CGT8DjvlugnaXKLRDX3EWOg3jVNXzXWwNVWARbzLeXY7J0xIEMovkhrYn7n7RkzqsHxuYCsY2rHL5ew3sDQ6Eob1nOaBwnEuc3L+PJyZmIYrCARBAB9RIdzdyFRQHH+1OBZLUem0QrO9iWlZYAWshkFfr6Y9M8cTSP6Lsjk7jqZONTEHIMgv/20+oAq7AdWu7mqnmJIXkjFW4pZviJsaq5APg55JT8w5pJGJZrWPgzM+KyvHSyn7BWu0DZJrPsolkWfcqa5WNmL1D17XsLzhM9Q9nlPxH+dvjbEQVt4WMh3hAWJl7NQ7VJCCyo597BdEkQzwqr+A427/me8MUKHh8mJKPkop08fY8WY0eZuKZB0vqhxp4WZ5alHrSRqVtw/oNxpwbQgtqZBD35hQLGNWnJTYz2oV9vUu8oyhDogrVvLJTFI2uRnCuF9+ZvEcT/+zkc6USVmHhyYLEszZr3xc7ezTwRRTiI4JBR5+2bwOqJ+Ba/7oen7b47K05wGwJy/MthKwh6UEh6F6d6iDHlDk9o38BfLa3BODoZj91swXipzKrmlUd1o59Uv5G+EqxyRI1V0ZycPLZXDbmlNYrkHwgbKYrzro9fzYc/05poHKcFQVMabqZxQjVIlyvuCjisrW5coQlClMiov9mYdSEzbYUgB2p4tAyIctQGvQVgA1XnSmo37Punhh19jeoPk93awgA2JkuT3cZCONA0Mm0jC3m1afFwA+hgEvnMmvmJmJhF7OiN/jKpobug1/Y5Pe86Wscharm+ZKyOyAT2yodloEDzLmTibIlK4qJ
 lsi9f4NH
 JGbJEiEtrPR9x8yna7k1Re8viOLmCP9rawP29qwxEgbL55roMYgT7euuQ6FFnMlD/B+pfoMhlHZAczEuYYtiTq6iR5Hb1CViyIbBKgQwtbqCovB5JGRcLPU9RrsUA0XcoPbA/vcHtJIFFQ0w7rPfQkJtNjT0d4vMsWfkRPcHgiEGRkngmMLYukR4ZsSoXGU8oygLmtGOXK3NxKhLGKE8etaE91MXXoMzHboqR
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.

Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:

 1) Reads all the non_zswap_batch folios sequentially by calling
    swap_read_folio().
 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
    zswap_finish_load_batch() that finally decompresses each
    SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
    batch of say, 32 folios) in parallel with IAA.

Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:

 1) single-mapped, SWP_SYNCHRONOUS_IO:
      We call swapin_readahead() with "single_mapped_path = true". This is
      done only in the !zswap_never_enabled() case.
 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
      We call swapin_readahead() with "single_mapped_path = false".

This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.

Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):

For IAA:
 echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

For software compressors:
 echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
 mm/shmem.c      |   2 +-
 mm/swap.h       |  12 ++--
 mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
 mm/swapfile.c   |   2 +-
 5 files changed, 299 insertions(+), 61 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+/*
+ * swapin readahead based batching interface for zswap batched loads using IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the references
+ * that were obtained within __read_swap_cache_async() by this process that is
+ * faulting in this single-mapped swap entry.
+ */
+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+						   struct folio *folio)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+
+	if (__swap_count(entry) != 0)
+		return false;
+
+	/*
+	 * The folio ref count for a single-mapped folio that was allocated
+	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
+	 * incrementors of the folio ref count in __read_swap_cache_async():
+	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+	 */
+
+	if (folio_ref_count(folio) <= 3)
+		return true;
+
+	return false;
+}
+
 static inline bool should_try_to_free_swap(struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
@@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
+	bool single_mapped_swapcache = false;
 	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
@@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
+			if (zswap_never_enabled()) {
+				/* skip swapcache */
+				folio = alloc_swap_folio(vmf);
+				if (folio) {
+					__folio_set_locked(folio);
+					__folio_set_swapbacked(folio);
+
+					nr_pages = folio_nr_pages(folio);
+					if (folio_test_large(folio))
+						entry.val = ALIGN_DOWN(entry.val, nr_pages);
 					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
+					 * Prevent parallel swapin from proceeding with
+					 * the cache flag. Otherwise, another thread
+					 * may finish swapin first, free the entry, and
+					 * swapout reusing the same entry. It's
+					 * undetectable as pte_same() returns true due
+					 * to entry reuse.
 					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
+					if (swapcache_prepare(entry, nr_pages)) {
+						/*
+						 * Relax a bit to prevent rapid
+						 * repeated page faults.
+						 */
+						add_wait_queue(&swapcache_wq, &wait);
+						schedule_timeout_uninterruptible(1);
+						remove_wait_queue(&swapcache_wq, &wait);
+						goto out_page;
+					}
+					need_clear_cache = true;
+
+					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+					shadow = get_shadow_from_swap_cache(entry);
+					if (shadow)
+						workingset_refault(folio, shadow);
+
+					folio_add_lru(folio);
+
+					/* To provide entry to swap_read_folio() */
+					folio->swap = entry;
+					swap_read_folio(folio, NULL, NULL, NULL);
+					folio->private = NULL;
+				}
+			} else {
+				/*
+				 * zswap is enabled or was enabled at some point.
+				 * Don't skip swapcache.
+				 *
+				 * swapin readahead based batching interface
+				 * for zswap batched loads using IAA:
+				 *
+				 * Readahead is invoked in this path only if
+				 * the sys swap "singlemapped_ra_enabled" swap
+				 * parameter is set to true. By default,
+				 * "singlemapped_ra_enabled" is set to false,
+				 * the recommended setting for software compressors.
+				 * For IAA, if "singlemapped_ra_enabled" is set
+				 * to true, readahead will be deployed in this path
+				 * as well.
+				 *
+				 * For single-mapped pages, the batching interface
+				 * calls __read_swap_cache_async() to allocate and
+				 * place the faulting page in the swapcache. This is
+				 * to handle a scenario where the faulting page in
+				 * this process happens to simultaneously be a
+				 * readahead page in another process. By placing the
+				 * single-mapped faulting page in the swapcache,
+				 * we avoid race conditions and duplicate page
+				 * allocations under these scenarios.
+				 */
+				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+							 vmf, true);
+				if (!folio) {
+					ret = VM_FAULT_OOM;
+					goto out;
 				}
-				need_clear_cache = true;
-
-				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
 
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL, NULL, NULL);
-				folio->private = NULL;
-			}
+				single_mapped_swapcache = true;
+				nr_pages = folio_nr_pages(folio);
+				swapcache = folio;
+			} /* swapin with zswap support. */
 		} else {
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+						 vmf, false);
 			swapcache = folio;
 		}
 
@@ -4528,8 +4604,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(folio, vma, vmf->flags)) {
 		folio_free_swap(folio);
+		single_mapped_swapcache = false;
+	}
 
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
@@ -4619,6 +4697,30 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	/*
+	 * swapin readahead based batching interface
+	 * for zswap batched loads using IAA:
+	 *
+	 * Don't skip swapcache strategy for single-mapped
+	 * pages: As described above, we place the
+	 * single-mapped faulting page in the swapcache,
+	 * to avoid race conditions and duplicate page
+	 * allocations between process 1 handling a
+	 * page-fault for a single-mapped page, while
+	 * simultaneously, the same swap entry is a
+	 * readahead prefetch page in another process 2.
+	 *
+	 * One side-effect of this, is that if the race did
+	 * not occur, we need to clean up the swapcache
+	 * entry and free the zswap entry for the faulting
+	 * page, iff it is still single-mapped and is
+	 * exclusive to this process.
+	 */
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4638,6 +4740,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eae800ffab..e4549c04f316 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1624,7 +1624,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
 	struct folio *folio;
 
 	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
+	folio = swap_cluster_readahead(swap, gfp, mpol, ilx, false);
 	mpol_cond_put(mpol);
 
 	return folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2b82c8ed765c..2861bd8f5a96 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,9 +199,11 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
+		struct mempolicy *mpol, pgoff_t ilx,
+		bool single_mapped_path);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-		struct vm_fault *vmf);
+		struct vm_fault *vmf,
+		bool single_mapped_path);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -304,13 +306,15 @@ static inline void show_swap_cache_info(void)
 }
 
 static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
+			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx,
+			bool single_mapped_path)
 {
 	return NULL;
 }
 
 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+			struct vm_fault *vmf,
+			bool single_mapped_path)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0aa938e4c34d..66ea8f7f724c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,6 +44,12 @@ struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly = true;
 
+/*
+ * Enable readahead in single-mapped do_swap_page() path.
+ * Set to "true" for IAA.
+ */
+static bool enable_singlemapped_readahead __read_mostly = false;
+
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
@@ -340,6 +346,11 @@ static inline bool swap_use_vma_readahead(void)
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
 }
 
+static inline bool swap_use_singlemapped_readahead(void)
+{
+	return READ_ONCE(enable_singlemapped_readahead);
+}
+
 /*
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented - we rely on the kernel
@@ -635,12 +646,49 @@ static unsigned long swapin_nr_pages(unsigned long offset)
 	return pages;
 }
 
+static void process_ra_batch_of_same_type(
+	struct zswap_decomp_batch *zswap_batch,
+	struct folio_batch *non_zswap_batch,
+	swp_entry_t targ_entry,
+	struct swap_iocb **splug)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_batch_count(non_zswap_batch); ++i) {
+		struct folio *folio = non_zswap_batch->folios[i];
+		swap_read_folio(folio, splug, NULL, NULL);
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	swap_read_zswap_batch_unplug(zswap_batch, splug);
+
+	for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio = zswap_batch->fbatch.folios[i];
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	folio_batch_reinit(non_zswap_batch);
+
+	zswap_load_batch_reinit(zswap_batch);
+}
+
 /**
  * swap_cluster_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -654,7 +702,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool single_mapped_path)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -664,12 +713,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -678,6 +737,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (end_offset >= si->max)
 		end_offset = si->max - 1;
 
+	/* Note that all swap entries readahead are of the same swap type. */
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
@@ -687,14 +747,22 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (offset != entry_offset) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (offset != entry_offset) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();	/* Push any new pages onto the LRU now */
@@ -1009,6 +1077,9 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  * @mpol: NUMA memory allocation policy to be applied
  * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  * @vmf: fault information
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -1019,10 +1090,14 @@ static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  *
  */
 static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf,
+		bool single_mapped_path)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
+	int type = -1, prev_type = -1;
 	struct folio *folio;
 	pte_t *pte = NULL, pentry;
 	int win;
@@ -1031,10 +1106,18 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	pgoff_t ilx;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	win = swap_vma_ra_win(vmf, &start, &end);
 	if (win == 1)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	ilx = targ_ilx - PFN_DOWN(vmf->address - start);
 
 	blk_start_plug(&plug);
@@ -1057,16 +1140,38 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (addr != vmf->address) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			type = swp_type(entry);
+
+			/*
+			 * Process this sub-batch before switching to
+			 * another swap device type.
+			 */
+			if ((prev_type >= 0) && (type != prev_type))
+				process_ra_batch_of_same_type(&zswap_batch,
+							      &non_zswap_batch,
+							      targ_entry,
+							      &splug);
+
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (addr != vmf->address) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+
+			prev_type = type;
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
 	if (pte)
 		pte_unmap(pte);
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      targ_entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();
@@ -1092,7 +1197,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, bool single_mapped_path)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
@@ -1100,8 +1205,10 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf,
+				   single_mapped_path) :
+		swap_cluster_readahead(entry, gfp_mask, mpol, ilx,
+				       single_mapped_path);
 	mpol_cond_put(mpol);
 
 	return folio;
@@ -1126,10 +1233,30 @@ static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 
 	return count;
 }
+static ssize_t singlemapped_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  enable_singlemapped_readahead ? "true" : "false");
+}
+static ssize_t singlemapped_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &enable_singlemapped_readahead);
+	if (ret)
+		return ret;
+
+	return count;
+}
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
+static struct kobj_attribute singlemapped_ra_enabled_attr = __ATTR_RW(singlemapped_ra_enabled);
 
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&singlemapped_ra_enabled_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0915f3fab31..10367eaee1ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2197,7 +2197,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						&vmf, false);
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);