From patchwork Thu Nov 21 22:25:18 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com>
X-Patchwork-Id: 13882408
Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com
 [209.85.214.172])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AFB31DAC89
	for <linux-block@vger.kernel.org>; Thu, 21 Nov 2024 22:26:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.172
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1732227965; cv=none;
 b=WFEBeuidD6YRXlImJFusp2qY8rtVJnDCkZthmPqCd7oyfvfwf4wUrwarpC152Pao0K5Kx20VduKXhL1dB56DQuBfxk3YNPZau7+8TZompMinWM8+9Tv9IVO2KH2TQh+jUXiQlF1sjY2FC2pnbOQ9xVsIRV8zaaZwyGU6eARqNKE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1732227965; c=relaxed/simple;
	bh=Sxv2JR84jpiFjzfM9RUm9pUV5saHTEFoY0sIJMymkj4=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=NkmauHlonvT4cHp/EB50oI59H3cQnc8+FtsplM8JLoBR61Raf2kmwr7UyW0p/gODt2g6WnTBvPlVsjupacz8mI9CP5dx7z4qBeilj/Gcva46lCFqDdGbt1gYivTr2Ob1/bxgzo4CGbfU7yCPGehxqUbfouS/bpoueha+FXLpgEU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=NrhzupN7; arc=none smtp.client-ip=209.85.214.172
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="NrhzupN7"
Received: by mail-pl1-f172.google.com with SMTP id
 d9443c01a7336-212776d6449so15897765ad.1
        for <linux-block@vger.kernel.org>;
 Thu, 21 Nov 2024 14:26:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1732227963; x=1732832763;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tMb405c5DZI5c/GWeHnBtUPE9wRVlUNRdpkUDwoDqsM=;
        b=NrhzupN7ex5fAHvkNJL/mMPxAzMbYvQbRCNhPBdNC9EDAobcztI48QvKLv0BrvH8uh
         J8Je0/cvPHCR/zOIVOHAEUaGB0Cp5u4ujIRp4RnTYasfoC3Cfw9hhAL0jvkiP0ncr8ys
         apSEWygUTOrgu+gw6IU917hxo1JCPnJ+LYf/SAMWCZq7EwcRgXaK33E4CNX5FvEqwh3O
         OqOiILs5hAdIbOmwVhOwRu6/AEVHDVYSP1s6hjI7TWIAZ0oY1RHUrovIfd81e6a+ttlS
         DclZ8TZ2l8t8HR9g43WFylNSW8JyMxrYyTxkUkJzmoTh99RGs9YWwJFcZwGWv86EuWYx
         nIxw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1732227963; x=1732832763;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=tMb405c5DZI5c/GWeHnBtUPE9wRVlUNRdpkUDwoDqsM=;
        b=DfLaaxnKu/HHNgohaxea5EM+wCFF1TnP6kRi6Asj+tiuOhgR+z1yU+XcSSzocLiTkk
         eZsGOgM8RhuU1tZBVngQnmBckSck17YRKzOJD/iyMkgsv39CwMENEoLyUpcRA6vyCyCI
         DW11GslzOr8duMuuKanO1SLuK+BB7wooDFn34YUa8OxCUvi6HpggDyDfiFqmHXwrNSzb
         EG2BKyBrYGSGXjPYtC0utU3/n3EtpqctAFIGNbO1Oh4FwGT7WRchpOkSC8gT/qeny9ba
         JbGWnsuPR5LYcWkdmkcTBfCALS8CNhN6taTV0edM2F6tx5LPlRniNLlMwGjfuVq1JFgu
         BxmA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWVhFxci49Pe7J+OTTClmT/KNURVxU7wOX69cwyvHNSywzflobRPUkMh/iBOeiCas0+RcYzLRLoKCOC+w==@vger.kernel.org
X-Gm-Message-State: AOJu0YxEzms1f1aw5x0/9TuqzBruNJ4IdEgL7Ek5D8HQpcLq67uNXrmT
	NXU5se4qqAAMsQaH/T4pth7zT1yMLsgZ3rtGZJlLnUZeosaitQc+
X-Gm-Gg: ASbGnctveaaWG5xRRFd8FZkcVsTAmNYxulper9KFYjYoAAIjELTpAgQgM2MXqql6m57
	X/lyRnXUxHiOFG+d/fQpRUM3BPuNZ15/UxWW6wcYf37iJky4K0Kgu8GXB5/JngAl5VQaub4YubT
	5EtPyWWILmi8knWNNumD1193xBbPYMXT9XBzrKWVMyY1ZNPdjVz+cbvaS+rLHBf1KlNvCgBbYqY
	Th6HBxGj/dv4Ol/2dNrrzU+2PQXdWEmaNmjChj9krkGCSAqeZb+CR5vYkntxgZaqEoGy87u
X-Google-Smtp-Source: 
 AGHT+IH1F7rSx1mWbIYnSHXRFDY20mIXNnt2o6SpVteSmqhSA3+G7SsM/KO4zEEQbFBp87aTMXC5tA==
X-Received: by 2002:a17:902:d4cb:b0:20c:8331:cb6e with SMTP id
 d9443c01a7336-2129f53714cmr8717185ad.19.1732227962445;
        Thu, 21 Nov 2024 14:26:02 -0800 (PST)
Received: from Barrys-MBP.hub ([2407:7000:8942:5500:9d64:b0ba:faf2:680e])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2129dba22f4sm3334745ad.100.2024.11.21.14.25.54
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Thu, 21 Nov 2024 14:26:01 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: axboe@kernel.dk,
	bala.seshasayee@linux.intel.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	kanchana.p.sridhar@intel.com,
	kasong@tencent.com,
	linux-block@vger.kernel.org,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	surenb@google.com,
	terrelln@fb.com,
	usamaarif642@gmail.com,
	v-songbaohua@oppo.com,
	wajdi.k.feghali@intel.com,
	willy@infradead.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	yuzhao@google.com,
	zhengtangquan@oppo.com,
	zhouchengming@bytedance.com
Subject: [PATCH RFC v3 1/4] mm: zsmalloc: support objects compressed based on
 multiple pages
Date: Fri, 22 Nov 2024 11:25:18 +1300
Message-Id: <20241121222521.83458-2-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <20241121222521.83458-1-21cnbao@gmail.com>
References: <20241121222521.83458-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Tangquan Zheng <zhengtangquan@oppo.com>

This patch introduces support for zsmalloc to store compressed objects
based on multi-pages. Previously, a large folio with nr_pages subpages
would undergo compression one by one, each at the granularity of
PAGE_SIZE. However, by compressing them at a larger granularity, we
can conserve both memory and CPU resources.

We define the granularity with a configuration option called
ZSMALLOC_MULTI_PAGES_ORDER, set to a default value of 2, which matches
the minimum order of anonymous mTHP. As a result, a large folio with
8 subpages will now be split into 2 parts instead of 8.

The introduction of the multi-pages feature necessitates the creation
of new size classes to accommodate it.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/zram_drv.c |   3 +-
 include/linux/zsmalloc.h      |  10 +-
 mm/Kconfig                    |  18 +++
 mm/zsmalloc.c                 | 235 ++++++++++++++++++++++++++--------
 4 files changed, 207 insertions(+), 59 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3dee026988dc..6cb7d1e57362 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1461,8 +1461,7 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	if (!huge_class_size)
-		huge_class_size = zs_huge_class_size(zram->mem_pool);
+	huge_class_size = zs_huge_class_size(zram->mem_pool, 0);
 
 	for (index = 0; index < num_pages; index++)
 		spin_lock_init(&zram->table[index].lock);
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..9fa3e7669557 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -33,6 +33,14 @@ enum zs_mapmode {
 	 */
 };
 
+enum zsmalloc_type {
+	ZSMALLOC_TYPE_BASEPAGE,
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	ZSMALLOC_TYPE_MULTI_PAGES,
+#endif
+	ZSMALLOC_TYPE_MAX,
+};
+
 struct zs_pool_stats {
 	/* How many pages were migrated (freed) */
 	atomic_long_t pages_compacted;
@@ -46,7 +54,7 @@ void zs_destroy_pool(struct zs_pool *pool);
 unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
 void zs_free(struct zs_pool *pool, unsigned long obj);
 
-size_t zs_huge_class_size(struct zs_pool *pool);
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type);
 
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 			enum zs_mapmode mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 33fa51d608dc..6b302b66fc0a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -237,6 +237,24 @@ config ZSMALLOC_CHAIN_SIZE
 
 	  For more information, see zsmalloc documentation.
 
+config ZSMALLOC_MULTI_PAGES
+	bool "support zsmalloc multiple pages"
+	depends on ZSMALLOC && !CONFIG_HIGHMEM
+	help
+	  This option configures zsmalloc to support allocations larger than
+	  PAGE_SIZE, enabling compression across multiple pages. The size of
+	  these multiple pages is determined by the configured
+	  ZSMALLOC_MULTI_PAGES_ORDER.
+
+config ZSMALLOC_MULTI_PAGES_ORDER
+	int "zsmalloc multiple pages order"
+	default 2
+	range 1 9
+	depends on ZSMALLOC_MULTI_PAGES
+	help
+	  This option is used to configure zsmalloc to support the compression
+	  of multiple pages.
+
 menu "Slab allocator options"
 
 config SLUB
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 64b66a4d3e6e..ab57266b43f6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -70,6 +70,12 @@
 
 #define ZSPAGE_MAGIC	0x58
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZSMALLOC_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZSMALLOC_MULTI_PAGES_NR		(1 << ZSMALLOC_MULTI_PAGES_ORDER)
+#define ZSMALLOC_MULTI_PAGES_SIZE	(PAGE_SIZE * ZSMALLOC_MULTI_PAGES_NR)
+#endif
+
 /*
  * This must be power of 2 and greater than or equal to sizeof(link_free).
  * These two conditions ensure that any 'struct link_free' itself doesn't
@@ -120,7 +126,8 @@
 
 #define HUGE_BITS	1
 #define FULLNESS_BITS	4
-#define CLASS_BITS	8
+#define CLASS_BITS	9
+#define ISOLATED_BITS	5
 #define MAGIC_VAL_BITS	8
 
 #define ZS_MAX_PAGES_PER_ZSPAGE	(_AC(CONFIG_ZSMALLOC_CHAIN_SIZE, UL))
@@ -129,7 +136,11 @@
 #define ZS_MIN_ALLOC_SIZE \
 	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 /* each chunk includes extra space to keep handle */
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MAX_ALLOC_SIZE	(ZSMALLOC_MULTI_PAGES_SIZE)
+#else
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
+#endif
 
 /*
  * On systems with 4K page size, this gives 255 size classes! There is a
@@ -144,9 +155,22 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
-#define ZS_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
-				      ZS_SIZE_CLASS_DELTA) + 1)
+
+#define ZS_PAGE_SIZE_CLASS_DELTA	(PAGE_SIZE >> (CLASS_BITS - 1))
+#define ZS_PAGE_SIZE_CLASSES	(DIV_ROUND_UP(PAGE_SIZE - ZS_MIN_ALLOC_SIZE, \
+				      ZS_PAGE_SIZE_CLASS_DELTA) + 1)
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_MULTI_PAGES_SIZE_CLASS_DELTA	(ZSMALLOC_MULTI_PAGES_SIZE >> (CLASS_BITS - 1))
+#define ZS_MULTI_PAGES_SIZE_CLASSES	(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - PAGE_SIZE, \
+				      ZS_MULTI_PAGES_SIZE_CLASS_DELTA) + 1)
+#endif
+
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES + ZS_MULTI_PAGES_SIZE_CLASSES)
+#else
+#define ZS_SIZE_CLASSES	(ZS_PAGE_SIZE_CLASSES)
+#endif
 
 /*
  * Pages are distinguished by the ratio of used memory (that is the ratio
@@ -182,7 +206,8 @@ struct zs_size_stat {
 static struct dentry *zs_stat_root;
 #endif
 
-static size_t huge_class_size;
+/* huge_class_size[0] for page, huge_class_size[1] for multiple pages. */
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 struct size_class {
 	spinlock_t lock;
@@ -260,6 +285,29 @@ struct zspage {
 	rwlock_t lock;
 };
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	unsigned int order = 0;
+
+	/* used large order to alloc page for zspage when class_size > PAGE_SIZE */
+	if (size > PAGE_SIZE)
+		return ZSMALLOC_MULTI_PAGES_ORDER;
+
+	return order;
+}
+#else
+static inline unsigned int class_size_to_zs_order(unsigned long size)
+{
+	return 0;
+}
+#endif
+
+static inline unsigned long class_size_to_zs_size(unsigned long size)
+{
+	return PAGE_SIZE * (1 << class_size_to_zs_order(size));
+}
+
 struct mapping_area {
 	local_lock_t lock;
 	char *vm_buf; /* copy buffer for objects that span pages */
@@ -510,11 +558,22 @@ static int get_size_class_index(int size)
 {
 	int idx = 0;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+	if (size > PAGE_SIZE + ZS_HANDLE_SIZE) {
+		idx = ZS_PAGE_SIZE_CLASSES;
+		idx += DIV_ROUND_UP(size - PAGE_SIZE,
+				ZS_MULTI_PAGES_SIZE_CLASS_DELTA);
+
+		return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	}
+#endif
+
+	idx = 0;
 	if (likely(size > ZS_MIN_ALLOC_SIZE))
-		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
-				ZS_SIZE_CLASS_DELTA);
+		idx += DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+				ZS_PAGE_SIZE_CLASS_DELTA);
 
-	return min_t(int, ZS_SIZE_CLASSES - 1, idx);
+	return  min_t(int, ZS_PAGE_SIZE_CLASSES - 1, idx);
 }
 
 static inline void class_stat_add(struct size_class *class, int type,
@@ -564,11 +623,11 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 	unsigned long total_freeable = 0;
 	unsigned long inuse_totals[NR_FULLNESS_GROUPS] = {0, };
 
-	seq_printf(s, " %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %8s\n",
-			"class", "size", "10%", "20%", "30%", "40%",
+	seq_printf(s, " %5s %5s %5s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %9s %13s %10s %10s %16s %16s %8s\n",
+			"class", "size", "order", "10%", "20%", "30%", "40%",
 			"50%", "60%", "70%", "80%", "90%", "99%", "100%",
 			"obj_allocated", "obj_used", "pages_used",
-			"pages_per_zspage", "freeable");
+			"pages_per_zspage", "objs_per_zspage", "freeable");
 
 	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
 
@@ -579,7 +638,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 
 		spin_lock(&class->lock);
 
-		seq_printf(s, " %5u %5u ", i, class->size);
+		seq_printf(s, " %5u %5u %5u", i, class->size, class_size_to_zs_order(class->size));
 		for (fg = ZS_INUSE_RATIO_10; fg < NR_FULLNESS_GROUPS; fg++) {
 			inuse_totals[fg] += class_stat_read(class, fg);
 			seq_printf(s, "%9lu ", class_stat_read(class, fg));
@@ -594,9 +653,9 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
 		pages_used = obj_allocated / objs_per_zspage *
 				class->pages_per_zspage;
 
-		seq_printf(s, "%13lu %10lu %10lu %16d %8lu\n",
+		seq_printf(s, "%13lu %10lu %10lu %16d %16d %8lu\n",
 			   obj_allocated, obj_used, pages_used,
-			   class->pages_per_zspage, freeable);
+			   class->pages_per_zspage, objs_per_zspage, freeable);
 
 		total_objs += obj_allocated;
 		total_used_objs += obj_used;
@@ -811,7 +870,8 @@ static inline bool obj_allocated(struct page *page, void *obj,
 
 static void reset_page(struct page *page)
 {
-	__ClearPageMovable(page);
+	if (PageMovable(page))
+		__ClearPageMovable(page);
 	ClearPagePrivate(page);
 	set_page_private(page, 0);
 	page->index = 0;
@@ -863,7 +923,8 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
 	cache_free_zspage(pool, zspage);
 
 	class_stat_sub(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
-	atomic_long_sub(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_sub(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 }
 
 static void free_zspage(struct zs_pool *pool, struct size_class *class,
@@ -892,6 +953,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 	unsigned int freeobj = 1;
 	unsigned long off = 0;
 	struct page *page = get_first_page(zspage);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	while (page) {
 		struct page *next_page;
@@ -903,7 +965,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		vaddr = kmap_local_page(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
-		while ((off += class->size) < PAGE_SIZE) {
+		while ((off += class->size) < page_size) {
 			link->next = freeobj++ << OBJ_TAG_BITS;
 			link += class->size / sizeof(*link);
 		}
@@ -925,7 +987,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		}
 		kunmap_local(vaddr);
 		page = next_page;
-		off %= PAGE_SIZE;
+		off %= page_size;
 	}
 
 	set_freeobj(zspage, 0);
@@ -975,6 +1037,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 	struct zspage *zspage = cache_alloc_zspage(pool, gfp);
 
+	unsigned int order = class_size_to_zs_order(class->size);
+
 	if (!zspage)
 		return NULL;
 
@@ -984,12 +1048,14 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
-		page = alloc_page(gfp);
+		if (order > 0)
+			gfp &= ~__GFP_MOVABLE;
+		page = alloc_pages(gfp | __GFP_COMP, order);
 		if (!page) {
 			while (--i >= 0) {
 				dec_zone_page_state(pages[i], NR_ZSPAGES);
 				__ClearPageZsmalloc(pages[i]);
-				__free_page(pages[i]);
+				__free_pages(pages[i], order);
 			}
 			cache_free_zspage(pool, zspage);
 			return NULL;
@@ -1047,7 +1113,9 @@ static void *__zs_map_object(struct mapping_area *area,
 			struct page *pages[2], int off, int size)
 {
 	size_t sizes[2];
+	void *addr;
 	char *buf = area->vm_buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* disable page faults to match kmap_local_page() return conditions */
 	pagefault_disable();
@@ -1056,12 +1124,16 @@ static void *__zs_map_object(struct mapping_area *area,
 	if (area->vm_mm == ZS_MM_WO)
 		goto out;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy object to per-cpu buffer */
-	memcpy_from_page(buf, pages[0], off, sizes[0]);
-	memcpy_from_page(buf + sizes[0], pages[1], 0, sizes[1]);
+	addr = kmap_local_page(pages[0]);
+	memcpy(buf, addr + off, sizes[0]);
+	kunmap_local(addr);
+	addr = kmap_local_page(pages[1]);
+	memcpy(buf + sizes[0], addr, sizes[1]);
+	kunmap_local(addr);
 out:
 	return area->vm_buf;
 }
@@ -1070,7 +1142,9 @@ static void __zs_unmap_object(struct mapping_area *area,
 			struct page *pages[2], int off, int size)
 {
 	size_t sizes[2];
+	void *addr;
 	char *buf;
+	unsigned long page_size = class_size_to_zs_size(size);
 
 	/* no write fastpath */
 	if (area->vm_mm == ZS_MM_RO)
@@ -1081,12 +1155,16 @@ static void __zs_unmap_object(struct mapping_area *area,
 	size -= ZS_HANDLE_SIZE;
 	off += ZS_HANDLE_SIZE;
 
-	sizes[0] = PAGE_SIZE - off;
+	sizes[0] = page_size - off;
 	sizes[1] = size - sizes[0];
 
 	/* copy per-cpu buffer to object */
-	memcpy_to_page(pages[0], off, buf, sizes[0]);
-	memcpy_to_page(pages[1], 0, buf + sizes[0], sizes[1]);
+	addr = kmap_local_page(pages[0]);
+	memcpy(addr + off, buf, sizes[0]);
+	kunmap_local(addr);
+	addr = kmap_local_page(pages[1]);
+	memcpy(addr, buf + sizes[0], sizes[1]);
+	kunmap_local(addr);
 
 out:
 	/* enable page faults to match kunmap_local() return conditions */
@@ -1184,6 +1262,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	struct mapping_area *area;
 	struct page *pages[2];
 	void *ret;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	/*
 	 * Because we use per-cpu mapping areas shared among the
@@ -1208,12 +1288,14 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	read_unlock(&pool->migrate_lock);
 
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	local_lock(&zs_map_area.lock);
 	area = this_cpu_ptr(&zs_map_area);
 	area->vm_mm = mm;
-	if (off + class->size <= PAGE_SIZE) {
+	if (off + class->size <= page_size) {
 		/* this object is contained entirely within a page */
 		area->vm_addr = kmap_local_page(page);
 		ret = area->vm_addr + off;
@@ -1243,15 +1325,20 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 
 	struct size_class *class;
 	struct mapping_area *area;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
 	class = zspage_class(pool, zspage);
-	off = offset_in_page(class->size * obj_idx);
+
+	page_size = class_size_to_zs_size(class->size);
+	page_mask = ~(page_size - 1);
+	off = (class->size * obj_idx) & ~page_mask;
 
 	area = this_cpu_ptr(&zs_map_area);
-	if (off + class->size <= PAGE_SIZE)
+	if (off + class->size <= page_size)
 		kunmap_local(area->vm_addr);
 	else {
 		struct page *pages[2];
@@ -1281,9 +1368,9 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
  *
  * Return: the size (in bytes) of the first huge zsmalloc &size_class.
  */
-size_t zs_huge_class_size(struct zs_pool *pool)
+size_t zs_huge_class_size(struct zs_pool *pool, enum zsmalloc_type type)
 {
-	return huge_class_size;
+	return huge_class_size[type];
 }
 EXPORT_SYMBOL_GPL(zs_huge_class_size);
 
@@ -1298,13 +1385,21 @@ static unsigned long obj_malloc(struct zs_pool *pool,
 	struct page *m_page;
 	unsigned long m_offset;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
+	unsigned long page_shift;
 
 	class = pool->size_class[zspage->class];
 	obj = get_freeobj(zspage);
 
 	offset = obj * class->size;
-	nr_page = offset >> PAGE_SHIFT;
-	m_offset = offset_in_page(offset);
+	page_size = class_size_to_zs_size(class->size);
+	page_shift = PAGE_SHIFT + class_size_to_zs_order(class->size);
+	page_mask = ~(page_size - 1);
+
+	nr_page = offset >> page_shift;
+	m_offset = offset & ~page_mask;
+
 	m_page = get_first_page(zspage);
 
 	for (i = 0; i < nr_page; i++)
@@ -1385,12 +1480,14 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	obj_malloc(pool, zspage, handle);
 	newfg = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, newfg);
-	atomic_long_add(class->pages_per_zspage, &pool->pages_allocated);
+	atomic_long_add(class->pages_per_zspage * (1 << class_size_to_zs_order(class->size)),
+			&pool->pages_allocated);
 	class_stat_add(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
 	class_stat_add(class, ZS_OBJS_INUSE, 1);
 
 	/* We completely set up zspage so mark them as movable */
-	SetZsPageMovable(pool, zspage);
+	if (class_size_to_zs_order(class->size) == 0)
+		SetZsPageMovable(pool, zspage);
 out:
 	spin_unlock(&class->lock);
 
@@ -1406,9 +1503,14 @@ static void obj_free(int class_size, unsigned long obj)
 	unsigned long f_offset;
 	unsigned int f_objidx;
 	void *vaddr;
+	unsigned long page_size;
+	unsigned long page_mask;
 
 	obj_to_location(obj, &f_page, &f_objidx);
-	f_offset = offset_in_page(class_size * f_objidx);
+	page_size = class_size_to_zs_size(class_size);
+	page_mask = ~(page_size - 1);
+
+	f_offset = (class_size * f_objidx) & ~page_mask;
 	zspage = get_zspage(f_page);
 
 	vaddr = kmap_local_page(f_page);
@@ -1469,20 +1571,22 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	void *s_addr, *d_addr;
 	int s_size, d_size, size;
 	int written = 0;
+	unsigned long page_size = class_size_to_zs_size(class->size);
+	unsigned long page_mask =  ~(page_size - 1);
 
 	s_size = d_size = class->size;
 
 	obj_to_location(src, &s_page, &s_objidx);
 	obj_to_location(dst, &d_page, &d_objidx);
 
-	s_off = offset_in_page(class->size * s_objidx);
-	d_off = offset_in_page(class->size * d_objidx);
+	s_off = (class->size * s_objidx) & ~page_mask;
+	d_off = (class->size * d_objidx) & ~page_mask;
 
-	if (s_off + class->size > PAGE_SIZE)
-		s_size = PAGE_SIZE - s_off;
+	if (s_off + class->size > page_size)
+		s_size = page_size - s_off;
 
-	if (d_off + class->size > PAGE_SIZE)
-		d_size = PAGE_SIZE - d_off;
+	if (d_off + class->size > page_size)
+		d_size = page_size - d_off;
 
 	s_addr = kmap_local_page(s_page);
 	d_addr = kmap_local_page(d_page);
@@ -1507,7 +1611,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 		 * kunmap_local(d_addr). For more details see
 		 * Documentation/mm/highmem.rst.
 		 */
-		if (s_off >= PAGE_SIZE) {
+		if (s_off >= page_size) {
 			kunmap_local(d_addr);
 			kunmap_local(s_addr);
 			s_page = get_next_page(s_page);
@@ -1517,7 +1621,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 			s_off = 0;
 		}
 
-		if (d_off >= PAGE_SIZE) {
+		if (d_off >= page_size) {
 			kunmap_local(d_addr);
 			d_page = get_next_page(d_page);
 			d_addr = kmap_local_page(d_page);
@@ -1541,11 +1645,12 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	int index = *obj_idx;
 	unsigned long handle = 0;
 	void *addr = kmap_local_page(page);
+	unsigned long page_size = class_size_to_zs_size(class->size);
 
 	offset = get_first_obj_offset(page);
 	offset += class->size * index;
 
-	while (offset < PAGE_SIZE) {
+	while (offset < page_size) {
 		if (obj_allocated(page, addr + offset, &handle))
 			break;
 
@@ -1765,6 +1870,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	unsigned long handle;
 	unsigned long old_obj, new_obj;
 	unsigned int obj_idx;
+	unsigned int page_size = PAGE_SIZE;
 
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
@@ -1781,6 +1887,7 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 */
 	write_lock(&pool->migrate_lock);
 	class = zspage_class(pool, zspage);
+	page_size = class_size_to_zs_size(class->size);
 
 	/*
 	 * the class lock protects zpage alloc/free in the zspage.
@@ -1796,10 +1903,10 @@ static int zs_page_migrate(struct page *newpage, struct page *page,
 	 * Here, any user cannot access all objects in the zspage so let's move.
 	 */
 	d_addr = kmap_local_page(newpage);
-	copy_page(d_addr, s_addr);
+	memcpy(d_addr, s_addr, page_size);
 	kunmap_local(d_addr);
 
-	for (addr = s_addr + offset; addr < s_addr + PAGE_SIZE;
+	for (addr = s_addr + offset; addr < s_addr + page_size;
 					addr += class->size) {
 		if (obj_allocated(page, addr, &handle)) {
 
@@ -2085,6 +2192,7 @@ static int calculate_zspage_chain_size(int class_size)
 {
 	int i, min_waste = INT_MAX;
 	int chain_size = 1;
+	unsigned long page_size = class_size_to_zs_size(class_size);
 
 	if (is_power_of_2(class_size))
 		return chain_size;
@@ -2092,7 +2200,7 @@ static int calculate_zspage_chain_size(int class_size)
 	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
 		int waste;
 
-		waste = (i * PAGE_SIZE) % class_size;
+		waste = (i * page_size) % class_size;
 		if (waste < min_waste) {
 			min_waste = waste;
 			chain_size = i;
@@ -2138,18 +2246,33 @@ struct zs_pool *zs_create_pool(const char *name)
 	 * for merging should be larger or equal to current size.
 	 */
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
-		int size;
+		unsigned int size = 0;
 		int pages_per_zspage;
 		int objs_per_zspage;
 		struct size_class *class;
 		int fullness;
+		int order = 0;
+		int idx = ZSMALLOC_TYPE_BASEPAGE;
+
+		if (i < ZS_PAGE_SIZE_CLASSES)
+			size = ZS_MIN_ALLOC_SIZE + i * ZS_PAGE_SIZE_CLASS_DELTA;
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		if (i >= ZS_PAGE_SIZE_CLASSES)
+			size = PAGE_SIZE + (i - ZS_PAGE_SIZE_CLASSES) *
+					   ZS_MULTI_PAGES_SIZE_CLASS_DELTA;
+#endif
 
-		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
 		if (size > ZS_MAX_ALLOC_SIZE)
 			size = ZS_MAX_ALLOC_SIZE;
-		pages_per_zspage = calculate_zspage_chain_size(size);
-		objs_per_zspage = pages_per_zspage * PAGE_SIZE / size;
 
+#ifdef CONFIG_ZSMALLOC_MULTI_PAGES
+		order = class_size_to_zs_order(size);
+		if (order == ZSMALLOC_MULTI_PAGES_ORDER)
+			idx = ZSMALLOC_TYPE_MULTI_PAGES;
+#endif
+
+		pages_per_zspage = calculate_zspage_chain_size(size);
+		objs_per_zspage = pages_per_zspage * PAGE_SIZE * (1 << order) / size;
 		/*
 		 * We iterate from biggest down to smallest classes,
 		 * so huge_class_size holds the size of the first huge
@@ -2157,8 +2280,8 @@ struct zs_pool *zs_create_pool(const char *name)
 		 * endup in the huge class.
 		 */
 		if (pages_per_zspage != 1 && objs_per_zspage != 1 &&
-				!huge_class_size) {
-			huge_class_size = size;
+				!huge_class_size[idx]) {
+			huge_class_size[idx] = size;
 			/*
 			 * The object uses ZS_HANDLE_SIZE bytes to store the
 			 * handle. We need to subtract it, because zs_malloc()
@@ -2168,7 +2291,7 @@ struct zs_pool *zs_create_pool(const char *name)
 			 * class because it grows by ZS_HANDLE_SIZE extra bytes
 			 * right before class lookup.
 			 */
-			huge_class_size -= (ZS_HANDLE_SIZE - 1);
+			huge_class_size[idx] -= (ZS_HANDLE_SIZE - 1);
 		}
 
 		/*

From patchwork Thu Nov 21 22:25:19 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com>
X-Patchwork-Id: 13882409
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com
 [209.85.214.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 493951537CE
	for <linux-block@vger.kernel.org>; Thu, 21 Nov 2024 22:26:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.177
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1732227977; cv=none;
 b=goldh6tjCg+UbowVa5WwgDI9nrizcjdxIRKfuJ6W5GLFXlHGQObLOrs08J4Hs8vr8iB4wrcCeeVZNgnyGutLpExt1oRfkP8w+ly02SzWqFAF01Iu+9eCxtWBNk5UpsdKgs0HkhvmJVx3WsLIE2RbXU/mFLUFJCn+zSWEdBJTsEs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1732227977; c=relaxed/simple;
	bh=qG6OTJCMemddpko/GNZEkgbpiDd4aOtsAL+esH98Dys=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=sMONVfIF6xAcwdBVzAZpZ0hQiCDL5F4ABQh2XWsYPoO4YXoYDXUwkv7b1SCfvgD0y8zsYkpTSBwLzH2dwQhGwzT+nEqdDMl8KONJbpAkseWxEdnJ8xtIqLjR08aK++Fvuz9d1GmBsWBAsNtoSTlkDPoDrIVtsfkS6GR+OX7KoTA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=fJZB1kJH; arc=none smtp.client-ip=209.85.214.177
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="fJZB1kJH"
Received: by mail-pl1-f177.google.com with SMTP id
 d9443c01a7336-21271dc4084so14526605ad.2
        for <linux-block@vger.kernel.org>;
 Thu, 21 Nov 2024 14:26:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1732227973; x=1732832773;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=BCNqlqQ4rfs7IHemmBfATy34F1QXfYdj79CZjZxBivg=;
        b=fJZB1kJHKnFkeBYRsSUud02/SGaFpEoq6SYbIbeG34f+Ul4eKQeLfRJW+RD0tEdnEc
         Saa5r+D2Ci7jdlGAqIAw77eHkGhQnPq20XUiriaxfIebYU7hZmGBFG9Nhd/chz1e7GVY
         XtvnakWHNveFMFnviQl9d+O4aUs/TLFQDt2jdpb59isEIq43OypLf6Cd1HU9eFsDK6D0
         IECWiOZT0Cvz/okCwr3bZ6AbgQ46uxDtLMmeuHoPQz9uMEiWT4fZ7KsIoIWWb+3NET/j
         L+wVXBIxMkoIhttQA1eB8k+lKCxQT6CToNM7GrLo+LieFjBvTwCwUK0O+zB1MIop94fo
         ttKg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1732227973; x=1732832773;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=BCNqlqQ4rfs7IHemmBfATy34F1QXfYdj79CZjZxBivg=;
        b=B3Q8mK6cdW7j/GEo4spjFPWyjGJXprNn9e5iX2vcmOE9TA58RHi3ldopBszCWqqG0j
         ta0LUi3Z0BkuQgNlmMzR/HWRhfvROL3qjdqi/qU7mvVm77Y2EqyXETG05AY2k/AQ5gg1
         zmPsnT+j5dF8zochgZWn6CvN/zF2Z5EeQB6XuuSuSCCj6EvVwF910H44lR2MxJ3HCgTx
         Ww8RkN7K8D44+7H6ZBOJjL3L0RvAeOgtlKLf2OFIzABOHvjITd20Ke2+Oo2zPqhJ7VbR
         6R3hvGu8ULoY57cWsN6amWkv4ZgUVYWtI2vxHsgSnf47dUsFfeTfZ3PzOmSHzXAshW2t
         fMZA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWjhNMA5pvvz4V3WlaTZBcjTYzJjHgQTXmiAo7X3ox/Zp8rW85RyH9FGQouLQdAM5yUEbfD2QGl4nVbzw==@vger.kernel.org
X-Gm-Message-State: AOJu0Yw9/QwpBPp8MWWFi0EClCjg4anF38f7upnsGusj0iOmEgoXYNMe
	KcLk/pR8s2E1y/E10sCMwqfdcHzLz1L6ymPKiD7XsVMhoye+HP1z
X-Gm-Gg: ASbGnctMYqn0BmW+VrlhU5F40TDKE6X6lFHwVxu9rADHgUF6AOr2IglHDW4zt5B5DAy
	VbRVq0TqrW3YMb4JwmP6REZHlu2REmk4zwfzuPEnMiZSvCj/MgyiAbNoQ6H4LUqhw90ojIZhp/y
	EH7/DEblfyUmdZJKfTo5GkPl7qsspLSppcRJlWyjTrjF1Sws8J0dafayO0BBIf/aJGheJHTF37s
	JWj+B2GLt6XKcTOIwtLQMPDQClIrD+AiRhHdO4Us+tLssQfKiyAC8RbG/h/LxOnUKdV3OYT
X-Google-Smtp-Source: 
 AGHT+IEIYvAU6oHpRQ9h3xJVhAtrmO6W7kXx41o4KSLXJIlB4kYtYFzhhJPkt8xEcKZau4F/uTnsCw==
X-Received: by 2002:a17:902:d4cb:b0:212:996:3536 with SMTP id
 d9443c01a7336-2129f5c3cf0mr8895795ad.10.1732227973295;
        Thu, 21 Nov 2024 14:26:13 -0800 (PST)
Received: from Barrys-MBP.hub ([2407:7000:8942:5500:9d64:b0ba:faf2:680e])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2129dba22f4sm3334745ad.100.2024.11.21.14.26.05
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Thu, 21 Nov 2024 14:26:12 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: axboe@kernel.dk,
	bala.seshasayee@linux.intel.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	kanchana.p.sridhar@intel.com,
	kasong@tencent.com,
	linux-block@vger.kernel.org,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	surenb@google.com,
	terrelln@fb.com,
	usamaarif642@gmail.com,
	v-songbaohua@oppo.com,
	wajdi.k.feghali@intel.com,
	willy@infradead.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	yuzhao@google.com,
	zhengtangquan@oppo.com,
	zhouchengming@bytedance.com
Subject: [PATCH RFC v3 2/4] zram: support compression at the granularity of
 multi-pages
Date: Fri, 22 Nov 2024 11:25:19 +1300
Message-Id: <20241121222521.83458-3-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <20241121222521.83458-1-21cnbao@gmail.com>
References: <20241121222521.83458-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Tangquan Zheng <zhengtangquan@oppo.com>

Currently, when a large folio with nr_pages is submitted to zram, it is
divided into nr_pages parts for compression and storage individually.
By transitioning to a higher granularity, we can notably enhance
compression rates while simultaneously reducing CPU consumption.

This patch introduces the capability for large folios to be divided
based on the granularity specified by ZSMALLOC_MULTI_PAGES_ORDER, which
defaults to 2. For instance, for folios sized at 128KiB, compression
will occur in eight 16KiB multi-pages.

This modification will notably reduce CPU consumption and enhance
compression ratios. The following data illustrates the time and
compressed data for typical anonymous pages gathered from Android
phones.

Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/Kconfig    |   9 +
 drivers/block/zram/zcomp.c    |  17 +-
 drivers/block/zram/zcomp.h    |  12 +-
 drivers/block/zram/zram_drv.c | 449 +++++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h |  45 ++++
 5 files changed, 495 insertions(+), 37 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 402b7b175863..716e92c5fdfe 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -145,3 +145,12 @@ config ZRAM_MULTI_COMP
 	  re-compress pages using a potentially slower but more effective
 	  compression algorithm. Note, that IDLE page recompression
 	  requires ZRAM_TRACK_ENTRY_ACTIME.
+
+config ZRAM_MULTI_PAGES
+	bool "Enable multiple pages compression and decompression"
+	depends on ZRAM && ZSMALLOC_MULTI_PAGES
+	help
+	  Initially, zram divided large folios into blocks of nr_pages, each sized
+	  equal to PAGE_SIZE, for compression. This option fine-tunes zram to
+	  improve compression granularity by dividing large folios into larger
+	  parts defined by the configuration option ZSMALLOC_MULTI_PAGES_ORDER.
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index bb514403e305..44f5b404495a 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -52,6 +52,11 @@ static void zcomp_strm_free(struct zcomp *comp, struct zcomp_strm *zstrm)
 
 static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 {
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	unsigned long page_size = ZCOMP_MULTI_PAGES_SIZE;
+#else
+	unsigned long page_size = PAGE_SIZE;
+#endif
 	int ret;
 
 	ret = comp->ops->create_ctx(comp->params, &zstrm->ctx);
@@ -62,7 +67,7 @@ static int zcomp_strm_init(struct zcomp *comp, struct zcomp_strm *zstrm)
 	 * allocate 2 pages. 1 for compressed data, plus 1 extra for the
 	 * case when compressed size is larger than the original one
 	 */
-	zstrm->buffer = vzalloc(2 * PAGE_SIZE);
+	zstrm->buffer = vzalloc(2 * page_size);
 	if (!zstrm->buffer) {
 		zcomp_strm_free(comp, zstrm);
 		return -ENOMEM;
@@ -119,13 +124,13 @@ void zcomp_stream_put(struct zcomp *comp)
 }
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		   const void *src, unsigned int *dst_len)
+		   const void *src, unsigned int src_len, unsigned int *dst_len)
 {
 	struct zcomp_req req = {
 		.src = src,
 		.dst = zstrm->buffer,
-		.src_len = PAGE_SIZE,
-		.dst_len = 2 * PAGE_SIZE,
+		.src_len = src_len,
+		.dst_len = src_len * 2,
 	};
 	int ret;
 
@@ -136,13 +141,13 @@ int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
 }
 
 int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		     const void *src, unsigned int src_len, void *dst)
+		     const void *src, unsigned int src_len, void *dst, unsigned int dst_len)
 {
 	struct zcomp_req req = {
 		.src = src,
 		.dst = dst,
 		.src_len = src_len,
-		.dst_len = PAGE_SIZE,
+		.dst_len = dst_len,
 	};
 
 	return comp->ops->decompress(comp->params, &zstrm->ctx, &req);
diff --git a/drivers/block/zram/zcomp.h b/drivers/block/zram/zcomp.h
index ad5762813842..471c16be293c 100644
--- a/drivers/block/zram/zcomp.h
+++ b/drivers/block/zram/zcomp.h
@@ -30,6 +30,13 @@ struct zcomp_ctx {
 	void *context;
 };
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZCOMP_MULTI_PAGES_ORDER	(_AC(CONFIG_ZSMALLOC_MULTI_PAGES_ORDER, UL))
+#define ZCOMP_MULTI_PAGES_NR	(1 << ZCOMP_MULTI_PAGES_ORDER)
+#define ZCOMP_MULTI_PAGES_SIZE	(PAGE_SIZE * ZCOMP_MULTI_PAGES_NR)
+#define MULTI_PAGE_SHIFT (ZCOMP_MULTI_PAGES_ORDER + PAGE_SHIFT)
+#endif
+
 struct zcomp_strm {
 	local_lock_t lock;
 	/* compression buffer */
@@ -80,8 +87,9 @@ struct zcomp_strm *zcomp_stream_get(struct zcomp *comp);
 void zcomp_stream_put(struct zcomp *comp);
 
 int zcomp_compress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		   const void *src, unsigned int *dst_len);
+		   const void *src, unsigned int src_len, unsigned int *dst_len);
 int zcomp_decompress(struct zcomp *comp, struct zcomp_strm *zstrm,
-		     const void *src, unsigned int src_len, void *dst);
+		     const void *src, unsigned int src_len, void *dst,
+		     unsigned int dst_len);
 
 #endif /* _ZCOMP_H_ */
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6cb7d1e57362..90f87894ff3e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -50,7 +50,7 @@ static unsigned int num_devices = 1;
  * Pages that compress to sizes equals or greater than this are stored
  * uncompressed in memory.
  */
-static size_t huge_class_size;
+static size_t huge_class_size[ZSMALLOC_TYPE_MAX];
 
 static const struct block_device_operations zram_devops;
 
@@ -296,11 +296,11 @@ static inline void zram_fill_page(void *ptr, unsigned long len,
 	memset_l(ptr, value, len / sizeof(unsigned long));
 }
 
-static bool page_same_filled(void *ptr, unsigned long *element)
+static bool page_same_filled(void *ptr, unsigned long *element, unsigned int page_size)
 {
 	unsigned long *page;
 	unsigned long val;
-	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
+	unsigned int pos, last_pos = page_size / sizeof(*page) - 1;
 
 	page = (unsigned long *)ptr;
 	val = page[0];
@@ -1426,13 +1426,40 @@ static ssize_t debug_stat_show(struct device *dev,
 	return ret;
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static ssize_t multi_pages_debug_stat_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct zram *zram = dev_to_zram(dev);
+	ssize_t ret = 0;
+
+	down_read(&zram->init_lock);
+	ret = scnprintf(buf, PAGE_SIZE,
+			"zram_bio write/read multi_pages count:%8llu %8llu\n"
+			"zram_bio failed write/read multi_pages count%8llu %8llu\n"
+			"zram_bio partial write/read multi_pages count%8llu %8llu\n"
+			"multi_pages_miss_free %8llu\n",
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_writes),
+			(u64)atomic64_read(&zram->stats.multi_pages_failed_reads),
+			(u64)atomic64_read(&zram->stats.zram_bio_write_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.zram_bio_read_multi_pages_partial_count),
+			(u64)atomic64_read(&zram->stats.multi_pages_miss_free));
+	up_read(&zram->init_lock);
+
+	return ret;
+}
+#endif
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
 #ifdef CONFIG_ZRAM_WRITEBACK
 static DEVICE_ATTR_RO(bd_stat);
 #endif
 static DEVICE_ATTR_RO(debug_stat);
-
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+static DEVICE_ATTR_RO(multi_pages_debug_stat);
+#endif
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
 	size_t num_pages = disksize >> PAGE_SHIFT;
@@ -1449,6 +1476,7 @@ static void zram_meta_free(struct zram *zram, u64 disksize)
 static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 {
 	size_t num_pages, index;
+	int i;
 
 	num_pages = disksize >> PAGE_SHIFT;
 	zram->table = vzalloc(array_size(num_pages, sizeof(*zram->table)));
@@ -1461,7 +1489,10 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 		return false;
 	}
 
-	huge_class_size = zs_huge_class_size(zram->mem_pool, 0);
+	for (i = 0; i < ZSMALLOC_TYPE_MAX; i++) {
+		if (!huge_class_size[i])
+			huge_class_size[i] = zs_huge_class_size(zram->mem_pool, i);
+	}
 
 	for (index = 0; index < num_pages; index++)
 		spin_lock_init(&zram->table[index].lock);
@@ -1476,10 +1507,17 @@ static bool zram_meta_alloc(struct zram *zram, u64 disksize)
 static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
+	int nr_pages = 1;
 
 #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME
 	zram->table[index].ac_time = 0;
 #endif
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zram_test_flag(zram, index, ZRAM_COMP_MULTI_PAGES)) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		nr_pages = ZCOMP_MULTI_PAGES_NR;
+	}
+#endif
 
 	zram_clear_flag(zram, index, ZRAM_IDLE);
 	zram_clear_flag(zram, index, ZRAM_INCOMPRESSIBLE);
@@ -1503,7 +1541,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 	 */
 	if (zram_test_flag(zram, index, ZRAM_SAME)) {
 		zram_clear_flag(zram, index, ZRAM_SAME);
-		atomic64_dec(&zram->stats.same_pages);
+		atomic64_sub(nr_pages, &zram->stats.same_pages);
 		goto out;
 	}
 
@@ -1516,7 +1554,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 	atomic64_sub(zram_get_obj_size(zram, index),
 		     &zram->stats.compr_data_size);
 out:
-	atomic64_dec(&zram->stats.pages_stored);
+	atomic64_sub(nr_pages, &zram->stats.pages_stored);
 	zram_set_handle(zram, index, 0);
 	zram_set_obj_size(zram, index, 0);
 }
@@ -1526,7 +1564,7 @@ static void zram_free_page(struct zram *zram, size_t index)
  * Corresponding ZRAM slot should be locked.
  */
 static int zram_read_from_zspool(struct zram *zram, struct page *page,
-				 u32 index)
+				 u32 index, enum zsmalloc_type zs_type)
 {
 	struct zcomp_strm *zstrm;
 	unsigned long handle;
@@ -1534,6 +1572,12 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 	void *src, *dst;
 	u32 prio;
 	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
 
 	handle = zram_get_handle(zram, index);
 	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
@@ -1542,28 +1586,28 @@ static int zram_read_from_zspool(struct zram *zram, struct page *page,
 
 		value = handle ? zram_get_element(zram, index) : 0;
 		mem = kmap_local_page(page);
-		zram_fill_page(mem, PAGE_SIZE, value);
+		zram_fill_page(mem, page_size, value);
 		kunmap_local(mem);
 		return 0;
 	}
 
 	size = zram_get_obj_size(zram, index);
 
-	if (size != PAGE_SIZE) {
+	if (size != page_size) {
 		prio = zram_get_priority(zram, index);
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 	}
 
 	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
-	if (size == PAGE_SIZE) {
+	if (size == page_size) {
 		dst = kmap_local_page(page);
-		copy_page(dst, src);
+		memcpy(dst, src, page_size);
 		kunmap_local(dst);
 		ret = 0;
 	} else {
 		dst = kmap_local_page(page);
 		ret = zcomp_decompress(zram->comps[prio], zstrm,
-				       src, size, dst);
+				       src, size, dst, page_size);
 		kunmap_local(dst);
 		zcomp_stream_put(zram->comps[prio]);
 	}
@@ -1579,7 +1623,7 @@ static int zram_read_page(struct zram *zram, struct page *page, u32 index,
 	zram_slot_lock(zram, index);
 	if (!zram_test_flag(zram, index, ZRAM_WB)) {
 		/* Slot should be locked through out the function call */
-		ret = zram_read_from_zspool(zram, page, index);
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 		zram_slot_unlock(zram, index);
 	} else {
 		/*
@@ -1636,13 +1680,24 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	struct zcomp_strm *zstrm;
 	unsigned long element = 0;
 	enum zram_pageflags flags = 0;
+	unsigned long page_size = PAGE_SIZE;
+	int huge_class_idx = ZSMALLOC_TYPE_BASEPAGE;
+	int nr_pages = 1;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (folio_size(page_folio(page)) >= ZCOMP_MULTI_PAGES_SIZE) {
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+		huge_class_idx = ZSMALLOC_TYPE_MULTI_PAGES;
+		nr_pages = ZCOMP_MULTI_PAGES_NR;
+	}
+#endif
 
 	mem = kmap_local_page(page);
-	if (page_same_filled(mem, &element)) {
+	if (page_same_filled(mem, &element, page_size)) {
 		kunmap_local(mem);
 		/* Free memory associated with this sector now. */
 		flags = ZRAM_SAME;
-		atomic64_inc(&zram->stats.same_pages);
+		atomic64_add(nr_pages, &zram->stats.same_pages);
 		goto out;
 	}
 	kunmap_local(mem);
@@ -1651,7 +1706,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zstrm = zcomp_stream_get(zram->comps[ZRAM_PRIMARY_COMP]);
 	src = kmap_local_page(page);
 	ret = zcomp_compress(zram->comps[ZRAM_PRIMARY_COMP], zstrm,
-			     src, &comp_len);
+			     src, page_size, &comp_len);
 	kunmap_local(src);
 
 	if (unlikely(ret)) {
@@ -1661,8 +1716,8 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		return ret;
 	}
 
-	if (comp_len >= huge_class_size)
-		comp_len = PAGE_SIZE;
+	if (comp_len >= huge_class_size[huge_class_idx])
+		comp_len = page_size;
 	/*
 	 * handle allocation has 2 paths:
 	 * a) fast path is executed with preemption disabled (for
@@ -1691,7 +1746,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		if (IS_ERR_VALUE(handle))
 			return PTR_ERR((void *)handle);
 
-		if (comp_len != PAGE_SIZE)
+		if (comp_len != page_size)
 			goto compress_again;
 		/*
 		 * If the page is not compressible, you need to acquire the
@@ -1715,10 +1770,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
 	src = zstrm->buffer;
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		src = kmap_local_page(page);
 	memcpy(dst, src, comp_len);
-	if (comp_len == PAGE_SIZE)
+	if (comp_len == page_size)
 		kunmap_local(src);
 
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
@@ -1732,7 +1787,7 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zram_slot_lock(zram, index);
 	zram_free_page(zram, index);
 
-	if (comp_len == PAGE_SIZE) {
+	if (comp_len == page_size) {
 		zram_set_flag(zram, index, ZRAM_HUGE);
 		atomic64_inc(&zram->stats.huge_pages);
 		atomic64_inc(&zram->stats.huge_pages_since);
@@ -1745,10 +1800,19 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		zram_set_handle(zram, index, handle);
 		zram_set_obj_size(zram, index, comp_len);
 	}
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (page_size == ZCOMP_MULTI_PAGES_SIZE) {
+		/* Set multi-pages compression flag for free or overwriting */
+		for (int i = 0; i < ZCOMP_MULTI_PAGES_NR; i++)
+			zram_set_flag(zram, index + i, ZRAM_COMP_MULTI_PAGES);
+	}
+#endif
+
 	zram_slot_unlock(zram, index);
 
 	/* Update stats */
-	atomic64_inc(&zram->stats.pages_stored);
+	atomic64_add(nr_pages, &zram->stats.pages_stored);
 	return ret;
 }
 
@@ -1861,7 +1925,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 	if (comp_len_old < threshold)
 		return 0;
 
-	ret = zram_read_from_zspool(zram, page, index);
+	ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_BASEPAGE);
 	if (ret)
 		return ret;
 
@@ -1892,7 +1956,7 @@ static int recompress_slot(struct zram *zram, u32 index, struct page *page,
 		zstrm = zcomp_stream_get(zram->comps[prio]);
 		src = kmap_local_page(page);
 		ret = zcomp_compress(zram->comps[prio], zstrm,
-				     src, &comp_len_new);
+				     src, PAGE_SIZE, &comp_len_new);
 		kunmap_local(src);
 
 		if (ret) {
@@ -2056,7 +2120,7 @@ static ssize_t recompress_store(struct device *dev,
 		}
 	}
 
-	if (threshold >= huge_class_size)
+	if (threshold >= huge_class_size[ZSMALLOC_TYPE_BASEPAGE])
 		return -EINVAL;
 
 	down_read(&zram->init_lock);
@@ -2178,7 +2242,7 @@ static void zram_bio_discard(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_read(struct zram *zram, struct bio *bio)
+static void zram_bio_read_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -2209,7 +2273,7 @@ static void zram_bio_read(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
-static void zram_bio_write(struct zram *zram, struct bio *bio)
+static void zram_bio_write_page(struct zram *zram, struct bio *bio)
 {
 	unsigned long start_time = bio_start_io_acct(bio);
 	struct bvec_iter iter = bio->bi_iter;
@@ -2239,6 +2303,311 @@ static void zram_bio_write(struct zram *zram, struct bio *bio)
 	bio_endio(bio);
 }
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+
+/*
+ * The index is compress by multi-pages when any index ZRAM_COMP_MULTI_PAGES flag is set.
+ * Return: 0	: compress by page
+ *         > 0	: compress by multi-pages
+ */
+static inline int __test_multi_pages_comp(struct zram *zram, u32 index)
+{
+	int i;
+	int count = 0;
+	int head_index = index & ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+
+	for (i = 0; i < ZCOMP_MULTI_PAGES_NR; i++) {
+		if (zram_test_flag(zram, head_index + i, ZRAM_COMP_MULTI_PAGES))
+			count++;
+	}
+
+	return count;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	if (bio->bi_io_vec->bv_len >= ZCOMP_MULTI_PAGES_SIZE)
+		return true;
+
+	zram_slot_lock(zram, index);
+	if (__test_multi_pages_comp(zram, index)) {
+		zram_slot_unlock(zram, index);
+		return true;
+	}
+	zram_slot_unlock(zram, index);
+
+	return false;
+}
+
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	u32 index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
+
+	return !!__test_multi_pages_comp(zram, index);
+}
+
+static inline bool is_multi_pages_partial_io(struct bio_vec *bvec)
+{
+	return bvec->bv_len != ZCOMP_MULTI_PAGES_SIZE;
+}
+
+static int zram_read_multi_pages(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent)
+{
+	int ret;
+
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u\n", ret, index);
+
+	return ret;
+}
+
+static int zram_read_partial_from_zspool(struct zram *zram, struct page *page,
+				 u32 index, enum zsmalloc_type zs_type, int offset)
+{
+	struct zcomp_strm *zstrm;
+	unsigned long handle;
+	unsigned int size;
+	void *src, *dst;
+	u32 prio;
+	int ret;
+	unsigned long page_size = PAGE_SIZE;
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	if (zs_type == ZSMALLOC_TYPE_MULTI_PAGES)
+		page_size = ZCOMP_MULTI_PAGES_SIZE;
+#endif
+
+	handle = zram_get_handle(zram, index);
+	if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
+		unsigned long value;
+		void *mem;
+
+		value = handle ? zram_get_element(zram, index) : 0;
+		mem = kmap_local_page(page);
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		zram_fill_page(mem, PAGE_SIZE, value);
+		kunmap_local(mem);
+		return 0;
+	}
+
+	size = zram_get_obj_size(zram, index);
+
+	if (size != page_size) {
+		prio = zram_get_priority(zram, index);
+		zstrm = zcomp_stream_get(zram->comps[prio]);
+	}
+
+	src = zs_map_object(zram->mem_pool, handle, ZS_MM_RO);
+	if (size == page_size) {
+		dst = kmap_local_page(page);
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		memcpy(dst, src + offset, PAGE_SIZE);
+		kunmap_local(dst);
+		ret = 0;
+	} else {
+		dst = kmap_local_page(page);
+		/* use zstrm->buffer to store decompress thp and copy page to dst */
+		atomic64_inc(&zram->stats.zram_bio_read_multi_pages_partial_count);
+		ret = zcomp_decompress(zram->comps[prio], zstrm, src, size, zstrm->buffer, page_size);
+		memcpy(dst, zstrm->buffer + offset, PAGE_SIZE);
+		kunmap_local(dst);
+		zcomp_stream_put(zram->comps[prio]);
+	}
+	zs_unmap_object(zram->mem_pool, handle);
+	return ret;
+}
+
+/*
+ * Use a temporary buffer to decompress the page, as the decompressor
+ * always expects a full page for the output.
+ */
+static int zram_bvec_read_multi_pages_partial(struct zram *zram, struct page *page, u32 index,
+			  struct bio *parent, int offset)
+{
+	int ret;
+
+	zram_slot_lock(zram, index);
+	if (!zram_test_flag(zram, index, ZRAM_WB)) {
+		/* Slot should be locked through out the function call */
+		ret = zram_read_partial_from_zspool(zram, page, index, ZSMALLOC_TYPE_MULTI_PAGES, offset);
+		zram_slot_unlock(zram, index);
+	} else {
+		/*
+		 * The slot should be unlocked before reading from the backing
+		 * device.
+		 */
+		zram_slot_unlock(zram, index);
+
+		ret = read_from_bdev(zram, page, zram_get_element(zram, index),
+				     parent);
+	}
+
+	/* Should NEVER happen. Return bio error if it does. */
+	if (WARN_ON(ret < 0))
+		pr_err("Decompression failed! err=%d, page=%u offset=%d\n", ret, index, offset);
+
+	return ret;
+}
+
+static int zram_bvec_read_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			  u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_read_multi_pages_partial(zram, bvec->bv_page, index, bio, offset);
+	return zram_read_multi_pages(zram, bvec->bv_page, index, bio);
+}
+
+/*
+ * This is a partial IO. Read the full page before writing the changes.
+ */
+static int zram_bvec_write_multi_pages_partial(struct zram *zram, struct bio_vec *bvec,
+				   u32 index, int offset, struct bio *bio)
+{
+	struct page *page = alloc_pages(GFP_NOIO | __GFP_COMP, ZCOMP_MULTI_PAGES_ORDER);
+	int ret;
+	void *src, *dst;
+
+	if (!page)
+		return -ENOMEM;
+
+	ret = zram_read_multi_pages(zram, page, index, bio);
+	if (!ret) {
+		src = kmap_local_page(bvec->bv_page);
+		dst = kmap_local_page(page);
+		memcpy(dst + offset, src + bvec->bv_offset, bvec->bv_len);
+		kunmap_local(dst);
+		kunmap_local(src);
+
+		atomic64_inc(&zram->stats.zram_bio_write_multi_pages_partial_count);
+		ret = zram_write_page(zram, page, index);
+	}
+	__free_pages(page, ZCOMP_MULTI_PAGES_ORDER);
+	return ret;
+}
+
+static int zram_bvec_write_multi_pages(struct zram *zram, struct bio_vec *bvec,
+			   u32 index, int offset, struct bio *bio)
+{
+	if (is_multi_pages_partial_io(bvec))
+		return zram_bvec_write_multi_pages_partial(zram, bvec, index, offset, bio);
+	return zram_write_page(zram, bvec->bv_page, index);
+}
+
+
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector & (SECTORS_PER_MULTI_PAGE - 1)) << SECTOR_SHIFT;
+		struct bio_vec bv = multi_pages_bio_iter_iovec(bio, iter);
+
+		atomic64_add(1, &zram->stats.zram_bio_read_multi_pages_count);
+		bv.bv_len = min_t(u32, bv.bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		if (zram_bvec_read_multi_pages(zram, &bv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_reads);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+		flush_dcache_page(bv.bv_page);
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, bv.bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio)
+{
+	unsigned long start_time = bio_start_io_acct(bio);
+	struct bvec_iter iter = bio->bi_iter;
+
+	do {
+		/* Use head index, and other indexes are used as offset */
+		u32 index = (iter.bi_sector >> SECTORS_PER_PAGE_SHIFT) &
+				~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		u32 offset = (iter.bi_sector & (SECTORS_PER_MULTI_PAGE - 1)) << SECTOR_SHIFT;
+		struct bio_vec bv = multi_pages_bio_iter_iovec(bio, iter);
+
+		bv.bv_len = min_t(u32, bv.bv_len, ZCOMP_MULTI_PAGES_SIZE - offset);
+
+		atomic64_add(1, &zram->stats.zram_bio_write_multi_pages_count);
+		if (zram_bvec_write_multi_pages(zram, &bv, index, offset, bio) < 0) {
+			atomic64_inc(&zram->stats.multi_pages_failed_writes);
+			bio->bi_status = BLK_STS_IOERR;
+			break;
+		}
+
+		zram_slot_lock(zram, index);
+		zram_accessed(zram, index);
+		zram_slot_unlock(zram, index);
+
+		bio_advance_iter_single(bio, &iter, bv.bv_len);
+	} while (iter.bi_size);
+
+	bio_end_io_acct(bio, start_time);
+	bio_endio(bio);
+}
+#else
+static inline bool test_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+
+static inline bool want_multi_pages_comp(struct zram *zram, struct bio *bio)
+{
+	return false;
+}
+static void zram_bio_read_multi_pages(struct zram *zram, struct bio *bio) {}
+static void zram_bio_write_multi_pages(struct zram *zram, struct bio *bio) {}
+#endif
+
+static void zram_bio_read(struct zram *zram, struct bio *bio)
+{
+	if (test_multi_pages_comp(zram, bio))
+		zram_bio_read_multi_pages(zram, bio);
+	else
+		zram_bio_read_page(zram, bio);
+}
+
+static void zram_bio_write(struct zram *zram, struct bio *bio)
+{
+	if (want_multi_pages_comp(zram, bio))
+		zram_bio_write_multi_pages(zram, bio);
+	else
+		zram_bio_write_page(zram, bio);
+}
+
 /*
  * Handler function for all zram I/O requests.
  */
@@ -2276,6 +2645,25 @@ static void zram_slot_free_notify(struct block_device *bdev,
 		return;
 	}
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	int comp_count = __test_multi_pages_comp(zram, index);
+
+	if (comp_count > 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		return;
+	} else if (comp_count == 1) {
+		zram_clear_flag(zram, index, ZRAM_COMP_MULTI_PAGES);
+		zram_slot_unlock(zram, index);
+		/*only need to free head index*/
+		index &= ~((unsigned long)ZCOMP_MULTI_PAGES_NR - 1);
+		if (!zram_slot_trylock(zram, index)) {
+			atomic64_inc(&zram->stats.multi_pages_miss_free);
+			return;
+		}
+	}
+#endif
+
 	zram_free_page(zram, index);
 	zram_slot_unlock(zram, index);
 }
@@ -2493,6 +2881,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
 	&dev_attr_io_stat.attr,
 	&dev_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	&dev_attr_multi_pages_debug_stat.attr,
+#endif
 #ifdef CONFIG_ZRAM_WRITEBACK
 	&dev_attr_bd_stat.attr,
 #endif
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 134be414e210..ac4eb4f39cb7 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -28,6 +28,10 @@
 #define ZRAM_SECTOR_PER_LOGICAL_BLOCK	\
 	(1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define SECTORS_PER_MULTI_PAGE_SHIFT	(MULTI_PAGE_SHIFT - SECTOR_SHIFT)
+#define SECTORS_PER_MULTI_PAGE	(1 << SECTORS_PER_MULTI_PAGE_SHIFT)
+#endif
 
 /*
  * ZRAM is mainly used for memory efficiency so we want to keep memory
@@ -38,7 +42,15 @@
  *
  * We use BUILD_BUG_ON() to make sure that zram pageflags don't overflow.
  */
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+#define ZRAM_FLAG_SHIFT (PAGE_SHIFT +  \
+	CONFIG_ZSMALLOC_MULTI_PAGES_ORDER + 1)
+#else
 #define ZRAM_FLAG_SHIFT (PAGE_SHIFT + 1)
+#endif
+
+#define ENABLE_HUGEPAGE_ZRAM_DEBUG 1
 
 /* Only 2 bits are allowed for comp priority index */
 #define ZRAM_COMP_PRIORITY_MASK	0x3
@@ -55,6 +67,10 @@ enum zram_pageflags {
 	ZRAM_COMP_PRIORITY_BIT1, /* First bit of comp priority index */
 	ZRAM_COMP_PRIORITY_BIT2, /* Second bit of comp priority index */
 
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	ZRAM_COMP_MULTI_PAGES,	/* Compressed by multi-pages */
+#endif
+
 	__NR_ZRAM_PAGEFLAGS,
 };
 
@@ -90,6 +106,16 @@ struct zram_stats {
 	atomic64_t bd_reads;		/* no. of reads from backing device */
 	atomic64_t bd_writes;		/* no. of writes from backing device */
 #endif
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+	atomic64_t zram_bio_write_multi_pages_count;
+	atomic64_t zram_bio_read_multi_pages_count;
+	atomic64_t multi_pages_failed_writes;
+	atomic64_t multi_pages_failed_reads;
+	atomic64_t zram_bio_write_multi_pages_partial_count;
+	atomic64_t zram_bio_read_multi_pages_partial_count;
+	atomic64_t multi_pages_miss_free;
+#endif
 };
 
 #ifdef CONFIG_ZRAM_MULTI_COMP
@@ -141,4 +167,23 @@ struct zram {
 #endif
 	atomic_t pp_in_progress;
 };
+
+#ifdef CONFIG_ZRAM_MULTI_PAGES
+ #define multi_pages_bvec_iter_offset(bvec, iter)				\
+	(mp_bvec_iter_offset((bvec), (iter)) % ZCOMP_MULTI_PAGES_SIZE)
+
+#define multi_pages_bvec_iter_len(bvec, iter)				\
+	min_t(unsigned int, mp_bvec_iter_len((bvec), (iter)),		\
+	      ZCOMP_MULTI_PAGES_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define multi_pages_bvec_iter_bvec(bvec, iter)				\
+((struct bio_vec) {						\
+	.bv_page	= bvec_iter_page((bvec), (iter)),	\
+	.bv_len		= multi_pages_bvec_iter_len((bvec), (iter)),	\
+	.bv_offset	= multi_pages_bvec_iter_offset((bvec), (iter)),	\
+})
+
+#define multi_pages_bio_iter_iovec(bio, iter)				\
+	multi_pages_bvec_iter_bvec((bio)->bi_io_vec, (iter))
+#endif
 #endif

From patchwork Thu Nov 21 22:25:20 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com>
X-Patchwork-Id: 13882410
Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com
 [209.85.214.180])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BC5C1DDC14
	for <linux-block@vger.kernel.org>; Thu, 21 Nov 2024 22:26:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.180
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1732227988; cv=none;
 b=GUrXYTgmMPn0m1k2S7HvubD4tfbYVlS+1QY8JAnM77HIqPdtK7V8ZYqWPqNxcZ4SLj6DcI6gus0ttMEzpgRRfCZtfCs5Vi92g+kcGpWhSg2Mw59XQMnYRu3c9A2DnLg+bBArqFACznz/F1ZCn0igJCjqpVPS2CAO65gU5qIUXU0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1732227988; c=relaxed/simple;
	bh=n9Imun7lFdzH3n76Y/xW6eOk4MzmS6i8pe2+V22yAWQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=WJRhfBMusguNfH+xjH7Hss/C+ZVk7hnzQ/q4gO2XIZI8SJu79cBG4I0GfCHAFt2x0toESoNiXkWazHrnu200PKMxZYTcgTU7an3SIuYE13pNtoXP5MmftRtoKRWo11WzIUoVvdgL8IT0UH3/cal5rqgxxEk+4aUYkQpFXuxwqlI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Wa7scZAJ; arc=none smtp.client-ip=209.85.214.180
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Wa7scZAJ"
Received: by mail-pl1-f180.google.com with SMTP id
 d9443c01a7336-21288d3b387so11422395ad.1
        for <linux-block@vger.kernel.org>;
 Thu, 21 Nov 2024 14:26:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1732227985; x=1732832785;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=bnUtikKG6kYaTFvoCtF8la8TvUYmRutVMcxq2kAfDzs=;
        b=Wa7scZAJpNYRElA1e3jjDUJSDTsTHQzPRD1epciI2TAwI3Jya/pOPfm3Kk/S/NcGnk
         zD4J6MKeyqbNwyIN1aZt/ykHV1g69nc1oSMvU9fYqtErh7eG5cqGbtl3iL4bNga6UQ3m
         RhRCzNzbzuj1ciTTK07r2fWxbLMLEbTqDfYJhM/350P0KGnFOehsPMHhy8QV5r7uumcu
         +vCxNGnRp363W1P9JMGfw9lUenfjZI7v/LMY2EUjA9/vSTvuVtUMYQ3Bv0rFmVJ3uID4
         V78Xhny0rMs+i8ZW/K9pcg9/zuNUU0Oiu2RUWm06T48P1XUwdi9GN4XaXakOoKMsHz7V
         c9sQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1732227985; x=1732832785;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=bnUtikKG6kYaTFvoCtF8la8TvUYmRutVMcxq2kAfDzs=;
        b=CxVWI9BY6expvXp4tzV+w/PTrvNpVCk3upEUIhWCxbGgV1mWYYALQLwQeaNH5ekm0j
         T0AKzcBrlZfdwkwrJjPHfWaltU02V3mbJ6d2FHpPxI2Zvne//TiZnJp7WLaQERfFUeez
         bZhiHde7oY4NLJSo48KM0BZ0t0YcXGe2P+rf7Qu6sk4DPyp8g9p5Lwc+Khobu5UwBPXY
         mYEGfIPCBSjbUSLTxq7IWdnHzJjpOfJZYcTVPgPmd1WudlDpEVdzLPC7ZOl/SL6Wapg4
         sjuobT63K535kR8gsuxz2mxIAQcsYfxg0qdykNMilTDF10SfZc36FeHrUPkYAwQfN1Ij
         7Edg==
X-Forwarded-Encrypted: i=1;
 AJvYcCWIWXVoxf2VhRG3YYkWWcH41bGjjeeoeHA2zGVZ5Bj5w977Bj61u3ve9a25QOBnVxJcdnCCAjm5ib5lng==@vger.kernel.org
X-Gm-Message-State: AOJu0YzOaPktzzAXoEVerqT7uXmYxddVIG4opxPhiuWfrFfYm9YL3Udz
	gOFcCbiYarPJqIz7HLFKOjqkzHRpdlXhIUg+skOCuzXRwCHLCvuL
X-Gm-Gg: ASbGncs5KYRE2dk8hJ+9H1dZhZXbGT7+g4jt65F01HSbK70kJ3e/G2tPkSk0bveWofX
	oEA1x3uL7RJ7tAhPHhWyUnsXcx+oi+cCY2lPiLZZBNujTMsoR72M6FtLjBYorRHfu0dqlUPOZiH
	KzsQujRzH0Sdt5JJ9WeVQHO9q66qx4M/3e+zTfMKpL5TNu6FwGRRkVpYMcjvmtoUmytNiBuIL5O
	c/YcM6Q9yllDjEuxQU4XyDlllUoUf0nokkHl8GXrq3kkIZkqHc8dLehcuZG2DGFvcVPrY1q
X-Google-Smtp-Source: 
 AGHT+IFYXTSN4Arskmw5br0QxBuvikrOLITGUKeWiVS90/P9GJacujBndu+rNI50ulwvEXnP/KRJnA==
X-Received: by 2002:a17:902:f646:b0:20c:b9ca:c12d with SMTP id
 d9443c01a7336-2129f6807d4mr7603135ad.38.1732227985525;
        Thu, 21 Nov 2024 14:26:25 -0800 (PST)
Received: from Barrys-MBP.hub ([2407:7000:8942:5500:9d64:b0ba:faf2:680e])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2129dba22f4sm3334745ad.100.2024.11.21.14.26.17
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Thu, 21 Nov 2024 14:26:25 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: axboe@kernel.dk,
	bala.seshasayee@linux.intel.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	kanchana.p.sridhar@intel.com,
	kasong@tencent.com,
	linux-block@vger.kernel.org,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	surenb@google.com,
	terrelln@fb.com,
	usamaarif642@gmail.com,
	v-songbaohua@oppo.com,
	wajdi.k.feghali@intel.com,
	willy@infradead.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	yuzhao@google.com,
	zhengtangquan@oppo.com,
	zhouchengming@bytedance.com
Subject: [PATCH RFC v3 3/4] zram: backend_zstd: Adjust estimated_src_size to
 accommodate multi-page compression
Date: Fri, 22 Nov 2024 11:25:20 +1300
Message-Id: <20241121222521.83458-4-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <20241121222521.83458-1-21cnbao@gmail.com>
References: <20241121222521.83458-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Barry Song <v-songbaohua@oppo.com>

If we continue using PAGE_SIZE as the estimated_src_size, we won't
benefit from the reduced CPU usage and improved compression ratio
brought by larger block compression.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/block/zram/backend_zstd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/backend_zstd.c b/drivers/block/zram/backend_zstd.c
index 1184c0036f44..e126615eeff2 100644
--- a/drivers/block/zram/backend_zstd.c
+++ b/drivers/block/zram/backend_zstd.c
@@ -70,12 +70,12 @@ static int zstd_setup_params(struct zcomp_params *params)
 	if (params->level == ZCOMP_PARAM_NO_LEVEL)
 		params->level = zstd_default_clevel();
 
-	zp->cprm = zstd_get_params(params->level, PAGE_SIZE);
+	zp->cprm = zstd_get_params(params->level, ZCOMP_MULTI_PAGES_SIZE);
 
 	zp->custom_mem.customAlloc = zstd_custom_alloc;
 	zp->custom_mem.customFree = zstd_custom_free;
 
-	prm = zstd_get_cparams(params->level, PAGE_SIZE,
+	prm = zstd_get_cparams(params->level, ZCOMP_MULTI_PAGES_SIZE,
 			       params->dict_sz);
 
 	zp->cdict = zstd_create_cdict_byreference(params->dict,
@@ -137,7 +137,7 @@ static int zstd_create(struct zcomp_params *params, struct zcomp_ctx *ctx)
 
 	ctx->context = zctx;
 	if (params->dict_sz == 0) {
-		prm = zstd_get_params(params->level, PAGE_SIZE);
+		prm = zstd_get_params(params->level, ZCOMP_MULTI_PAGES_SIZE);
 		sz = zstd_cctx_workspace_bound(&prm.cParams);
 		zctx->cctx_mem = vzalloc(sz);
 		if (!zctx->cctx_mem)

From patchwork Thu Nov 21 22:25:21 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com>
X-Patchwork-Id: 13882411
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com
 [209.85.214.177])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6AE51ABEB0
	for <linux-block@vger.kernel.org>; Thu, 21 Nov 2024 22:26:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.177
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1732227999; cv=none;
 b=BAjVMVZMqSkJRS714rOyuR18EFDaGWQwuMwPutFancgEfeGODsGh5x1ho5lmpCdoZsJhCVlNwOcuFV+FrmicOM6TG4vGTtjf+E6u6QJzd1h/2xBSoYjYqAnBDG4rY23mfThM6zIHZdFoOTW+sWVwGQ2dV45vAiEusXVUAGwcGJ0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1732227999; c=relaxed/simple;
	bh=E2LdAEWxZEQPoTCg5UwJe0V4hxYf0LhZnWfqs1pvwOg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=F8wwSHF38GXPC7DF0fKf+0PAF6ac39YojOtonDt7fb8DbvarR87rp+HqkGI+H/eRfOL1rodMdWdBMjuPE1jU4bsic+QM4hb9oM1xyKrZW6PcMqqPnV1LXlZ0VlZ/Qt9mXH8AE7FRNhfJ/O0JpUd9GAyq65QH/KloSC88oNpdzro=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=ZZA8/awG; arc=none smtp.client-ip=209.85.214.177
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="ZZA8/awG"
Received: by mail-pl1-f177.google.com with SMTP id
 d9443c01a7336-2128383b86eso14641415ad.2
        for <linux-block@vger.kernel.org>;
 Thu, 21 Nov 2024 14:26:37 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1732227997; x=1732832797;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=pZAV40eLukdAUm3AIOVig1F6aMrQYsyjN2fkwd6I2Qs=;
        b=ZZA8/awGzqG/HEXWV6kn0geRJcPaA5n7UHVUg2b+VfK0MssxL+fnWdDf4+yUOrxFJQ
         U4pgMbv2DSexq8zSdpkL6EsTl0mWPROnyA21sXNbgalFzR7+I//kcNm+9a/WS5g8Tk3f
         YzaH3hF9kLpmkjMnGzgu76tOzUKTb70tnxMuA/hPzJKuKeSU0eNLpucEjdlxKBeHWdhT
         fRYdZT4xLv9Pl9WadZFopqkGW8pqtRD2CbnMMKAA1uehRFBXD7a9CbO2DPrAnUnLXpau
         6zTciC43lopnxp4+2jpr+i1/3wKNXaGoqN9B2zy0mnTKfu5gnhKmHNCqBjVkWTL2IFpl
         czLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1732227997; x=1732832797;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=pZAV40eLukdAUm3AIOVig1F6aMrQYsyjN2fkwd6I2Qs=;
        b=bbR41KLLOIRsbro5gvl8EoIvGSsHQJbkkZ7Csv21vh3LL4nQJxt+Dh1u3YDyxqev30
         lL1R4R7paUvlxbINurWRuyd1BD5Rtfi3nsWz0O0mExL3lzUUOR/CuzlcWx6Y/ejYfT5z
         S6/GwNbySW2aah3HOCWQWzexx+87VNHIr97DUIN2BABYjfFRAv4lhr0ghNtNJWeb6xTW
         xxCzcqYRuKhSxPikGi64AzYeb8FzmD51dUR9vWD70DOS6BtN/Jc3ULc30iaL2oLFRYBA
         XxgL9gkcSbdzuMjrVaigxUqfxSdC0IWiQ2oAUfrjmXAvaB4CbfzinEgMJ7OXb5VtUSXj
         iLqA==
X-Forwarded-Encrypted: i=1;
 AJvYcCVY/IGM9VNB/ljN5q83sKRD1S97nwUlGaQeyp2kTDWZqMBKSWS8Zmq6nC3MhyY5yllJ7DgH72iSSx9jyg==@vger.kernel.org
X-Gm-Message-State: AOJu0YyZ9GV+ZRsHE46tGEqbhNAXs6VZF8/7wA3ImxVAn5kH6lhD5W9N
	Oy1m1cU8IIgXmqTAcahYtbC4Gy/RbUWdSXtvvJ77dQdPaaNaVH0C
X-Gm-Gg: ASbGncvStpNzsIu8et1Wn44WHtEqke9KvQzsHVOV7GmOxpJvVmuXCYP9k5MXWFrbG1A
	ujsvu6TmYN2zsbQn3bn7p0vR2zzs7LKgPs+z74m+B9BjClP4ZpmSExLkSXmT7z4uik1F4BmxqJ+
	Z18HDZaDxBXy5RIrOrScAbPAW/d9JJU8BOQgDYijw9cJHRlHSVZ07Hn48Yl4mm+tRa8rHjzXIyW
	GKMr0s1fjmF+ntb9ukmQpA51JON3k8px/R/xyeV+Vn6fNZoBDppnoVpdZ+oLYxndhNlTfhU
X-Google-Smtp-Source: 
 AGHT+IFk8IcOaVtDODSKrSa6m+UG9KLonALKh/GnCgbMtgcUZBCu3HpUQkp54L4Z2yATbp/3yYDCjw==
X-Received: by 2002:a17:903:244b:b0:211:f6e4:d68e with SMTP id
 d9443c01a7336-2129fce2e23mr6801055ad.10.1732227996748;
        Thu, 21 Nov 2024 14:26:36 -0800 (PST)
Received: from Barrys-MBP.hub ([2407:7000:8942:5500:9d64:b0ba:faf2:680e])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-2129dba22f4sm3334745ad.100.2024.11.21.14.26.28
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Thu, 21 Nov 2024 14:26:36 -0800 (PST)
From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: axboe@kernel.dk,
	bala.seshasayee@linux.intel.com,
	chrisl@kernel.org,
	david@redhat.com,
	hannes@cmpxchg.org,
	kanchana.p.sridhar@intel.com,
	kasong@tencent.com,
	linux-block@vger.kernel.org,
	minchan@kernel.org,
	nphamcs@gmail.com,
	ryan.roberts@arm.com,
	senozhatsky@chromium.org,
	surenb@google.com,
	terrelln@fb.com,
	usamaarif642@gmail.com,
	v-songbaohua@oppo.com,
	wajdi.k.feghali@intel.com,
	willy@infradead.org,
	ying.huang@intel.com,
	yosryahmed@google.com,
	yuzhao@google.com,
	zhengtangquan@oppo.com,
	zhouchengming@bytedance.com,
	Chuanhua Han <chuanhuahan@gmail.com>
Subject: [PATCH RFC v3 4/4] mm: fall back to four small folios if mTHP
 allocation fails
Date: Fri, 22 Nov 2024 11:25:21 +1300
Message-Id: <20241121222521.83458-5-21cnbao@gmail.com>
X-Mailer: git-send-email 2.39.3 (Apple Git-146)
In-Reply-To: <20241121222521.83458-1-21cnbao@gmail.com>
References: <20241121222521.83458-1-21cnbao@gmail.com>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

From: Barry Song <v-songbaohua@oppo.com>

The swapfile can compress/decompress at 4 * PAGES granularity, reducing
CPU usage and improving the compression ratio. However, if allocating an
mTHP fails and we fall back to a single small folio, the entire large
block must still be decompressed. This results in a 16 KiB area requiring
4 page faults, where each fault decompresses 16 KiB but retrieves only
4 KiB of data from the block. To address this inefficiency, we instead
fall back to 4 small folios, ensuring that each decompression occurs
only once.

Allowing swap_read_folio() to decompress and read into an array of
4 folios would be extremely complex, requiring extensive changes
throughout the stack, including swap_read_folio, zeromap,
zswap, and final swap implementations like zRAM. In contrast,
having these components fill a large folio with 4 subpages is much
simpler.

To avoid a full-stack modification, we introduce a per-CPU order-2
large folio as a buffer. This buffer is used for swap_read_folio(),
after which the data is copied into the 4 small folios. Finally, in
do_swap_page(), all these small folios are mapped.

Co-developed-by: Chuanhua Han <chuanhuahan@gmail.com>
Signed-off-by: Chuanhua Han <chuanhuahan@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 192 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f..e551570c1425 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4042,6 +4042,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 	return folio;
 }
 
+#define BATCH_SWPIN_ORDER 2
+#define BATCH_SWPIN_COUNT (1 << BATCH_SWPIN_ORDER)
+#define BATCH_SWPIN_SIZE (PAGE_SIZE << BATCH_SWPIN_ORDER)
+
+struct batch_swpin_buffer {
+	struct folio *folio;
+	struct mutex mutex;
+};
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
@@ -4120,7 +4129,101 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 	return orders;
 }
 
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static DEFINE_PER_CPU(struct batch_swpin_buffer, swp_buf);
+
+static int __init batch_swpin_buffer_init(void)
+{
+	int ret, cpu;
+	struct batch_swpin_buffer *buf;
+
+	for_each_possible_cpu(cpu) {
+		buf = per_cpu_ptr(&swp_buf, cpu);
+		buf->folio = (struct folio *)alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_COMP, BATCH_SWPIN_ORDER);
+		if (!buf->folio) {
+			ret = -ENOMEM;
+			goto err;
+		}
+		mutex_init(&buf->mutex);
+	}
+	return 0;
+
+err:
+	for_each_possible_cpu(cpu) {
+		buf = per_cpu_ptr(&swp_buf, cpu);
+		if (buf->folio) {
+			folio_put(buf->folio);
+			buf->folio = NULL;
+		}
+	}
+	return ret;
+}
+core_initcall(batch_swpin_buffer_init);
+
+static struct folio *alloc_batched_swap_folios(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf, struct folio **folios,
+		swp_entry_t entry)
+{
+	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
+	struct batch_swpin_buffer *sbuf = raw_cpu_ptr(&swp_buf);
+	struct folio *folio = sbuf->folio;
+	unsigned long addr;
+	int i;
+
+	if (unlikely(!folio))
+		return NULL;
+
+	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
+		addr = haddr + i * PAGE_SIZE;
+		folios[i] = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vmf->vma, addr);
+		if (!folios[i])
+			goto err;
+		if (mem_cgroup_swapin_charge_folio(folios[i], vmf->vma->vm_mm,
+					GFP_KERNEL, entry))
+			goto err;
+	}
+
+	mutex_lock(&sbuf->mutex);
+	*buf = sbuf;
+#ifdef CONFIG_MEMCG
+	folio->memcg_data = (*folios)->memcg_data;
+#endif
+	return folio;
+
+err:
+	for (i--; i >= 0; i--)
+		folio_put(folios[i]);
+	return NULL;
+}
+
+static void fill_batched_swap_folios(struct vm_fault *vmf,
+		void *shadow, struct batch_swpin_buffer *buf,
+		struct folio *folio, struct folio **folios)
+{
+	unsigned long haddr = ALIGN_DOWN(vmf->address, BATCH_SWPIN_SIZE);
+	unsigned long addr;
+	int i;
+
+	for (i = 0; i < BATCH_SWPIN_COUNT; i++) {
+		addr = haddr + i * PAGE_SIZE;
+		__folio_set_locked(folios[i]);
+		__folio_set_swapbacked(folios[i]);
+		if (shadow)
+			workingset_refault(folios[i], shadow);
+		folio_add_lru(folios[i]);
+		copy_user_highpage(&folios[i]->page, folio_page(folio, i),
+				addr, vmf->vma);
+		if (folio_test_uptodate(folio))
+			folio_mark_uptodate(folios[i]);
+	}
+
+	folio->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~(1UL << PG_head));
+	mutex_unlock(&buf->mutex);
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf,
+		struct folio **folios)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long orders;
@@ -4180,6 +4283,9 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 
 	pte_unmap_unlock(pte, ptl);
 
+	if (!orders)
+		goto fallback;
+
 	/* Try allocating the highest of the remaining orders. */
 	gfp = vma_thp_gfp_mask(vma);
 	while (orders) {
@@ -4194,14 +4300,29 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 		order = next_order(&orders, order);
 	}
 
+	/*
+	 * During swap-out, a THP might have been compressed into multiple
+	 * order-2 blocks to optimize CPU usage and compression ratio.
+	 * Attempt to batch swap-in 4 smaller folios to ensure they are
+	 * decompressed together as a single unit only once.
+	 */
+	return alloc_batched_swap_folios(vmf, buf, folios, entry);
+
 fallback:
 	return __alloc_swap_folio(vmf);
 }
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
-static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+static struct folio *alloc_swap_folio(struct vm_fault *vmf,
+		struct batch_swpin_buffer **buf,
+		struct folio **folios)
 {
 	return __alloc_swap_folio(vmf);
 }
+static inline void fill_batched_swap_folios(struct vm_fault *vmf,
+		void *shadow, struct batch_swpin_buffer *buf,
+		struct folio *folio, struct folio **folios)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
@@ -4216,6 +4337,8 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
  */
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
+	struct folio *folios[BATCH_SWPIN_COUNT] = { NULL };
+	struct batch_swpin_buffer *buf = NULL;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *swapcache, *folio = NULL;
 	DECLARE_WAITQUEUE(wait, current);
@@ -4228,7 +4351,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
-	int nr_pages;
+	int nr_pages, i;
 	unsigned long page_idx;
 	unsigned long address;
 	pte_t *ptep;
@@ -4296,7 +4419,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
+			folio = alloc_swap_folio(vmf, &buf, folios);
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
@@ -4327,10 +4450,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
+				if (shadow && !buf)
 					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
+				if (!buf)
+					folio_add_lru(folio);
 
 				/* To provide entry to swap_read_folio() */
 				folio->swap = entry;
@@ -4361,6 +4484,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 		page = folio_file_page(folio, swp_offset(entry));
+		/*
+		 * Copy data into batched small folios from the large
+		 * folio buffer
+		 */
+		if (buf) {
+			fill_batched_swap_folios(vmf, shadow, buf, folio, folios);
+			folio = folios[0];
+			page = &folios[0]->page;
+			goto do_map;
+		}
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -4415,6 +4548,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			lru_add_drain();
 	}
 
+do_map:
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
@@ -4431,8 +4565,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* allocated large folios for SWP_SYNCHRONOUS_IO */
-	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
-		unsigned long nr = folio_nr_pages(folio);
+	if ((folio_test_large(folio) || buf) && !folio_test_swapcache(folio)) {
+		unsigned long nr = buf ? BATCH_SWPIN_COUNT : folio_nr_pages(folio);
 		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
 		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
 		pte_t *folio_ptep = vmf->pte - idx;
@@ -4527,6 +4661,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		}
 	}
 
+	/* Batched mapping of allocated small folios for SWP_SYNCHRONOUS_IO */
+	if (buf) {
+		for (i = 0; i < nr_pages; i++)
+			arch_swap_restore(swp_entry(swp_type(entry),
+				swp_offset(entry) + i), folios[i]);
+		swap_free_nr(entry, nr_pages);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+		add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+		rmap_flags |= RMAP_EXCLUSIVE;
+		for (i = 0; i < nr_pages; i++) {
+			unsigned long addr = address + i * PAGE_SIZE;
+
+			pte = mk_pte(&folios[i]->page, vma->vm_page_prot);
+			if (pte_swp_soft_dirty(vmf->orig_pte))
+				pte = pte_mksoft_dirty(pte);
+			if (pte_swp_uffd_wp(vmf->orig_pte))
+				pte = pte_mkuffd_wp(pte);
+			if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
+			    !pte_needs_soft_dirty_wp(vma, pte)) {
+				pte = pte_mkwrite(pte, vma);
+				if ((vmf->flags & FAULT_FLAG_WRITE) && (i == page_idx)) {
+					pte = pte_mkdirty(pte);
+					vmf->flags &= ~FAULT_FLAG_WRITE;
+				}
+			}
+			flush_icache_page(vma, &folios[i]->page);
+			folio_add_new_anon_rmap(folios[i], vma, addr, rmap_flags);
+			set_pte_at(vma->vm_mm, addr, ptep + i, pte);
+			arch_do_swap_page_nr(vma->vm_mm, vma, addr, pte, pte, 1);
+			if (i == page_idx)
+				vmf->orig_pte = pte;
+			folio_unlock(folios[i]);
+		}
+		goto wp_page;
+	}
+
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
@@ -4612,6 +4782,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 
+wp_page:
 	if (vmf->flags & FAULT_FLAG_WRITE) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
@@ -4638,9 +4809,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
-	folio_unlock(folio);
+	if (!buf) {
+		folio_unlock(folio);
+	} else {
+		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
+			folio_unlock(folios[i]);
+	}
 out_release:
-	folio_put(folio);
+	if (!buf) {
+		folio_put(folio);
+	} else {
+		for (i = 0; i < BATCH_SWPIN_COUNT; i++)
+			folio_put(folios[i]);
+	}
 	if (folio != swapcache && swapcache) {
 		folio_unlock(swapcache);
 		folio_put(swapcache);