From patchwork Mon Apr 14 01:44:55 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049541
Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com
 [95.215.58.173])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5CE5F187858
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595128; cv=none;
 b=ffE6jnL0/fdytpIqQsdikiWpHtuwxN8TFhFIf3sq8jz40DQOE44qO2ka0aRrMeVpptReFLi7IQTfVWCpwJXCa9XbxE2y0TdLWkev7wW2r55l3biAIkrN8iAPoNA5tuGji5kE0B/5tacoJ/hpw6AmkdkF57OhhucKma0mN+D9eXo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595128; c=relaxed/simple;
	bh=CHJB8ylmvUZAraWLaan6Ist2/cMzTAvqPPAzjhPKKmc=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=ZMhTyLXXOVz3oCM2ZlpEdsvFkqIZoEULoE7lvchwm++a0czj4ipw30iCh4TxZQLUOjOQSlFNitO73zrnV39WFg96vkwNdCXyy6zQZ8lmZthY4V9psT1wEWACLACGQZPNEk/vvAcmwMpRIez5ME3IVLXwVzKrczOqpOAwiNoxr8o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=AVCuOZxD; arc=none smtp.client-ip=95.215.58.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="AVCuOZxD"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595123;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cEChtasyBbHvB3/rNYeOSia7d4HdTAvwjlQLfl1P+D0=;
	b=AVCuOZxDo2f9CyQErYNBSwvCRudUMlh+xdUjbqZ3RwHK+UbWL/nwqBiZygzjHwwNRg5ONA
	/SIomqE/k8/2fiR/fGEKhfO9mfYm6ChZRGrherAmqA3PttRKHz39S98UlxTVKI5kBh27Zg
	FELjdYWKk/RYeeDb6EtBUfkjcmsDOb8=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 01/11] pcache: introduce cache_dev for managing persistent
 memory-based cache devices
Date: Mon, 14 Apr 2025 01:44:55 +0000
Message-Id: <20250414014505.20477-2-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the core abstraction `cache_dev`, which represents
a persistent memory (pmem) device used by pcache to cache block device data.
The `cache_dev` layer is responsible for pmem initialization, metadata layout,
lifetime management, and user-facing administration via sysfs.

Key components:

- cache_dev metadata layout:
   - Superblock (`pcache_sb`) located at offset 4KB, includes magic, version,
     feature flags (endianness, data CRC), and segment count.
   - Persistent metadata regions (e.g., backing_dev info) follow, double-indexed
     with CRC and sequence numbers for crash consistency and safe updates.
   - Segments start after metadata and are used to store cache data.

- DAX-based memory mapping:
   - Uses `fs_dax_get_by_bdev()` and `dax_direct_access()` to map the entire pmem
     space either directly or via `vmap()` if partial mapping occurs.
   - Ensures that the underlying pmem is directly accessible for low-latency,
     zero-copy access to cache segments.

- Sysfs interface:
   - `/sys/.../info`: displays cache superblock information.
   - `/sys/.../path`: shows associated pmem path.
   - `/sys/.../adm`: accepts administrative commands, e.g.,

- Metadata helpers:
   - Implements utility functions for safely locating the latest or oldest
     valid metadata copy using CRC and sequence numbers.
   - Provides `pcache_meta_get_next_seq()` to atomically increment update versioning.

- Registration and formatting:
   - `cache_dev_register()` initializes the cache_dev from a pmem device path,
     optionally formatting it if `format=1` is specified.
   - `cache_dev_format()` initializes the SB and zeroes metadata regions.
   - `cache_dev_unregister()` gracefully tears down a cache_dev once all backing
     devices have been detached.

Design rationale:

The `cache_dev` layer cleanly separates the pmem-specific logic from
higher-level pcache functionalities. It offers a persistent and self-describing
structure that supports safe recovery, scalable segment management, and
runtime extensibility. It also serves as the anchoring point for dynamic
attachment of backing block devices, which are managed through additional
modules layered on top.

This patch lays the foundation for building a robust and high-performance
block cache system based on persistent memory.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache_dev.c       | 808 +++++++++++++++++++++++++
 drivers/block/pcache/cache_dev.h       |  81 +++
 drivers/block/pcache/pcache_internal.h | 185 ++++++
 3 files changed, 1074 insertions(+)
 create mode 100644 drivers/block/pcache/cache_dev.c
 create mode 100644 drivers/block/pcache/cache_dev.h
 create mode 100644 drivers/block/pcache/pcache_internal.h

diff --git a/drivers/block/pcache/cache_dev.c b/drivers/block/pcache/cache_dev.c
new file mode 100644
index 000000000000..8b5fea7bfcee
--- /dev/null
+++ b/drivers/block/pcache/cache_dev.c
@@ -0,0 +1,808 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/blkdev.h>
+#include <linux/dax.h>
+#include <linux/vmalloc.h>
+#include <linux/pfn_t.h>
+#include <linux/parser.h>
+
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "segment.h"
+#include "meta_segment.h"
+
+static struct pcache_cache_dev *cache_devs[PCACHE_CACHE_DEV_MAX];
+static DEFINE_IDA(cache_devs_id_ida);
+static DEFINE_MUTEX(cache_devs_mutex);
+
+static ssize_t info_show(struct device *dev,
+			 struct device_attribute *attr,
+			 char *buf)
+{
+	struct pcache_sb *sb;
+	struct pcache_cache_dev *cache_dev;
+	ssize_t ret;
+
+	cache_dev = container_of(dev, struct pcache_cache_dev, device);
+	sb = cache_dev->sb_addr;
+
+	ret = sprintf(buf, "magic: 0x%llx\n"
+			"version: %u\n"
+			"flags: %x\n\n"
+			"segment_num: %u\n",
+			le64_to_cpu(sb->magic),
+			le16_to_cpu(sb->version),
+			le16_to_cpu(sb->flags),
+			cache_dev->seg_num);
+
+	return ret;
+}
+static DEVICE_ATTR_ADMIN_RO(info);
+
+static ssize_t path_show(struct device *dev,
+			 struct device_attribute *attr,
+			 char *buf)
+{
+	struct pcache_cache_dev *cache_dev;
+
+	cache_dev = container_of(dev, struct pcache_cache_dev, device);
+
+	return sprintf(buf, "%s\n", cache_dev->path);
+}
+static DEVICE_ATTR_ADMIN_RO(path);
+
+enum {
+	PCACHE_ADM_OPT_ERR		= 0,
+	PCACHE_ADM_OPT_OP,
+	PCACHE_ADM_OPT_FORCE,
+	PCACHE_ADM_OPT_DATA_CRC,
+	PCACHE_ADM_OPT_PATH,
+	PCACHE_ADM_OPT_BID,
+	PCACHE_ADM_OPT_QUEUES,
+	PCACHE_ADM_OPT_CACHE_SIZE,
+};
+
+enum {
+	PCACHE_ADM_OP_B_START,
+	PCACHE_ADM_OP_B_STOP,
+};
+
+static const char *const adm_op_names[] = {
+	[PCACHE_ADM_OP_B_START] = "backing-start",
+	[PCACHE_ADM_OP_B_STOP] = "backing-stop",
+};
+
+static const match_table_t adm_opt_tokens = {
+	{ PCACHE_ADM_OPT_OP,		"op=%s"	},
+	{ PCACHE_ADM_OPT_FORCE,		"force=%u" },
+	{ PCACHE_ADM_OPT_DATA_CRC,	"data_crc=%u" },
+	{ PCACHE_ADM_OPT_PATH,		"path=%s" },
+	{ PCACHE_ADM_OPT_BID,		"backing_id=%u" },
+	{ PCACHE_ADM_OPT_QUEUES,	"queues=%u" },
+	{ PCACHE_ADM_OPT_CACHE_SIZE,	"cache_size=%u" },	/* unit is MiB */
+	{ PCACHE_ADM_OPT_ERR,		NULL	}
+};
+
+
+struct pcache_cache_dev_adm_options {
+	u16 op;
+	bool force;
+	bool data_crc;
+	u32 backing_id;
+	u32 queues;
+	char path[PCACHE_PATH_LEN];
+	u64 cache_size_M;
+};
+
+static int parse_adm_options(struct pcache_cache_dev *cache_dev,
+		char *buf,
+		struct pcache_cache_dev_adm_options *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *o, *p;
+	int token, ret = 0;
+
+	o = buf;
+
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, adm_opt_tokens, args);
+		switch (token) {
+		case PCACHE_ADM_OPT_OP:
+			ret = match_string(adm_op_names, ARRAY_SIZE(adm_op_names), args[0].from);
+			if (ret < 0) {
+				cache_dev_err(cache_dev, "unknown op: '%s'\n", args[0].from);
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->op = ret;
+			break;
+		case PCACHE_ADM_OPT_PATH:
+			if (match_strlcpy(opts->path, &args[0],
+				PCACHE_PATH_LEN) == 0) {
+				ret = -EINVAL;
+				goto out;
+			}
+			break;
+		case PCACHE_ADM_OPT_FORCE:
+			if (match_uint(args, &token) || token != 1) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->force = 1;
+			break;
+		case PCACHE_ADM_OPT_DATA_CRC:
+			if (match_uint(args, &token) || token != 1) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->data_crc = 1;
+			break;
+		case PCACHE_ADM_OPT_BID:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			opts->backing_id = token;
+			break;
+		case PCACHE_ADM_OPT_QUEUES:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (token > PCACHE_QUEUES_MAX) {
+				cache_dev_err(cache_dev, "invalid queues: %u, larger than max %u\n",
+						token, PCACHE_QUEUES_MAX);
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->queues = token;
+			break;
+		case PCACHE_ADM_OPT_CACHE_SIZE:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->cache_size_M = token;
+			break;
+		default:
+			cache_dev_err(cache_dev, "unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static ssize_t adm_store(struct device *dev,
+			struct device_attribute *attr,
+			const char *ubuf,
+			size_t size)
+{
+	int ret;
+	char *buf;
+	struct pcache_cache_dev_adm_options opts = { 0 };
+	struct pcache_cache_dev *cache_dev;
+
+	opts.backing_id = U32_MAX;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	cache_dev = container_of(dev, struct pcache_cache_dev, device);
+
+	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
+	if (IS_ERR(buf)) {
+		cache_dev_err(cache_dev, "failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
+		return PTR_ERR(buf);
+	}
+	buf[size] = '\0';
+	ret = parse_adm_options(cache_dev, buf, &opts);
+	if (ret < 0) {
+		kfree(buf);
+		return ret;
+	}
+	kfree(buf);
+
+	mutex_lock(&cache_dev->adm_lock);
+	switch (opts.op) {
+	case PCACHE_ADM_OP_B_START:
+		u32 cache_segs = 0;
+		struct pcache_backing_dev_opts backing_opts = { 0 };
+
+		if (opts.cache_size_M > 0)
+			cache_segs = DIV_ROUND_UP(opts.cache_size_M,
+					PCACHE_SEG_SIZE / PCACHE_MB);
+
+		backing_opts.path = opts.path;
+		backing_opts.queues = opts.queues;
+		backing_opts.cache_segs = cache_segs;
+		backing_opts.data_crc = opts.data_crc;
+
+		ret = backing_dev_start(cache_dev, &backing_opts);
+		break;
+	case PCACHE_ADM_OP_B_STOP:
+		ret = backing_dev_stop(cache_dev, opts.backing_id);
+		break;
+	default:
+		mutex_unlock(&cache_dev->adm_lock);
+		cache_dev_err(cache_dev, "invalid op: %d\n", opts.op);
+		return -EINVAL;
+	}
+	mutex_unlock(&cache_dev->adm_lock);
+
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+static DEVICE_ATTR_WO(adm);
+
+static struct attribute *cache_dev_attrs[] = {
+	&dev_attr_info.attr,
+	&dev_attr_path.attr,
+	&dev_attr_adm.attr,
+	NULL
+};
+
+static struct attribute_group cache_dev_attr_group = {
+	.attrs = cache_dev_attrs,
+};
+
+static const struct attribute_group *cache_dev_attr_groups[] = {
+	&cache_dev_attr_group,
+	NULL
+};
+
+static void cache_dev_release(struct device *dev)
+{
+}
+
+const struct device_type cache_dev_type = {
+	.name		= "cache_dev",
+	.groups		= cache_dev_attr_groups,
+	.release	= cache_dev_release,
+};
+
+static void cache_dev_free(struct pcache_cache_dev *cache_dev)
+{
+	cache_devs[cache_dev->id] = NULL;
+	ida_simple_remove(&cache_devs_id_ida, cache_dev->id);
+	kfree(cache_dev);
+}
+
+static struct pcache_cache_dev *cache_dev_alloc(void)
+{
+	struct pcache_cache_dev *cache_dev;
+	int ret;
+
+	cache_dev = kzalloc(sizeof(struct pcache_cache_dev), GFP_KERNEL);
+	if (!cache_dev)
+		return NULL;
+
+	mutex_init(&cache_dev->lock);
+	mutex_init(&cache_dev->seg_lock);
+	mutex_init(&cache_dev->adm_lock);
+	INIT_LIST_HEAD(&cache_dev->backing_devs);
+
+	ret = ida_simple_get(&cache_devs_id_ida, 0, PCACHE_CACHE_DEV_MAX,
+				GFP_KERNEL);
+	if (ret < 0)
+		goto cache_dev_free;
+
+	cache_dev->id = ret;
+	cache_devs[cache_dev->id] = cache_dev;
+
+	return cache_dev;
+
+cache_dev_free:
+	kfree(cache_dev);
+	return NULL;
+}
+
+static void cache_dev_dax_exit(struct pcache_cache_dev *cache_dev)
+{
+	if (cache_dev->dax_dev)
+		fs_put_dax(cache_dev->dax_dev, cache_dev);
+
+	if (cache_dev->bdev_file)
+		fput(cache_dev->bdev_file);
+}
+
+static int cache_dev_dax_notify_failure(struct dax_device *dax_dev, u64 offset,
+				  u64 len, int mf_flags)
+{
+
+	pr_err("%s: dax_dev %llx offset %llx len %lld mf_flags %x\n",
+	       __func__, (u64)dax_dev, (u64)offset, (u64)len, mf_flags);
+
+	return -EOPNOTSUPP;
+}
+
+const struct dax_holder_operations cache_dev_dax_holder_ops = {
+	.notify_failure		= cache_dev_dax_notify_failure,
+};
+
+static int cache_dev_dax_init(struct pcache_cache_dev *cache_dev, char *path)
+{
+	struct dax_device *dax_dev = NULL;
+	struct file *bdev_file = NULL;
+	struct block_device *bdev;
+	long total_pages, mapped_pages;
+	u64 bdev_size, start_off = 0;
+	struct page **pages = NULL;
+	void *vaddr = NULL;
+	int ret, id;
+	pfn_t pfn;
+	long i = 0;
+
+	/* Copy the device path */
+	memcpy(cache_dev->path, path, PCACHE_PATH_LEN);
+
+	/* Open block device */
+	bdev_file = bdev_file_open_by_path(path, BLK_OPEN_READ | BLK_OPEN_WRITE, cache_dev, NULL);
+	if (IS_ERR(bdev_file)) {
+		ret = PTR_ERR(bdev_file);
+		cache_dev_err(cache_dev, "failed to open bdev %s, err=%d\n", path, ret);
+		goto err;
+	}
+
+	/* Get block device structure */
+	bdev = file_bdev(bdev_file);
+	if (!bdev) {
+		ret = -EINVAL;
+		cache_dev_err(cache_dev, "failed to get bdev from file\n");
+		goto fput;
+	}
+
+	/* Get total device size */
+	bdev_size = bdev_nr_bytes(bdev);
+	if (bdev_size == 0) {
+		ret = -ENODEV;
+		cache_dev_err(cache_dev, "device %s has zero size\n", path);
+		goto fput;
+	}
+
+	/* Convert device size to total pages */
+	total_pages = bdev_size >> PAGE_SHIFT;
+
+	/* Get the DAX device */
+	dax_dev = fs_dax_get_by_bdev(bdev, &start_off, cache_dev, &cache_dev_dax_holder_ops);
+	if (IS_ERR(dax_dev)) {
+		ret = PTR_ERR(dax_dev);
+		cache_dev_err(cache_dev, "failed to get dax_dev from bdev, err=%d\n", ret);
+		goto fput;
+	}
+
+	/* Lock DAX access */
+	id = dax_read_lock();
+
+	/* Try to access the entire device memory */
+	mapped_pages = dax_direct_access(dax_dev, 0, total_pages, DAX_ACCESS, &vaddr, &pfn);
+	if (mapped_pages < 0) {
+		cache_dev_err(cache_dev, "dax_direct_access failed, err=%ld\n", mapped_pages);
+		ret = mapped_pages;
+		goto unlock;
+	}
+
+	if (!pfn_t_has_page(pfn)) {
+		cache_dev_err(cache_dev, "pfn_t does not have a valid page mapping\n");
+		ret = -EOPNOTSUPP;
+		goto unlock;
+	}
+
+	/* If all pages are mapped in one go, use direct mapping */
+	if (mapped_pages == total_pages) {
+		cache_dev->sb_addr = (struct pcache_sb *)vaddr;
+	} else {
+		/* Use vmap() to create a contiguous mapping */
+		long chunk_size;
+
+		cache_dev_debug(cache_dev, "partial mapping, using vmap\n");
+
+		pages = vmalloc_array(total_pages, sizeof(struct page *));
+		if (!pages) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		i = 0;
+		do {
+			/* Access each page range in DAX */
+			chunk_size = dax_direct_access(dax_dev, i, total_pages - i, DAX_ACCESS, NULL, &pfn);
+			if (chunk_size <= 0) {
+				ret = chunk_size ? chunk_size : -EINVAL;
+				goto vfree;
+			}
+
+			if (!pfn_t_has_page(pfn)) {
+				ret = -EOPNOTSUPP;
+				goto vfree;
+			}
+
+			/* Store pages in the array for vmap */
+			while (chunk_size-- && i < total_pages) {
+				pages[i++] = pfn_t_to_page(pfn);
+				pfn.val++;
+				if (!(i & 15))
+					cond_resched();
+			}
+		} while (i < total_pages);
+
+		/* Map all pages into a contiguous virtual address */
+		vaddr = vmap(pages, total_pages, VM_MAP, PAGE_KERNEL);
+		if (!vaddr) {
+			cache_dev_err(cache_dev, "vmap failed");
+			ret = -ENOMEM;
+			goto vfree;
+		}
+
+		vfree(pages);
+		cache_dev->sb_addr = (struct pcache_sb *)vaddr;
+	}
+
+	/* Unlock and store references */
+	dax_read_unlock(id);
+
+	cache_dev->bdev_file = bdev_file;
+	cache_dev->dax_dev = dax_dev;
+	cache_dev->bdev = bdev;
+
+	return 0;
+
+vfree:
+	vfree(pages);
+unlock:
+	dax_read_unlock(id);
+	fs_put_dax(dax_dev, cache_dev);
+fput:
+	fput(bdev_file);
+err:
+	return ret;
+}
+
+void cache_dev_flush(struct pcache_cache_dev *cache_dev, void *pos, u32 size)
+{
+	dax_flush(cache_dev->dax_dev, pos, size);
+}
+
+void cache_dev_zero_range(struct pcache_cache_dev *cache_dev, void *pos, u32 size)
+{
+	memset(pos, 0, size);
+	cache_dev_flush(cache_dev, pos, size);
+}
+
+static int cache_dev_format(struct pcache_cache_dev *cache_dev, bool force)
+{
+	struct pcache_sb *sb = cache_dev->sb_addr;
+	u64 nr_segs;
+	u64 cache_dev_size;
+	u64 magic;
+	u16 flags = 0;
+
+	magic = le64_to_cpu(sb->magic);
+	if (magic && !force)
+		return -EEXIST;
+
+	cache_dev_size = bdev_nr_bytes(file_bdev(cache_dev->bdev_file));
+	if (cache_dev_size < PCACHE_CACHE_DEV_SIZE_MIN) {
+		cache_dev_err(cache_dev, "dax device is too small, required at least %u",
+				PCACHE_CACHE_DEV_SIZE_MIN);
+		return -ENOSPC;
+	}
+
+	nr_segs = (cache_dev_size - PCACHE_SEGMENTS_OFF) / ((PCACHE_SEG_SIZE));
+
+	/* Segment 0 to be backing info segment, clear it */
+	cache_dev_zero_range(cache_dev, CACHE_DEV_BACKING_SEG(cache_dev), PCACHE_SEG_SIZE);
+
+	sb->version = cpu_to_le16(PCACHE_VERSION);
+
+#if defined(__BYTE_ORDER) ? (__BIG_ENDIAN == __BYTE_ORDER) : defined(__BIG_ENDIAN)
+	flags |= PCACHE_SB_F_BIGENDIAN;
+#endif
+	sb->flags = cpu_to_le16(flags);
+
+	sb->magic = cpu_to_le64(PCACHE_MAGIC);
+	sb->seg_num = cpu_to_le16(nr_segs);
+
+	sb->crc = cpu_to_le32(crc32(0, (void *)sb + 4, PCACHE_SB_SIZE - 4));
+
+	return 0;
+}
+
+static int sb_validate(struct pcache_cache_dev *cache_dev)
+{
+	struct pcache_sb *sb = cache_dev->sb_addr;
+	u16 flags;
+
+	if (le64_to_cpu(sb->magic) != PCACHE_MAGIC) {
+		cache_dev_err(cache_dev, "unexpected magic: %llx\n",
+				le64_to_cpu(sb->magic));
+		return -EINVAL;
+	}
+
+	flags = le16_to_cpu(sb->flags);
+
+#if defined(__BYTE_ORDER) ? (__BIG_ENDIAN == __BYTE_ORDER) : defined(__BIG_ENDIAN)
+	if (!(flags & PCACHE_SB_F_BIGENDIAN)) {
+		cache_dev_err(cache_dev, "cache_dev is not big endian\n");
+		return -EINVAL;
+	}
+#else
+	if (flags & PCACHE_SB_F_BIGENDIAN) {
+		cache_dev_err(cache_dev, "cache_dev is big endian\n");
+		return -EINVAL;
+	}
+#endif
+	return 0;
+}
+
+static void backing_dev_info_init(struct pcache_cache_dev *cache_dev)
+{
+	struct pcache_segment_info *seg_info;
+	struct pcache_meta_segment *meta_seg;
+	struct pcache_backing_dev_info *backing_info, *backing_info_addr;
+	u32 seg_id;
+	u32 i;
+
+	meta_seg = cache_dev->backing_info_seg;
+again:
+	set_bit(meta_seg->segment.seg_info->seg_id, cache_dev->seg_bitmap);
+	/* Try to find the backing_dev_id with same path */
+	pcache_meta_seg_for_each_meta(meta_seg, i, backing_info_addr) {
+		backing_info = pcache_meta_find_latest(&backing_info_addr->header, PCACHE_BACKING_DEV_INFO_SIZE);
+		if (!backing_info || backing_info->state != PCACHE_BACKING_STATE_RUNNING)
+			continue;
+
+		seg_id = backing_info->cache_info.seg_id;
+next_seg:
+		seg_info = pcache_segment_info_read(cache_dev, seg_id);
+		BUG_ON(!seg_info);
+		set_bit(seg_info->seg_id, cache_dev->seg_bitmap);
+		if (segment_info_has_next(seg_info)) {
+			seg_id = seg_info->next_seg;
+			goto next_seg;
+		}
+	}
+
+	if (meta_seg->next_meta_seg) {
+		meta_seg = meta_seg->next_meta_seg;
+		goto again;
+	}
+}
+
+static int cache_dev_init(struct pcache_cache_dev *cache_dev,
+			  struct pcache_cache_dev_register_options *opts)
+{
+	struct pcache_sb *sb;
+	struct device *dev;
+	int ret;
+
+	ret = sb_validate(cache_dev);
+	if (ret)
+		goto err;
+
+	sb = cache_dev->sb_addr;
+	cache_dev->seg_num = le64_to_cpu(sb->seg_num);
+
+	cache_dev->seg_bitmap = bitmap_zalloc(cache_dev->seg_num, GFP_KERNEL);
+	if (!cache_dev->seg_bitmap)
+		goto err;
+
+	cache_dev->backing_info_seg = pcache_meta_seg_alloc(cache_dev, 0, PCACHE_BACKING_DEV_INFO_SIZE);
+	if (!cache_dev->backing_info_seg)
+		goto free_bitmap;
+
+	backing_dev_info_init(cache_dev);
+
+	dev = &cache_dev->device;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->bus = &pcache_bus_type;
+	dev->type = &cache_dev_type;
+	dev->parent = &pcache_root_dev;
+	dev_set_name(&cache_dev->device, "cache_dev%d", cache_dev->id);
+	ret = device_add(&cache_dev->device);
+	if (ret)
+		goto free_backing_info_seg;
+
+	return 0;
+
+free_backing_info_seg:
+	pcache_meta_seg_free(cache_dev->backing_info_seg);
+free_bitmap:
+	bitmap_free(cache_dev->seg_bitmap);
+err:
+	return ret;
+}
+
+static void cache_dev_exit(struct pcache_cache_dev *cache_dev)
+{
+	device_unregister(&cache_dev->device);
+	pcache_meta_seg_free(cache_dev->backing_info_seg);
+	bitmap_free(cache_dev->seg_bitmap);
+}
+
+int cache_dev_unregister(u32 cache_dev_id)
+{
+	struct pcache_cache_dev *cache_dev;
+
+	if (cache_dev_id >= PCACHE_CACHE_DEV_MAX) {
+		pr_err("invalid cache_dev_id: %u\n", cache_dev_id);
+		return -EINVAL;
+	}
+
+	cache_dev = cache_devs[cache_dev_id];
+	if (!cache_dev) {
+		pr_err("cache_dev: %u, is not registered\n", cache_dev_id);
+		return -EINVAL;
+	}
+
+	mutex_lock(&cache_dev->lock);
+	if (!list_empty(&cache_dev->backing_devs)) {
+		mutex_unlock(&cache_dev->lock);
+		return -EBUSY;
+	}
+	mutex_unlock(&cache_dev->lock);
+
+	cache_dev_exit(cache_dev);
+	cache_dev_dax_exit(cache_dev);
+	cache_dev_free(cache_dev);
+	module_put(THIS_MODULE);
+
+	return 0;
+}
+
+int cache_dev_register(struct pcache_cache_dev_register_options *opts)
+{
+	struct pcache_cache_dev *cache_dev;
+	int ret;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	if (!strstr(opts->path, "/dev/pmem")) {
+		pr_err("%s: path (%s) is not pmem\n",
+		       __func__, opts->path);
+		ret = -EINVAL;
+		goto module_put;
+	}
+
+	cache_dev = cache_dev_alloc();
+	if (!cache_dev) {
+		ret = -ENOMEM;
+		goto module_put;
+	}
+
+	ret = cache_dev_dax_init(cache_dev, opts->path);
+	if (ret)
+		goto cache_dev_free;
+
+	if (opts->format) {
+		ret = cache_dev_format(cache_dev, opts->force);
+		if (ret < 0)
+			goto dax_release;
+	}
+
+	ret = cache_dev_init(cache_dev, opts);
+	if (ret)
+		goto dax_release;
+
+	return 0;
+dax_release:
+	cache_dev_dax_exit(cache_dev);
+cache_dev_free:
+	cache_dev_free(cache_dev);
+module_put:
+	module_put(THIS_MODULE);
+
+	return ret;
+}
+
+int cache_dev_find_backing_info(struct pcache_cache_dev *cache_dev, struct pcache_backing_dev *backing_dev, bool *new_backing)
+{
+	struct pcache_meta_segment *meta_seg;
+	struct pcache_backing_dev_info *backing_info, *backing_info_addr;
+	struct pcache_backing_dev_info *empty_backing_info;
+	bool empty_id_found = false;
+	u32 total_id = 0;
+	u32 empty_id;
+	int ret;
+	u32 i;
+
+	meta_seg = cache_dev->backing_info_seg;
+again:
+	/* Try to find the backing_dev_id with same path */
+	pcache_meta_seg_for_each_meta(meta_seg, i, backing_info_addr) {
+		backing_info = pcache_meta_find_latest(&backing_info_addr->header, PCACHE_BACKING_DEV_INFO_SIZE);
+
+		if (!backing_info || backing_info->state == PCACHE_BACKING_STATE_NONE) {
+			if (!empty_id_found) {
+				empty_id_found = true;
+				empty_backing_info = backing_info_addr;
+				empty_id = total_id;
+			}
+			total_id++;
+			continue;
+		}
+
+		if (strcmp(backing_info->path, backing_dev->backing_dev_info.path) == 0) {
+			backing_dev->backing_dev_id = backing_info->backing_dev_id;
+			backing_dev->backing_dev_info_addr = backing_info_addr;
+			*new_backing = false;
+			ret = 0;
+			goto out;
+		}
+		total_id++;
+	}
+
+	if (meta_seg->next_meta_seg) {
+		meta_seg = meta_seg->next_meta_seg;
+		goto again;
+	}
+
+	if (empty_id_found) {
+		backing_dev->backing_dev_info_addr = empty_backing_info;
+		backing_dev->backing_dev_id = empty_id;
+		*new_backing = true;
+		ret = 0;
+		goto out;
+	}
+
+	ret = -ENOSPC;
+	/* TODO allocate a new meta seg for backing_dev_info */
+out:
+	return ret;
+}
+
+int cache_dev_add_backing(struct pcache_cache_dev *cache_dev, struct pcache_backing_dev *backing_dev)
+{
+	mutex_lock(&cache_dev->lock);
+	list_add_tail(&backing_dev->node, &cache_dev->backing_devs);
+	mutex_unlock(&cache_dev->lock);
+
+	return 0;
+}
+
+struct pcache_backing_dev *cache_dev_fetch_backing(struct pcache_cache_dev *cache_dev, u32 backing_dev_id)
+{
+	struct pcache_backing_dev *temp_backing_dev;
+
+	mutex_lock(&cache_dev->lock);
+	list_for_each_entry(temp_backing_dev, &cache_dev->backing_devs, node) {
+		if (temp_backing_dev->backing_dev_id == backing_dev_id) {
+			list_del_init(&temp_backing_dev->node);
+			goto found;
+		}
+	}
+	temp_backing_dev = NULL;
+found:
+	mutex_unlock(&cache_dev->lock);
+	return temp_backing_dev;
+}
+
+int cache_dev_get_empty_segment_id(struct pcache_cache_dev *cache_dev, u32 *seg_id)
+{
+	int ret;
+
+	mutex_lock(&cache_dev->seg_lock);
+	*seg_id = find_next_zero_bit(cache_dev->seg_bitmap, cache_dev->seg_num, 0);
+	if (*seg_id == cache_dev->seg_num) {
+		ret = -ENOSPC;
+		goto unlock;
+	}
+
+	set_bit(*seg_id, cache_dev->seg_bitmap);
+	ret = 0;
+unlock:
+	mutex_unlock(&cache_dev->seg_lock);
+	return ret;
+}
diff --git a/drivers/block/pcache/cache_dev.h b/drivers/block/pcache/cache_dev.h
new file mode 100644
index 000000000000..cad1b97a38ba
--- /dev/null
+++ b/drivers/block/pcache/cache_dev.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_CACHE_DEV_H
+#define _PCACHE_CACHE_DEV_H
+
+#include <linux/device.h>
+
+#include "pcache_internal.h"
+#include "meta_segment.h"
+
+#define cache_dev_err(cache_dev, fmt, ...)						\
+	pcache_err("cache_dev%u: " fmt,							\
+		 cache_dev->id, ##__VA_ARGS__)
+#define cache_dev_info(cache_dev, fmt, ...)						\
+	pcache_info("cache_dev%u: " fmt,						\
+		 cache_dev->id, ##__VA_ARGS__)
+#define cache_dev_debug(cache_dev, fmt, ...)						\
+	pcache_debug("cache_dev%u: " fmt,						\
+		 cache_dev->id, ##__VA_ARGS__)
+
+/*
+ * PCACHE SB flags configured during formatting
+ *
+ * The PCACHE_SB_F_xxx flags define registration requirements based on cache_dev
+ * formatting. For a machine to register a cache_dev:
+ * - PCACHE_SB_F_BIGENDIAN: Requires a big-endian machine.
+ */
+#define PCACHE_SB_F_BIGENDIAN			(1 << 0)
+
+struct pcache_sb {
+	__le32 crc;
+	__le16 version;
+	__le16 flags;
+	__le64 magic;
+
+	__le16 seg_num;
+};
+
+struct pcache_cache_dev {
+	u16				id;
+	u16				seg_num;
+	struct pcache_sb		*sb_addr;
+	struct device			device;
+	struct mutex			lock;
+	struct mutex			adm_lock;
+	struct list_head		backing_devs;
+
+	char				path[PCACHE_PATH_LEN];
+	struct dax_device		*dax_dev;
+	struct file			*bdev_file;
+	struct block_device		*bdev;
+
+	struct mutex			seg_lock;
+	unsigned long			*seg_bitmap;
+
+	struct pcache_meta_segment	*backing_info_seg;
+};
+
+struct pcache_cache_dev_register_options {
+	char path[PCACHE_PATH_LEN];
+	bool format;
+	bool force;
+};
+
+struct pcache_backing_dev;
+int cache_dev_register(struct pcache_cache_dev_register_options *opts);
+int cache_dev_unregister(u32 cache_dev_id);
+
+void cache_dev_flush(struct pcache_cache_dev *cache_dev, void *pos, u32 size);
+void cache_dev_zero_range(struct pcache_cache_dev *cache_dev, void *pos, u32 size);
+
+int cache_dev_find_backing_info(struct pcache_cache_dev *cache_dev,
+				struct pcache_backing_dev *backing_dev, bool *new_backing);
+
+int cache_dev_add_backing(struct pcache_cache_dev *cache_dev, struct pcache_backing_dev *backing_dev);
+struct pcache_backing_dev *cache_dev_fetch_backing(struct pcache_cache_dev *cache_dev, u32 backing_dev_id);
+int cache_dev_get_empty_segment_id(struct pcache_cache_dev *cache_dev, u32 *seg_id);
+
+extern const struct bus_type pcache_bus_type;
+extern struct device pcache_root_dev;
+
+#endif /* _PCACHE_CACHE_DEV_H */
diff --git a/drivers/block/pcache/pcache_internal.h b/drivers/block/pcache/pcache_internal.h
new file mode 100644
index 000000000000..dd51d8339275
--- /dev/null
+++ b/drivers/block/pcache/pcache_internal.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_INTERNAL_H
+#define _PCACHE_INTERNAL_H
+
+#include <linux/delay.h>
+#include <linux/crc32.h>
+
+#define pcache_err(fmt, ...)							\
+	pr_err("pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define pcache_info(fmt, ...)							\
+	pr_info("pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+#define pcache_debug(fmt, ...)							\
+	pr_debug("pcache: %s:%u " fmt, __func__, __LINE__, ##__VA_ARGS__)
+
+#define PCACHE_KB			(1024ULL)			/* 1 Kilobyte in bytes */
+#define PCACHE_MB			(1024ULL * PCACHE_KB)		/* 1 Megabyte in bytes */
+
+#define PCACHE_CACHE_DEV_MAX		1024				/* Maximum number of cache_dev instances */
+#define PCACHE_PATH_LEN			256
+
+#define PCACHE_QUEUES_MAX		128				/* Maximum number of I/O queues */
+
+#define PCACHE_PART_SHIFT		4				/* Bit shift for partition identifier */
+
+/* pcache segment */
+#define PCACHE_SEG_SIZE			(16 * 1024 * 1024ULL)		/* Size of each PCACHE segment (16 MB) */
+
+#define PCACHE_MAGIC			0x65B05EFA96C596EFULL		/* Unique identifier for PCACHE cache dev */
+#define PCACHE_VERSION			1
+
+/* Maximum number of metadata indices */
+#define PCACHE_META_INDEX_MAX			2
+
+#define PCACHE_SB_OFF				4096
+#define PCACHE_SB_SIZE				PAGE_SIZE
+
+#define PCACHE_CACHE_DEV_INFO_OFF		(PCACHE_SB_OFF + PCACHE_SB_SIZE)
+#define PCACHE_CACHE_DEV_INFO_SIZE		PAGE_SIZE
+#define PCACHE_CACHE_DEV_INFO_STRIDE		(PCACHE_CACHE_DEV_INFO_SIZE * PCACHE_META_INDEX_MAX)
+
+#define PCACHE_SEGMENTS_OFF			(PCACHE_CACHE_DEV_INFO_OFF + PCACHE_CACHE_DEV_INFO_STRIDE)
+#define PCACHE_SEG_INFO_SIZE			PAGE_SIZE
+
+#define PCACHE_BACKING_DEV_INFO_SIZE		PAGE_SIZE
+
+#define PCACHE_CACHE_DEV_SIZE_MIN		(512 * 1024 * 1024)	 /* 512 MB */
+
+#define CACHE_DEV_SEGMENTS(cache_dev)		((void *)cache_dev->sb_addr + PCACHE_SEGMENTS_OFF)
+#define CACHE_DEV_SEGMENT(cache_dev, id)	((void *)CACHE_DEV_SEGMENTS(cache_dev) + (u64)id * PCACHE_SEG_SIZE)
+
+#define BACKING_DEV_INFO_SEG_ID			0
+#define CACHE_DEV_BACKING_SEG(cache_dev)	(CACHE_DEV_SEGMENT(cache_dev, BACKING_DEV_INFO_SEG_ID))
+
+/*
+ * struct pcache_meta_header - PCACHE metadata header structure
+ * @crc: CRC checksum for validating metadata integrity.
+ * @seq: Sequence number to track metadata updates.
+ * @version: Metadata version.
+ * @res: Reserved space for future use.
+ */
+struct pcache_meta_header {
+	u32 crc;
+	u8  seq;
+	u8  version;
+	u16 res;
+};
+
+/*
+ * pcache_meta_crc - Calculate CRC for the given metadata header.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of the metadata structure.
+ *
+ * Returns the CRC checksum calculated by excluding the CRC field itself.
+ */
+static inline u32 pcache_meta_crc(struct pcache_meta_header *header, u32 meta_size)
+{
+	return crc32(0, (void *)header + 4, meta_size - 4);  /* CRC calculated starting after the crc field */
+}
+
+/*
+ * pcache_meta_seq_after - Check if a sequence number is more recent, accounting for overflow.
+ * @seq1: First sequence number.
+ * @seq2: Second sequence number.
+ *
+ * Determines if @seq1 is more recent than @seq2 by calculating the signed
+ * difference between them. This approach allows handling sequence number
+ * overflow correctly because the difference wraps naturally, and any value
+ * greater than zero indicates that @seq1 is "after" @seq2. This method
+ * assumes 8-bit unsigned sequence numbers, where the difference wraps
+ * around if seq1 overflows past seq2.
+ *
+ * Returns:
+ *   - true if @seq1 is more recent than @seq2, indicating it comes "after"
+ *   - false otherwise.
+ */
+static inline bool pcache_meta_seq_after(u8 seq1, u8 seq2)
+{
+	return (s8)(seq1 - seq2) > 0;
+}
+
+/*
+ * pcache_meta_find_latest - Find the latest valid metadata.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of each metadata block.
+ *
+ * Finds the latest valid metadata by checking sequence numbers. If a
+ * valid entry with the highest sequence number is found, its pointer
+ * is returned. Returns NULL if no valid metadata is found.
+ */
+static inline void *pcache_meta_find_latest(struct pcache_meta_header *header,
+					 u32 meta_size)
+{
+	struct pcache_meta_header *meta, *latest = NULL;
+	u32 i;
+
+	for (i = 0; i < PCACHE_META_INDEX_MAX; i++) {
+		meta = (void *)header + (i * meta_size);
+
+		/* Skip if CRC check fails */
+		if (meta->crc != pcache_meta_crc(meta, meta_size))
+			continue;
+
+		/* Update latest if a more recent sequence is found */
+		if (!latest || pcache_meta_seq_after(meta->seq, latest->seq))
+			latest = meta;
+	}
+
+	return latest;
+}
+
+/*
+ * pcache_meta_find_oldest - Find the oldest valid metadata.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of each metadata block.
+ *
+ * Returns the oldest valid metadata by comparing sequence numbers.
+ * If an entry with the lowest sequence number is found, its pointer
+ * is returned. Returns NULL if no valid metadata is found.
+ */
+static inline void *pcache_meta_find_oldest(struct pcache_meta_header *header,
+					 u32 meta_size)
+{
+	struct pcache_meta_header *meta, *oldest = NULL;
+	u32 i;
+
+	for (i = 0; i < PCACHE_META_INDEX_MAX; i++) {
+		meta = (void *)header + (meta_size * i);
+
+		/* Mark as oldest if CRC check fails */
+		if (meta->crc != pcache_meta_crc(meta, meta_size)) {
+			oldest = meta;
+			break;
+		}
+
+		/* Update oldest if an older sequence is found */
+		if (!oldest || pcache_meta_seq_after(oldest->seq, meta->seq))
+			oldest = meta;
+	}
+
+	BUG_ON(!oldest);
+
+	return oldest;
+}
+
+/*
+ * pcache_meta_get_next_seq - Get the next sequence number for metadata.
+ * @header: Pointer to the metadata header.
+ * @meta_size: Size of each metadata block.
+ *
+ * Returns the next sequence number based on the latest metadata entry.
+ * If no latest metadata is found, returns 1.
+ */
+static inline u32 pcache_meta_get_next_seq(struct pcache_meta_header *header,
+					u32 meta_size)
+{
+	struct pcache_meta_header *latest;
+
+	latest = pcache_meta_find_latest(header, meta_size);
+	if (!latest)
+		return 1;
+
+	return (latest->seq + 1);
+}
+
+#endif /* _PCACHE_INTERNAL_H */

From patchwork Mon Apr 14 01:44:56 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049543
Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com
 [95.215.58.176])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35E9318DB26
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.176
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595132; cv=none;
 b=SYH4Pay3ZiKZftQG5JzeZsPxu3Nv9A1KVnzwafXsrVz2Mxnm74mQGb8eKpKoqosvEihZy9ic7VcmGEFnBxe+aKkwcITBz0Tx7Baz3AB0ES3SBHP2xPdhGlTTfiWR5FLLDrhWzcTS9ACZih/2S+degMFyi4TrT3+TbR3VltOsiAY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595132; c=relaxed/simple;
	bh=0JwQesxzmbdE1v2Isi0vz7bFbHUnKWRof5qvxOw8zjg=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Yhp5Wr1QJ3HamLtpgIxvdgePkbFmrFmvfcqlYq0F5IJRZ2O8jT9hNS7Ux+w+KUb9zBmsM4Tbmlx26drWk1f8mPnrz8SAzmvdXro6QWPDKW/RaWCSV77wejLZ0xL0tBGL976zVPU+iFsuahYUpwUrzoKnUrwcDnTGu5W8rmMkTak=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=SVVzovFh; arc=none smtp.client-ip=95.215.58.176
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="SVVzovFh"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595127;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=sOsI9HfatRd+cdPnfJ9cixNnzCD8PmfIQMnUMgVK2hU=;
	b=SVVzovFhCD9esIB2o2xvJjiGZpsbw+y+Due/aEY8SXsBaQgupVLvwqd9f5IcFzBqksIySx
	I8zKqwb5XMKP6voxs/oz4nASG7C05GqS3K9G+D+kx0H1iAlje+BIMZJAeqNEGeWoHvuX/L
	fwcZU88Oag7wPVH1U9iuzIFG99r9870=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 02/11] pcache: introduce segment abstraction
Date: Mon, 14 Apr 2025 01:44:56 +0000
Message-Id: <20250414014505.20477-3-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

pcache: introduce segment abstraction and metadata support

This patch introduces the basic infrastructure for managing segments
in the pcache system. A "segment" is the minimum unit of allocation
and persistence on the persistent memory used as cache.

Key features introduced:

- `struct pcache_segment` and associated helpers for managing segment data.
- Metadata handling for segments via `struct pcache_segment_info`, including
  type, state, data offset, and next-segment pointer.
- Support for reading and writing segment metadata with on-media consistency
  using `pcache_meta_find_latest()` and `pcache_meta_find_oldest()` helpers.
- Abstractions for copying data to and from segments and bio vectors, including:
  - `segment_copy_to_bio()`
  - `segment_copy_from_bio()`
- Logical cursor `segment_pos_advance()` for iterating over data inside a segment.

Segment metadata is stored inline in each segment and versioned with CRC to ensure
integrity and crash safety. The segment design also lays the foundation for
segment chaining via `next_seg`, which will be used in cache_segment and other higher
level structures.

This patch is part of the core segment layer and will be utilized by metadata
and data layers such as meta_segment and cache_segment in subsequent patches.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/segment.c | 175 +++++++++++++++++++++++++++++++++
 drivers/block/pcache/segment.h |  78 +++++++++++++++
 2 files changed, 253 insertions(+)
 create mode 100644 drivers/block/pcache/segment.c
 create mode 100644 drivers/block/pcache/segment.h

diff --git a/drivers/block/pcache/segment.c b/drivers/block/pcache/segment.c
new file mode 100644
index 000000000000..01e43c9d9bfa
--- /dev/null
+++ b/drivers/block/pcache/segment.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/dax.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "meta_segment.h"
+#include "segment.h"
+
+int segment_pos_advance(struct pcache_segment_pos *seg_pos, u32 len)
+{
+	u32 to_advance;
+
+	while (len) {
+		to_advance = len;
+
+		if (to_advance > seg_pos->segment->data_size - seg_pos->off)
+			to_advance = seg_pos->segment->data_size - seg_pos->off;
+
+		seg_pos->off += to_advance;
+
+		len -= to_advance;
+	}
+
+	return 0;
+}
+
+int segment_copy_to_bio(struct pcache_segment *segment,
+		u32 data_off, u32 data_len, struct bio *bio, u32 bio_off)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *dst;
+	u32 to_copy, page_off = 0;
+	struct pcache_segment_pos pos = { .segment = segment,
+				   .off = data_off };
+next:
+	bio_for_each_segment(bv, bio, iter) {
+		if (bio_off > bv.bv_len) {
+			bio_off -= bv.bv_len;
+			continue;
+		}
+		page_off = bv.bv_offset;
+		page_off += bio_off;
+		bio_off = 0;
+
+		dst = kmap_local_page(bv.bv_page);
+again:
+		segment = pos.segment;
+
+		to_copy = min(bv.bv_offset + bv.bv_len - page_off,
+				segment->data_size - pos.off);
+		if (to_copy > data_len)
+			to_copy = data_len;
+
+		flush_dcache_page(bv.bv_page);
+		memcpy(dst + page_off, segment->data + pos.off, to_copy);
+
+		/* advance */
+		pos.off += to_copy;
+		page_off += to_copy;
+		data_len -= to_copy;
+		if (!data_len) {
+			kunmap_local(dst);
+			return 0;
+		}
+
+		/* more data in this bv page */
+		if (page_off < bv.bv_offset + bv.bv_len)
+			goto again;
+		kunmap_local(dst);
+	}
+
+	if (bio->bi_next) {
+		bio = bio->bi_next;
+		goto next;
+	}
+
+	return 0;
+}
+
+void segment_copy_from_bio(struct pcache_segment *segment,
+		u32 data_off, u32 data_len, struct bio *bio, u32 bio_off)
+{
+	struct bio_vec bv;
+	struct bvec_iter iter;
+	void *src;
+	u32 to_copy, page_off = 0;
+	struct pcache_segment_pos pos = { .segment = segment,
+				   .off = data_off };
+next:
+	bio_for_each_segment(bv, bio, iter) {
+		if (bio_off > bv.bv_len) {
+			bio_off -= bv.bv_len;
+			continue;
+		}
+		page_off = bv.bv_offset;
+		page_off += bio_off;
+		bio_off = 0;
+
+		src = kmap_local_page(bv.bv_page);
+again:
+		segment = pos.segment;
+
+		to_copy = min(bv.bv_offset + bv.bv_len - page_off,
+				segment->data_size - pos.off);
+		if (to_copy > data_len)
+			to_copy = data_len;
+
+		memcpy_flushcache(segment->data + pos.off, src + page_off, to_copy);
+		flush_dcache_page(bv.bv_page);
+
+		/* advance */
+		pos.off += to_copy;
+		page_off += to_copy;
+		data_len -= to_copy;
+		if (!data_len) {
+			kunmap_local(src);
+			return;
+		}
+
+		/* more data in this bv page */
+		if (page_off < bv.bv_offset + bv.bv_len)
+			goto again;
+		kunmap_local(src);
+	}
+
+	if (bio->bi_next) {
+		bio = bio->bi_next;
+		goto next;
+	}
+}
+
+int pcache_segment_init(struct pcache_cache_dev *cache_dev, struct pcache_segment *segment,
+		      struct pcache_segment_init_options *options)
+{
+	segment->seg_info = options->seg_info;
+
+	segment->seg_info->type = options->type;
+	segment->seg_info->state = options->state;
+	segment->seg_info->seg_id = options->seg_id;
+	segment->seg_info->data_off = options->data_off;
+
+	segment->cache_dev = cache_dev;
+	segment->data_size = PCACHE_SEG_SIZE - options->data_off;
+	segment->data = CACHE_DEV_SEGMENT(cache_dev, options->seg_id) + options->data_off;
+
+	return 0;
+}
+
+void pcache_segment_info_write(struct pcache_cache_dev *cache_dev, struct pcache_segment_info *seg_info, u32 seg_id)
+{
+	struct pcache_segment_info *seg_info_addr;
+
+	seg_info->header.seq++;
+
+	seg_info_addr = CACHE_DEV_SEGMENT(cache_dev, seg_id);
+	seg_info_addr = pcache_meta_find_oldest(&seg_info_addr->header, PCACHE_SEG_INFO_SIZE);
+
+	memcpy(seg_info_addr, seg_info, sizeof(struct pcache_segment_info));
+
+	seg_info_addr->header.crc = pcache_meta_crc(&seg_info_addr->header, PCACHE_SEG_INFO_SIZE);
+	cache_dev_flush(cache_dev, seg_info_addr, PCACHE_SEG_INFO_SIZE);
+
+}
+
+struct pcache_segment_info *pcache_segment_info_read(struct pcache_cache_dev *cache_dev, u32 seg_id)
+{
+	struct pcache_segment_info *seg_info_addr;
+
+	seg_info_addr = CACHE_DEV_SEGMENT(cache_dev, seg_id);
+
+	return pcache_meta_find_latest(&seg_info_addr->header, PCACHE_SEG_INFO_SIZE);
+}
diff --git a/drivers/block/pcache/segment.h b/drivers/block/pcache/segment.h
new file mode 100644
index 000000000000..c41cb8d5b921
--- /dev/null
+++ b/drivers/block/pcache/segment.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_SEGMENT_H
+#define _PCACHE_SEGMENT_H
+
+#include <linux/bio.h>
+
+#include "pcache_internal.h"
+
+#define segment_err(segment, fmt, ...)					\
+	cache_dev_err(segment->cache_dev, "segment%d: " fmt,				\
+		 segment->seg_id, ##__VA_ARGS__)
+#define segment_info(segment, fmt, ...)					\
+	cache_dev_info(segment->cache_dev, "segment%d: " fmt,				\
+		 segment->seg_id, ##__VA_ARGS__)
+#define segment_debug(segment, fmt, ...)					\
+	cache_dev_debug(segment->cache_dev, "segment%d: " fmt,				\
+		 segment->seg_id, ##__VA_ARGS__)
+
+
+#define PCACHE_SEGMENT_STATE_NONE		0
+#define PCACHE_SEGMENT_STATE_RUNNING	1
+
+#define PCACHES_TYPE_NONE			0
+#define PCACHES_TYPE_META			1
+#define PCACHE_SEGMENT_TYPE_DATA			2
+
+struct pcache_segment_info {
+	struct pcache_meta_header	header;	/* Metadata header for the segment */
+	u8			type;
+	u8			state;
+	u16			flags;
+	u32			next_seg;
+	u32			seg_id;
+	u32			data_off;
+};
+
+#define PCACHE_SEG_INFO_FLAGS_HAS_NEXT	(1 << 0)
+
+static inline bool segment_info_has_next(struct pcache_segment_info *seg_info)
+{
+	return (seg_info->flags & PCACHE_SEG_INFO_FLAGS_HAS_NEXT);
+}
+
+struct pcache_segment_pos {
+	struct pcache_segment	*segment;	/* Segment associated with the position */
+	u32			off;		/* Offset within the segment */
+};
+
+struct pcache_segment_init_options {
+	u8			type;
+	u8			state;
+	u32			seg_id;
+	u32			data_off;
+
+	struct pcache_segment_info	*seg_info;
+};
+
+struct pcache_segment {
+	struct pcache_cache_dev	*cache_dev;
+
+	void			*data;
+	u32			data_size;
+
+	struct pcache_segment_info	*seg_info;
+};
+
+int segment_copy_to_bio(struct pcache_segment *segment,
+		      u32 data_off, u32 data_len, struct bio *bio, u32 bio_off);
+void segment_copy_from_bio(struct pcache_segment *segment,
+			u32 data_off, u32 data_len, struct bio *bio, u32 bio_off);
+int segment_pos_advance(struct pcache_segment_pos *seg_pos, u32 len);
+int pcache_segment_init(struct pcache_cache_dev *cache_dev, struct pcache_segment *segment,
+		      struct pcache_segment_init_options *options);
+
+void pcache_segment_info_write(struct pcache_cache_dev *cache_dev, struct pcache_segment_info *seg_info, u32 seg_id);
+struct pcache_segment_info *pcache_segment_info_read(struct pcache_cache_dev *cache_dev, u32 set_id);
+
+#endif /* _PCACHE_SEGMENT_H */

From patchwork Mon Apr 14 01:44:57 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049544
Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com
 [95.215.58.188])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4935518C031
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.188
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595135; cv=none;
 b=FggTz1xQThzRVoeizePbrwMHfm6zMIb9Csuf0/3LAw1HP1SRK65tjLRSzgSNaq70wniOkneXP1pOh3JJIlPncNylg7+XKq63CKXyMeU6wyX6U/N6b2GhNKSrG3Th3Q7ALC/t65qw0EHULJvY1wlaVsNdjUgQS9puhCfySkz4ymM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595135; c=relaxed/simple;
	bh=Wie/ZryAVDQijPRPJjGFG3UQ8ZTUQ+zyinsBIIkhtz4=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=NXRsRvClcPhvSu5NrlhbnbiX1L7ZLkPSSbic+dTgDFrUIbAuHj2R1gNMIzyXMfY9f8bb2APqP2hSsvQK6x4l6iZjCoTSwr4tMWEl28Lr/fz70EJxZO08QMLye1gt2L7aKuNYuW0yQ0jrnMYpDMGp1789RA5qgPB9AsgCaD1uF0g=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=cgTU36wO; arc=none smtp.client-ip=95.215.58.188
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="cgTU36wO"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595131;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=swjkE7S92/en+HgPlkxlOjwT141VID2yftXpkwZdDX4=;
	b=cgTU36wO3Ji7RigPNjwYPhDrwAEjxCyiLC9Y8hmlxP+KKeKd273qG52OYLFfw5U7R5PVuc
	MtRfk677uRavMVo86ZS/zukyF0XXHR87MJfehB8ljup5jGeYt6QKhD6RSbI/ZQ8tMy3sUm
	ycrTZnRdNrVxWG5fVemNBARF9zMc7HY=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 03/11] pcache: introduce meta_segment abstraction
Date: Mon, 14 Apr 2025 01:44:57 +0000
Message-Id: <20250414014505.20477-4-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the `meta_segment` abstraction to persistently store
metadata for pcache, specifically the `pcache_backing_dev_info` structure.

Each `meta_segment` wraps a data segment and organizes metadata space into
multiple entries, each replicated multiple times for reliability. Metadata
integrity is ensured using a sequence counter and CRC per metadata header.

Key highlights:
- `struct pcache_meta_segment`: Manages a segment dedicated to metadata.
- `struct pcache_meta_segment_info`: Describes the layout and size of
  metadata entries.
- `meta_seg_info_write()`: Writes updated metadata info with CRC protection.
- `meta_seg_meta()`: Computes the address of a given metadata entry.
- `pcache_meta_seg_for_each_meta`: A convenient macro to iterate over all
  metadata entries in the segment, simplifying metadata scanning and
  management logic.

Currently, only `pcache_backing_dev_info` is stored via this mechanism.

This design provides a structured, verifiable, and extensible foundation
for storing persistent metadata in the pcache framework.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/meta_segment.c | 61 +++++++++++++++++++++++++++++
 drivers/block/pcache/meta_segment.h | 46 ++++++++++++++++++++++
 2 files changed, 107 insertions(+)
 create mode 100644 drivers/block/pcache/meta_segment.c
 create mode 100644 drivers/block/pcache/meta_segment.h

diff --git a/drivers/block/pcache/meta_segment.c b/drivers/block/pcache/meta_segment.c
new file mode 100644
index 000000000000..6a6dd9ad9041
--- /dev/null
+++ b/drivers/block/pcache/meta_segment.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "meta_segment.h"
+
+static void meta_seg_info_write(struct pcache_meta_segment *meta_seg)
+{
+	struct pcache_meta_segment_info *info_addr;
+
+	mutex_lock(&meta_seg->info_lock);
+	meta_seg->meta_seg_info.seg_info.header.seq++;
+
+	info_addr = CACHE_DEV_SEGMENT(meta_seg->cache_dev, meta_seg->meta_seg_info.seg_info.seg_id);
+	info_addr = pcache_meta_find_oldest(&info_addr->seg_info.header, PCACHE_SEG_INFO_SIZE);
+
+	memcpy(info_addr, &meta_seg->meta_seg_info, sizeof(struct pcache_meta_segment_info));
+	info_addr->seg_info.header.crc = pcache_meta_crc(&info_addr->seg_info.header, PCACHE_SEG_INFO_SIZE);
+
+	cache_dev_flush(meta_seg->cache_dev, info_addr, PCACHE_SEG_INFO_SIZE);
+	mutex_unlock(&meta_seg->info_lock);
+}
+
+static void meta_seg_init(struct pcache_cache_dev *cache_dev, struct pcache_meta_segment *meta_seg, u32 seg_id, u32 meta_size)
+{
+	struct pcache_segment_init_options seg_opts = { 0 };
+
+	meta_seg->cache_dev = cache_dev;
+	mutex_init(&meta_seg->info_lock);
+
+	seg_opts.type = PCACHES_TYPE_META;
+	seg_opts.state = PCACHE_SEGMENT_STATE_RUNNING;
+	seg_opts.seg_id = seg_id;
+	seg_opts.data_off = PCACHE_SEG_INFO_SIZE * PCACHE_META_INDEX_MAX;
+	seg_opts.seg_info = &meta_seg->meta_seg_info.seg_info;
+
+	pcache_segment_init(cache_dev, &meta_seg->segment, &seg_opts);
+
+	meta_seg->meta_seg_info.meta_size = meta_size;
+	meta_seg->meta_seg_info.meta_num = meta_seg->segment.data_size / (meta_size * PCACHE_META_INDEX_MAX);
+
+	meta_seg_info_write(meta_seg);
+}
+
+struct pcache_meta_segment *pcache_meta_seg_alloc(struct pcache_cache_dev *cache_dev, u32 seg_id, u32 meta_size)
+{
+	struct pcache_meta_segment *meta_seg;
+
+	meta_seg = kzalloc(sizeof(struct pcache_meta_segment), GFP_KERNEL);
+	if (!meta_seg)
+		return NULL;
+
+	meta_seg_init(cache_dev, meta_seg, seg_id, meta_size);
+
+	return meta_seg;
+}
+
+void pcache_meta_seg_free(struct pcache_meta_segment *meta_seg)
+{
+	kfree(meta_seg);
+}
diff --git a/drivers/block/pcache/meta_segment.h b/drivers/block/pcache/meta_segment.h
new file mode 100644
index 000000000000..3e0886d0bd2b
--- /dev/null
+++ b/drivers/block/pcache/meta_segment.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_META_SEGMENT_H
+#define _PCACHE_META_SEGMENT_H
+
+#include <linux/bio.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "segment.h"
+
+struct pcache_cache_dev;
+struct pcache_backing_dev_info;
+
+struct pcache_meta_segment_info {
+	struct pcache_segment_info	seg_info;
+	u32 meta_size;
+	u32 meta_num;
+};
+
+struct pcache_meta_segment {
+	struct pcache_segment	segment;
+
+	struct pcache_cache_dev *cache_dev;
+
+	struct pcache_meta_segment_info meta_seg_info;
+	struct mutex info_lock;
+
+	struct pcache_meta_segment *next_meta_seg;
+};
+
+static inline void *meta_seg_meta(struct pcache_meta_segment *meta_seg, u32 meta_id)
+{
+	void *data = meta_seg->segment.data;
+
+	return (data + meta_id * meta_seg->meta_seg_info.meta_size * PCACHE_META_INDEX_MAX);
+}
+
+#define pcache_meta_seg_for_each_meta(meta_seg, i, meta)	\
+	for (i = 0;						\
+	     i < meta_seg->meta_seg_info.meta_num &&		\
+	     ((meta = meta_seg_meta(meta_seg, i)) || true);	\
+	     i++)
+
+struct pcache_meta_segment *pcache_meta_seg_alloc(struct pcache_cache_dev *cache_dev, u32 seg_id, u32 meta_size);
+void pcache_meta_seg_free(struct pcache_meta_segment *meta_seg);
+#endif /* _PCACHE_META_SEGMENT_H */

From patchwork Mon Apr 14 01:44:58 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049545
Received: from out-182.mta1.migadu.com (out-182.mta1.migadu.com
 [95.215.58.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2104719D09C
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595140; cv=none;
 b=JW7pXmai40WMNwx/s4Y9vcbyW1mjDKF2SkqIEjo/gJrb27351J8pz8hEa+kLcDoL1AlVJS41LDtfVqwXRPMje7T4QUmZMuKeiKL2p0bdiZk4XfbyZaxOp+iTgq6tn3iWQFDxKJ/XAlAjP0VYc2NuDhzRT4bknt4T2+0P514WXlA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595140; c=relaxed/simple;
	bh=y0UHDy8Dco3DHk7nxSKg7QhT8wFyGmzKQe5ykLA0ai8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=UfsYg29S96567YhuhWEf8SlA+zEW9YUtnz77JNu4TnZEFbv0AoZ7P3N6DPNQqWaSfLEng8AB2WXLJoOBCvyeGtV+F0cr6b1DavqPspX76U2sEWaas2t3XRFm/UvkshsutDl3Y7Q3nqHjrf0JoBhV59LpNX8n0kG+oP4nM5cKac4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=ZcPFSFq7; arc=none smtp.client-ip=95.215.58.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="ZcPFSFq7"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595136;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=7RxdgXiQ/ZgOyOW4Bo2AqGzU9+zrN+iL5e6Sh1i+W/A=;
	b=ZcPFSFq7qYEEjqtQKtwyVLIQivpGOZqSZueoxN1mVI8k+GWdVtQA/Wgfub9KMLio9RlEYh
	Saw+o/OCGxFOehmV2yQbrYdRQhH7TR0g14vp6Dlm8e2CTrJQOGeLhjOHhCwCCL30L5UOWv
	N/POjJszUhm/zYblptde0TFcZoLRSX0=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 04/11] pcache: introduce cache_segment abstraction
Date: Mon, 14 Apr 2025 01:44:58 +0000
Message-Id: <20250414014505.20477-5-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the `cache_segment` module, responsible for managing
cache data and cache key segments used by the pcache system.

Each `cache_segment` is a wrapper around a physical segment on the
persistent cache device, storing cached data and metadata required to track
its state and generation. The segment metadata is persistently recorded
and loaded to support crash recovery.

At the time of backing device startup, a set of `cache_segments` is allocated
according to the cache size requirement of the device. All cache data and
cache keys will be stored within these segments.

Features:
- Segment metadata (`struct pcache_cache_seg_info`) with CRC and sequence tracking.
- Segment control (`struct pcache_cache_seg_gen`) to record generation number, which tracks invalidation.
- Support for dynamic segment linking via `next_seg`.
- Segment reference counting via `cache_seg_get()` and `cache_seg_put()`, with automatic invalidation when refcount reaches zero.
- Metadata flush and reload via `cache_seg_info_write()` and `cache_seg_info_load()`.

This is a foundational piece enabling pcache to manage space efficiently and reuse segments.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache_segment.c | 247 +++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)
 create mode 100644 drivers/block/pcache/cache_segment.c

diff --git a/drivers/block/pcache/cache_segment.c b/drivers/block/pcache/cache_segment.c
new file mode 100644
index 000000000000..f51301d75f70
--- /dev/null
+++ b/drivers/block/pcache/cache_segment.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+
+static void cache_seg_info_write(struct pcache_cache_segment *cache_seg)
+{
+	mutex_lock(&cache_seg->info_lock);
+	pcache_segment_info_write(cache_seg->cache->backing_dev->cache_dev,
+			&cache_seg->cache_seg_info.segment_info,
+			cache_seg->segment.seg_info->seg_id);
+	mutex_unlock(&cache_seg->info_lock);
+}
+
+static int cache_seg_info_load(struct pcache_cache_segment *cache_seg)
+{
+	struct pcache_segment_info *cache_seg_info;
+	int ret = 0;
+
+	mutex_lock(&cache_seg->info_lock);
+	cache_seg_info = pcache_segment_info_read(cache_seg->cache->backing_dev->cache_dev,
+						cache_seg->segment.seg_info->seg_id);
+	if (!cache_seg_info) {
+		pr_err("can't read segment info of segment: %u\n",
+			      cache_seg->segment.seg_info->seg_id);
+		ret = -EIO;
+		goto out;
+	}
+	memcpy(&cache_seg->cache_seg_info, cache_seg_info, sizeof(struct pcache_cache_seg_info));
+out:
+	mutex_unlock(&cache_seg->info_lock);
+	return ret;
+}
+
+static void cache_seg_ctrl_load(struct pcache_cache_segment *cache_seg)
+{
+	struct pcache_cache_seg_ctrl *cache_seg_ctrl = cache_seg->cache_seg_ctrl;
+	struct pcache_cache_seg_gen *cache_seg_gen;
+
+	mutex_lock(&cache_seg->ctrl_lock);
+	cache_seg_gen = pcache_meta_find_latest(&cache_seg_ctrl->gen->header,
+					     sizeof(struct pcache_cache_seg_gen));
+	if (!cache_seg_gen) {
+		cache_seg->gen = 0;
+		goto out;
+	}
+
+	cache_seg->gen = cache_seg_gen->gen;
+out:
+	mutex_unlock(&cache_seg->ctrl_lock);
+}
+
+static void cache_seg_ctrl_write(struct pcache_cache_segment *cache_seg)
+{
+	struct pcache_cache_seg_ctrl *cache_seg_ctrl = cache_seg->cache_seg_ctrl;
+	struct pcache_cache_seg_gen *cache_seg_gen;
+
+	mutex_lock(&cache_seg->ctrl_lock);
+	cache_seg_gen = pcache_meta_find_oldest(&cache_seg_ctrl->gen->header,
+					     sizeof(struct pcache_cache_seg_gen));
+	BUG_ON(!cache_seg_gen);
+	cache_seg_gen->gen = cache_seg->gen;
+	cache_seg_gen->header.seq = pcache_meta_get_next_seq(&cache_seg_ctrl->gen->header,
+							  sizeof(struct pcache_cache_seg_gen));
+	cache_seg_gen->header.crc = pcache_meta_crc(&cache_seg_gen->header,
+						 sizeof(struct pcache_cache_seg_gen));
+	mutex_unlock(&cache_seg->ctrl_lock);
+
+	cache_dev_flush(cache_seg->cache->backing_dev->cache_dev, cache_seg_gen, sizeof(struct pcache_cache_seg_gen));
+}
+
+static int cache_seg_meta_load(struct pcache_cache_segment *cache_seg)
+{
+	int ret;
+
+	ret = cache_seg_info_load(cache_seg);
+	if (ret)
+		goto err;
+
+	cache_seg_ctrl_load(cache_seg);
+
+	return 0;
+err:
+	return ret;
+}
+
+/**
+ * cache_seg_set_next_seg - Sets the ID of the next segment
+ * @cache_seg: Pointer to the cache segment structure.
+ * @seg_id: The segment ID to set as the next segment.
+ *
+ * A pcache_cache allocates multiple cache segments, which are linked together
+ * through next_seg. When loading a pcache_cache, the first cache segment can
+ * be found using cache->seg_id, which allows access to all the cache segments.
+ */
+void cache_seg_set_next_seg(struct pcache_cache_segment *cache_seg, u32 seg_id)
+{
+	cache_seg->cache_seg_info.segment_info.flags |= PCACHE_SEG_INFO_FLAGS_HAS_NEXT;
+	cache_seg->cache_seg_info.segment_info.next_seg = seg_id;
+	cache_seg_info_write(cache_seg);
+}
+
+int cache_seg_init(struct pcache_cache *cache, u32 seg_id, u32 cache_seg_id,
+		   bool new_cache)
+{
+	struct pcache_cache_dev *cache_dev = cache->backing_dev->cache_dev;
+	struct pcache_cache_segment *cache_seg = &cache->segments[cache_seg_id];
+	struct pcache_segment_init_options seg_options = { 0 };
+	struct pcache_segment *segment = &cache_seg->segment;
+	int ret;
+
+	cache_seg->cache = cache;
+	cache_seg->cache_seg_id = cache_seg_id;
+	spin_lock_init(&cache_seg->gen_lock);
+	atomic_set(&cache_seg->refs, 0);
+	mutex_init(&cache_seg->info_lock);
+	mutex_init(&cache_seg->ctrl_lock);
+
+	/* init pcache_segment */
+	seg_options.type = PCACHE_SEGMENT_TYPE_DATA;
+	seg_options.data_off = PCACHE_CACHE_SEG_CTRL_OFF + PCACHE_CACHE_SEG_CTRL_SIZE;
+	seg_options.seg_id = seg_id;
+	seg_options.seg_info = &cache_seg->cache_seg_info.segment_info;
+	pcache_segment_init(cache_dev, segment, &seg_options);
+
+	cache_seg->cache_seg_ctrl = CACHE_DEV_SEGMENT(cache_dev, seg_id) + PCACHE_CACHE_SEG_CTRL_OFF;
+	/* init cache->cache_ctrl */
+	if (cache_seg_is_ctrl_seg(cache_seg_id))
+		cache->cache_ctrl = (struct pcache_cache_ctrl *)cache_seg->cache_seg_ctrl;
+
+	if (new_cache) {
+		cache_seg->cache_seg_info.segment_info.type = PCACHE_SEGMENT_TYPE_DATA;
+		cache_seg->cache_seg_info.segment_info.state = PCACHE_SEGMENT_STATE_RUNNING;
+		cache_seg->cache_seg_info.segment_info.flags = 0;
+		cache_seg_info_write(cache_seg);
+
+		/* clear outdated kset in segment */
+		memcpy_flushcache(segment->data, &pcache_empty_kset, sizeof(struct pcache_cache_kset_onmedia));
+	} else {
+		ret = cache_seg_meta_load(cache_seg);
+		if (ret)
+			goto err;
+	}
+
+	atomic_set(&cache_seg->state, pcache_cache_seg_state_running);
+
+	return 0;
+err:
+	return ret;
+}
+
+void cache_seg_destroy(struct pcache_cache_segment *cache_seg)
+{
+	/* clear cache segment ctrl */
+	cache_dev_zero_range(cache_seg->cache->backing_dev->cache_dev, cache_seg->cache_seg_ctrl,
+			PCACHE_CACHE_SEG_CTRL_SIZE);
+
+	clear_bit(cache_seg->segment.seg_info->seg_id, cache_seg->cache->backing_dev->cache_dev->seg_bitmap);
+}
+
+#define PCACHE_WAIT_NEW_CACHE_INTERVAL	100
+#define PCACHE_WAIT_NEW_CACHE_COUNT	100
+
+/**
+ * get_cache_segment - Retrieves a free cache segment from the cache.
+ * @cache: Pointer to the cache structure.
+ *
+ * This function attempts to find a free cache segment that can be used.
+ * It locks the segment map and checks for the next available segment ID.
+ * If no segment is available, it waits for a predefined interval and retries.
+ * If a free segment is found, it initializes it and returns a pointer to the
+ * cache segment structure. Returns NULL if no segments are available after
+ * waiting for a specified count.
+ */
+struct pcache_cache_segment *get_cache_segment(struct pcache_cache *cache)
+{
+	struct pcache_cache_segment *cache_seg;
+	u32 seg_id;
+	u32 wait_count = 0;
+
+again:
+	spin_lock(&cache->seg_map_lock);
+	seg_id = find_next_zero_bit(cache->seg_map, cache->n_segs, cache->last_cache_seg);
+	if (seg_id == cache->n_segs) {
+		spin_unlock(&cache->seg_map_lock);
+		/* reset the hint of ->last_cache_seg and retry */
+		if (cache->last_cache_seg) {
+			cache->last_cache_seg = 0;
+			goto again;
+		}
+
+		if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
+			return NULL;
+
+		udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
+		goto again;
+	}
+
+	/*
+	 * found an available cache_seg, mark it used in seg_map
+	 * and update the search hint ->last_cache_seg
+	 */
+	set_bit(seg_id, cache->seg_map);
+	cache->last_cache_seg = seg_id;
+	spin_unlock(&cache->seg_map_lock);
+
+	cache_seg = &cache->segments[seg_id];
+	cache_seg->cache_seg_id = seg_id;
+
+	return cache_seg;
+}
+
+static void cache_seg_gen_increase(struct pcache_cache_segment *cache_seg)
+{
+	spin_lock(&cache_seg->gen_lock);
+	cache_seg->gen++;
+	spin_unlock(&cache_seg->gen_lock);
+
+	cache_seg_ctrl_write(cache_seg);
+}
+
+void cache_seg_get(struct pcache_cache_segment *cache_seg)
+{
+	atomic_inc(&cache_seg->refs);
+}
+
+static void cache_seg_invalidate(struct pcache_cache_segment *cache_seg)
+{
+	struct pcache_cache *cache;
+
+	cache = cache_seg->cache;
+	cache_seg_gen_increase(cache_seg);
+
+	spin_lock(&cache->seg_map_lock);
+	clear_bit(cache_seg->cache_seg_id, cache->seg_map);
+	spin_unlock(&cache->seg_map_lock);
+
+	/* clean_work will clean the bad key in key_tree*/
+	queue_work(cache->backing_dev->task_wq, &cache->clean_work);
+}
+
+void cache_seg_put(struct pcache_cache_segment *cache_seg)
+{
+	if (atomic_dec_and_test(&cache_seg->refs))
+		cache_seg_invalidate(cache_seg);
+}

From patchwork Mon Apr 14 01:44:59 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049559
Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com
 [95.215.58.183])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 920BD1A01CC
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.183
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595147; cv=none;
 b=IsnVjZZ1cd/UcP5CIpBYE2ovyveOoDgy5wbnpKZ2E5wqkQrFG+eYipZBcxx3Lt9PS5E0L1+rxCWSTMG7TpVyGIR9hy2TRHkyZn7aHC1lQx7hao4XPI+s83Dq3nCHOtIwOapnccLUVhrhUNt8Sc8RA8ABr1SsBjOdRJBXy130lj4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595147; c=relaxed/simple;
	bh=zVxMs05OmDfZgLnqLJ1G8lF3CN9mbZ5S2zzEFcA9Vp8=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=LSHuR8t4Sc9b7YPFsO9x+3hBwat7DNT+spl8/UcYUbgQDZKaWJjOLRZf9oPuxFQAy1mo8Xykw5h31a6CPPD8ByKR9zeQoUiA2blOjtnqwBdqUd4vzx8iib+c4TUkE2w2fC7dCsx5cn9r4lVBJqLxQql4JQMiHsjdmTun6T1/Bko=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=m6j3EeA0; arc=none smtp.client-ip=95.215.58.183
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="m6j3EeA0"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595140;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Jjg0xfiy8siMDsSsG/q/l/t4ReMnhHpayR4Ys9YZtto=;
	b=m6j3EeA0xLb3IXcEhH2P9aRQjpZOQHgvm2ZzErmCRs65WokqJOKohab6m53j4p4Tkvb+Z+
	Z4muG9Y6TACk4v3IBpUnKVqELLSQqVCc4IGoyhsZm7IBln12EuWocauqbNQTb4GO+OvIMD
	k2UpNazlkLaP58wg2VeuTWuW9wfBIog=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 05/11] pcache: introduce lifecycle management of
 pcache_cache
Date: Mon, 14 Apr 2025 01:44:59 +0000
Message-Id: <20250414014505.20477-6-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the core implementation for managing the lifecycle of the
pcache_cache structure, which represents the central cache context in pcache.

Key responsibilities covered by this patch include:

- Allocation and initialization:
  `pcache_cache_alloc()` validates configuration options and allocates the memory
  for the pcache_cache instance, including segment array, ksets, request trees,
  and data heads. It sets up key internal fields such as segment maps, generation
  counters, and in-memory trees.

- Subsystem initialization:
  It initializes all supporting subsystems including segment metadata, key tail/
  dirty tail state, in-memory key trees, and background workers like GC and writeback.

- Clean shutdown and destruction:
  `pcache_cache_destroy()` performs a staged shutdown: flushing remaining keys,
  cancelling work queues, tearing down trees, releasing segments, and freeing memory.
  It ensures all pending metadata and dirty data are safely handled before release.

- Persistent state management:
  Provides helpers to encode and decode the key_tail and dirty_tail positions
  persistently, ensuring the cache can recover its position and metadata after
  a crash or reboot.

By defining a consistent and crash-safe lifecycle model for pcache_cache, this patch
lays the foundation for higher-level cache operations to be implemented safely and
concurrently.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache.c | 394 ++++++++++++++++++++++
 drivers/block/pcache/cache.h | 612 +++++++++++++++++++++++++++++++++++
 2 files changed, 1006 insertions(+)
 create mode 100644 drivers/block/pcache/cache.c
 create mode 100644 drivers/block/pcache/cache.h

diff --git a/drivers/block/pcache/cache.c b/drivers/block/pcache/cache.c
new file mode 100644
index 000000000000..0dd61ded4b82
--- /dev/null
+++ b/drivers/block/pcache/cache.c
@@ -0,0 +1,394 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/blk_types.h>
+
+#include "logic_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+
+void cache_pos_encode(struct pcache_cache *cache,
+			     struct pcache_cache_pos_onmedia *pos_onmedia,
+			     struct pcache_cache_pos *pos)
+{
+	struct pcache_cache_pos_onmedia *oldest;
+
+	oldest = pcache_meta_find_oldest(&pos_onmedia->header, sizeof(struct pcache_cache_pos_onmedia));
+	BUG_ON(!oldest);
+
+	oldest->cache_seg_id = pos->cache_seg->cache_seg_id;
+	oldest->seg_off = pos->seg_off;
+	oldest->header.seq = pcache_meta_get_next_seq(&pos_onmedia->header, sizeof(struct pcache_cache_pos_onmedia));
+	oldest->header.crc = cache_pos_onmedia_crc(oldest);
+	cache_dev_flush(cache->backing_dev->cache_dev, oldest, sizeof(struct pcache_cache_pos_onmedia));
+}
+
+int cache_pos_decode(struct pcache_cache *cache,
+			    struct pcache_cache_pos_onmedia *pos_onmedia,
+			    struct pcache_cache_pos *pos)
+{
+	struct pcache_cache_pos_onmedia *latest;
+
+	latest = pcache_meta_find_latest(&pos_onmedia->header, sizeof(struct pcache_cache_pos_onmedia));
+	if (!latest)
+		return -EIO;
+
+	pos->cache_seg = &cache->segments[latest->cache_seg_id];
+	pos->seg_off = latest->seg_off;
+
+	return 0;
+}
+
+static void cache_info_set_seg_id(struct pcache_cache *cache, u32 seg_id)
+{
+	cache->cache_info->seg_id = seg_id;
+	backing_dev_info_write(cache->backing_dev);
+}
+
+static struct pcache_cache *cache_alloc(struct pcache_backing_dev *backing_dev)
+{
+	struct pcache_cache *cache;
+
+	cache = kvzalloc(struct_size(cache, segments, backing_dev->cache_segs), GFP_KERNEL);
+	if (!cache)
+		goto err;
+
+	cache->seg_map = bitmap_zalloc(backing_dev->cache_segs, GFP_KERNEL);
+	if (!cache->seg_map)
+		goto free_cache;
+
+	cache->req_cache = KMEM_CACHE(pcache_backing_dev_req, 0);
+	if (!cache->req_cache)
+		goto free_bitmap;
+
+	cache->backing_dev = backing_dev;
+	cache->n_segs = backing_dev->cache_segs;
+	spin_lock_init(&cache->seg_map_lock);
+	spin_lock_init(&cache->key_head_lock);
+
+	mutex_init(&cache->key_tail_lock);
+	mutex_init(&cache->dirty_tail_lock);
+
+	INIT_DELAYED_WORK(&cache->writeback_work, cache_writeback_fn);
+	INIT_DELAYED_WORK(&cache->gc_work, pcache_cache_gc_fn);
+	INIT_WORK(&cache->clean_work, clean_fn);
+
+	return cache;
+
+free_bitmap:
+	bitmap_free(cache->seg_map);
+free_cache:
+	kvfree(cache);
+err:
+	return NULL;
+}
+
+static void cache_free(struct pcache_cache *cache)
+{
+	kmem_cache_destroy(cache->req_cache);
+	bitmap_free(cache->seg_map);
+	kvfree(cache);
+}
+
+static void pcache_cache_info_init(struct pcache_cache_opts *opts)
+{
+	struct pcache_cache_info *cache_info = opts->cache_info;
+
+	cache_info->n_segs = opts->n_segs;
+	cache_info->gc_percent = PCACHE_CACHE_GC_PERCENT_DEFAULT;
+	if (opts->data_crc)
+		cache_info->flags |= PCACHE_CACHE_FLAGS_DATA_CRC;
+}
+
+static int cache_validate(struct pcache_backing_dev *backing_dev,
+			  struct pcache_cache_opts *opts)
+{
+	struct pcache_cache_info *cache_info;
+	int ret = -EINVAL;
+
+	if (opts->n_paral > PCACHE_CACHE_PARAL_MAX) {
+		backing_dev_err(backing_dev, "n_paral too large (max %u).\n",
+			 PCACHE_CACHE_PARAL_MAX);
+		goto err;
+	}
+
+	if (opts->new_cache)
+		pcache_cache_info_init(opts);
+
+	cache_info = opts->cache_info;
+
+	/*
+	 * Check if the number of segments required for the specified n_paral
+	 * exceeds the available segments in the cache. If so, report an error.
+	 */
+	if (opts->n_paral * PCACHE_CACHE_SEGS_EACH_PARAL > cache_info->n_segs) {
+		backing_dev_err(backing_dev, "n_paral %u requires cache size (%llu), more than current (%llu).",
+				opts->n_paral, opts->n_paral * PCACHE_CACHE_SEGS_EACH_PARAL * (u64)PCACHE_SEG_SIZE,
+				cache_info->n_segs * (u64)PCACHE_SEG_SIZE);
+		goto err;
+	}
+
+	if (cache_info->n_segs > backing_dev->cache_dev->seg_num) {
+		backing_dev_err(backing_dev, "too large cache_segs: %u, segment_num: %u\n",
+				cache_info->n_segs, backing_dev->cache_dev->seg_num);
+		goto err;
+	}
+
+	if (cache_info->n_segs > PCACHE_CACHE_SEGS_MAX) {
+		backing_dev_err(backing_dev, "cache_segs: %u larger than PCACHE_CACHE_SEGS_MAX: %u\n",
+				cache_info->n_segs, PCACHE_CACHE_SEGS_MAX);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	return ret;
+}
+
+static int cache_tail_init(struct pcache_cache *cache, bool new_cache)
+{
+	int ret;
+
+	if (new_cache) {
+		set_bit(0, cache->seg_map);
+
+		cache->key_head.cache_seg = &cache->segments[0];
+		cache->key_head.seg_off = 0;
+		cache_pos_copy(&cache->key_tail, &cache->key_head);
+		cache_pos_copy(&cache->dirty_tail, &cache->key_head);
+
+		cache_encode_dirty_tail(cache);
+		cache_encode_key_tail(cache);
+	} else {
+		if (cache_decode_key_tail(cache) || cache_decode_dirty_tail(cache)) {
+			backing_dev_err(cache->backing_dev, "Corrupted key tail or dirty tail.\n");
+			ret = -EIO;
+			goto err;
+		}
+	}
+	return 0;
+err:
+	return ret;
+}
+
+static void cache_segs_destroy(struct pcache_cache *cache)
+{
+	u32 i;
+
+	for (i = 0; i < cache->n_segs; i++)
+		cache_seg_destroy(&cache->segments[i]);
+}
+
+static int get_seg_id(struct pcache_cache *cache,
+		      struct pcache_cache_segment *prev_cache_seg,
+		      bool new_cache, u32 *seg_id)
+{
+	struct pcache_backing_dev *backing_dev = cache->backing_dev;
+	struct pcache_cache_dev *cache_dev = backing_dev->cache_dev;
+	int ret;
+
+	if (new_cache) {
+		ret = cache_dev_get_empty_segment_id(cache_dev, seg_id);
+		if (ret) {
+			backing_dev_err(backing_dev, "no available segment\n");
+			goto err;
+		}
+
+		if (prev_cache_seg)
+			cache_seg_set_next_seg(prev_cache_seg, *seg_id);
+		else
+			cache_info_set_seg_id(cache, *seg_id);
+	} else {
+		if (prev_cache_seg) {
+			struct pcache_segment_info *prev_seg_info;
+
+			prev_seg_info = &prev_cache_seg->cache_seg_info.segment_info;
+			if (!segment_info_has_next(prev_seg_info)) {
+				ret = -EFAULT;
+				goto err;
+			}
+			*seg_id = prev_cache_seg->cache_seg_info.segment_info.next_seg;
+		} else {
+			*seg_id = cache->cache_info->seg_id;
+		}
+	}
+	return 0;
+err:
+	return ret;
+}
+
+static int cache_segs_init(struct pcache_cache *cache, bool new_cache)
+{
+	struct pcache_cache_segment *prev_cache_seg = NULL;
+	struct pcache_cache_info *cache_info = cache->cache_info;
+	u32 seg_id;
+	int ret;
+	u32 i;
+
+	for (i = 0; i < cache_info->n_segs; i++) {
+		ret = get_seg_id(cache, prev_cache_seg, new_cache, &seg_id);
+		if (ret)
+			goto segments_destroy;
+
+		ret = cache_seg_init(cache, seg_id, i, new_cache);
+		if (ret)
+			goto segments_destroy;
+
+		prev_cache_seg = &cache->segments[i];
+	}
+	return 0;
+
+segments_destroy:
+	cache_segs_destroy(cache);
+
+	return ret;
+}
+
+static int cache_init_req_keys(struct pcache_cache *cache, u32 n_paral)
+{
+	u32 n_subtrees;
+	int ret;
+	u32 i;
+
+	/* Calculate number of cache trees based on the device size */
+	n_subtrees = DIV_ROUND_UP(cache->dev_size << SECTOR_SHIFT, PCACHE_CACHE_SUBTREE_SIZE);
+	ret = cache_tree_init(cache, &cache->req_key_tree, n_subtrees);
+	if (ret)
+		goto err;
+
+	/* Set the number of ksets based on n_paral, often corresponding to blkdev multiqueue count */
+	cache->n_ksets = n_paral;
+	cache->ksets = kcalloc(cache->n_ksets, PCACHE_KSET_SIZE, GFP_KERNEL);
+	if (!cache->ksets) {
+		ret = -ENOMEM;
+		goto req_tree_exit;
+	}
+
+	/*
+	 * Initialize each kset with a spinlock and delayed work for flushing.
+	 * Each kset is associated with one queue to ensure independent handling
+	 * of cache keys across multiple queues, maximizing multiqueue concurrency.
+	 */
+	for (i = 0; i < cache->n_ksets; i++) {
+		struct pcache_cache_kset *kset = get_kset(cache, i);
+
+		kset->cache = cache;
+		spin_lock_init(&kset->kset_lock);
+		INIT_DELAYED_WORK(&kset->flush_work, kset_flush_fn);
+	}
+
+	cache->n_heads = n_paral;
+	cache->data_heads = kcalloc(cache->n_heads, sizeof(struct pcache_cache_data_head), GFP_KERNEL);
+	if (!cache->data_heads) {
+		ret = -ENOMEM;
+		goto free_kset;
+	}
+
+	for (i = 0; i < cache->n_heads; i++) {
+		struct pcache_cache_data_head *data_head = &cache->data_heads[i];
+
+		spin_lock_init(&data_head->data_head_lock);
+	}
+
+	/*
+	 * Replay persisted cache keys using cache_replay.
+	 * This function loads and replays cache keys from previously stored
+	 * ksets, allowing the cache to restore its state after a restart.
+	 */
+	ret = cache_replay(cache);
+	if (ret) {
+		backing_dev_err(cache->backing_dev, "failed to replay keys\n");
+		goto free_heads;
+	}
+
+	return 0;
+
+free_heads:
+	kfree(cache->data_heads);
+free_kset:
+	kfree(cache->ksets);
+req_tree_exit:
+	cache_tree_exit(&cache->req_key_tree);
+err:
+	return ret;
+}
+
+static void cache_destroy_req_keys(struct pcache_cache *cache)
+{
+	u32 i;
+
+	for (i = 0; i < cache->n_ksets; i++) {
+		struct pcache_cache_kset *kset = get_kset(cache, i);
+
+		cancel_delayed_work_sync(&kset->flush_work);
+	}
+
+	kfree(cache->data_heads);
+	kfree(cache->ksets);
+	cache_tree_exit(&cache->req_key_tree);
+}
+
+struct pcache_cache *pcache_cache_alloc(struct pcache_backing_dev *backing_dev,
+				  struct pcache_cache_opts *opts)
+{
+	struct pcache_cache *cache;
+	int ret;
+
+	ret = cache_validate(backing_dev, opts);
+	if (ret)
+		return NULL;
+
+	cache = cache_alloc(backing_dev);
+	if (!cache)
+		return NULL;
+
+	cache->bdev_file = opts->bdev_file;
+	cache->dev_size = opts->dev_size;
+	cache->cache_info = opts->cache_info;
+	cache->state = PCACHE_CACHE_STATE_RUNNING;
+
+	ret = cache_segs_init(cache, opts->new_cache);
+	if (ret)
+		goto free_cache;
+
+	ret = cache_tail_init(cache, opts->new_cache);
+	if (ret)
+		goto segs_destroy;
+
+	ret = cache_init_req_keys(cache, opts->n_paral);
+	if (ret)
+		goto segs_destroy;
+
+	ret = cache_writeback_init(cache);
+	if (ret)
+		goto destroy_keys;
+
+	queue_delayed_work(cache->backing_dev->task_wq, &cache->gc_work, 0);
+
+	return cache;
+
+destroy_keys:
+	cache_destroy_req_keys(cache);
+segs_destroy:
+	cache_segs_destroy(cache);
+free_cache:
+	cache_free(cache);
+
+	return NULL;
+}
+
+void pcache_cache_destroy(struct pcache_cache *cache)
+{
+	cache->state = PCACHE_CACHE_STATE_STOPPING;
+	cache_flush(cache);
+
+	cancel_delayed_work_sync(&cache->gc_work);
+	flush_work(&cache->clean_work);
+
+	cache_writeback_exit(cache);
+
+	if (cache->req_key_tree.n_subtrees)
+		cache_destroy_req_keys(cache);
+
+	cache_segs_destroy(cache);
+	cache_free(cache);
+}
diff --git a/drivers/block/pcache/cache.h b/drivers/block/pcache/cache.h
new file mode 100644
index 000000000000..c50e94e0515c
--- /dev/null
+++ b/drivers/block/pcache/cache.h
@@ -0,0 +1,612 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_CACHE_H
+#define _PCACHE_CACHE_H
+
+#include "segment.h"
+
+/* Garbage collection thresholds */
+#define PCACHE_CACHE_GC_PERCENT_MIN       0                   /* Minimum GC percentage */
+#define PCACHE_CACHE_GC_PERCENT_MAX       90                  /* Maximum GC percentage */
+#define PCACHE_CACHE_GC_PERCENT_DEFAULT   70                  /* Default GC percentage */
+
+#define PCACHE_CACHE_PARAL_MAX		128
+#define PCACHE_CACHE_SEGS_EACH_PARAL	10
+
+#define PCACHE_CACHE_SUBTREE_SIZE		(4 * 1024 * 1024)   /* 4MB total tree size */
+#define PCACHE_CACHE_SUBTREE_SIZE_MASK		0x3FFFFF            /* Mask for tree size */
+#define PCACHE_CACHE_SUBTREE_SIZE_SHIFT		22                  /* Bit shift for tree size */
+
+/* Maximum number of keys per key set */
+#define PCACHE_KSET_KEYS_MAX		128
+#define PCACHE_CACHE_SEGS_MAX		(1024 * 1024)	/* maximum cache size for each device is 16T */
+#define PCACHE_KSET_ONMEDIA_SIZE_MAX	struct_size_t(struct pcache_cache_kset_onmedia, data, PCACHE_KSET_KEYS_MAX)
+#define PCACHE_KSET_SIZE		(sizeof(struct pcache_cache_kset) + sizeof(struct pcache_cache_key_onmedia) * PCACHE_KSET_KEYS_MAX)
+
+/* Maximum number of keys to clean in one round of clean_work */
+#define PCACHE_CLEAN_KEYS_MAX             10
+
+/* Writeback and garbage collection intervals in jiffies */
+#define PCACHE_CACHE_WRITEBACK_INTERVAL   (5 * HZ)
+#define PCACHE_CACHE_GC_INTERVAL          (5 * HZ)
+
+/* Macro to get the cache key structure from an rb_node pointer */
+#define CACHE_KEY(node)                (container_of(node, struct pcache_cache_key, rb_node))
+
+struct pcache_cache_pos_onmedia {
+	struct pcache_meta_header header;
+	u32 cache_seg_id;
+	u32 seg_off;
+};
+
+/* Offset and size definitions for cache segment control */
+#define PCACHE_CACHE_SEG_CTRL_OFF     (PCACHE_SEG_INFO_SIZE * PCACHE_META_INDEX_MAX)
+#define PCACHE_CACHE_SEG_CTRL_SIZE    PAGE_SIZE
+
+struct pcache_cache_seg_gen {
+	struct pcache_meta_header header;
+	u64 gen;
+};
+
+/* Control structure for cache segments */
+struct pcache_cache_seg_ctrl {
+	struct pcache_cache_seg_gen gen[PCACHE_META_INDEX_MAX]; /* Updated by blkdev, incremented in invalidating */
+	u64	res[64];
+};
+
+struct pcache_cache_seg_info {
+	struct pcache_segment_info segment_info;   /* must be first member */
+};
+
+#define PCACHE_CACHE_FLAGS_DATA_CRC	(1 << 0)
+
+struct pcache_cache_info {
+	u32 seg_id;
+	u32 n_segs;
+	u16 gc_percent;
+	u16 flags;
+	u32 res2;
+};
+
+struct pcache_cache_pos {
+	struct pcache_cache_segment *cache_seg;
+	u32 seg_off;
+};
+
+enum pcache_cache_seg_state {
+	pcache_cache_seg_state_none	= 0,
+	pcache_cache_seg_state_running
+};
+
+struct pcache_cache_segment {
+	struct pcache_cache	*cache;
+	u32			cache_seg_id;   /* Index in cache->segments */
+	struct pcache_segment	segment;
+	atomic_t		refs;
+
+	atomic_t		state;
+
+	struct pcache_cache_seg_info cache_seg_info;
+	struct mutex           info_lock;
+
+	spinlock_t             gen_lock;
+	u64                    gen;
+	struct pcache_cache_seg_ctrl *cache_seg_ctrl;
+	struct mutex           ctrl_lock;
+};
+
+/* rbtree for cache entries */
+struct pcache_cache_subtree {
+	struct rb_root root;
+	spinlock_t tree_lock;
+};
+
+struct pcache_cache_tree {
+	struct pcache_cache		*cache;
+	u32				n_subtrees;
+	struct kmem_cache		*key_cache;
+	struct pcache_cache_subtree	*subtrees;
+};
+
+#define PCACHE_CACHE_STATE_NONE			0
+#define PCACHE_CACHE_STATE_RUNNING		1
+#define PCACHE_CACHE_STATE_STOPPING		2
+
+/* PCACHE Cache main structure */
+struct pcache_cache {
+	struct pcache_backing_dev	*backing_dev;
+	struct pcache_cache_ctrl	*cache_ctrl;
+
+	u32			n_heads;
+	struct pcache_cache_data_head *data_heads;
+
+	spinlock_t		key_head_lock;
+	struct pcache_cache_pos	key_head;
+	u32			n_ksets;
+	struct pcache_cache_kset	*ksets;
+
+	struct mutex		key_tail_lock;
+	struct pcache_cache_pos	key_tail;
+
+	struct mutex		dirty_tail_lock;
+	struct pcache_cache_pos	dirty_tail;
+
+	struct pcache_cache_tree	req_key_tree;
+	struct work_struct	clean_work;
+
+	struct file		*bdev_file;
+	u64			dev_size;
+	struct delayed_work	writeback_work;
+	struct delayed_work	gc_work;
+
+	struct kmem_cache	*req_cache;
+
+	struct pcache_cache_info	*cache_info;
+
+	u32			state:8;
+
+	u32			n_segs;
+	unsigned long		*seg_map;
+	u32			last_cache_seg;
+	spinlock_t		seg_map_lock;
+	struct pcache_cache_segment segments[]; /* Last member */
+};
+
+/* PCACHE Cache options structure */
+struct pcache_cache_opts {
+	u32 cache_id;
+	void *owner;
+	u32 n_segs;
+	bool new_cache;
+	bool data_crc;
+	u64 dev_size;
+	u32 n_paral;
+	struct file *bdev_file;
+	struct pcache_cache_info *cache_info;
+};
+
+struct pcache_cache *pcache_cache_alloc(struct pcache_backing_dev *backing_dev,
+				  struct pcache_cache_opts *opts);
+void pcache_cache_destroy(struct pcache_cache *cache);
+
+struct pcache_cache_ctrl {
+	struct pcache_cache_seg_ctrl cache_seg_ctrl;
+
+	/* Updated by gc_thread */
+	struct pcache_cache_pos_onmedia key_tail_pos[PCACHE_META_INDEX_MAX];
+
+	/* Updated by writeback_thread */
+	struct pcache_cache_pos_onmedia dirty_tail_pos[PCACHE_META_INDEX_MAX];
+};
+
+struct pcache_cache_data_head {
+	spinlock_t data_head_lock;
+	struct pcache_cache_pos head_pos;
+};
+
+struct pcache_cache_key {
+	struct pcache_cache_tree	*cache_tree;
+	struct pcache_cache_subtree	*cache_subtree;
+	struct kref			ref;
+	struct rb_node			rb_node;
+	struct list_head		list_node;
+	u64				off;
+	u32				len;
+	u64				flags;
+	struct pcache_cache_pos		cache_pos;
+	u64				seg_gen;
+};
+
+#define PCACHE_CACHE_KEY_FLAGS_EMPTY   (1 << 0)
+#define PCACHE_CACHE_KEY_FLAGS_CLEAN   (1 << 1)
+
+struct pcache_cache_key_onmedia {
+	u64 off;
+	u32 len;
+	u32 flags;
+	u32 cache_seg_id;
+	u32 cache_seg_off;
+	u64 seg_gen;
+	u32 data_crc;
+};
+
+struct pcache_cache_kset_onmedia {
+	u32 crc;
+	union {
+		u32 key_num;
+		u32 next_cache_seg_id;
+	};
+	u64 magic;
+	u64 flags;
+	struct pcache_cache_key_onmedia data[];
+};
+
+/* cache key */
+struct pcache_cache_key *cache_key_alloc(struct pcache_cache_tree *cache_tree);
+void cache_key_init(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key);
+void cache_key_get(struct pcache_cache_key *key);
+void cache_key_put(struct pcache_cache_key *key);
+int cache_key_append(struct pcache_cache *cache, struct pcache_cache_key *key);
+int cache_key_insert(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key, bool fixup);
+int cache_key_decode(struct pcache_cache *cache,
+			struct pcache_cache_key_onmedia *key_onmedia,
+			struct pcache_cache_key *key);
+void cache_pos_advance(struct pcache_cache_pos *pos, u32 len);
+
+#define PCACHE_KSET_FLAGS_LAST		(1 << 0)
+#define PCACHE_KSET_MAGIC		0x676894a64e164f1aULL
+
+struct pcache_cache_kset {
+	struct pcache_cache *cache;
+	spinlock_t        kset_lock;
+	struct delayed_work flush_work;
+	struct pcache_cache_kset_onmedia kset_onmedia;
+};
+
+extern struct pcache_cache_kset_onmedia pcache_empty_kset;
+
+struct pcache_cache_subtree_walk_ctx {
+	struct pcache_cache_tree *cache_tree;
+	struct rb_node *start_node;
+	struct pcache_request *pcache_req;
+	u32	req_done;
+	struct pcache_cache_key *key;
+
+	struct list_head *delete_key_list;
+	struct list_head *submit_req_list;
+
+	/*
+	 *	  |--------|		key_tmp
+	 * |====|			key
+	 */
+	int (*before)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	/*
+	 * |----------|			key_tmp
+	 *		|=====|		key
+	 */
+	int (*after)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	/*
+	 *     |----------------|	key_tmp
+	 * |===========|		key
+	 */
+	int (*overlap_tail)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	/*
+	 * |--------|			key_tmp
+	 *   |==========|		key
+	 */
+	int (*overlap_head)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	/*
+	 *    |----|			key_tmp
+	 * |==========|			key
+	 */
+	int (*overlap_contain)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	/*
+	 * |-----------|		key_tmp
+	 *   |====|			key
+	 */
+	int (*overlap_contained)(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+			struct pcache_cache_subtree_walk_ctx *ctx);
+
+	int (*walk_finally)(struct pcache_cache_subtree_walk_ctx *ctx);
+	bool (*walk_done)(struct pcache_cache_subtree_walk_ctx *ctx);
+};
+
+int cache_subtree_walk(struct pcache_cache_subtree_walk_ctx *ctx);
+struct rb_node *cache_subtree_search(struct pcache_cache_subtree *cache_subtree, struct pcache_cache_key *key,
+				  struct rb_node **parentp, struct rb_node ***newp,
+				  struct list_head *delete_key_list);
+int cache_kset_close(struct pcache_cache *cache, struct pcache_cache_kset *kset);
+void clean_fn(struct work_struct *work);
+void kset_flush_fn(struct work_struct *work);
+int cache_replay(struct pcache_cache *cache);
+int cache_tree_init(struct pcache_cache *cache, struct pcache_cache_tree *cache_tree, u32 n_subtrees);
+void cache_tree_exit(struct pcache_cache_tree *cache_tree);
+
+/* cache segments */
+struct pcache_cache_segment *get_cache_segment(struct pcache_cache *cache);
+int cache_seg_init(struct pcache_cache *cache, u32 seg_id, u32 cache_seg_id,
+		   bool new_cache);
+void cache_seg_destroy(struct pcache_cache_segment *cache_seg);
+void cache_seg_get(struct pcache_cache_segment *cache_seg);
+void cache_seg_put(struct pcache_cache_segment *cache_seg);
+void cache_seg_set_next_seg(struct pcache_cache_segment *cache_seg, u32 seg_id);
+
+/* cache info */
+void cache_info_write(struct pcache_cache *cache);
+int cache_info_load(struct pcache_cache *cache);
+
+/* cache request*/
+int cache_flush(struct pcache_cache *cache);
+void miss_read_end_work_fn(struct work_struct *work);
+int pcache_cache_handle_req(struct pcache_cache *cache, struct pcache_request *pcache_req);
+
+/* gc */
+void pcache_cache_gc_fn(struct work_struct *work);
+
+/* writeback */
+void cache_writeback_exit(struct pcache_cache *cache);
+int cache_writeback_init(struct pcache_cache *cache);
+void cache_writeback_fn(struct work_struct *work);
+
+/* inline functions */
+static inline struct pcache_cache_subtree *get_subtree(struct pcache_cache_tree *cache_tree, u64 off)
+{
+	if (cache_tree->n_subtrees == 1)
+		return &cache_tree->subtrees[0];
+
+	return &cache_tree->subtrees[off >> PCACHE_CACHE_SUBTREE_SIZE_SHIFT];
+}
+
+static inline void *cache_pos_addr(struct pcache_cache_pos *pos)
+{
+	return (pos->cache_seg->segment.data + pos->seg_off);
+}
+
+static inline void *get_key_head_addr(struct pcache_cache *cache)
+{
+	return cache_pos_addr(&cache->key_head);
+}
+
+static inline u32 get_kset_id(struct pcache_cache *cache, u64 off)
+{
+	return (off >> PCACHE_CACHE_SUBTREE_SIZE_SHIFT) % cache->n_ksets;
+}
+
+static inline struct pcache_cache_kset *get_kset(struct pcache_cache *cache, u32 kset_id)
+{
+	return (void *)cache->ksets + PCACHE_KSET_SIZE * kset_id;
+}
+
+static inline struct pcache_cache_data_head *get_data_head(struct pcache_cache *cache, u32 i)
+{
+	return &cache->data_heads[i % cache->n_heads];
+}
+
+static inline bool cache_key_empty(struct pcache_cache_key *key)
+{
+	return key->flags & PCACHE_CACHE_KEY_FLAGS_EMPTY;
+}
+
+static inline bool cache_key_clean(struct pcache_cache_key *key)
+{
+	return key->flags & PCACHE_CACHE_KEY_FLAGS_CLEAN;
+}
+
+static inline void cache_pos_copy(struct pcache_cache_pos *dst, struct pcache_cache_pos *src)
+{
+	memcpy(dst, src, sizeof(struct pcache_cache_pos));
+}
+
+/**
+ * cache_seg_is_ctrl_seg - Checks if a cache segment is a cache ctrl segment.
+ * @cache_seg_id: ID of the cache segment.
+ *
+ * Returns true if the cache segment ID corresponds to a cache ctrl segment.
+ *
+ * Note: We extend the segment control of the first cache segment
+ * (cache segment ID 0) to serve as the cache control (pcache_cache_ctrl)
+ * for the entire PCACHE cache. This function determines whether the given
+ * cache segment is the one storing the pcache_cache_ctrl information.
+ */
+static inline bool cache_seg_is_ctrl_seg(u32 cache_seg_id)
+{
+	return (cache_seg_id == 0);
+}
+
+/**
+ * cache_key_cutfront - Cuts a specified length from the front of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ * @cut_len: Length to cut from the front.
+ *
+ * Advances the cache key position by cut_len and adjusts offset and length accordingly.
+ */
+static inline void cache_key_cutfront(struct pcache_cache_key *key, u32 cut_len)
+{
+	if (key->cache_pos.cache_seg)
+		cache_pos_advance(&key->cache_pos, cut_len);
+
+	key->off += cut_len;
+	key->len -= cut_len;
+}
+
+/**
+ * cache_key_cutback - Cuts a specified length from the back of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ * @cut_len: Length to cut from the back.
+ *
+ * Reduces the length of the cache key by cut_len.
+ */
+static inline void cache_key_cutback(struct pcache_cache_key *key, u32 cut_len)
+{
+	key->len -= cut_len;
+}
+
+static inline void cache_key_delete(struct pcache_cache_key *key)
+{
+	struct pcache_cache_subtree *cache_subtree;
+
+	cache_subtree = key->cache_subtree;
+	if (!cache_subtree)
+		return;
+
+	rb_erase(&key->rb_node, &cache_subtree->root);
+	key->flags = 0;
+	cache_key_put(key);
+}
+
+static inline bool cache_data_crc_on(struct pcache_cache *cache)
+{
+	return (cache->cache_info->flags & PCACHE_CACHE_FLAGS_DATA_CRC);
+}
+
+/**
+ * cache_key_data_crc - Calculates CRC for data in a cache key.
+ * @key: Pointer to the pcache_cache_key structure.
+ *
+ * Returns the CRC-32 checksum of the data within the cache key's position.
+ */
+static inline u32 cache_key_data_crc(struct pcache_cache_key *key)
+{
+	void *data;
+
+	data = cache_pos_addr(&key->cache_pos);
+
+	return crc32(0, data, key->len);
+}
+
+static inline u32 cache_kset_crc(struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+	u32 crc_size;
+
+	if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST)
+		crc_size = sizeof(struct pcache_cache_kset_onmedia) - 4;
+	else
+		crc_size = struct_size(kset_onmedia, data, kset_onmedia->key_num) - 4;
+
+	return crc32(0, (void *)kset_onmedia + 4, crc_size);
+}
+
+static inline u32 get_kset_onmedia_size(struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+	return struct_size_t(struct pcache_cache_kset_onmedia, data, kset_onmedia->key_num);
+}
+
+/**
+ * cache_seg_remain - Computes remaining space in a cache segment.
+ * @pos: Pointer to pcache_cache_pos structure.
+ *
+ * Returns the amount of remaining space in the segment data starting from
+ * the current position offset.
+ */
+static inline u32 cache_seg_remain(struct pcache_cache_pos *pos)
+{
+	struct pcache_cache_segment *cache_seg;
+	struct pcache_segment *segment;
+	u32 seg_remain;
+
+	cache_seg = pos->cache_seg;
+	segment = &cache_seg->segment;
+	seg_remain = segment->data_size - pos->seg_off;
+
+	return seg_remain;
+}
+
+/**
+ * cache_key_invalid - Checks if a cache key is invalid.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns true if the cache key is invalid due to its generation being
+ * less than the generation of its segment; otherwise returns false.
+ *
+ * When the GC (garbage collection) thread identifies a segment
+ * as reclaimable, it increments the segment's generation (gen). However,
+ * it does not immediately remove all related cache keys. When accessing
+ * such a cache key, this function can be used to determine if the cache
+ * key has already become invalid.
+ */
+static inline bool cache_key_invalid(struct pcache_cache_key *key)
+{
+	if (cache_key_empty(key))
+		return false;
+
+	return (key->seg_gen < key->cache_pos.cache_seg->gen);
+}
+
+/**
+ * cache_key_lstart - Retrieves the logical start offset of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns the logical start offset for the cache key.
+ */
+static inline u64 cache_key_lstart(struct pcache_cache_key *key)
+{
+	return key->off;
+}
+
+/**
+ * cache_key_lend - Retrieves the logical end offset of a cache key.
+ * @key: Pointer to pcache_cache_key structure.
+ *
+ * Returns the logical end offset for the cache key.
+ */
+static inline u64 cache_key_lend(struct pcache_cache_key *key)
+{
+	return key->off + key->len;
+}
+
+static inline void cache_key_copy(struct pcache_cache_key *key_dst, struct pcache_cache_key *key_src)
+{
+	key_dst->off = key_src->off;
+	key_dst->len = key_src->len;
+	key_dst->seg_gen = key_src->seg_gen;
+	key_dst->cache_tree = key_src->cache_tree;
+	key_dst->cache_subtree = key_src->cache_subtree;
+	key_dst->flags = key_src->flags;
+
+	cache_pos_copy(&key_dst->cache_pos, &key_src->cache_pos);
+}
+
+/**
+ * cache_pos_onmedia_crc - Calculates the CRC for an on-media cache position.
+ * @pos_om: Pointer to pcache_cache_pos_onmedia structure.
+ *
+ * Calculates the CRC-32 checksum of the position, excluding the first 4 bytes.
+ * Returns the computed CRC value.
+ */
+static inline u32 cache_pos_onmedia_crc(struct pcache_cache_pos_onmedia *pos_om)
+{
+	return pcache_meta_crc(&pos_om->header, sizeof(struct pcache_cache_pos_onmedia));
+}
+
+void cache_pos_encode(struct pcache_cache *cache,
+			     struct pcache_cache_pos_onmedia *pos_onmedia,
+			     struct pcache_cache_pos *pos);
+int cache_pos_decode(struct pcache_cache *cache,
+			    struct pcache_cache_pos_onmedia *pos_onmedia,
+			    struct pcache_cache_pos *pos);
+
+static inline void cache_encode_key_tail(struct pcache_cache *cache)
+{
+	mutex_lock(&cache->key_tail_lock);
+	cache_pos_encode(cache, cache->cache_ctrl->key_tail_pos, &cache->key_tail);
+	mutex_unlock(&cache->key_tail_lock);
+}
+
+static inline int cache_decode_key_tail(struct pcache_cache *cache)
+{
+	int ret;
+
+	mutex_lock(&cache->key_tail_lock);
+	ret = cache_pos_decode(cache, cache->cache_ctrl->key_tail_pos, &cache->key_tail);
+	mutex_unlock(&cache->key_tail_lock);
+
+	return ret;
+}
+
+static inline void cache_encode_dirty_tail(struct pcache_cache *cache)
+{
+	mutex_lock(&cache->dirty_tail_lock);
+	cache_pos_encode(cache, cache->cache_ctrl->dirty_tail_pos, &cache->dirty_tail);
+	mutex_unlock(&cache->dirty_tail_lock);
+}
+
+static inline int cache_decode_dirty_tail(struct pcache_cache *cache)
+{
+	int ret;
+
+	mutex_lock(&cache->dirty_tail_lock);
+	ret = cache_pos_decode(cache, cache->cache_ctrl->dirty_tail_pos, &cache->dirty_tail);
+	mutex_unlock(&cache->dirty_tail_lock);
+
+	return ret;
+}
+#endif /* _PCACHE_CACHE_H */

From patchwork Mon Apr 14 01:45:00 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049560
Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com
 [95.215.58.176])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80A781A23B9
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.176
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595148; cv=none;
 b=OWiAr+qe0xyPXq1edk8lL/Czs7PwruuERRz/j5a3qlxya8eT8Qe6kfnLhJlp0aJwlZGV9hK/VKqvh1tAq4qEHzd5L/D3+C/fRQ5th4NoQTAoCzg16H4X2MLGgYgEDmneh3nQZSOSyROmT7KBXRfYMvb+G6KutGSbFjbD0Lsmf38=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595148; c=relaxed/simple;
	bh=fJZb+T7ydJBmQg2edPbLgj8WD+nr9Nq9XAEx/4GGILk=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=Dh39Ivxf73HPIeC7hxIOHF2CpqleO6ded331wrBJf9ANsb64eVyQ8MbrQ/UUKQAcRLgeHmywUWhcVs3YwFzvszGxi+sVdwrXgYGPmRnVpdn4ErbtxSyK+H20UcSucOzKVU1uRVz7G8vbwnKIf/y+XQzPQWdZKkKf4MexOCq3bC0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=H7W+W61v; arc=none smtp.client-ip=95.215.58.176
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="H7W+W61v"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595144;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qEGvUEgO4KAwZ8jolAEZFp5P76r6dGDynUlkFVVzuhw=;
	b=H7W+W61vBdQ/L4c9vHM0CtgaeOXdiy+6PvClLmJgjL4I0PsJIz93Ph1BtFWaxCGGfE8ZQE
	nxtSGHa63DPdewVzBtrdl35EX4zyFdsmXH1Op5njLnt6nAai0w5w0GI813jvuCLfup++yK
	TIXbBDWreYP0KCbLc5SFdGAQWC34CwI=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 06/11] pcache: gc and writeback
Date: Mon, 14 Apr 2025 01:45:00 +0000
Message-Id: <20250414014505.20477-7-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch adds support for writeback and garbage collection (GC) to manage
cache segment lifecycle and ensure long-term data integrity in the PCACHE
system.

The writeback logic traverses cached data (in the form of ksets) from
`dirty_tail`, writing back valid keys to the backing device.
This is done in FIFO order and ensures that data is synchronously flushed
to persistent storage before being marked clean. After all dirty keys in
a kset are written back, the segment is considered clean and can be reclaimed.

The garbage collection mechanism monitors cache usage and, once the percentage
of used segments exceeds the configured `gc_percent` threshold, begins reclaiming
cache segments from `key_tail` forward. Only fully clean segments are eligible
for reuse.

Because writeback and GC operate in FIFO order, this model guarantees that,
even if the cache device fails unexpectedly, the data on the backing device
remains crash-consistent.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache_gc.c        | 150 ++++++++++++++++++++
 drivers/block/pcache/cache_writeback.c | 183 +++++++++++++++++++++++++
 2 files changed, 333 insertions(+)
 create mode 100644 drivers/block/pcache/cache_gc.c
 create mode 100644 drivers/block/pcache/cache_writeback.c

diff --git a/drivers/block/pcache/cache_gc.c b/drivers/block/pcache/cache_gc.c
new file mode 100644
index 000000000000..b32cc2704dfb
--- /dev/null
+++ b/drivers/block/pcache/cache_gc.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "cache.h"
+#include "backing_dev.h"
+
+/**
+ * cache_key_gc - Releases the reference of a cache key segment.
+ * @cache: Pointer to the pcache_cache structure.
+ * @key: Pointer to the cache key to be garbage collected.
+ *
+ * This function decrements the reference count of the cache segment
+ * associated with the given key. If the reference count drops to zero,
+ * the segment may be invalidated and reused.
+ */
+static void cache_key_gc(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+	cache_seg_put(key->cache_pos.cache_seg);
+}
+
+/**
+ * need_gc - Determines if garbage collection is needed for the cache.
+ * @cache: Pointer to the pcache_cache structure.
+ *
+ * This function checks if garbage collection is necessary based on the
+ * current state of the cache, including the position of the dirty tail,
+ * the integrity of the key segment on media, and the percentage of used
+ * segments compared to the configured threshold.
+ *
+ * Return: true if garbage collection is needed, false otherwise.
+ */
+static bool need_gc(struct pcache_cache *cache)
+{
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	void *dirty_addr, *key_addr;
+	u32 segs_used, segs_gc_threshold;
+
+	dirty_addr = cache_pos_addr(&cache->dirty_tail);
+	key_addr = cache_pos_addr(&cache->key_tail);
+	if (dirty_addr == key_addr) {
+		backing_dev_debug(cache->backing_dev, "key tail is equal to dirty tail: %u:%u\n",
+				cache->dirty_tail.cache_seg->cache_seg_id,
+				cache->dirty_tail.seg_off);
+		return false;
+	}
+
+	/* Check if kset_onmedia is corrupted */
+	kset_onmedia = (struct pcache_cache_kset_onmedia *)key_addr;
+	if (kset_onmedia->magic != PCACHE_KSET_MAGIC) {
+		backing_dev_debug(cache->backing_dev, "gc error: magic is not as expected. key_tail: %u:%u magic: %llx, expected: %llx\n",
+					cache->key_tail.cache_seg->cache_seg_id, cache->key_tail.seg_off,
+					kset_onmedia->magic, PCACHE_KSET_MAGIC);
+		return false;
+	}
+
+	/* Verify the CRC of the kset_onmedia */
+	if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+		backing_dev_debug(cache->backing_dev, "gc error: crc is not as expected. crc: %x, expected: %x\n",
+					cache_kset_crc(kset_onmedia), kset_onmedia->crc);
+		return false;
+	}
+
+	segs_used = bitmap_weight(cache->seg_map, cache->n_segs);
+	segs_gc_threshold = cache->n_segs * cache->cache_info->gc_percent / 100;
+	if (segs_used < segs_gc_threshold) {
+		backing_dev_debug(cache->backing_dev, "segs_used: %u, segs_gc_threshold: %u\n", segs_used, segs_gc_threshold);
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * last_kset_gc - Advances the garbage collection for the last kset.
+ * @cache: Pointer to the pcache_cache structure.
+ * @kset_onmedia: Pointer to the kset_onmedia structure for the last kset.
+ */
+static int last_kset_gc(struct pcache_cache *cache, struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+	struct pcache_cache_segment *cur_seg, *next_seg;
+
+	/* Don't move to the next segment if dirty_tail has not moved */
+	if (cache->dirty_tail.cache_seg == cache->key_tail.cache_seg)
+		return -EAGAIN;
+
+	cur_seg = cache->key_tail.cache_seg;
+
+	next_seg = &cache->segments[kset_onmedia->next_cache_seg_id];
+	cache->key_tail.cache_seg = next_seg;
+	cache->key_tail.seg_off = 0;
+	cache_encode_key_tail(cache);
+
+	backing_dev_debug(cache->backing_dev, "gc advance kset seg: %u\n", cur_seg->cache_seg_id);
+
+	spin_lock(&cache->seg_map_lock);
+	clear_bit(cur_seg->cache_seg_id, cache->seg_map);
+	spin_unlock(&cache->seg_map_lock);
+
+	return 0;
+}
+
+void pcache_cache_gc_fn(struct work_struct *work)
+{
+	struct pcache_cache *cache = container_of(work, struct pcache_cache, gc_work.work);
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	struct pcache_cache_key_onmedia *key_onmedia;
+	struct pcache_cache_key *key;
+	int ret;
+	int i;
+
+	while (true) {
+		if (!need_gc(cache))
+			break;
+
+		kset_onmedia = (struct pcache_cache_kset_onmedia *)cache_pos_addr(&cache->key_tail);
+
+		if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+			ret = last_kset_gc(cache, kset_onmedia);
+			if (ret)
+				break;
+			continue;
+		}
+
+		for (i = 0; i < kset_onmedia->key_num; i++) {
+			struct pcache_cache_key key_tmp = { 0 };
+
+			key_onmedia = &kset_onmedia->data[i];
+
+			key = &key_tmp;
+			cache_key_init(&cache->req_key_tree, key);
+
+			ret = cache_key_decode(cache, key_onmedia, key);
+			if (ret) {
+				backing_dev_err(cache->backing_dev, "failed to decode cache key in gc\n");
+				break;
+			}
+
+			cache_key_gc(cache, key);
+		}
+
+		backing_dev_debug(cache->backing_dev, "gc advance: %u:%u %u\n",
+			cache->key_tail.cache_seg->cache_seg_id,
+			cache->key_tail.seg_off,
+			get_kset_onmedia_size(kset_onmedia));
+
+		cache_pos_advance(&cache->key_tail, get_kset_onmedia_size(kset_onmedia));
+		cache_encode_key_tail(cache);
+	}
+
+	queue_delayed_work(cache->backing_dev->task_wq, &cache->gc_work, PCACHE_CACHE_GC_INTERVAL);
+}
diff --git a/drivers/block/pcache/cache_writeback.c b/drivers/block/pcache/cache_writeback.c
new file mode 100644
index 000000000000..5738d2abe831
--- /dev/null
+++ b/drivers/block/pcache/cache_writeback.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <linux/bio.h>
+
+#include "cache.h"
+#include "backing_dev.h"
+
+static inline bool is_cache_clean(struct pcache_cache *cache)
+{
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	struct pcache_cache_pos *pos;
+	void *addr;
+
+	pos = &cache->dirty_tail;
+	addr = cache_pos_addr(pos);
+	kset_onmedia = (struct pcache_cache_kset_onmedia *)addr;
+
+	/* Check if the magic number matches the expected value */
+	if (kset_onmedia->magic != PCACHE_KSET_MAGIC) {
+		backing_dev_debug(cache->backing_dev, "dirty_tail: %u:%u magic: %llx, not expected: %llx\n",
+				pos->cache_seg->cache_seg_id, pos->seg_off,
+				kset_onmedia->magic, PCACHE_KSET_MAGIC);
+		return true;
+	}
+
+	/* Verify the CRC checksum for data integrity */
+	if (kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+		backing_dev_debug(cache->backing_dev, "dirty_tail: %u:%u crc: %x, not expected: %x\n",
+				pos->cache_seg->cache_seg_id, pos->seg_off,
+				cache_kset_crc(kset_onmedia), kset_onmedia->crc);
+		return true;
+	}
+
+	return false;
+}
+
+void cache_writeback_exit(struct pcache_cache *cache)
+{
+	cache_flush(cache);
+
+	while (!is_cache_clean(cache))
+		schedule_timeout(HZ);
+
+	cancel_delayed_work_sync(&cache->writeback_work);
+}
+
+int cache_writeback_init(struct pcache_cache *cache)
+{
+	/* Queue delayed work to start writeback handling */
+	queue_delayed_work(cache->backing_dev->task_wq, &cache->writeback_work, 0);
+
+	return 0;
+}
+
+static int cache_key_writeback(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+	struct pcache_cache_pos *pos;
+	void *addr;
+	ssize_t written;
+	u32 seg_remain;
+	u64 off;
+
+	if (cache_key_clean(key))
+		return 0;
+
+	pos = &key->cache_pos;
+
+	seg_remain = cache_seg_remain(pos);
+	BUG_ON(seg_remain < key->len);
+
+	addr = cache_pos_addr(pos);
+	off = key->off;
+
+	/* Perform synchronous writeback to maintain overwrite sequence.
+	 * Ensures data consistency by writing in order. For instance, if K1 writes
+	 * data to the range 0-4K and then K2 writes to the same range, K1's write
+	 * must complete before K2's.
+	 *
+	 * Note: We defer flushing data immediately after each key's writeback.
+	 * Instead, a `sync` operation is issued once the entire kset (group of keys)
+	 * has completed writeback, ensuring all data from the kset is safely persisted
+	 * to disk while reducing the overhead of frequent flushes.
+	 */
+	written = kernel_write(cache->bdev_file, addr, key->len, &off);
+	if (written != key->len)
+		return -EIO;
+
+	return 0;
+}
+
+static int cache_kset_writeback(struct pcache_cache *cache,
+		struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+	struct pcache_cache_key_onmedia *key_onmedia;
+	struct pcache_cache_key *key;
+	u64 start = U64_MAX, end = U64_MAX;
+	int ret;
+	u32 i;
+
+	/* Iterate through all keys in the kset and write each back to storage */
+	for (i = 0; i < kset_onmedia->key_num; i++) {
+		struct pcache_cache_key key_tmp = { 0 };
+
+		key_onmedia = &kset_onmedia->data[i];
+
+		key = &key_tmp;
+		cache_key_init(NULL, key);
+
+		ret = cache_key_decode(cache, key_onmedia, key);
+		if (ret) {
+			backing_dev_err(cache->backing_dev, "failed to decode key: %llu:%u in writeback.",
+					key->off, key->len);
+			return ret;
+		}
+
+		if (start == U64_MAX || start > key->off)
+			start = key->off;
+		if (end == U64_MAX || end < key->off + key->len)
+			end = key->off + key->len;
+
+		ret = cache_key_writeback(cache, key);
+		if (ret) {
+			backing_dev_err(cache->backing_dev, "writeback error: %d\n", ret);
+			return ret;
+		}
+	}
+
+	/* Sync the entire kset's data to disk to ensure durability */
+	vfs_fsync_range(cache->bdev_file, start, end, 1);
+
+	return 0;
+}
+
+static void last_kset_writeback(struct pcache_cache *cache,
+		struct pcache_cache_kset_onmedia *last_kset_onmedia)
+{
+	struct pcache_cache_segment *next_seg;
+
+	backing_dev_debug(cache->backing_dev, "last kset, next: %u\n", last_kset_onmedia->next_cache_seg_id);
+
+	next_seg = &cache->segments[last_kset_onmedia->next_cache_seg_id];
+
+	cache->dirty_tail.cache_seg = next_seg;
+	cache->dirty_tail.seg_off = 0;
+	cache_encode_dirty_tail(cache);
+}
+
+void cache_writeback_fn(struct work_struct *work)
+{
+	struct pcache_cache *cache = container_of(work, struct pcache_cache, writeback_work.work);
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	int ret = 0;
+	void *addr;
+
+	/* Loop until all dirty data is written back and the cache is clean */
+	while (true) {
+		if (is_cache_clean(cache))
+			break;
+
+		addr = cache_pos_addr(&cache->dirty_tail);
+		kset_onmedia = (struct pcache_cache_kset_onmedia *)addr;
+
+		if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+			last_kset_writeback(cache, kset_onmedia);
+			continue;
+		}
+
+		ret = cache_kset_writeback(cache, kset_onmedia);
+		if (ret)
+			break;
+
+		backing_dev_debug(cache->backing_dev, "writeback advance: %u:%u %u\n",
+			cache->dirty_tail.cache_seg->cache_seg_id,
+			cache->dirty_tail.seg_off,
+			get_kset_onmedia_size(kset_onmedia));
+
+		cache_pos_advance(&cache->dirty_tail, get_kset_onmedia_size(kset_onmedia));
+
+		cache_encode_dirty_tail(cache);
+	}
+
+	queue_delayed_work(cache->backing_dev->task_wq, &cache->writeback_work, PCACHE_CACHE_WRITEBACK_INTERVAL);
+}

From patchwork Mon Apr 14 01:45:01 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049561
Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com
 [95.215.58.173])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB8AF1A3BA1
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595153; cv=none;
 b=DFIjI5rNAbGMWEuQRIFJN2kBBpnlRvGb7hjtGQEdrQc+W4CpRmTrYP4HE9laWkHFYEELsk6mLh7oxx4RMY2OZqFt3nOJ899OeM5uGKxImGpO8dis/W8jCSQ1wxgV98xLJTDSaQ9mPbdxgF+KIBFRd56lr8Hd5Ryxyvx13st+Pug=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595153; c=relaxed/simple;
	bh=VZckwGgBqeyJPiuj1COjAFk16qUJWiOt/jKHpfPHaeY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=ZmydMGjkzc2sFRvaH6kUJVHB55OdBQtnFZvt/sporWzjT8qYlBvV31Feq0fcA5j0Va/sg9okPLY2BJRO+90++dyS1500HGEtxhfl7Sakr4Y+d3/gDyd+cNC6UevwBeCSveHWnLNQ9REZL7VThGFmhor2q3cS55o5cR/To3Ig/I4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=C/i6Wx+E; arc=none smtp.client-ip=95.215.58.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="C/i6Wx+E"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595149;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=PERYZnF60Rueh79zQCyZ/qsBqw4mzIQTx9IQaNEQUms=;
	b=C/i6Wx+ErnwGPEqGJu/SrAxhuhTnZfDXOMlPYevcCKw7vkKWFW49gFdj+7ffV0NLC5ZkBq
	OlbFnEgiNdDwPQ8cGkfnXDQthmnXSZSvoY5e0PaAJbXEGMVsMTxFjcHZeyJ1D+SkUqIqZV
	XWFClSRYVDfxftkxAd950hyOMHimr1Q=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 07/11] pcache: introduce cache_key infrastructure for
 persistent metadata management
Date: Mon, 14 Apr 2025 01:45:01 +0000
Message-Id: <20250414014505.20477-8-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch adds a comprehensive cache key management module to pcache.
Cache keys represent a mapping between logical offsets and cached data in
persistent memory. The new implementation provides functions for:

  - Allocation and initialization of cache keys with reference counting,
    using kmem_cache for efficient memory management.
  - Encoding cache keys into an on-media format (struct pcache_cache_key_onmedia)
    that stores offset, length, segment information (segment ID and offset),
    generation number, and flags, with optional data CRC for integrity.
  - Decoding on-media cache keys and validating their integrity, reporting
    errors if mismatches occur.
  - Appending keys to ksets and handling kset flush when a kset becomes full,
    including support for appending a “last kset” marker for segment chaining.
  - Insertion of cache keys into the cache tree (implemented as an RB-tree),
    including custom overlap fixup functions (fixup_overlap_tail, fixup_overlap_head,
    fixup_overlap_contain, fixup_overlap_contained) to handle various overlapping
    scenarios during key insertion.
  - Cache tree traversal and search functions (cache_subtree_walk, cache_subtree_search)
    for efficient key management and garbage collection, along with a background
    cleanup routine (clean_fn) to remove invalid keys.

This cache_key infrastructure is a key part of the pcache metadata system,
enabling persistent, crash-consistent tracking of cached data locations and
facilitating recovery and garbage collection.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache_key.c | 885 +++++++++++++++++++++++++++++++
 1 file changed, 885 insertions(+)
 create mode 100644 drivers/block/pcache/cache_key.c

diff --git a/drivers/block/pcache/cache_key.c b/drivers/block/pcache/cache_key.c
new file mode 100644
index 000000000000..d68055ae8c2f
--- /dev/null
+++ b/drivers/block/pcache/cache_key.c
@@ -0,0 +1,885 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include "cache.h"
+#include "backing_dev.h"
+
+struct pcache_cache_kset_onmedia pcache_empty_kset = { 0 };
+
+void cache_key_init(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key)
+{
+	kref_init(&key->ref);
+	key->cache_tree = cache_tree;
+	INIT_LIST_HEAD(&key->list_node);
+	RB_CLEAR_NODE(&key->rb_node);
+}
+
+struct pcache_cache_key *cache_key_alloc(struct pcache_cache_tree *cache_tree)
+{
+	struct pcache_cache_key *key;
+
+	key = kmem_cache_zalloc(cache_tree->key_cache, GFP_NOWAIT);
+	if (!key)
+		return NULL;
+
+	cache_key_init(cache_tree, key);
+
+	return key;
+}
+
+/**
+ * cache_key_get - Increment the reference count of a cache key.
+ * @key: Pointer to the pcache_cache_key structure.
+ *
+ * This function increments the reference count of the specified cache key,
+ * ensuring that it is not freed while still in use.
+ */
+void cache_key_get(struct pcache_cache_key *key)
+{
+	kref_get(&key->ref);
+}
+
+/**
+ * cache_key_destroy - Free a cache key structure when its reference count drops to zero.
+ * @ref: Pointer to the kref structure.
+ *
+ * This function is called when the reference count of the cache key reaches zero.
+ * It frees the allocated cache key back to the slab cache.
+ */
+static void cache_key_destroy(struct kref *ref)
+{
+	struct pcache_cache_key *key = container_of(ref, struct pcache_cache_key, ref);
+	struct pcache_cache_tree *cache_tree = key->cache_tree;
+
+	kmem_cache_free(cache_tree->key_cache, key);
+}
+
+void cache_key_put(struct pcache_cache_key *key)
+{
+	kref_put(&key->ref, cache_key_destroy);
+}
+
+void cache_pos_advance(struct pcache_cache_pos *pos, u32 len)
+{
+	/* Ensure enough space remains in the current segment */
+	BUG_ON(cache_seg_remain(pos) < len);
+
+	pos->seg_off += len;
+}
+
+static void cache_key_encode(struct pcache_cache *cache,
+			     struct pcache_cache_key_onmedia *key_onmedia,
+			     struct pcache_cache_key *key)
+{
+	key_onmedia->off = key->off;
+	key_onmedia->len = key->len;
+
+	key_onmedia->cache_seg_id = key->cache_pos.cache_seg->cache_seg_id;
+	key_onmedia->cache_seg_off = key->cache_pos.seg_off;
+
+	key_onmedia->seg_gen = key->seg_gen;
+	key_onmedia->flags = key->flags;
+
+	if (cache_data_crc_on(cache))
+		key_onmedia->data_crc = cache_key_data_crc(key);
+}
+
+int cache_key_decode(struct pcache_cache *cache,
+			struct pcache_cache_key_onmedia *key_onmedia,
+			struct pcache_cache_key *key)
+{
+	key->off = key_onmedia->off;
+	key->len = key_onmedia->len;
+
+	key->cache_pos.cache_seg = &cache->segments[key_onmedia->cache_seg_id];
+	key->cache_pos.seg_off = key_onmedia->cache_seg_off;
+
+	key->seg_gen = key_onmedia->seg_gen;
+	key->flags = key_onmedia->flags;
+
+	if (cache_data_crc_on(cache) &&
+			key_onmedia->data_crc != cache_key_data_crc(key)) {
+		backing_dev_err(cache->backing_dev, "key: %llu:%u seg %u:%u data_crc error: %x, expected: %x\n",
+				key->off, key->len, key->cache_pos.cache_seg->cache_seg_id,
+				key->cache_pos.seg_off, cache_key_data_crc(key), key_onmedia->data_crc);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void append_last_kset(struct pcache_cache *cache, u32 next_seg)
+{
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+
+	kset_onmedia = get_key_head_addr(cache);
+	kset_onmedia->flags |= PCACHE_KSET_FLAGS_LAST;
+	kset_onmedia->next_cache_seg_id = next_seg;
+	kset_onmedia->magic = PCACHE_KSET_MAGIC;
+	kset_onmedia->crc = cache_kset_crc(kset_onmedia);
+	cache_pos_advance(&cache->key_head, sizeof(struct pcache_cache_kset_onmedia));
+}
+
+int cache_kset_close(struct pcache_cache *cache, struct pcache_cache_kset *kset)
+{
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	u32 kset_onmedia_size;
+	int ret;
+
+	kset_onmedia = &kset->kset_onmedia;
+
+	if (!kset_onmedia->key_num)
+		return 0;
+
+	kset_onmedia_size = struct_size(kset_onmedia, data, kset_onmedia->key_num);
+
+	spin_lock(&cache->key_head_lock);
+again:
+	/* Reserve space for the last kset */
+	if (cache_seg_remain(&cache->key_head) < kset_onmedia_size + sizeof(struct pcache_cache_kset_onmedia)) {
+		struct pcache_cache_segment *next_seg;
+
+		next_seg = get_cache_segment(cache);
+		if (!next_seg) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		/* clear outdated kset in next seg */
+		memcpy_flushcache(next_seg->segment.data, &pcache_empty_kset,
+					sizeof(struct pcache_cache_kset_onmedia));
+		append_last_kset(cache, next_seg->cache_seg_id);
+		cache->key_head.cache_seg = next_seg;
+		cache->key_head.seg_off = 0;
+		goto again;
+	}
+
+	kset_onmedia->magic = PCACHE_KSET_MAGIC;
+	kset_onmedia->crc = cache_kset_crc(kset_onmedia);
+
+	/* clear outdated kset after current kset */
+	memcpy_flushcache(get_key_head_addr(cache) + kset_onmedia_size, &pcache_empty_kset,
+				sizeof(struct pcache_cache_kset_onmedia));
+
+	/* write current kset into segment */
+	memcpy_flushcache(get_key_head_addr(cache), kset_onmedia, kset_onmedia_size);
+	memset(kset_onmedia, 0, sizeof(struct pcache_cache_kset_onmedia));
+	cache_pos_advance(&cache->key_head, kset_onmedia_size);
+
+	ret = 0;
+out:
+	spin_unlock(&cache->key_head_lock);
+
+	return ret;
+}
+
+/**
+ * cache_key_append - Append a cache key to the related kset.
+ * @cache: Pointer to the pcache_cache structure.
+ * @key: Pointer to the cache key structure to append.
+ *
+ * This function appends a cache key to the appropriate kset. If the kset
+ * is full, it closes the kset. If not, it queues a flush work to write
+ * the kset to media.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int cache_key_append(struct pcache_cache *cache, struct pcache_cache_key *key)
+{
+	struct pcache_cache_kset *kset;
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	struct pcache_cache_key_onmedia *key_onmedia;
+	u32 kset_id = get_kset_id(cache, key->off);
+	int ret = 0;
+
+	kset = get_kset(cache, kset_id);
+	kset_onmedia = &kset->kset_onmedia;
+
+	spin_lock(&kset->kset_lock);
+	key_onmedia = &kset_onmedia->data[kset_onmedia->key_num];
+	cache_key_encode(cache, key_onmedia, key);
+
+	/* Check if the current kset has reached the maximum number of keys */
+	if (++kset_onmedia->key_num == PCACHE_KSET_KEYS_MAX) {
+		/* If full, close the kset */
+		ret = cache_kset_close(cache, kset);
+		if (ret) {
+			kset_onmedia->key_num--;
+			goto out;
+		}
+	} else {
+		/* If not full, queue a delayed work to flush the kset */
+		queue_delayed_work(cache->backing_dev->task_wq, &kset->flush_work, 1 * HZ);
+	}
+out:
+	spin_unlock(&kset->kset_lock);
+
+	return ret;
+}
+
+/**
+ * cache_subtree_walk - Traverse the cache tree.
+ * @cache: Pointer to the pcache_cache structure.
+ * @ctx: Pointer to the context structure for traversal.
+ *
+ * This function traverses the cache tree starting from the specified node.
+ * It calls the appropriate callback functions based on the relationships
+ * between the keys in the cache tree.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int cache_subtree_walk(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_cache_key *key_tmp, *key;
+	struct rb_node *node_tmp;
+	int ret;
+
+	key = ctx->key;
+	node_tmp = ctx->start_node;
+
+	while (node_tmp) {
+		if (ctx->walk_done && ctx->walk_done(ctx))
+			break;
+
+		key_tmp = CACHE_KEY(node_tmp);
+		/*
+		 * If key_tmp ends before the start of key, continue to the next node.
+		 * |----------|
+		 *              |=====|
+		 */
+		if (cache_key_lend(key_tmp) <= cache_key_lstart(key)) {
+			if (ctx->after) {
+				ret = ctx->after(key, key_tmp, ctx);
+				if (ret)
+					goto out;
+			}
+			goto next;
+		}
+
+		/*
+		 * If key_tmp starts after the end of key, stop traversing.
+		 *	  |--------|
+		 * |====|
+		 */
+		if (cache_key_lstart(key_tmp) >= cache_key_lend(key)) {
+			if (ctx->before) {
+				ret = ctx->before(key, key_tmp, ctx);
+				if (ret)
+					goto out;
+			}
+			break;
+		}
+
+		/* Handle overlapping keys */
+		if (cache_key_lstart(key_tmp) >= cache_key_lstart(key)) {
+			/*
+			 * If key_tmp encompasses key.
+			 *     |----------------|	key_tmp
+			 * |===========|		key
+			 */
+			if (cache_key_lend(key_tmp) >= cache_key_lend(key)) {
+				if (ctx->overlap_tail) {
+					ret = ctx->overlap_tail(key, key_tmp, ctx);
+					if (ret)
+						goto out;
+				}
+				break;
+			}
+
+			/*
+			 * If key_tmp is contained within key.
+			 *    |----|		key_tmp
+			 * |==========|		key
+			 */
+			if (ctx->overlap_contain) {
+				ret = ctx->overlap_contain(key, key_tmp, ctx);
+				if (ret)
+					goto out;
+			}
+
+			goto next;
+		}
+
+		/*
+		 * If key_tmp starts before key ends but ends after key.
+		 * |-----------|	key_tmp
+		 *   |====|		key
+		 */
+		if (cache_key_lend(key_tmp) > cache_key_lend(key)) {
+			if (ctx->overlap_contained) {
+				ret = ctx->overlap_contained(key, key_tmp, ctx);
+				if (ret)
+					goto out;
+			}
+			break;
+		}
+
+		/*
+		 * If key_tmp starts before key and ends within key.
+		 * |--------|		key_tmp
+		 *   |==========|	key
+		 */
+		if (ctx->overlap_head) {
+			ret = ctx->overlap_head(key, key_tmp, ctx);
+			if (ret)
+				goto out;
+		}
+next:
+		node_tmp = rb_next(node_tmp);
+	}
+
+	if (ctx->walk_finally) {
+		ret = ctx->walk_finally(ctx);
+		if (ret)
+			goto out;
+	}
+
+	return 0;
+out:
+	return ret;
+}
+
+/**
+ * cache_subtree_search - Search for a key in the cache tree.
+ * @cache_subtree: Pointer to the cache tree structure.
+ * @key: Pointer to the cache key to search for.
+ * @parentp: Pointer to store the parent node of the found node.
+ * @newp: Pointer to store the location where the new node should be inserted.
+ * @delete_key_list: List to collect invalid keys for deletion.
+ *
+ * This function searches the cache tree for a specific key and returns
+ * the node that is the predecessor of the key, or first node if the key is
+ * less than all keys in the tree. If any invalid keys are found during
+ * the search, they are added to the delete_key_list for later cleanup.
+ *
+ * Returns a pointer to the previous node.
+ */
+struct rb_node *cache_subtree_search(struct pcache_cache_subtree *cache_subtree, struct pcache_cache_key *key,
+				  struct rb_node **parentp, struct rb_node ***newp,
+				  struct list_head *delete_key_list)
+{
+	struct rb_node **new, *parent = NULL;
+	struct pcache_cache_key *key_tmp;
+	struct rb_node *prev_node = NULL;
+
+	new = &(cache_subtree->root.rb_node);
+	while (*new) {
+		key_tmp = container_of(*new, struct pcache_cache_key, rb_node);
+		if (cache_key_invalid(key_tmp))
+			list_add(&key_tmp->list_node, delete_key_list);
+
+		parent = *new;
+		if (key_tmp->off >= key->off) {
+			new = &((*new)->rb_left);
+		} else {
+			prev_node = *new;
+			new = &((*new)->rb_right);
+		}
+	}
+
+	if (!prev_node)
+		prev_node = rb_first(&cache_subtree->root);
+
+	if (parentp)
+		*parentp = parent;
+
+	if (newp)
+		*newp = new;
+
+	return prev_node;
+}
+
+/**
+ * fixup_overlap_tail - Adjust the key when it overlaps at the tail.
+ * @key: Pointer to the new cache key being inserted.
+ * @key_tmp: Pointer to the existing key that overlaps.
+ * @ctx: Pointer to the context for walking the cache tree.
+ *
+ * This function modifies the existing key (key_tmp) when there is an
+ * overlap at the tail with the new key. If the modified key becomes
+ * empty, it is deleted. Returns 0 on success, or -EAGAIN if the key
+ * needs to be reinserted.
+ */
+static int fixup_overlap_tail(struct pcache_cache_key *key,
+			       struct pcache_cache_key *key_tmp,
+			       struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	int ret;
+
+	/*
+	 *     |----------------|	key_tmp
+	 * |===========|		key
+	 */
+	cache_key_cutfront(key_tmp, cache_key_lend(key) - cache_key_lstart(key_tmp));
+	if (key_tmp->len == 0) {
+		cache_key_delete(key_tmp);
+		ret = -EAGAIN;
+
+		/*
+		 * Deleting key_tmp may change the structure of the
+		 * entire cache tree, so we need to re-search the tree
+		 * to determine the new insertion point for the key.
+		 */
+		goto out;
+	}
+
+	return 0;
+out:
+	return ret;
+}
+
+/**
+ * fixup_overlap_contain - Handle case where new key completely contains an existing key.
+ * @key: Pointer to the new cache key being inserted.
+ * @key_tmp: Pointer to the existing key that is being contained.
+ * @ctx: Pointer to the context for walking the cache tree.
+ *
+ * This function deletes the existing key (key_tmp) when the new key
+ * completely contains it. It returns -EAGAIN to indicate that the
+ * tree structure may have changed, necessitating a re-insertion of
+ * the new key.
+ */
+static int fixup_overlap_contain(struct pcache_cache_key *key,
+				  struct pcache_cache_key *key_tmp,
+				  struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	/*
+	 *    |----|			key_tmp
+	 * |==========|			key
+	 */
+	cache_key_delete(key_tmp);
+
+	return -EAGAIN;
+}
+
+/**
+ * fixup_overlap_contained - Handle overlap when a new key is contained in an existing key.
+ * @key: The new cache key being inserted.
+ * @key_tmp: The existing cache key that overlaps with the new key.
+ * @ctx: Context for the cache tree walk.
+ *
+ * This function adjusts the existing key if the new key is contained
+ * within it. If the existing key is empty, it indicates a placeholder key
+ * that was inserted during a miss read. This placeholder will later be
+ * updated with real data from the backing_dev, making it no longer an empty key.
+ *
+ * If we delete key or insert a key, the structure of the entire cache tree may change,
+ * requiring a full research of the tree to find a new insertion point.
+ */
+static int fixup_overlap_contained(struct pcache_cache_key *key,
+	struct pcache_cache_key *key_tmp, struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_cache_tree *cache_tree = ctx->cache_tree;
+	int ret;
+
+	/*
+	 * |-----------|		key_tmp
+	 *   |====|			key
+	 */
+	if (cache_key_empty(key_tmp)) {
+		/* If key_tmp is empty, don't split it;
+		 * it's a placeholder key for miss reads that will be updated later.
+		 */
+		cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+		if (key_tmp->len == 0) {
+			cache_key_delete(key_tmp);
+			ret = -EAGAIN;
+			goto out;
+		}
+	} else {
+		struct pcache_cache_key *key_fixup;
+		bool need_research = false;
+
+		/* Allocate a new cache key for splitting key_tmp */
+		key_fixup = cache_key_alloc(cache_tree);
+		if (!key_fixup) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		cache_key_copy(key_fixup, key_tmp);
+
+		/* Split key_tmp based on the new key's range */
+		cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+		if (key_tmp->len == 0) {
+			cache_key_delete(key_tmp);
+			need_research = true;
+		}
+
+		/* Create a new portion for key_fixup */
+		cache_key_cutfront(key_fixup, cache_key_lend(key) - cache_key_lstart(key_tmp));
+		if (key_fixup->len == 0) {
+			cache_key_put(key_fixup);
+		} else {
+			/* Insert the new key into the cache */
+			ret = cache_key_insert(cache_tree, key_fixup, false);
+			if (ret)
+				goto out;
+			need_research = true;
+		}
+
+		if (need_research) {
+			ret = -EAGAIN;
+			goto out;
+		}
+	}
+
+	return 0;
+out:
+	return ret;
+}
+
+/**
+ * fixup_overlap_head - Handle overlap when a new key overlaps with the head of an existing key.
+ * @key: The new cache key being inserted.
+ * @key_tmp: The existing cache key that overlaps with the new key.
+ * @ctx: Context for the cache tree walk.
+ *
+ * This function adjusts the existing key if the new key overlaps
+ * with the beginning of it. If the resulting key length is zero
+ * after the adjustment, the key is deleted. This indicates that
+ * the key no longer holds valid data and requires the tree to be
+ * re-researched for a new insertion point.
+ */
+static int fixup_overlap_head(struct pcache_cache_key *key,
+	struct pcache_cache_key *key_tmp, struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	/*
+	 * |--------|		key_tmp
+	 *   |==========|	key
+	 */
+	/* Adjust key_tmp by cutting back based on the new key's start */
+	cache_key_cutback(key_tmp, cache_key_lend(key_tmp) - cache_key_lstart(key));
+	if (key_tmp->len == 0) {
+		/* If the adjusted key_tmp length is zero, delete it */
+		cache_key_delete(key_tmp);
+		return -EAGAIN;
+	}
+
+	return 0;
+}
+
+/**
+ * cache_insert_fixup - Fix up overlaps when inserting a new key.
+ * @cache_tree: Pointer to the cache_tree structure.
+ * @key: The new cache key to insert.
+ * @prev_node: The last visited node during the search.
+ *
+ * This function initializes a walking context and calls the
+ * cache_subtree_walk function to handle potential overlaps between
+ * the new key and existing keys in the cache tree. Various
+ * fixup functions are provided to manage different overlap scenarios.
+ */
+static int cache_insert_fixup(struct pcache_cache_tree *cache_tree,
+	struct pcache_cache_key *key, struct rb_node *prev_node)
+{
+	struct pcache_cache_subtree_walk_ctx walk_ctx = { 0 };
+
+	/* Set up the context with the cache, start node, and new key */
+	walk_ctx.cache_tree = cache_tree;
+	walk_ctx.start_node = prev_node;
+	walk_ctx.key = key;
+
+	/* Assign overlap handling functions for different scenarios */
+	walk_ctx.overlap_tail = fixup_overlap_tail;
+	walk_ctx.overlap_head = fixup_overlap_head;
+	walk_ctx.overlap_contain = fixup_overlap_contain;
+	walk_ctx.overlap_contained = fixup_overlap_contained;
+
+	/* Begin walking the cache tree to fix overlaps */
+	return cache_subtree_walk(&walk_ctx);
+}
+
+/**
+ * cache_key_insert - Insert a new cache key into the cache tree.
+ * @cache_tree: Pointer to the cache_tree structure.
+ * @key: The cache key to insert.
+ * @fixup: Indicates if this is a new key being inserted.
+ *
+ * This function searches for the appropriate location to insert
+ * a new cache key into the cache tree. It handles key overlaps
+ * and ensures any invalid keys are removed before insertion.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+int cache_key_insert(struct pcache_cache_tree *cache_tree, struct pcache_cache_key *key, bool fixup)
+{
+	struct rb_node **new, *parent = NULL;
+	struct pcache_cache_subtree *cache_subtree;
+	struct pcache_cache_key *key_tmp = NULL, *key_next;
+	struct rb_node *prev_node = NULL;
+	LIST_HEAD(delete_key_list);
+	int ret;
+
+	cache_subtree = get_subtree(cache_tree, key->off);
+	key->cache_subtree = cache_subtree;
+search:
+	prev_node = cache_subtree_search(cache_subtree, key, &parent, &new, &delete_key_list);
+	if (!list_empty(&delete_key_list)) {
+		/* Remove invalid keys from the delete list */
+		list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) {
+			list_del_init(&key_tmp->list_node);
+			cache_key_delete(key_tmp);
+		}
+		goto search;
+	}
+
+	if (fixup) {
+		ret = cache_insert_fixup(cache_tree, key, prev_node);
+		if (ret == -EAGAIN)
+			goto search;
+		if (ret)
+			goto out;
+	}
+
+	/* Link and insert the new key into the red-black tree */
+	rb_link_node(&key->rb_node, parent, new);
+	rb_insert_color(&key->rb_node, &cache_subtree->root);
+
+	return 0;
+out:
+	return ret;
+}
+
+/**
+ * clean_fn - Cleanup function to remove invalid keys from the cache tree.
+ * @work: Pointer to the work_struct associated with the cleanup.
+ *
+ * This function cleans up invalid keys from the cache tree in the background
+ * after a cache segment has been invalidated during cache garbage collection.
+ * It processes a maximum of PCACHE_CLEAN_KEYS_MAX keys per iteration and holds
+ * the tree lock to ensure thread safety.
+ */
+void clean_fn(struct work_struct *work)
+{
+	struct pcache_cache *cache = container_of(work, struct pcache_cache, clean_work);
+	struct pcache_cache_subtree *cache_subtree;
+	struct rb_node *node;
+	struct pcache_cache_key *key;
+	int i, count;
+
+	for (i = 0; i < cache->req_key_tree.n_subtrees; i++) {
+		cache_subtree = &cache->req_key_tree.subtrees[i];
+
+again:
+		if (cache->state == PCACHE_CACHE_STATE_STOPPING)
+			return;
+
+		/* Delete up to PCACHE_CLEAN_KEYS_MAX keys in one iteration */
+		count = 0;
+		spin_lock(&cache_subtree->tree_lock);
+		node = rb_first(&cache_subtree->root);
+		while (node) {
+			key = CACHE_KEY(node);
+			node = rb_next(node);
+			if (cache_key_invalid(key)) {
+				count++;
+				cache_key_delete(key);
+			}
+
+			if (count >= PCACHE_CLEAN_KEYS_MAX) {
+				/* Unlock and pause before continuing cleanup */
+				spin_unlock(&cache_subtree->tree_lock);
+				usleep_range(1000, 2000);
+				goto again;
+			}
+		}
+		spin_unlock(&cache_subtree->tree_lock);
+	}
+}
+
+/*
+ * kset_flush_fn - Flush work for a cache kset.
+ *
+ * This function is called when a kset flush work is queued from
+ * cache_key_append(). If the kset is full, it will be closed
+ * immediately. If not, the flush work will be queued for later closure.
+ *
+ * If cache_kset_close detects that a new segment is required to store
+ * the kset and there are no available segments, it will return an error.
+ * In this scenario, a retry will be attempted.
+ */
+void kset_flush_fn(struct work_struct *work)
+{
+	struct pcache_cache_kset *kset = container_of(work, struct pcache_cache_kset, flush_work.work);
+	struct pcache_cache *cache = kset->cache;
+	int ret;
+
+	spin_lock(&kset->kset_lock);
+	ret = cache_kset_close(cache, kset);
+	spin_unlock(&kset->kset_lock);
+
+	if (ret) {
+		/* Failed to flush kset, schedule a retry. */
+		queue_delayed_work(cache->backing_dev->task_wq, &kset->flush_work, 0);
+	}
+}
+
+static int kset_replay(struct pcache_cache *cache, struct pcache_cache_kset_onmedia *kset_onmedia)
+{
+	struct pcache_cache_key_onmedia *key_onmedia;
+	struct pcache_cache_key *key;
+	int ret;
+	int i;
+
+	for (i = 0; i < kset_onmedia->key_num; i++) {
+		key_onmedia = &kset_onmedia->data[i];
+
+		key = cache_key_alloc(&cache->req_key_tree);
+		if (!key) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		ret = cache_key_decode(cache, key_onmedia, key);
+		if (ret) {
+			cache_key_put(key);
+			goto err;
+		}
+
+		/* Mark the segment as used in the segment map. */
+		set_bit(key->cache_pos.cache_seg->cache_seg_id, cache->seg_map);
+
+		/* Check if the segment generation is valid for insertion. */
+		if (key->seg_gen < key->cache_pos.cache_seg->gen) {
+			cache_key_put(key);
+		} else {
+			ret = cache_key_insert(&cache->req_key_tree, key, true);
+			if (ret) {
+				cache_key_put(key);
+				goto err;
+			}
+		}
+
+		cache_seg_get(key->cache_pos.cache_seg);
+	}
+
+	return 0;
+err:
+	return ret;
+}
+
+int cache_replay(struct pcache_cache *cache)
+{
+	struct pcache_cache_pos pos_tail;
+	struct pcache_cache_pos *pos;
+	struct pcache_cache_kset_onmedia *kset_onmedia;
+	u32 count = 0;
+	int ret = 0;
+	void *addr;
+
+	cache_pos_copy(&pos_tail, &cache->key_tail);
+	pos = &pos_tail;
+
+	/* Mark the segment as used in the segment map. */
+	set_bit(pos->cache_seg->cache_seg_id, cache->seg_map);
+
+	while (true) {
+		addr = cache_pos_addr(pos);
+
+		kset_onmedia = (struct pcache_cache_kset_onmedia *)addr;
+		if (kset_onmedia->magic != PCACHE_KSET_MAGIC ||
+				kset_onmedia->crc != cache_kset_crc(kset_onmedia)) {
+			break;
+		}
+
+		if (kset_onmedia->crc != cache_kset_crc(kset_onmedia))
+			break;
+
+		/* Process the last kset and prepare for the next segment. */
+		if (kset_onmedia->flags & PCACHE_KSET_FLAGS_LAST) {
+			struct pcache_cache_segment *next_seg;
+
+			backing_dev_debug(cache->backing_dev, "last kset replay, next: %u\n", kset_onmedia->next_cache_seg_id);
+
+			next_seg = &cache->segments[kset_onmedia->next_cache_seg_id];
+
+			pos->cache_seg = next_seg;
+			pos->seg_off = 0;
+
+			set_bit(pos->cache_seg->cache_seg_id, cache->seg_map);
+			continue;
+		}
+
+		/* Replay the kset and check for errors. */
+		ret = kset_replay(cache, kset_onmedia);
+		if (ret)
+			goto out;
+
+		/* Advance the position after processing the kset. */
+		cache_pos_advance(pos, get_kset_onmedia_size(kset_onmedia));
+		if (++count > 512) {
+			cond_resched();
+			count = 0;
+		}
+	}
+
+	/* Update the key_head position after replaying. */
+	spin_lock(&cache->key_head_lock);
+	cache_pos_copy(&cache->key_head, pos);
+	spin_unlock(&cache->key_head_lock);
+
+out:
+	return ret;
+}
+
+int cache_tree_init(struct pcache_cache *cache, struct pcache_cache_tree *cache_tree, u32 n_subtrees)
+{
+	int ret;
+	u32 i;
+
+	cache_tree->cache = cache;
+	cache_tree->n_subtrees = n_subtrees;
+
+	cache_tree->key_cache = KMEM_CACHE(pcache_cache_key, 0);
+	if (!cache_tree->key_cache) {
+		ret = -ENOMEM;
+		goto err;
+	}
+	/*
+	 * Allocate and initialize the subtrees array.
+	 * Each element is a cache tree structure that contains
+	 * an RB tree root and a spinlock for protecting its contents.
+	 */
+	cache_tree->subtrees = kvcalloc(cache_tree->n_subtrees, sizeof(struct pcache_cache_subtree), GFP_KERNEL);
+	if (!cache_tree->n_subtrees) {
+		ret = -ENOMEM;
+		goto destroy_key_cache;
+	}
+
+	for (i = 0; i < cache_tree->n_subtrees; i++) {
+		struct pcache_cache_subtree *cache_subtree = &cache_tree->subtrees[i];
+
+		cache_subtree->root = RB_ROOT;
+		spin_lock_init(&cache_subtree->tree_lock);
+	}
+
+	return 0;
+
+destroy_key_cache:
+	kmem_cache_destroy(cache_tree->key_cache);
+err:
+	return ret;
+}
+
+void cache_tree_exit(struct pcache_cache_tree *cache_tree)
+{
+	struct pcache_cache_subtree *cache_subtree;
+	struct rb_node *node;
+	struct pcache_cache_key *key;
+	u32 i;
+
+	for (i = 0; i < cache_tree->n_subtrees; i++) {
+		cache_subtree = &cache_tree->subtrees[i];
+
+		spin_lock(&cache_subtree->tree_lock);
+		node = rb_first(&cache_subtree->root);
+		while (node) {
+			key = CACHE_KEY(node);
+			node = rb_next(node);
+
+			cache_key_delete(key);
+		}
+		spin_unlock(&cache_subtree->tree_lock);
+	}
+	kvfree(cache_tree->subtrees);
+	kmem_cache_destroy(cache_tree->key_cache);
+}

From patchwork Mon Apr 14 01:45:02 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049562
Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com
 [95.215.58.174])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D99941A8F6E
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:45:55 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595159; cv=none;
 b=mUWzdPR1yRAWg0AwpLhafG2YIhUl/BIQVzQ02pnUk53vMs6ZlwQaoH2Yd5aykk2mFYVao0DelcE2kS0Tjd4HYGrkaLibArgSQqBgEJyOBPpVhbVwuZwOpVrdwkNC9kNlDyocmsCZ+DJzcHQhu4O4hkfA0UMpkis8PUmy6W9HSb4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595159; c=relaxed/simple;
	bh=tLv86KSjsFfVqnow6fhzxdZE5kdFw9ZoErl23cQxr/M=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=nlzrhX/U+q8eB3rIHtN7i3mSthro5OnQPthRvAzaeYlglOEfYiEop4axp4L7F2MXDYZ3d0qD6r5IdZ0inwJKjKrRqv7BlHgYlAjyfSBq7fmg2zl0mzqi2Yaah5yqP7/CNsdmS2fSgAj8miJjmO2AhkseXhovMev3E3K2/K4NuSE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=gswPM67S; arc=none smtp.client-ip=95.215.58.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="gswPM67S"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595154;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dxDMoRtSRWuK/H8EUpRueay0Al8ARSlmVenBnsLAI9A=;
	b=gswPM67S55W8LkRc8FaC+f9Cw6lPkaL8ApGqMV9R47LFz9KdL9KMsqP53+PJwa20KrJjbu
	8bL9Qh7ipFiZznI06Tm3bNp2kkv4e4b313/XwwdiZbViSFJ4VNYjurZkAfbKznrT4A9zXl
	ThuBgRhL5GTehz/HEvl2jSNIbYDbQI8=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 08/11] pcache: implement request processing and cache I/O
 path in cache_req
Date: Mon, 14 Apr 2025 01:45:02 +0000
Message-Id: <20250414014505.20477-9-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the core request processing logic, which
handles all I/O operations for the PCACHE system, including read, write, and flush.

Read operations walk the in-memory cache tree to locate cached ranges. Missing
ranges are submitted to the backing device through asynchronous requests, optionally
inserting empty placeholder keys to prevent redundant reads. The traversal logic
carefully handles all possible overlapping conditions between requested and cached
ranges.

Write operations allocate space from per-queue data heads, copy data into persistent
memory segments, and append corresponding keys into the current kset for persistence.

Flush operations traverse all active ksets and ensure any accumulated keys are written
to persistent memory. Each kset is flushed atomically, allowing the cache metadata to
remain consistent even in the event of a crash.

This patch lays the foundation for the entire cache I/O path, ensuring that cache
operations are efficient and crash-safe.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/cache_req.c | 812 +++++++++++++++++++++++++++++++
 1 file changed, 812 insertions(+)
 create mode 100644 drivers/block/pcache/cache_req.c

diff --git a/drivers/block/pcache/cache_req.c b/drivers/block/pcache/cache_req.c
new file mode 100644
index 000000000000..9d0bce55caed
--- /dev/null
+++ b/drivers/block/pcache/cache_req.c
@@ -0,0 +1,812 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "cache.h"
+#include "backing_dev.h"
+#include "logic_dev.h"
+
+static int cache_data_head_init(struct pcache_cache *cache, u32 head_index)
+{
+	struct pcache_cache_segment *next_seg;
+	struct pcache_cache_data_head *data_head;
+
+	data_head = get_data_head(cache, head_index);
+	next_seg = get_cache_segment(cache);
+	if (!next_seg)
+		return -EBUSY;
+
+	cache_seg_get(next_seg);
+	data_head->head_pos.cache_seg = next_seg;
+	data_head->head_pos.seg_off = 0;
+
+	return 0;
+}
+
+/*
+ * cache_data_alloc - Allocate data for a cache key.
+ * @cache: Pointer to the cache structure.
+ * @key: Pointer to the cache key to allocate data for.
+ * @head_index: Index of the data head to use for allocation.
+ *
+ * This function tries to allocate space from the cache segment specified by the
+ * data head. If the remaining space in the segment is insufficient to allocate
+ * the requested length for the cache key, it will allocate whatever is available
+ * and adjust the key's length accordingly. This function does not allocate
+ * space that crosses segment boundaries.
+ */
+static int cache_data_alloc(struct pcache_cache *cache, struct pcache_cache_key *key, u32 head_index)
+{
+	struct pcache_cache_data_head *data_head;
+	struct pcache_cache_pos *head_pos;
+	struct pcache_cache_segment *cache_seg;
+	u32 seg_remain;
+	u32 allocated = 0, to_alloc;
+	int ret = 0;
+
+	data_head = get_data_head(cache, head_index);
+
+	spin_lock(&data_head->data_head_lock);
+again:
+	if (!data_head->head_pos.cache_seg) {
+		seg_remain = 0;
+	} else {
+		cache_pos_copy(&key->cache_pos, &data_head->head_pos);
+		key->seg_gen = key->cache_pos.cache_seg->gen;
+
+		head_pos = &data_head->head_pos;
+		cache_seg = head_pos->cache_seg;
+		seg_remain = cache_seg_remain(head_pos);
+		to_alloc = key->len - allocated;
+	}
+
+	if (seg_remain > to_alloc) {
+		/* If remaining space in segment is sufficient for the cache key, allocate it. */
+		cache_pos_advance(head_pos, to_alloc);
+		allocated += to_alloc;
+		cache_seg_get(cache_seg);
+	} else if (seg_remain) {
+		/* If remaining space is not enough, allocate the remaining space and adjust the cache key length. */
+		cache_pos_advance(head_pos, seg_remain);
+		key->len = seg_remain;
+
+		/* Get for key: obtain a reference to the cache segment for the key. */
+		cache_seg_get(cache_seg);
+		/* Put for head_pos->cache_seg: release the reference for the current head's segment. */
+		cache_seg_put(head_pos->cache_seg);
+		head_pos->cache_seg = NULL;
+	} else {
+		/* Initialize a new data head if no segment is available. */
+		ret = cache_data_head_init(cache, head_index);
+		if (ret)
+			goto out;
+
+		goto again;
+	}
+
+out:
+	spin_unlock(&data_head->data_head_lock);
+
+	return ret;
+}
+
+static void cache_copy_from_req_bio(struct pcache_cache *cache, struct pcache_cache_key *key,
+				struct pcache_request *pcache_req, u32 bio_off)
+{
+	struct pcache_cache_pos *pos = &key->cache_pos;
+	struct pcache_segment *segment;
+
+	segment = &pos->cache_seg->segment;
+
+	segment_copy_from_bio(segment, pos->seg_off, key->len, pcache_req->req->bio, bio_off);
+}
+
+static int cache_copy_to_req_bio(struct pcache_cache *cache, struct pcache_request *pcache_req,
+			    u32 bio_off, u32 len, struct pcache_cache_pos *pos, u64 key_gen)
+{
+	struct pcache_cache_segment *cache_seg = pos->cache_seg;
+	struct pcache_segment *segment = &cache_seg->segment;
+	int ret;
+
+	spin_lock(&cache_seg->gen_lock);
+	if (key_gen < cache_seg->gen) {
+		spin_unlock(&cache_seg->gen_lock);
+		return -EINVAL;
+	}
+
+	ret = segment_copy_to_bio(segment, pos->seg_off, len, pcache_req->req->bio, bio_off);
+	spin_unlock(&cache_seg->gen_lock);
+
+	return ret;
+}
+
+/**
+ * miss_read_end_req - Handle the end of a miss read request.
+ * @cache: Pointer to the cache structure.
+ * @pcache_req: Pointer to the request structure.
+ *
+ * This function is called when a backing request to read data from
+ * the backing_dev is completed. If the key associated with the request
+ * is empty (a placeholder), it allocates cache space for the key,
+ * copies the data read from the bio into the cache, and updates
+ * the key's status. If the key has been overwritten by a write
+ * request during this process, it will be deleted from the cache
+ * tree and no further action will be taken.
+ */
+static void miss_read_end_req(struct pcache_backing_dev_req *backing_req, int ret)
+{
+	void *priv_data = backing_req->priv_data;
+	struct pcache_request *pcache_req = backing_req->upper_req;
+	struct pcache_cache *cache = backing_req->backing_dev->cache;
+
+	if (priv_data) {
+		struct pcache_cache_key *key;
+		struct pcache_cache_subtree *cache_subtree;
+
+		key = (struct pcache_cache_key *)priv_data;
+		cache_subtree = key->cache_subtree;
+
+		/* if this key was deleted from cache_subtree by a write, key->flags should be cleared,
+		 * so if cache_key_empty() return true, this key is still in cache_subtree
+		 */
+		spin_lock(&cache_subtree->tree_lock);
+		if (cache_key_empty(key)) {
+			/* Check if the backing request was successful. */
+			if (ret) {
+				cache_key_delete(key);
+				goto unlock;
+			}
+
+			/* Allocate cache space for the key and copy data from the backing_dev. */
+			ret = cache_data_alloc(cache, key, pcache_req->queue->index);
+			if (ret) {
+				cache_key_delete(key);
+				goto unlock;
+			}
+			cache_copy_from_req_bio(cache, key, pcache_req, backing_req->bio_off);
+			key->flags &= ~PCACHE_CACHE_KEY_FLAGS_EMPTY;
+			key->flags |= PCACHE_CACHE_KEY_FLAGS_CLEAN;
+
+			/* Append the key to the cache. */
+			ret = cache_key_append(cache, key);
+			if (ret) {
+				cache_seg_put(key->cache_pos.cache_seg);
+				cache_key_delete(key);
+				goto unlock;
+			}
+		}
+unlock:
+		spin_unlock(&cache_subtree->tree_lock);
+		cache_key_put(key);
+	}
+
+	pcache_req_put(pcache_req, ret);
+}
+
+/**
+ * submit_cache_miss_req - Submit a backing request when cache data is missing
+ * @cache: The cache context that manages cache operations
+ * @pcache_req: The cache request containing information about the read request
+ *
+ * This function is used to handle cases where a cache read request cannot locate
+ * the required data in the cache. When such a miss occurs during `cache_subtree_walk`,
+ * it triggers a backing read request to fetch data from the backing storage.
+ *
+ * If `pcache_req->priv_data` is set, it points to a `pcache_cache_key`, representing
+ * a new cache key to be inserted into the cache. The function calls `cache_key_insert`
+ * to attempt adding the key. On insertion failure, it releases the key reference and
+ * clears `priv_data` to avoid further processing.
+ */
+static void submit_cache_miss_req(struct pcache_cache *cache, struct pcache_backing_dev_req *backing_req)
+{
+	int ret;
+
+	if (backing_req->priv_data) {
+		struct pcache_cache_key *key;
+
+		/* Attempt to insert the key into the cache if priv_data is set */
+		key = (struct pcache_cache_key *)backing_req->priv_data;
+		ret = cache_key_insert(&cache->req_key_tree, key, true);
+		if (ret) {
+			/* Release the key if insertion fails */
+			cache_key_put(key);
+			backing_req->priv_data = NULL;
+			backing_req->ret = ret;
+			backing_dev_req_end(backing_req);
+			return;
+		}
+	}
+	backing_dev_req_submit(backing_req);
+}
+
+/**
+ * create_cache_miss_req - Create a backing read request for a cache miss
+ * @cache: The cache structure that manages cache operations
+ * @parent: The parent request structure initiating the miss read
+ * @off: Offset in the parent request to read from
+ * @len: Length of data to read from the backing_dev
+ * @insert_key: Determines whether to insert a placeholder empty key in the cache tree
+ *
+ * This function generates a new backing read request when a cache miss occurs. The
+ * `insert_key` parameter controls whether a placeholder (empty) cache key should be
+ * added to the cache tree to prevent multiple backing requests for the same missing
+ * data. Generally, when the miss read occurs in a cache segment that doesn't contain
+ * the requested data, a placeholder key is created and inserted.
+ *
+ * However, if the cache tree already has an empty key at the location for this
+ * read, there is no need to create another. Instead, this function just send the
+ * new request without adding a duplicate placeholder.
+ *
+ * Returns:
+ * A pointer to the newly created request structure on success, or NULL on failure.
+ * If an empty key is created, it will be released if any errors occur during the
+ * process to ensure proper cleanup.
+ */
+static struct pcache_backing_dev_req *create_cache_miss_req(struct pcache_cache *cache, struct pcache_request *parent,
+					u32 off, u32 len, bool insert_key)
+{
+	struct pcache_backing_dev *backing_dev = cache->backing_dev;
+	struct pcache_backing_dev_req *backing_req;
+	struct pcache_cache_key *key = NULL;
+
+	backing_req = backing_dev_req_create(backing_dev, parent, off, len, miss_read_end_req);
+	if (!backing_req)
+		goto out;
+
+	/* Allocate a new empty key if insert_key is set */
+	if (insert_key) {
+		key = cache_key_alloc(&cache->req_key_tree);
+		if (!key) {
+			backing_req->ret = -ENOMEM;
+			goto end_req;
+		}
+
+		/* Initialize the empty key with offset, length, and empty flag */
+		key->off = parent->off + off;
+		key->len = len;
+		key->flags |= PCACHE_CACHE_KEY_FLAGS_EMPTY;
+	}
+
+	/* Attach the empty key to the request if it was created */
+	if (key) {
+		cache_key_get(key);
+		backing_req->priv_data = key;
+	}
+
+	return backing_req;
+
+end_req:
+	backing_dev_req_end(backing_req);
+out:
+	return NULL;
+}
+
+static int send_cache_miss_req(struct pcache_cache *cache, struct pcache_request *pcache_req,
+			    u32 off, u32 len, bool insert_key)
+{
+	struct pcache_backing_dev_req *backing_req;
+
+	backing_req = create_cache_miss_req(cache, pcache_req, off, len, insert_key);
+	if (!backing_req)
+		return -ENOMEM;
+
+	submit_cache_miss_req(cache, backing_req);
+
+	return 0;
+}
+
+/*
+ * In the process of walking the cache tree to locate cached data, this
+ * function handles the situation where the requested data range lies
+ * entirely before an existing cache node (`key_tmp`). This outcome
+ * signifies that the target data is absent from the cache (cache miss).
+ *
+ * To fulfill this portion of the read request, the function creates a
+ * backing request (`backing_req`) for the missing data range represented
+ * by `key`. It then appends this request to the submission list in the
+ * `ctx`, which will later be processed to retrieve the data from backing
+ * storage. After setting up the backing request, `req_done` in `ctx` is
+ * updated to reflect the length of the handled range, and the range
+ * in `key` is adjusted by trimming off the portion that is now handled.
+ *
+ * The scenario handled here:
+ *
+ *	  |--------|			  key_tmp (existing cached range)
+ * |====|					   key (requested range, preceding key_tmp)
+ *
+ * Since `key` is before `key_tmp`, it signifies that the requested data
+ * range is missing in the cache (cache miss) and needs retrieval from
+ * backing storage.
+ */
+static int read_before(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+		struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_backing_dev_req *backing_req;
+	int ret;
+
+	/*
+	 * In this scenario, `key` represents a range that precedes `key_tmp`,
+	 * meaning the requested data range is missing from the cache tree
+	 * and must be retrieved from the backing_dev.
+	 */
+	backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, true);
+	if (!backing_req) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	list_add(&backing_req->node, ctx->submit_req_list);
+	ctx->req_done += key->len;
+	cache_key_cutfront(key, key->len);
+
+	return 0;
+out:
+	return ret;
+}
+
+/*
+ * During cache_subtree_walk, this function manages a scenario where part of the
+ * requested data range overlaps with an existing cache node (`key_tmp`).
+ *
+ *	 |----------------|  key_tmp (existing cached range)
+ * |===========|		   key (requested range, overlapping the tail of key_tmp)
+ */
+static int read_overlap_tail(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+		struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_backing_dev_req *backing_req;
+	u32 io_len;
+	int ret;
+
+	/*
+	 * Calculate the length of the non-overlapping portion of `key`
+	 * before `key_tmp`, representing the data missing in the cache.
+	 */
+	io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key);
+	if (io_len) {
+		backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, true);
+		if (!backing_req) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		list_add(&backing_req->node, ctx->submit_req_list);
+		ctx->req_done += io_len;
+		cache_key_cutfront(key, io_len);
+	}
+
+	/*
+	 * Handle the overlapping portion by calculating the length of
+	 * the remaining data in `key` that coincides with `key_tmp`.
+	 */
+	io_len = cache_key_lend(key) - cache_key_lstart(key_tmp);
+	if (cache_key_empty(key_tmp)) {
+		ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+		if (ret)
+			goto out;
+	} else {
+		ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+					io_len, &key_tmp->cache_pos, key_tmp->seg_gen);
+		if (ret) {
+			list_add(&key_tmp->list_node, ctx->delete_key_list);
+			goto out;
+		}
+	}
+
+	ctx->req_done += io_len;
+	cache_key_cutfront(key, io_len);
+
+	return 0;
+
+out:
+	return ret;
+}
+
+/**
+ * The scenario handled here:
+ *
+ *    |----|          key_tmp (existing cached range)
+ * |==========|       key (requested range)
+ */
+static int read_overlap_contain(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+		struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_backing_dev_req *backing_req;
+	u32 io_len;
+	int ret;
+
+	/*
+	 * Calculate the non-overlapping part of `key` before `key_tmp`
+	 * to identify the missing data length.
+	 */
+	io_len = cache_key_lstart(key_tmp) - cache_key_lstart(key);
+	if (io_len) {
+		backing_req = create_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, true);
+		if (!backing_req) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		list_add(&backing_req->node, ctx->submit_req_list);
+
+		ctx->req_done += io_len;
+		cache_key_cutfront(key, io_len);
+	}
+
+	/*
+	 * Handle the overlapping portion between `key` and `key_tmp`.
+	 */
+	io_len = key_tmp->len;
+	if (cache_key_empty(key_tmp)) {
+		ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+		if (ret)
+			goto out;
+	} else {
+		ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+					io_len, &key_tmp->cache_pos, key_tmp->seg_gen);
+		if (ret) {
+			list_add(&key_tmp->list_node, ctx->delete_key_list);
+			goto out;
+		}
+	}
+
+	ctx->req_done += io_len;
+	cache_key_cutfront(key, io_len);
+
+	return 0;
+out:
+	return ret;
+}
+
+/*
+ *	 |-----------|		key_tmp (existing cached range)
+ *	   |====|			key (requested range, fully within key_tmp)
+ *
+ * If `key_tmp` contains valid cached data, this function copies the relevant
+ * portion to the request's bio. Otherwise, it sends a backing request to
+ * fetch the required data range.
+ */
+static int read_overlap_contained(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+		struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_cache_pos pos;
+	int ret;
+
+	/*
+	 * Check if `key_tmp` is empty, indicating a miss. If so, initiate
+	 * a backing request to fetch the required data for `key`.
+	 */
+	if (cache_key_empty(key_tmp)) {
+		ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, false);
+		if (ret)
+			goto out;
+	} else {
+		cache_pos_copy(&pos, &key_tmp->cache_pos);
+		cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp));
+
+		ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+					key->len, &pos, key_tmp->seg_gen);
+		if (ret) {
+			list_add(&key_tmp->list_node, ctx->delete_key_list);
+			goto out;
+		}
+	}
+
+	ctx->req_done += key->len;
+	cache_key_cutfront(key, key->len);
+
+	return 0;
+out:
+	return ret;
+}
+
+/*
+ *	 |--------|		  key_tmp (existing cached range)
+ *	   |==========|	  key (requested range, overlapping the head of key_tmp)
+ */
+static int read_overlap_head(struct pcache_cache_key *key, struct pcache_cache_key *key_tmp,
+		struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_cache_pos pos;
+	u32 io_len;
+	int ret;
+
+	io_len = cache_key_lend(key_tmp) - cache_key_lstart(key);
+
+	if (cache_key_empty(key_tmp)) {
+		ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, io_len, false);
+		if (ret)
+			goto out;
+	} else {
+		cache_pos_copy(&pos, &key_tmp->cache_pos);
+		cache_pos_advance(&pos, cache_key_lstart(key) - cache_key_lstart(key_tmp));
+
+		ret = cache_copy_to_req_bio(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done,
+					io_len, &pos, key_tmp->seg_gen);
+		if (ret) {
+			list_add(&key_tmp->list_node, ctx->delete_key_list);
+			goto out;
+		}
+	}
+
+	ctx->req_done += io_len;
+	cache_key_cutfront(key, io_len);
+
+	return 0;
+out:
+	return ret;
+}
+
+/*
+ * read_walk_finally - Finalizes the cache read tree walk by submitting any
+ *					 remaining backing requests
+ * @ctx:	   Context structure holding information about the cache,
+ *			 read request, and submission list
+ *
+ * This function is called at the end of the `cache_subtree_walk` during a
+ * cache read operation. It completes the walk by checking if any data
+ * requested by `key` was not found in the cache tree, and if so, it sends
+ * a backing request to retrieve that data. Then, it iterates through the
+ * submission list of backing requests created during the walk, removing
+ * each request from the list and submitting it.
+ *
+ * The scenario managed here includes:
+ * - Sending a backing request for the remaining length of `key` if it was
+ *   not fulfilled by existing cache entries.
+ * - Iterating through `ctx->submit_req_list` to submit each backing request
+ *   enqueued during the walk.
+ *
+ * This ensures all necessary backing requests for cache misses are submitted
+ * to the backing storage to retrieve any data that could not be found in
+ * the cache.
+ */
+static int read_walk_finally(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	struct pcache_backing_dev_req *backing_req, *next_req;
+	struct pcache_cache_key *key = ctx->key;
+	int ret;
+
+	if (key->len) {
+		ret = send_cache_miss_req(ctx->cache_tree->cache, ctx->pcache_req, ctx->req_done, key->len, true);
+		if (ret)
+			goto out;
+		ctx->req_done += key->len;
+	}
+
+	list_for_each_entry_safe(backing_req, next_req, ctx->submit_req_list, node) {
+		list_del_init(&backing_req->node);
+		submit_cache_miss_req(ctx->cache_tree->cache, backing_req);
+	}
+
+	return 0;
+
+out:
+	return ret;
+}
+
+/*
+ * This function is used within `cache_subtree_walk` to determine whether the
+ * read operation has covered the requested data length. It compares the
+ * amount of data processed (`ctx->req_done`) with the total data length
+ * specified in the original request (`ctx->pcache_req->data_len`).
+ *
+ * If `req_done` meets or exceeds the required data length, the function
+ * returns `true`, indicating the walk is complete. Otherwise, it returns `false`,
+ * signaling that additional data processing is needed to fulfill the request.
+ */
+static bool read_walk_done(struct pcache_cache_subtree_walk_ctx *ctx)
+{
+	return (ctx->req_done >= ctx->pcache_req->data_len);
+}
+
+/*
+ * cache_read - Process a read request by traversing the cache tree
+ * @cache:	 Cache structure holding cache trees and related configurations
+ * @pcache_req:   Request structure with information about the data to read
+ *
+ * This function attempts to fulfill a read request by traversing the cache tree(s)
+ * to locate cached data for the requested range. If parts of the data are missing
+ * in the cache, backing requests are generated to retrieve the required segments.
+ *
+ * The function operates by initializing a key for the requested data range and
+ * preparing a context (`walk_ctx`) to manage the cache tree traversal. The context
+ * includes pointers to functions (e.g., `read_before`, `read_overlap_tail`) that handle
+ * specific conditions encountered during the traversal. The `walk_finally` and `walk_done`
+ * functions manage the end stages of the traversal, while the `delete_key_list` and
+ * `submit_req_list` lists track any keys to be deleted or requests to be submitted.
+ *
+ * The function first calculates the requested range and checks if it fits within the
+ * current cache tree (based on the tree's size limits). It then locks the cache tree
+ * and performs a search to locate any matching keys. If there are outdated keys,
+ * these are deleted, and the search is restarted to ensure accurate data retrieval.
+ *
+ * If the requested range spans multiple cache trees, the function moves on to the
+ * next tree once the current range has been processed. This continues until the
+ * entire requested data length has been handled.
+ */
+static int cache_read(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+	struct pcache_cache_key key_data = { .off = pcache_req->off, .len = pcache_req->data_len };
+	struct pcache_cache_subtree *cache_subtree;
+	struct pcache_cache_key *key_tmp = NULL, *key_next;
+	struct rb_node *prev_node = NULL;
+	struct pcache_cache_key *key = &key_data;
+	struct pcache_cache_subtree_walk_ctx walk_ctx = { 0 };
+	LIST_HEAD(delete_key_list);
+	LIST_HEAD(submit_req_list);
+	int ret;
+
+	walk_ctx.cache_tree = &cache->req_key_tree;
+	walk_ctx.req_done = 0;
+	walk_ctx.pcache_req = pcache_req;
+	walk_ctx.before = read_before;
+	walk_ctx.overlap_tail = read_overlap_tail;
+	walk_ctx.overlap_head = read_overlap_head;
+	walk_ctx.overlap_contain = read_overlap_contain;
+	walk_ctx.overlap_contained = read_overlap_contained;
+	walk_ctx.walk_finally = read_walk_finally;
+	walk_ctx.walk_done = read_walk_done;
+	walk_ctx.delete_key_list = &delete_key_list;
+	walk_ctx.submit_req_list = &submit_req_list;
+
+next_tree:
+	key->off = pcache_req->off + walk_ctx.req_done;
+	key->len = pcache_req->data_len - walk_ctx.req_done;
+	if (key->len > PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK))
+		key->len = PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK);
+
+	cache_subtree = get_subtree(&cache->req_key_tree, key->off);
+	spin_lock(&cache_subtree->tree_lock);
+
+search:
+	prev_node = cache_subtree_search(cache_subtree, key, NULL, NULL, &delete_key_list);
+
+cleanup_tree:
+	if (!list_empty(&delete_key_list)) {
+		list_for_each_entry_safe(key_tmp, key_next, &delete_key_list, list_node) {
+			list_del_init(&key_tmp->list_node);
+			cache_key_delete(key_tmp);
+		}
+		goto search;
+	}
+
+	walk_ctx.start_node = prev_node;
+	walk_ctx.key = key;
+
+	ret = cache_subtree_walk(&walk_ctx);
+	if (ret == -EINVAL)
+		goto cleanup_tree;
+	else if (ret)
+		goto out;
+
+	spin_unlock(&cache_subtree->tree_lock);
+
+	if (walk_ctx.req_done < pcache_req->data_len)
+		goto next_tree;
+
+	return 0;
+out:
+	spin_unlock(&cache_subtree->tree_lock);
+
+	return ret;
+}
+
+static int cache_write(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+	struct pcache_cache_subtree *cache_subtree;
+	struct pcache_cache_key *key;
+	u64 offset = pcache_req->off;
+	u32 length = pcache_req->data_len;
+	u32 io_done = 0;
+	int ret;
+
+	while (true) {
+		if (io_done >= length)
+			break;
+
+		key = cache_key_alloc(&cache->req_key_tree);
+		if (!key) {
+			ret = -ENOMEM;
+			goto err;
+		}
+
+		key->off = offset + io_done;
+		key->len = length - io_done;
+		if (key->len > PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK))
+			key->len = PCACHE_CACHE_SUBTREE_SIZE - (key->off & PCACHE_CACHE_SUBTREE_SIZE_MASK);
+
+		ret = cache_data_alloc(cache, key, pcache_req->queue->index);
+		if (ret) {
+			cache_key_put(key);
+			goto err;
+		}
+
+		if (!key->len) {
+			cache_seg_put(key->cache_pos.cache_seg);
+			cache_key_put(key);
+			continue;
+		}
+
+		cache_copy_from_req_bio(cache, key, pcache_req, io_done);
+
+		cache_subtree = get_subtree(&cache->req_key_tree, key->off);
+		spin_lock(&cache_subtree->tree_lock);
+		ret = cache_key_insert(&cache->req_key_tree, key, true);
+		if (ret) {
+			cache_seg_put(key->cache_pos.cache_seg);
+			cache_key_put(key);
+			goto unlock;
+		}
+
+		ret = cache_key_append(cache, key);
+		if (ret) {
+			cache_seg_put(key->cache_pos.cache_seg);
+			cache_key_delete(key);
+			goto unlock;
+		}
+
+		io_done += key->len;
+		spin_unlock(&cache_subtree->tree_lock);
+	}
+
+	return 0;
+unlock:
+	spin_unlock(&cache_subtree->tree_lock);
+err:
+	return ret;
+}
+
+/**
+ * cache_flush - Flush all ksets to persist any pending cache data
+ * @cache: Pointer to the cache structure
+ *
+ * This function iterates through all ksets associated with the provided `cache`
+ * and ensures that any data marked for persistence is written to media. For each
+ * kset, it acquires the kset lock, then invokes `cache_kset_close`, which handles
+ * the persistence logic for that kset.
+ *
+ * If `cache_kset_close` encounters an error, the function exits immediately with
+ * the respective error code, preventing the flush operation from proceeding to
+ * subsequent ksets.
+ */
+int cache_flush(struct pcache_cache *cache)
+{
+	struct pcache_cache_kset *kset;
+	u32 i, ret;
+
+	for (i = 0; i < cache->n_ksets; i++) {
+		kset = get_kset(cache, i);
+
+		spin_lock(&kset->kset_lock);
+		ret = cache_kset_close(cache, kset);
+		spin_unlock(&kset->kset_lock);
+
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * pcache_cache_handle_req - Entry point for handling cache requests
+ * @cache: Pointer to the cache structure
+ * @pcache_req: Pointer to the request structure containing operation and data details
+ *
+ * This function serves as the main entry for cache operations, directing
+ * requests based on their operation type. Depending on the operation (`op`)
+ * specified in `pcache_req`, the function calls the appropriate helper function
+ * to process the request.
+ */
+int pcache_cache_handle_req(struct pcache_cache *cache, struct pcache_request *pcache_req)
+{
+	switch (pcache_req->op) {
+	case REQ_OP_FLUSH:
+		return cache_flush(cache);
+	case REQ_OP_WRITE:
+		return cache_write(cache, pcache_req);
+	case REQ_OP_READ:
+		return cache_read(cache, pcache_req);
+	default:
+		return -EIO;
+	}
+
+	return 0;
+}

From patchwork Mon Apr 14 01:45:03 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049563
Received: from out-182.mta1.migadu.com (out-182.mta1.migadu.com
 [95.215.58.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C600E1AB528
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:46:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595163; cv=none;
 b=Vg4jwjxKb88uzNP4lglppTqI4wRiaz2iXTKjr/W80tdqu30ogVtGAO9FKvx42/y6zU+sgOFVbircgxnJwyONQGtQx5uDFzVIdScGwujqderGw2ywCiW4f2Ju3wZR2wnw4ONhTwi/wP6Wv4yTn/t7pQUNZldX7oNugq9bFv+z/gA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595163; c=relaxed/simple;
	bh=sxmCPdxrpKlqtCvPbykqVOQ/y6mE9HQZwVV5lIA/aSo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=n06U+BEK+lEptAQGUXeCxhht9Fqp1eSaWGPCJLprZB00PmxCnyj1+NBjjZ4Vh3LWhC6YTbkG7g5Im16OyqL4Bgi61Q3+SqVgwZMHzGdTbN6yVbj++TFTOF9Y3Mp2+Tyit/ELYxtp64jcfZy4oK6PuyxY/XHWcrq8TUc2GRFNdG4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=KwUmYwgt; arc=none smtp.client-ip=95.215.58.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="KwUmYwgt"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595158;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=RZZFggpY+ZKiLRCW/hYI9LKJVuoQ9jFBC+b9AKHIz1M=;
	b=KwUmYwgtosbaDS2/8TrCmlHqVJ3D7yYuhAQg3OO/6UeE8K7rn2BAvf9eDDcrYdR9gMiLEq
	TNvwyNRTFdHp+3L/kFGKvdugeQ7hzydd0eneMxFKUo5raRCCHPHgYwsn1idWRQMw2Dph8E
	lsI+17sA3tcpUauYu4KDVRkhk63ZMAs=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 09/11] pcache: introduce logic block device and request
 handling
Date: Mon, 14 Apr 2025 01:45:03 +0000
Message-Id: <20250414014505.20477-10-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the logic block device layer (`pcache_logic_dev`), which
connects pcache to the kernel block layer through a standard gendisk interface.
It implements the infrastructure to expose the cache as a Linux block device
(e.g., /dev/pcache0), enabling I/O submission via standard block device APIs.

Key components added:

- pcache_logic_dev:
  Represents the logical block device and encapsulates associated state,
  such as queues, gendisk, tag set, and open count tracking.

- Block I/O path:
  Implements `pcache_queue_rq()` to translate block layer requests into
  internal `pcache_request` objects. Handles data reads, writes, and flushes
  by dispatching them to `pcache_cache_handle_req()` and completing them
  via `pcache_req_put()`.

- Queue management:
  Initializes per-hctx queues and associates them with `pcache_queue`.
  Ensures multi-queue support by allocating queues according to the backing
  device's configuration.

- Device lifecycle:
  Provides `logic_dev_start()` and `logic_dev_stop()` to manage device
  creation, queue setup, and gendisk registration/unregistration.
  Tracks open_count to ensure safe teardown.

- blkdev integration:
  Adds `pcache_blkdev_init()` and `pcache_blkdev_exit()` to register/unregister
  the pcache major number.

This forms the upper layer of pcache's I/O path and makes the cache visible
as a standard Linux block device.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/logic_dev.c | 348 +++++++++++++++++++++++++++++++
 drivers/block/pcache/logic_dev.h |  73 +++++++
 2 files changed, 421 insertions(+)
 create mode 100644 drivers/block/pcache/logic_dev.c
 create mode 100644 drivers/block/pcache/logic_dev.h

diff --git a/drivers/block/pcache/logic_dev.c b/drivers/block/pcache/logic_dev.c
new file mode 100644
index 000000000000..02917bac2210
--- /dev/null
+++ b/drivers/block/pcache/logic_dev.c
@@ -0,0 +1,348 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include "pcache_internal.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "logic_dev.h"
+
+static int pcache_major;
+static DEFINE_IDA(pcache_mapped_id_ida);
+
+static int minor_to_pcache_mapped_id(int minor)
+{
+	return minor >> PCACHE_PART_SHIFT;
+}
+
+static int logic_dev_open(struct gendisk *disk, blk_mode_t mode)
+{
+	struct pcache_logic_dev *logic_dev = disk->private_data;
+
+	mutex_lock(&logic_dev->lock);
+	logic_dev->open_count++;
+	mutex_unlock(&logic_dev->lock);
+
+	return 0;
+}
+
+static void logic_dev_release(struct gendisk *disk)
+{
+	struct pcache_logic_dev *logic_dev = disk->private_data;
+
+	mutex_lock(&logic_dev->lock);
+	logic_dev->open_count--;
+	mutex_unlock(&logic_dev->lock);
+}
+
+static const struct block_device_operations logic_dev_bd_ops = {
+	.owner			= THIS_MODULE,
+	.open			= logic_dev_open,
+	.release		= logic_dev_release,
+};
+
+static inline bool pcache_req_nodata(struct pcache_request *pcache_req)
+{
+	switch (pcache_req->op) {
+	case REQ_OP_WRITE:
+	case REQ_OP_READ:
+		return false;
+	case REQ_OP_FLUSH:
+		return true;
+	default:
+		BUG();
+	}
+}
+
+static blk_status_t pcache_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct request *req = bd->rq;
+	struct pcache_queue *queue = hctx->driver_data;
+	struct pcache_logic_dev *logic_dev = queue->logic_dev;
+	struct pcache_request *pcache_req = blk_mq_rq_to_pdu(bd->rq);
+	int ret;
+
+	memset(pcache_req, 0, sizeof(struct pcache_request));
+	kref_init(&pcache_req->ref);
+	blk_mq_start_request(bd->rq);
+
+	pcache_req->queue = queue;
+	pcache_req->req = req;
+	pcache_req->op = req_op(bd->rq);
+	pcache_req->off = (u64)blk_rq_pos(bd->rq) << SECTOR_SHIFT;
+	if (!pcache_req_nodata(pcache_req))
+		pcache_req->data_len = blk_rq_bytes(bd->rq);
+	else
+		pcache_req->data_len = 0;
+
+	ret = pcache_cache_handle_req(logic_dev->backing_dev->cache, pcache_req);
+	pcache_req_put(pcache_req, ret);
+
+	return BLK_STS_OK;
+}
+
+static int pcache_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
+			unsigned int hctx_idx)
+{
+	struct pcache_logic_dev *logic_dev = driver_data;
+
+	hctx->driver_data = &logic_dev->queues[hctx_idx];
+
+	return 0;
+}
+
+const struct blk_mq_ops logic_dev_mq_ops = {
+	.queue_rq	= pcache_queue_rq,
+	.init_hctx	= pcache_init_hctx,
+};
+
+static int disk_start(struct pcache_logic_dev *logic_dev)
+{
+	struct gendisk *disk;
+	struct queue_limits lim = {
+		.max_hw_sectors			= BIO_MAX_VECS * PAGE_SECTORS,
+		.io_min				= 4096,
+		.io_opt				= 4096,
+		.max_segments			= BIO_MAX_VECS,
+		.max_segment_size		= PAGE_SIZE,
+		.discard_granularity		= 0,
+		.max_hw_discard_sectors		= 0,
+		.max_write_zeroes_sectors	= 0
+	};
+	int ret;
+
+	memset(&logic_dev->tag_set, 0, sizeof(logic_dev->tag_set));
+	logic_dev->tag_set.ops = &logic_dev_mq_ops;
+	logic_dev->tag_set.queue_depth = 128;
+	logic_dev->tag_set.numa_node = NUMA_NO_NODE;
+	logic_dev->tag_set.nr_hw_queues = logic_dev->num_queues;
+	logic_dev->tag_set.cmd_size = sizeof(struct pcache_request);
+	logic_dev->tag_set.timeout = 0;
+	logic_dev->tag_set.driver_data = logic_dev;
+
+	ret = blk_mq_alloc_tag_set(&logic_dev->tag_set);
+	if (ret) {
+		logic_dev_err(logic_dev, "failed to alloc tag set %d", ret);
+		goto err;
+	}
+
+	disk = blk_mq_alloc_disk(&logic_dev->tag_set, &lim, logic_dev);
+	if (IS_ERR(disk)) {
+		ret = PTR_ERR(disk);
+		logic_dev_err(logic_dev, "failed to alloc disk");
+		goto out_tag_set;
+	}
+
+	snprintf(disk->disk_name, sizeof(disk->disk_name), "pcache%d",
+		 logic_dev->mapped_id);
+
+	disk->major = pcache_major;
+	disk->first_minor = logic_dev->mapped_id << PCACHE_PART_SHIFT;
+	disk->minors = (1 << PCACHE_PART_SHIFT);
+	disk->fops = &logic_dev_bd_ops;
+	disk->private_data = logic_dev;
+
+	logic_dev->disk = disk;
+
+	set_capacity(logic_dev->disk, logic_dev->dev_size);
+	set_disk_ro(logic_dev->disk, false);
+
+	/* Register the disk with the system */
+	ret = add_disk(logic_dev->disk);
+	if (ret)
+		goto put_disk;
+
+	return 0;
+
+put_disk:
+	put_disk(logic_dev->disk);
+out_tag_set:
+	blk_mq_free_tag_set(&logic_dev->tag_set);
+err:
+	return ret;
+}
+
+static void disk_stop(struct pcache_logic_dev *logic_dev)
+{
+	del_gendisk(logic_dev->disk);
+	put_disk(logic_dev->disk);
+	blk_mq_free_tag_set(&logic_dev->tag_set);
+}
+
+static struct pcache_logic_dev *logic_dev_alloc(struct pcache_backing_dev *backing_dev)
+{
+	struct pcache_logic_dev *logic_dev;
+	int ret;
+
+	logic_dev = kzalloc(sizeof(struct pcache_logic_dev), GFP_KERNEL);
+	if (!logic_dev)
+		return NULL;
+
+	logic_dev->backing_dev = backing_dev;
+	mutex_init(&logic_dev->lock);
+	INIT_LIST_HEAD(&logic_dev->node);
+
+	logic_dev->mapped_id = ida_simple_get(&pcache_mapped_id_ida, 0,
+					 minor_to_pcache_mapped_id(1 << MINORBITS),
+					 GFP_KERNEL);
+	if (logic_dev->mapped_id < 0) {
+		ret = -ENOENT;
+		goto logic_dev_free;
+	}
+
+	return logic_dev;
+
+logic_dev_free:
+	kfree(logic_dev);
+
+	return NULL;
+}
+
+static void logic_dev_free(struct pcache_logic_dev *logic_dev)
+{
+	ida_simple_remove(&pcache_mapped_id_ida, logic_dev->mapped_id);
+	kfree(logic_dev);
+}
+
+static void logic_dev_destroy_queues(struct pcache_logic_dev *logic_dev)
+{
+	struct pcache_queue *queue;
+	int i;
+
+	/* Stop each queue associated with the block device */
+	for (i = 0; i < logic_dev->num_queues; i++) {
+		queue = &logic_dev->queues[i];
+		if (queue->state == PCACHE_QUEUE_STATE_NONE)
+			continue;
+	}
+
+	/* Free the memory allocated for the queues */
+	kfree(logic_dev->queues);
+}
+
+static int logic_dev_create_queues(struct pcache_logic_dev *logic_dev)
+{
+	int i;
+	struct pcache_queue *queue;
+
+	logic_dev->queues = kcalloc(logic_dev->num_queues, sizeof(struct pcache_queue), GFP_KERNEL);
+	if (!logic_dev->queues)
+		return -ENOMEM;
+
+	for (i = 0; i < logic_dev->num_queues; i++) {
+		queue = &logic_dev->queues[i];
+		queue->logic_dev = logic_dev;
+		queue->index = i;
+
+		queue->state = PCACHE_QUEUE_STATE_RUNNING;
+	}
+
+	return 0;
+}
+
+static int logic_dev_init(struct pcache_logic_dev *logic_dev, u32 queues)
+{
+	int ret;
+
+	logic_dev->num_queues = queues;
+	logic_dev->dev_size = logic_dev->dev_size;
+
+	ret = logic_dev_create_queues(logic_dev);
+	if (ret < 0)
+		goto err;
+
+	return 0;
+err:
+	return ret;
+}
+
+static void logic_dev_destroy(struct pcache_logic_dev *logic_dev)
+{
+	logic_dev_destroy_queues(logic_dev);
+}
+
+int logic_dev_start(struct pcache_backing_dev *backing_dev, u32 queues)
+{
+	struct pcache_logic_dev *logic_dev;
+	int ret;
+
+	logic_dev = logic_dev_alloc(backing_dev);
+	if (!logic_dev)
+		return -ENOMEM;
+
+	logic_dev->dev_size = backing_dev->dev_size;
+	ret = logic_dev_init(logic_dev, queues);
+	if (ret)
+		goto logic_dev_free;
+
+	backing_dev->logic_dev = logic_dev;
+
+	ret = disk_start(logic_dev);
+	if (ret < 0)
+		goto logic_dev_destroy;
+
+	return 0;
+
+logic_dev_destroy:
+	logic_dev_destroy(logic_dev);
+logic_dev_free:
+	logic_dev_free(logic_dev);
+	return ret;
+}
+
+int logic_dev_stop(struct pcache_logic_dev *logic_dev)
+{
+	mutex_lock(&logic_dev->lock);
+	if (logic_dev->open_count > 0) {
+		mutex_unlock(&logic_dev->lock);
+		return -EBUSY;
+	}
+	mutex_unlock(&logic_dev->lock);
+
+	disk_stop(logic_dev);
+	logic_dev_destroy(logic_dev);
+	logic_dev_free(logic_dev);
+
+	return 0;
+}
+
+int pcache_blkdev_init(void)
+{
+	pcache_major = register_blkdev(0, "pcache");
+	if (pcache_major < 0)
+		return pcache_major;
+
+	return 0;
+}
+
+void pcache_blkdev_exit(void)
+{
+	unregister_blkdev(pcache_major, "pcache");
+}
+
+static void end_req(struct kref *ref)
+{
+	struct pcache_request *pcache_req = container_of(ref, struct pcache_request, ref);
+	struct request *req = pcache_req->req;
+	int ret = pcache_req->ret;
+
+	if (req) {
+		/* Complete the block layer request based on the return status */
+		if (ret == -ENOMEM || ret == -EBUSY)
+			blk_mq_requeue_request(req, true);
+		else
+			blk_mq_end_request(req, errno_to_blk_status(ret));
+	}
+}
+
+void pcache_req_get(struct pcache_request *pcache_req)
+{
+	kref_get(&pcache_req->ref);
+}
+
+void pcache_req_put(struct pcache_request *pcache_req, int ret)
+{
+	/* Set the return status if it is not already set */
+	if (ret && !pcache_req->ret)
+		pcache_req->ret = ret;
+
+	kref_put(&pcache_req->ref, end_req);
+}
diff --git a/drivers/block/pcache/logic_dev.h b/drivers/block/pcache/logic_dev.h
new file mode 100644
index 000000000000..2a8de0b02369
--- /dev/null
+++ b/drivers/block/pcache/logic_dev.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _PCACHE_LOGIC_DEV_H
+#define _PCACHE_LOGIC_DEV_H
+
+#include <linux/blk-mq.h>
+
+#include "pcache_internal.h"
+
+#define logic_dev_err(logic_dev, fmt, ...)							\
+	cache_dev_err(logic_dev->backing_dev->cache_dev, "logic_dev%d: " fmt,			\
+		 logic_dev->mapped_id, ##__VA_ARGS__)
+#define logic_dev_info(logic_dev, fmt, ...)							\
+	cache_dev_info(logic_dev->backing_dev->cache_dev, "logic_dev%d: " fmt,			\
+		 logic_dev->mapped_id, ##__VA_ARGS__)
+#define logic_dev_debug(logic_dev, fmt, ...)							\
+	cache_dev_debug(logic_dev->backing_dev->cache_dev, "logic_dev%d: " fmt,			\
+		 logic_dev->mapped_id, ##__VA_ARGS__)
+
+#define PCACHE_QUEUE_STATE_NONE			0
+#define PCACHE_QUEUE_STATE_RUNNING		1
+
+struct pcache_queue {
+	struct pcache_logic_dev	*logic_dev;
+	u32			index;
+
+	u8	                state;
+};
+
+struct pcache_request {
+	struct pcache_queue	*queue;
+	struct request		*req;
+
+	u64			off;
+	u32			data_len;
+
+	u8			op;
+
+	struct kref		ref;
+	int			ret;
+};
+
+struct pcache_logic_dev {
+	int				mapped_id; /* id in block device such as: /dev/pcache0 */
+
+	struct pcache_backing_dev	*backing_dev;
+
+	int				major;		/* blkdev assigned major */
+	int				minor;
+	struct gendisk			*disk;		/* blkdev's gendisk and rq */
+
+	struct mutex			lock;
+	unsigned long			open_count;	/* protected by lock */
+
+	struct list_head		node;
+
+	/* Block layer tags. */
+	struct blk_mq_tag_set		tag_set;
+
+	uint32_t			num_queues;
+	struct pcache_queue		*queues;
+
+	u64				dev_size;
+};
+
+int logic_dev_start(struct pcache_backing_dev *backing_dev, u32 queues);
+int logic_dev_stop(struct pcache_logic_dev *logic_dev);
+
+void pcache_req_get(struct pcache_request *pcache_req);
+void pcache_req_put(struct pcache_request *pcache_req, int ret);
+
+int pcache_blkdev_init(void);
+void pcache_blkdev_exit(void);
+#endif /* _PCACHE_LOGIC_DEV_H */

From patchwork Mon Apr 14 01:45:04 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049564
Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com
 [95.215.58.174])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A25E21B041E
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:46:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595167; cv=none;
 b=g4h5Y+ruG69zD3c5oB060AP3zA6W7p4DdPnk8uSCsabhPCxmuzW31BI/lni/GGuOq2U0AzMPHYHRtap2L3k23JxsP61fKRaJ2BrGxlzd16K6n8qzg2PxNV2PB1deO4o8oxZyTzuwfigno4XIWG1ztvhOSvIPK2/d+Zi9RwY4rW4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595167; c=relaxed/simple;
	bh=9cBDjKJhSiTvib7W3fj8aRhdB5gLKX8e+OpW7b/UfMo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=nKDNCfyhanJHalgsuhhI3ou6hsqIHpKcWxGXrkYbUnphHwzAanWTHYpDf5FjtO5benwzWgv9WPqHQz1ylE1cDtgfuW0O5Pv59WQc9lg1Jy8NNrxhMPr72Uc80QTYQkMOBjOrVR2vGS6y9NOBz/9iZJ/nXjWG5+OjIR+KA2+3QM8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=mKp6sPN0; arc=none smtp.client-ip=95.215.58.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="mKp6sPN0"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595162;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3iTQTKfivIcZOcu/03Vqjvl30ny00w4N5DElQG4ahKU=;
	b=mKp6sPN0/1sBTM1SlLF4WxzvdwpR2bmeiK+CWM1Gj8tfZjW08win4lfe3xWh5o90+Bshlm
	Sa41MIWOF+l4IjB6C48Xg1A6P+kEGKF1bAmBAScRwZqrQ0qiN4aFK2cxcLwoQq7yPPify1
	sNSGo30O5J0l3sxr3tL/LyHtQ/7hHTI=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 10/11] pcache: add backing device management
Date: Mon, 14 Apr 2025 01:45:04 +0000
Message-Id: <20250414014505.20477-11-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces support for managing backing devices in the pcache
framework. A backing device represents a real block device (e.g., /dev/sdX),
which is wrapped with caching support and exposed via a logical device
(e.g., /dev/pcacheX).

Key highlights:

- `pcache_backing_dev`:
  Encapsulates the state of a backing device, including its metadata, bioset,
  workqueues, cache instance, and associated logic device.

- Metadata persistence:
  Uses `pcache_backing_dev_info` to persist path, cache config (segment count,
  GC percent), and other info. Supports update and recovery.

- Sysfs interface:
  Exposes `path`, `cache_segs`, `mapped_id`, `cache_used_segs`, and GC control
  under `/sys/class/.../backing_devX/`.

- I/O request handling:
  Implements a generic `pcache_backing_dev_req` abstraction, which maps and
  submits bio chains to the underlying device. Completion is handled
  asynchronously via `workqueue` to enable decoupled upper-layer processing.

- Initialization flow:
  `backing_dev_start()` prepares the backing device by opening the file,
  initializing bioset/workqueue, loading or creating metadata, instantiating
  the cache, starting the logic block device, and registering sysfs.

- Clean shutdown via `backing_dev_stop()`.

This forms the middle layer of pcache, bridging cache logic and the
logical block device to actual physical storage.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 drivers/block/pcache/backing_dev.c | 593 +++++++++++++++++++++++++++++
 drivers/block/pcache/backing_dev.h | 105 +++++
 2 files changed, 698 insertions(+)
 create mode 100644 drivers/block/pcache/backing_dev.c
 create mode 100644 drivers/block/pcache/backing_dev.h

diff --git a/drivers/block/pcache/backing_dev.c b/drivers/block/pcache/backing_dev.c
new file mode 100644
index 000000000000..89a87e715f60
--- /dev/null
+++ b/drivers/block/pcache/backing_dev.c
@@ -0,0 +1,593 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/blkdev.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "cache.h"
+#include "backing_dev.h"
+#include "logic_dev.h"
+#include "meta_segment.h"
+
+static ssize_t path_show(struct device *dev,
+			 struct device_attribute *attr,
+			 char *buf)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+
+	return sprintf(buf, "%s\n", backing_dev->backing_dev_info.path);
+}
+static DEVICE_ATTR_ADMIN_RO(path);
+
+static ssize_t mapped_id_show(struct device *dev,
+			      struct device_attribute *attr,
+			      char *buf)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+
+	return sprintf(buf, "%u\n", backing_dev->logic_dev->mapped_id);
+}
+static DEVICE_ATTR_ADMIN_RO(mapped_id);
+
+/* sysfs for cache */
+static ssize_t cache_segs_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+
+	return sprintf(buf, "%u\n", backing_dev->cache_segs);
+}
+static DEVICE_ATTR_ADMIN_RO(cache_segs);
+
+static ssize_t cache_used_segs_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct pcache_backing_dev *backing_dev;
+	u32 segs_used;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+	segs_used = bitmap_weight(backing_dev->cache->seg_map, backing_dev->cache->n_segs);
+	return sprintf(buf, "%u\n", segs_used);
+}
+static DEVICE_ATTR_ADMIN_RO(cache_used_segs);
+
+static ssize_t cache_gc_percent_show(struct device *dev,
+			       struct device_attribute *attr,
+			       char *buf)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+
+	return sprintf(buf, "%u\n", backing_dev->backing_dev_info.cache_info.gc_percent);
+}
+
+static ssize_t cache_gc_percent_store(struct device *dev,
+				struct device_attribute *attr,
+				const char *buf,
+				size_t size)
+{
+	struct pcache_backing_dev *backing_dev;
+	unsigned long val;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	backing_dev = container_of(dev, struct pcache_backing_dev, device);
+	ret = kstrtoul(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	if (val < PCACHE_CACHE_GC_PERCENT_MIN ||
+	    val > PCACHE_CACHE_GC_PERCENT_MAX)
+		return -EINVAL;
+
+	backing_dev->backing_dev_info.cache_info.gc_percent = val;
+	backing_dev_info_write(backing_dev);
+
+	return size;
+}
+static DEVICE_ATTR_ADMIN_RW(cache_gc_percent);
+
+static struct attribute *backing_dev_attrs[] = {
+	&dev_attr_path.attr,
+	&dev_attr_mapped_id.attr,
+	&dev_attr_cache_segs.attr,
+	&dev_attr_cache_used_segs.attr,
+	&dev_attr_cache_gc_percent.attr,
+	NULL
+};
+
+static struct attribute_group backing_dev_attr_group = {
+	.attrs = backing_dev_attrs,
+};
+
+static const struct attribute_group *backing_dev_attr_groups[] = {
+	&backing_dev_attr_group,
+	NULL
+};
+
+static void backing_dev_release(struct device *dev)
+{
+}
+
+const struct device_type backing_dev_type = {
+	.name		= "backing_dev",
+	.groups		= backing_dev_attr_groups,
+	.release	= backing_dev_release,
+};
+
+void backing_dev_info_write(struct pcache_backing_dev *backing_dev)
+{
+	struct pcache_backing_dev_info *info;
+	struct pcache_meta_header *meta;
+
+	mutex_lock(&backing_dev->info_lock);
+
+	meta = &backing_dev->backing_dev_info.header;
+	meta->seq++;
+
+	info = pcache_meta_find_oldest(&backing_dev->backing_dev_info_addr->header, PCACHE_BACKING_DEV_INFO_SIZE);
+	memcpy(info, &backing_dev->backing_dev_info, sizeof(struct pcache_backing_dev_info));
+	info->header.crc = pcache_meta_crc(&info->header, PCACHE_BACKING_DEV_INFO_SIZE);
+
+	cache_dev_flush(backing_dev->cache_dev, info, PCACHE_BACKING_DEV_INFO_SIZE);
+	mutex_unlock(&backing_dev->info_lock);
+}
+
+static int backing_dev_info_load(struct pcache_backing_dev *backing_dev)
+{
+	struct pcache_backing_dev_info *info;
+	int ret = 0;
+
+	mutex_lock(&backing_dev->info_lock);
+
+	info = pcache_meta_find_latest(&backing_dev->backing_dev_info_addr->header, PCACHE_BACKING_DEV_INFO_SIZE);
+	if (!info) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	memcpy(&backing_dev->backing_dev_info, info, sizeof(struct pcache_backing_dev_info));
+unlock:
+	mutex_unlock(&backing_dev->info_lock);
+	return ret;
+}
+
+static void backing_dev_free(struct pcache_backing_dev *backing_dev)
+{
+	drain_workqueue(backing_dev->task_wq);
+	destroy_workqueue(backing_dev->task_wq);
+	kmem_cache_destroy(backing_dev->backing_req_cache);
+	kfree(backing_dev);
+}
+
+static void req_submit_fn(struct work_struct *work);
+static void req_complete_fn(struct work_struct *work);
+static struct pcache_backing_dev *backing_dev_alloc(struct pcache_cache_dev *cache_dev)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = kzalloc(sizeof(struct pcache_backing_dev), GFP_KERNEL);
+	if (!backing_dev)
+		return NULL;
+
+	backing_dev->backing_req_cache = KMEM_CACHE(pcache_backing_dev_req, 0);
+	if (!backing_dev->backing_req_cache)
+		goto free_backing_dev;
+
+	backing_dev->task_wq = alloc_workqueue("pcache-backing-wq",  WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
+	if (!backing_dev->task_wq)
+		goto destroy_io_cache;
+
+	backing_dev->cache_dev = cache_dev;
+
+	mutex_init(&backing_dev->info_lock);
+	INIT_LIST_HEAD(&backing_dev->node);
+	INIT_LIST_HEAD(&backing_dev->submit_list);
+	INIT_LIST_HEAD(&backing_dev->complete_list);
+	spin_lock_init(&backing_dev->lock);
+	spin_lock_init(&backing_dev->submit_lock);
+	spin_lock_init(&backing_dev->complete_lock);
+	INIT_WORK(&backing_dev->req_submit_work, req_submit_fn);
+	INIT_WORK(&backing_dev->req_complete_work, req_complete_fn);
+
+	return backing_dev;
+
+destroy_io_cache:
+	kmem_cache_destroy(backing_dev->backing_req_cache);
+free_backing_dev:
+	kfree(backing_dev);
+	return NULL;
+}
+
+static int backing_dev_cache_init(struct pcache_backing_dev *backing_dev,
+				  struct pcache_backing_dev_opts *backing_opts,
+				  bool new_backing_dev)
+{
+	struct pcache_cache_opts cache_opts = { 0 };
+	int ret;
+
+	backing_dev->cache_segs = backing_opts->cache_segs;
+	cache_opts.cache_info = &backing_dev->backing_dev_info.cache_info;
+	cache_opts.n_segs = backing_opts->cache_segs;
+	cache_opts.n_paral = backing_opts->queues;
+	cache_opts.new_cache = new_backing_dev;
+	cache_opts.data_crc = backing_opts->data_crc;
+	cache_opts.bdev_file = backing_dev->bdev_file;
+	cache_opts.dev_size = backing_dev->dev_size;
+
+	backing_dev->cache = pcache_cache_alloc(backing_dev, &cache_opts);
+	if (!backing_dev->cache) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+
+err:
+	return ret;
+}
+
+static void backing_dev_cache_destroy(struct pcache_backing_dev *backing_dev)
+{
+	if (backing_dev->cache)
+		pcache_cache_destroy(backing_dev->cache);
+}
+
+static int backing_dev_sysfs_init(struct pcache_backing_dev *backing_dev)
+{
+	struct device *dev;
+	struct pcache_logic_dev *logic_dev = backing_dev->logic_dev;
+	int ret;
+
+	dev = &backing_dev->device;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->type = &backing_dev_type;
+	dev->parent = &backing_dev->cache_dev->device;
+	dev_set_name(dev, "backing_dev%d", backing_dev->backing_dev_id);
+
+	ret = device_add(dev);
+	if (ret)
+		goto err;
+
+	ret = sysfs_create_link(&disk_to_dev(logic_dev->disk)->kobj,
+				&backing_dev->device.kobj, "pcache");
+	if (ret)
+		goto dev_unregister;
+
+	bd_link_disk_holder(backing_dev->bdev, logic_dev->disk);
+	bd_link_disk_holder(backing_dev->cache_dev->bdev, logic_dev->disk);
+
+	return 0;
+
+dev_unregister:
+	device_unregister(dev);
+err:
+	return ret;
+}
+
+static void backing_dev_sysfs_exit(struct pcache_backing_dev *backing_dev)
+{
+	struct pcache_logic_dev *logic_dev = backing_dev->logic_dev;
+
+	bd_unlink_disk_holder(backing_dev->cache_dev->bdev, logic_dev->disk);
+	bd_unlink_disk_holder(backing_dev->bdev, logic_dev->disk);
+	sysfs_remove_link(&disk_to_dev(logic_dev->disk)->kobj, "pcache");
+	device_unregister(&backing_dev->device);
+}
+
+static int backing_dev_init(struct pcache_backing_dev *backing_dev, struct pcache_backing_dev_opts *backing_opts)
+{
+	struct pcache_cache_dev *cache_dev = backing_dev->cache_dev;
+	bool new_backing;
+	int ret;
+
+	memcpy(backing_dev->backing_dev_info.path, backing_opts->path, PCACHE_PATH_LEN);
+
+	backing_dev->bdev_file = bdev_file_open_by_path(backing_dev->backing_dev_info.path,
+			BLK_OPEN_READ | BLK_OPEN_WRITE, backing_dev, NULL);
+	if (IS_ERR(backing_dev->bdev_file)) {
+		backing_dev_err(backing_dev, "failed to open bdev: %d", (int)PTR_ERR(backing_dev->bdev_file));
+		ret = PTR_ERR(backing_dev->bdev_file);
+		goto err;
+	}
+
+	backing_dev->bdev = file_bdev(backing_dev->bdev_file);
+	backing_dev->dev_size = bdev_nr_sectors(backing_dev->bdev);
+
+	ret = bioset_init(&backing_dev->bioset, 1024, 0, BIOSET_NEED_BVECS);
+	if (ret)
+		goto close_bdev;
+
+	ret = cache_dev_find_backing_info(cache_dev, backing_dev, &new_backing);
+	if (ret)
+		goto bioset_exit;
+
+	if (!new_backing)
+		backing_dev_info_load(backing_dev);
+
+	ret = backing_dev_cache_init(backing_dev, backing_opts, new_backing);
+	if (ret)
+		goto bioset_exit;
+
+	ret = logic_dev_start(backing_dev, backing_opts->queues);
+	if (ret)
+		goto destroy_cache;
+
+	ret = backing_dev_sysfs_init(backing_dev);
+	if (ret)
+		goto logic_dev_stop;
+
+	backing_dev->backing_dev_info.state = PCACHE_BACKING_STATE_RUNNING;
+	backing_dev->backing_dev_info.backing_dev_id = backing_dev->backing_dev_id;
+	backing_dev_info_write(backing_dev);
+
+	cache_dev_add_backing(cache_dev, backing_dev);
+
+	return 0;
+
+logic_dev_stop:
+	logic_dev_stop(backing_dev->logic_dev);
+destroy_cache:
+	backing_dev_cache_destroy(backing_dev);
+bioset_exit:
+	bioset_exit(&backing_dev->bioset);
+close_bdev:
+	fput(backing_dev->bdev_file);
+err:
+	return ret;
+}
+
+static int backing_dev_destroy(struct pcache_backing_dev *backing_dev)
+{
+	backing_dev_sysfs_exit(backing_dev);
+	logic_dev_stop(backing_dev->logic_dev);
+	backing_dev_cache_destroy(backing_dev);
+	bioset_exit(&backing_dev->bioset);
+	fput(backing_dev->bdev_file);
+
+	backing_dev->backing_dev_info.state = PCACHE_BACKING_STATE_NONE;
+	backing_dev_info_write(backing_dev);
+
+	return 0;
+}
+
+int backing_dev_start(struct pcache_cache_dev *cache_dev, struct pcache_backing_dev_opts *backing_opts)
+{
+	struct pcache_backing_dev *backing_dev;
+	int ret;
+
+	/* Check if path starts with "/dev/" */
+	if (strncmp(backing_opts->path, "/dev/", 5) != 0)
+		return -EINVAL;
+
+	backing_dev = backing_dev_alloc(cache_dev);
+	if (!backing_dev)
+		return -ENOMEM;
+
+	ret = backing_dev_init(backing_dev, backing_opts);
+	if (ret)
+		goto destroy_backing_dev;
+
+	return 0;
+
+destroy_backing_dev:
+	backing_dev_free(backing_dev);
+
+	return ret;
+}
+
+int backing_dev_stop(struct pcache_cache_dev *cache_dev, u32 backing_dev_id)
+{
+	struct pcache_backing_dev *backing_dev;
+
+	backing_dev = cache_dev_fetch_backing(cache_dev, backing_dev_id);
+	if (!backing_dev)
+		return -ENOENT;
+
+	backing_dev_destroy(backing_dev);
+	backing_dev_free(backing_dev);
+
+	return 0;
+}
+
+/* pcache_backing_dev_req functions */
+static void end_req(struct kref *ref)
+{
+	struct pcache_backing_dev_req *backing_req = container_of(ref, struct pcache_backing_dev_req, ref);
+	struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+	spin_lock(&backing_dev->complete_lock);
+	list_move_tail(&backing_req->node, &backing_dev->complete_list);
+	spin_unlock(&backing_dev->complete_lock);
+
+	queue_work(backing_dev->task_wq, &backing_dev->req_complete_work);
+}
+
+static void backing_dev_bio_end(struct bio *bio)
+{
+	struct pcache_backing_dev_req *backing_req = bio->bi_private;
+	int ret = bio->bi_status;
+
+	if (ret && !backing_req->ret)
+		backing_req->ret = ret;
+
+	kref_put(&backing_req->ref, end_req);
+	bio_put(bio);
+}
+
+static int map_bio_pages(struct bio *bio, struct request *req, u32 req_off, u32 len)
+{
+	struct bio_vec src_bvec;
+	struct bvec_iter src_iter;
+	size_t mapped = 0, offset = 0;
+	struct bio *src_bio;
+
+	src_bio = req->bio;
+
+next_bio:
+	bio_for_each_segment(src_bvec, src_bio, src_iter) {
+		struct page *page = src_bvec.bv_page;
+		size_t page_off = src_bvec.bv_offset;
+		size_t page_len = src_bvec.bv_len;
+
+		if (offset + page_len <= req_off) {
+			offset += page_len;
+			continue;
+		}
+
+		size_t start = (req_off > offset) ? (req_off - offset) : 0;
+		size_t map_len = min(len - mapped, page_len - start);
+
+		if (bio_add_page(bio, page, map_len, page_off + start) != map_len) {
+			pr_err("Failed to map page to bio\n");
+			break;
+		}
+
+		mapped += map_len;
+		if (mapped >= len)
+			goto out;
+
+		offset += page_len;
+	}
+
+	if (src_bio->bi_next) {
+		src_bio = src_bio->bi_next;
+		goto next_bio;
+	}
+out:
+	return 0;
+}
+
+struct pcache_backing_dev_req *backing_dev_req_create(struct pcache_backing_dev *backing_dev, struct pcache_request *pcache_req,
+			u32 off, u32 len, backing_req_end_fn_t end_req)
+{
+	struct pcache_backing_dev_req *backing_req;
+	u32 mapped_len = 0;
+	struct bio *bio;
+
+	backing_req = kmem_cache_zalloc(backing_dev->backing_req_cache, GFP_ATOMIC);
+	if (!backing_req)
+		return NULL;
+
+	backing_req->backing_dev = backing_dev;
+	INIT_LIST_HEAD(&backing_req->node);
+	kref_init(&backing_req->ref);
+	backing_req->end_req = end_req;
+	backing_req->bio_off = off;
+next_bio:
+	bio = bio_alloc_bioset(backing_dev->bdev,
+					BIO_MAX_VECS,
+					req_op(pcache_req->req),
+					GFP_ATOMIC, &backing_dev->bioset);
+	if (!bio)
+		goto free_backing_req;
+
+	bio->bi_iter.bi_sector = (pcache_req->off + off + mapped_len) >> SECTOR_SHIFT;
+	bio->bi_iter.bi_size = 0;
+	bio->bi_private = backing_req;
+	bio->bi_end_io = backing_dev_bio_end;
+	kref_get(&backing_req->ref);
+
+	if (backing_req->bio)
+		bio->bi_next = backing_req->bio;
+	backing_req->bio = bio;
+
+	map_bio_pages(bio, pcache_req->req, off + mapped_len, len - mapped_len);
+	mapped_len += bio->bi_iter.bi_size;
+	if (mapped_len < len)
+		goto next_bio;
+
+	pcache_req_get(pcache_req);
+	backing_req->upper_req = pcache_req;
+
+	return backing_req;
+
+free_backing_req:
+	while (backing_req->bio) {
+		bio = backing_req->bio;
+		backing_req->bio = bio->bi_next;
+		bio_put(bio);
+	}
+	kmem_cache_free(backing_dev->backing_req_cache, backing_req);
+
+	return NULL;
+}
+
+static void req_submit_fn(struct work_struct *work)
+{
+	struct pcache_backing_dev *backing_dev = container_of(work, struct pcache_backing_dev, req_submit_work);
+	struct pcache_backing_dev_req *backing_req;
+	unsigned long flags;
+	LIST_HEAD(tmp_list);
+
+	spin_lock(&backing_dev->submit_lock);
+	list_splice_init(&backing_dev->submit_list, &tmp_list);
+	spin_unlock(&backing_dev->submit_lock);
+
+	while (!list_empty(&tmp_list)) {
+		backing_req = list_first_entry(&tmp_list,
+					    struct pcache_backing_dev_req, node);
+		list_del_init(&backing_req->node);
+		while (backing_req->bio) {
+			struct bio *bio = backing_req->bio;
+
+			backing_req->bio = bio->bi_next;
+			submit_bio_noacct(bio);
+		}
+
+		local_irq_save(flags);
+		kref_put(&backing_req->ref, end_req);
+		local_irq_restore(flags);
+	}
+}
+
+static void req_complete_fn(struct work_struct *work)
+{
+	struct pcache_backing_dev *backing_dev = container_of(work, struct pcache_backing_dev, req_complete_work);
+	struct pcache_backing_dev_req *backing_req;
+	unsigned long flags;
+	LIST_HEAD(tmp_list);
+
+	spin_lock_irqsave(&backing_dev->complete_lock, flags);
+	list_splice_init(&backing_dev->complete_list, &tmp_list);
+	spin_unlock_irqrestore(&backing_dev->complete_lock, flags);
+
+	while (!list_empty(&tmp_list)) {
+		backing_req = list_first_entry(&tmp_list,
+					    struct pcache_backing_dev_req, node);
+		list_del_init(&backing_req->node);
+		backing_dev_req_end(backing_req);
+	}
+}
+
+void backing_dev_req_submit(struct pcache_backing_dev_req *backing_req)
+{
+	struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+	spin_lock(&backing_dev->submit_lock);
+	list_add_tail(&backing_req->node, &backing_dev->submit_list);
+	spin_unlock(&backing_dev->submit_lock);
+
+	queue_work(backing_dev->task_wq, &backing_dev->req_submit_work);
+}
+
+void backing_dev_req_end(struct pcache_backing_dev_req *backing_req)
+{
+	struct pcache_backing_dev *backing_dev = backing_req->backing_dev;
+
+	if (backing_req->end_req)
+		backing_req->end_req(backing_req, backing_req->ret);
+
+	kmem_cache_free(backing_dev->backing_req_cache, backing_req);
+}
diff --git a/drivers/block/pcache/backing_dev.h b/drivers/block/pcache/backing_dev.h
new file mode 100644
index 000000000000..e929dc821d37
--- /dev/null
+++ b/drivers/block/pcache/backing_dev.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _BACKING_DEV_H
+#define _BACKING_DEV_H
+
+#include <linux/hashtable.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+
+#define backing_dev_err(backing_dev, fmt, ...)						\
+	cache_dev_err(backing_dev->cache_dev, "backing_dev%d: " fmt,			\
+		 backing_dev->backing_dev_id, ##__VA_ARGS__)
+#define backing_dev_info(backing_dev, fmt, ...)						\
+	cache_dev_info(backing_dev->cache_dev, "backing_dev%d: " fmt,			\
+		 backing_dev->backing_dev_id, ##__VA_ARGS__)
+#define backing_dev_debug(backing_dev, fmt, ...)					\
+	cache_dev_debug(backing_dev->cache_dev, "backing_dev%d: " fmt,			\
+		 backing_dev->backing_dev_id, ##__VA_ARGS__)
+
+#define PCACHE_BACKING_STATE_NONE		0
+#define PCACHE_BACKING_STATE_RUNNING		1
+
+struct pcache_cache_info;
+struct pcache_backing_dev_info {
+	struct pcache_meta_header	header;
+	u8				state;
+	u8				res;
+
+	u16				res1;
+
+	u32				backing_dev_id;
+	u64				dev_size; /* nr_sectors */
+
+	char				path[PCACHE_PATH_LEN];
+	struct pcache_cache_info	cache_info;
+};
+
+struct pcache_backing_dev_req;
+typedef void (*backing_req_end_fn_t)(struct pcache_backing_dev_req *backing_req, int ret);
+
+struct pcache_request;
+struct pcache_backing_dev_req {
+	struct bio			*bio;
+	struct pcache_backing_dev	*backing_dev;
+
+	void				*priv_data;
+	backing_req_end_fn_t		end_req;
+
+	struct pcache_request		*upper_req;
+	u32				bio_off;
+	struct list_head		node;
+	struct kref			ref;
+	int				ret;
+};
+
+struct pcache_logic_dev;
+struct pcache_backing_dev {
+	u32				backing_dev_id;
+	struct pcache_cache_dev		*cache_dev;
+	spinlock_t			lock;
+	struct list_head		node;
+	struct device			device;
+
+	struct pcache_backing_dev_info	backing_dev_info;
+	struct pcache_backing_dev_info	*backing_dev_info_addr;
+	struct mutex			info_lock;
+
+	struct block_device		*bdev;
+	struct file			*bdev_file;
+
+	struct workqueue_struct		*task_wq;
+
+	struct bio_set			bioset;
+	struct kmem_cache		*backing_req_cache;
+	struct list_head		submit_list;
+	spinlock_t			submit_lock;
+	struct work_struct		req_submit_work;
+
+	struct list_head		complete_list;
+	spinlock_t			complete_lock;
+	struct work_struct		req_complete_work;
+
+	struct pcache_logic_dev		*logic_dev;
+	u64				dev_size;
+
+	u32				cache_segs;
+	struct pcache_cache		*cache;
+};
+
+struct pcache_backing_dev_opts {
+	char *path;
+	u32 queues;
+	u32 cache_segs;
+	bool data_crc;
+};
+
+int backing_dev_start(struct pcache_cache_dev *cache_dev, struct pcache_backing_dev_opts *backing_opts);
+int backing_dev_stop(struct pcache_cache_dev *cache_dev, u32 backing_dev_id);
+void backing_dev_info_write(struct pcache_backing_dev *backing_dev);
+
+void backing_dev_req_submit(struct pcache_backing_dev_req *backing_req);
+void backing_dev_req_end(struct pcache_backing_dev_req *backing_req);
+struct pcache_backing_dev_req *backing_dev_req_create(struct pcache_backing_dev *backing_dev,
+		struct pcache_request *pcache_req, u32 off, u32 len, backing_req_end_fn_t end_req);
+#endif /* _BACKING_DEV_H */

From patchwork Mon Apr 14 01:45:05 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Dongsheng Yang <dongsheng.yang@linux.dev>
X-Patchwork-Id: 14049565
Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com
 [95.215.58.183])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 931621B81C1
	for <linux-block@vger.kernel.org>; Mon, 14 Apr 2025 01:46:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=95.215.58.183
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1744595170; cv=none;
 b=a0h7RaZ7pRRsvKhsyhYLyBf37w1eGFQJBVqME95GQEb7rYuPJUgv5drXsAyPPyMWo0hBYKmMc1iWf3FHvxr06RjS1+nRCVBSDhazox5a33hm06K1H8/d0IFD+6Do1CFNO5BTLdpTS7kAuB5e1T0iZ0pVNQF2MAVAwOxgnw8YtLw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1744595170; c=relaxed/simple;
	bh=Ij+nQA+8RujoN7hiZEYeXEqqlyqHInUV/AL/W+TxWEE=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=MFYOLY8fBwNNo2LDjx5bjoki+B8baWYvJi6bukUQ/qp46ASjYw9IumEMNZlgYvCuC1LWFDtpJp9iXoUybJPwVcwC+Cvg055cDLJF3P8WdNhW0HsqA1VGIzjRUWvq3rStZg6TcgMaRF8YeNpsesEUmpZb6/Fef8hPQWvnjCPiVZM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev;
 spf=pass smtp.mailfrom=linux.dev;
 dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b=CrSg4cpx; arc=none smtp.client-ip=95.215.58.183
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev
 header.b="CrSg4cpx"
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and
 include these headers.
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1744595166;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=C+lfN/jDmhx15H28wr+ZCKCFpi48AsJfaRSyB2d11d8=;
	b=CrSg4cpxZSZYNp3SKC0g7z3kvxtNwWAKZ4IXaWkyVsxL5lAyyB/oOPPMb8QXm3LwC25IaO
	8vGMpwW1mUD7/JTeBoo1hA43ON+W8hon9qzTaEDJiYpADcImlDoKPWIGBXRpFLaYJ4WWgB
	7caFwl1Dpqv+7dP+fZH5P7DhMCIe9ts=
From: Dongsheng Yang <dongsheng.yang@linux.dev>
To: axboe@kernel.dk,
	hch@lst.de,
	dan.j.williams@intel.com,
	gregory.price@memverge.com,
	John@groves.net,
	Jonathan.Cameron@Huawei.com,
	bbhushan2@marvell.com,
	chaitanyak@nvidia.com,
	rdunlap@infradead.org
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-cxl@vger.kernel.org,
	linux-bcache@vger.kernel.org,
	nvdimm@lists.linux.dev,
	Dongsheng Yang <dongsheng.yang@linux.dev>
Subject: [RFC PATCH 11/11] block: introduce pcache (persistent memory to be
 cache for block device)
Date: Mon, 14 Apr 2025 01:45:05 +0000
Message-Id: <20250414014505.20477-12-dongsheng.yang@linux.dev>
In-Reply-To: <20250414014505.20477-1-dongsheng.yang@linux.dev>
References: <20250414014505.20477-1-dongsheng.yang@linux.dev>
Precedence: bulk
X-Mailing-List: linux-block@vger.kernel.org
List-Id: <linux-block.vger.kernel.org>
List-Subscribe: <mailto:linux-block+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-block+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Migadu-Flow: FLOW_OUT

This patch introduces the initial integration of `pcache`, a Linux kernel
block layer module that leverages persistent memory (PMem) as a high-performance
caching layer for traditional block devices (e.g., SSDs, HDDs).

- Persistent Memory as Cache:
   - `pcache` uses DAX-enabled persistent memory (e.g., `/dev/pmemX`) to provide
     fast, byte-addressable, non-volatile caching for block devices.
   - Supports both direct-mapped and vmap-based access depending on DAX capabilities.

- Modular Architecture:
   - `cache_dev`: represents a persistent memory device used as a cache.
   - `backing_dev`: represents an individual block device being cached.
   - `logic_dev`: exposes a block device (`/dev/pcacheX`) to userspace, serving as
     the frontend interface for I/O.
   - `cache`: implements core caching logic (hit/miss, writeback, GC, etc.).

Design Motivation:

`pcache` is designed to bridge the performance gap between slow-but-large storage
(HDDs, SATA/NVMe SSDs) and emerging byte-addressable persistent memory.
Compared to traditional block layer caching, `pcache` is persistent, low-latency, highly concurrent,
and more amenable to modern storage-class memory devices than legacy caching designs.

This patch finalizes the series by wiring up the initialization entry point
(`pcache_init()`), sysfs bus registration, root device handling, and Kconfig glue.

With this, the `pcache` subsystem is ready to load as a kernel module and serve
as a cache engine for block I/O.

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
---
 MAINTAINERS                   |   8 ++
 drivers/block/Kconfig         |   2 +
 drivers/block/Makefile        |   2 +
 drivers/block/pcache/Kconfig  |  16 +++
 drivers/block/pcache/Makefile |   4 +
 drivers/block/pcache/main.c   | 194 ++++++++++++++++++++++++++++++++++
 6 files changed, 226 insertions(+)
 create mode 100644 drivers/block/pcache/Kconfig
 create mode 100644 drivers/block/pcache/Makefile
 create mode 100644 drivers/block/pcache/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00e94bec401e..5ee5879072b9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18026,6 +18026,14 @@ S:	Maintained
 F:	drivers/leds/leds-pca9532.c
 F:	include/linux/leds-pca9532.h
 
+PCACHE (Pmem as cache for block device)
+M:	Dongsheng Yang <dongsheng.yang@linux.dev>
+M:	Zheng Gu <cengku@gmail.com>
+R:	Linggang Zeng <linggang.linux@gmail.com>
+L:	linux-block@vger.kernel.org
+S:	Maintained
+F:	drivers/block/pcache/
+
 PCI DRIVER FOR AARDVARK (Marvell Armada 3700)
 M:	Thomas Petazzoni <thomas.petazzoni@bootlin.com>
 M:	Pali Rohár <pali@kernel.org>
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index a97f2c40c640..27731dbed7f6 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -192,6 +192,8 @@ config BLK_DEV_LOOP_MIN_COUNT
 
 source "drivers/block/drbd/Kconfig"
 
+source "drivers/block/pcache/Kconfig"
+
 config BLK_DEV_NBD
 	tristate "Network block device support"
 	depends on NET
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index 1105a2d4fdcb..40b96ccbd414 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -43,3 +43,5 @@ obj-$(CONFIG_BLK_DEV_NULL_BLK)	+= null_blk/
 obj-$(CONFIG_BLK_DEV_UBLK)			+= ublk_drv.o
 
 swim_mod-y	:= swim.o swim_asm.o
+
+obj-$(CONFIG_BLK_DEV_PCACHE)	+= pcache/
diff --git a/drivers/block/pcache/Kconfig b/drivers/block/pcache/Kconfig
new file mode 100644
index 000000000000..2dc77354a4b1
--- /dev/null
+++ b/drivers/block/pcache/Kconfig
@@ -0,0 +1,16 @@
+config BLK_DEV_PCACHE
+	tristate "Persistent memory for cache of Block Device (Experimental)"
+	depends on DEV_DAX && FS_DAX
+	help
+	  PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
+	  DAX-enabled devices) as a high-performance cache layer in front of
+	  traditional block devices such as SSDs or HDDs.
+
+	  PCACHE is implemented as a kernel module that integrates with the block
+	  layer and supports direct access (DAX) to persistent memory for low-latency,
+	  byte-addressable caching.
+
+	  Note: This feature is experimental and should be tested thoroughly
+	  before use in production environments.
+
+	  If unsure, say 'N'.
diff --git a/drivers/block/pcache/Makefile b/drivers/block/pcache/Makefile
new file mode 100644
index 000000000000..0e7316ae20e1
--- /dev/null
+++ b/drivers/block/pcache/Makefile
@@ -0,0 +1,4 @@
+pcache-y := main.o cache_dev.o backing_dev.o segment.o meta_segment.o logic_dev.o cache.o cache_segment.o cache_key.o cache_req.o cache_writeback.o cache_gc.o
+
+obj-$(CONFIG_BLK_DEV_PCACHE) += pcache.o
+
diff --git a/drivers/block/pcache/main.c b/drivers/block/pcache/main.c
new file mode 100644
index 000000000000..d0430c64aff3
--- /dev/null
+++ b/drivers/block/pcache/main.c
@@ -0,0 +1,194 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright(C) 2025, Dongsheng Yang <dongsheng.yang@linux.dev>
+ */
+
+#include <linux/capability.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/parser.h>
+
+#include "pcache_internal.h"
+#include "cache_dev.h"
+#include "logic_dev.h"
+
+enum {
+	PCACHE_REG_OPT_ERR		= 0,
+	PCACHE_REG_OPT_FORCE,
+	PCACHE_REG_OPT_FORMAT,
+	PCACHE_REG_OPT_PATH,
+};
+
+static const match_table_t register_opt_tokens = {
+	{ PCACHE_REG_OPT_FORCE,		"force=%u" },
+	{ PCACHE_REG_OPT_FORMAT,	"format=%u" },
+	{ PCACHE_REG_OPT_PATH,		"path=%s" },
+	{ PCACHE_REG_OPT_ERR,		NULL	}
+};
+
+static int parse_register_options(char *buf,
+		struct pcache_cache_dev_register_options *opts)
+{
+	substring_t args[MAX_OPT_ARGS];
+	char *o, *p;
+	int token, ret = 0;
+
+	o = buf;
+
+	while ((p = strsep(&o, ",\n")) != NULL) {
+		if (!*p)
+			continue;
+
+		token = match_token(p, register_opt_tokens, args);
+		switch (token) {
+		case PCACHE_REG_OPT_PATH:
+			if (match_strlcpy(opts->path, &args[0],
+				PCACHE_PATH_LEN) == 0) {
+				ret = -EINVAL;
+				break;
+			}
+			break;
+		case PCACHE_REG_OPT_FORCE:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->force = (token != 0);
+			break;
+		case PCACHE_REG_OPT_FORMAT:
+			if (match_uint(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->format = (token != 0);
+			break;
+		default:
+			pr_err("unknown parameter or missing value '%s'\n", p);
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+out:
+	return ret;
+}
+
+static ssize_t cache_dev_unregister_store(const struct bus_type *bus, const char *ubuf,
+				      size_t size)
+{
+	u32 cache_dev_id;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (sscanf(ubuf, "cache_dev_id=%u", &cache_dev_id) != 1)
+		return -EINVAL;
+
+	ret = cache_dev_unregister(cache_dev_id);
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+
+static ssize_t cache_dev_register_store(const struct bus_type *bus, const char *ubuf,
+				      size_t size)
+{
+	struct pcache_cache_dev_register_options opts = { 0 };
+	char *buf;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	buf = kmemdup(ubuf, size + 1, GFP_KERNEL);
+	if (IS_ERR(buf)) {
+		pr_err("failed to dup buf for adm option: %d", (int)PTR_ERR(buf));
+		return PTR_ERR(buf);
+	}
+	buf[size] = '\0';
+
+	ret = parse_register_options(buf, &opts);
+	if (ret < 0) {
+		kfree(buf);
+		return ret;
+	}
+	kfree(buf);
+
+	ret = cache_dev_register(&opts);
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+
+static BUS_ATTR_WO(cache_dev_unregister);
+static BUS_ATTR_WO(cache_dev_register);
+
+static struct attribute *pcache_bus_attrs[] = {
+	&bus_attr_cache_dev_unregister.attr,
+	&bus_attr_cache_dev_register.attr,
+	NULL,
+};
+
+static const struct attribute_group pcache_bus_group = {
+	.attrs = pcache_bus_attrs,
+};
+__ATTRIBUTE_GROUPS(pcache_bus);
+
+const struct bus_type pcache_bus_type = {
+	.name		= "pcache",
+	.bus_groups	= pcache_bus_groups,
+};
+
+static void pcache_root_dev_release(struct device *dev)
+{
+}
+
+struct device pcache_root_dev = {
+	.init_name =    "pcache",
+	.release =      pcache_root_dev_release,
+};
+
+static int __init pcache_init(void)
+{
+	int ret;
+
+	ret = device_register(&pcache_root_dev);
+	if (ret < 0) {
+		put_device(&pcache_root_dev);
+		goto err;
+	}
+
+	ret = bus_register(&pcache_bus_type);
+	if (ret < 0)
+		goto device_unregister;
+
+	ret = pcache_blkdev_init();
+	if (ret < 0)
+		goto bus_unregister;
+
+	return 0;
+
+bus_unregister:
+	bus_unregister(&pcache_bus_type);
+device_unregister:
+	device_unregister(&pcache_root_dev);
+err:
+
+	return ret;
+}
+
+static void pcache_exit(void)
+{
+	pcache_blkdev_exit();
+	bus_unregister(&pcache_bus_type);
+	device_unregister(&pcache_root_dev);
+}
+
+MODULE_AUTHOR("Dongsheng Yang <dongsheng.yang@linux.dev>");
+MODULE_DESCRIPTION("PMem for Cache of block device");
+MODULE_LICENSE("GPL v2");
+module_init(pcache_init);
+module_exit(pcache_exit);