From patchwork Wed Apr 3 14:17:55 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hannes Reinecke X-Patchwork-Id: 13616242 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 742DA14900E for ; Wed, 3 Apr 2024 14:18:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712153889; cv=none; b=Jt/NlRY/a2G4pa33ygk0Ev/02InyO8A1AZE/DGN5N/0V/UBC0BLs6fVoIm0TiDk9LPHBKzL7ub2pU7sn/pXLGscKDtgxuJYtThMSlbaSWTv6ZOsnAWW3iW7PfXyImvNz1OnKusnLUO90QN1FJ4UWsiQAy1cFTKKG2RP0fYYqvNE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712153889; c=relaxed/simple; bh=TTqQe0kuEWpkaYADmuaSEDdndGlLgHXqdTPbu9AVxWg=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BpA2ODvEZGCx9AKOHI/KNAIjFrQvJDeMz0oEzkC7JRLWi64Py4lh4qVXORxu5CGTz542YMuItgi4oS+kvCtrD9sdLgfvB0/qXsOzsTbCBZY48kMOpuimtLX+T36nyc1KC9lkW3aFiijgl06ZkQR0jJIgROXuy4AVn01I1O0mXeo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=m+0OOHRo; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="m+0OOHRo" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6572DC43390; Wed, 3 Apr 2024 14:18:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712153889; bh=TTqQe0kuEWpkaYADmuaSEDdndGlLgHXqdTPbu9AVxWg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=m+0OOHRofVO1EB0zG8lnuHQJlVJWPlxBEXL69uk1LPR833PFiuPe1q+cSU5AAqcDH I0ZlBTdhwYdZUFFgbRkAxXNRlJHh9e2OTVt2HxN9QtkqiJatsJiS4UDb8HVoiYTK4g rb/ncoW+r+H0v/BVsEgrLl0M23w8edeEp/iTkYo6qKaSTFLO+4f+hh5IWOdAO6750R yEFTKk+dq9ufjdP+HV2nohnakArXlKtBGcXvv00S1h19zX0azfGJVt96arUQ2kIu1U S7axyzAJl6l4x/j5p9P8+soDZNsZ0YZpvCtON3UC0oo5H9iamOVppZYFf51PW37hot gdwOgThDbVzXA== From: Hannes Reinecke To: Christoph Hellwig Cc: Keith Busch , Sagi Grimberg , Jens Axboe , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Hannes Reinecke Subject: [PATCH 1/2] block: track per-node I/O latency Date: Wed, 3 Apr 2024 16:17:55 +0200 Message-Id: <20240403141756.88233-2-hare@kernel.org> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20240403141756.88233-1-hare@kernel.org> References: <20240403141756.88233-1-hare@kernel.org> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency. This can be used by I/O schedulers to determine the 'best' queue to send I/O to. Signed-off-by: Hannes Reinecke --- block/Kconfig | 6 + block/Makefile | 1 + block/blk-mq-debugfs.c | 2 + block/blk-nlatency.c | 388 +++++++++++++++++++++++++++++++++++++++++ block/blk-rq-qos.h | 6 + include/linux/blk-mq.h | 11 ++ 6 files changed, 414 insertions(+) create mode 100644 block/blk-nlatency.c diff --git a/block/Kconfig b/block/Kconfig index 1de4682d48cc..f8cef096a876 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -186,6 +186,12 @@ config BLK_CGROUP_IOPRIO scheduler and block devices process requests. Only some I/O schedulers and some block devices support I/O priorities. +config BLK_NODE_LATENCY + bool "Track per-node I/O latency" + help + Enable per-node I/O latency tracking. This can be used by I/O schedulers + to determine the node with the least latency. + config BLK_DEBUG_FS bool "Block layer debugging information in debugfs" default y diff --git a/block/Makefile b/block/Makefile index 46ada9dc8bbf..9d2e71a3e36f 100644 --- a/block/Makefile +++ b/block/Makefile @@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o obj-$(CONFIG_BLK_CGROUP_IOPRIO) += blk-ioprio.o obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o +obj-$(CONFIG_BLK_NODE_LATENCY) += blk-nlatency.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 94668e72ab09..cb38228b95d8 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -762,6 +762,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id) return "latency"; case RQ_QOS_COST: return "cost"; + case RQ_QOS_NLAT: + return "node-latency"; } return "unknown"; } diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c new file mode 100644 index 000000000000..037f5c64bbbf --- /dev/null +++ b/block/blk-nlatency.c @@ -0,0 +1,388 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Per-node request latency tracking. + * + * Copyright (C) 2023 Hannes Reinecke + * + * A simple per-node latency tracker for use by I/O scheduler. + * Latencies are measures over 'win_usec' microseconds and stored per node. + * If the number of measurements falls below 'lowat' the measurement is + * assumed to be unreliable and will become 'stale'. + * These 'stale' latencies can be 'decayed', where during each measurement + * interval the 'stale' latency value is decreased by 'decay' percent. + * Once the 'stale' latency reaches zero it will be updated by the + * measured latency. + */ +#include +#include +#include + +#include "blk-stat.h" +#include "blk-rq-qos.h" +#include "blk.h" + +#define NLAT_DEFAULT_LOWAT 2 +#define NLAT_DEFAULT_DECAY 50 + +struct rq_nlat { + struct rq_qos rqos; + + u64 win_usec; /* latency measurement window in microseconds */ + unsigned int lowat; /* Low Watermark below which latency measurement is deemed unreliable */ + unsigned int decay; /* Percentage for 'decaying' latencies */ + bool enabled; + + struct blk_stat_callback *cb; + + unsigned int num; + u64 *latency; + unsigned int *samples; +}; + +static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos) +{ + return container_of(rqos, struct rq_nlat, rqos); +} + +static u64 nlat_default_latency_usec(struct request_queue *q) +{ + /* + * We default to 2msec for non-rotational storage, and 75msec + * for rotational storage. + */ + if (blk_queue_nonrot(q)) + return 2000ULL; + else + return 75000ULL; +} + +static void nlat_timer_fn(struct blk_stat_callback *cb) +{ + struct rq_nlat *nlat = cb->data; + int n; + + for (n = 0; n < cb->buckets; n++) { + if (cb->stat[n].nr_samples < nlat->lowat) { + /* + * 'decay' the latency by the specified + * percentage to ensure the queues are + * being tested to balance out temporary + * latency spikes. + */ + nlat->latency[n] = + div64_u64(nlat->latency[n] * nlat->decay, 100); + } else + nlat->latency[n] = cb->stat[n].mean; + nlat->samples[n] = cb->stat[n].nr_samples; + } + if (nlat->enabled) + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); +} + +static int nlat_bucket_node(const struct request *rq) +{ + if (!rq->mq_ctx) + return -1; + return cpu_to_node(blk_mq_rq_cpu((struct request *)rq)); +} + +static void nlat_exit(struct rq_qos *rqos) +{ + struct rq_nlat *nlat = RQNLAT(rqos); + + blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb); + blk_stat_free_callback(nlat->cb); + kfree(nlat->samples); + kfree(nlat->latency); + kfree(nlat); +} + +#ifdef CONFIG_BLK_DEBUG_FS +static int nlat_win_usec_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + + seq_printf(m, "%llu\n", nlat->win_usec); + return 0; +} + +static ssize_t nlat_win_usec_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + char val[16] = { }; + u64 usec; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >= sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err = kstrtoull(val, 10, &usec); + if (err) + return err; + blk_stat_deactivate(nlat->cb); + nlat->win_usec = usec; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_lowat_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + + seq_printf(m, "%u\n", nlat->lowat); + return 0; +} + +static ssize_t nlat_lowat_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + char val[16] = { }; + unsigned int lowat; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >= sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err = kstrtouint(val, 10, &lowat); + if (err) + return err; + blk_stat_deactivate(nlat->cb); + nlat->lowat = lowat; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_decay_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + + seq_printf(m, "%u\n", nlat->decay); + return 0; +} + +static ssize_t nlat_decay_write(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + char val[16] = { }; + unsigned int decay; + int err; + + if (blk_queue_dying(nlat->rqos.disk->queue)) + return -ENOENT; + + if (count >= sizeof(val)) + return -EINVAL; + + if (copy_from_user(val, buf, count)) + return -EFAULT; + + err = kstrtouint(val, 10, &decay); + if (err) + return err; + if (decay > 100) + return -EINVAL; + blk_stat_deactivate(nlat->cb); + nlat->decay = decay; + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return count; +} + +static int nlat_enabled_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + + seq_printf(m, "%d\n", nlat->enabled); + return 0; +} + +static int nlat_id_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + + seq_printf(m, "%u\n", rqos->id); + return 0; +} + +static int nlat_latency_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + int n; + + if (!nlat->enabled) + return 0; + + for (n = 0; n < nlat->num; n++) { + if (n > 0) + seq_puts(m, " "); + seq_printf(m, "%llu", nlat->latency[n]); + } + seq_puts(m, "\n"); + return 0; +} + +static int nlat_samples_show(void *data, struct seq_file *m) +{ + struct rq_qos *rqos = data; + struct rq_nlat *nlat = RQNLAT(rqos); + int n; + + if (!nlat->enabled) + return 0; + + for (n = 0; n < nlat->num; n++) { + if (n > 0) + seq_puts(m, " "); + seq_printf(m, "%u", nlat->samples[n]); + } + seq_puts(m, "\n"); + return 0; +} + +static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] = { + {"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write}, + {"lowat", 0600, nlat_lowat_show, nlat_lowat_write}, + {"decay", 0600, nlat_decay_show, nlat_decay_write}, + {"enabled", 0400, nlat_enabled_show}, + {"id", 0400, nlat_id_show}, + {"latency", 0400, nlat_latency_show}, + {"samples", 0400, nlat_samples_show}, + {}, +}; +#endif + +static const struct rq_qos_ops nlat_rqos_ops = { + .exit = nlat_exit, +#ifdef CONFIG_BLK_DEBUG_FS + .debugfs_attrs = nlat_debugfs_attrs, +#endif +}; + +u64 blk_nlat_latency(struct gendisk *disk, int node) +{ + struct rq_qos *rqos; + struct rq_nlat *nlat; + + rqos = nlat_rq_qos(disk->queue); + if (!rqos) + return 0; + nlat = RQNLAT(rqos); + if (node > nlat->num) + return 0; + + return div64_u64(nlat->latency[node], 1000); +} +EXPORT_SYMBOL_GPL(blk_nlat_latency); + +int blk_nlat_enable(struct gendisk *disk) +{ + struct rq_qos *rqos; + struct rq_nlat *nlat; + + /* Latency tracking not enabled? */ + rqos = nlat_rq_qos(disk->queue); + if (!rqos) + return -EINVAL; + nlat = RQNLAT(rqos); + if (nlat->enabled) + return 0; + + /* Queue not registered? Maybe shutting down... */ + if (!blk_queue_registered(disk->queue)) + return -EAGAIN; + + nlat->enabled = true; + memset(nlat->latency, 0, sizeof(u64) * nlat->num); + memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num); + blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000); + + return 0; +} +EXPORT_SYMBOL_GPL(blk_nlat_enable); + +void blk_nlat_disable(struct gendisk *disk) +{ + struct rq_qos *rqos = nlat_rq_qos(disk->queue); + struct rq_nlat *nlat; + if (!rqos) + return; + nlat = RQNLAT(rqos); + if (nlat->enabled) { + blk_stat_deactivate(nlat->cb); + nlat->enabled = false; + } +} +EXPORT_SYMBOL_GPL(blk_nlat_disable); + +int blk_nlat_init(struct gendisk *disk) +{ + struct rq_nlat *nlat; + int ret = -ENOMEM; + + nlat = kzalloc(sizeof(*nlat), GFP_KERNEL); + if (!nlat) + return -ENOMEM; + + nlat->num = num_possible_nodes(); + nlat->lowat = NLAT_DEFAULT_LOWAT; + nlat->decay = NLAT_DEFAULT_DECAY; + nlat->win_usec = nlat_default_latency_usec(disk->queue); + + nlat->latency = kzalloc(sizeof(u64) * nlat->num, GFP_KERNEL); + if (!nlat->latency) + goto err_free; + nlat->samples = kzalloc(sizeof(unsigned int) * nlat->num, GFP_KERNEL); + if (!nlat->samples) + goto err_free; + nlat->cb = blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node, + nlat->num, nlat); + if (!nlat->cb) + goto err_free; + + /* + * Assign rwb and add the stats callback. + */ + mutex_lock(&disk->queue->rq_qos_mutex); + ret = rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops); + mutex_unlock(&disk->queue->rq_qos_mutex); + if (ret) + goto err_free_cb; + + blk_stat_add_callback(disk->queue, nlat->cb); + + return 0; + +err_free_cb: + blk_stat_free_callback(nlat->cb); +err_free: + kfree(nlat->samples); + kfree(nlat->latency); + kfree(nlat); + return ret; +} +EXPORT_SYMBOL_GPL(blk_nlat_init); diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 37245c97ee61..2fc11ced0c00 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -17,6 +17,7 @@ enum rq_qos_id { RQ_QOS_WBT, RQ_QOS_LATENCY, RQ_QOS_COST, + RQ_QOS_NLAT, }; struct rq_wait { @@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request_queue *q) return rq_qos_id(q, RQ_QOS_LATENCY); } +static inline struct rq_qos *nlat_rq_qos(struct request_queue *q) +{ + return rq_qos_id(q, RQ_QOS_NLAT); +} + static inline void rq_wait_init(struct rq_wait *rq_wait) { atomic_set(&rq_wait->inflight, 0); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 390d35fa0032..4d88bec43316 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -1229,4 +1229,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq) } #endif /* CONFIG_BLK_DEV_ZONED */ +#ifdef CONFIG_BLK_NODE_LATENCY +int blk_nlat_enable(struct gendisk *disk); +void blk_nlat_disable(struct gendisk *disk); +u64 blk_nlat_latency(struct gendisk *disk, int node); +int blk_nlat_init(struct gendisk *disk); +#else +static inline int blk_nlat_enable(struct gendisk *disk) { return 0; } +static inline void blk_nlat_disable(struct gendisk *disk) {} +u64 blk_nlat_latency(struct gendisk *disk, int node) { return 0; } +static inline in blk_nlat_init(struct gendisk *disk) { return -ENOTSUPP; } +#endif #endif /* BLK_MQ_H */ From patchwork Wed Apr 3 14:17:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hannes Reinecke X-Patchwork-Id: 13616243 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 93E1C14900C for ; Wed, 3 Apr 2024 14:18:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712153891; cv=none; b=dtbUB4pkVHfgvW7fYLghekl+O0RntxdcfGpYtAoBMKwXJOtjyX2s1DXuzwvqOEu5XCU3tGesP5EANvSdjBp724kgJua71DEC6BIi2f/tGGOPT0BlcjnFf65KBzncVlVYBipjt5OURYt/9T2x4NhqIkyyY8HuR0ry30IQZIxfjKY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712153891; c=relaxed/simple; bh=w1EEgdIPZxfZ8fVd55Jr20R155i9uV1D12DOeG3OFus=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YxcWnX9Dae6Pv5l/e78bx5LDN/oESXrZy4Uy6Qy8C1EsCMlbJ4NQWA5QkomuvvEKuXbYJMwlz3OSrsTRG9q+JVXiIjEzb0PIvChOgLhhnDJ3HByhFEsLqmXV6I7avFN4gEmv314LfDbImOFfgqRZc2Oe8bXYHzVN4jGWT5U4MmU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aHqfGWS4; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aHqfGWS4" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7E887C43601; Wed, 3 Apr 2024 14:18:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712153891; bh=w1EEgdIPZxfZ8fVd55Jr20R155i9uV1D12DOeG3OFus=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aHqfGWS4mQ9Ykusg/R01L/tijnufZFRdRSEMkkaZ1A+UxBSZNPsI4O/ySDW9C+JMS clMwkjfaVyigVme/mkSiMBUggAu9vO/xdJZpaUWL2hwsbqImln7dmfYFdztJ+WbdKb HrUAsD0XlyMmuTjWehtQLROCvJ/J+V0vJXZZz+/K3rLmmH6pmWZvEx/jLeMDs+lirY UwLwA3KZkG5H8mSpHi0M+Ff51xs4wrp2brmnxmC/UkojOJRQ3H9gQ8c74hWqjsamSb vB5fII4t5UJTRiij0nB1kb3uHlYkSpH4z6Mck4YlLrEnUkuIbzQQjsvcq8XoGqALs5 RO9Fpx1smYqjQ== From: Hannes Reinecke To: Christoph Hellwig Cc: Keith Busch , Sagi Grimberg , Jens Axboe , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Hannes Reinecke Subject: [PATCH 2/2] nvme: add 'latency' iopolicy Date: Wed, 3 Apr 2024 16:17:56 +0200 Message-Id: <20240403141756.88233-3-hare@kernel.org> X-Mailer: git-send-email 2.35.3 In-Reply-To: <20240403141756.88233-1-hare@kernel.org> References: <20240403141756.88233-1-hare@kernel.org> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add a latency-based I/O policy for multipathing. It uses the blk-nodelat latency tracker to provide latencies for each node, and schedules I/O on the path with the least latency for the submitting node. Signed-off-by: Hannes Reinecke --- drivers/nvme/host/multipath.c | 57 ++++++++++++++++++++++++++++++----- drivers/nvme/host/nvme.h | 1 + 2 files changed, 51 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 5397fb428b24..18e7fe45c2c1 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath, static const char *nvme_iopolicy_names[] = { [NVME_IOPOLICY_NUMA] = "numa", [NVME_IOPOLICY_RR] = "round-robin", + [NVME_IOPOLICY_LAT] = "latency", }; static int iopolicy = NVME_IOPOLICY_NUMA; @@ -29,6 +30,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp) iopolicy = NVME_IOPOLICY_NUMA; else if (!strncmp(val, "round-robin", 11)) iopolicy = NVME_IOPOLICY_RR; +#ifdef CONFIG_BLK_NODE_LATENCY + else if (!strncmp(val, "latency", 7)) + iopolicy = NVME_IOPOLICY_LAT; +#endif else return -EINVAL; @@ -40,6 +45,28 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp) return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]); } +static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy) +{ + struct nvme_ns_head *h; + struct nvme_ns *ns; + bool enable = iopolicy == NVME_IOPOLICY_LAT; + int ret = 0; + + mutex_lock(&subsys->lock); + list_for_each_entry(h, &subsys->nsheads, entry) { + list_for_each_entry_rcu(ns, &h->list, siblings) { + if (enable) { + ret = blk_nlat_enable(ns->disk); + if (ret) + break; + } else + blk_nlat_disable(ns->disk); + } + } + mutex_unlock(&subsys->lock); + return ret; +} + module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy, &iopolicy, 0644); MODULE_PARM_DESC(iopolicy, @@ -242,13 +269,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node) { int found_distance = INT_MAX, fallback_distance = INT_MAX, distance; struct nvme_ns *found = NULL, *fallback = NULL, *ns; + int iopolicy = READ_ONCE(head->subsys->iopolicy); list_for_each_entry_rcu(ns, &head->list, siblings) { if (nvme_path_is_disabled(ns)) continue; - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) + if (iopolicy == NVME_IOPOLICY_NUMA) distance = node_distance(node, ns->ctrl->numa_node); + else if (iopolicy == NVME_IOPOLICY_LAT) + distance = blk_nlat_latency(ns->disk, node); else distance = LOCAL_DISTANCE; @@ -339,15 +369,17 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns) inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) { int node = numa_node_id(); + int iopolicy = READ_ONCE(head->subsys->iopolicy); struct nvme_ns *ns; ns = srcu_dereference(head->current_path[node], &head->srcu); if (unlikely(!ns)) return __nvme_find_path(head, node); - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR) + if (iopolicy == NVME_IOPOLICY_RR) return nvme_round_robin_path(head, node, ns); - if (unlikely(!nvme_path_is_optimized(ns))) + if (iopolicy == NVME_IOPOLICY_LAT || + unlikely(!nvme_path_is_optimized(ns))) return __nvme_find_path(head, node); return ns; } @@ -803,15 +835,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev, { struct nvme_subsystem *subsys = container_of(dev, struct nvme_subsystem, dev); - int i; + int i, ret; for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) { if (sysfs_streq(buf, nvme_iopolicy_names[i])) { - WRITE_ONCE(subsys->iopolicy, i); - return count; + ret = nvme_activate_iopolicy(subsys, i); + if (!ret) { + WRITE_ONCE(subsys->iopolicy, i); + return count; + } + return ret; } } - return -EINVAL; } SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR, @@ -847,6 +882,14 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl, void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid) { + if (!blk_nlat_init(ns->disk) && + READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_LAT) { + int ret = blk_nlat_enable(ns->disk); + if (unlikely(ret)) + pr_warn("%s: Failed to enable latency tracking, error %d\n", + ns->disk->disk_name, ret); + } + if (nvme_ctrl_use_ana(ns->ctrl)) { struct nvme_ana_group_desc desc = { .grpid = anagrpid, diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 27397f8404d6..b07afb1aa5bb 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -402,6 +402,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl) enum nvme_iopolicy { NVME_IOPOLICY_NUMA, NVME_IOPOLICY_RR, + NVME_IOPOLICY_LAT, }; struct nvme_subsystem { From patchwork Thu May 9 20:43:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Meneghini X-Patchwork-Id: 13660403 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27D707F7F5 for ; Thu, 9 May 2024 20:43:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715287417; cv=none; b=o2vCv3zIqQwBnXg0LyPq4JzCdhml4x2NWY8Z7/6uoa4+W7olnDsc5/FMiH+MIrP18j9EMIzpfmBR+C/9JYy6CJua9RVWZtMI6947LWMirWCuoZEveRvCtCCsMz4jwMXsRYp9BVkA52MAJzx9D8M+IVRU8CZNlMDB1tURsVdX61k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715287417; c=relaxed/simple; bh=29Ogzu8HUgDu3PF9FKvjcGb+SavKhmrv7g+xDkNBUkY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-type; b=E10iaTqYGhbdWwQEaRv33herpOAJCwvuGaHMflxtjd6mxJUWiWD9RLg2NPXM4zssomIrovqB1BwyKtJbu1STcOC3aGGYk8INfNZAq+lhv6Tygc1nsqVqvGYKjK+GRpoLK+Fi/61Gy8GcyzbFG6oVWp1wyj/i9TFMYG6j7xtiF8w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=ESTfc/VT; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ESTfc/VT" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715287415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J8WReQiv4GBqlDuegj32JcgtpHrEsWG68ox5pprOL3Y=; b=ESTfc/VTM4T1hMcDXtzPAcndfKQbefLQYZb0z948lMverVqZ/Ptq2KWpU25I1OfO7tebEe 1LrEnFUcFmebEwvVN5Jbjs1/Zmk7Ejp/WRAxqcoqa9OB/vJehJp1TcpDmjKrh0axHi49sl rR1EF8PmTPvhp2u4As+PRrd1+FvSgR4= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-139-MVAV1Q_mOD-RwikX89Y0GA-1; Thu, 09 May 2024 16:43:30 -0400 X-MC-Unique: MVAV1Q_mOD-RwikX89Y0GA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9FE3829AA389; Thu, 9 May 2024 20:43:29 +0000 (UTC) Received: from jmeneghi.bos.com (unknown [10.22.16.53]) by smtp.corp.redhat.com (Postfix) with ESMTP id F25061C4DB56; Thu, 9 May 2024 20:43:28 +0000 (UTC) From: John Meneghini To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, emilne@redhat.com, hare@kernel.org Cc: linux-block@vger.kernel.org, cgroups@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, jmeneghi@redhat.com, jrani@purestorage.com, randyj@purestorage.com, aviv.coro@ibm.com Subject: [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes Date: Thu, 9 May 2024 16:43:24 -0400 Message-Id: <20240509204324.832846-4-jmeneghi@redhat.com> In-Reply-To: <20240403141756.88233-1-hare@kernel.org> References: <20240403141756.88233-1-hare@kernel.org> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-type: text/plain X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7 Send a pr_notice when ever the iopolicy on a subsystem is changed. This is important for support reasons. It is fully expected that users will be changing the iopolicy with active IO in progress. Signed-off-by: John Meneghini --- drivers/nvme/host/multipath.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index e9330bb1990b..0286e44a081f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -67,6 +67,10 @@ static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy) } } mutex_unlock(&subsys->lock); + + pr_notice("%s: %s enable %d status %d for subsysnqn %s\n", __func__, + nvme_iopolicy_names[iopolicy], enable, ret, subsys->subnqn); + return ret; } @@ -890,6 +894,8 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy) { struct nvme_ctrl *ctrl; + int old_iopolicy = READ_ONCE(subsys->iopolicy); + WRITE_ONCE(subsys->iopolicy, iopolicy); mutex_lock(&nvme_subsystems_lock); @@ -898,6 +904,10 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy) nvme_mpath_clear_ctrl_paths(ctrl); } mutex_unlock(&nvme_subsystems_lock); + + pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__, + nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy], + subsys->subnqn); } static ssize_t nvme_subsys_iopolicy_store(struct device *dev,