[0/9] blk-mq: Reduce static requests memory footprint for shared sbitmap

Message ID	1626275195-215652-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk> CC: <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <ming.lei@redhat.com>, <linux-scsi@vger.kernel.org>, <kashyap.desai@broadcom.com>, <hare@suse.de>, John Garry <john.garry@huawei.com> Subject: [PATCH 0/9] blk-mq: Reduce static requests memory footprint for shared sbitmap Date: Wed, 14 Jul 2021 23:06:26 +0800 Message-ID: <1626275195-215652-1-git-send-email-john.garry@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	blk-mq: Reduce static requests memory footprint for shared sbitmap \| expand [0/9] blk-mq: Reduce static requests memory footprint for shared sbitmap [1/9] blk-mq: Change rqs check in blk_mq_free_rqs() [2/9] block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ [3/9] blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests() [4/9] blk-mq: Add blk_mq_tag_resize_sched_shared_sbitmap() [5/9] blk-mq: Invert check in blk_mq_update_nr_requests() [RFC,6/9] blk-mq: Refactor blk_mq_{alloc,free}_rqs [RFC,7/9] blk-mq: Allocate per tag set static rqs for shared sbitmap [8/9] blk-mq: Allocate per request queue static rqs for shared sbitmap [9/9] blk-mq: Clear mappings for shared sbitmap sched static rqs

Message ID

1626275195-215652-1-git-send-email-john.garry@huawei.com (mailing list archive)

Headers

From: John Garry <john.garry@huawei.com>
To: <axboe@kernel.dk>
CC: <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
        <ming.lei@redhat.com>, <linux-scsi@vger.kernel.org>,
        <kashyap.desai@broadcom.com>, <hare@suse.de>,
        John Garry <john.garry@huawei.com>
Subject: [PATCH 0/9] blk-mq: Reduce static requests memory footprint for
 shared sbitmap
Date: Wed, 14 Jul 2021 23:06:26 +0800
Message-ID: <1626275195-215652-1-git-send-email-john.garry@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Series

blk-mq: Reduce static requests memory footprint for shared sbitmap | expand

Message

John Garry July 14, 2021, 3:06 p.m. UTC

Currently a full set of static requests are allocated per hw queue per
tagset when shared sbitmap is used.

However, only tagset->queue_depth number of requests may be active at
any given time. As such, only tagset->queue_depth number of static
requests are required.

The same goes for using an IO scheduler, which allocates a full set of
static requests per hw queue per request queue.

This series very significantly reduces memory usage in both scenarios by
allocating static rqs per tagset and per request queue, respectively,
rather than per hw queue per tagset and per request queue.

For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we
save approx. 300MB(!) [370MB -> 60MB]

A couple of patches are marked as RFC, as maybe there is a better
implementation approach.

Any more testing would be appreciated also.

John Garry (9):
  blk-mq: Change rqs check in blk_mq_free_rqs()
  block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ
  blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()
  blk-mq: Add blk_mq_tag_resize_sched_shared_sbitmap()
  blk-mq: Invert check in blk_mq_update_nr_requests()
  blk-mq: Refactor blk_mq_{alloc,free}_rqs
  blk-mq: Allocate per tag set static rqs for shared sbitmap
  blk-mq: Allocate per request queue static rqs for shared sbitmap
  blk-mq: Clear mappings for shared sbitmap sched static rqs

 block/blk-core.c       |   2 +-
 block/blk-mq-sched.c   |  57 ++++++++++++--
 block/blk-mq-sched.h   |   2 +-
 block/blk-mq-tag.c     |  22 ++++--
 block/blk-mq-tag.h     |   1 +
 block/blk-mq.c         | 165 +++++++++++++++++++++++++++++++----------
 block/blk-mq.h         |   9 +++
 drivers/block/rbd.c    |   2 +-
 include/linux/blk-mq.h |   4 +
 include/linux/blkdev.h |   6 +-
 10 files changed, 215 insertions(+), 55 deletions(-)

Comments

Ming Lei July 20, 2021, 3:22 p.m. UTC | #1

On Wed, Jul 14, 2021 at 11:06:26PM +0800, John Garry wrote:
> Currently a full set of static requests are allocated per hw queue per
> tagset when shared sbitmap is used.
> 
> However, only tagset->queue_depth number of requests may be active at
> any given time. As such, only tagset->queue_depth number of static
> requests are required.
> 
> The same goes for using an IO scheduler, which allocates a full set of
> static requests per hw queue per request queue.
> 
> This series very significantly reduces memory usage in both scenarios by
> allocating static rqs per tagset and per request queue, respectively,
> rather than per hw queue per tagset and per request queue.
> 
> For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we
> save approx. 300MB(!) [370MB -> 60MB]
> 
> A couple of patches are marked as RFC, as maybe there is a better
> implementation approach.

There is another candidate for addressing this issue, and looks simpler:

 block/blk-mq-sched.c |  4 ++++
 block/blk-mq-tag.c   |  4 ++++
 block/blk-mq-tag.h   |  3 +++
 block/blk-mq.c       | 18 ++++++++++++++++++
 block/blk-mq.h       | 11 +++++++++++
 5 files changed, 40 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c838d81ac058..b9236ee0fe4e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -538,6 +538,10 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	if (!hctx->sched_tags)
 		return -ENOMEM;
 
+	blk_mq_set_master_tags(hctx->sched_tags,
+			q->queue_hw_ctx[0]->sched_tags, hctx->flags,
+			hctx_idx);
+
 	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
 	if (ret)
 		blk_mq_sched_free_tags(set, hctx, hctx_idx);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 86f87346232a..c471a073234d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -608,6 +608,10 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 				tags->nr_reserved_tags, set->flags);
 		if (!new)
 			return -ENOMEM;
+
+		blk_mq_set_master_tags(new,
+			hctx->queue->queue_hw_ctx[0]->sched_tags, set->flags,
+			hctx->queue_num);
 		ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
 		if (ret) {
 			blk_mq_free_rq_map(new, set->flags);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 8ed55af08427..0a3fbbc61e06 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -21,6 +21,9 @@ struct blk_mq_tags {
 	struct request **static_rqs;
 	struct list_head page_list;
 
+	/* only used for blk_mq_is_sbitmap_shared() */
+	struct blk_mq_tags	*master;
+
 	/*
 	 * used to clear request reference in rqs[] before freeing one
 	 * request pool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2c4ac51e54eb..ef8a6a7e5f7c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2348,6 +2348,15 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 {
 	struct page *page;
 
+	if (blk_mq_is_sbitmap_shared(set->flags)) {
+		if (tags->master)
+			tags = tags->master;
+		if (hctx_idx < set->nr_hw_queues - 1) {
+			blk_mq_clear_rq_mapping(set, tags, hctx_idx);
+			return;
+		}
+	}
+
 	if (tags->rqs && set->ops->exit_request) {
 		int i;
 
@@ -2444,6 +2453,12 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	size_t rq_size, left;
 	int node;
 
+	if (blk_mq_is_sbitmap_shared(set->flags) && tags->master) {
+		memcpy(tags->static_rqs, tags->master->static_rqs,
+		       sizeof(tags->static_rqs[0]) * tags->nr_tags);
+		return 0;
+	}
+
 	node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], hctx_idx);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
@@ -2860,6 +2875,9 @@ static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
 	if (!set->tags[hctx_idx])
 		return false;
 
+	blk_mq_set_master_tags(set->tags[hctx_idx], set->tags[0], flags,
+			       hctx_idx);
+
 	ret = blk_mq_alloc_rqs(set, set->tags[hctx_idx], hctx_idx,
 				set->queue_depth);
 	if (!ret)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d08779f77a26..a08b89be6acc 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -354,5 +354,16 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	return __blk_mq_active_requests(hctx) < depth;
 }
 
+static inline void blk_mq_set_master_tags(struct blk_mq_tags *tags,
+		struct blk_mq_tags *master_tags, unsigned int flags,
+		unsigned int hctx_idx)
+{
+	if (blk_mq_is_sbitmap_shared(flags)) {
+		if (hctx_idx)
+			tags->master = master_tags;
+		else
+			tags->master = NULL;
+	}
+}
 
 #endif


Thanks,
Ming

John Garry July 20, 2021, 5:05 p.m. UTC | #2

On 20/07/2021 16:22, Ming Lei wrote:
> On Wed, Jul 14, 2021 at 11:06:26PM +0800, John Garry wrote:
>> Currently a full set of static requests are allocated per hw queue per
>> tagset when shared sbitmap is used.
>>
>> However, only tagset->queue_depth number of requests may be active at
>> any given time. As such, only tagset->queue_depth number of static
>> requests are required.
>>
>> The same goes for using an IO scheduler, which allocates a full set of
>> static requests per hw queue per request queue.
>>
>> This series very significantly reduces memory usage in both scenarios by
>> allocating static rqs per tagset and per request queue, respectively,
>> rather than per hw queue per tagset and per request queue.
>>
>> For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we
>> save approx. 300MB(!) [370MB -> 60MB]
>>
>> A couple of patches are marked as RFC, as maybe there is a better
>> implementation approach.

Hi Ming,

To be clear, my concerns with my implementation were:
a. Interface of __blk_mq_{alloc,free}_rqs was a bit strange in passing a 
struct list_head * and struct request ** . But maybe it's ok.
b. We need to ensure that the host should not expect a set of static rqs 
per hw queue, i.e. should not use blk_mq_ops.init_request hctx_idx 
argument. The guard I have added is effectively useless against that. I 
am thinking that we should have a variant of blk_mq_ops.init_request 
without a hctx_idx argument, which needs to be used for shared sbitmap.

> 
> There is another candidate for addressing this issue, and looks simpler:

Thanks, I think that this solution does not deal with b., above, either.

> 
>   block/blk-mq-sched.c |  4 ++++
>   block/blk-mq-tag.c   |  4 ++++
>   block/blk-mq-tag.h   |  3 +++
>   block/blk-mq.c       | 18 ++++++++++++++++++
>   block/blk-mq.h       | 11 +++++++++++
>   5 files changed, 40 insertions(+)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index c838d81ac058..b9236ee0fe4e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -538,6 +538,10 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
>   	if (!hctx->sched_tags)
>   		return -ENOMEM;
>   
> +	blk_mq_set_master_tags(hctx->sched_tags,
> +			q->queue_hw_ctx[0]->sched_tags, hctx->flags,
> +			hctx_idx);
> +
>   	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
>   	if (ret)
>   		blk_mq_sched_free_tags(set, hctx, hctx_idx);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 86f87346232a..c471a073234d 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -608,6 +608,10 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   				tags->nr_reserved_tags, set->flags);
>   		if (!new)
>   			return -ENOMEM;
> +
> +		blk_mq_set_master_tags(new,
> +			hctx->queue->queue_hw_ctx[0]->sched_tags, set->flags,
> +			hctx->queue_num);
>   		ret = blk_mq_alloc_rqs(set, new, hctx->queue_num, tdepth);
>   		if (ret) {
>   			blk_mq_free_rq_map(new, set->flags);
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 8ed55af08427..0a3fbbc61e06 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -21,6 +21,9 @@ struct blk_mq_tags {
>   	struct request **static_rqs;
>   	struct list_head page_list;
>   
> +	/* only used for blk_mq_is_sbitmap_shared() */
> +	struct blk_mq_tags	*master;

It is a bit odd to have a pointer to struct blk_mq_tags in struct 
blk_mq_tags like this

> +
>   	/*
>   	 * used to clear request reference in rqs[] before freeing one
>   	 * request pool
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2c4ac51e54eb..ef8a6a7e5f7c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2348,6 +2348,15 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   {
>   	struct page *page;
>   
> +	if (blk_mq_is_sbitmap_shared(set->flags)) {
> +		if (tags->master)
> +			tags = tags->master;
> +		if (hctx_idx < set->nr_hw_queues - 1) {
> +			blk_mq_clear_rq_mapping(set, tags, hctx_idx);
> +			return;
> +		}
> +	}
> +
>   	if (tags->rqs && set->ops->exit_request) {
>   		int i;
>   
> @@ -2444,6 +2453,12 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	size_t rq_size, left;
>   	int node;
>   
> +	if (blk_mq_is_sbitmap_shared(set->flags) && tags->master) {
> +		memcpy(tags->static_rqs, tags->master->static_rqs,
> +		       sizeof(tags->static_rqs[0]) * tags->nr_tags);
> +		return 0;
> +	}
> +
>   	node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], hctx_idx);
>   	if (node == NUMA_NO_NODE)
>   		node = set->numa_node;
> @@ -2860,6 +2875,9 @@ static bool __blk_mq_alloc_map_and_request(struct blk_mq_tag_set *set,
>   	if (!set->tags[hctx_idx])
>   		return false;
>   
> +	blk_mq_set_master_tags(set->tags[hctx_idx], set->tags[0], flags,
> +			       hctx_idx);
> +
>   	ret = blk_mq_alloc_rqs(set, set->tags[hctx_idx], hctx_idx,
>   				set->queue_depth);
>   	if (!ret)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index d08779f77a26..a08b89be6acc 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -354,5 +354,16 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>   	return __blk_mq_active_requests(hctx) < depth;
>   }
>   
> +static inline void blk_mq_set_master_tags(struct blk_mq_tags *tags,
> +		struct blk_mq_tags *master_tags, unsigned int flags,
> +		unsigned int hctx_idx)
> +{
> +	if (blk_mq_is_sbitmap_shared(flags)) {
> +		if (hctx_idx)
> +			tags->master = master_tags;
> +		else
> +			tags->master = NULL;
> +	}
> +}
>   
>   #endif
> 

I'll check this solution a bit further.

Thanks,
John