diff mbox

[1/3] blk-mq: add an API to estimate hardware queue node

Message ID 68fed570910230ce847f8f3b685eeea399640a7f.1458941500.git.shli@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li March 25, 2016, 9:36 p.m. UTC
we allocate most data structure in device's node, but some data
structures are not for DMA and mostly used by specific cpus/node which
could diff from device's node. Allocating such hot data in device's
node doesn't make sense. Add an API to estimate hardware queue node.
This can be used before blk-mq actually establishes the mapping. This
API runs slow, but it only used in initialization time.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c         | 21 +++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 23 insertions(+)

Comments

Christoph Hellwig March 29, 2016, 7:24 a.m. UTC | #1
On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> we allocate most data structure in device's node, but some data
> structures are not for DMA and mostly used by specific cpus/node which
> could diff from device's node. Allocating such hot data in device's
> node doesn't make sense. Add an API to estimate hardware queue node.
> This can be used before blk-mq actually establishes the mapping. This
> API runs slow, but it only used in initialization time.

I think this is the wrong way around.  I've got some proprotype code
that just leaves the cpu assignments to the drivers and picks it up
in blk-mq.  Give me a few days to post it..
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shaohua Li March 29, 2016, 4:47 p.m. UTC | #2
On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> > we allocate most data structure in device's node, but some data
> > structures are not for DMA and mostly used by specific cpus/node which
> > could diff from device's node. Allocating such hot data in device's
> > node doesn't make sense. Add an API to estimate hardware queue node.
> > This can be used before blk-mq actually establishes the mapping. This
> > API runs slow, but it only used in initialization time.
> 
> I think this is the wrong way around.  I've got some proprotype code
> that just leaves the cpu assignments to the drivers and picks it up
> in blk-mq.  Give me a few days to post it..

This looks weird, shouldn't the cpu assignment be determined by block
core (blk-mq) because block core decides how to use the queue?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe March 29, 2016, 4:50 p.m. UTC | #3
On 03/29/2016 10:47 AM, Shaohua Li wrote:
> On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
>> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
>>> we allocate most data structure in device's node, but some data
>>> structures are not for DMA and mostly used by specific cpus/node which
>>> could diff from device's node. Allocating such hot data in device's
>>> node doesn't make sense. Add an API to estimate hardware queue node.
>>> This can be used before blk-mq actually establishes the mapping. This
>>> API runs slow, but it only used in initialization time.
>>
>> I think this is the wrong way around.  I've got some proprotype code
>> that just leaves the cpu assignments to the drivers and picks it up
>> in blk-mq.  Give me a few days to post it..
>
> This looks weird, shouldn't the cpu assignment be determined by block
> core (blk-mq) because block core decides how to use the queue?

I agree, that belongs in the blk-mq proper, the driver should just 
follow the rules outlined, not impose their own in this regard. It'll 
also help with irq affinity mappings, once we get that in.
Christoph Hellwig March 29, 2016, 5:44 p.m. UTC | #4
On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
> >This looks weird, shouldn't the cpu assignment be determined by block
> >core (blk-mq) because block core decides how to use the queue?
> 
> I agree, that belongs in the blk-mq proper, the driver should just follow
> the rules outlined, not impose their own in this regard. It'll also help
> with irq affinity mappings, once we get that in.

It's not going to work that way unfortunately.  Lots of driver simply
have no control over the underlying interrupts.  Think of any RDMA
storage or other layer drivers - they get low level queues from a layer
they don't control and need a block queue for each of them.

My plan is to make the block layer follow what the networking layer
does - get the low level queues / MSI-X pairs and then use the
infrastructure in lib/cpu_rmap.c to figure out the number of queues
and queue placement for them.

The lower half of that is about to get rewritten by Thomas after
disccussion between him, me and a few others to provide drivers
nice APIs for spreading MSI-X vectors over CPUs or nodes.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe March 29, 2016, 8:51 p.m. UTC | #5
On 03/29/2016 11:44 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
>>> This looks weird, shouldn't the cpu assignment be determined by block
>>> core (blk-mq) because block core decides how to use the queue?
>>
>> I agree, that belongs in the blk-mq proper, the driver should just follow
>> the rules outlined, not impose their own in this regard. It'll also help
>> with irq affinity mappings, once we get that in.
>
> It's not going to work that way unfortunately.  Lots of driver simply
> have no control over the underlying interrupts.  Think of any RDMA
> storage or other layer drivers - they get low level queues from a layer
> they don't control and need a block queue for each of them.
>
> My plan is to make the block layer follow what the networking layer
> does - get the low level queues / MSI-X pairs and then use the
> infrastructure in lib/cpu_rmap.c to figure out the number of queues
> and queue placement for them.

That sounds fine, as long as we get it in general infrastructure. Note 
that we need to be able to support sets of queues for blk-mq, once we 
start splitting them up (for scheduling). As long as we can map sets of 
queues to sets of CPUs, then that's not a concern.

For Shaohua's patches, not binding per-device data to the home node 
makes a lot of sense, though.
Christoph Hellwig March 29, 2016, 9:20 p.m. UTC | #6
On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
> For Shaohua's patches, not binding per-device data to the home node makes a
> lot of sense, though.

It does.  But exposing the node estimation function just to work around
the fact that the way the cpu/node mapping is done is upside down
currently doesn't seem helpful.  Let's try to revamp that first and the
rebase the rest of the series on top of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe March 29, 2016, 9:22 p.m. UTC | #7
On 03/29/2016 03:20 PM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
>> For Shaohua's patches, not binding per-device data to the home node makes a
>> lot of sense, though.
>
> It does.  But exposing the node estimation function just to work around
> the fact that the way the cpu/node mapping is done is upside down
> currently doesn't seem helpful.  Let's try to revamp that first and the
> rebase the rest of the series on top of that.

That's fine with me, it's not exactly urgent.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 050f7a1..ec214d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2330,6 +2330,27 @@  void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+	unsigned int index)
+{
+	unsigned int *map;
+	int node;
+
+	if (total_queues == 1)
+		return NUMA_NO_NODE;
+	map = kzalloc(sizeof(*map) * nr_cpu_ids, GFP_KERNEL);
+	if (!map)
+		return NUMA_NO_NODE;
+	if (blk_mq_update_queue_map(map, total_queues, cpu_online_mask)) {
+		kfree(map);
+		return NUMA_NO_NODE;
+	}
+	node = blk_mq_hw_queue_to_node(map, index);
+	kfree(map);
+	return node;
+}
+EXPORT_SYMBOL(blk_mq_estimate_hw_queue_node);
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15a73d4..892b41b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -188,6 +188,8 @@  void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+	unsigned int index);
 
 enum {
 	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */