Message ID | 68fed570910230ce847f8f3b685eeea399640a7f.1458941500.git.shli@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote: > we allocate most data structure in device's node, but some data > structures are not for DMA and mostly used by specific cpus/node which > could diff from device's node. Allocating such hot data in device's > node doesn't make sense. Add an API to estimate hardware queue node. > This can be used before blk-mq actually establishes the mapping. This > API runs slow, but it only used in initialization time. I think this is the wrong way around. I've got some proprotype code that just leaves the cpu assignments to the drivers and picks it up in blk-mq. Give me a few days to post it.. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote: > On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote: > > we allocate most data structure in device's node, but some data > > structures are not for DMA and mostly used by specific cpus/node which > > could diff from device's node. Allocating such hot data in device's > > node doesn't make sense. Add an API to estimate hardware queue node. > > This can be used before blk-mq actually establishes the mapping. This > > API runs slow, but it only used in initialization time. > > I think this is the wrong way around. I've got some proprotype code > that just leaves the cpu assignments to the drivers and picks it up > in blk-mq. Give me a few days to post it.. This looks weird, shouldn't the cpu assignment be determined by block core (blk-mq) because block core decides how to use the queue? -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/29/2016 10:47 AM, Shaohua Li wrote: > On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote: >> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote: >>> we allocate most data structure in device's node, but some data >>> structures are not for DMA and mostly used by specific cpus/node which >>> could diff from device's node. Allocating such hot data in device's >>> node doesn't make sense. Add an API to estimate hardware queue node. >>> This can be used before blk-mq actually establishes the mapping. This >>> API runs slow, but it only used in initialization time. >> >> I think this is the wrong way around. I've got some proprotype code >> that just leaves the cpu assignments to the drivers and picks it up >> in blk-mq. Give me a few days to post it.. > > This looks weird, shouldn't the cpu assignment be determined by block > core (blk-mq) because block core decides how to use the queue? I agree, that belongs in the blk-mq proper, the driver should just follow the rules outlined, not impose their own in this regard. It'll also help with irq affinity mappings, once we get that in.
On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote: > >This looks weird, shouldn't the cpu assignment be determined by block > >core (blk-mq) because block core decides how to use the queue? > > I agree, that belongs in the blk-mq proper, the driver should just follow > the rules outlined, not impose their own in this regard. It'll also help > with irq affinity mappings, once we get that in. It's not going to work that way unfortunately. Lots of driver simply have no control over the underlying interrupts. Think of any RDMA storage or other layer drivers - they get low level queues from a layer they don't control and need a block queue for each of them. My plan is to make the block layer follow what the networking layer does - get the low level queues / MSI-X pairs and then use the infrastructure in lib/cpu_rmap.c to figure out the number of queues and queue placement for them. The lower half of that is about to get rewritten by Thomas after disccussion between him, me and a few others to provide drivers nice APIs for spreading MSI-X vectors over CPUs or nodes. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/29/2016 11:44 AM, Christoph Hellwig wrote: > On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote: >>> This looks weird, shouldn't the cpu assignment be determined by block >>> core (blk-mq) because block core decides how to use the queue? >> >> I agree, that belongs in the blk-mq proper, the driver should just follow >> the rules outlined, not impose their own in this regard. It'll also help >> with irq affinity mappings, once we get that in. > > It's not going to work that way unfortunately. Lots of driver simply > have no control over the underlying interrupts. Think of any RDMA > storage or other layer drivers - they get low level queues from a layer > they don't control and need a block queue for each of them. > > My plan is to make the block layer follow what the networking layer > does - get the low level queues / MSI-X pairs and then use the > infrastructure in lib/cpu_rmap.c to figure out the number of queues > and queue placement for them. That sounds fine, as long as we get it in general infrastructure. Note that we need to be able to support sets of queues for blk-mq, once we start splitting them up (for scheduling). As long as we can map sets of queues to sets of CPUs, then that's not a concern. For Shaohua's patches, not binding per-device data to the home node makes a lot of sense, though.
On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote: > For Shaohua's patches, not binding per-device data to the home node makes a > lot of sense, though. It does. But exposing the node estimation function just to work around the fact that the way the cpu/node mapping is done is upside down currently doesn't seem helpful. Let's try to revamp that first and the rebase the rest of the series on top of that. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/29/2016 03:20 PM, Christoph Hellwig wrote: > On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote: >> For Shaohua's patches, not binding per-device data to the home node makes a >> lot of sense, though. > > It does. But exposing the node estimation function just to work around > the fact that the way the cpu/node mapping is done is upside down > currently doesn't seem helpful. Let's try to revamp that first and the > rebase the rest of the series on top of that. That's fine with me, it's not exactly urgent.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 050f7a1..ec214d3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2330,6 +2330,27 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set) } EXPORT_SYMBOL(blk_mq_free_tag_set); +int blk_mq_estimate_hw_queue_node(unsigned int total_queues, + unsigned int index) +{ + unsigned int *map; + int node; + + if (total_queues == 1) + return NUMA_NO_NODE; + map = kzalloc(sizeof(*map) * nr_cpu_ids, GFP_KERNEL); + if (!map) + return NUMA_NO_NODE; + if (blk_mq_update_queue_map(map, total_queues, cpu_online_mask)) { + kfree(map); + return NUMA_NO_NODE; + } + node = blk_mq_hw_queue_to_node(map, index); + kfree(map); + return node; +} +EXPORT_SYMBOL(blk_mq_estimate_hw_queue_node); + int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) { struct blk_mq_tag_set *set = q->tag_set; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 15a73d4..892b41b 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -188,6 +188,8 @@ void blk_mq_insert_request(struct request *, bool, bool, bool); void blk_mq_free_request(struct request *rq); void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq); bool blk_mq_can_queue(struct blk_mq_hw_ctx *); +int blk_mq_estimate_hw_queue_node(unsigned int total_queues, + unsigned int index); enum { BLK_MQ_REQ_NOWAIT = (1 << 0), /* return when out of requests */
we allocate most data structure in device's node, but some data structures are not for DMA and mostly used by specific cpus/node which could diff from device's node. Allocating such hot data in device's node doesn't make sense. Add an API to estimate hardware queue node. This can be used before blk-mq actually establishes the mapping. This API runs slow, but it only used in initialization time. Signed-off-by: Shaohua Li <shli@fb.com> --- block/blk-mq.c | 21 +++++++++++++++++++++ include/linux/blk-mq.h | 2 ++ 2 files changed, 23 insertions(+)