Message ID | 1498653880-29223-1-git-send-email-maxg@mellanox.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 28, 2017 at 03:44:40PM +0300, Max Gurtovoy wrote: > This patch performs sequential mapping between CPUs and queues. > In case the system has more CPUs than HWQs then there are still > CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs > and their siblings to the same HWQ. > This actually fixes a bug that found unmapped HWQs in a system with > 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) > running NVMEoF (opens upto maximum of 64 HWQs). From a first look at it, it looks good. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
On Wed, Jun 28, 2017 at 03:44:40PM +0300, Max Gurtovoy wrote: > This patch performs sequential mapping between CPUs and queues. > In case the system has more CPUs than HWQs then there are still > CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs > and their siblings to the same HWQ. > This actually fixes a bug that found unmapped HWQs in a system with > 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) > running NVMEoF (opens upto maximum of 64 HWQs). > > Performance results running fio (72 jobs, 128 iodepth) > using null_blk (w/w.o patch): Can you also tests with Sagi's series to use the proper IRQ level mapping?
Hi Max, > This patch performs sequential mapping between CPUs and queues. > In case the system has more CPUs than HWQs then there are still > CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs > and their siblings to the same HWQ. > This actually fixes a bug that found unmapped HWQs in a system with > 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) > running NVMEoF (opens upto maximum of 64 HWQs). The explanation can be a bit clearer... I still need to take a look at the patch itself, but do note that ideally we will never get to blk_mq_map_queues since we prefer to map queues based on MSIX assignments. for nvme-rdma, this is merely a fallback. And looking ahead, MSIX based mapping should be the primary mapping logic. Can you please test with my patchset on converting nvme-rdma to MSIX based mapping (I assume you are testing with mlx5 yes)? I'd be very much interested to know if the original problem exists with this applied. I'll take a closer look into the patch.
On 06/28/2017 09:44 AM, Max Gurtovoy wrote: > This patch performs sequential mapping between CPUs and queues. > In case the system has more CPUs than HWQs then there are still > CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs > and their siblings to the same HWQ. > This actually fixes a bug that found unmapped HWQs in a system with > 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) > running NVMEoF (opens upto maximum of 64 HWQs). > > Performance results running fio (72 jobs, 128 iodepth) > using null_blk (w/w.o patch): > > bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24) > ----- ---------------------------- ------------------------------ ---------------------------- ----------------------------- > 512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K > 1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K > 2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K > 4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K > 8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K > 16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K > 32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K > 64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K > 128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K > Max, thanks for this patch. I was having the same issue, and was thinking in how to address... Your patch fixes it for me, just tested it. I can test any follow-up patch you send, basaed on reviews you got in the list. Cheers, Guilherme
On 6/28/2017 5:38 PM, Sagi Grimberg wrote: > Hi Max, Hi Sagi, > >> This patch performs sequential mapping between CPUs and queues. >> In case the system has more CPUs than HWQs then there are still >> CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs >> and their siblings to the same HWQ. >> This actually fixes a bug that found unmapped HWQs in a system with >> 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) >> running NVMEoF (opens upto maximum of 64 HWQs). > > The explanation can be a bit clearer... > > I still need to take a look at the patch itself, but do note that > ideally we will never get to blk_mq_map_queues since we prefer > to map queues based on MSIX assignments. for nvme-rdma, this is > merely a fallback. And looking ahead, MSIX based mapping should > be the primary mapping logic. we still have a fallback option in your series so we surly need some fix to the blk_mq_map_queues (also for stable kernel IMO. Jens/Christoph ?). > > Can you please test with my patchset on converting nvme-rdma to > MSIX based mapping (I assume you are testing with mlx5 yes)? Sure. does V6 is the last version of the patchset ? I'll test it with ConnectX-5 adapter and send the results. > I'd be very much interested to know if the original problem > exists with this applied. it will exist in case set->nr_hw_queues > dev->num_comp_vectors. > > I'll take a closer look into the patch. Thanks.
> +static int cpu_to_queue_index(unsigned int nr_queues, const int cpu, > + const struct cpumask *online_mask) > { > - return cpu * nr_queues / nr_cpus; > + /* > + * Non online CPU will be mapped to queue index 0. > + */ > + if (!cpumask_test_cpu(cpu, online_mask)) > + return 0; Why not map offline cpus to what they would've map to if they were online? > + if (cpu < nr_queues) { > + map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask); > + } else { > + first_sibling = get_first_sibling(cpu); > + if (first_sibling == cpu) > + map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask); This seems to be the only reference to online_mask left (on each side of the if statement. I think you can just not pass it and use cpu_online_mask in cpu_to_queue_index.
>> Can you please test with my patchset on converting nvme-rdma to >> MSIX based mapping (I assume you are testing with mlx5 yes)? > > Sure. does V6 is the last version of the patchset ? > I'll test it with ConnectX-5 adapter and send the results. Yes. >> I'd be very much interested to know if the original problem >> exists with this applied. > > it will exist in case set->nr_hw_queues > dev->num_comp_vectors. We don't ask for more hw queues than num_comp_vectors.
On 6/28/2017 5:58 PM, Sagi Grimberg wrote: > >> +static int cpu_to_queue_index(unsigned int nr_queues, const int cpu, >> + const struct cpumask *online_mask) >> { >> - return cpu * nr_queues / nr_cpus; >> + /* >> + * Non online CPU will be mapped to queue index 0. >> + */ >> + if (!cpumask_test_cpu(cpu, online_mask)) >> + return 0; > > Why not map offline cpus to what they would've map to if they were > online? I didn't change logic for offline cpu's. Should it be done in this patch ? > >> + if (cpu < nr_queues) { >> + map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask); >> + } else { >> + first_sibling = get_first_sibling(cpu); >> + if (first_sibling == cpu) >> + map[cpu] = cpu_to_queue_index(nr_queues, cpu, >> online_mask); > > This seems to be the only reference to online_mask left (on each side of > the if statement. I think you can just not pass it and use > cpu_online_mask in cpu_to_queue_index. Another user of cpu_to_queue_index may want to send different mask. Currently it's a static function so I guess we can do it.
On 6/28/2017 6:01 PM, Sagi Grimberg wrote: > >>> Can you please test with my patchset on converting nvme-rdma to >>> MSIX based mapping (I assume you are testing with mlx5 yes)? >> >> Sure. does V6 is the last version of the patchset ? >> I'll test it with ConnectX-5 adapter and send the results. > > Yes. > >>> I'd be very much interested to know if the original problem >>> exists with this applied. >> >> it will exist in case set->nr_hw_queues > dev->num_comp_vectors. > > We don't ask for more hw queues than num_comp_vectors. I've tested Sagi's patches and they fix the connection establishment bug for NVMEoF. here are the results: fio 72 jobs, 128 iodepth. NVMEoF register_always=N 1 Subsystem, 1 namespace num_comp_vector is 60 and therefore num_queues is 60. I run a comparison to my original patch with 60 queues and also for 64 queues (possible in my patch because no limitation of num_comp_vectors) bs IOPS(read queues=60(Sagi)/60(Max)/64(Max)) ----- -------------------------------------------- 512 3424.9K/3587.8K/3619.2K 1k 3421.8K/3613.5K/3630.6K 2k 3416.4K/3605.7K/3630.2K 4k 2361.6K/2399.9K/2404.1K 8k 1368.7K/1370.7K/1370.6K 16k 692K/691K/692K 32k 345K/348K/348K 64k 175K/174K/174K 128k 88K/87K/87K bs IOPS(write queues=60(Sagi)/60(Max)/64(Max)) ----- ------------------------------ 512 3243.6K/3329.7K/3392.9K 1k 3249.7K/3341.2K/3379.2K 2k 3251.2K/3336.9K/3385.9K 4k 2685.8K/2683.9K/2683.3K 8k 1336.6K/1355.1K/1361.6K 16k 690K/690K/691K 32k 348K/348K/348K 64k 174K/174K/174K 128k 87K/87K/87K My conclusion is that Sagi's patch is correct (although we see little bit less performance: 100K-200K less for small block sizes) so you can add: Tested-by: Max Gurtovoy <maxg@mellanox.com> Nevertheless, we should review and consider pushing my fixes to the block layer for other users of this mapping function.
>>> +static int cpu_to_queue_index(unsigned int nr_queues, const int cpu, >>> + const struct cpumask *online_mask) >>> { >>> - return cpu * nr_queues / nr_cpus; >>> + /* >>> + * Non online CPU will be mapped to queue index 0. >>> + */ >>> + if (!cpumask_test_cpu(cpu, online_mask)) >>> + return 0; >> >> Why not map offline cpus to what they would've map to if they were >> online? > > I didn't change logic for offline cpu's. > Should it be done in this patch ? The patch clearly treats offline cpus differently, maps them to queue 0.
On Wed, Jun 28, 2017 at 08:11:23PM +0300, Max Gurtovoy wrote: > Nevertheless, we should review and consider pushing my fixes to the block > layer for other users of this mapping function. Totally agree with it and it's easier to backport to let's say stable kernels.
Looks fine,
Reviewed-by: Christoph Hellwig <hch@lst.de>
So finally:
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
On Wed, Jun 28, 2017 at 03:44:40PM +0300, Max Gurtovoy wrote: > This patch performs sequential mapping between CPUs and queues. > In case the system has more CPUs than HWQs then there are still > CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs > and their siblings to the same HWQ. > This actually fixes a bug that found unmapped HWQs in a system with > 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) > running NVMEoF (opens upto maximum of 64 HWQs). Christoph/Sagi/Keith, any updates on this patch? Without it I' not able to run NVMf on a box with 44 Cores and 88 Threads w/o adding -i 44 to the nvme connect statement. Thanks, Johannes
On 7/5/2017 10:59 AM, Johannes Thumshirn wrote: > On Wed, Jun 28, 2017 at 03:44:40PM +0300, Max Gurtovoy wrote: >> This patch performs sequential mapping between CPUs and queues. >> In case the system has more CPUs than HWQs then there are still >> CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs >> and their siblings to the same HWQ. >> This actually fixes a bug that found unmapped HWQs in a system with >> 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) >> running NVMEoF (opens upto maximum of 64 HWQs). > > Christoph/Sagi/Keith, > > any updates on this patch? Without it I' not able to run NVMf on a box with 44 > Cores and 88 Threads w/o adding -i 44 to the nvme connect statement. > > Thanks, > Johannes > Hi Johannes, this was merge already to the main tree (Jens add it to his pull request) Cheers, Max.
On Wed, Jul 05, 2017 at 11:11:36AM +0300, Max Gurtovoy wrote: > Hi Johannes, > this was merge already to the main tree (Jens add it to his pull request) Oops, sorry for the noise.
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 8e61e86..2cca4fc 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -14,10 +14,15 @@ #include "blk.h" #include "blk-mq.h" -static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues, - const int cpu) +static int cpu_to_queue_index(unsigned int nr_queues, const int cpu, + const struct cpumask *online_mask) { - return cpu * nr_queues / nr_cpus; + /* + * Non online CPU will be mapped to queue index 0. + */ + if (!cpumask_test_cpu(cpu, online_mask)) + return 0; + return cpu % nr_queues; } static int get_first_sibling(unsigned int cpu) @@ -36,55 +41,26 @@ int blk_mq_map_queues(struct blk_mq_tag_set *set) unsigned int *map = set->mq_map; unsigned int nr_queues = set->nr_hw_queues; const struct cpumask *online_mask = cpu_online_mask; - unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling; - cpumask_var_t cpus; - - if (!alloc_cpumask_var(&cpus, GFP_ATOMIC)) - return -ENOMEM; - - cpumask_clear(cpus); - nr_cpus = nr_uniq_cpus = 0; - for_each_cpu(i, online_mask) { - nr_cpus++; - first_sibling = get_first_sibling(i); - if (!cpumask_test_cpu(first_sibling, cpus)) - nr_uniq_cpus++; - cpumask_set_cpu(i, cpus); - } - - queue = 0; - for_each_possible_cpu(i) { - if (!cpumask_test_cpu(i, online_mask)) { - map[i] = 0; - continue; - } + unsigned int cpu, first_sibling; + for_each_possible_cpu(cpu) { /* - * Easy case - we have equal or more hardware queues. Or - * there are no thread siblings to take into account. Do - * 1:1 if enough, or sequential mapping if less. + * First do sequential mapping between CPUs and queues. + * In case we still have CPUs to map, and we have some number of + * threads per cores then map sibling threads to the same queue for + * performace optimizations. */ - if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) { - map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue); - queue++; - continue; + if (cpu < nr_queues) { + map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask); + } else { + first_sibling = get_first_sibling(cpu); + if (first_sibling == cpu) + map[cpu] = cpu_to_queue_index(nr_queues, cpu, online_mask); + else + map[cpu] = map[first_sibling]; } - - /* - * Less then nr_cpus queues, and we have some number of - * threads per cores. Map sibling threads to the same - * queue. - */ - first_sibling = get_first_sibling(i); - if (first_sibling == i) { - map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues, - queue); - queue++; - } else - map[i] = map[first_sibling]; } - free_cpumask_var(cpus); return 0; } EXPORT_SYMBOL_GPL(blk_mq_map_queues);
This patch performs sequential mapping between CPUs and queues. In case the system has more CPUs than HWQs then there are still CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs and their siblings to the same HWQ. This actually fixes a bug that found unmapped HWQs in a system with 2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs) running NVMEoF (opens upto maximum of 64 HWQs). Performance results running fio (72 jobs, 128 iodepth) using null_blk (w/w.o patch): bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24) ----- ---------------------------- ------------------------------ ---------------------------- ----------------------------- 512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K 1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K 2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K 4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K 8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K 16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K 32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K 64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K 128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K Signed-off-by: Max Gurtovoy <maxg@mellanox.com> --- block/blk-mq-cpumap.c | 68 ++++++++++++++++--------------------------------- 1 files changed, 22 insertions(+), 46 deletions(-)